This post summarizes some of my experiences getting Swedish stock market data. UPDATE: I updated my setup, it now uses a Timescale database as backend, see this post for the update (Swedish stock market data – pt. 2). For data analysis I will post something later on. I want to do technical analysis reports and then perhaps tie in some more adventurous stuff. If you just want the whole script, it is at the bottom of the page. When it is up on my GitHub I will link it from here. EDIT:I uploaded the script to sync the stocks to my GitHub: https://github.com/vilhelmp/stock_sync/blob/master/sync_stocks.py. A quick example of some data, and a peek at what I am currently working on.

Candlestick plot, showing daily OHLC values, and the Bollinger bands (20 day).
A quick ‘candlestic’ plot along with the 20 day Bollinger band of the past 5 years daily OHLC values for the A-stock of the Swedish construction company ‘PEAB’. NB: There might be some issues with how the Bollinger bands are computed in this case based on the centering of dates in Pandas.

Initial thoughts

So if I wanted to experiment with stock market analysis I needed (historical) stock market data. I live in Sweden and while it is fairly easy to get stock data for American markets, it is not as straightforward for other markets. Since I do not want to do intra-day analysis, I am going for daily summaries. I had to be aware of several things and will cover in this post, for example

  • I have not found a flag or setting to provide Yahoo Finance to get all Swedish stocks. Thus, I need a list of the names of all Swedish stocks.
  • I need a way store the data as it grows. 10+ years of daily data for a lot of stocks takes up some space. It needs to be quick and easy to retrieve all quotes as well.
  • The sync script should be a bit smart, and not run on weekends. Preferably it should not run on bank holidays as well. Not on weekends is easy to solve with crontab, but not syncing on holidays is not as easy.
  • Some kind of logging would be good.
  • Easily extendable and sharable. This is something I still have not implemented properly. I intend to include a configuration file for ease of use (using Pythons configParser).

Anyway, let’s get to it.

The Swedish stock market

The stock markets in the Nordic countries of Sweden, Denmark, Finland and Iceland are run by NASDAQ. They have several segments (i.e. Large, Mid and Small cap), that lists numerous stocks. They are trying to acquire the Oslo stock exchange in Norway. The page for each stock have the daily open/high/low/close values (OHLC), but not when you look at the historical prices (getting XML/JSON of historical prices seem simple). A physicist at Uppsala University wrote a Python script using Selenium to scrape the data directly (i.e. not OHLC). There’s gotta be a better way, “and there is, Kevin!“. Swedish stocks are listed at Yahoo Finance, it is just the stock name with spaces replaced with a hyphen, and ending with “.ST” for STockholm (I presume). Thus, the SEB A stock is “SEB-A.ST” (proof), this means all I need is the list of stocks at the Stockholm stock exchange.

Get the list of stocks to sync

To get the list of stocks at the Stockholm stock exchange we can just copy paste, or read the table on the page at http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm. It even has the company sector it operates in, good for some sector wide indexes etc later on. The full table is simple to parse with Pandas like this:

import requests as r
from lxml import html
import pandas as pd
# get the page with the requests module
page = r.get('http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm') 
tree = html.fromstring(page.content)<br>
page.close
tree.make_links_absolute('http://www.nasdaqomxnordic.com')
# we can use XPath to get our table
trs = tree.xpath('//tbody//tr')
# and putting it into a Pandas DataFrame can (almost) never be wrong
data = pd.DataFrame(
        [[j.text_content() for j in i.getchildren()[:-1]] for i in trs],
        columns = ['name','symbol','currency','isin','sector','icb']
        )
# here we create the ticker names for the queries to Yahoo Finance
data['tickers'] = ["-".join(i.split(" "))+".ST" for i in data['symbol'].values]

We are all nice and only read this once, and not spamming the server with requests. In theory you could just go into the site and save the page as HTML and read that, and then save the list of stock names into a file. Alternatively save the DataFrame above into a csv-file or something for later loading. Remember though, this is just a list of the names of the stocks we want to sync.

Storing stock market data

Okay so know we have the names of all the stocks. Let us get the data! However, how do we store it locally after we have fetched it all? It is basically a time-series data set. CSV files galore? I started to investigate options, including:

  • Splunk (link)
    • e.g. together with a stock app (link) or a simple Python script
  • InfluxDB (link) – a time-series database. Powerful, but not necessary the best (see1, see2 for example comparisons).
  • TimescaleDB (link) packaged as a PostgreSQL plugin. Really good, and a full database server running. I did install a TimescaleDB server later on and tried it out as well. I put that in a separate post.
  • A full TIG stack (tig-what?) with Yahoo finance script for Telegraf (here).
  • PyStore (link) – fast and local (folder structure) storage using Parquet files made for stock data.

They all seem nice, but to start off I just wanted something quick to get going. So in the end I went with PyStore, the install was a breeze, and usage is easy. PyStore uses the Parquet file format and Dask together with Pandas and Numpy to store time-series.

EDIT: After rewriting the script, I ended up installing and using a TimescaleDB server for this application. I put that in a separate post.

Get the (historical) stock data

So, back to the stock data, I now have the list of stock names I want. To get the data I use YahooFinancials package (link), which is really simple. Note that it can only do ‘daily’, ‘weekly’, or ‘monthly’ intervals, however. Taking the ticker ‘SEB-A.ST’ from before, to retrieve daily summaries of prices I simply do:

yahoo_financials = YahooFinancials('SEB-A.ST')
hsp = yahoo_financials.get_historical_price_data(START_DATE, END_DATE, 'daily')

So to get the full Swedish stock market data, I do this for all the stock tickers in the lists we got from the NASDAQ webpage. Then I need to get it into PyStore store, to do this I follow the below recipe.

# First define where to store things
STOREPATH = '/PATH/TO/YOUR/STORE'
SYNC_STORE = 'NAME_OF_STORE'

# Now set up the store and its collection
pystore.set_path(STOREPATH)
store = pystore.store(SYNC_STORE)
collection_prices = store.collection('prices')

# Now format the data in a Pandas DataFrame
price_data = pd.DataFrame( hsp[ticker]['prices'] )
data_types = {'date'            :'float',   
              'formatted_date'  :'str',
              'open'            :'float',
              'high'            :'float',
              'low'             :'float',
              'close'           :'float',
              'adjclose'        :'float',
              'volume'          :'float'}

price_data = price_data.astype( dtype=data_types )
price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date'])
price_data.set_index('formatted_date', inplace=True)

# Finally, write it to the store
collection_prices.write('SEB-A.ST', price_data)

Now we could retrieve the data from the collection with the following short code.

item = collection.item('SEB-A.ST')
data = item.data 
metadata = item.metadata
df = item.to_pandas()

This is basically directly from the PyStore readme, the data item is a dask dataframe (link), to get a Pandas DataFrame we use the .to_pandas() method (you can also stay with a Dask dataframe and do computations or try other ways of speeding things up). So this explains some of the things I had to set up to get the data.

Further considerations

There are more things to consider as well. We need to sync historical data that we do not have yet. First we need to check the database what we have. If it exists at all or when the last sync of it was done, and then try to sync as far as possible (yes, Yahoo Finance has a limit to the number of requests per hour/day etc). Furthermore, we do not want to try to sync on say, a Sunday when we synced on Friday, because err the stock market is closed both Saturday and Sunday (later this will be taken care of in CRON if scheduled there). So I need to check what day today is and yesterday, and also if it was a bank holiday (I use the holidays package for that).

The mother script

So I put together a script that will sync all the Swedish stock market data to a specified location. This is copy-pasted below. I will at some point put this on GitHub for easier tracking, but I am currently too busy applying for jobs and hanging out with my son for that. If you are interested in running it, let me know and I will see if I can package it more nicely. When you manually run it, it shows this nice progress bar.

Super awesome progress bar copy-pasted from GitHub page https://gist.github.com/snakers4/91fa21b9dda9d055a02ecd23f24fbc3d
A nice progress bar to look at while testing the script.
from yahoofinancials import YahooFinancials
import pandas as pd
from lxml import html
import requests as r
from time import sleep
import json
import pystore
import numpy as np
import datetime as dt
import sys
import holidays
import logging
import os
import matplotlib.dates as mdates

# holidays will show Sundays and public holidays as holidays
# NOT Saturdays
swe_holidays = holidays.Sweden()

STOCKMARKET_CLOSES = dt.time(16,00,00)
# in minutes
WAITFORMARKET = 30              
# attempt to sync this far back
START_DATE = dt.date(2000,1,1)
# define some file locations
STOREPATH = '/PATH/TO/data/stock_data/pystore'
LOGPATH = '/PATH/TO/LOGS/'
# what are the stores going to be called?
# well here I choose the different markets, and included 
# some indexes as well.
SYNC_STORES = ['NASDAQ-STO',
               'NASDAQ-FirstNorthPremier', 
               'NASDAQ-FirstNorth',
               'INDEXES']

# URLs to get the list of companies in the different markets
SYNC_URLS = ['http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm',
             'http://www.nasdaqomxnordic.com/shares/listed-companies/first-north-premier',
             'http://www.nasdaqomxnordic.com/shares/listed-companies/first-north',
             '']

# list of indexes to sync
listofindexes = [
            {'index': '^OMX',
             'name': 'Stockholm OMX',
             },
            {'index': '^OMXSPI',
             'name': 'Stockholm all-share',
             },
            {'index': '^OMXSBGI',
             'name': 'Stockhoolm Generalindex',
             },
            {'index': '^SX2000PI',
             'name': 'Stockholm Industrials Index',
             },
            ]


# Some stocks is some kind of weird type?
# usually an existing stock with some added name 
# for exampel both TAGM-B.ST and SPEC.ST works, 
# but TAGM-BTA-B.ST and SPEC-TO3.ST does not

# I like to use logging so I can see if anything went wrong, when and perhaps get an idea of what.
logging.basicConfig(filename=os.path.join(LOGPATH,'stock_syncing-{0}.log'.format(dt.datetime.now().strftime('%y%m%d-%H%M%S')) 
                                          ), 
                    filemode='w', format='%(asctime)s-%(levelname)s: %(message)s',
                    level=logging.INFO)

date_today = dt.date.today()
datetime_today = dt.datetime.today()
timeconstraint = dt.datetime.combine(date_today,STOCKMARKET_CLOSES) + dt.timedelta(minutes=WAITFORMARKET)
if (datetime_today.time() <= timeconstraint.time()):
    logging.warning( 'Warning: Stock market closes around *{0}* during weekdays.'.format(STOCKMARKET_CLOSES) )

# Print iterations progress
# from https://gist.github.com/snakers4/91fa21b9dda9d055a02ecd23f24fbc3d
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    # Print New Line on Complete
    if iteration == total: 
        print()


def get_start_date(collection, key):
    # try to get the last date stored, if there
    try:
        last_date = collection.item(key).tail(1).index[0]
        # 
        # If you are not storing Timestamps
        # you have to run last_date = dt.date( *[int(i) for i in last_date.split('-')] )
        # first to get last date into a datetime date object.
        last_date = last_date.date()
        # starting date is last date + one day
        sdate = last_date + dt.timedelta(days=1)
    except(FileNotFoundError):
        # if the ticker doesn't exist, start syncing the full thing
        sdate = START_DATE
    return sdate

def write_to_database(collection, key, data_frame, metadata=dict(source='unknown')):

    try:
        collection.write(key, data_frame, 
                            metadata=metadata
                            )
    except ValueError as e: 
        if 'exist' in str(e).lower(): 
            try:
                # test to append it, if this doesn't work 
                # then there's something else wrong.
                collection.append( key, data_frame )
            except Exception as e:
                logging.error( str(e) )
                raise StandardError
        return True
    

def get_dates_to_sync(collection, key):
    
    date_today = dt.date.today()
    datetime_today = dt.datetime.today()
    # get last date synced, if any, else use global param START_DATE
    sdate = get_start_date(collection, key)
    
    timeconstraint = dt.datetime.combine(date_today,STOCKMARKET_CLOSES) + dt.timedelta(minutes=WAITFORMARKET)
    # always sync until today unless
    # the market hasn't closed yet, sync until yesterday
    if (datetime_today.time() <= timeconstraint.time()):
        edate = date_today - dt.timedelta(days=1)        
    else:
        edate = date_today
    # check that last sync date is not a (bank) holiday
    # NOTE that this will not take care of Saturdays
    while edate in swe_holidays:
        if edate in swe_holidays:
            logging.info('Last sync day is a public holiday ({0}), testing the day before...'.format(edate.strftime('%Y-%m-%d')) )
        edate -= dt.timedelta(days=1)
    # number of days to sync
    daystosync = (edate-sdate).days + 1
    # if number of days to sync is negative it means last sync date is today.
    if daystosync <= 0:
        logging.info('*No data (dates) to sync for ticker: {0}*'.format(ticker))
        return False
    # If start is on saturday/sunday, and we are trying to sync sat, or sat+sun
    # we have to wait.
    if sdate.weekday() > 4 and daystosync < 2:
        logging.info('*Stock market not open on weekends!*')
        return False

    if sdate>edate:
        logging.info('{0}: No data to sync.'.format(key))
        return False
    # format input for Yahoo
    sdate_str = sdate.strftime('%Y-%m-%d')
    edate_str = edate.strftime('%Y-%m-%d')
    return (sdate,edate),(sdate_str,edate_str)


def hsp_empty(hsp, ticker):
    
    if not hsp[ticker] or 'prices' not in hsp[ticker].keys():
        # if hsp[ticker] empty, or if it doesn't list prices
        return True
    else: # if it has 'prices' but it's empty
        try:
            if not hsp[ticker]['prices']:
                return True
            else:
                return False
        except Exception as e:
            #~ print(e)
            logging.error(str(e))
            raise Exception

# prepare local storage/database
pystore.set_path(STOREPATH)

for sync_store, sync_url in zip(SYNC_STORES, SYNC_URLS):
    logging.info('###############  Syncing {0} ###############'.format(sync_store) )
    
    store = pystore.store(sync_store)
    
    # the various "tables" to store things in
    logging.info('Syncing {0}'.format(sync_store) )
    collection_prices = store.collection('prices')
    if sync_store == 'INDEXES':
        for ind,n in zip(listofindexes, range(len(listofindexes))):
            printProgressBar(n,len(listofindexes), prefix='Estimated ', suffix = 'of *indexes* synced', length=30)
            ticker = ind['index']
            name = ind['name']
            yahoo_financials = YahooFinancials( ticker )
            dates_to_sync = get_dates_to_sync(collection_prices, ticker) 
            if not dates_to_sync:
                continue
            else:
                (sdate, edate), (sdate_str, edate_str) = dates_to_sync
            # get historical prices
            hsp = yahoo_financials.get_historical_price_data(sdate_str, edate_str, 'daily')
            # Create DataFrame and write to store
            # NOTE: 'date' and 'volume' should ideally be 'int64',
            #       but it doesn't support NaN values. Pandas >0.24 
            #       has an update using 'Int64' (not uppercase), dtype
            #       BUT parquet file storage doesn't support this (yet?).
            data_types = {'date'            :'float',   
                          'formatted_date'  :'str',
                          'open'            :'float',
                          'high'            :'float',
                          'low'             :'float',
                          'close'           :'float',
                          'adjclose'        :'float',
                          'volume'          :'float'}
            if hsp_empty(hsp,ticker):
                empty_ticker_counter += 1
                empty_tickers.append(ticker)
                logging.info('   ***{0}***   '.format(ticker))
                logging.info('No data in this reply...')
                logging.info('If this happens often, perhaps wait a couple of hours to sync.')
                continue
            price_data = pd.DataFrame( hsp[ticker]['prices'] )
            price_data = price_data.astype( dtype=data_types )
            if price_data.empty:
                logging.info('   ***{0}***   '.format(ticker))
                logging.info('No data in this reply...')
                logging.info('If this happens often, perhaps wait a couple of hours to sync.')
                continue
            price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date'])
            price_data.set_index('formatted_date', inplace=True)
            if not price_data['open'].any(): # i.e. if any is not false=nan=none
                logging.info('   ***{0}***   '.format(ticker))
                logging.info('Data is empty, try later.')
                continue
            write_to_database(collection_prices, 
                                ticker, 
                                price_data, 
                                metadata={'source': 'Yahoo', 
                                        'name': str(name), 
                                        }
                                )
            logging.info('   ***{0}***   '.format(ticker))
            logging.info('Synced from {0} to {1}'.format(sdate_str,edate_str))
            price_data = []
            w8 = 1.5/np.random.randint(1,high=20)
            sleep(w8)


    else:

        ################## FIRST SYNC
        # 1. Get stock list

        # first get stock symbols, anonymously through TOR!
        page = r.get(sync_url, 
            )
        tree = html.fromstring(page.content)
        page.close
        # fix links
        tree.make_links_absolute('http://www.nasdaqomxnordic.com')
        # get table rows with stocks
        trs = tree.xpath('//tbody//tr')
        # get the data
        data = pd.DataFrame(
                [[j.text_content() for j in i.getchildren()[:-1]] for i in trs],
                columns = ['name', 'symbol', 'currency', 'isin', 'sector', 'icb']
                )
        
        # 2. Data gathering

        ## 2a. Prepare to fetch stock data

        tickers = ["-".join(i.split(" "))+".ST" for i in data['symbol'].values]
        sectors = data['sector'].values
        names = data['name'].values
        isins = data['isin'].values
        currencies = data['currency'].values

        
        ## 2b. Get data for each stock, one at a time, since we might have 
        #      different sync intervals for every stock, and minimize the 
        #      payload (fly under the radar).
        empty_tickers = []
        empty_ticker_counter = 0
        for ticker, sector, name, isin, currency,n in zip(tickers, sectors, names, isins, currencies, range(len(tickers)) ):
            logging.info('   ***{0}***   '.format(ticker))
            printProgressBar(n, len(tickers), prefix='Estimated ', suffix = 'of {0} *stocks* synced'.format(sync_store), length=30)
            # lets try to find the ticker first,
            # if it doesn't work, catch the exception (FileNotFoundError)
            # and proceed
            yahoo_financials = YahooFinancials(ticker)
            
            ### SYNC STOCK PRICES
            # Fist check the dates to be synced
            dates_to_sync = get_dates_to_sync(collection_prices, ticker) 
            if not dates_to_sync:
                continue
            else:
                (sdate, edate), (sdate_str, edate_str) = dates_to_sync
            # get historical prices
            hsp = yahoo_financials.get_historical_price_data(sdate_str, edate_str, 'daily')
            # Create DataFrame and write to store
            # NOTE: 'date' and 'volume' should ideally be 'int64',
            #       but it doesn't support NaN values. Pandas >0.24 
            #       has an update using 'Int64' (not uppercase), dtype
            #       BUT parquet file storage doesn't support this (yet?).
            data_types = {'date'            :'float',   
                          'formatted_date'  :'str',
                          'open'            :'float',
                          'high'            :'float',
                          'low'             :'float',
                          'close'           :'float',
                          'adjclose'        :'float',
                          'volume'          :'float'}
            if hsp_empty(hsp,ticker):
                empty_ticker_counter += 1
                empty_tickers.append(ticker)
                logging.info('No data in this reply...')
                logging.info('If this happens often, perhaps wait a couple of hours to sync.')
                continue
            price_data = pd.DataFrame( hsp[ticker]['prices'] )
            price_data = price_data.astype( dtype=data_types )
            if price_data.empty:
                logging.info('No data in this reply...')
                logging.info('If this happens often, perhaps wait a couple of hours to sync.')
                continue
            price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date'])
            price_data.set_index('formatted_date', inplace=True)
            if not price_data['open'].any(): # i.e. if any is not false=nan=none
                logging.info('Data is empty, try later.')
                continue
            write_to_database(collection_prices, 
                                ticker, 
                                price_data, 
                                metadata={'source': 'Yahoo', 
                                        'sector':str(sector), 
                                        'name': str(name), 
                                        'isin': isin,
                                        'currency': str(currency),
                                        }
                                )
            logging.info('   ***{0}***   '.format(ticker))
            logging.info('Synced from {0} to {1}'.format(sdate_str,edate_str))
            price_data = []
            w8 = 1.5/np.random.randint(1,high=20)
            sleep(w8)
        logging.debug('Couldn\'t sync these tickers: {0}'.format(empty_tickers) )

That’s it. Now you will have all the stock market data for Sweden. Try it out, I dare you.