This post summarizes some of my experiences getting Swedish stock market data. UPDATE: I updated my setup, it now uses a Timescale database as backend, see this post for the update (Swedish stock market data – pt. 2). For data analysis I will post something later on. I want to do technical analysis reports and then perhaps tie in some more adventurous stuff. If you just want the whole script, it is at the bottom of the page. When it is up on my GitHub I will link it from here. EDIT:I uploaded the script to sync the stocks to my GitHub: https://github.com/vilhelmp/stock_sync/blob/master/sync_stocks.py. A quick example of some data, and a peek at what I am currently working on.

Initial thoughts
So if I wanted to experiment with stock market analysis I needed (historical) stock market data. I live in Sweden and while it is fairly easy to get stock data for American markets, it is not as straightforward for other markets. Since I do not want to do intra-day analysis, I am going for daily summaries. I had to be aware of several things and will cover in this post, for example
- I have not found a flag or setting to provide Yahoo Finance to get all Swedish stocks. Thus, I need a list of the names of all Swedish stocks.
- I need a way store the data as it grows. 10+ years of daily data for a lot of stocks takes up some space. It needs to be quick and easy to retrieve all quotes as well.
- The sync script should be a bit smart, and not run on weekends. Preferably it should not run on bank holidays as well. Not on weekends is easy to solve with crontab, but not syncing on holidays is not as easy.
- Some kind of logging would be good.
- Easily extendable and sharable. This is something I still have not implemented properly. I intend to include a configuration file for ease of use (using Pythons
configParser
).
Anyway, let’s get to it.
The Swedish stock market
The stock markets in the Nordic countries of Sweden, Denmark, Finland and Iceland are run by NASDAQ. They have several segments (i.e. Large, Mid and Small cap), that lists numerous stocks. They are trying to acquire the Oslo stock exchange in Norway. The page for each stock have the daily open/high/low/close values (OHLC), but not when you look at the historical prices (getting XML/JSON of historical prices seem simple). A physicist at Uppsala University wrote a Python script using Selenium to scrape the data directly (i.e. not OHLC). There’s gotta be a better way, “and there is, Kevin!“. Swedish stocks are listed at Yahoo Finance, it is just the stock name with spaces replaced with a hyphen, and ending with “.ST” for STockholm (I presume). Thus, the SEB A stock is “SEB-A.ST” (proof), this means all I need is the list of stocks at the Stockholm stock exchange.
Get the list of stocks to sync
To get the list of stocks at the Stockholm stock exchange we can just copy paste, or read the table on the page at http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm. It even has the company sector it operates in, good for some sector wide indexes etc later on. The full table is simple to parse with Pandas like this:
import requests as r from lxml import html import pandas as pd # get the page with the requests module page = r.get('http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm') tree = html.fromstring(page.content)<br> page.close tree.make_links_absolute('http://www.nasdaqomxnordic.com') # we can use XPath to get our table trs = tree.xpath('//tbody//tr') # and putting it into a Pandas DataFrame can (almost) never be wrong data = pd.DataFrame( [[j.text_content() for j in i.getchildren()[:-1]] for i in trs], columns = ['name','symbol','currency','isin','sector','icb'] ) # here we create the ticker names for the queries to Yahoo Finance data['tickers'] = ["-".join(i.split(" "))+".ST" for i in data['symbol'].values]
We are all nice and only read this once, and not spamming the server with requests. In theory you could just go into the site and save the page as HTML and read that, and then save the list of stock names into a file. Alternatively save the DataFrame above into a csv-file or something for later loading. Remember though, this is just a list of the names of the stocks we want to sync.
Storing stock market data
Okay so know we have the names of all the stocks. Let us get the data! However, how do we store it locally after we have fetched it all? It is basically a time-series data set. CSV files galore? I started to investigate options, including:
- Splunk (link)
- e.g. together with a stock app (link) or a simple Python script
- InfluxDB (link) – a time-series database. Powerful, but not necessary the best (see1, see2 for example comparisons).
- TimescaleDB (link) packaged as a PostgreSQL plugin. Really good, and a full database server running. I did install a TimescaleDB server later on and tried it out as well. I put that in a separate post.
- A full TIG stack (tig-what?) with Yahoo finance script for Telegraf (here).
- PyStore (link) – fast and local (folder structure) storage using Parquet files made for stock data.
They all seem nice, but to start off I just wanted something quick to get going. So in the end I went with PyStore, the install was a breeze, and usage is easy. PyStore uses the Parquet file format and Dask together with Pandas and Numpy to store time-series.
EDIT: After rewriting the script, I ended up installing and using a TimescaleDB server for this application. I put that in a separate post.
Get the (historical) stock data
So, back to the stock data, I now have the list of stock names I want. To get the data I use YahooFinancials package (link), which is really simple. Note that it can only do ‘daily’, ‘weekly’, or ‘monthly’ intervals, however. Taking the ticker ‘SEB-A.ST’ from before, to retrieve daily summaries of prices I simply do:
yahoo_financials = YahooFinancials('SEB-A.ST') hsp = yahoo_financials.get_historical_price_data(START_DATE, END_DATE, 'daily')
So to get the full Swedish stock market data, I do this for all the stock tickers in the lists we got from the NASDAQ webpage. Then I need to get it into PyStore store, to do this I follow the below recipe.
# First define where to store things STOREPATH = '/PATH/TO/YOUR/STORE' SYNC_STORE = 'NAME_OF_STORE' # Now set up the store and its collection pystore.set_path(STOREPATH) store = pystore.store(SYNC_STORE) collection_prices = store.collection('prices') # Now format the data in a Pandas DataFrame price_data = pd.DataFrame( hsp[ticker]['prices'] ) data_types = {'date' :'float', 'formatted_date' :'str', 'open' :'float', 'high' :'float', 'low' :'float', 'close' :'float', 'adjclose' :'float', 'volume' :'float'} price_data = price_data.astype( dtype=data_types ) price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date']) price_data.set_index('formatted_date', inplace=True) # Finally, write it to the store collection_prices.write('SEB-A.ST', price_data)
Now we could retrieve the data from the collection with the following short code.
item = collection.item('SEB-A.ST') data = item.data metadata = item.metadata df = item.to_pandas()
This is basically directly from the PyStore readme, the data item is a dask dataframe (link), to get a Pandas DataFrame we use the .to_pandas()
method (you can also stay with a Dask dataframe and do computations or try other ways of speeding things up). So this explains some of the things I had to set up to get the data.
Further considerations
There are more things to consider as well. We need to sync historical data that we do not have yet. First we need to check the database what we have. If it exists at all or when the last sync of it was done, and then try to sync as far as possible (yes, Yahoo Finance has a limit to the number of requests per hour/day etc). Furthermore, we do not want to try to sync on say, a Sunday when we synced on Friday, because err the stock market is closed both Saturday and Sunday (later this will be taken care of in CRON if scheduled there). So I need to check what day today is and yesterday, and also if it was a bank holiday (I use the holidays
package for that).
The mother script
So I put together a script that will sync all the Swedish stock market data to a specified location. This is copy-pasted below. I will at some point put this on GitHub for easier tracking, but I am currently too busy applying for jobs and hanging out with my son for that. If you are interested in running it, let me know and I will see if I can package it more nicely. When you manually run it, it shows this nice progress bar.

from yahoofinancials import YahooFinancials import pandas as pd from lxml import html import requests as r from time import sleep import json import pystore import numpy as np import datetime as dt import sys import holidays import logging import os import matplotlib.dates as mdates # holidays will show Sundays and public holidays as holidays # NOT Saturdays swe_holidays = holidays.Sweden() STOCKMARKET_CLOSES = dt.time(16,00,00) # in minutes WAITFORMARKET = 30 # attempt to sync this far back START_DATE = dt.date(2000,1,1) # define some file locations STOREPATH = '/PATH/TO/data/stock_data/pystore' LOGPATH = '/PATH/TO/LOGS/' # what are the stores going to be called? # well here I choose the different markets, and included # some indexes as well. SYNC_STORES = ['NASDAQ-STO', 'NASDAQ-FirstNorthPremier', 'NASDAQ-FirstNorth', 'INDEXES'] # URLs to get the list of companies in the different markets SYNC_URLS = ['http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm', 'http://www.nasdaqomxnordic.com/shares/listed-companies/first-north-premier', 'http://www.nasdaqomxnordic.com/shares/listed-companies/first-north', ''] # list of indexes to sync listofindexes = [ {'index': '^OMX', 'name': 'Stockholm OMX', }, {'index': '^OMXSPI', 'name': 'Stockholm all-share', }, {'index': '^OMXSBGI', 'name': 'Stockhoolm Generalindex', }, {'index': '^SX2000PI', 'name': 'Stockholm Industrials Index', }, ] # Some stocks is some kind of weird type? # usually an existing stock with some added name # for exampel both TAGM-B.ST and SPEC.ST works, # but TAGM-BTA-B.ST and SPEC-TO3.ST does not # I like to use logging so I can see if anything went wrong, when and perhaps get an idea of what. logging.basicConfig(filename=os.path.join(LOGPATH,'stock_syncing-{0}.log'.format(dt.datetime.now().strftime('%y%m%d-%H%M%S')) ), filemode='w', format='%(asctime)s-%(levelname)s: %(message)s', level=logging.INFO) date_today = dt.date.today() datetime_today = dt.datetime.today() timeconstraint = dt.datetime.combine(date_today,STOCKMARKET_CLOSES) + dt.timedelta(minutes=WAITFORMARKET) if (datetime_today.time() <= timeconstraint.time()): logging.warning( 'Warning: Stock market closes around *{0}* during weekdays.'.format(STOCKMARKET_CLOSES) ) # Print iterations progress # from https://gist.github.com/snakers4/91fa21b9dda9d055a02ecd23f24fbc3d def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'): """ Call in a loop to create terminal progress bar @params: iteration - Required : current iteration (Int) total - Required : total iterations (Int) prefix - Optional : prefix string (Str) suffix - Optional : suffix string (Str) decimals - Optional : positive number of decimals in percent complete (Int) length - Optional : character length of bar (Int) fill - Optional : bar fill character (Str) """ percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total))) filledLength = int(length * iteration // total) bar = fill * filledLength + '-' * (length - filledLength) print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r') # Print New Line on Complete if iteration == total: print() def get_start_date(collection, key): # try to get the last date stored, if there try: last_date = collection.item(key).tail(1).index[0] # # If you are not storing Timestamps # you have to run last_date = dt.date( *[int(i) for i in last_date.split('-')] ) # first to get last date into a datetime date object. last_date = last_date.date() # starting date is last date + one day sdate = last_date + dt.timedelta(days=1) except(FileNotFoundError): # if the ticker doesn't exist, start syncing the full thing sdate = START_DATE return sdate def write_to_database(collection, key, data_frame, metadata=dict(source='unknown')): try: collection.write(key, data_frame, metadata=metadata ) except ValueError as e: if 'exist' in str(e).lower(): try: # test to append it, if this doesn't work # then there's something else wrong. collection.append( key, data_frame ) except Exception as e: logging.error( str(e) ) raise StandardError return True def get_dates_to_sync(collection, key): date_today = dt.date.today() datetime_today = dt.datetime.today() # get last date synced, if any, else use global param START_DATE sdate = get_start_date(collection, key) timeconstraint = dt.datetime.combine(date_today,STOCKMARKET_CLOSES) + dt.timedelta(minutes=WAITFORMARKET) # always sync until today unless # the market hasn't closed yet, sync until yesterday if (datetime_today.time() <= timeconstraint.time()): edate = date_today - dt.timedelta(days=1) else: edate = date_today # check that last sync date is not a (bank) holiday # NOTE that this will not take care of Saturdays while edate in swe_holidays: if edate in swe_holidays: logging.info('Last sync day is a public holiday ({0}), testing the day before...'.format(edate.strftime('%Y-%m-%d')) ) edate -= dt.timedelta(days=1) # number of days to sync daystosync = (edate-sdate).days + 1 # if number of days to sync is negative it means last sync date is today. if daystosync <= 0: logging.info('*No data (dates) to sync for ticker: {0}*'.format(ticker)) return False # If start is on saturday/sunday, and we are trying to sync sat, or sat+sun # we have to wait. if sdate.weekday() > 4 and daystosync < 2: logging.info('*Stock market not open on weekends!*') return False if sdate>edate: logging.info('{0}: No data to sync.'.format(key)) return False # format input for Yahoo sdate_str = sdate.strftime('%Y-%m-%d') edate_str = edate.strftime('%Y-%m-%d') return (sdate,edate),(sdate_str,edate_str) def hsp_empty(hsp, ticker): if not hsp[ticker] or 'prices' not in hsp[ticker].keys(): # if hsp[ticker] empty, or if it doesn't list prices return True else: # if it has 'prices' but it's empty try: if not hsp[ticker]['prices']: return True else: return False except Exception as e: #~ print(e) logging.error(str(e)) raise Exception # prepare local storage/database pystore.set_path(STOREPATH) for sync_store, sync_url in zip(SYNC_STORES, SYNC_URLS): logging.info('############### Syncing {0} ###############'.format(sync_store) ) store = pystore.store(sync_store) # the various "tables" to store things in logging.info('Syncing {0}'.format(sync_store) ) collection_prices = store.collection('prices') if sync_store == 'INDEXES': for ind,n in zip(listofindexes, range(len(listofindexes))): printProgressBar(n,len(listofindexes), prefix='Estimated ', suffix = 'of *indexes* synced', length=30) ticker = ind['index'] name = ind['name'] yahoo_financials = YahooFinancials( ticker ) dates_to_sync = get_dates_to_sync(collection_prices, ticker) if not dates_to_sync: continue else: (sdate, edate), (sdate_str, edate_str) = dates_to_sync # get historical prices hsp = yahoo_financials.get_historical_price_data(sdate_str, edate_str, 'daily') # Create DataFrame and write to store # NOTE: 'date' and 'volume' should ideally be 'int64', # but it doesn't support NaN values. Pandas >0.24 # has an update using 'Int64' (not uppercase), dtype # BUT parquet file storage doesn't support this (yet?). data_types = {'date' :'float', 'formatted_date' :'str', 'open' :'float', 'high' :'float', 'low' :'float', 'close' :'float', 'adjclose' :'float', 'volume' :'float'} if hsp_empty(hsp,ticker): empty_ticker_counter += 1 empty_tickers.append(ticker) logging.info(' ***{0}*** '.format(ticker)) logging.info('No data in this reply...') logging.info('If this happens often, perhaps wait a couple of hours to sync.') continue price_data = pd.DataFrame( hsp[ticker]['prices'] ) price_data = price_data.astype( dtype=data_types ) if price_data.empty: logging.info(' ***{0}*** '.format(ticker)) logging.info('No data in this reply...') logging.info('If this happens often, perhaps wait a couple of hours to sync.') continue price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date']) price_data.set_index('formatted_date', inplace=True) if not price_data['open'].any(): # i.e. if any is not false=nan=none logging.info(' ***{0}*** '.format(ticker)) logging.info('Data is empty, try later.') continue write_to_database(collection_prices, ticker, price_data, metadata={'source': 'Yahoo', 'name': str(name), } ) logging.info(' ***{0}*** '.format(ticker)) logging.info('Synced from {0} to {1}'.format(sdate_str,edate_str)) price_data = [] w8 = 1.5/np.random.randint(1,high=20) sleep(w8) else: ################## FIRST SYNC # 1. Get stock list # first get stock symbols, anonymously through TOR! page = r.get(sync_url, ) tree = html.fromstring(page.content) page.close # fix links tree.make_links_absolute('http://www.nasdaqomxnordic.com') # get table rows with stocks trs = tree.xpath('//tbody//tr') # get the data data = pd.DataFrame( [[j.text_content() for j in i.getchildren()[:-1]] for i in trs], columns = ['name', 'symbol', 'currency', 'isin', 'sector', 'icb'] ) # 2. Data gathering ## 2a. Prepare to fetch stock data tickers = ["-".join(i.split(" "))+".ST" for i in data['symbol'].values] sectors = data['sector'].values names = data['name'].values isins = data['isin'].values currencies = data['currency'].values ## 2b. Get data for each stock, one at a time, since we might have # different sync intervals for every stock, and minimize the # payload (fly under the radar). empty_tickers = [] empty_ticker_counter = 0 for ticker, sector, name, isin, currency,n in zip(tickers, sectors, names, isins, currencies, range(len(tickers)) ): logging.info(' ***{0}*** '.format(ticker)) printProgressBar(n, len(tickers), prefix='Estimated ', suffix = 'of {0} *stocks* synced'.format(sync_store), length=30) # lets try to find the ticker first, # if it doesn't work, catch the exception (FileNotFoundError) # and proceed yahoo_financials = YahooFinancials(ticker) ### SYNC STOCK PRICES # Fist check the dates to be synced dates_to_sync = get_dates_to_sync(collection_prices, ticker) if not dates_to_sync: continue else: (sdate, edate), (sdate_str, edate_str) = dates_to_sync # get historical prices hsp = yahoo_financials.get_historical_price_data(sdate_str, edate_str, 'daily') # Create DataFrame and write to store # NOTE: 'date' and 'volume' should ideally be 'int64', # but it doesn't support NaN values. Pandas >0.24 # has an update using 'Int64' (not uppercase), dtype # BUT parquet file storage doesn't support this (yet?). data_types = {'date' :'float', 'formatted_date' :'str', 'open' :'float', 'high' :'float', 'low' :'float', 'close' :'float', 'adjclose' :'float', 'volume' :'float'} if hsp_empty(hsp,ticker): empty_ticker_counter += 1 empty_tickers.append(ticker) logging.info('No data in this reply...') logging.info('If this happens often, perhaps wait a couple of hours to sync.') continue price_data = pd.DataFrame( hsp[ticker]['prices'] ) price_data = price_data.astype( dtype=data_types ) if price_data.empty: logging.info('No data in this reply...') logging.info('If this happens often, perhaps wait a couple of hours to sync.') continue price_data['formatted_date'] = pd.to_datetime(price_data['formatted_date']) price_data.set_index('formatted_date', inplace=True) if not price_data['open'].any(): # i.e. if any is not false=nan=none logging.info('Data is empty, try later.') continue write_to_database(collection_prices, ticker, price_data, metadata={'source': 'Yahoo', 'sector':str(sector), 'name': str(name), 'isin': isin, 'currency': str(currency), } ) logging.info(' ***{0}*** '.format(ticker)) logging.info('Synced from {0} to {1}'.format(sdate_str,edate_str)) price_data = [] w8 = 1.5/np.random.randint(1,high=20) sleep(w8) logging.debug('Couldn\'t sync these tickers: {0}'.format(empty_tickers) )
That’s it. Now you will have all the stock market data for Sweden. Try it out, I dare you.
Hi Magnus
This is really fantastic effort, I try to run multiple times but it does not run, even after certain modification. Do you know if it still works?
thanks, Neeraj
Hi Neeraj,
Thanks! What kind of error do you get? I would suspect that the YahooFinancials library have changed, or doesn’t work anymore. However the best starting point is to check what error you get.
All the best,
Magnus
Hi,
This looks like a great piece of work! However, when I try to run it, the command r.get(‘http://www.nasdaqomxnordic.com/shares/listed-companies/stockholm’) never returns. Do you have any idea of why?
Hi,
Hmm, did you try setting a UserAgent in the header? Maybe they are blocking everything that doesn’t look like a normal human browsing. I will test it my self and get back. I just finished a dashboard that doesn’t use the local database, but directly fetches quotes from Investing.com.
I get a time-out as well, but after adding proxies and a random UserAgent I can fetch the lists. I will try to update the post with these findings.
Hi again,
I updated the script on GitHub with some things (random UserAgent string, and random free proxy). It worked for me, so should be ok now. Cheers!