Sourcing and wrangling messy economic data with Python & Pandas
Part I of a series on building an automated dashboard that monitors the health of the global economy
Platforms like Kaggle offer a great way of learning about data science and applying machine learning algorithms to real problems. But such data science competitions often fall short in one regard: in most real-world projects you are not simply given all relevant data sets and they certainly won’t be delivered in one convenient CSV file.
Granted, in Kaggle competitions you have to do data wrangling too, e.g. dealing with missing values and converting categorical to numeric values. But the focus is clearly on optimising machine learning algorithms and model stacking. So let’s deal with the aspect of data sourcing, importing and wrangling in this blog post. In future parts of this series we will look at data visualisation and turning all of this into an interactive dashboard.
Let’s identify sources that are relevant and free of charge. To monitor the health of the global economy we can look at statistics such as GDP growth, unemployment etc. Such figures are low-frequency — published once a quarter or month — and hence might not give us a very timely picture of how the economy is doing. Some economists like to joke that looking at GDP figures is like driving with a rear-view mirror.
We want to focus on data points that are as real-time as possible, e.g. asset prices that reflect the macro-economic reality. Bond yields, i.e. interest rates, are one such good measure. They represent the “price of money”, which is a very important factor for all kinds of economic, business and financial decisions. Here is a selection of sources that we use, but there are many other websites, especially countries’ central banks that offer such data for free.
- St Louis Federal Reserve FRED
- Ministry of Finance, Japan
- Germany’s Central Bank
- Reserve Bank of Australia
Data science tools
We will use the Python ecosystem, including interactive Jupyter Notebooks and Pandas, the key library for time-series analysis. Below we import these and other relevant dependencies.
Importing and wrangling the data
Data can imported as a CSV file, scraped from a website in HMTL format or simply downloaded from an API. We will look at examples ranging from very easy to very difficult.
1. The good: APIs and easy web scraping
The Federal Reserve Bank of St. Louis’s FRED API, accessed via the pandas-datareader module, is very convenient. It does not require an API token for access and is implemented with only one line of code.
I’m using try and except statements — a simple defensive programming technique — when calling the API. If an error occurs the error will be printed and the script continues to run instead of breaking.
The Quandl API also makes it easy to import data, as shown in the second part of the script right below. (You will need to sign up at Quandl and get your free token though.)
The data we’re importing are inflation expectations in the US and price data for copper and gold futures. The ratio of copper-to-gold prices is often seen an an indicator of economic health with copper going up relative to gold when there is strong economic growth and hence industrial demand. If gold outperforms copper that’s often a sign that investors are worried about growth and inflation.
We also want to import data on Japanese government bond yields. The Yen is considered a safe haven currency which people tend to buy in a time of crisis. That capital is then often parked in Japanese government bonds. More demand for those means rising bond prices and falling yields. In that sense Japanese government bond yields often tell us not only about specific Japan-related developments (growth, inflation) but also global risk-aversion.
The Ministry of Finance of Japan provides this on their website in CSV format. It’s split into two files: one for long-term history and one for more recent data points.
I use the pandas read_csv method to load the files and specify that row 1 is the header and that the column ‘Date’ should be parsed as — surprise — dates. This is import though as we want the date column to be in datetime format, not a string or another Python object.
We also have to go through minor wrangling steps shown and described in comments in the below script.
With the .tail(10) method we show the last ten rows of the Dataframe, which should look like the table below.
2. The bad: scraping the Reserve Bank of Australia’s website
The statistics part of the RBA’s website provides links to both CSV files and Excel files. That looks convenient but turns out to be deceptive as there are a number of unexpected problems.
Problem 1: a very unexpected Unicode character
First we try to import the CSV file with the Pandas read_csv method as shown below.
An exception is raised and the following error printed:
'utf-8' codec can't decode byte 0x96 in position 25: invalid start byte
The first thing to do is to open the CSV file with a spreadsheet tool such as Excel or Pages in MacOS. I used Pages and there was no obvious problem, no unicode character in “position 25” to be seen.
Only when you open the file in a text editor can you see the problem in row 1:
F2 CAPITAL MARKET YIELDS ñ GOVERNMENT BONDS,,,,,,,,,,,,,,,,,,,,,,,,
There is an n with a wiggly line (tilde) that causes the error. Since we’re going to download the file every time we run the script, we would run into the same problem again and again. Luckily the alternative file in Excel format does not have an unexpected “ñ” in there. So we use the pd.read_excel method in pandas to download the XLS file as shown below.
Problem 2: download via pandas read_excel is error-prone
We use the try/except statements again here and having run it multiple times there seems to be a connection error with the RBA’s website. The download is very unreliable using pd.read_excel. The alternative that kicks in when an Exception is raised is the wget downloader. We call this command line tool outside the Jupyter Notebook with !wget.
Sometimes the file that is downloaded ends up being empty so there is another part in the above script that checks if the most recent download returned an empty file. Should that happen we use the most recent backup.
At the end of this unexpectedly difficult importing procedure, we take a look at the first 12 rows of the Dataframe (see below).
This clearly needs some pre-processing to remove unwanted data, to give columns meaningful names etc. The below script deals with all of these issues.
3. And the ugly: scraping the German Bundesbank’s yield data
I tried to think of an aspect that is positive, but I suppose the only good thing about trying to import and use this data from the Bundesbank’s website is a painful learning process in data sourcing and wrangling.
Unlike Quandl’s API or other central banks, the Bundesbank puts the yield data of every tenor (3-month bond, 2-year bond, 10-year bond etc) in a separate CSV file, meaning that we have to look for quite a few different files. But all things considered that’s the smallest problem when you take a first look at the data, as below:
Where to begin? Where is the yield data? Surely interest rates are not at 61% or 63%. Why are things separated in such a strange way? (Hint: the decimal separator in Germany is “,” instead of “.”, so in a CSV file a different separator is used in the shape of a semi-colon “;”. But pd.read_csv assumes a “,” so we have to specifically tell it to use “;” as the delimiter.) There are also a few NaNs (not a number), which it turns out are weekends. Market data is only generated during week days.
This calls for a more extensive wrangling procedure which is best wrapped into a more modular format in the shape of functions. We also separate the logic from data (e.g. all the URLs for downloading the various CSVs) in the two code snippets below.
The functions below have detailed doc strings and also quite a number of comments where appropriate to fully explain the wrangling and cleaning process.
Finally we call the above functions, load each CSV file, wrangle it and finally concatenate each tenor into one Dataframe.
The resulting table should look like this.
We’ve shown how to create an automated process for importing and wrangling data from convenient sources such as the Quandl API all the way to challenging ones like the German Bundesbank. We can now perform analysis with these cleaned data sets, visualise results and create interactive dashboards. Here’s a quick preview of the type of charts we’ll create with Python libraries such as Matplotlib and Seaborn in the next part of this series.
If you are interested in alternative data and automating analytics workflows in asset management and trading, please sign up to our newsletter at Oyler.co