Technical Tuesday: Dealing with missing values in your data science pipeline
Statistics is not known for good jokes, but this one is worth a try as it touches on today’s topic: “There are two kinds of people in the world. 1) Those who can extrapolate from missing data.”
As any experienced data scientist will confirm, 80% of the work of building statistical and machine learning models is pre-processing one’s data properly. That includes steps such as obtaining, cleaning, wrangling, standardising and transforming it. If you import data from Bloomberg into an Excel spreadsheet you will likely have come across blanks and “N/A” in certain cells. Alternatively, connecting to an API with Python for data downloads often yields a few (or many) NaNs. This means “not a number”, which is a numeric data type that represents values that are not available or undefined.
Algorithms generally require numeric values — so we have to do something with our N/As, NaNs and blank cells. Some software packages make the decision for you — and you may not be aware of this — and ignore the entire column or (more often) row containing a missing value. By contrast, a good data science package or programming language worth its salt should make these choices obvious or give you an error so you know exactly what the problem is.
Stuck between pure theory and glib treatment — a practical middle ground
Handling missing data may seem an obscure topic, but there is a vast array of academic research on it. Many a paper and book have been written, often not with practitioners in mind. At the other end of the spectrum are data science and machine learning tutorials that often treat the issue in a handwavy way. Just showing one line of code, e.g. of a Pandas method like .fillna() to substitute NaNs with some value or .dropna() to discard entire rows or columns containing NaNs, does not do the topic and the trade-offs to consider justice.
More often than not it will require your judgement in combination with some data exploration and the answer to the question of what to do will often be: “it depends”. We’ll be going through the following steps in this post. First, we’ll obtain some equities and FX price data via the Yahoo Finance API. After briefly inspecting the data, we will visualise missing values with a Python module called missingno.
We then explore different reasons for why data points may be missing. Each explanation will require a different solution to address the problem. We go through the main options: deletion, substitution and imputation, which means inferring values from the known parts of the data. Finally, we will download some macro data from the FRED API to investigate how to deal with missing values as a result of our data having different frequencies.
Accessing and exploring equities and FX data
We import basic libraries such as Pandas and special-purpose ones such as yfinance, which is a Python package to download Yahoo Finance data. Calling the yfinance API in line 8 of the below code block, we download share prices for Apple, Airbnb, Amazon and Snowflake as well as FX rates for EUR/USD starting from 1 January 2020 until today (5 January 2021).
The results are automatically stored in a Pandas DataFrame (variable name: data). The DataFrame has multi-level column indices containing open, high, low, close etc price data, which are not important to this analysis so we access only closing prices [“Close”] as shown below. The .head() method gives us the first five rows of the data set. It is obvious that there are quite a few missing data points as shown by the multitude of NaNs. Weekends appear to be excluded (jumping from Friday Jan 3rd to Monday Jan 6th) so that is not the cause of it. Some of them can likely be explained by exchange holidays, e.g. Jan 1st. Let’s find out how many exactly per column with the code right below the table.
The sums are useful but better yet is to visualise missing values. The missingno module is perfect for this task:
In order to see dates on the left we need to pass a date range into the matrix plotting function. Specifically, we use the .set_index() method and determine the frequency (here: business month, BM). Apple and Amazon have missing data points during holidays such as Easter, Labour Day, Thanksgiving, Christmas etc. The reason Airbnb and Snowflake have so many missing values is because they were only listed late in 2020. If you didn’t know that, then the above plot would certainly have prompted you to look into it.
The quickest solution: deleting rows and/or columns
Now that we have explored our data set, we need to apply our judgement to determine how to proceed. Assuming the Airbnb and Snowflake share prices are not of crucial importance to our analysis, we can delete those two rows. This is done with one line of code and we save the result in a new DataFrame called data_preprocessed. That way the raw data remains unaffected.
The next step is to deal with rows that have missing values. Two questions to consider: (1) Are all stocks listed on the same exchange? (2) And are those data points missing because that exchange was closed on those particular days? If we are sure about this then those rows containing NaNs can be discarded. When we use dropna() it’s best to be precise and specify along which axis we want to delete: rows (‘index’) or ‘columns’.
However, if we compare securities from different exchanges with different holidays (say a London-listed with a NYSE-listed stock) we could be deleting (with the simple dropna() method) a valid data point of the US stock because it is a bank holiday in the UK and there is a NaN for that day. Regarding the second question is it worth investigating if a missing value is an exchange holiday or whether there might be a problem with the data provider or connection leading to missing values.
Let’s make sure we do this properly. We import a Python module with market calendar data and select NYSE trading dates (Amazon and Apple are Nasdaq-listed but their exchange holidays are the same.) We re-index our data with the NYSE calendar, which excludes days on which the exchange was closed. Looking at the missing data plot again we have a perfect data set now. Hence all those missing values were in fact holidays and not due some other reason like Yahoo Finance failing to record an existing and valid data point.
Substitution with forward filling
We now introduce another scenario alluded to above: Securities listed in different locations with different exchange holidays. Looking at data from 28 August 2020 and 7 Sept 2020 we can see the late August bank holiday (31st) in the London-listed BP share price and Labour Day in the US for Apple on Sept 7th. Deleting rows with missing values in one column would discard valuable data in other columns.
When we’re not dealing with time series a common solution is to substitute missing values with the mean or median of the entire data set. Applied to our financial time series here this is a very poor choice. If we take the mean of the entire year 2020, we end up with 313.31 for August 31st for BP and 95.62 for Apple on Sept 7th. Not only do both look like complete outliers, these mean values are also computed with data that was not available at the time. We are introducing look-ahead bias and that’s the statistical danger zone!
A better way is to forward fill the missing value, which means substituting our NaNs with the last available (prior) observation. This is shown right below. The value for BP on Aug 31st is now the same as on the last available day before (28th) and the value for Apple on Sept 7th is the same as the last available data point from Sept 4th. That way we do not introduce look-ahead bias. Any methodology that adds or removes values is likely to change the distribution of the original data. In addition to being careful not to introduce statistical biases such as look-ahead, we must also aim to find the approach that produces a distribution that is as close as possible to the original data.
Imputation with Pandas
Let’s consider some non-sequential data, i.e. not a time-series problem where order matter as we discussed above. We create a column with some values and NaNs. To fill those missing data points we can use the Pandas method interpolate(). I suggest inspecting the documentation as there are many options and parameters that can be customised. Two examples for linear and polynomial interpolation are shown below. More advanced options such as splines are also available. The key is to find a method that retains the characteristics of the original distribution as much as possible. This can be measured by looking at the different moments of the distribution before and after imputing missing values. Pandas conveniently allows us to do this with the following methods: mean() as measure of central tendency, std() and var() as measures of dispersion and higher moments such as skew() and kurt().
Dealing with different frequencies
Sometimes our financial data comes in different frequencies. In our example below we compare M1 Money Stock (released weekly) with daily prices for Bitcoin. The St. Louis Fed has an excellent data service called FRED and we use their API to download data. (You need sign up and create a free account to generate your own API token.)
We can see that there are plenty of NaNs as a result of daily vs weekly data frequency. To handle those missing values we have two resampling choices: increasing the frequency of the M1 data to daily (up-sampling) or decrease the frequency of Bitcoin prices to weekly (down-sampling).
Up-sampling is very straightforward. Given it’s a time series and we do not want to introduce look-ahead bias we simply use forward fill as before to replace NaNs in the M1 data.
Down-sampling is the obvious alternative. We create a new data frame with weekly data, using the resample() method. This is a powerful tool and I encourage looking at the documentation to find out more about it. We need to pass it a frequency, here “W” for weekly, and we also need to tell it how to do the aggregation. We could take the weekly average for example by adding mean(). However, I use last() which takes the last available value for that seven day period. There are various pros and cons but a more detailed discussion is beyond the scope of this article.
In both cases it looks like the extraordinary expansion of money supply in early 2020 preceded the rally in Bitcoin. How we do frequency aggregation is also a question of judgement. Is our aim to predict Bitcoin prices/returns and we have lots of other daily data sets we can use as input variables to build a model? In that case it might be better to go with up-sampling of M1 data rather than remove six out of seven daily data points from the Bitcoin price series with down-sampling. If we’re just interested in a chart showing the general long-term development we may go with weekly data points instead. As with most issues related to missing values and data frequency the answer is often: “it depends”!
This article is also cross-posted on LinkedIn.