Using Granger Causality and SARIMAX Models
Afolabi Lagunju · Follow
--
Introduction
On a daily basis, companies are faced with decisions relating to manoeuvring financial markets, optimising their supply chain operations, or formulating strategies to maintain a competitive edge amidst evolving trends. Nevertheless, achieving high accuracy results in forecasting can prove elusive when time series models fail to account for interconnected events or other time series that exert influence on the subject of prediction.
In this article, we will explore the concept of leading indicators, how to identify them, and how to leverage them to improve time series forecasting. We will dive into the practical implementation using Python and real-world data sourced from the Federal Reserve Economic Data.
What are Leading Indicators?
Leading indicators are datasets that help in forecasting future trends or activities. An everyday example of a leading indicator is the sudden appearance of cloud cover, which could signal the likelihood of a thunderstorm occurring within the next hour.
How to Identify & Leverage Leading Indicators
- Domain Knowledge: Like all Data Science projects, we start by understanding the domain we will be operating within. A key aspect of this process is to identify variables or factors that are likely to influence the time series we want to forecast. In our practical implementation, we will be forecasting Beer, Wine, and Liquor salesᵈ¹ so factors like Consumer Price Index⁵𝄒ᵈ², Liquor Price¹𝄒ᵈ³, Real Personal Income⁵𝄒ᵈ⁴, Working Age Population⁵𝄒ᵈ⁵, Unemployment Rate⁴𝄒ᵈ⁶, Labour Force Participation⁴𝄒ᵈ⁷, Social Benefits⁴𝄒ᵈ⁸, and Consumer Loans²𝄒ᵈ⁹ will be our potential leading indicators.
- Data Exploration & Visualisation: Next, we employ Time Series Decomposition to analyse the seasonal components of our dependent time series alongside its potential leading indicators. Our objective is to identify peaks or troughs in our leading indicators that precede similar/opposite changes in our dependent time series. It is crucial to note that the relationship between leading indicators and the dependent time series can be either positive or negative. Although we will not be going into this in our implementation, you can use seasonal_decompose function from stats_model to achieve this.
- Statistical Tests: In this step, we validate the influence of our hand-picked potential leading indicators on the dependent time series by employing the Granger Causality test.
- Preprocessing: Next, we scale our data to ensure that all features are within the same range, then we apply Principal Component Analysis (PCA) to eliminate multicollinearity between our leading indicators.
- Model Building: Finally, we build our SARIMAX model using the auto_arima function from pmdarima module, and setting the leading indicators as the exogenous values amongst other parameters.
What is Granger Causality?
Granger Causality, first introduced in 1969 by Clive Granger³, is a statistical hypothesis test that helps determine whether changes in a time series can predict or “cause” changes in another. It has been implemented as a function under statsmodels.
What is SARIMAX?
SARIMAX stands for Seasonal AutoRegressive Integrated Moving Average with eXogenous variables. As the name suggests, the model combines several components like autoregression (AR), moving average (MA), differencing (I for integrated), and the inclusion of external factors (the “X” part) — where we will plug our leading indicators.
Python Implementation
Before we start, create an account with Federal Reserve Economic Data (FRED), and get an API key using this link. Please note that this product uses the FRED® API but is not endorsed or certified by the Federal Reserve Bank of St. Louis.
We start with installing and loading the needed modules.
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from pandas.tseries.offsets import MonthEnd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytestsfrom pmdarima import auto_arima
Next, we’ll create a custom function to read the data using FRED API.
FRED_API_KEY = '__YOUR_API_KEY__'# Function to read data from FRED API
def get_fred_data(data_id, data_name):
response = requests.get(f'https://api.stlouisfed.org/fred/series/observations?series_id={data_id}&api_key={FRED_API_KEY}&file_type=json')
df = pd.DataFrame(response.json()['observations'])[['date', 'value']].rename(columns={'value': data_name})
df[data_name] = pd.to_numeric(df[data_name], errors='coerce')
df['date'] = pd.to_datetime(df['date']) + MonthEnd(1)
df.set_index('date', inplace=True)
df.index.freq='M'
return df
Now, let’s read our data and store it in a pandas dataframe.
dependent_timeseries_id = 'MRTSSM4453USN'
dependent_timeseries_name = 'liquor_sales'potential_leading_indicators = {
'USACPIALLMINMEI': 'consumer_price_index',
'PCU44534453': 'liquor_ppi',
'DSPIC96': 'real_income',
'LFWA25TTUSM647S': 'working_age_population',
'UNRATENSA': 'unemployment_rate',
'LNU01300000': 'labor_force_participation',
'A063RC1': 'social_benefits',
'CCLACBM027NBOG': 'consumer_loans',
}
# Read dependent time series
timeseries = get_fred_data(dependent_timeseries_id, dependent_timeseries_name)# Join timeseries with potential leading indicators
for data_id, data_name in potential_leading_indicators.items():
df = get_fred_data(data_id, data_name)
timeseries = timeseries.join(df)
# We will start our analysis from Jan-2010
timeseries = timeseries['2010':]
# add month we want to predict liquor_sales
timeseries = timeseries.reindex(timeseries.index.union([timeseries.index[-1] + timeseries.index.freq]))
timeseries
Quick visual analysis of our data shows that our dependent time series (liquor sales) more or less follows the same cycle every 12 months. We will use this 12 month cycle as a parameter later on in our time series forecasting.
timeseries[dependent_timeseries_name].plot(figsize=(20,8));
Before we test for causality, we need to confirm the stationarity of our time series data. To achieve this, we will use the Augmented Dickey–Fuller test. If our dataset fails this stationarity test, we must employ a recursive differencing approach until it satisfies the test criteria.
# create a copy of the timeseries to use for tests. Be sure to exclude the additional row we added in the previous task
timeseries_for_gc_tests = timeseries[:-1]
all_cols = timeseries_for_gc_tests.columnsstationary_cols = []
diff_times = 0
while True:
# Test for stationarity
for col in all_cols:
adf, pvalue, lagsused, observations, critical_values, icbest = adfuller(timeseries_for_gc_tests[col])
if pvalue <= 0.05:
stationary_cols.append(col)
# Difference the time series if at least one column fails the stationary test
if set(stationary_cols) != set(all_cols):
timeseries_for_gc_tests = timeseries_for_gc_tests.diff().dropna()
diff_times += 1
stationary_cols = []
else:
print(f'No of Differencing: {diff_times}')
break
Now that we have loaded our time series data into a pandas dataframe, and it passes the stationarity test, we test for causality using the granger causality test.
maxlag = 6 # represents the maximum number of past time periods to look for potential causality. We cap ours at 6 months
leading_indicators = []for x in all_cols[1:]:
gc_res = grangercausalitytests(timeseries_for_gc_tests[[dependent_timeseries_name, x]], maxlag=maxlag, verbose=0)
leading_indicators_tmp = []
for lag in range(1, maxlag+1):
ftest_stat = gc_res[lag][0]['ssr_ftest'][0]
ftest_pvalue = gc_res[lag][0]['ssr_ftest'][1]
if ftest_pvalue <= 0.05:
leading_indicators_tmp.append({'x': x, 'lag': lag, 'ftest_pvalue': ftest_pvalue, 'ftest_stat': ftest_stat, 'xlabel': f'{x}__{lag}_mths_ago'})
if leading_indicators_tmp:
leading_indicators.append(max(leading_indicators_tmp, key=lambda x:x['ftest_stat']))
# Display leading indicators as a dataframe
pd.DataFrame(leading_indicators).reset_index(drop=True).reset_index(drop=True)
From our tests, we see can see that liquor sales of the current month is affected by Consumer Price Indexᵈ² and Consumer Loansᵈ¹⁰ of 2 months ago; and Labour Force Participationᵈ⁷ of 6 months ago.
Now that we have established our leading indicators, we will shift their records so that their lagged figures are in the same row as the current data of liquor_sales which they “cause”.
# shift the leading indicators by their corresponding lag periods
for i in leading_indicators:
timeseries[i['xlabel']] = timeseries[i['x']].shift(periods=i['lag'], freq='M')# select only the dependent_timeseries_name and leading indicators for further analysis
timeseries = timeseries[[dependent_timeseries_name, *[i['xlabel'] for i in leading_indicators]]].dropna(subset=[i['xlabel'] for i in leading_indicators], axis=0)
timeseries
Next, we scale our data so that all features are within the same range, then apply PCA to eliminate multicollinearity between our leading indicators.
# Scale dependent timeseries
y_scaler = StandardScaler()
dependent_timeseries_scaled = y_scaler.fit_transform(timeseries[[dependent_timeseries_name]])# Scale leading indicators
X_scaler = StandardScaler()
leading_indicators_scaled = X_scaler.fit_transform(timeseries[[i['xlabel'] for i in leading_indicators]])
# Reduce dimensionality of the leading indicators
pca = PCA(n_components=0.90)
leading_indicators_scaled_components = pca.fit_transform(leading_indicators_scaled)leading_indicators_scaled_components.shape
Finally, we can build our SARIMAX model with the help of auto_arima. For the purpose of this implementation, we will leave all parameters as their default, with the exception of seasonality flag and the number of periods in each cycle (m).
We will train our model using the timeseries data up until ‘2024–05–31’, test with ‘2024–06–30’ data, then predict the ‘2024–07–31’ liquor sales.
# Build SARIMAX model
periods_in_cycle = 12 # number of periods per cycle. In our case, its 12 months
model = auto_arima(y=dependent_timeseries_scaled[:-2], X=leading_indicators_scaled_components[:-2], seasonal=True, m=periods_in_cycle)
model.summary()
# Forecast the next two periods
preds_scaled = model.predict(n_periods=2, X=leading_indicators_scaled_components[-2:])
pred_2024_06_30, pred_2024_07_31 = np.round(y_scaler.inverse_transform([preds_scaled]))[0]print("TEST\n----")
print(f"Actual Liquor Sales for 2024-06-30: {timeseries[dependent_timeseries_name]['2024-06-30']}")
print(f"Predicted Liquor Sales for 2024-06-30: {pred_2024_06_30}")
print(f"MAPE: {mean_absolute_percentage_error([timeseries[dependent_timeseries_name]['2024-06-30']], [pred_2024_06_30]):.1%}")
print("\nFORECAST\n--------")
print(f"Forecasted Liquor Sales for 2024-07-31: {pred_2024_07_31}")
By following the process step-by-step, we forecasted the liquor sales figure for July 2024 with an estimated MAPE of just 0.4%.
To further improve the accuracy of our prediction, we can explore adding more potential leading indicators and finetuning the models used.
Conclusion
Leading indicators, as we’ve explored, serve as early signals of future trends, providing a crucial edge in anticipating changes before they fully materialise. By leveraging techniques such as Granger causality tests to identify leading indicator series and incorporating them into forecasting models, we can significantly enhance the precision and robustness of our predictions.
Thank you for Reading
I hope you enjoyed this article and are inspired to try this on your dataset. Follow me on Medium for more Data Science posts like this, and let’s connect on LinkedIn.
References
[1] Chaloupka FJ, Grossman M, Saffer H. The effects of price on alcohol consumption and alcohol-related problems. Alcohol Res Health. 2002;26(1):22–34. PMID: 12154648; PMCID: PMC6683806.
[2] Cuffe, Harold E. and Christopher G. Gibbs. “The Effect of Payday Lending Restrictions on Liquor Sales.” Banking & Insurance eJournal (2015)
[3] Granger, C. W. J. “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods.” Econometrica 37, no. 3 (1969): 424–38. https://doi.org/10.2307/1912791.
[4] Jørgensen MB, Pedersen J, Thygesen LC, Lau CJ, Christensen AI, Becker U, Tolstrup JS. Alcohol consumption and labour market participation: a prospective cohort study of transitions between work, unemployment, sickness absence, and social benefits. Eur J Epidemiol. 2019 Apr;34(4):397–407. doi: 10.1007/s10654–018–0476–7. Epub 2019 Jan 10. PMID: 30627937; PMCID: PMC6451700.
[5] Nelson, Jon P., Economic and Demographic Factors in U.S. Alcohol Demand: A Growth-Accounting Analysis. EMPIRICAL ECONOMICS, Vol. 22, №1, March 7, 1997, Available at SSRN: https://ssrn.com/abstract=4686
[6] Prabal, K. De., Drinking during downturn: New evidence from the housing market fluctuations in the United States during the Great Recession. Economics & Human Biology, Vol. 43, December, 2021.
Data Citations
[d1] U.S. Census Bureau, Retail Sales: Beer, Wine, and Liquor Stores [MRTSSM4453USN], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MRTSSM4453USN, August 28, 2024.
[d2] Organization for Economic Co-operation and Development, Consumer Price Indices (CPIs, HICPs), COICOP 1999: Consumer Price Index: Total for United States [USACPIALLMINMEI], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/USACPIALLMINMEI, August 28, 2024.
[d3] U.S. Bureau of Labor Statistics, Producer Price Index by Industry: Beer, Wine, and Liquor Retailers [PCU44534453], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/PCU44534453, August 28, 2024.
[d4] U.S. Bureau of Economic Analysis, Real Disposable Personal Income [DSPIC96], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/DSPIC96, August 28, 2024.
[d5] Organization for Economic Co-operation and Development, Infra-Annual Labor Statistics: Working-Age Population Total: From 25 to 54 Years for United States [LFWA25TTUSM647S], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/LFWA25TTUSM647S, August 28, 2024.
[d6] U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATENSA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/UNRATENSA, August 28, 2024.
[d7] U.S. Bureau of Labor Statistics, Labor Force Participation Rate [LNU01300000], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/LNU01300000, August 28, 2024.
[d8] U.S. Bureau of Economic Analysis, Personal current transfer receipts: Government social benefits to persons [A063RC1], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/A063RC1, August 28, 2024.
[d9] Board of Governors of the Federal Reserve System (US), Consumer Loans: Credit Cards and Other Revolving Plans, All Commercial Banks [CCLACBM027NBOG], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CCLACBM027NBOG, August 28, 2024.