in Maths, Tutorial

The Autocorrelation Function

The purpose of this tutorial is to show a simple technique to estimate periodicity in time series, called autocorrelation.

This tutorial is part of a longer series that focuses on how to analyse time series.

Introduction

In the previous part of this tutorial, Time Series Decomposition, we have seen how is possible to decompose sales in their original components. One of the inputs of this process, is knowing the exact periodicity of the seasonal components. When it comes to real data, this is rarely the case.

The Correlation Coefficient

The first step is to find a way of measuring how similar two time series are. There are countless way of doing this, depending on the underlying assumptions of your data. The most used one for those applications is called correlation. The correlation between two functions (or time series) is a measure of how similarly they behave. It can be expressed as:

    \[corr \left(X, Y \right) = \frac{cov\left(X, Y\right)}{std\left(X\right) std\left(Y\right)}\]

with std\left(X\right)  and mean\left(X\right) being the standard deviation and the mean of X, respectively:

    \[std\left(X\right) = \sqrt{  \frac{1}{N} \sum_{i=1}^{N} \left[ X_i - mean\left(X\right) \right]^2 }\]

    \[mean\left(X\right) = \frac{1}{N} \sum_{i=1}^{N} X_i\]

The mean is simply the average of the whole time series. The standard deviation, instead, indicates how much the points of the series tends distance themselves from the mean. This quantity is often associated with variance, defined as:

    \[var\left(X\right) = std\left(X\right)^2\]

When the variance is zero, all the points in the series are equal to the mean. A high variance indicates that the points are scattered around.

The term cov\left(X, Y\right) represents the covariance between X and Y, which generalises the concept of variance over two time series instead of one. The covariance provides a measure of how much two time series change together. It does not necessarily account for how similar they are, but for how similarly they behave. More in detail, it captures whether the two time series increase and decrease at the same time.

The covariance is calculated as follow:

    \[cov \left( X, Y \right) = \frac{1}{N} \sum_{i=1}^{N} \left[  X_i - mean\left(X\right)   \right]\left[ Y_i - mean\left(Y\right) \right]\]

and is easy to see that indeed cov\left(X,X\right) = var\left(X\right).

Looking back to the definition of correlation, it is now easy to understand what is trying to capture. It’s a measure of how similarly X and Y behave, normalised by their variance to obtain a value between -1 and +1. When both time series tend to increase (or decrease) over time with a similar fashion, they will be positively correlated. Conversely, if one goes up and the other goes down, they will be negatively correlated.

Autocorrelation Function

The idea behind the concept of autocorrelation is to calculate the correlation coefficient of a time series with itself, shifted in time. If the data has a periodicity, the correlation coefficient will be higher when those two periods resonate with each other.

The first step is to define an operator to shift a time series in time, causing a delay of t. This is known as the lag operator:

    \[lag\left(X_i,t\right) = X_{i-t}\]

The autocorrelation of a time series with lag t is defined as:

    \[autocorr\left(X,t\right) = corr\left[X, lag\left(X,t\right) \right]\]

which can also be expressed as:

    \[autocorr\left(X,t\right)=\frac{cov\left[X, lag\left(X,t\right)\right]}{std\left[X\right] std\left[ lag\left(X,t\right)         \right]}=\frac{cov\left[X, lag\left(X,t\right)\right]}{var\left(X\right) }=\]

    \[= \frac{ \sum_{i=1}^{N} \left[  X_i - mean\left(X\right)   \right]\left[ X_{i-t} - mean\left(X\right) \right] }{ \sum_{i=1}^{N} \left[ X_i - mean\left(X\right) \right]^2 }\]

The code

The above mentioned form is amenable to be written as code. The easiest function is surely the one that calculates the mean of a time series:

public float Mean (float [] x)
{
	float sum = 0;
	for (int i = 0; i < x.length; i ++)
		sum += x[i];
	return sum / x.length;
}

A little bit complicates is the case for the autocorrelation function. It creates an array which will contain the final result. Each t-th element contains autocorr\left(X,t\right). Only \frac{N}{2} are present, since the function repeat itself after that point.

public float [] Autocorrelation (float [] x)
{
	float mean = Mean(x);

	float [] autocorrelation = new float[x.length/2];
	for (int t = 0; t < autocorrelation.length; t ++)
	{
		float n = 0; // Numerator
		float d = 0; // Denominator

		for (int i = 0; i < x.length; i ++)
		{
			float xim = x[i] - mean;
			n += xim * (x[(i + t) % x.length] - mean);
			d += xim * xim;
		}

		autocorrelation[t] = n / d;
	}

	return autocorrelation;
}

Line 14 implements an inline lag operator. It shifts i by t, and uses the modulo operator so that the time series loops. If this is not the desired case, then you should only loop up to x.length -t.

The Correlogram

Autocorrelation is a relatively robust technique, which doesn’t come with strong assumptions on how the data has been created. If in the previous post we have used a synthetic sales data, this time we can confidently use real analytics:

This is the plot for the autocorrelation function, also known as correlogram:

All correlograms start at 1; this is because when t=0, we are comparing the time series with itself. The next clear peak in the correlogram appears at t=7. This indicates that the data has a weekly periodicity. If a distribution has a component of period 7, its correlogram will show peaks for each t multiple of 7. This is indeed the case, and strengthen the idea that the weekly periodicity is not just a statistical coincidence.

Because of this resonance, interpreting correlograms is not always easy. There are several improvements on this technique which can help to extract actual cycles. Partial autocorrelation functions controls for the values of the time series at all shorter lags. This removes interference and resonance with multiple cycles, highlighting a more clear periodicity. A more advanced technique, called Power Spectral Density, performs a Fourier analysis on the correlogram to find its main component.

Conclusion

This tutorial concludes the series on time series analysis. We have explored valuable techniques to extract information from temporal data, focusing on their potential and limitations.

Other resources

💖 Support this blog

This website exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.

Patreon Patreon_button
Twitter_logo

YouTube_logo
📧 Stay updated

You will be notified when a new tutorial is released!

📝 Licensing

You are free to use, adapt and build upon this tutorial for your own projects (even commercially) as long as you credit me.

You are not allowed to redistribute the content of this tutorial on other platforms, especially the parts that are only available on Patreon.

If the knowledge you have gained had a significant impact on your project, a mention in the credit would be very appreciated. ❤️🧔🏻

Write a Comment

Comment

  1. Hi Alan, thanks for the nice tutorial! Should the Autocorrelation function’s for loop end condition (line 6) be ‘autocorrelation.length’ instead of ‘autocorrelation.length/2’ since you’ve already halved the input array (x) in line 5? Thanks!