The Mathematics of the Kalman Filter

This is the second part of the series dedicated to one of the most popular sensor de-noising technique: Kalman filters. This article will introduce the Mathematics of the Kalman Filter, with a special attention to a quantity that makes it all possible: the Kalman gain.

You can read all the articles in this online course here:

Introduction

Kalman filtering is a very powerful technique that finds application in a vast number of cases. One of its most popular use is sensor de-noising. Readings from sensors are typically affected by a certain degree of statistical uncertainty, which can lead to poor measurements. Under certain circumstances, Kalman filters have been proved to be optimal, meaning that no other algorithm will perform better.

This series of tutorials aims not just to explain the theory behind Kalman filters, but to create a fully functioning one! Many books present the equations in their “final” form; while that is a very time-efficient approach, it is not the one that we have chosen for this series.

If we want to understand not just how Kalman filters work, but why they work, it is important to do so step by step. And so, this series will build a Kalman filter “iteratively”, adding a new features at the end of every article.

By the end of this first articleโ€”which will focus on a quantity called Kalman gainโ€”we will have a fully-functioning Kalman filter which is able to de-noise signals that are not expected to change very rapidly.

The following part will extend this limitation, but introducing a much more “complete” version of the Kalman filter which can handle signals that evolve linearly over time.

Finally, another article will introduce the so-called Extended Kalman Filters, which are designed to handle signals that can evolve in more complex ways.

How it Works

When it comes to Kalman filters, most online tutorials start with the example of a moving train. The position of the train at a given time nโ€”let’s call it x_nโ€”is the unknown property that we want to estimate. If we had direct access to the true position of the train, there would be no need for Kalman filters. The only way to access the position of the train is indirectly, by some sort of measurement; for instance, a GPS. If we had a truly perfect GPS, there would be no need for Kalman Filters. Unfortunately, no measurement apparatus is perfect, and no matter what we do, our measurementโ€”let’s call it z_nโ€”will always be affected by noise, errors and inaccuracies.

From a mathematical point of view, we can model this in the following diagram:

The train (generally referred to as the process) changes its position at every new time interval. This is often referred to as process update. In an ideal scenario, if a train travels at a constant speed of v metres per second, it yields the following relationship:

(1)   \begin{equation*} x_{n+1} = x_{n} + v \, dn\end{equation*}

where dn is the difference in time between two updates.

Each measurement (z_n) will be close to the actual position (x_n), plus or minus some random noise (more on that later). When the system starts, z_0 is the first (and only) measurement taken. Consequently, it is also our best guess on the real position of the train, x_0. The “best guess” for the real position of the train, often referred to as the estimated position, is indicated at \hat{x}_n. When the system starts, the first measurement is also the best estimated position at time 0:

(2)   \begin{equation*} \hat{x}_0 = z_0\end{equation*}

What a Kalman filter does, is updating and refining the estimated position (\hat{x}_n) as the system evolves. And it does so by integrating (fusing) together the information about where the system believes the train should be (\hat{x}_n) and where the measurement tells it is (z_n).

Both quantities are inherently inaccurate, but combining them both can help refining the overall estimate. In this example we used two measures, but this can theoretically be expanded to include more readings from different sensors, all of which having different degrees of reliability. This is commonly known in the academic literature as sensor fusion.

The rest of this article will slowly expand on this simplistic interpretation of Kalman filtering, by progressively introducing more nuanced variants.

๐Ÿ“ฐ Ad Break

Step 1: The Model

Each Kalman filter is tailored to the application it needs to be used for, and it does so by incorporating a model of the system. This model is a simplified mathematical description of how we expect the system to evolve. In the case of a train, the Kalman filter would need to integrate the laws of motion into its equations. This allows the filter to not just passively reacts to the train movement, but to predict its future position based on its current state. Equation (1), for instance, is modelling the fact that the train position increases linearly with time.

It is important to understand two things about this:

  • This is a belief: all Kalman filters are built with some pre-existing knowledge of how the system should evolve. The idea that the train moves at a constant speed is a belief, which may or may not be entirely correct;
  • This is a model: the Kalman filter does not expect (or require) (1) to be valid at all times. That equation is a simplified model of how the system is expect to behave at a short time scale. As such, there is no requirement for this to be exact. The more precise the model can be, the better the system will react. But because Kalman filters are designed to be resilient to noise, errors or inaccuracies in the model are generally well tolerated, as their deviation from the “true” behaviour can simply be interpreted as an extra source of noise.

This leads to an algorithm that works in two separate steps:

  • Time update (or prediction step): based on the current state of the system, we predict its new state.
    In the case of a train, this means predicting its next position based on its previous position and previous velocity.
  • Measurement update (or correction step): the prediction generated in the step before is corrected using the data from a sensor.
    In the case of a train, this means integrating the predicted position with the measurement from a GPS.

Under this framework, the expected position updated in the prediction step is called a priori state estimate (\hat{x}^{-}_n). “a priori” means “before“, as this is the predicted position where the train should before the measurement is performed. Once a measure is available, the Kalman filter can integrate that information to refine its estimation, which gets called a posteriori state estimate.

๐Ÿ“ก What if sensors readings are not always available?

A simpler toy example

Most online tutorials start with the example of a train moving forward along a track; and this tutorial was not an exception. Such a scenario is very easy to understand. However, when it comes to the actual Maths, following such a path actually leads to a more complex derivation. This is because each Kalman filter is tailored to the application it needs to be used for, and doing so requires to integrate the laws of motion into our equation. And doing so, “pollutes” the derivation of the Kalman filter by adding specific terms that are not part of the filter itself, but of the model the filter is trying to predict.

In this specific article, we will focus only on the correction step, which is the real heart of the Kalman filter; something that all filters belonging to Kalman family share. But for this to work, we need to pick a scenario in which the prediction step is virtually non-existent. This means measuring a quantity that is not expected to change over time. More realistically, a quantity that is not expected to change too fast. This leads us to consider a system which was already described in the second diagram of this article:

A good example is the temperature of a well-insulated room. Let’s assume that the real temperature at time n is equal to x_n. Under our assumption, we believe that the system evolves according to the following equation:

(3)   \begin{equation*} x_{n+1} = x_{n}\end{equation*}

This means that we expect the temperature of the room to remain constant, as the room itself is well insulated.

Also, to simplify the notationโ€”yet without loss of generalityโ€”we will only consider the evolution of the Kalman filter from a generic time n=0 to n=1:

This has no real implication, but it allows to simplify the notation getting rid of n-1, n and n+1 subscripts.

๐ŸŒก๏ธย The concept of temperature is a statistical one!

Step 2: The Measurement

If we could access the real temperature of the room (x_n) at any arbitrary time (n), we would have no need for a Kalman filter. The problem is that we can never sample the temperature directly and with absolute precision. Instead, we are forced to rely on an indirect, imprecise measure, which introduces some uncertainty. What we read from a thermometer is not a value of the “real temperature”. It is (to make an old-fashioned example) how much mercury expands when subjected to a certain temperature. The expansion of mercury is a proxy for the temperature, which is the real quantity that we want to estimate.

Measuring the temperature of a room using a thermometer introduces a new type of uncertainty, as the sensor itself has a limited precision.

(4)   \begin{equation*} z_1 = x_1 + w_1\end{equation*}

(5)   \begin{equation*}w_1 \sim \mathcal{N}(0,R)\end{equation*}

The quantity z_1 represents the temperature reading at the time 1, which is a function of the actual temperature x_1 plus some uncertainty modelled by the Gaussian variable w_1, referred to as the measurement noise.

The quantity R measures how reliable the sensor is. You can calculate R experimentally, simply by querying the sensor many times and measuring the variance of its output. In our example, however, we assume it is constant throughout the entire life of the filter. This is not always the case; the precision of a GPS, for instance, depends on how fast you travel. Likewise, it is fair to assume that the precision of a thermometer could be a function of the temperature.

Visualising Noise and Uncertainty

Before we proceed, you should be familiar with the concept of Gaussian distribution. To get a better idea of the true meaning of (4) you can have a look at the interactive chart below. The blue curve shows the probability density function of z_1; for its value on the X axis, it indicates its relative likelihood of appearing.

 
z_1
R

If the sensor returns the value 0, it does not mean that the temperature really is 0. Its probability density function shows how likely the other temperatures are.

๐Ÿ””ย Tell me more about the Gaussian distribution!

โ“ย How come the curve can go above 1?

Step 3. Estimated Position

The purpose of sensor de-noising is, at its core, finding a good guess for x_1, based on the information we have previously collected, and on any available sensor reading. This guess is often indicated in the scientific literature as \hat{x}_1 and is called the estimated state (or position, in our example) at time 1. When it comes to reading sensors, the most simple approach is to blindly trust them. Which means saying that \hat{x}_1=z_1. Very often, this is also the worst approach you can take. Every sensor reading comes with a certain degree of uncertainty, and relying on its raw value means that you are comfortable integrating the maximum amount of noise possible into your application.

A more reasonable approach is to come up with a different equation for \hat{x}_1. One that, for instance, takes into consideration not just the most recent reading from the sensor, z_1, but also how hot we believed the room was one instant ago. This takes the form of the estimated position at the previous time 0, which is called \hat{x}_0. By relying on both pieces of information we are implying that the actual position should not be too far from where the train was before.

There are infinitely many ways in which \hat{x}_0 and z_1 can be merged together. The simplest way is simply to combine them linearly:

(6)   \begin{equation*} \hat{x}_1 \overset{\triangle}{=} \hat{x}_0 \left(1-k_1\right) + z_1 \, k_1\end{equation*}

Equation (6) interpolates between \hat{x}_0 and z_1 linearly, using the coefficient k_1\in\left[0,1\right]. When k_1=1, we return to the case in which we fully trust the sensor and we have that \hat{x}_1=z_1. When k_1=0.5, instead, we weight the sensor and our previous guess equally and the new position is simply the average of the two: \hat{x}_1=\frac{\hat{x}_0+ z_1}{2}. If we assume k_1=0, we are implying that the sensor is not to be trusted, and its contribution is simply ignored.

What we have done here is called linear interpolation: a technique that all game developers should be very familiar with. In case you are interested to learn more about what it can do and how it can be used, An Introduction to Linear Interpolation is a good starting point.

It is obvious at this point that, if we decide to mix \hat{x}_0 and z_1 linearly, the coefficient k_1 indicates how much trust we have in the raw value obtained from the sensor. The interactive chart below shows a Kalman filter designed for signals that are not expected to change over time. You can try changing the value of the Kalman gain (k) to see how the filter adjusts itself.

 
R
0.08
k
0.1

You can obviously choose the value of k_1 by hand, but what Kalman does is finding, at each time step, its optimal value. The term “optimal” means that, under the right conditions and providing certain assumptions are holding, it converges to the the value of k_1 so that the estimated position \hat{x}_1 is as close as possible to the actual position x_1.

โ“ย Deriving Kalman Filter using Least Squares Minimisation

Step 4. Sensor Fusion

To understand how Kalman finds the optimal value for k_1, we have to take a step back to talk about probability distributions. Every random process has a probability density function (or pdf), which indicates the values it can produce, and how likely they area.

In the case of the thermometer that is used in this example, we have assumed the pdf of its error to follows a Gaussian distribution. When the sensor reads a value of z_1, we know there is an uncertainty associated with that value which can be modelled as generated from a Gaussian distribution with variance R, as seen in (4). Incidentally, this means that if we assume the sensor noise to be normally distributed, we can also model the sensor reading itself (z_1) as sampled from a normally distributed variable. Let’s call it Z_1:

(11)   \begin{equation*} Z_1 \sim \mathcal{N}(z_1, R)\end{equation*}

which means that the value returned by the sensor (z_1) is not perfect.

What we are trying to do here is to model all parts of the system as normal distributions. This will allow us to borrow the much needed statistical tools which will help compensating for the presence of noise.

This was pretty much straightforward for (z_1), since its normal distribution was a direct consequence of assuming the sensor noise followed a normal distribution in the first place. But we cannot simply assume the same for the various \hat{x}_n. This is because the estimated states are the result of an iterative process, which combines the current sensor reading with the previous state estimate. But as we can clearly see from (6), \hat{x}_n is a linear combination of two normally distributed variables. As we have just discussed, z_1 is normally distributed; and as it turns out, \hat{x}_0 is normally distributed as well! This is definitely true in its very first iteration, where \hat{x}_0=z_0, as seen in (2). In the following iteration, \hat{x}_1 will be a linear combination of z_1 (normally distributed) and \hat{x}_0 (normally distributed, as \hat{x}_0=z_0). A linear combination of two normally distributed variables has a normal distribution. Following this very reasoning, every instance of \hat{x}_n is normally distributed, as it is a linear combination of two normally distributed variables.

Thanks to this inductive reasoning, even the estimated temperature (\hat{x}_0) can be modelled as if it was sampled from a Gaussian distribution (let’s call it \hat{X}_0), as its value comes with a certain degree of uncertainty. While the variances of the process and measurement noises, Q and R are inputs of the Kalman filter (i.e.: they are not assume to change over time), the variance of \hat{X}_0 is a quantity that depends on time and that that we will need to calculate. For now, let’s call it P_0, so that the probability distribution of \hat{X}_0 can be expressed in the form:

(12)   \begin{equation*} \hat{X}_0 \sim \mathcal{N}(\hat{x}_0, P_0)\end{equation*}

Practically speaking, P_0 indicates the overall confidence over the estimated position \hat{x}_0, sampled from the random variable \hat{X}_0. While the variance of the measurement noise R stays the same throughout the entire life of the filter, the confidence on our estimated position changes over time. It makes sense to update not only our estimated position \hat{x}_0 to \hat{x}_1, but also how confident we are about its accuracy from P_0 to P_1. We expect the Kalman filter to converge to the real temperature of the room pretty quickly, so it means that its confidence should increase over time. On the other hand, if the temperature is subjected to a sudden change, it might take some time for the filter to recover and we should trust its current estimate less.

Joint Probability

We now have two different probability distributions, each one estimating the probability of finding how hot the room is. These distributions are not only associated with a mean (which is their best guess), but also with a variance that indicates how confident that guess is. Their joint probability is a new probability distribution which incorporates all of these pieces of information.

What Kalman does is finding the most likely estimation for the temperature of the room, calculating the joint probability of the two distributions, \hat{X}_0 and Z_1, effectively finding the distribution for the new estimate of the room temperature, \hat{X}_1.

If we assume \hat{X}_0 and Z_1 to be independent of each other, their joint distribution can be found multiplying them together:

(13)   \begin{equation*}\hat{X}_1  \overset{\triangle}{=} \hat{X}_0 \times Z_1\end{equation*}

One of the things that make the derivation of the Kalman filter algorithm so clean and efficient is that the product of two Gaussian distributions is another Gaussian distribution:

(14)   \begin{equation*} \mathcal{N}\left(\mu_1, \sigma_1^2\right) \times\mathcal{N}\left(\mu_2, \sigma_2^2\right) =\mathcal{N}\left(\frac{\mu_1 \sigma_2^2 + \mu_2 \sigma_1^2}{ \sigma_1^2+ \sigma_2^2},\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2}\right)\end{equation*}

This means that the process can be repeated indefinitely, without increasing in complexity. And it is, ultimately, what allowed us to assume that even the estimated position (x_0) was following a normal distribution, as seen in (12).

โ“ Show me the derivation!

We can use (14) to multiply together the probability distributions of \hat{X}_0 and Z_1:

(15)   \begin{equation*} \begin{align}\hat{X}_1 & \overset{\triangle}{=} \hat{X}_0 \phantom{\left(\hat{x}_0, P_0\right)} \times Z_1 \phantom{\left(z_1, R\right)} &=\\&= \mathcal{N}\left(\hat{x}_0, P_0\right) \times\mathcal{N}\left(z_1, R\right) &=\\&=\mathcal{N}\Bigl(\underset{\hat{x}_1}{\underbrace{\frac{\hat{x}_0  R+ z_1 P_0}{P_0+R}}},\underset{P_1}{\underbrace{\frac{P_0  R}{P_0 + R}}}\Bigr) & \overset{\triangle}{=}\\&\overset{\triangle}{=}\mathcal{N}\left(\hat{x}_1, P_1\right)&\end{align}\end{equation*}

Before progressing, is important to understand what this means, in the context of our example. Both \hat{x}_0 and z_1 are different measures of the temperature of the room. They are affected by a random error, so we assume they are sampled from two Gaussian distributions: \hat{X}_0 and Z_1, respectively. Their joint distribution, \hat{X}_1 takes into consideration not only the distribution means (\hat{x}_0 and z_1), but also how confident these values are (represented by their variances, P_0 and R). Since the mean of a Gaussian distribution is the value that is most likely to occur, the mean of the joint distribution (\hat{x}_1) will be the most likely temperature of the room at time 1. Reframing the problem in terms of probability density functions, (15) also allows to derive the variance of \hat{X}_1, called P_1, which indicates how confident we are in its accuracy.

Step 5. The Kalman Gain

If we look at the new definition that we have derived for \hat{x}_1 from (15), it looks quite different from what originally proposed in (6). In reality, they are indeed the same quantity, although just expressed in a different way. We can see this by rearranging (15) like this:

(16)   \begin{equation*}\hat{x}_1 = \frac{\hat{x}_0 R + z_1 P_0}{P_0+R} =\end{equation*}

    \begin{equation*}\begin{align*}& = \phantom{\hat{x}_0}\frac{\hat{x}_0 R }{P_0+R} & + & \phantom{z_1} \frac{z_1 P_0}{P_0+R} =\\& = \hat{x}_0\underset{1-k_1}{\underbrace{\frac{ R}{P_0+R}}}& +& z_1\underset{k_1}{\underbrace{\frac{P_0}{P_0+R}}} = \\& = \hat{x}_0 \left(1-k_1 \right) & +& z_1 k_1\end{align}\end{equation}

There is a hidden beauty in this derivation, which arises from the fact that a probabilistic definition of the problem, based on two Gaussian distributions \hat{X}_0 and Z_1, simplifies into a linear interpolation between \hat{x}_0 and z_1. When expressed in this form, the coefficient k_1 is called the Kalman gain:

(17)   \begin{equation*}k_1 = \frac{P_0}{P_0+R}\end{equation*}

Now we have all the pieces to calculate the temperature of the room \hat{x}_1, based on the previous estimate \hat{x}_0 and the current data from the sensor z_1. This derivation also provides an additional quantity P_1 which can be used to measure the confidence in our result.

To recap:

(18)   \begin{equation*}\begin{align}k_1 &=  \frac{P_0}{P_0+R}  & \textup{Kalman gain} \\\hat{x}_1 & = \hat{x}_0 \left(1-k_1\right) + z_1 \, k_1 & \textup{state update} \\P_1 &=  \frac{P_0  R}{P_0 + R} & \textup{variance update}\end{align}\end{equation*}

โ“ย How come Least Squares and the joint probability converge?

Step 6. Iterate

This derivation for the Kalman filter has been shown between an initial time 0 and the following interval 1. In a real application, this process is repeated in a cycle.

Like any Mathematical model, the Kalman filter works under very strict assumptions. In this case, the evolution of the system has to be linear, the and measurement noise have to be Gaussian distributed. For this specific derivation, we have also assumed that the quantity we want to measure never changes.

Visualising The Kalman Filter

Understanding what the Kalman filter really does can be quite challenging at first. The interactive chart below allows controlling the parameters of the random variables which represent the sensor measurement Z_1 and previous estimate \hat{X}_0. The result is the updated estimate \hat{X}_1, which turns out to be another Gaussian distribution.

 
z_1
R
x_0
P_0

๐Ÿ“ฐ Ad Break

Summary

This post introduced the mathematics of Kalman filter. This is a list of all the symbols and notations used:

  • x_n: the real temperature of the room at time n;
  • z_n: the measured temperature, which is the data collected from the sensor at time t_n;
    This is a noisy estimate for the values of x_n;
  • R: the measurement noise, which indicates how reliable the sensor is.
    This is the variance of the sensor;
  • \hat{x}_n: the estimated temperature of the room at time n.
    This is an inaccurate guess for x_n, and it is calculated by taking into account \hat{x}_{n-1} and z_n;
  • k_n: the Kalman gain, which represents the best coefficient to merge the estimated temperature and the sensor measurement;
  • P_n: the confidence of the estimated position.
    This is the variance of the estimated temperature.

To simplify the notations, we have focused on two generic time intervals referred to as n=0 and n=1.

The next part of this series will extend the equations here presented, so that the filter can work even if the temperature changes.

What’s Next…

You can read all the tutorials in this online course here:

Further Readings

Comments

7 responses to “The Mathematics of the Kalman Filter”

  1. Where does the process noise ‘Q’ come into picture here ? It should show up in the kalman gain equations right ? I cant see it .

  2. = How in equation 8 , you considered x1 directly as x0 ,
    Xn+1 = xn
    For n =0 to n=1
    If you have n= 0 to -1 only its possible.
    But according to your assumption, its wont work .

    1. I’m not sure I’m following!

  3. […] 2. The Mathematics of the Kalman Filter: The Kalman […]

  4. […] 2. The Mathematics of the Kalman Filter: The Kalman […]

  5. […] 2. The Mathematics of the Kalman Filter: The Kalman […]

  6. […] 2. The Mathematics of the Kalman Filter: The Kalman […]

Leave a Reply

Your email address will not be published. Required fields are marked *

">