In today’s world, data is everywhere—and often, it’s overwhelming. Datasets can have hundreds or even thousands of features, many of which are redundant or noisy. Visualising data is an essential skill that allows Data Scientists to effectively highlight structures and relationships that would otherwise remain hidden.
This series of tutorials will focus on Principal Component Analysis (PCA), an essential tool in Data Analysis and Visualisation. PCA offers a compelling way to unlock hidden patterns, simplify data, and gain actionable insights. At its core, PCA is a way to literally see the data from a different perspective, so that its most important features are seen first.
Fully understanding PCA requires deep knowledge of Linear Algebra and Statistics, which can be intimidating to some. My aim with this tutorial is to provide a gentle and intuitive introduction to the subject, specifically tailored to programmers.
This first article in the series will feature little to no Maths, in favour of a more visual approach, with the aid of interactive charts and explorable examples.
Introduction
The way we look at something can dramatically change how we perceive it. This is as true for everyday objects, as it is for data.
The profile of a sphere (or a circle) looks the same from every angle you look at it, but that is generally not true for other shapes. The profile of a box (or a rectangle), for instance, changes depending on its rotation. The interactive chart below shows how the horizontal and vertical projections of a rectangle change, based on its orientation ⬇️:
It is easy to see that when , the longer side of the box is aligned with the X axis, and its horizontal projection is the largest it can be.
The same logic can be applied to datasets, which are made of points, rather than being solid objects. But the principle is the same: there is a direction that makes the data appear as spread out as possible. PCA measures this variability through a statistical quantity called variance, which represents how much the points of a dataset differ from their mean. The interactive chart below shows how the spread of a dataset alongside the X axis changes, based on its orientation ⬇️:
Even in this scenario, there is a direction which maximises the projected variance on the X axis. In the context of PCA, this is known as the first principal component, and it represents the direction along which the data varies the most.
The principal component is the direction along which the data varies the most.
When we are analysing two-dimensional data, we can identify two principal components. They are the two directions along which the data varies the most. If you imagine the data being fit by an ellipsoid, the principal components can be thought of as its width and height. Likewise, a dataset in 3 dimensions will have 3 principal components, which can be thought of as the width, height and depth of the said ellipsoid.
PCA identifies the directions along which the data varies the most, and rotates it in such a way that the axes themselves are aligned with those directions. This means that PCA changes the frame of reference, so that the data is centred and aligned with the axes of the ellipsoid of best fit.
You can see the result of PCA running on our toy dataset in the charts below, and how it reprojects the data in the frame of reference which maximises the variance.
A practical example
PCS is typically performed on datasets with multiple features. You can imagine a dataset as a spreadsheet: each row corresponds to a specific specimen, and each column to a different measurement of that specimen. One notorious example often used by Data Scientists is the Iris flower data set, which records four different measurements for 150 iris flowers across three different species (Iris setosa, Iris versicolor, and Iris virginica).



The Iris flower dataset contains four-dimensional data, as four features are measured for each flower. This makes it impossible to visually represent the data in its entirety with a single scatter plot.
Species | Sepal length | Sepal width | Petal length | Petal width |
---|---|---|---|---|
I. setosa | 5.1 | 3.5 | 1.4 | 0.2 |
I. setosa | 4.9 | 3.0 | 1.4 | 0.2 |
… | … | … | … | … |
I. versicolor | 7.0 | 3.2 | 4.7 | 1.4 |
I. versicolor | 6.4 | 3.2 | 4.5 | 1.5 |
… | … | … | … | … |
I. virginica | 6.2 | 3.4 | 5.4 | 2.3 |
I. virginica | 5.9 | 3.0 | 5.1 | 1.8 |
One simple way to go around that is to plot two variables against each other. The chart below, for instance, shows the relationship between the petal length and the petal width of the three species of irises measured:
❓ Why some markers are darker?
Some points in the Iris flower dataset are overlapping. The markers in the chart above are semi-transparent, so that it is easier to see when this is happening.
Darker markers indicate that multiple flowers have those exact measurements.
It should be pretty clear that the data has a structure, and that the three populations are somewhat distinct, as they appear as three separate clusters. And it is easy to see that the two variables are also correlated: knowing one, can give you some information about the other one. For instance, if we measure a random iris flower and find that its petals are around 4 centimetres in length, it is likely to be an Iris versicolor, and its petal width will likely be around 1.2 centimetres. This is known as classification and, in this context, is the problem of finding the right species of an Iris, given its measurements.
Classification is a challenging problem, as flower petals have some inherent variability, and two species could have very similar measurements. For instance, if we measured a petal length of 5 centimetres, the classification would not be as clear-cut as the populations for Iris versicolor and Iris virginica somewhat overlap. This can be seen clearly in the KDE plot below, which shows the distribution of the petal lengths for the three species:
❓ What is a KDE plot?
The plot above is known as a Kernel Density Estimation (KDE) plot, and can be thought of as a continuous variant of a histogram.
A traditional histogram takes a series of points, and “counts” how many fall in each interval (or bin). By their nature, histograms are a discrete tool designed to visualise discrete data. And because of this, they can be uninformative when the points are not sufficiently dense.
KDE plots are an effective way to make continuous distributions from discrete data. They do so by assuming that every point in the dataset contributes to the overall distribution by adding a small Gaussian distribution.
This has an interesting side effect: it de-facto implies that each point has a certain degree of uncertainty. In the case of measurements, this works well, as no measurement can be 100% accurate.
Based on the problem, a different distribution (i.e.: a different kernel) can be used.
Later in this article, we will discuss how to calculate the KDE values, which may come in handy for classification.
The best possible scenario for classification would be for the petal lengths to be arranged in three well-separate and compact clusters. Now let’s imagine for a second the opposite: what would be the absolute worst scenario? If all clusters were to be aligned on the X axis, distinguishing them solely based on the petal length would be very difficult—if not impossible.
At the beginning of the article we identified the principal component by rotating a rectangle. This is equivalent to rotating a line which passes through the mean, until we find one onto which the points appear as close as possible. The chart below shows exactly this, where all points are projected on the line which minimises their variability (in green):
If the data had been originally presented like that, the job of classifying flowers based on a single measurement along the green line would have been much harder, as the coordinates are as tightly packed as possible alongside that direction:
Conversely, we can imagine a different direction for the axis, which makes the points as spread out as possible (below, in red):
The improvement from the original distribution is small, but significant:
❓ Isn’t this just simple linear regression?
If you are familiar with Data Analysis, you might also be familiar with the concept of simple linear regression. This is a technique which finds a line that best fits a set of points, according to some metric.
The concept presented in the chart above is conceptually similar, as we have found a line that best fits the data. What differs is the function used to evaluate the fitness of a line.
In the case of linear regression, one of the most common fitness functions used is the mean square error (MSE):
(1)
where is the vertical coordinate of
-th point in the distribution, and
the vertical coordinate of the regression line evaluated at
. The mean square error is, as the name suggests, the average of the squared errors between a point and the trending line, measured alongside the Y axis.
While the trending line minimises the MSE, the principal component is aligned with the direction that maximises the variance of the data:
(2)
where is the mean of the coordinates on the principal component.
The two equations are eerily similar, as both measure the average of squared errors. However, there are some key differences: the main one being that in PCA the variance is calculated alongside the principal component, not alongside the original X axis.
As a result, PCA minimizes the perpendicular distances from the points to the principal axis (below, in red), while linear regression minimizes the vertical distances from the points to the regression line (below, in blue).
For large datasets with compact points, the principal component and the regression line often coincide. But for more irregular distributions, they can diverge quite significantly.
No line is inherently “better”: they are calculated with different techniques, as they attempt to solve different problems. However, PCA can and is indeed used for linear regression: a technique known as Principal Component Regression.
The green axis is aligned with the direction that minimises the variance (i.e.: it makes points as close as possible), while the red axis is aligned with the direction that maximises the variance. This means that it is the direction which makes the points as spread out as possible. Or, equivalent, the direction along which the data varies the most. In the context of PCA, these are respectively the second and the first principal components.
Because the data we are analysing has only two dimensions, there are only two directions in which it can vary. Consequently, there are only two principal components. The first one accounts for the direction of most variance, while the last one accounts for the direction of least variance.
For our subset of the Iris flower data, the first principal component is a line that passes through the dataset mean, and has an inclination of around 23 degrees. We can verify this by rotating the data and seeing that it is indeed the one associated with the maximum variance:
❓ Why are the two directions orthogonal?
The first principal component for the petal data is 23 degrees, while the second is 113 degrees. These are suspiciously at exactly 90 degrees from each other, and it is not a coincidence.
As it turns out, all principal components identified by PCA are orthogonal to each other. This result is a consequence of the algorithm used by PCA, which will be explored in details later in the article.
This also means that PCA works under the assumptions that the actual principal components in the original data are orthogonal to begin with. If that is not the case (which can happen if the data appears somewhat “skewed”), PCA will still produce orthogonal principal components, which may fail to capture the actual directions in which the data varies the most.
So far, we have identified the principal components geometrically. PCA actually offers an analytic way to calculate them which is much more computationally efficient. However, understanding how and why that works requires extensive knowledge of linear algebra. In this first article, we will identify all principal components visually. But in the next one we will see how to derive it mathematically and computationally.
📰 Ad Break
Classification
In the previous section, we have identified the principal component of the scatter plot between the petal length and the petal width. How is this helping with our classification problem?
Before we introduced the concept of a principal component, the best approach to classification was to rely on the petal length alone. This works because there is a strong correlation between the species and petal length itself. However, as we have seen before, the distributions of the petal lengths somewhat overlap, making it difficult to distinguish between Iris veriscolor and Iris virginica in some circumstances.
The principal component provides the best separation for those clusters, minimising their overlaps. If we classify flowers not based on a single measurement along the original X axis (i.e.: the petal length), but based on their projection along the principal component (which linearly combines both petal length and width), we can obtain a more reliable classification.
This should improve in performance of the classification, at the cost of now requiring two measurements instead of just one. But for most applications, it is an acceptable price to pay.
If we want to go down this path, we need to calculate the projection of each point onto the principal component. At the end of this article we will discuss the “correct” way to do that according to PCA; for now, let’s do it based on what we know so far. The principal component is a line that always passes through the mean; in the case of the Iris flower dataset, it has an inclination of around 23 degrees. What we can do is simply rotate the entire dataset by -23 degrees around the mean:
❓ Why are we rotating the points in the opposite direction?
Let’s imagine that the data lies along a line which makes an angle with the X axis:

If we want to transform the data so that the points lie on the X axis, we have to rotate in the opposite direction:

The rotation causes the principal component to be aligned with the horizontal axis. This means that the new X coordinate of each point is effectively the projection alongside the principal component. This is the new value we can use to classify our flowers!
Mathematically, the new and
values are calculated by rotating each point around the mean, which can be done with the following equations:
(3)
(4)
where:
: the petal length;
: the petal width;
: the rotation of the principal component (23 degrees, in our example);
: the average petal length;
: the average petal width.
❓ Where do sine and cosine come from?
The equations (3) and (4) are rotating the the point by
degrees around the point
.
They are derived from the simpler equations used to rotate a point by degrees around
:
(5)
(6)
In order to rotate around , we can first translate all points so that
maps onto
. We can then perform a rotation around
using (5, 6), and finally translating all points back to
.
Equations (5,6) should look very familiar if you have any knowledge about 2D games. If not, I would suggest reading A Gentle Primer on 2D Rotations, which introduced the subject, and provides a full derivation of these expressions.
The value for (which represents the projection of the original point onto the principal component) can now be used as a more robust classifier for the Iris species. And it can be done by checking the relative distribution of the flower species in the KDE plot seen before.
For instance, let’s try to classify the flower with petal length () and width (
) of
and
, respectively. We can use (3) to calculate its projection on the direction of maximum variance:
:
(7)
In this case, it is easy to see falls in the middle of the Iris setosa cluster, so the classification is fairly unambiguous.
If we repeat this with a different entry (such as and
), things might not always be that clear. The projected point is
, which falls in between the Iris versicolor and the Iris virginica clusters. In this case, we can look at where they land on the KDE plot:
While it is true that the KDE value for Iris virginica is slightly higher than the one for Iris versicolor, they are both pretty close. If you are comfortable with a more probabilistic approach, you can simply calculate their relative proportion, and convert them into probabilities:
(8)
❓ How to calculate the KDE distributions?
In this article we have used KDE plots to show the distribution of points alongside a specific direction. KDE plots are a “continuous” variant of the more traditional histograms.
As previously explained, a KDE plot produces a continuous distribution given a finite number of discrete measurements. And it does so, by creating a function which replaces each point with a Gaussian function. This means that every point now has a contribution which can be sampled on every point, but which is mostly localised at the original position of the point itself.
The KDE plots used in this article have been generated using a Gaussian function with zero mean and unit variance:
(9)
which is translated and rescaled appropriately:
(10)
In the equation (10):
: the position on the line for which we want to calculate the density;
: the value of
-th point in the series.
In the chart above, each point in the dataset contributes to the overall KDE distribution by a tiny amount. Deciding how impactful each point should be on the total KDE distribution is a non-trivial task, and is modulated by a parameter called bandwidth (). In this article, we have used the following bandwidth:
(11)
which is the optimal choice univariate Gaussian data, as it minimises the mean integrated squared error.
As it turns out:
(12)
which leads to the following probabilities:
(13)
which confirms that the flowered we measured is more much more likely to belong to the Iris virginica species.
Dimensionality Reduction
In the previous section, we have seen how the principal component can be used to improve the classification of flowers. PCA, however, is more often used for another task: dimensionality reduction. That is the problem of taking high-dimensional data, and finding a different way to represent it with fewer dimensions. This is possible because, generally speaking, not all dimensions are equally important, and because there might be correlations between them.
While there could be flowers with potentially any combination of sepal and petal lengths, in reality the three species measured in the Iris flower dataset fall into three distinct groups. This means that the dataset has a structure which we can take advantage of.
In the same way we have identified the direction of maximum variance between the petal lengths and widths, we could do the same between the sepal lengths and widths. This would reduce what was originally a four-dimensional dataset to two variables, which capture the principal components of the petal and sepal data. All points could be reprojected into this new 2D space, making it easier to visualise in a meaningful way multi-dimensional data.
Such an approach would be valid, but would also come with a certain degree of arbitrariness. The choice of finding the principal component between the petal lengths and widths, rather than—let’s say—between the petal lengths and sepal widths, is an arbitrary choice.
Luckily for us, PCA can be performed on data that has any number of dimensions: including the original Iris flower dataset. The process works in the exact same way, finding the direction along which the four-dimensional data varies the most. The principal component can still be represented with a line (although one that lives in a four-dimensional space), which can be thought of as the axis which maximises the variance.
The principal component is not the only direction in which the data exhibits some variance, which is why in a dataset in dimensions we can identify
principal components. Once the first principal component is identified, PCA can be applied again to identify the second one. This is the direction of maximum variance, once all the variance from the first principal component has been removed from the data. Let’s see how to do that…
First principal component
This process has a simple and intuitive geometrical explanation, but cannot be easily visualised in four dimensions. So please, allow me to describe it on a toy dataset which has only three dimensions:
The red line represents the principal component: it is easy to see that it passes through the direction in which the data is most spread out. Or, more formally, the direction which accounts for the maximum amount of variance in the data. You can drag all these 3D charts to rotate them.
Second Principal Component
The principal component is not the only direction in which the data exhibits some variance. To identify the second principal component, we first need to remove all the variance across the first one. The variance indicates how spread out the points are along a certain direction, so removing all variance means collapsing them onto a 2D plane. You can use the slider below to collapse the variance across the principal component ⬇️:
In the chart above, each point also casts a “shadow” onto the three planes, to make it easier to visualise how the overall shape of the cloud of points changes.
The value of is used to “scale” the variance across the principal component. When
, the data is unchanged; as
approaches zero, the data progressively shrinks towards the centre of the cluster, alongside the first principal component. Eventually, when
all points collapse onto a 2D plane (in red) which is orthogonal to the first principal component, and passes through the mean of the dataset.
By doing so, the entire dataset has collapsed from three dimensions to just two, as all points now lie on a 2D plane. Moreover, the way in which this plane is derived, makes it always orthogonal to the first principal component. Consequently, the new direction associated with the maximum variance on this plane (the second principal component, in green) will also be orthogonal with the first principal component.
The second principal component belongs to the 2D plane, and its calculation is not dissimilar to what we did at the beginning of the tutorial for the petal lengths and widths.
Third principal component
Once the second principal component has been identified, we can repeat the exact same process. All variance across the second principal component is removed, by shrinking the data to what effectively becomes a line:
The dataset has now collapse into a single dimension, as all points lie on the same line. This is the third (and final) principal component, and accounts for the remaining variance in the data. If we repeat the process, all points would simply collapse onto the mean, as there would be no variance left to “spread out” the points.
Incidentally, the last principal component is also the direction along which the data varies the least. This is because all variance has already been accounted for by the previous principal components.
❓ Is the total variance preserved by PCA?
The variance of the data generally changes based on the direction it is calculated. Likewise, rotating the data generally changes its variance.
However, the total variance (i.e.: the sum of the variance calculated across all axes) remains the same.
One key property of PCA is, consequently, the ability to preserve the total variance of the data. This happens because PCA effectively aligns (= rotates) the dataset to its principal components. PCA might occasionally flip/mirror one of the axes, but that still preserves the angles and distances between the original points.
Another way to intuitively prove this is that all principal components are, by construction, orthogonal to each other. And a change of basis with orthogonal vectors is equivalent to a rotation and/or flip.
PCA & the Iris Flower Dataset
We now have all we need to finally run PCA on the full Iris flower dataset. What you see below is the same dataset, reprojected using only its first two principal components:
❓ What are the principal components of the iris dataset?
In this article we have not show the algebraic way to compute the principal components. But in case you are interested, these are the principal components for the Iris flower dataset, which are presented as unit vector pointing in the directions of max variance:
(14)
The principal components are often joined together as columns of a matrix:
(15)
This turns out to be really helpful, as it can be used to reproject any existing point into the new frame of reference aligned with the principal components:
(16)
In our case, we were only interested in the first two components. This can be calculated easily by creating a matrix which contains only the first two principal components as column vectors:
(17)
All of these steps will be explained in details in the next instalments of this series.
Reconstructing the data using only two principal components means that some of the information (measured in terms of variability) is lost. A scree plot can be used to plot the percentage of variance that each principal component adds to the final reconstruction. For instance, you can see in the scree plot below that the first principal component in the Iris flower dataset accounts for 92% of the original variation.
Scree plots can be very informative when deciding how many principal components is worth using to reconstruct a dataset. In the case of the Iris flower dataset, it turns out that even just one component could have been more than enough, as the remaining tree only accounts for a merely 8%.
📚 Recommended Books
Summary
The first post in this series introduced the concept of Principal Component Analysis, one of the simplest tools to reduce the dimensions of a dataset, while trying to preserve much of its original structure.
PCA works by identifying the principal components, which are the directions along which the data exhibits the most variation. It works best under the following assumptions:
- Linearity: the relationship between the variables that PCA is trying to capture must be linear;
- Meaningful variance: PCA assumes that the directions in which data varies the most are more important. This makes PCA very sensitive to scale, as changing the magnitude of one variable can drastically change the overall result;
- Independency: PCA assumes that the variables which the data comes from are independent/uncorrelated. As a result, the principal components identified by PCA are always orthogonal to each other. If the underlying structure of the data is non-orthogonal (i.e.: the data is somewhat skewed), PCA may not provide the best representation.
Running PCA typically produces the following results:
- Principal components: the unit vectors aligned to the direction of maximum variance. If the data has
-dimensional there will be
principal components, and they are all orthogonal to each other;
- Projected data: the original dataset is centred around the mean, and the frame of reference is rotated/flipped to align with the principal components. Some implementations also scale the data alongside the principal components, so that the variance on each axis is
.
- Explained variance: the principal components are sorted by how much variance they explain. This allows to measure how much of the variability in the original data can be reconstructed using a subset of the principal components.
PCA often finds a variety of different applications, such as:
- Visualisation: the lower dimensional representation of a dataset might be better suited for 2D or 3D scatter plots;
- Modelling: the original dataset (which can have many data points) can be replaced with a simpler model. In the case of PCA, this model is represented by the ellipsoid which best fits the original data;
- Compression: high-dimensional data can be compressed only by storing the dimensions which encode the most information. This is a lossy compression, as a certain amount of information is lost in the process.
- Understanding: the dimensionality reduction can help understand which measurements (or their combinations) are more important. For instance, for the purpose of identifying the species of a flower, certain measurements are more important than others.
In this article we have not yet explained how to analytically derive the principal components. This will be the subject of another article in the series.
What’s next…
- PCA for Programmers (Part 1)
- PCA for Programmers (Part 2) 🚧 (coming soon)
Further readings
Interactive and explorables:
- Principal Component Analysis Explained Visually
- Visualizing PCA in 3D, featuring interactive charts
A more mathematical approach
Leave a Reply