in programming, shader, tutorial, Unity3D

GPU Sorting

Share Button

This article introduces the concept of parallel sorting, discussing the theory and implementation of a shader that can sort pixels.

You can read the full series here:

You can find a link to download the Unity source code at the end of this tutorial.


If you did not study Computer Science in the 80s or 90s, chances are you will struggle to understand the excitement some developers have for sorting algorithms. What could seem at first like a minor task, turns out to be a cornerstone of Computer Science.

But first of all, what are sorting algorithm? Imagine that you have a list of numbers. A sorting algorithm is a program that takes that same list and reorders its numbers. Sorting algorithms are often introduced during the study of Computational Complexity, another vast subject that will be covered extensively in a future series. There are countless ways in which a list of items can be sorted, and each strategy offers a unique trade-off between cost and performance.

View post on

Most of the complexity of sorting algorithms comes from the way the problem is defined and approached. In order to decide how to rearrange the elements, an algorithm must compare numbers. In the scientific literature, each comparison performed adds up to the complexity of the algorithm. The number of comparisons one performs is the way complexity is measured.

However, things are not that easy: the number of comparisons and swaps depend on the list itself. This is why in the field of Computational Complexity there are more objective ways to measure the performance of an algorithm. What is the worst possible scenario for an algorithm? How many steps will it take, at most, to sort the most un-sorted list it can work with? Such a way to approach the problem is known, unsurprisingly, as worst-case scenario analysis. We can ask the same question for the best-case scenario. What is the minimum amount of comparisons that an algorithm has to perform to sort an array? Loosely speaking, the latter is often with the Big O notation, which measures complexity as a function of the number of elements to sort, n. It has been proved that under this framework, no algorithm can perform better than \mathcal{O}\left(n\log n\right). This means that it cannot exist an algorithm that always sorts every input array of length n by making less than n\log n comparisons.

❓ This is not what O(n log n) means!
If you are familiar with the Big O notation, you might have realised that the previous paragraph was massively oversimplifying the topic.

You can imagine the Big O notation as a way to express the “order of magnitude” of a function. So \mathcal{O}\left(n\log n\right) with 10 elements to sort doesn’t really mean 33 comparisons. It could be 10, or even 100 times that. What matters for the purpose of a superficial understanding is that that the number of comparisons necessary to sort an array grows proportional to n\log n.

A future series will focus on the subject of Computational Complexity, from a purely Mathematical point of view.

The Limitations of the GPU

Traditional sorting algorithms are built based on two simple concepts:

  • Comparison: the act of comparing two elements of a list, in order to decide which one is larger;
  • Swaps: the act of swapping the position of two elements, in order to bring the list in a new state that is closer to the desired one.

The list to be sorted is often given in the form of an array. Accessing random elements of an array is extremely efficient, and it is possible to swap any two elements with no restrictions.

If we want to use a shader to sort, we first need to understand the intrinsic limitations of this new medium. While it is possible to provide a list of number to a shader, that is not the best way to exploits its parallelism. Conceptually, we can imagine the GPU executing the same shader code on each pixel, at the same time. This is not generally what happens, since a GPU is unlikely to have the computational power to parallelize the computation for each pixel. Some parts of the image will be processed in parallel, but we shouldn’t make any assumptions on which ones. For this reason, the shader code has to operate under the same constraints of true parallelism, even if in reality that is not necessarily how it runs.

The most serious constraint, however, is the fact that the shader code is localised. If the GPU is executing a piece of code on a pixel at coordinate \left[x, y\right] we cannot write to any other pixel but \left[x, y\right]. On a shader, the swap operation requires the synchronisation of both pixels.

Sorting on a GPU

The approach presented in this tutorial is conceptually simple.

  • The data to sort is provided in a texture;
  • If we have an array of n elements, we create a texture of size n \times 1 pixels;
  • The value in the red channel of each pixel contains the number to sort;
  • Each rendering pass will bring the array closer to its final state.

For example, let’s imagine we have a list of fours numbers to sort: \left[4,3,2,1\right]. We can feed it to a shader by creating a texture of 4\times 1 pixels, storing the values in the red channel (below):

As discussed before, the swap operation is much more complex when done in a shader, since it needs to be executed in parallel by two independent pixels. The easiest way to work with such a limitation is to swap nearby couples of pixels (below):

It’s important to notice that, conversely to what happens on a CPU, the process happens somehow in reverse. The first pixel samples its previous value, 4 and the one on its right, 3. Since it is the on the left-side of the swap couple, it needs to take the smaller one. The second pixel in the couple needs to perform the opposite operation.

Since those calculations are done independently, both pieces of code should agree on which value to propagate down without any communication. This is where the challenge of parallel sorting lies. If we are not careful, both pixels in the swap couple could pick the same number, de-facto duplicating it.

If we repeat this swapping process, nothing would change. Each couple can be sorted in a single step. Replicating this process will not lead to any changes. The way to overcome this, is to change the swapping couples. We can see this in the diagram below:

❓ What about the pixels on the edge?
There are several approaches that can be used to extend this solution so that it works on the first and last pixels.

The first approach is to consider these two pixels as part of a single swap couple. While this works, it requires to write additional conditions that in the shader code that are inevitably going to impact the performance.

A simpler solution is to simply clamp the texture. The two extreme pixels will perform swaps between themselves (below):


We can keep alternating these two steps, until the entire image is sorted:

Such a technique has been around for quite a long time, and it exists in many different flavours and variations. It is often referred to as odd-even sort, since it alternates swaps between odd/even and even/odd indices. It’s working mechanism is deeply connected with the bubble sort, so it is not surprising to find this algorithm under the name of parallel bubble sort.


When working on a GPU, we should assume that the same shader code is executed on each pixel independently, and at the same time. In a traditional CPU, we should consider each comparison/swap as an individual operation. A shader pass would count as n operations. Under the assumption that we can compute all of them in parallel, the time to execute one or to execute n is the same. As a result, the parallel complexity of each shader pass is 1.

What is the complexity in the worst case scenario? We can see that each step brings a pixel closer to its final position. The further a pixel can travel is n, so we need at most n steps to sort the list.

If we look at this problem from a more traditional perspective, things are very different. Each shader pass analyses each pixel at least once, which means adding complexity n. When coupled with the number of passes n, it brings the complexity to \mathcal{O}\left(n^2\right). This shows how inefficient the odd-even sort really is when executed sequentially.

Coming Next…

The next post will focus on the implementation details of this algorithm. We will see how to recreate it in Unity, using a vertex and fragment shader.

You can read the full series here:


You can download all the assets necessary to reproduce a GPU sorting shader. There are two downloads available:

Feature Standard Premium
GPU Sorting Shader
In Place Sorting Support
Sorting by Hue
Sorting by Red Channel
Sorting by Luminosity
Export as a GIF
Download Standard Premium
📧 Stay updated

A new tutorial is released every week.

💖 Support this blog

This websites exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.


Write a Comment



  • GPU Sorting - Alan Zucconi

    […] Part 1. GPU Sorting […]