You can read the full series here:

- Part 1.
**GPU Sorting** - Part 2.
**GPU Sorting**

You can find a link to download the **Unity source code** at the end of this tutorial.

#### Introduction

In the previous part of this tutorial, we have discussed the limitations that we face when working in a parallel environment. Conceptually, the same shader code is executed on each pixel of a texture at the same time. While this is not necessarily what happens on a GPU, we cannot rely on the traditional assumptions that we take for granted on sequential machines.

The previous post introduced the **odd-even sort**, which is one of the simplest sorting algorithms that can be implemented on a GPU. It works by dividing the grouping the pixels on each line of a texture in groups of two. Each one is called a **swap couple**, and the shader swaps them if they are not already in order. The next step shifts the swap couples, and the process repeats. You can see this in the diagram below:

As discussed before, the complexity of this algorithm is , since it takes at most shader passes to sort an image of pixels.

#### Shaders for Simulation

Shaders are usually used to process an input textures. Their result is visualised on the screen, leaving the original textures unaffected. What we want to do here is different, since we need the shader to keep iterating on the same texture. We have discussed how to do this extensive a previous series called How to Use Shaders for Simulations.

The idea behind this is to create two **render textures**, which are special textures amenable to shader manipulation. The diagram below shows how this process works.

The two render textures are cyclically swapped the shader pass can always use the previous one as the input and the other ones as the ouput.

This tutorial relies on the very implementation presented in How to Use Shaders for Simulations. If you have not read that tutorial yet, fear not. While we will focus on the sorting algorithm only, this is all you need to know:

float4 frag(vertOutput i) : COLOR { // X, Y coordinate of the pixel [0, _Pixels-1] float2 xy = (int2)(i.uv * _Pixels); float x = xy.x; float y = xy.y; // UV coordinate of the pixel [0,1] float2 uv = xy / _Pixels; // UV size of one pixel float s = 1.0 / _Pixels; ... }

The **fragment function** of the shader will contain the sorting code. The variable `uv`

indicates the UV coordinates of the current pixel being drawn, while `s`

represents the size of a pixel in UV space.

##### ⭐ Suggested Unity Assets ⭐

**Unity Pro**or

**Unity Plus**subscriptions plans to get more functionality and training resources to power up your projects.

#### Re-ordering the Swap Couple

In the previous part, we have introduced the concept **swap couple**. Pixels are grouped in couples, which are then sorted in a single shader pass. Let’s imagine having two nearby pixels, which colours and represents two numbers. The shader pass has to sample these two pixels and reorder them accordingly.

To do so, the pixel on the left side of the swap couple will sample the two pixels above, and pick as its final colour. The pixel on the right will do the opposite:

The position of the pixel does not only determine which operation to do. It also determines which pixels to sample. **Min pixels** need to sample the pixels at their current position and the one to their right. **Max pixels** need to sample the pixel on their left instead.

// Max operation float3 C = tex2D(_MainTex, uv).rgb; // Current pixel float3 L = tex2D(_MainTex, uv + fixed2(-s, 0)).rgb; // Left pixel result = max(L, C); // Min operation float3 C = tex2D(_MainTex, uv).rgb; // Current pixel float3 R = tex2D(_MainTex, uv + fixed2(+s, 0)).rgb; // Right result = max(C, R);

On top of that, the min and max pixels are swapped after each iteration. The diagram below shows which pixels perform a min operation (blue) and which one performs a max operation (red):

The two determinant factors in deciding which operation to perform are the **iteration number** `_Iteration`

and the **element index** `x`

(or `y`

, if we are sorting columns instead of rows).

By looking at the diagram above, we can derive the final condition that determines whether we have to perform a min or max operation:

// Odd/Even Macros #define EVEN(x) (fmod((x),2)==0) #define ODD(x) (fmod((x),2)!=0) float3 C = tex2D(_MainTex, uv).rgb; // Centre if ( ( EVEN(_Iteration) && EVEN(x) ) || ( ODD (_Iteration) && ODD (x) ) ) { // Max operation float3 L = tex2D(_MainTex, uv + fixed2(-s, 0)).rgb; // Left result = max(L, C); } else { // Min operation float3 R = tex2D(_MainTex, uv + fixed2(+s, 0)).rgb; // Right result = min(C, R); }

That condition makes sure that during even iterations (the second, the fourth, …) pixels in an even position perform a max operation, and pixels in an odd position perform a min operation. On odd iterations (the first, the third, …) this is swapped.

For this effect to work, both render textures must be set to **Clamp**, as the shader code does not have any boundary conditions to prevent accessing pixels outside the texure.

#### Conclusion

You can find more sorting animations in this gallery:

You can read the full series here:

- Part 1.
**GPU Sorting** - Part 2.
**GPU Sorting**

#### Download

Become a Patron!

You can download all the assets necessary to reproduce a GPU sorting shader. There are two downloads available:

Feature |
Standard |
Premium |

GPU Sorting Shader | ✅ | ✅ |

In Place Sorting Support | ❌ | ✅ |

Sorting by Hue | ❌ | ✅ |

Sorting by Red Channel | ❌ | ✅ |

Sorting by Luminosity | ❌ | ✅ |

Export as a GIF | ❌ | ✅ |

Download |
Standard |
Premium |

##### 💖 Support this blog

This website exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.

##### 📧 Stay updated

You will be notified when a new tutorial is released!

##### 📝 Licensing

You are free to use, adapt and build upon this tutorial for your own projects (even commercially) as long as you credit me.

You are not allowed to redistribute the content of this tutorial on other platforms, especially the parts that are only available on Patreon.

If the knowledge you have gained had a significant impact on your project, a mention in the credit would be very appreciated. ❤️🧔🏻

Great article Alan! One thing I don’t understand:

in these lines

if ( ( EVEN(_Iteration) && EVEN(x) ) ||

( ODD (_Iteration) && ODD (x) )

)

the if statement is always true right? Because it will always be an odd or an even number. So the else statement will never be executed. Am I wrong with my reasoning?

Hi Federico!

If the statement were “EVEN(x) || ODD (X)” yes, you’d be right.

But the condition is slightly different.

It checks is “_Iteration” and “x” are both either EVEN or ODD.

So it’s like: (BOTH EVEN || BOTH ODD).

I hope this makes sense!

Trying to implement this with opengl compute shaders on a 2d list of ints, it works great up until the list of ints hits a certain limit(around 2560) when the sorting stops fully sorting the ints every frame, my condition for calling another swap is while (sFlags[0]|| streak < 2), sFlags[0] is the gpu's way of telling the cpu whether any swaps happened, and streak is ensuring that the gpu had to do no swaps twice in a row(shifted swap and unshifted swap). I feel like the loop condition for doing more parallel swaps should work to ensure the list is sorted every frame but as you can see in this video https://youtu.be/cANG6Yxjy1A (sorting based off height, color representative) the sorting allows many frames to pass before it is completely sorted, what can i do to fix this, it seems like my while condition is good as it is. thx for this article -ben

using glMemoryBarrier(ALL_BIT) after all calls to the gpu for compute and draw shaders.

glFinish() is what i needed to ensure all computing is done before another swap call is made etc… sorry to make a mess of your comments

It would be useful if you wrote more about the performance to expect compared to CPU based sorting of the same type. Is it worthwhile to use the GPU? For example, if you have four million numbers to sort (a 2048×2048 texture), you will need four million iterations from my understanding. In addition there is an overhead of getting the numbers to the GPU and from the GPU after finishing. This is very slow. Furthermore, bubble sort is a very slow brute force algorithm. Given the inflexibility and overhead of the GPU sort, is it even worthwhile to use this rather than implementing a smarter sort method on the CPU?

The animations indicate that for each iteration one whole line is completed sorting in the texture, and that a 2048×2048 pixel texture would only need 2048 passes, and that each new pass would need to compute one less row. Which would significantly reduce the computational requirements. However, this seems to be misleading, as in the text you state that you need as many iterations as pixels, also no mention that each pass can reduce the area covered by the shader.

Hi Tony!

It is totally possible to do sorting algorithms on the GPU in a way that makes sense. The easiest way in Unity would be using compute shaders, which are not subjected to the tricky limitations of “traditional” shaders.

It was not my intention to show a “better”/faster sorting algorithm, but to show a distributed sorting algorithm that works locally. Most of them can access and swap arbitrary positions within a data structure. This is a luxury that you usually do not have with shaders, which is why it was an interesting challenge. It is a way to show how to do a “simple” task under some pretty strong constraints.

That being said, I’ll soon write an article about a novel sorting algorithm, which has a very neat application!