You can read the full series here:

You can find a link to download the Unity source code at the end of this tutorial.

Introduction

In the previous part of this tutorial, we have discussed the limitations that we face when working in a parallel environment. Conceptually, the same shader code is executed on each pixel of a texture at the same time. While this is not necessarily what happens on a GPU, we cannot rely on the traditional assumptions that we take for granted on sequential machines.

The previous post introduced the odd-even sort, which is one of the simplest sorting algorithms that can be implemented on a GPU. It works by dividing the grouping the pixels on each line of a texture in groups of two. Each one is called a swap couple, and the shader swaps them if they are not already in order. The next step shifts the swap couples, and the process repeats. You can see this in the diagram below:

As discussed before, the complexity of this algorithm is \mathcal{O}\left(n\right), since it takes at most n shader passes to sort an image of n pixels.

Shaders for Simulation

Shaders are usually used to process an input textures. Their result is visualised on the screen, leaving the original textures unaffected. What we want to do here is different, since we need the shader to keep iterating on the same texture. We have discussed how to do this extensive a previous series called How to Use Shaders for Simulations.

The idea behind this is to create two render textures, which are special textures amenable to shader manipulation. The diagram below shows how this process works.

The two render textures are cyclically swapped the shader pass can always use the previous one as the input and the other ones as the ouput.

This tutorial relies on the very implementation presented in How to Use Shaders for Simulations. If you have not read that tutorial yet, fear not. While we will focus on the sorting algorithm only, this is all you need to know:

float4 frag(vertOutput i) : COLOR
{
	// X, Y coordinate of the pixel [0, _Pixels-1]
	float2 xy = (int2)(i.uv * _Pixels);
	float x = xy.x;
	float y = xy.y;
	// UV coordinate of the pixel [0,1]
	float2 uv = xy / _Pixels;
	// UV size of one pixel
	float s = 1.0 / _Pixels;
	
	...
}

The fragment function of the shader will contain the sorting code. The variable uv indicates the UV coordinates of the current pixel being drawn, while s represents the size of a pixel in UV space.

Re-ordering the Swap Couple

In the previous part, we have introduced the concept swap couple. Pixels are grouped in couples, which are then sorted in a single shader pass. Let’s imagine having two nearby pixels, which colours x and y represents two numbers. The shader pass has to sample these two pixels and reorder them accordingly.

To do so, the pixel on the left side of the swap couple will sample the two pixels above, and pick min\left(x,y\right) as its final colour. The pixel on the right will do the opposite:

The position of the pixel does not only determine which operation to do. It also determines which pixels to sample. Min pixels need to sample the pixels at their current position and the one to their right. Max pixels need to sample the pixel on their left instead.

// Max operation
float3 C = tex2D(_MainTex, uv).rgb;	                // Current pixel
float3 L = tex2D(_MainTex, uv + fixed2(-s, 0)).rgb;	// Left pixel
result = max(L, C);

// Min operation
float3 C = tex2D(_MainTex, uv).rgb;	                // Current pixel
float3 R = tex2D(_MainTex, uv + fixed2(+s, 0)).rgb;	// Right
result = max(C, R);

On top of that, the min and max pixels are swapped after each iteration. The diagram below shows which pixels perform a min operation (blue) and which one performs a max operation (red):

The two determinant factors in deciding which operation to perform are the iteration number _Iteration and the element index x  (or y, if we are sorting columns instead of rows).

By looking at the diagram above, we can derive the final condition that determines whether we have to perform a min or max operation:

// Odd/Even Macros
#define EVEN(x) (fmod((x),2)==0)
#define ODD(x)  (fmod((x),2)!=0)

float3 C = tex2D(_MainTex, uv).rgb;	// Centre
if ( ( EVEN(_Iteration) && EVEN(x) )	||
     ( ODD (_Iteration) && ODD (x) )
   )
{
	// Max operation
	float3 L = tex2D(_MainTex, uv + fixed2(-s, 0)).rgb;	// Left
	result = max(L, C);
}
else
{
	// Min operation
	float3 R = tex2D(_MainTex, uv + fixed2(+s, 0)).rgb;	// Right
	result = min(C, R);
}

That condition makes sure that during even iterations (the second, the fourth, …) pixels in an even position perform a max operation, and pixels in an odd position perform a min operation. On odd iterations (the first, the third, …) this is swapped.

For this effect to work, both render textures must be set to Clamp, as the shader code does not have any boundary conditions to prevent accessing pixels outside the texure.

Conclusion

You can find more sorting animations in this gallery:

You can read the full series here:

Download

Become a Patron!
You can download all the assets necessary to reproduce a GPU sorting shader. There are two downloads available:

FeatureStandardPremium
GPU Sorting Shader
In Place Sorting Support
Sorting by Hue
Sorting by Red Channel
Sorting by Luminosity
Export as a GIF
DownloadStandardPremium

Comments

8 responses to “GPU Sorting”

  1. It would be useful if you wrote more about the performance to expect compared to CPU based sorting of the same type. Is it worthwhile to use the GPU? For example, if you have four million numbers to sort (a 2048×2048 texture), you will need four million iterations from my understanding. In addition there is an overhead of getting the numbers to the GPU and from the GPU after finishing. This is very slow. Furthermore, bubble sort is a very slow brute force algorithm. Given the inflexibility and overhead of the GPU sort, is it even worthwhile to use this rather than implementing a smarter sort method on the CPU?

    The animations indicate that for each iteration one whole line is completed sorting in the texture, and that a 2048×2048 pixel texture would only need 2048 passes, and that each new pass would need to compute one less row. Which would significantly reduce the computational requirements. However, this seems to be misleading, as in the text you state that you need as many iterations as pixels, also no mention that each pass can reduce the area covered by the shader.

    1. Hi Tony!
      It is totally possible to do sorting algorithms on the GPU in a way that makes sense. The easiest way in Unity would be using compute shaders, which are not subjected to the tricky limitations of “traditional” shaders.

      It was not my intention to show a “better”/faster sorting algorithm, but to show a distributed sorting algorithm that works locally. Most of them can access and swap arbitrary positions within a data structure. This is a luxury that you usually do not have with shaders, which is why it was an interesting challenge. It is a way to show how to do a “simple” task under some pretty strong constraints.

      That being said, I’ll soon write an article about a novel sorting algorithm, which has a very neat application!

  2. Trying to implement this with opengl compute shaders on a 2d list of ints, it works great up until the list of ints hits a certain limit(around 2560) when the sorting stops fully sorting the ints every frame, my condition for calling another swap is while (sFlags[0]|| streak < 2), sFlags[0] is the gpu's way of telling the cpu whether any swaps happened, and streak is ensuring that the gpu had to do no swaps twice in a row(shifted swap and unshifted swap). I feel like the loop condition for doing more parallel swaps should work to ensure the list is sorted every frame but as you can see in this video https://youtu.be/cANG6Yxjy1A (sorting based off height, color representative) the sorting allows many frames to pass before it is completely sorted, what can i do to fix this, it seems like my while condition is good as it is. thx for this article -ben

    1. using glMemoryBarrier(ALL_BIT) after all calls to the gpu for compute and draw shaders.

      1. glFinish() is what i needed to ensure all computing is done before another swap call is made etc… sorry to make a mess of your comments

  3. Great article Alan! One thing I don’t understand:
    in these lines
    if ( ( EVEN(_Iteration) && EVEN(x) ) ||
    ( ODD (_Iteration) && ODD (x) )
    )
    the if statement is always true right? Because it will always be an odd or an even number. So the else statement will never be executed. Am I wrong with my reasoning?

    1. Hi Federico!

      If the statement were “EVEN(x) || ODD (X)” yes, you’d be right.
      But the condition is slightly different.
      It checks is “_Iteration” and “x” are both either EVEN or ODD.
      So it’s like: (BOTH EVEN || BOTH ODD).

      I hope this makes sense!

  4. […] Part 2. GPU Sorting […]

Leave a Reply

Your email address will not be published. Required fields are marked *