Friday, April 3, 2015

Compute Shader Framework

In the last few posts about ray tracing I briefly mentioned compute shaders. If you don't know what they are, here is a short summary:

Introduction


Compute shaders are not part of the ordinary graphics pipeline, they can be used separately of any other stage. They are particularly meant for computation on the GPU. The compute shaders are in the same language as the other pipeline stages, like the pixel shader. In this case, HLSL. The compute shader takes advantage of the huge speedup the GPU has to offer over the CPU. This is done by taking into account the parallel computation power of the GPU.

When I first started out with compute shaders, I saw it as a black box and didn't really understand how to get started using one. After I found out that it can be really useful for large computations, I decided to implement one for my ray tracing project (with success). After this, I decided to make a simple framework allowing everyone access to compute shaders in a more user friendly way. This is only intented for DirectX compute shaders in C#

Framework

So without further ado, here is the framework: ComputeShader. You can also view it on GitHub.
On the first run, the framework will download some NuGet packages from SharpDX. If you have already have SharpDX installed, you can simply reference them to skip this part.

With this framework it's possible to bind any structure to the GPU. You can do numerous things with these structures in your shader, and then output some data that you want to know. You can read this data back in your code and use it later on! A simple example would be updating a particle system: you dump all positions and velocities to the GPU, and then calculate the next positions in your shader.

Usage:
In your project, either reference the ComputeShaderAddon.dll or add the project to your solution and reference the project.
You can now calculate anything on the GPU by using the next 4 lines of code (don't forget to include ComputeShaderAddon):
ComputeShaderHelper CSHelper = new ComputeShaderHelper(Device, "effect.fx");
int index = CSHelper.SetData<ExampleStruct>(data);
CSHelper.Execute(50);
CSHelper.GetData<ExampleStruct>(index);

This is the code from the example in the framework. What it does per line:
- Initialize the helper, this compiles the shader (if necessary) and sets it up.
- Set your data from any possible struct to the GPU buffers. The index is stored to retrieve the data later on.
- Executes the compute shader, the number is the amount of cores used on the GPU. The maximum number of cores is 1024, however this will use all calculation power of the GPU at once!
- Retrieve the data from the GPU, using the index from above.
Create your compute shader. Set the amount of cores you want to use in the brackets above the main function like this:[numthreads(cores, 1, 1)]

Likewise, save the length of the array of structs somewhere in the compute shader, if you want to use this like I did in the framework.
Done! Run your project!

Results
The example is a small program I wrote to test the computation power of the GPU in comparison with the CPU. The operation to perform is simple: for every struct you get, count numbers from zero to the length of the array and store them in the struct. Below you'll find a CPU and GPU version of this in code:
// CPU
for (int i = 0; i < amount; i++)
{
 int result = 0;
 for (int j = 0; j < amount; j++)
 result += j;
 data[i].Data = new Vector3(result, result, result);
}

// GPU -- ComputeShaderExample.fx in framework
[numthreads(nThreads, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
 int range = nStructs / nThreads;
 for (uint i = id.x * range; i < id.x * range + range; i++)
 {
 int result = 0;
 for (uint j = 0; j < nStructs; j++)
 {
 result += j;
 }
 data[i].Data = float3(result, result, result);
 }
}

The framework ran with 50 cores on the GPU, and the results are as follows:

Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the GPU: 102 milliseconds
Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the CPU: 283 milliseconds

With this small calculation the GPU, using 50 cores, is about 3 times faster than the CPU, using only one core.

Some scalings:
StructsCoresGPU time (ms)CPU time (ms)
10k50102283
20k503801131
30k507512463
10k10452283
10k100119283
10k1000480283
You can see that the amount of cores is something to fiddle with, since the resulting time differs greatly. This is because the overhead of running all the threads costs more than the actual computation itself, so be careful with this!

Fun stuff: the outcome is easily calculated by: n(n + 1) / 2. This is an easy way to calculate a numerical sequence like this. In this case, n = 9999 (because we start at 0).

Future work

This framework currently only supports Unordered View bindings, so if you would use DirectX 10 you can only bind one array of structures to the compute shader and that's it. In DirectX 11 this is increased to 8, which is supported in this example project.

Currently you still have to set the length of the array and the amount of threads manually in the compute shader. I don't know if it's possible to change this dynamically from code, but if I ever find a way, I will update the framework for sure.