Monday, December 21, 2015

Real-time Raytracing part 4

In all of the previous posts, I've talked about optimizing ray tracing. I've mentioned path tracing, but never spoke of path tracing specific optimizations. This post will be a small survey of implementations and research about path tracing optimizations.


Path tracing is great for simulating global illumination by tracing rays, and of course, these rays have to follow "realistic paths". As observed a long time ago (1760) by Johann Lambert, intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal. This is called Lambert's cosine law in optics.

In path tracing, we mainly use the cosine law to calculate the contribution of light from the next ray. When tracing from a diffuse surface, the direction of the next ray is usually calculated by generating a random direction in the hemisphere of the surface. When tracing from a mirror, you can imagine there is only one direction on the hemisphere that is correct: the perfect reflection. Metal-like surfaces contain both elements, called glossy reflections. This image shows what I'm talking about in 2D:

If you would order them by calculation cost for the path tracing algorithm, the reflection would be the easiest, there is only one solution, one ray to calculate. The glossy one is harder to calculate, since you have to accumulate rays with different weights, but still simpler to fully calculate than the diffuse option. Besides these simple examples, materials can contain many functions to bend incoming rays. The collection of these functions describe the Bidirectional Reflectance Distribution Function (BRDF). Which in turn, you can even extend to allow for rays extending through the surface with a Bidirectional Scattering Distribution Function (BSDF).

Cosine weighted sampling
The intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal.
This leads to an interesting optimization: cosine weighted sampling. Since the incoming and outgoing light is measured by a cosine, the outgoing rays nearly parallel to the surface don't add any (useful) information to the final solution. I could write this whole post about this optimization, but that would be a waste of time, since there are already many good examples out there. This post from Rory Driscoll shows results as well as the maths behind it.

This image shows the cosine weighted sampling on the top and the uniform sampling on the bottom. This was taken from the pathtracing blog
Sampling optimization

As mentioned above, when you've found a ray-triangle collision, the direction of the new ray is usually 'bruteforce' calculated by taking a random direction on the hemisphere. For diffuse surfaces, this is no problem, since every outgoing direction weights the same. If you would apply the same to perfectly reflecting surfaces, chances are you will most likely never find the correct outgoing ray.

By knowing the material properties you can determine how to sample the outgoing ray. This sampling strategy can be a very easy optimization if you're willing to end up with an approximation. If you look at the 3 examples from the image above: one variable for the degree of reflectiveness can produce these examples.

The problems with this reflectiveness variable for path tracing are the weights shown in the glossy example. Solving the Monte Carlo problem of path tracing requires a probability distribution function (PDF) over these samples. The outer rays in the glossy examples are shown smaller, because they affect the outcome less than the rays from the perfect reflection.

This optimization is as simple or as hard as you want to make it. The simpler you keep the sampling, the more approximate your solution is going to be. The best solution would be to sample your chosen BRDF. The cosine weighted sampling is shown to be perfect for diffuse only materials (diffuse materials have perfect lambertian reflectance). I found a sampling method for the Cook-Torrance BRDF in this paper: Microfacet Models for Refraction through Rough Surfaces.

Direct and indirect lighting

Direct lighting is easy to calculate. This is shown in almost all released 3D games out there. Simply calculating a dot product of the surface normal and light direction would result in perfect lambertian shading.

In path tracing we can apply an optimization to allow for direct light sampling. It's called next event estimation. This is where the math and statistics get nasty. To allow for direct light sampling, we have to apply a different probability distribution function. We have to sample the light and divide the contribution of direct light by the area of the light:
Area sampling. Image taken from [2]

Whereas the indirect illumination should be sampled by the original method using hemispheres. With one exception, all of the rays hitting the light should be discarded (since the light gets sampled by the other PDF from above).
Hemisphere sampling without direct light sampling (dashed lines). Image taken from [2]
These examples show only sampling of a single light, which is easy to maintain. Imagine a scene with thousands of lights and one sun. If you apply the same theory of the two PDFs from above, you would end up sampling more inefficient than the brute force sampling. The solution is rather simple, we add weights to each light. But these have to be based on different criteria: only adding weights based on distance could influence the effects of the sun (large distance, huge brightness) versus normal lights (small distance, small brightness). Other small optimizations could be added based on needs: you don't have to continually sample the sun during night time and so on. If you like to know the mathematics and details behind these two PDFs, I suggest you read the presentations from the references below.

Multiple importance sampling

In larger applications, thousands of material types and BRDFs are going to be used. Having a different sampling strategy for each of them is going to be time consuming and frankly undo-able. We already have two different PDFs to sample all of these materials, the direct and indirect sampling. The problem that arises from all of this shown in images:

These are the two sampling methods shown from above, the BSDF sampling is the hemisphere indirect sampling and the light source sampling is the direct sampling using the area of the light. The problem is the glossy reflection on the four plates. You can see that indirect sampling works if the light is large enough to have some collisions with rays, and the direct light source sampling works fine for small area lights.

Multiple Importance Sampling is a technique that can combine different sampling techniques to find a low variance estimate of the integral. By combining both sampling methods we can get the following result:
Images taken from [4]

Bidirectional path tracing

This leads us to more recent techniques in path tracing, namely bidirectional path tracing. This technique is based on the inverse of path tracing, instead of finding the light by tracing rays from the camera, find the camera by tracing from the light. These are the shooting rays, and the original camera rays the gathering rays. The bidirectional part indicates that these two types of rays can be combined.

This technique wasn't meant for real time path tracing applications which require responsiveness. Since for every gathering ray you shoot, the amount of calculation is doubled by also tracing shooting rays. This nearly doubles the time per frame, and thus decreases responsiveness. While you lose responsiveness, this algorithm increases the convergence rate for path tracing a lot, so in the end it's worth it.

As extension to bidirectional path tracing, there is an algorithm called Metropolis light transport. This algorithm allows finding more difficult light paths by extending existing paths. An example: A shooting and gathering ray have been found by the bidirectional raytracing method. Every bounce in these rays is recorded as a node. The metropolis light transport can add extra nodes on these rays to modify the existing ray to a new one. These nodes are placed in areas with high gradients. You can find more on this technique in this paper: Metropolis Light Transport.

Another extension I found, but have not fully explored is: Light Transport Simulation with Vertex Connection and Merging. The basic idea is to reuse existing paths by connecting vertices. This can increase the amount of generated paths per pixel:

Vertex merging: reusing paths to increase paths per pixel. Image taken from [6]
There is an opensource implementation of VCM called smallVCM. This implementation also contains multiple importance sampling and normal bidirectional path tracing.

Further reading

Most recent techniques include gradient domain tracing. It first originated from metropolis light transport in: Gradient-domain Metropolis Light Transport. The paper on Gradient Domain Bidirectional Path Tracing applies this to a bidirectional path tracer, and Gradient-Domain Path Tracing applies the same principle for normal path tracers. The path tracer also contains source code. You can see the path tracing algorithm is converging faster already with only one sample per pixel:

Showing the difference between Gradient domain path tracer (G-PT) and normal path tracer (PT). Image taken from [9].
Other blog posts:
Path tracing 3D fractals. Also explains cosine weighted sampling, importance sampling, next event estimation, and image based lighting. Implementation in GLSL.
Everything about ray tracing, and is currently working on a GPU path tracer.
Community for path- and ray tracing.





[3] Walter, Bruce, et al. "Microfacet models for refraction through rough surfaces."Proceedings of the 18th Eurographics conference on Rendering Techniques. Eurographics Association, 2007.
[4] Veach, Eric. Robust monte carlo methods for light transport simulation. Diss. Stanford University, 1997.
[5] Veach, Eric, and Leonidas J. Guibas. "Metropolis light transport." Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997.
[6] Georgiev, Iliyan, et al. "Light transport simulation with vertex connection and merging." ACM Trans. Graph. 31.6 (2012): 192.
[7] Lehtinen, Jaakko, et al. "Gradient-domain metropolis light transport." ACM Transactions on Graphics (TOG) 32.4 (2013): 95.
[8] Manzi, Marco, et al. "Gradient-Domain Bidirectional Path Tracing." (2015).
[9] Kettunen, Markus, et al. "Gradient-domain path tracing." ACM Transactions on Graphics (TOG) 34.4 (2015): 123.

Saturday, December 5, 2015

Real-time Raytracing part 3.1

In part 3, I've shown some examples on how to tune algorithms on the GPU. Here I would like to address how we can apply those rules to optimize path tracing. As usual, I will show code samples for CUDA, but this doesn't mean it can't be applied on any other graphics programming language.

Fortunately for us, there is an opensource framework for optimized GPU based ray traversal. The framework is from the research paper Understanding the Efficiency of Ray Traversal on GPUs. In 2012 they added Kepler and Fermi Addendum, which added some changes to the framework to optimize traversal for those architectures. In this post I will be looking at the version for the kepler architecture. The framework is quite heavy on magic numbers and hard to read code. In this post I'm going to dissect some of the more important parts and how they improve performance.

Data structures

The first thing to notice is what kind of spatial data structure they are using. From the name of the class: "SplitBVHBuilder" you can already see that the builder is based on this paper. It comes down to a simple binned BVH builder, including triangle splits to optimize traversal. For more information on BVH techniques, you can read part 2 of this series.

More important for this post is how you store this BVH on the GPU and access it without creating any memory latency. If we look at the ray traversal, the following code shows how the nodes are read from memory:

    const float4 n0xy = tex1Dfetch(t_nodesA, nodeAddr + 0); // (c0.lo.x, c0.hi.x, c0.lo.y, c0.hi.y)
    const float4 n1xy = tex1Dfetch(t_nodesA, nodeAddr + 1); // (c1.lo.x, c1.hi.x, c1.lo.y, c1.hi.y)
    const float4 nz   = tex1Dfetch(t_nodesA, nodeAddr + 2); // (c0.lo.z, c0.hi.z, c1.lo.z, c1.hi.z)
            float4 tmp  = tex1Dfetch(t_nodesA, nodeAddr + 3); // child_index0, child_index1
            int2  cnodes= *(int2*)&tmp;

Recalling from part 3, I showed you a similar statement for memory alignment using the fact that arrays in global memory didn't have data fragmentation. Reading memory from a texture using this principle is more natural, because textures ensure no data fragmentation. As seen in the comments above, the information gathered contains two BVH bounding boxes and two child indices. Storing the data this way ensures you have four 128 bit reads and one 64 bit read, since the last two values are unused.

The other data structure I would like to point out is the way they store triangles. The nodeaddr variable can store either positive or negative values. The positive indicate an index in the BVH and negative values store indices to the triangle array. The actual index is found by using the bit negate function ($\sim$) on the index. From this address they fetch the triangle using similar code as above:

    // Tris in TEX (good to fetch as a single batch)
    const float4 v00 = tex1Dfetch(t_trisA, triAddr + 0);
    const float4 v11 = tex1Dfetch(t_trisA, triAddr + 1);
    const float4 v22 = tex1Dfetch(t_trisA, triAddr + 2);

In which you can store the data required for a ray-triangle test and execute it. The thing to note is: BVHs can store more than one triangle per node, especially SBVHs. In order to store this efficiently in one array, they inserted breaks in the array. The code for checking all triangles:

for (int triAddr = ~leafAddr;; triAddr += 3)
 // Read triangle data (see above)

    // End marker (negative zero) => all triangles processed.
    if (__float_as_int(v00.x) == 0x80000000)

 // Ray-triangle intersection (see part 1)

The for loop determines the triangle address and loops endlessly increasing the triangle address. If the first address contains a stop value (they chose for negative zero, but any distinguished value will do), stop the for loop. Simple as that, the array can now contain more than one triangle per node.

Traversal algorithm

The traversal algorithm is as follows: beginning this algorithm assumes the ray hit the BVH containing the model we wish to test for. If it didn't, it's not that bad, after two ray-box intersections the algorithm stops.

  • Starting from the first node address, we read the first two child bounding boxes using the code above
  • Using the optimized intersection test for the GPU from the paper, we test ray-box intersections.
  • If the ray intersects one of them, we change the node address and continue the algorithm. 
  • If we intersect both children, store one of the children on the stack and continue traversing the other.
  • If we intersect none, pull a child index from the stack. If the stack is empty, stop the algorithm.
While this algorithm is pretty easy to implement, optimizing it for the GPU is quite a hassle. In order to minimize divergence, they thought of two neat tricks to alter this algorithm. The first trick is persistent threads. This concept originated from the following thought: A warp will execute until all threads inside it are done processing. In a path tracer like ours this can lead to horrific situations: 31 threads waiting for just the one to be done tracing.

The idea to combat this is quite simple: if there are less than X threads active, fetch new work for the idle threads. Using older graphics hardware, the implementation would take some time for all threads to communicate their status using shuffle functions. Since kepler architecture there are some warp voting functions introduced. From the CUDA toolkit:
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.
Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Which comes in very useful for this task. The code that remains to execute this test:
if( __popc(__ballot(true)) < DYNAMIC_FETCH_THRESHOLD )
Where __popc is a population count, counting the number of bits equal to 1 from the provided value.

The other trick in the algorithm to minimize divergence is part of triangle checking. It comes down to timing: when to check for triangles. In theory, the threads in the warp could be divergent: one thread checking ray-triangle intersections and the others traversing the BVH. While this is not completely preventable, we could help the algorithm postpone checking triangles until all threads have at least one triangle to check.

By creating an array (or in this case a variable) to store a triangle address, we can continue traversing the BVH until other threads have also found a triangle. The code in the source uses a direct compiled statement to check this, since it wouldn't be compiled in an optimal way. The code that you would need to check if all threads have a triangle is simple:
    if(!__any(leafAddr >= 0))
Where leafAddr indicates a triangle address. The statement simply checks if there is a thread for which the address is larger than or equal to 0. If there are none, it means all threads have found a triangle to check and we can stop traversing the BVH.

I hope to have cleared up some of the magic parts of the source code. Of course, if you have any more questions about it, feel free to post them below.

Saturday, November 21, 2015

Real-time Raytracing part 3

In part 2, I showed you that you can speed up your ray tracing algorithm by means of a spatial data structure. In this post I would like to address optimizing GPU code using CUDA. Below, I give some general tips to boost parallel performance. Be aware that these points will not always be relevant. I will explain why, and under which conditions these points can optimize your algorithm:
  • Try to avoid divergent statements and recursive algorithms. 
  • Keep track of memory alignment for reads and writes of large datasets. 


Why is this so important? Most of the performance issues can be devoted to divergence. To explain divergence, I'm going to show a simple example. Say we want to multiply all even numbers by 2, and divide all uneven number by 2. A simple parallel reduction creates us the following:
__device__ void divergent(int* array, int nNumbers)
   int idx = threadIdx.x + blockDim.x * blockIdx.x;

   for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x)
     if (array[i] % 2 == 0)
       array[i] *= 2;
       array[i] /= 2;

CUDA can run thousands of threads in parallel, but it comes with a small trick called warps. These warps also exist in other GPU specific languages like OpenCL or compute shaders, but they are referred to as wavefronts. CUDA launches a set of 32 threads as one simultaneous bunch called a warp. Here is step by step what happens while executing one warp:

  1. The 32 threads each calculate their respective idx.
  2. The starting point of the for loop is the same as idx.
  3. The first conditional statement: all of the threads for which this statement is true (16 threads) will execute, the other threads (16) are idling.
  4. The second conditional statement: same as the first, only the other way around.
  5. Increase the loop index, and repeat from 3 until the end of the loop.

In total we've just executed the same amount of statements as in a serial variant of this algorithm. However, the serial variant does not have any idle time. If you would sum the total time spent executing this algorithm for both variants, the serial algorithm would take less total time than this parallel variant. Simply because half of the threads are idling all the time!

This is where the term divergence originated from: divergent statements blocking full parallel execution in warps or wavefronts. In the example above, we have at maximum 50% divergence, meaning half of the threads follow a different path than linear code execution. So, how do we get the algorithm from above to execute in parallel without divergence? Fairly simple:
__device__ void divergent(int* array, int nNumbers)
  int idx = threadIdx.x + blockDim.x * blockIdx.x;

  idx *= 2; // All even numbers
  for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
    array[i] *= 2;

  idx++; // All uneven numbers
  for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
    array[i] /= 2;

By removing the divergent statements, we get code that leaves no thread idling. Of course this was a simple example, and there are many branching statements which are impossible to solve the same way as above.

Recapping on point 1: Try to avoid divergent statements and recursive algorithms. It's incredibly easy to make a recursive call which causes divergence in the parallel execution. Keep in mind that the recursive algorithms are just a point of view: if you can guarantee non-divergent recursive calls, there is nothing wrong with recursive calls!

Memory alignment

To explain memory alignment I'm going to use an example from raytracing. In order to optimize my path tracer, I had to split the GPU code in different kernels for execution. I will detail the process in the next post. For now, we have to save the ray state in a datastructure as small as possible to avoid high memory latency. This is what I came up with:
struct RayState 
  Ray ray; // contains position and direction (both 3D vectors)
  float minDist; 
  int triangleID; 

The total size of this structure is a simple sum: the amount of floats and integers times their size in bytes. I'm assuming that an integer and float have the same size, because in most compilers they do. With this assumption, in total we have 8 times the size of a float, which is usually 32 bits or 4 bytes.

If you would use this structure to load and save data, there is a small problem: this structure will read in 4 bytes a time, since we did not tell CUDA there is any alignment whatsoever. You can profile this behavior with NSight, which shows the following in memory transactions:

Our tactical choice lead CUDA to read in by the lowest alignment, which is 32 bits. This can be a problem, if you look at this graph:

This was measured for a Tesla C2050, which is a bit outdated by now, but by no means a bad statistic to look at. There is a clear trend showing from this graph: the lower the amount of bits read per transaction, the slower memory access speed. This statement is only true if the amount of threads per multiprocessor is low. This is true in our optimized path tracer, but again, more on this in the next post.

This is where the magic happens. As described in the endless documentation for CUDA (CUDA programming guide), we can declare aligned memory on the device. Which allows us to read our list of structures as an array, if you know the memory size of a single element. We can read our structure as if it is an array of float4 elements (128 bits). The following piece of code does just this:
// Interpret as float4 array, and transform to structure
float4 r1 = ((float4*)rays)[rayidx * 2];
float4 r2 = ((float4*)rays)[rayidx * 2 + 1];
float3 raypos = make_float3(r1.x, r1.y, r1.z);
float3 raydir = make_float3(r1.w, r2.x, r2.y);
float2 data = make_float2(r2.z, r2.w);

I read in two float4 elements from the array by converting the index. This is a simple trick: we know the total size of the structure, which is 256 bits. We are trying to read in 128 bit elements, so the index has to be multiplied by two to get a correct index for a float4. A profiling session now shows the following data:

Which means the data is now read in as aligned 128 bit reads. Mission successful! Recapping on point 2: Keep track of memory alignment for reads and writes of large datasets. We've successfully read in aligned data on the GPU. However, there is room for more memory optimization by using data locality. Since this has already become quite a long post, I will postpone this for another time.


CUDA programming guide

Saturday, October 31, 2015

Flux - Realtime path tracer

After all these posts on techniques, here is a video showing the current state of my path tracer:

And some rendered images:

Thursday, October 29, 2015

Real-time Raytracing part 2.1

In my last post I gave a summary on all available methods for creating bounding volume hierarchies. In this post I'd like to share my implementation of the Fast and Simple Agglomerative LBVH Construction (download link). I'm not using any special heuristic, it's going to be the same simple radix tree as mentioned in the previous post.

This is what we want to end up with. The green nodes in the bottom are the leaf nodes (containing triangles), and in blue are the internal nodes. The numbers below the nodes are their left and right childrens indices, more on this later. The numbers in the bottom green rectangles are the leaf nodes' respective morton codes, they represent an ordering in space. In our case this space is going to be 3D. There are plenty of methods on creating morton codes, so I'm going to skip discussing these methods here.

In Karras' paper on fast LBVH building, they use 2 compute passes to find all relations between the nodes, and build a radix tree. Apetrei found a similar method which only uses one pass to do both. Similar to Karras' method, we split the array of nodes into leaf nodes and internal nodes. You can still store them in the same array if you use an offset though.

First of all, the structures. We're going to used Axis Aligned Bounding Boxes (AABB), for which the structure looks as follows:
struct AABB
 float3 min, max;

The BVHNode contains a little bit more:
struct BVHNode
 AABB bounds;
 BVHNode *children[2];
 int triangleID;
 int atomic;
 int rangeLeft, rangeRight;

The basic things are there: AABB and relations to successors and parent. If this would be a leaf node, the triangles can be stored in the elements array. The rangeLeft and rangeRight are the, soon to be, childrens indices. The atomic counter will play an important role in this algorithm.

Important to note: If you're going to optimize this, think about memory alignment, this structure currently has the size of 6 floats and 6 ints. Storing in multiples of 128 bit (4 ints/floats) and reading/writing in those sizes on the GPU is a good memory optimization.

I started by simply allocating two arrays for the nodes on the GPU. Other requirements you need for this algorithm are the AABB's of every triangle, and their respective morton codes. Obtaining these is simple, but a little obfuscated on the GPU. We're going to use thrust (a package included in the cuda SDK) to sort the created morton keys:
void Sort(unsigned long long* morton, int *keys, const int n)
 thrust::device_ptr<int> d_keys = thrust::device_pointer_cast(keys);
 thrust::device_ptr<unsigned long long> d_morton = thrust::device_pointer_cast(morton);
 thrust::stable_sort_by_key(d_morton, d_morton + n, d_keys);

Note that both morton and keys are both already existing arrays on the GPU. The keys array is a simple int array containing [1,2..n] storing the triangleIDs. Morton are the morton codes of the respective triangles. In this piece of code I simply convert them to something 'thrustworthy' and sort the morton codes together with the keys using stable sort by key (so you end up with both arrays sorted the same way).

Since allocated data on the GPU could contain anything, we're going to reset some parameters in the nodes and start by defining bounding boxes in the leaf nodes:
__global__ void Reset(BVHNode *leafNodes, BVHNode *internalNodes, AABB *data, int *sortedObjectIDs, unsigned long long *morton, int nNodes)
 unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
 if (idx >= nNodes)

 // Reset parameters for internal and leaf nodes here
 // Set ranges
 leafNodes[idx].rangeLeft = idx;
 leafNodes[idx].rangeRight = idx;
 leafNodes[idx].atomic = 1; // To allow the next thread to process
 internalNodes[idx].atomic = 0; // Second thread to process

 // Set triangles in leaf
 // Save triangle ID
 // Expand bounds using min/max functions

 // Special case
 if (nNodes == 1)
 internalNodes[0].bounds = leafNodes[0].bounds;
 internalNodes[0].children[0] = &leafNodes[0];

For those who have never seen CUDA code before, the __host__ and __device__ simply indicate on which device this code can be executed, __global__ indicates the host calls the function, but the device executes the function.

The input for the function contains the same morton codes and keys from the sorting, the nodes themselves, and nNodes, which is simply nLeafNodes in this case. This is always true in a binary radix tree.

Compute shader logic: To find the index in the array, we have to look up the current threadID. I called this function using only one dimensional blocks and grids, so this is easy to calculate by adding threadId to the block index and size. If we surpass the maximum amount of threads (nNodes) we simply tell this thread to stop executing.

Walkthrough: First, we simply reset the leaf and internal nodes parameters to NULL and reset the AABB. We're going to start the algorithm in the leaf nodes, so the range counter will only include the node itself for every leaf node. The atomic counter is set to 1 for leaf nodes, and 0 in internal nodes. This is explained further in the algorithm below.

We want to end up with every node having a bound in terms of an AABB. This can be done by setting them directly. I prefer to use the min/max functions, since you don't have to rethink the process if you try to store multiple triangles later on.

The final part of the code simply handles the case when there is only one node. If this is the case, we don't even have to run the actual algorithm!

The actual algorithm

I made a function which calls the actual algorithm for every leaf node, fetching the threadID in the same manner as above. After that I start the recursion:
// Sets the bounding box and traverses to root
__device__ void ProcessParent(BVHNode *node, int nData, BVHNode* internalNodes, BVHNode *leafNodes, unsigned long long *morton)
 // Allow only one thread to process a node
 if (atomicAdd(&node->atomic, 1) != 1)

 // Set bounding box if the node is no leaf
 if (!isLeaf(node))
 // Expand bounding box with min/max functions from children AABB's

 int left = node->rangeLeft;
 int right = node->rangeRight;
 if (left == 0 || (right != nData - 1 && HighestBit(right, morton) < HighestBit(left - 1, morton)))
 // parent = right, set parent left child and range to node
 // parent = left -1, set parent right child and range to node

 if (left == 0 && right == nData)
 ProcessParent(parent, nData, internalNodes, leafNodes, morton);

Walkthrough: The first part is an atomicAdd, the function adds a value to a reference and returns the previous value. We need this to guarantee parallel reduction. You can read more on what atomics are and how they work here. In our context, we're saying if the old value stored in the atomic counter is equal to one, the thread can continue. This way we stop the first thread that encounters a new node and continue traversing when the second thread arrives. This is the reason why we set the atomic counter on 1 for every leaf node, since the first thread arriving will read this value and continue.

The next part simply states that if the thread encounters an internal node, we set the bounding box by combining the AABB's from the children. We know that both children are set from the logic with the atomic counter above.

Now we can calculate the parent according to the logic from the paper. With the left and right child index and the following function, we can calculate which internal node is the parent:
// Returns the highest differing bit of i and i+1
__device__ int HighestBit(int i, unsigned long long *morton)
 return morton[i] ^ morton[i + 1];

The paper has a clear explanation as to why this algorithm works, so if you want to know why, I suggest reading the paper.

However, I will give a working example to show this. Let's trace the algorithm from the thread that handles internal node 5 from the image. The child indices are 5 and 6. The algorithm states:
if (left == 0 || 
(right != nData - 1 && 
HighestBit(right, morton) < HighestBit(left - 1, morton)))

The first two statements opt out the 0 and n-1 case (left and right part of the tree). This is not the case, so we have to compare some values. The HighestBit of right returns $1101 \wedge 1111 = 0010 = 2$ and the HighestBit of left-1 returns $1000 \wedge 1100 = 0100 = 4$ 2 < 4, so the statement is true. This means the parent has the same index of the right child, which is 6.

I hope you gained some insight in this algorithm from this code sample. Make sure to note that while this algorithm gives you fast building speed it does not give the best tracing speed.

Warning: LBVHs as shown in the graph above represent an optimal solution. Most of the LBVHs you will make using this algorithm will (most likely) contain tons of layers containing one leaf- and one internal node, thus the depth constraint of log2(n) in balanced binary trees does not apply here!

Thursday, October 22, 2015

Real-time Raytracing part 2

Back to part 1.

In this post we will look into a particular optimization for the raytracing algorithm, namely data structures. Once you got your raytracer up and running on the GPU, you will check for every ray, which triangles it intersects. In the last post we saw that checking all triangles for every ray is simply impossible. In order to save some time, we store the triangles in a spatial data structure, which is easier to query for a ray.

There are many possible data structures to store a set of triangles. The most frequently used ones for raytracing are: k-d trees, octrees and bounding volume hierarchies (BVH). Any one of them is going to speed up the process a LOT. If your ray would only hit the background, in both the octree and the BVH there would only be a single ray-box intersection and then we're done. That's already saving n triangle checks!

But this post wouldn't be interesting to just point to some data structures. In this post I'm going more in depth on the BVH, specifically using bounding boxes. Why bounding boxes? Because there is a very simple, computationally efficient test for ray-box intersections. No if-statements, so it's a very solid algorithm for massive parallel execution as well. Fits right in our GPU ray tracer! Now there are basically two types of BVHs. The LBVH (linear BVH) can be built really fast, but has less tracing performance than SAH (Surface Area Heuristic) BVHs. The SAH splits the set of triangles so that the surface areas of the two child spaces, weighted by the number of objects in each child, are equal.

Tero Karras from Nvidia research came up with a very cool algorithm to build LBVHs in parallel on the GPU. On the Nvidia blog you can find this and this post. It explains how you can traverse a BVH on the GPU, and some great performance tips to speed it up. Part three of the tutorial shows how you can build the LBVH on the GPU making use of the massive parallelism available. It's worth checking out the research paper on the subject, but below is an easier and even faster version. The basic principle of the paper is building a radix tree of a set of objects which are sorted using morton codes. This creates a spatial code by interleaving positional vectors in one integer or long. From this radix tree, you can precalculate where the tree would split, and thus process the whole tree in parallel.

This LBVH creation method is really fast: about 16ms for a million triangles is not uncommon. Even though, the algorithm could be way easier, as described in Fast and Simple Agglomerative LBVH Construction (download link). This paper shows that the same restrictions on the radix tree can be used to build the tree bottom-up instead of using an unnecessary top-down iteration. It shows that by comparing values with the neighbor, you can decide the parent node. It's even a bit faster than the method above, and a lot simpler to implement.

This is as fast as it's going to get for building times for BVHs. But how about tracing times? As mentioned above, the SAH is a good heuristic for ray tracing performance. This graph is an awesome comparison:

This graph is from another Nvidia paper: Fast Parallel Construction of High-Quality Bounding Volume Hierarchies. This is another paper on building LBVHs really really fast. They also mention an optimization in tracing speed by pre-splitting some large triangles. There is a lot more in depth information about building times and tracing times as well. They mention a lot of methods, but the main point from this image is to show that (H)LBVH has about 60~70% of the tracing speed in comparison with SAH BVHs. The building time is about 10~20 times slower than that of LBVHs though.

So how do we build these awesome BVHs using the SAH? Ingo Wald wrote a paper on building these kind of BVHs very fast. This algorithm is performed on the CPU in a top-down fashion. There are a lot of possibilities for performance boost which are also explained in the paper. The building times could be about 3 times as slow as building a LBVH by trading in some tracing performance for build optimizations. But, as shown in the graph above, this method still gives the best raytracing performance.

This image below shows a visualization of rays intersecting bounding boxes with 8 stanford dragon models. You can find the source code for this BVH in part 2.1.

I hope to have informed you enough about the possibilities of optimizing raytracing by using spatial data structures. There is some speculation about the creation of BVHs on the GPU, since the GPU will be working on tracing rays all the time anyway. More on this discussion can be read on (where you can also find an implementation of most of these methods).

Part 2.1: code sample of Fast and Simple Agglomerative LBVH Construction
Part 3

Wednesday, October 14, 2015

Real-time Raytracing part 1

In the next few posts I will talk about Real-time Raytracing. The real-time part has different interpretations. Some define it as being 'interactive', which is another vague term as any application running at 1 FPS can still be interactive. In this case we will define a "real-time constraint" as a time constraint per frame of about 16ms (which leads to 60 frames per second). This is a common time constraint used in games which have to be responsive.

Now for the raytracing part. The basic concept of raytracing is tracing the paths of light for every pixel on an image plane. This way we can simulate visual realism by by mimicking the real process. You can create pretty images without much code:

Unfortunately raytracing also has downsides. The process to trace the light paths requires global scene access, and is thus unable to work similar to a rasterizer (which draws objects one by one). This makes the whole rendering process more difficult and the standard route of rendering objects is averted. Raytracing has a very high computational cost. A basic algorithm for checking which triangle was hit already shows the problem:
for (int x = 0; x < screenwidth; x++)
 for (int y = 0; y < screenheight; y++)
 for (int t = 0; t < nTriangles; t++)
 RayTriangle(x, y, t)

The whole algorithm scales with screen size and the amount of triangles. Running this algorithm with the dragon model shown above is basically asking for your computer to explode. The model has 100.000 triangles exactly, and checking that amount of triangles for every pixel in a low resolution (1024x1024) is about 104 billion ray triangle checks. A simple line-triangle intersection already has a lot of instructions, needless to say, this algorithm isn't going to run real-time...

So, how do we make it real-time? Several decades of research show us that there are plenty of possibilities to speed up the process of raytracing. In the upcoming posts I will talk about some of these processes to speed up my path tracer. Most of the subjects will be on parallel programming on the GPU using CUDA. Even though I will try my best to keep it as readable as possible, unexplained terminology can always be found in the excellent ray tracing tutorial of scratchapixel.

If you're really interested in ray tracing, and haven't read it already: Ray tracey's blog already has a huge collection of ray tracing posts.

Part 2.

Friday, April 3, 2015

Compute Shader Framework

In the last few posts about ray tracing I briefly mentioned compute shaders. If you don't know what they are, here is a short summary:


Compute shaders are not part of the ordinary graphics pipeline, they can be used separately of any other stage. They are particularly meant for computation on the GPU. The compute shaders are in the same language as the other pipeline stages, like the pixel shader. In this case, HLSL. The compute shader takes advantage of the huge speedup the GPU has to offer over the CPU. This is done by taking into account the parallel computation power of the GPU.

When I first started out with compute shaders, I saw it as a black box and didn't really understand how to get started using one. After I found out that it can be really useful for large computations, I decided to implement one for my ray tracing project (with success). After this, I decided to make a simple framework allowing everyone access to compute shaders in a more user friendly way. This is only intented for DirectX compute shaders in C#


So without further ado, here is the framework: ComputeShader. You can also view it on GitHub.
On the first run, the framework will download some NuGet packages from SharpDX. If you have already have SharpDX installed, you can simply reference them to skip this part.

With this framework it's possible to bind any structure to the GPU. You can do numerous things with these structures in your shader, and then output some data that you want to know. You can read this data back in your code and use it later on! A simple example would be updating a particle system: you dump all positions and velocities to the GPU, and then calculate the next positions in your shader.

In your project, either reference the ComputeShaderAddon.dll or add the project to your solution and reference the project.
You can now calculate anything on the GPU by using the next 4 lines of code (don't forget to include ComputeShaderAddon):
ComputeShaderHelper CSHelper = new ComputeShaderHelper(Device, "effect.fx");
int index = CSHelper.SetData<ExampleStruct>(data);

This is the code from the example in the framework. What it does per line:
- Initialize the helper, this compiles the shader (if necessary) and sets it up.
- Set your data from any possible struct to the GPU buffers. The index is stored to retrieve the data later on.
- Executes the compute shader, the number is the amount of cores used on the GPU. The maximum number of cores is 1024, however this will use all calculation power of the GPU at once!
- Retrieve the data from the GPU, using the index from above.
Create your compute shader. Set the amount of cores you want to use in the brackets above the main function like this:[numthreads(cores, 1, 1)]

Likewise, save the length of the array of structs somewhere in the compute shader, if you want to use this like I did in the framework.
Done! Run your project!

The example is a small program I wrote to test the computation power of the GPU in comparison with the CPU. The operation to perform is simple: for every struct you get, count numbers from zero to the length of the array and store them in the struct. Below you'll find a CPU and GPU version of this in code:
// CPU
for (int i = 0; i < amount; i++)
 int result = 0;
 for (int j = 0; j < amount; j++)
 result += j;
 data[i].Data = new Vector3(result, result, result);

// GPU -- ComputeShaderExample.fx in framework
[numthreads(nThreads, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
 int range = nStructs / nThreads;
 for (uint i = id.x * range; i < id.x * range + range; i++)
 int result = 0;
 for (uint j = 0; j < nStructs; j++)
 result += j;
 data[i].Data = float3(result, result, result);

The framework ran with 50 cores on the GPU, and the results are as follows:

Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the GPU: 102 milliseconds
Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the CPU: 283 milliseconds

With this small calculation the GPU, using 50 cores, is about 3 times faster than the CPU, using only one core.

Some scalings:
StructsCoresGPU time (ms)CPU time (ms)
You can see that the amount of cores is something to fiddle with, since the resulting time differs greatly. This is because the overhead of running all the threads costs more than the actual computation itself, so be careful with this!

Fun stuff: the outcome is easily calculated by: n(n + 1) / 2. This is an easy way to calculate a numerical sequence like this. In this case, n = 9999 (because we start at 0).

Future work

This framework currently only supports Unordered View bindings, so if you would use DirectX 10 you can only bind one array of structures to the compute shader and that's it. In DirectX 11 this is increased to 8, which is supported in this example project.

Currently you still have to set the length of the array and the amount of threads manually in the compute shader. I don't know if it's possible to change this dynamically from code, but if I ever find a way, I will update the framework for sure.

Wednesday, January 21, 2015

Raytracing Part 3

The final post in this series about my ray tracer. I spent last weekend creating some black magic called a compute shader. It allows you to use your GPU to aid in calculations. In my case this came in quite handy with all these ray calculations. The GPU is optimal for calculations that can be executed completely in parallel. Fortunately, raytracing falls in this category. 

This is the first result worth looking at:

First thing to notice is that it's quite grainy. This is because there are not that many samples per pixel, only about 5. However, the (still local) Cook Torrance shading is already working as you can see on the metal-ish sphere in the back.

The main thing to notice here are the soft shadows. A small change, but 'realism' awaits. There are some incorrect results in this image, because you can see the diffuse reflection of the green box is still biased. The local shading model from two lights draws the reflecting rays in their direction, meaning this is not a totally correct result.

The box in the middle suddenly became radioactive and now emits light. The image is still biased, even the emitting box only emits towards the light sources.

Experimenting with rougher surfaces and more glossy surfaces.

The first results of refracting rays.

Perfectly refracting sphere showing the bias is indeed gone here. The grain is also gone since I actually took the time to sit back and wait for the image to converge (this image took about two minutes to converge to this state).

The same scene with different materials and a more emitting 'roof'.

Wednesday, January 14, 2015

Raytracing Part 2

In this post some more images of the continuing project on raytracing. First of all, I found out that the bounding spheres were way too big, and a big performance hit on my ray tracing project. Instead of spheres I now calculate bounding boxes around all the triangles in the model and test with these instead of the spheres. All in all, this improvement alone did not speed up things enough. By looking at my quickly made octree, I found some fatal flaws rendering the octree quite worthless. All of the triangles were stored in the topmost bounding box instead of the leaves as they should be.

This image shows the octree of the Stanford dragon and 2 teapots:

After I got all this working I was quite amazed by the quick rendering times. It took almost half the time as before, making it possible to even render this huge dragon:

The next image shows some reflections in the windows and the body of the car:

Of course, I could never stop here. So why not add some refraction as well:

By combining the result of reflections and refraction:

Now, these images are already quite fancy. However, there is still a lot of work to do. The following image shows a sneak peak of the next and final part of this raytracing series. By tracing multiple rays within a pixel area we have more information about the color. Combining this with a nice gaussian function over multiple pixels I have created a nice anti-aliasing function:

In the next part I will show results of a distributed raytracer slowly converging towards rendering an image with global illumination effects.

Tuesday, January 6, 2015

Raytracing Part 1

This post shows the progress for an assignment I'm currently creating for a university course. It's about creating a raytracer starting from scratch. I chose to implement it using SharpDX. I first started by creating a standard renderer, which draws triangles on the GPU using DirectX. After I could load in simple objects and draw them on the screen, I created a basic raytracer.

First I had to convert the loaded models in SharpDX to triangles I could actually use to trace. After some work I finally got some results and started taking screenshots. Hence the result so far in graphical representation:

The model drawn with the basic effect

Silhouette by raytracing the boundary. The gray area is the bounding sphere (which is a little off scale).

Material loading, getting the 'correct' colors for the car. The light color interfered here, thats why the colors are still a little off.

I forgot to order the triangles by depth! What a mess..

After ordering the triangles

Blinn-phong shading.

More lights! Shadows!

Soft shading, taking the average of three normals for each point on a triangle.

Our car is back! However some normals are still incorrect, you can see headlights missing.

Soft Shadows! Sending 20 rays from every point to the lightsource to see if we have shadow. The correct normals have been put in.

Blinn-Phong shading back in the picture. The car looks really metallic now:

There are still some strange 'light artifacts' on the floor and the car window. More updates will be in the next post.