Thursday, July 6, 2017

Automated marshalling of managed- to unmanaged structures

An art that I wouldn't even wish upon my greatest enemies to figure out.

I recently started a new project in a language very familiar to me: C#, a managed language. This means that all the memory management is done for you, which is both a blessing and a curse. For this project, I happened to struggle over the "curse part" of automatic memory management.

The problem I encountered is as follows: when you have two structures that are identical in terms of their members, their respective sizes can (and often will) differ in managed languages compared to unmanaged languages. For example, I have the following structure in C# and C++ respectively:

// C#
[StructLayout(LayoutKind.Sequential)]
struct DebugComponent
{
    public float4 Float4;
    public float Float;
}

// C++
struct CPP_DebugComponent
{
    float4 Float4;
    float Float;
};

The size of the structure can be found in C# by using Marshal.SizeOf() (or sizeof() in unsafe code) and reports that the structure is 20 bytes in size, which is correct. Note that I already applied the StructLayout to Sequential, as this will create a layout similar to unmanaged code.

The size of the same structure in C++ using sizeof() reports that the structure is 32 bytes. This is also correct, because the float4 type here is aligned to 16 bytes, meaning the structure will receive another 12 bytes of padding at the end, to make sure it aligns with 16 bytes.

Unfortunately trying to use this structure in a tool such as ManagedCuda, the CUDA kernel struct will use the C++ version, and when you call the kernel from your C# code you will have to use the other version. This creates a mismatch in memory layout, resulting in very weird artifacts after running the kernel, or even crashing because you're writing to unallocated memory in this case.

The "simple" solution I found is to manually expand the C# structure by using the StructLayout.Size attribute to extend the structure to 32 bytes instead of 20. After asking my question on StackOverflow, I didn't solve the problem to create these structures automatically without counting the sizes of every individual type in the structure itself.

So I had to switch up my solution a little bit. I created a project which contains the raw C# structures that I want to use, along with all their functionality like loading and serialization. I then created another C# project for the automated code generation. Using this project we can load our Structures as a dll, from which we can derive all the structures and what types they contain in text templates:
  • Structures: project that contains raw structures that will be used on the GPU
  • Tools: my general purpose project that will generate GPU versions of the structs defined in Structures.
In order to make this conversion as secure as possible I don't want to manually check every time I create a new structure if the GPU version has the same alignment and size. So I created two more projects:
  • AlignedStructsWrapper: A C++/CLI project that combines managed and unmanaged code
  • Tests: a unit test project
Using the CLI project, we can load both our versions of the structure: the managed C# version and the unmanaged C++ version. We can now measure the difference in their sizes:

public ref struct WrapperGpuDebugComponent
{
public:
 int SizeDiff()
 {
  int managedSize = sizeof(Tools::Content::Generated::DebugComponent);
  int nativeSize = sizeof(CUDA::CPP_DebugComponent);
  return managedSize - nativeSize;
 }
};

In the unit test project we load our CLI from reference and we can create a simple unit test that calls the SizeDiff function and checks if the difference is indeed 0:

[TestMethod]
public void CheckStructureSizes()
{
 WrapperGpuDebugComponent debugcomponent = new WrapperGpuDebugComponent();
 Assert.AreEqual(debugcomponent.SizeDiff(), 0);
}

Of course I also generated the CLI structures and the unit test functions automatically for every structure so I only have to recompile the projects and have everything tested.

Sunday, May 14, 2017

CUDA in Visual Studio 2017

Edit: CUDA 9.0 RC is released. This version shows full Visual Studio 2017 support.

Note: this article only shows how to compile Visual Studio 2015 CUDA projects in Visual Studio 2017. For actual VS2017 support we will have to wait for a new CUDA release.

I previously wrote a small article on CUDA support for VS2015, to support CUDA compilation of older projects. Following the same principle we can 'hack' CUDA compilation support in VS2017. 

What you need
  • CUDA installation with visual studio integration for VS. I used CUDA 8.0 and VS2015 respectively.
  • VS2017 (any edition)
Copying the required files
  • To allow CUDA compilation we have to copy a few files. Find the CUDA 8.0 setting files in the VS2015 buildcustomizations directory:
C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations
Note: If you use a different VS version, you have to change the 'V140' accordingly (V120 for VS2013 for example).
  • Copy the following files: CUDA 8.0.props, CUDA 8.0.targets, CUDA 8.0.xml, and Nvda.Build.CudaTasks.v8.0.dll
  • Find the VS2017 buildcustomizations directory:
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\BuildCustomizations
Note: I used the VS2017 Community edition. If you have another edition, change 'Community' in the path accordingly.

  • Paste the CUDA files here.

That's it
You can now load and compile your VS2015 CUDA projects in VS2017. When you first open your project in VS2017, make sure to not upgrade your project to VS2017, otherwise this won't work. 

Friday, April 28, 2017

IniGenerator

I wrote a simple C# code interface for ini files. There are already great NuGet packages for ini file IO parsing and writing, but not a lot of packages that automatically generate a code layer. For this project I based my solution on the ini parser to do the file IO.

My package only provides a small overlay to create a C# class which handles all the file IO behind the scenes. Using the text-templates, we define both the ini file and create a C# class. An example file I used to create my configuration:

<#@ include file="$(ProjectDir)IniTemplate.tt" #>
<#
    // All properties in the ini file
    // Name, default value and category
    CreateProperty("Width", 1280, "Video");
    CreateProperty("Height", 720, "Video");
    CreateProperty("Fullscreen", false, "Video");

    // Generate the code layer
    GenerateIniClass();
#>

Which will create a C# class with the same name as your text-template. The ini file will either be created the first time you use this class, or the old values will be read from the existing file.

Usually there is no backwards compatibility with older versions of the ini file. If you add new properties, all values in the ini file will be reset to their defaults. I avoid these scenarios using the beautiful functionality to merge two ini files from the ini parser. I can simply add the new properties to the old ini file without changing their values.

Finally, an example of the above template used in code:

// Use the namespace where you placed the template
using IniGenerator.Content.Generated;

// Name of ini file
var config = new Config();
// Can be used directly in code without parsing
var size = new Size(config.Width, config.Height);
var fullscreen = config.Fullscreen;

You can view the source code on GitHub, or download the package from NuGet.

If you have any feedback, leave a comment or post an issue on the GitHub project.

Saturday, April 22, 2017

Master Thesis

Update: You can download the full thesis here.

Level-of-Detail Independent Voxel-Based Surface Approximations was the subject of my master thesis. I wrote a small dissemination that explains the basics of my thesis on this page.


This image shows the final result of my thesis work. The models above are voxel models with 4096 (2^12) voxels in every axis. If they were all filled, I would have to store 4096^3 = 68719476736 voxels in total. There has been a lot of research into compressing the huge amount of data this requires, I mentioned some examples on the thesis page.

Using a Sparse Voxel Octree (SVO) storing scalar field values, the six models above can be stored in 12GB of memory total. Using my multiresolution method we can store visually comparable models in only 2GB of memory total.

Here is a small video showing the current state of the voxel path tracer:


Thursday, October 27, 2016

Master thesis current state

For my master thesis, I've been working on a project for some time now. In this post I'll share a few debug screenshots of the current state of the project.


The project can load triangle meshes of (almost) any format, and convert them to large resolution sparse voxel octrees (SVO). In the picture above, you can see the voxelized Lucy model, which is about 500MB in triangle format.

Converted to an SVO, storing only which voxels are on or off, we can store data for up to 4096^3 voxels in only ~40MB. That's a boolean for ~68 billion voxels. But since we only store voxels that are on, this number changes drastically, thus we are able to compress this pretty well.

If you want to store more data for every voxel, the amount of memory is going to skyrocket. In the example above, I store a scalar field for every voxel (that's 8 floating point numbers per voxel). All that data nets nearly 870MB for the highest resolution.

I was able to convert and draw a model up to 8192^3 resolution, including all simplifications of a dragon model:


Where you can see that the voxel resolution is even larger than the millions of triangles stored in the model.

For my master thesis I will be working on creating level of detail simplifications. The screenshots below show some results of a linear approximation calculated for a group of voxels:



The first model at 512^3 resolution, and the second at 2048^3.

These images are rendered using raymarching to find the intersection point of a ray and a scalar field. I wrote the raytracer in CUDA, which allows for an interactive framerate even with gigantic data sets by using parallelization on the GPU.

Finally, some more renders:



Wednesday, July 20, 2016

The next step in Entity Component Systems

Let's talk game engines. For a current project, I'm writing a game engine completely from scratch. Before I started I wrote down some key functionalities the engine should be able to handle. After writing that down, the proces of creating the engine boils down to optimizing the functionality in terms of performance and memory requirements.

Before we dive into engine compositions, let's look at two big AAA game engines used to create games which are used by a large number of developers today.

Unreal Engine 4 is mostly known because of the impressive graphics that can be created within a few mouseclicks. The engine is open source, and is an all-round engine that supports nearly everything you can think of when creating a game. Below you will find some amazing pictures rendered using Unreal Engine 4.



Complete album here.

Unity is an all-round game engine designed to ease the development process by working with C# as its core language. Unity is a lot more oriented towards an entity component system than Unreal Engine 4.


These engines took years to develop, and can practically support any use case. For my engine I'm looking to only support a very narrow subset of these use cases.

The thing both engines have in common is that they are both entity component systems, which is what the remainder of this post is all about.

Engine compositions
Mainly, there are two compositions for game engines. The first one is Object Oriented Programming (OOP), and the second type is an Entity Component System (ECS). There are already a lot of resources available to learn about both. I compiled a small list of examples, and will revisit the topic briefly.

[1] Understanding Component-Entity-Systems by Boreal Games
Shows a clear and concise introduction to OOP vs ECS systems.
[2] Implementation of a component-based entity system in modern C++ by Vittoreo Romeo for CPPCon
Explains the ECS system in depth and shows an implementation in C++
[3] Evolve your Hierarchy by Mick West
Experience from a programmer changing from OOP to an ECS system. Explains cache optimizations

Object Oriented Programming
The main reason to use OOP for games, was that the concept is easy to grasp. You create a hierarchy for all objects, and code reuse is introduced by polymorphism. In the diagram below you can see an example of OOP used in a game engine.

Image from [1]
It's clear to see that the EvilTree can't fit in the hierarchy, because it would require inheritance from both static and dynamic entities. While this is possible in some languages, it can lead to difficulties known as the diamond problem.

Entity Component System
The entity component system is designed to solve the problem stated above. The structure above would look as follows in an ECS:

Image from [1]
Where you can see that all the objects are derived from different components. The components can be reused for any object, this makes it easy to add new entities.

The next step for Entity Component Systems
Now that we are up to speed with game engine compositions, let's look at the path forward.

Data Oriented Design (DOD)
This paradigm is the design principle behind the ECS system. The basic principle is instead of looking at code, look at data. This concept is derived from the fact that most applications are memory bound instead of processing power.

The main point for DOD in game engines focusses on using the ECS in a way that avoids cache misses. To illustrate this, we look at the table below.

Image from [3]
In this ECS from [3], we can see on the left all the entities in the game, and in the table all of the components they exist of. When viewed from a higher scale, the only thing we changed is instead of looking at the diagram from left to right, we now look at it from top to bottom. So in our code we define our components as a list of Position components, Movement components, and so on.

For every entity we add, we simply add one of each component to all component arrays. This is why there are holes in the diagram.

All the way at the top of this article, I said that the engine is a balance between performance and memory. We can now clearly see why this is the case: this layout consumes a lot of memory to increase the performance. The "Script only" object creates 5 empty components which will never be used.

The three principles
We can define three points to weigh all our engine needs.
  1. Cache efficiency
  2. Memory efficiency
  3. Parallel efficiency

Now we have to define how to determine the efficiency of all of these points.
  1. Measuring cache efficiency is tricky at best, and this is very dependent on optimization. So instead of a ratio, we define the following: An engine becomes more efficient if we require fewer arrays available at the same time, and the smaller the size of a single item in the array the better
  2. The ratio for memory efficiency is defined as the total used components divided by the total created components.
  3. We can determine the parallel efficiency by looking at the percentage of work that can be executed in parallel. For a complete calculation of the speedup, we can look at Amdahl's law.

ECS revisited
Now that we have the three principles, we can discuss the efficiency of the ECS system. If we assume that we are using the ECS from [3]:
  1. We can see that this would lead to a good cache efficiency, but not the best: A physics component can't operate without a position component, this means we need multiple arrays simultaneously when updating entities. But the big advantage here is that all the arrays are separated, and can be queried individually per component.
  2. The memory efficiency is bad. The ratio for this particular example is already 19/30 = 63%. That means that 37% of our memory is just thrown away for the sake of efficiency.
  3. The parallel efficiency leads to a tricky scenario: how can you execute ECS in parallel? In this particular case, we pretty much can't. The physics system updates the position component, and only after that, we can render the component using the new position.

    In order to process everything in parallel, we'd have to look from left to right (per object) again, and processing the objects in parallel would require ALL data available in caches, which in turn would lead to a lot of cache misses.. 

In conclusion, we can say that we traded off space for time. We require more memory, but less processing time due to cache efficiency to process all our objects.

As for parallel efficiency, from a DOD composition such as this, we can run some tasks in parallel. If we look at the implementation from [2], every entity has a signature:

Image from [2]

In this case, the signature tells us that we require AI and Enemy components. In the implementation, we have systems that update all entities containing specific signatures. To create a simple parallel processing ECS, we can simply say that every set of systems that don't have any component in common can execute at the same time.

In our example, with a physics system and rendering system, this would not work, since they both rely on a position component.

A solution presented in Vittorio Romeo's presentation is to store all the components in one mega-array:

Image from [2]
While this could be the solution, I very much disliked the amount of work required to implement this, and keep up with all the overhead of adding and removing components.

Setting up the engine
In my engine I opted for an ideal solution, within the constraints of C++ (static types, known at compile time). With the power of C++ 11, we can reach a lot using variadic templates, and I built a lot of my implementation with it.

I based my system on the implementation of [2]. I use signatures to define a set of components, and I also define systems by providing a signature. But instead of storing components separately in one array per component type, I took a step back in 'engine progression'. I define one array for every possible signature.

If you read that carefully, you should already be thinking, that would require a lot of arrays! If we define up to 64 different components, and generate one array for every possible signature, we would have 1.8446744073e+19 arrays. Possibly requiring more memory than the complete application that we're building.

So instead of creating one array for all possible signatures, let's create one array for every signature used. The tricky part is finding all possible signatures at compile time. And we can't do that in C++, since we have no such library as reflection from C#, where we could query all that.

Text templates
We will use C# to create our signatures before the C++ compile time, so we know all the types at compile time. For this we will use something commonly used in web development: text templates. If it's your first time working with these in visual studio, I recommend the syntax highlight plugin.

So in text templates, we can write code that writes code, quite nifty. A small example that generates a list of all our components:

<#@ template debug="false" hostspecific="false" language="C#" #>
<#@ assembly name="$(TargetDir)CodeGeneratorFunctions.dll" #>
<#@ import namespace = "CodeGeneratorFunctions" #>
<#@ output extension=".h" #>
// This file was automatically generated
#pragma once
<#
 string[] components = Generator.GetComponents();
 foreach(string component in components)
 {
  WriteLine("#include \"Components\\" + component + ".h\"");
 }
#>

Where I define a function "GetComponents()" in a DLL CodeGeneratorFunctions, to search all our files for component signatures. I was pleasantly surprised with the runtime, as I thought it would take years to search all those files, but it actually took less than a couple of seconds.

Similar to this example, I wrote text templates for all signatures and systems.

The engine
Back to engine talk, now that we have our component composition, let's determine how efficient this engine could be in theory.

  1. Cache efficiency. At first glance, the cache efficiency would suffer in comparison with the original ECS from [3], but it's hard to tell. We have a larger single item size in the array, but we only require one array simultaneously. Below I explain a small optimization that reduces the individual item size. 
  2. Our memory efficiency is great. Since we can basically store only the necessary components for every object, we waste 0% of memory, with a little overhead of creating a lot of arrays in larger systems. 
  3. The parallel efficiency is similar to the ordinary ECS. We can only execute systems in parallel when they contain unique component sets. 
So in theory, we got an advantage in terms of memory. But we are still questioning if the cache efficiency changed. Looking back at the mega-array structure, we can see that we improved a little, we can store our velocity and position together in one array, so we don't have to acces two separate arrays and have cache misses. 

Optimizations
The main disadvantage of this structure is when you use larger objects. If we have an object storing a lot of components, the item size of the array is going to be large. And a large item size means cache misses. We can make our engine more cache friendly by using hot/cold data separation

As example, we consider objects storing data for physics (position, velocity) and rendering (huge models of 300MB). Loading only two of these objects, is a guaranteed cache miss, since the two objects are 300MB apart in the memory lane. The solution is to store a reference to the model, rather than storing it completely, and only call the model when it's required for rendering. This way we can update the physics without cache misses (a pointer is only 8 bytes). 

The hot/cold data separation separates hot data (used multiple times) versus cold data (used sparsely). If you want to read more, I suggest reading the article on gameprogrammingpatterns about this topic


To improve up on the parallel execution of the system, we can schedule all the work of one system in parallel. Since one system will be handling multiple arrays of objects, the first option is to execute all of these in parallel, and the second option is to execute all of the items in one array in parallel. Since the arrays have different lengths, the first option would be a bad choice (one thread would take very long, while the others would be waiting). Thus linearly processing all arrays, and execute all items inside that array is the best option. Further optimizations can be done by using SIMD instructions, but that is outside the scope for this article. 


Future work
I'd like to see if I can increase parallel efficiency by introducing tech from Naughty Dog's engine. They have a great presentation about it available online. Instead of introducing parallel execution per system, I'd like to see if we can create a set of instructions per signature and create a fiber to execute that.

When I'm done creating the engine (which is never, because it's a game engine) I will compile some benchmarks compared to a 'normal' ECS implementation. I also hope to release some source code from this engine.

Sources
[1] Understanding Component-Entity-Systems by Boreal Games
[2] Implementation of a component-based entity system in modern C++ by Vittoreo Romeo for CPPCon
[3] Evolve your Hierarchy by Mick West
[4] What is data oriented design StackOverflow
[5] Introduction to data oriented design DICE
[6] Gameprogrammingpatterns: Data Locality
[7] Parallelizing the Naughty Dog engine using fibers

Thursday, February 25, 2016

CUDA in Visual Studio 2015

Update September 2016: CUDA 8.0 is available, Visual studio update 3 is supported.
Update June 2016: CUDA 8.0 RC is available. Visual studio 2015 is supported, but update 2 is not yet included.

In this small post I will explain how you can use CUDA 7.5 in Visual Studio 2015. I don't claim to have full support for VS2015, merely using the VS2015 editor and compiling a VS2013 project. You will still need to have Visual Studio 2013 installed, with the CUDA toolkit extension.

On a separate note: Nsight does have support for VS2015, so no hacks required there!

What you need:

  • Any C++ project using CUDA.
  • CUDA 7.5 toolkit with VS2013 support installed (I only tested it with 7.5, but I can imagine other versions working as well)

In the project properties, make sure the project compiles for "v120", which is VS2013. 

To actually load the project in VS2015 having support for compiling CUDA code, we have to copy some files. Visual Studio uses targets to include several extensions for loading project files. We simply have to copy the VS2013 support to the VS2015 folder:

The CUDA 7.5 extension files for VS2013 are located in:
C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\BuildCustomizations

Find CUDA 7.5.props, CUDA 7.5.targets and CUDA 7.5.xml

If you simply copy them to the same folder for VS2015:
C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations

You can now load these projects in VS2015. 


My raytracer running from VS2015 using Nsight.