How Svelto.ECS + Svelto.Task help writing Data Oriented, Cache Friendly, Multi-Threaded code

Note: this is an on-going article and will be updated with my new findings over the time. It assumes the reader knows how to use Svelto.ECS and Svelto.Tasks although the findings are interesting regardless.

Introduction

New exciting features are coming to Svelto.ECS and Svelto.Tasks libraries.  As I am currently focused on optimizing Robocraft for Xbox One, I added several functionalities that can help making our game faster. Hence I decided to write a simple example to show some to you. I have to be honest, it’s very hard to think about a simple example that makes sense. Every game has its own needs and what makes sense for one couldn’t be that meaningful for others. However everything boils down to the fact that I added features to exploit data locality (CPU cache) and easily write multithreaded code. I have already discussed about how to use Svelto.Tasks multi-threaded ITaskRoutine, so I hope I can now show how to use them with Svelto.ECS. Spoiler alert: this article only won’t be sufficient to show you the potentiality of the library, as being more thorough would make this post too long and complex, therefore I encourage you to experiment on your own or wait for my next articles (that come every six months or so :P). This article also assumes that you know the basic concepts behind Svelto.ECS and Svelto.Tasks.

Initially I wanted to write an example that would mimic the boids demo showed at the unite roadmap talk that you can briefly see on this unity blog post. I soon decided to stop going down that route because it’s obvious that Joachim took advance of the new thread safe transform class or even wrote a new rendering pipeline on purpose for this demo, as the standard Transform class and even the GameObject based pipeline would be the biggest bottleneck impossible to parallelize. Since I believe that massive parallelism makes more sense on the GPU and that multithreading should be exploited on CPU in a different way, I decided to not invest more time on it. However as I had some interesting results, I will use what is left of this useless example I wrote to show you what potential gain in milliseconds you may get using Svelto.ECS and Svelto.Tasks. I will eventually discuss how this approach can potentially be used in a real life scenario too.

Svelto.ECS and Svelto.Tasks have a symbiotic relationship. Svelto.ECS allows to naturally encapsulate logic running on several entities and threat each engine flow as a separate “fiber“. Svelto.Tasks allows to run these independent “fibers” asynchronously even on other threads. GPUs work in a totally different fashion than CPUs and operative systems take the burden to handle how processes must share the few cores usually available. While on a GPU we talk about hundreds of cores, on a CPU we can have usually only 4-12 cores that have to run thousands of threads. However each core can run only one thread at time and it’s thanks to the OS scheduling system that all these threads can actually share the CPU power. More threads run, more difficult is the OS task to decide which threads to run and when. That’s why massive parallelism doesn’t make much sense on CPU. At full capacity, you can’t physically run more threads than cores, hence multithreading on CPU is not meant for intensive operations. While with Svelto.ECS+Tasks would be possible to run a thread for each engine for each entity, it would be probably better to run just a thread for each engine or sequencer execution.

The Example

You can proceed now downloading the new Svelto.ECS example, which has been updated to Unity 2017, and open the scene under the Svelto-ECS-Parallelism-Example folder. Since I removed everything from the original demo I was building, I can focus on the code instructions only and that’s why this example is the simplest ever I made so far, as it has only one Engine and one Node. The demonstration will show four different levels of optimization, using different techniques and how is possible to make the same instructions run several times faster.  In order to show the optimization steps, I decided to use some ugly defines (sorry, I can’t really spend too much time on these exercises), therefore we need to stop the execution of the demo in the editor and change the define symbols every time we want to proceed to the next step. Alternative you can build a separate executable for each combination of defines, so you won’t have to worry about the editor overhead during the profiling. The first set of defines will be: FIRST_TIER_EXAMPLE;PROFILER (never mind CROSS_PLATFORM_INPUT it can stay there if needed).

Let’s open and analyse the code in Visual Studio. As usual you can find our standard Context and relative composition root (MainContextParallel and ParallelCompositionRoot). Inside the context initialization, a BoidEngine is created (I didn’t bother renaming the classes) as well as 256,000 entities. Seems a lot, but we are going to run very simple operations on them, nothing to do with a real flock algorithm. Actually the instructions that run are basically random and totally meaningless. Feel free to change them as you wish.

The operations that must be executed are written inside the BoidEnumerator class. While the code example is ugly, I wrote it to be allocation free to be as fast as possible (the only allocation present is due to the ToString() used for the GUI). A preallocated IEnumerator can be reused for each frame. This enumerator doesn’t yield at all, running synchronously, and operates on a set of nodes that must be defined beforehand. The set of operations to run, as already said, are totally meaningless and written just to show how much time very simple operations can take to run on a large set of entities.

The main loop logic, enclosed in the Update enumerator, will run until the execution is stopped. It’s running inside a standard Update Scheduler because I am assuming that the result of the operations must be available every frame. In the way it’s designed, the main thread cannot be faster than the execution of the operations even if they run on other threads. This is very easily achievable with Svelto.Tasks.

 

The Naive Code (Editor 113ms, client 58ms)

So, assuming you didn’t get too much confused by the things happening under the hood, running the example will show you that the operations takes several milliseconds to be executed (use the stats window if you use the editor). in my case, on my I7 CPU, running the FIRST_TIER_EXAMPLE operations takes around 113ms. (interestingly using net 4.6 it takes longer, I wonder why!)

this is how the code looks at the moment:

Let’s have a look a the MSIL code generated

this set of operations runs 4 times for each entity; As I stated before, they are meaningless and random, it’s just a set of common and simple operations to run.

The Callless Code (Editor 57ms, Client 23ms)

Let’s now switch to SECOND_TIER_EXAMPLE, changing the defines in the editor and replacing FIRST_TIER_EXAMPLE. Let the code recompile. Exactly the same operations, but with different instructions, now take around 65ms (client 24ms)…what changed? I simply instructed the compiler to not use the Vector3 methods, running the operations directly on the components.

and now with SECOND_TIER_EXAMPLE

It appears to me that the main difference is the number of call/callvirt executed, which obviously involves several extra operations. All the calls involve saving the current registries and some variables on the stack, call virt involves an extra pointer dereference to get the address of the function to call. I am reporting the IL code in case you think I am missing something, please let me know as I am obviously interested.

Let’s quickly switch to THIRD_TIER_EXAMPLE. The code now runs at around 57ms (client 23ms, almost no difference, interesting), but the code didn’t change at all. What changed then? I simply exploited the FasterList method to return the list array without trimming and copy it. This removed another extra callvirt, as the operator[] of a collection like FasterList (but same goes for the standard List) is actually a method call. An Array instead is known directly by the compiler as a contiguous portion of allocated memory, therefore knows how to treat it efficiently.

 

The Cache Friendly Code (Editor 24ms, Client 16ms)

Have you have heard how ECS design can easily enable data oriented code? This is what the FOURTH_TIER_EXAMPLE define is about. The code now runs at around 24ms, another big jump, but again it seems that the instructions didn’t change much. Let’s see them:

It’s true that now the code is call free, but would this justifies the big jump? The answer is no. Just removing the last call left would have saved around 10ms only, while now the procedure is more than 30ms faster. The big saving must be found in the new IStructNodeEngine Svelto.ECS feature that FOURTH_TIER_EXAMPLE enables.

Cache is what makes the procedure much faster. Nowadays CPUs can read a continuous block of data up to 64 bytes big in one memory access from the L1 cache (which takes around 4 cycles of clock). The L1 cache is efficiently filled by the CPU on data access, caching 32kb of consecutive data. As long as the data you need to access is in that 32kb of cache, the instructions will run much faster. Missing data from the cache will initialize a very slow RAM memory access, which can be hundreds of cycles slow! There are many articles around talking about modern CPU architecture and how data oriented code takes full advantage of it, so I won’t spend more time, but I will link some resources once I have the time, so check the end of the article for new edit every now and then.

Caching is weird though. For example we store the node structure in a local variable. This seems to be a harmless instruction, but in a data oriented code, it can be disaster! The following code runs faster, but just a bit faster:

How does removing the storing to a local variable of the node make it faster? Well this is where things get very tricky. The reason is that the contiguous amount of data you need to read must be not too big and must be able to be read in one memory access (32 to 64bytes, depending the CPU).

Since our BoidNode struct is quite small, caching the variable here actually would have made the execution just slightly slower. However if you make the Boidnode struct larger (Try to add other four Vector3 fields and cache the whole structure in a local variable), you will wreck the processor and the execution will become largely slower! (edit: this problem could be solved with the new Ref feature of c# 7 which, unluckily, is still not supported by Unity)

Instead accessing directly the single component won’t enable the struct-wide read due to the copy to a local variable and since x,y,z are read and cached at once, these instructions will run at the same speed regardless the size of the struct. Alternatively you can cache locally just the position Vector3 in a local variable, which won’t make it faster, but it will be still work fast regardless the size of the struct.

If you don’t know how data is laid out in memory, well, stop here and study how data structures work. If you know c++ you are already advantaged, because you know how pointers work, but otherwise, you must know that you can write contiguous data in memory only with array of value types, including struct, assuming that these structs contain only value types! Svelto ECS now allows to store nodes (mapping of entity components) either as classes and structs, but when structs are used, the nodes are always stored as simple array of structs, so that you can achieve the fastest result, depending how you design your structs (Structure of Arrays or Array of Structures, it’s up to you!)

 

The Multi Threaded Cache Friendly Code (Editor 19ms 8ms, Client 3ms)

To conclude, let’s keep the FOURTH_TIER_EXAMPLE define, but add a new one, called TURBO_EXAMPLE. The code now runs at around 8ms. This because the new MultiThreadedParallelTaskCollection Svelto.Tasks feature is now enabled. The operations, instead to run on one thread, are split and run on 8 threads. As you already figured out, 8 threads doesn’t mean 8 times faster and this is the reality. Splitting the operations over threads doesn’t only give sub-linear gains, but also diminishing return, as more the threads, less and less faster will the operations run, until increasing threads won’t make any difference. This is due to what I was explaining earlier. Multi-threading is no magic. Physically your CPU cannot run more threads than its cores and this is true for all the threads of all the processes running at the same time on your Operative System. That’s why CPU multi-threading makes more sense for asynchronous operations that can run over time or operations that involve waiting for external sources (sockets, disks and so on), so that the thread can pause until the data is received, leaving the space to other threads to run meanwhile. Memory access is also a big bottleneck, especially when cache missing is involved (the CPU may even decide to switch to other threads while waiting for the memory to be read, this is what Hyperthreading actually does)

This is how I use the MultiThreadedParallelTaskCollection in the example:

This collection allows to add Enumerators to be executed later on multiple threads. The number of threads to activate is passed by constructor. It’s interesting to note that the number of enumerators to run is independent by the number of threads, although in this case they are mapped 1 to 1. The MultiThreadedParallelTaskCollection, being an IEnumerator, can be scheduler on whatever runner, but the sets of tasks will always run on their own, pre-activated, threads.

The way I am using threading in this example is not the way it should be used in real life. First of all I am actually blocking the main thread to wait for the other threads to finish, so that I can actually measure how long the threads take to finish the operations. In a real life scenario, the code shouldn’t be written to wait for the threads to finish. For example, handling AI could run independently by the frame rate. I am thinking about several way to manage synchronization so that will be possible not just to exploit continuation between threads, but even run tasks on different threads at the same time and synchronize them. WaitForSignalEnumerator is an example of what I have in mind, more will come.

All right, we are at the end of the article now. I need to repeat myself here: this article doesn’t show really what is possible to do with Svelto.ECS and Svelto.Tasks in its entirety, this is just a glimpse of the possibilities opened by this infrastructure. Also the purpose of this article wasn’t about showing the level of complexity that is possible to handle easily with the Svelto framework, but just to show you how important is to know how to optimize our code. The most important optimizations are first the algorithmic ones, then the data structure related ones and eventually the ones at instruction level. Multi-threading is not about optimization, but instead being able to actually exploit the power of the CPU used. I also tried to highlight the CPU threading is not about massive parallelism and GPUs should be used for this purpose instead.

Note: Because of the wrong use of the Interlocked.Increment function, that I was stupidly calling inside the inner loop, the previous test was skewed. Obviously the lock contention and the lock itself made all the test actually a bit slower, but much slower in the multi-threads case. I didn’t notice it until I had a second look today!

 

The Importance of the Compiler (Client 1.5ms!!)

I started to wonder, what if I had the chance to use a good c++ compiler to see if it could do a better work with auto-vectorization? After all, we are aware of the fact that the JIT compiler can’t do miracles in this sense. Since IL2CPP is not available for Windows platform, I compiled the same code for UWP, using the IL2CPP option. I guess the results are pretty clear, IL2CPP produces an executable that is twice as fast as the c# version. Since there isn’t any call to the native code, can’t be because of the benefit due to the lack of marshalling. I haven’t had the time to verify yet what the real difference is, but auto-vectorization may be the reason.

Conclusions

While this article is implicitly (and strongly) advising you to follow the Entity-Component-System pattern to centralize and modularize the logic of your game entities and gives another example of Svelto.Tasks usage to easily exploit multithreading, it has the main goal to show you why is needed to pay attention to how the code is written even if we talk about c# and Unity. While the effectiveness of these extreme optimizations largely depend on the context, understanding how the compiler generates the final code is necessary to not degrade the performance of large projects.

Having a light knowledge of modern CPUs architectures (like I do) helps to understand how to better exploit the CPU cache and optimize memory accesses. Multi-threading really needs more separate articles, but I now can state with certainty that exploiting the full power of the CPU with Unity is more than feasible, although the biggest bottleneck will remain the rendering pipeline itself.

To be honest, most of the Unity optimization issues are often related to the marshalling that the Unity framework performs when is time to call the Unity native functions. Unluckily these functions are many and are called very often, becoming the biggest issue we found during our optimizations. GameObject hierarchy complexity, calling Transform functions multiple times and similar problems can really kill the performance of your product. I may talk about these finding in future with a separate article, but many have been discussed already by other developers. The real goal to pursue is to find the right balance of what you can do in c# and what must be left to Unity. The ECS pattern can help with this a lot, avoiding the use of MonoBehaviour functionalities when they are not strictly necessary.

For example, Iet’s try to instantiate 256k GameObjects (yes it’s possible to do) and add a Monobehaviour each that simply runs the fastest version of the test. On my PC, it runs at 105ms, and even if profiling with Unity doesn’t give the best accuracy, it seems that half of this time is spent for Marshalling (very likely the time between switching between c# and c++ for each update, my current assumption is that Monobehaviours are stored as native objects and they need to switch to managed code for each single Update called, this is just crazy!).

 

Final Results:
Editor Client
ECS Naive Code 113ms 58ms
ECS No Calls Involved 57ms  23ms
ECS Structs only 24ms  16ms
ECS Multithreading (8 threads) 8ms  3ms
ECS Multithreading UWP/IL2CPP (8 threads) n/a  1.5ms
Classic GameObject/Update approach (no calls inside) 105ms 45ms
Classic GameObject/Update approach (no calls inside) UWP n/a 22ms

 

TL;DR:

  • I crammed too many points in this article 🙂
  • Optimize your algorithms first, you don’t want any quadratic algoritm in your code. Logarithmic is the way to go.
  • Optimize your datastructures next, you must know how your data structures work in memory to efficiently use them. There is difference between an array, a list and a linkedlist!
  • Optimize your instructions paying attention to memory accesses and caching for last (if you want to squeeze those MS OUT!). Coding to exploit the CPU cache is as important as using multi-threading. Of course not all the Engines need to be so optimized, so it depends by the situation. I will link some good references when I have the time, as there is much more to discuss about this point.
  • When using Unity, be very careful to minimize the usage of wrappers to native functions. These are many and often can be easily identified in Visual Studio navigating to the declaration of the function.
  • Mind that Unity callbacks (Update, LateUpdate etc) are often Marshalled from native code, making them slow to use. The Unity hierarchical profiler tells you half the truth, showing the time taken inside the Update, but not between Updates (usually shown on the “others” graph). The Timeline view will show “gaps” between Update calls, which is the time used for Marshalling them.
  • Multithreading is tricky but Svelto.Task can massively help
  • ECS pattern allows to write code that exploits temporal and spatial locality (As you can manage all your entities in one central place). Try Svelto.ECS to see what you can do 😉
  • ECS pattern will also help you to minimize the use of Unity functionalities which most of the time involve costly marshalling operations to switch to the native code. Most of them are not really justified and due to old legacy code that has never been updated in 10 years!
  • I understand that is hard to relate to an example that iterates over 256k entities. However moving logic from Monobehaviours to ECS engines saved dozens of milliseconds in the execution of the real-life code of our projects.
  • Why don’t they finish IL2CPP for standalone platforms? At least windows only!
  • Don’t use the editor to profile 🙂

Please leave here your thoughts, I will probably expand this article with my new findings, especially the ones related to the cache optimizations.

External resources: (more will come)

 

 

5 thoughts on “How Svelto.ECS + Svelto.Task help writing Data Oriented, Cache Friendly, Multi-Threaded code

  1. Very good and interesting article!
    Just a note on the preamble (sorry :P)

    – “However each core can run only one thread at time”, there are some exceptions: in SMT more threads are executed in the same time on the same core.

    – “That’s why massive parallelism doesn’t make much sense on CPU”; actually this is not stricly related to context switching.
    Indeed, CPUs have OpenCL back-end implementations, which of course do not translate thousand of threads to thousand of OS threads,
    but still can’t get the same performances for data-parallel algorithms.
    To be noted that even on GPU there is no real “context switching”; groups of threads of fixed size “allocate” and retain all their resources
    (registers, shared memory portion, slots) until they are completed, then another group of thread is issued;
    so you choose how many threads do you want to schedule based on the size of the problem instead on the number of cores on the GPU.

    Although GPU such “context switching” between threads comes at zero cost, the real difference between CPU and GPU in resulting performance is
    due to the different aim of those different hardware architectures. The CPU’s main aim is to reduce the single thread processing time,
    and to do this it requires a lot of fancy optimizations which makes each single core very complex and requires fast and large caches
    (so few cores can be placed on the same die);
    the GPU’s main aim is to do more work in the same time unit, while the single thread processing time is a secondary problem.
    So, many cores (very unoptimized and simple, except at coarse level) can be placed on the die, and each one (of many) will solve (slowly)
    a part of the problem that must be independent from the rest (there are some exceptions though):
    this usually means that the exact same code processes different data (e.g. using different arrays’ indices)
    so such cores can be very simple, and replicated many times.

    1. thanks for the reply!

      1) I am not aware of any SMT architecture running games. Edit: ok I checked on wiki, with SMT you mean the intel Hyper Threading. HyperThreading doesn’t allow two threads to run at the same time, it allows a super fast context switching, so that a second thread can run while the first one is waiting for something slow, like a RAM memory access.

      2) I didn’t say (and if I did I didn’t mean it) that massive parallelism doesn’t make sense one a consumer-CPU because of the context switching, but because you can’t really run more threads than cores at full speed. The cores of a consumer CPU are at most 6 (12 with HT), while the GPU ones can be more than 1 thousand!

      3) a simple set of vectorized instructions to run on totally independent entities make waaaaay more sense on the GPU. Vector fields algorithms make way more sense on the GPU than on the CPU. However in the specific Boid Flock example there is a drawback, which is the use a read/write data structure to store the global position of the entities as every entity needs to know the position of the entities close to it. Unluckily I don’t have enough experience with Compute Shaders to know what the real challenges to develop such an algorithm on the GPU are. Albeit computer shaders on GPU and technology for high vectorization of CPU instructions like the Intel® SPMD Program Compiler will be the next fields I will experiment with.

      1. 1) AMD Ryzen, for example, is 2-way SMT. It is not super fast context switching, instead both threads are active in the same time; context is not switched between both, because local resources are allocated for both threads.

        2) Even on GPUs with 1000 cores you usually launch kernels with even more than a million threads. They are just issued and completed in groups of threads; when one group completes its processing, it releases its local resources that could be used by the next group in the queue. Yeah vectorized calculations are at the basis of GPU optimized cores, even if they are explicitely used on CPU with vector instructions and implicitely on GPU (you code what is done by the single thread, but the hardware actually issues and executes sets of threads, eg. 32, as they are a single flow of vectorized instructions, masking the threads that followed divergent branches).

        3) Oh, I wrote an article some years ago (http://www.ce.uniroma2.it/publications/parco2013_uniformgrids.pdf) on a similar subject (particle physics collisions) where I was creating on GPU a set of linked-lists for particles close together.

          1. Eheh. Thanks, it would be super-cool <3 but you know I'm crazy and I've just bought a house in Rome xD

Leave a Reply