How Svelto.ECS + Svelto.Task help writing Data Oriented, Cache Friendly, Multi-Threaded code

Note: this article assumes that the reader knows how to use Svelto.ECS and Svelto.Tasks although the findings are interesting regardless.

Introduction

New exciting features are coming to Svelto.ECS and Svelto.Tasks libraries. As I am currently focused on optimizing Robocraft for Xbox One, I added several functionalities that can help making our game faster. Therefore I decided to write a simple example to show some od them. I have to be honest, it’s very hard to think about a simple example that makes sense. Every game has its own needs and what makes sense for one couldn’t be that meaningful for others. However everything boils down to the fact that I added features to exploit data locality (CPU cache) and easily write multi-threaded code. I have already discussed about how to use Svelto.Tasks multi-threaded ITaskRoutine, so I hope I can now show how to use them with Svelto.ECS. Spoiler alert: this article only won’t be sufficient to show you the potentiality of the library, as being more thorough would make this post too long and complex, therefore I encourage you to experiment on your own or wait for my next articles (that come every six months or so :P). This article also assumes that you know the basic concepts behind Svelto.ECS and Svelto.Tasks.

Initially I wanted to write an example that would mimic the boids demo showed at the unite roadmap talk that you can briefly see on this unity blog post. I soon decided to stop going down that route because it’s obvious that Joachim took advance of the new thread safe transform class or even wrote a new rendering pipeline on purpose for this demo, as the standard Transform class and even the GameObject based pipeline would be the biggest bottleneck impossible to parallelize (Note: I eventually found a work around to this problem to show visually the power of the multi core Svelto.Tasks approach). Since I believe that massive parallelism makes more sense on the GPU and that multi-threading should be exploited on CPU in a different way, I decided to not invest more time on it. However as I had some interesting results, I will use what is left of this useless example I wrote to show you what potential gain in milliseconds you may get using Svelto.ECS and Svelto.Tasks. I will eventually discuss how this approach can potentially be used in a real life scenario too.

Svelto.ECS and Svelto.Tasks have a symbiotic relationship. Svelto.ECS allows to naturally encapsulate logic running on several entities and treat each engine flow as a separate “fiber“. Svelto.Tasks allows to run these independent “fibers” asynchronously even on other threads.

GPUs work in a totally different fashion than CPUs and operative systems take the burden to handle how processes must share the few cores usually available. While on a GPU we talk about hundreds of cores, on a CPU we can have usually only 2-12 cores that have to run thousands of threads. However each core can run only one thread at time and it’s thanks to the OS scheduling system that all these threads can actually share the CPU power. More threads run, more difficult is the OS task to decide which threads to run and when. That’s why massive parallelism doesn’t make much sense on CPU. At full capacity, you can’t physically run more threads than cores, thus multi-threading on CPU is not meant for intensive operations.

The Example

You can proceed now downloading the new Svelto.ECS cache example and open the scene under the Svelto-ECS-Parallelism-Example folder. Since I removed everything from the original demo I was building, I can focus on the code instructions only and that’s why this example is the simplest ever I made so far, as it has only one Engine and one Node. The demonstration will show four different levels of optimization, using different techniques and how is possible to make the same instructions run several times faster. In order to show the optimization steps, I decided to use some ugly defines (sorry, I can’t really spend too much time on these exercises), therefore we need to stop the execution of the demo in the editor and change the define symbols every time we want to proceed to the next step. Alternative you can build a separate executable for each combination of defines, so you won’t have to worry about the editor overhead during the profiling. The first set of defines will be: FIRST_TIER_EXAMPLE;PROFILER.

Let’s open and analyse the code. As usual you can find our standard Context and relative composition root (MainContextParallel and ParallelCompositionRoot). Inside the context initialization, a BoidEngine is created (I didn’t bother renaming the classes) as well as 256,000 entities. Seems a lot, but we are going to run very simple operations on them, nothing to do with a real flock algorithm. Actually the instructions that run are basically random and totally meaningless. Feel free to change them as you wish.

The operations that must be executed are written inside the BoidEnumerator class. While the code example is ugly, I wrote it to be allocation free to be as fast as possible. A preallocated IEnumerator can be reused for each frame. This enumerator doesn’t yield at all, running synchronously, and operates on a set of entities that must be defined beforehand. The set of operations to run, as already said, are totally meaningless and written just to show how much time very simple operations can take to run on a large set of entities.

The main loop logic, enclosed in the Update enumerator, will run until the execution is stopped. It’s running inside a standard Update Scheduler because I am assuming that the result of the operations must be available every frame. In the way it’s designed, the main thread cannot be faster than the execution of the operations even if they run on other threads. This is very easily achievable with Svelto.Tasks.

The Naive Code (Editor 113ms, client 58ms)

So, assuming you didn’t get too much confused by the things happening under the hood, running the example will show you that the operations takes several milliseconds to be executed (use the stats window if you use the editor). in my case, on my I7 CPU, running the FIRST_TIER_EXAMPLE operations takes around 113ms. 

this is how the code looks at the moment:

Let’s have a look a the MSIL code generated

this set of operations runs 4 times for each entity; As I stated before, they are meaningless and random, it’s just a set of common and simple operations to run.

The Callless Code (Editor 57ms, Client 23ms)

Let’s now switch to SECOND_TIER_EXAMPLE, changing the defines in the editor and replacing FIRST_TIER_EXAMPLE. Let the code recompile. Exactly the same operations, but with different instructions, now take around 65ms (client 24ms)…what changed? I simply instructed the compiler to not use the Vector3 methods, running the operations directly on the components.

The code looks like:

It appears to me that the main difference is the number of call/callvirt executed, which obviously involves several extra operations. All the calls involve saving the current registries and some variables on the stack, call virt involves an extra pointer dereference to get the address of the function to call. More importantly, we avoid the copy of the Vector3 struct due to the fact that the method results are passed through the stack.

Let’s quickly switch to THIRD_TIER_EXAMPLE. The code now runs at around 57ms (client 23ms, almost no difference, interesting), but the code didn’t change at all. What changed then? I simply exploited the FasterList method to return the array used by the list. This removed another extra callvirt, as the operator[] of a collection like FasterList (but same goes for the standard List) is actually a method call. An Array instead is known directly by the compiler as a contiguous portion of allocated memory, therefore knows how to treat it efficiently.

The Cache Friendly Code (Editor 24ms, Client 16ms)

Have you have heard how ECS design can easily enable data oriented code? This is what the FOURTH_TIER_EXAMPLE define is about. The code now runs at around 24ms, another big jump, but again it seems that the instructions didn’t change much. Let’s see them:

It’s true that now the code is call free, but would this justifies the big jump? The answer is no. Just removing the last call left would have saved around 10ms only, while now the procedure is more than 30ms faster. The big saving must be found in the new EntityViewStruct feature that is enabled by the FOURTH_TIER_EXAMPLE enables. Svelto.ECS allows now to build EntityViews as class and as struct. When they are built as struct, you can exploit the full power of writing cache friendly data oriented code.

If you don’t know how data is laid out in memory, well, stop here and study how data structures work. If you know c++ you are already advantaged, because you know how pointers work, but otherwise, you must know that you can write contiguous data in memory only with array of value types, including struct, assuming that these structs contain only value types! Svelto ECS now allows to store EntityViews either as classes and structs, but when structs are used, the EntityViews are always stored as simple array of structs, so that you can achieve the fastest result, depending how you design your structs (Structure of Arrays or Array of Structures, it’s up to you!)

To put it simply, when you use references, the memory area pointed by the reference, is very likely nowhere in the memory near where the reference itself is. This means that the CPU cannot exploit the cache to read contiguous data. Array of value types (which can be struct containing valuetypes) are instead always stored contiguously.

Cache is what makes the procedure much faster. Nowadays CPUs can read a continuous block of data up to 64 bytes in one single memory access from the L1 cache (which takes around 4 cycles of clock). The L1 cache is efficiently filled by the CPU on data access, caching 32KB of consecutive data. As long as the data you need to access is in those 32KB of cache, the instructions will run much faster. Missing data from the cache will initialize a very slow RAM memory access, which can be hundreds of cycles slow! There are many articles around talking about modern CPU architecture and how data oriented code takes full advantage of it, so I won’t spend more time, but I will link some resources once I have the time, so check the end of the article for new edit every now and then.

Caching is weird though. For example it could come natural to store the whole EntityViewStruct in a local variable, while this seems a harmless instruction, in a data oriented context it can be disaster! How does removing the storing to a local variable of the struct make it faster? Well this is where things get very tricky. The reason is that the contiguous amount of data you need to read must be not too big and must be able to be read in one memory access (32 to 64bytes, depending the CPU).

Since our BoidNode struct is quite small, caching the variable here actually would have made the execution just slightly slower. However if you make the Boidnode struct larger (Try to add other four Vector3 fields and cache the whole structure in a local variable), you will wreck the processor and the execution will become largely slower! (edit: this problem could be solved with the new Ref feature of c# 7 which, unluckily, is still not supported by Unity)

Instead accessing directly the single component won’t enable the struct-wide read due to the copy to a local variable and since x,y,z are read and cached at once, these instructions will run at the same speed regardless the size of the struct. Alternatively you can cache locally just the position Vector3 in a local variable, which won’t make it faster, but it will be still work fast regardless the size of the struct.

The Multi Threaded Cache Friendly Code (Editor 8ms, Client 3ms)

To conclude, let’s keep the FOURTH_TIER_EXAMPLE define, but add a new one, called TURBO_EXAMPLE. The code now runs at around 8ms. This because the new MultiThreadedParallelTaskCollection Svelto.Tasks feature is now enabled. The operations, instead to run on one thread, are split and run on 8 threads. As you already figured out, 8 threads doesn’t mean 8 times faster (unless you actually have 8 cores :)) and this is the reality. Splitting the operations over multiple threads doesn’t only give sub-linear gains, but also diminishing return, as more the threads, less and less faster will the operations run, until increasing threads won’t make any difference. This is due to what I was explaining earlier. Multi-threading is no magic. Physically your CPU cannot run more threads than its cores and this is true for all the threads of all the processes running at the same time on your Operative System. That’s why CPU multi-threading makes more sense for asynchronous operations that can run over time or operations that involve waiting for external sources (sockets, disks and so on), so that the thread can pause until the data is received, leaving the space to other threads to run meanwhile. Memory access is also a big bottleneck, especially when cache missing is involved (the CPU may even decide to switch to other threads while waiting for the memory to be read, this is what Hyperthreading actually does)

This is how I use the MultiThreadedParallelTaskCollection in the example:

This collection allows to add Enumerators to be executed later on multiple threads. The number of threads to activate is passed by constructor. It’s interesting to note that the number of enumerators to run is independent by the number of threads, although in this case they are mapped 1 to 1. The MultiThreadedParallelTaskCollection, being an IEnumerator, can be scheduler on whatever runner, but the sets of tasks will always run on their own, pre-activated, threads.

The way I am using threading in this example is not the way it should be used in real life. First of all I am actually blocking the main thread to wait for the other threads to finish, so that I can actually measure how long the threads take to finish the operations. In a real life scenario, the code shouldn’t be written to wait for the threads to finish. For example, handling AI could run independently by the frame rate. I am thinking about several way to manage synchronization so that will be possible not just to exploit continuation between threads, but even run tasks on different threads at the same time and synchronize them. WaitForSignalEnumerator is an example of what I have in mind, more will come.

All right, we are at the end of the article now. I need to repeat myself here: this article doesn’t show really what is possible to do with Svelto.ECS and Svelto.Tasks in its entirety, this is just a glimpse of the possibilities opened by this infrastructure. Also the purpose of this article wasn’t about showing the level of complexity that is possible to handle easily with the Svelto framework, but just to show you how important is to know how to optimize our code. The most important optimizations are first the algorithmic ones, then the data structure related ones and eventually the ones at instruction level. Multi-threading is not about optimization, but instead being able to actually exploit the power of the CPU used. I also tried to highlight the CPU threading is not about massive parallelism and GPUs should be used for this purpose instead.

The Importance of the Compiler (Client 1.5ms!!)

I started to wonder, what if I had the chance to use a good c++ compiler to see if it could do a better work with auto-vectorization? After all, we are aware of the fact that the JIT compiler can’t do miracles in this sense. Since IL2CPP is not available for Windows platform, I compiled the same code for UWP, using the IL2CPP option. I guess the results are pretty clear, IL2CPP produces an executable that is twice as fast as the c# version. Since there isn’t any call to the native code, can’t be because of the benefit due to the lack of marshalling. I haven’t had the time to verify yet what the real difference is, but auto-vectorization may be the reason.

Conclusions

While this article is implicitly (and strongly) advising you to follow the Entity-Component-System pattern to centralize and modularize the logic of your game entities and gives another example of Svelto.Tasks usage to easily exploit multithreading, it has the main goal to show you why is needed to pay attention to how the code is written even if we talk about c# and Unity. While the effectiveness of these extreme optimizations largely depend on the context, understanding how the compiler generates the final code is necessary to not degrade the performance of large projects.

Having a light knowledge of modern CPUs architectures (like I do) helps to understand how to better exploit the CPU cache and optimize memory accesses. Multi-threading really needs more separate articles, but I now can state with certainty that exploiting the full power of the CPU with Unity is more than feasible, although the biggest bottleneck will remain the rendering pipeline itself.

To be honest, most of the Unity optimization issues are often related to the marshalling that the Unity framework performs when is time to call the Unity native functions. Unluckily these functions are many and are called very often, becoming the biggest issue we found during our optimizations. GameObject hierarchy complexity, calling Transform functions multiple times and similar problems can really kill the performance of your product. I may talk about these finding in future with a separate article, but many have been discussed already by other developers. The real goal to pursue is to find the right balance of what you can do in c# and what must be left to Unity. The ECS pattern can help with this a lot, avoiding the use of MonoBehaviour functionalities when they are not strictly necessary.

For example, Iet’s try to instantiate 256k GameObjects (yes it’s possible to do) and add a Monobehaviour each that simply runs the fastest version of the test. On my PC, it runs at 105ms, and even if profiling with Unity doesn’t give the best accuracy, it seems that half of this time is spent for Marshalling (very likely the time between switching between c# and c++ for each update, my current assumption is that Monobehaviours are stored as native objects and they need to switch to managed code for each single Update called, this is just crazy!).

Final Results:
Editor Client
ECS Naive Code 113ms 58ms
ECS No Calls Involved 57ms 23ms
ECS Structs only 24ms 16ms
ECS Multithreading (8 threads) 8ms 3ms
ECS Multithreading UWP/IL2CPP (8 threads) n/a 1.5ms
Classic GameObject/Update approach (no calls inside) 105ms 45ms
Classic GameObject/Update approach (no calls inside) UWP n/a 22ms

TL;DR:

  • I crammed too many points in this article 🙂
  • Optimize your algorithms first, you don’t want any quadratic algoritm in your code. Logarithmic is the way to go.
  • Optimize your datastructures next, you must know how your data structures work in memory to efficiently use them. There is difference between an array, a list and a linkedlist!
  • Optimize your instructions paying attention to memory accesses and caching for last (if you want to squeeze those MS OUT!). Coding to exploit the CPU cache is as important as using multi-threading. Of course not all the Engines need to be so optimized, so it depends by the situation. I will link some good references when I have the time, as there is much more to discuss about this point.
  • When using Unity, be very careful to minimize the usage of wrappers to native functions. These are many and often can be easily identified in Visual Studio navigating to the declaration of the function.
  • Mind that Unity callbacks (Update, LateUpdate etc) are often Marshalled from native code, making them slow to use. The Unity hierarchical profiler tells you half the truth, showing the time taken inside the Update, but not between Updates (usually shown on the “others” graph). The Timeline view will show “gaps” between Update calls, which is the time used for Marshalling them.
  • Multithreading is tricky but Svelto.Task can massively help
  • ECS pattern allows to write code that exploits temporal and spatial locality (As you can manage all your entities in one central place). Try Svelto.ECS to see what you can do 😉
  • ECS pattern will also help you to minimize the use of Unity functionalities which most of the time involve costly marshalling operations to switch to the native code. Most of them are not really justified and due to old legacy code that has never been updated in 10 years!
  • I understand that is hard to relate to an example that iterates over 256k entities. However moving logic from Monobehaviours to ECS engines saved dozens of milliseconds in the execution of the real-life code of our projects.
  • Why don’t they finish IL2CPP for standalone platforms? At least windows only!
  • Don’t use the editor to profile 🙂

Please leave here your thoughts, I will probably expand this article with my new findings, especially the ones related to the cache optimizations.

For more Multi-threading shenanigans check my new article: Learning Svelto Tasks by example: The Million Points example 

External resources:

7 thoughts on “How Svelto.ECS + Svelto.Task help writing Data Oriented, Cache Friendly, Multi-Threaded code

  1. Very good and interesting article!
    Just a note on the preamble (sorry :P)

    – “However each core can run only one thread at time”, there are some exceptions: in SMT more threads are executed in the same time on the same core.

    – “That’s why massive parallelism doesn’t make much sense on CPU”; actually this is not stricly related to context switching.
    Indeed, CPUs have OpenCL back-end implementations, which of course do not translate thousand of threads to thousand of OS threads,
    but still can’t get the same performances for data-parallel algorithms.
    To be noted that even on GPU there is no real “context switching”; groups of threads of fixed size “allocate” and retain all their resources
    (registers, shared memory portion, slots) until they are completed, then another group of thread is issued;
    so you choose how many threads do you want to schedule based on the size of the problem instead on the number of cores on the GPU.

    Although GPU such “context switching” between threads comes at zero cost, the real difference between CPU and GPU in resulting performance is
    due to the different aim of those different hardware architectures. The CPU’s main aim is to reduce the single thread processing time,
    and to do this it requires a lot of fancy optimizations which makes each single core very complex and requires fast and large caches
    (so few cores can be placed on the same die);
    the GPU’s main aim is to do more work in the same time unit, while the single thread processing time is a secondary problem.
    So, many cores (very unoptimized and simple, except at coarse level) can be placed on the die, and each one (of many) will solve (slowly)
    a part of the problem that must be independent from the rest (there are some exceptions though):
    this usually means that the exact same code processes different data (e.g. using different arrays’ indices)
    so such cores can be very simple, and replicated many times.

    1. thanks for the reply!

      1) I am not aware of any SMT architecture running games. Edit: ok I checked on wiki, with SMT you mean the intel Hyper Threading. HyperThreading doesn’t allow two threads to run at the same time, it allows a super fast context switching, so that a second thread can run while the first one is waiting for something slow, like a RAM memory access.

      2) I didn’t say (and if I did I didn’t mean it) that massive parallelism doesn’t make sense on a consumer-CPU because of the context switching, but because you can’t really run more threads than cores at full speed. The cores of a consumer CPU are at most 6 (12 with HT), while the GPU ones can be more than 1 thousand!

      3) a simple set of vectorized instructions to run on totally independent entities make waaaaay more sense on the GPU. Vector fields algorithms make way more sense on the GPU than on the CPU. However in the specific Boid Flock example there is a drawback, which is the use a read/write data structure to store the global position of the entities as every entity needs to know the position of the entities close to it. Unluckily I don’t have enough experience with Compute Shaders to know what the real challenges to develop such an algorithm on the GPU are. Albeit computer shaders on GPU and technology for high vectorization of CPU instructions like the Intel® SPMD Program Compiler will be the next fields I will experiment with.

      1. 1) AMD Ryzen, for example, is 2-way SMT. It is not super fast context switching, instead both threads are active in the same time; context is not switched between both, because local resources are allocated for both threads.

        2) Even on GPUs with 1000 cores you usually launch kernels with even more than a million threads. They are just issued and completed in groups of threads; when one group completes its processing, it releases its local resources that could be used by the next group in the queue. Yeah vectorized calculations are at the basis of GPU optimized cores, even if they are explicitely used on CPU with vector instructions and implicitely on GPU (you code what is done by the single thread, but the hardware actually issues and executes sets of threads, eg. 32, as they are a single flow of vectorized instructions, masking the threads that followed divergent branches).

        3) Oh, I wrote an article some years ago (http://www.ce.uniroma2.it/publications/parco2013_uniformgrids.pdf) on a similar subject (particle physics collisions) where I was creating on GPU a set of linked-lists for particles close together.

        1. That’s all super interesting. I’ll do more research and read your post. Are you sure you don’t want to work in the uk? 😉

          1. Eheh. Thanks, it would be super-cool <3 but you know I'm crazy and I've just bought a house in Rome xD

  2. Hey mate,

    I’ve finally found the time to work through your whole article and checked out all the different defines so I could watch the FPS increases on my machine. Very cool! Lots of fun!

    As already written (somewhere else), I’ve setup a cube collisions demo (where cubes move randomly on a plane and change color once they bump into another cube – inspired from Adam Martin’s Blog).

    In order to have a “lean” basis – which would allow me to integrate your ECS framework and learn while doing so. My git repo is here: https://github.com/CreativeWarlock/CubeCollisionsECS

    After reproducing all basic mechanics from the Svelto-ESC-Parallelism-Example I’ve got the cubes moving (, but only under the premise I would use EntityViews for the CubeEntityViews.

    In order to get the speedy parallelism I assumed I had to go further so I let the EntityViews implement IEntityStruct. After some adjustments, I now only see the cubes being placed and colorized during their “spawning” and then demo pauses due to an NPE: Some quick debugging revealed that
    1.) during the initialization the whole array of structs has set their Transforms and BoxColliders nicely due to the CubeMovementImplementor that is assigned during the creation of the cubes.
    2.) with the very first Update() in CubesEngine which triggers MoveNext() from the CubeEnumerator the instances of BoxColliders within the array of cubeEntityViews are all null.
    What happened?

    It seems I have missed something somewhere. If you could spare some time to help me, I’ll surely appreciate your help! I’d also be able to continue testing your framework as I’m already convinced to integrate your ECS into my game project, but I’d love to see how things go first before I’d start with all the refactoring.

    Quick other question about the BuildEntity method:
    _entityFactory.BuildEntity(cube.GetInstanceID(), cube.GetComponentsInChildren());

    Do I need to add the IComponent implementor here or is this only required when I use GenericEntities in the Descriptor?

    Best regards,
    Marcel

    1. The entitystruct is quite an advanced topic and I will need to dedicate an article on the good practices with those. From what you wrote it seems you aren’t using them as they should be used. Please follow the cache example I wrote that uses them although it’s not a good examples. It’s use is also limited to very specific applications where the efficiency is crucial, but trust me I still have to find a practical example where they actually needed :). You would need to iterate on hundreds of entities to actually appreciate them and the code must be written with cache efficiency in mind which can be tricky with or without ECS. All that said EntityStruct differs from EntityViews by the fact that they are actually structs and not classes. They do implement IEntityStruct but they are still Structures. As such they don’t need an implementor. If you create an entity descriptor with just entitystructs you don’t even need to pass implementors in the build entity function. They must be initialized inside engines. I am thinking to add the support of the Add method to have a simple place where to initiliaze them.

Leave a Reply