Learning Svelto Tasks by example: The million points on multiple CPU cores example

With my previous article on Svelto.Tasks and multi-threaded cache friendly code, I failed to show visually the power of Svelto.Tasks because I didn’t know how to upload a huge amount of data to the GPU without stalling the main thread. Honestly, I stopped being a graphic programmer right before DX 11 was introduced, so my knowledge of modern pipelines is limited. I also thought that to find a good solution I would have needed to write some kind of low level workaround to overcome the Unity API limitations, but with my great surprise, actually Unity has already got what I needed. It was just necessary some investigation and help to find out :).

With my great astonishment, using compute buffers, is possible to upload every frame a huge amount of data without stalling the CPU. I still have to understand how this works an which DX 11 function ComputeBuffer.SetData maps onto, so if you know, please leave a comment, as I need to understand it, although it is not my priority for this current demo.  As matter of fact, it was enough for me to know that uploading the vertices transformed on the CPU would not affect the final performance in a significative way.

Therefore I immediately threw some lines of code to show off how simple and efficient is to work with multi-threads with Svelto.Tasks and after combining it with the new IL2CPP feature of Unity 2018, I achieved the incredible number of 1 million particles transformed on the CPU at >30fps!! If someone would have told me that this was possible, to be honest, I would have had my doubts. However let me clear, these kind of demo are pretty lame because it doesn’t make much sense to do these kind of operations on the CPU. This is exactly what the GPUs are for. It’s more a show off than something practical, but the code, obviously, can be used for better cases.

I was also lucky enough to find a good and, most of all, simple demo on github, called MillionPoints, which does what I needed, that is transforming 1 Million points on the GPU using Compute Shaders and the Graphics.DrawMeshInstancedIndirect function. All I had to do is to convert the simple compute shaders code in pure c# and make it run on the CPU.

While I still don’t consider my self a multi-threaded code expert, because one never stops to learn and I didn’t have the chance to use multi-threading in sophisticated algorithms yet, I may dare to say that I start to have a good understanding of all the problems involved and consequentially I designed Svelto.Tasks to be very simple to use in a multi-thread environment, exactly like the new c# job system claims to do. Since in Freejam we hire a lot of junior programmers, I had the chance to see first hand all the problems that could arise to give advanced tools to inexpert hands. That’s why I had to design something very straightforward to use and most of the time, worries free. This is what I hope to have achieved with Svelto.Tasks, improving it constantly over the years.

There are two keys elements that make Svelto.Tasks so powerful: the runners (or schedulers) and the continuation. The runners are designed to run every kind of IEnumerator (or often called task) on every kind of defined Runner. Svelto.Tasks already ships with a lot of Unity related Runners, but two are platform agnostic: the MultithreadRunner and the SyncRunner.

The concept of continuation (similar to await/async) is even more powerful. It allows to start a task running on a specific scheduler from another task running on another scheduler and continue from there once the new task is finished. This is much simpler to code than to explain. Although there are similarities, Tasks.Net and Svelto.Tasks are obviously designed differently as the latter is designed around all the problematic intrinsic in a game development production, including performance.

Enough talk now so let’s dive in the details. Open the scene main.unity and click on the MillionPoints GameObject. Be sure that the MillionPointsCPU Monobehaviour is enabled and the GPU one is disabled. Run it, don’t expect great performance in the editor though, there is a huge difference between editor and clients in this case. I will show all the differences in bit. BTW I am currently using Unity 2018.1b3.

The goal is to perform the following cache friendly CPU instructions inside the ParticleCPUKernel MoveNext() method inner loop on 1 Million points every frame using tasks running on multiple cores:

I skip all the compute buffer initialization stuff, because is not relevant for this article and we start directly from the function StartSveltoCPUWork(). First of all, I decided to split the job in 16 threads. As you know, when it’s time for CPU intensive work, increasing the number of threads more than the number of available cores gives a logarithmic-like gain,  so even 16 is already probably too much. Since the operations to apply on the vertices are quite straightforward, we can just have each thread operating on a specific segment of the buffer and this is what the function PrepareParallelTasks method does.

The MultiThreadedParallelTaskCollection has a similar interface to the simpler ParallelTaskCollection, but it is able to run N tasks on M threads using M ParallelTaskCollections. You may have figured out how this works already. it basically creates M Threads and runs one ParallelTaskCollection on each. The N tasks are split among the M ParallelTaskCollections. so the execution of M ParallelTaskCollections, and not N tasks, are truly in parallel. When N coincide to M, then all the tasks run in parallel like in the case of this example. In this case we initialize 16 ParallelTaskCollections running 1 task each. This task applies the ParticlesCPUKernel instructions on 1.000.000 / 16 particles.

Remember, the MultiThreadedParallelTaskCollection is just an IEnumerator than must run like the others tasks in Svelto.Tasks, so it doesn’t just start on its own. In order to run it, I put in place 3 different strategies. I know, I should have avoided it to reduce the complexity, but at the same time I wish to give you a better overview of what you can achieve with the continuation and separate runners combination. All the following code assume that you know very well how to work with the yield instruction.

Let’s start from enabling the #Test2 define as it is the simplest scenario. In this case, everything is driven by a single loop running on the main thread.

As I said, this is the simplest flow to understand. The intention is to wait for the other thread to finish, that’s why I run the multithreaded parallel collection execution on the SyncRunner that is designed to stall the current thread. Once the execution is completed, the rest on the code will run still on the main thread.

This is probably not the best solution, because the other threads could start compute the particles operations for the next frame between the end of this function and the next frame. Let’s see this visually:

in the way we are running the loop, the multithreaded collection starts during the Update phase, as the main thread loop is running on the Update Runner. While it’s running it stalls the main thread, so nothing else can be executed, then it finished and the process continue. The red vertical line can give an idea about where the threads start and complete to run. However why should the update phase wait for the threads to finish their operation when those could have ran outside the Update phase?

We could invert the way we trigger the multi-threads operation so that we compute the next values in between updates instead than inside the update phase.

There is more than one way to achieve this. Let’s see an example enabling the define #Test1

In this case we have two loops. One running on the main thread, as we have started the MainThreadOperations task using the standard UpdateRunner, and the other on an other thread, starting OperationsRunningOnOtherThreads on the standard MultiThreadedRunner before the main loop starts. Note that just the loop runs on that thread, as the _multiParallelTasks will use its own threads to run.

At this point _waitForSignal and _otherwaitForSignal are used to signal when the operations on each thread are completed. I hope this is intuitive. The main thread first waits for the other threads to finish, when this happens the draw mesh is issued and the main thread will signal the other thread to start the operations for the next frame. Since the other thread can finish running before the next update, it will yield its execution to other threads until the are signalled to compute the next values.

The _breakit part is a bit awkward at the moment. It’s necessary to be sure that the threads will stop when the code is executed in the editor, as stopping the execution in the editor won’t kill the running threads like it normally happens with a standalone application.

Last scenario is an alternative possibility:

so what’s the deal here? Again two loops, although this time the one running on the main thread just keeps on issuing the draw mesh with the last updated particles. In the multithreaded loop instead continuation is exploited to set the freshly computed particles to the compute buffer. Since Unity will throw an exception if this is done on other threads, we have to run the task on the main thread and wait it to finish.

The frame rate for this case is much higher, but don’t be fooled, the reason being that the mainthread this time is not stalled, however the particles will be updated at the same frame rate of the other cases, more or less.

Before to test the performance, let me spend two words about the cache friendly code created to compute the particle values. As you may notice, I use a lot of ref and out. This is because it’s very important to avoid copying structure when not necessary as it can hit hard the stack. This is also why c# 7.2 has recently introduced the byref returning value and byref variables, so that I can write simpler code to avoid copies on the stack of structs. You should always pass your Vector[N] by ref or out.

Let’s do some benchmarking now, using the #Test1 case:

Mono/.Net 4.6 ~20fps
IL2CPP ~48fps
UWP ~23fps
UWP .Net Native ~59fps

Last test was just an experiment. I knew that UWP .net core code can be compiled in native code through the .net native toolchain, therefore I had obviously to compare it against IL2CPP. The result make me think that a future integration in Unity for standalone platforms, if possible, would be beneficial.

And finally the project can be downloaded and executed from github as usual: https://github.com/sebas77/Svelto.Tasks.MillionPoints

P.S.: If you notice that Svelto.Tasks code can perform better, please tell me, I am sure there can be some areas to improve, as I continuously do it.

4 thoughts on “Learning Svelto Tasks by example: The million points on multiple CPU cores example

  1. Thanks for this great article! The results are very cool! This shows that it is not always necessary to panically rush towards GPU computation as soon as a game (or simulation) has to compute large number of entities. Sure, GPUs are having warp drives with speed 9.9, but that nice multithreading framework of yours allows CPU to make some big jump from sol drive to warp 1.0 – at least! :))

    I might have overseen something, but could it be you might have forgotten to add ‘Svelto.Utilities’ as a submodule? Both Unity and VS can’t find it. Your .gitmodules only lists Svelto.Tasks.

    1. Svelto.Common is a submodule of svelto.tasks everything should work if you update recursively. I will update the read me with some instructions.

      Edit: In reality the GPU must be exploited, especially in these cases. I left the original code using compute shaders working. It runs the same operations at 90fps leaving the CPU free to do anything else!

Leave a Reply