For the game I am currently developing, Gamecraft, we are using Svelto.ECS for the game logic, Unity ECS for the physic and GPUInstancer compute shader pipeline for the rendering. I have never had the time to benchmark the UECS rendering pipeline against the GPUInstancer one, mainly because I decided directly to use GPUInstancer, as its performance even on low-end GPUs (read intel HD4000) is astonishing.

Recently, on twitter, I have been asked if it was feasible to move 1 million 3D objects. I already knew that GPUInstancer wouldn’t break a sweat, but I also knew from my previous experiments that uploading from the CPU to the GPU 1 million matrices would have been impossible to use practically.

With GPUInstancer you are not supposed to update ALL the objects, but only the ones that actually moves, which usually are less. However knowing what the bottleneck is, I wondered if it was possible to upload the data to the GPU asynchronously or in any other way.

It turns out that Unity 2020 introduced two new ComputeBuffer methods, called BeginWrite and EndWrite. This sounded exactly what I needed to make this experiment work!

Let’s see what happens before and after using these methods. This is the entire code used to create the demo as seen in the tweet:

This code won’t run unless you buy the GPUInstancer plugin. As I said GPUInstancer implements most of the rendering pipeline using Compute Shaders, however it still relies on Unity pipeline (in this case URP) for few things, like computing the lighting. Materials and shaders are also the ones you would use normally with Unity, with some slight variations to support Compute Buffer structured buffers. I won’t go into details about it, if you are interested just check the extensive GPUInstancer wiki page. Just know that it’s very simple to use even for a person like me who knows almost zero about compute shaders.

The #define OLD_WAY would be how you normally use the plugin. However, because how slow ComputeBuffer SetData is, uploading 1 million matrices would make this solution unpractical. I have already suggested to the GPUInstancer authors if they can put more thoughts about the data optimizations. a 4×4 matrix may be much more than I need to upload to the GPU (for example if I just need to translate the object).

The Job uses burst to transform 1 Million cubes, Burst will do a good job (no pun intended), considering that maybe that simple math could be even optimized:

however the ComputeBuffer.SetData method would take too long to upload the entire array of matrices

But what happens if I switch to a SubUpdates buffer? Well I have no clue what’s happening and no interest for now to learn it. This was just an exercise after all. I believe that, somehow, the Nvidia Driver let me directly write in to the memory that is used by the GPU to reserve the compute buffer. What I know is that I get this in the profiler:

which means plenty of CPU avaiable to even make a game!

In order to achieve this, I had to hack a bit the GPUInstancer code (which ships with the plugin). I just changed the buffer initialization to:

Careful though, I spent almost one hour to understand why Begin/EndWrite wasn’t working at first. Turned out that currently it works only with the Vulkan Renderer, although Unity documentation doesn’t mention this anywhere!

Edit: it seems that is not necessary to call Begin/EndWrite every frame, but only when needed. I am not sure about it honestly, so if someone could shed some light, it would be appreciated.

19
Leave a Reply

avatar
7 Comment threads
12 Thread replies
1 Followers
 
Most reacted comment
Hottest comment thread
3 Comment authors
Sebastiano MandalàSebastiano Mandalàeizenhorn Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
eizenhorn
Guest
eizenhorn

It works not only with Vulcan. We’re using it on DX11. There some limitations for Begin\EndWrite: currently you might hit a slightly slower path on the render thraed if the buffer has been created just before calling begin write. So try to reuse the buffer as much as posssible and on DX11 you should not begin/end write several times in a frame.

Sebastiano Mandalà
Admin
Sebastiano Mandalà

thank you for the feedback. If I used it with DX11 the cubes were simply not moving. It may be due to the large quantity. I didn’t understand what you are saying about reusing the buffer. I can keep the buffer “open” without using end write as long as I want to?

Edit: I just tested it again, with DX11 the cubes do not even appear on the screen.

eizenhorn
Guest
eizenhorn

Thus it’s only GPUInstance implementation problem, not Unity, because we’re using raw DrawMeshInstanceDIndirect and it works perfectly for our custom animation and rendering system, as proof – animation and fog of war here – DrawMeshInstancedIndirect and ComputeBuffer’s with Begin\EndWrite: https://youtu.be/dUq2oVhInSU I didn’t understand what you are saying about reusing the buffer For example create buffer at OnCreate\Awake and reuse them, if buffer need to be resized – resize only when it necessery. Not call BeginWrite right after buffer creation, otherwise it will work slower. Nothing to force you EndWrite in same frame. In our case we schedule job every 3rd… Read more »

Sebastiano Mandalà
Admin
Sebastiano Mandalà

ok this is the kind of feedback I was looking for. I was in doubt myself about when EndWrite should be called.

However what does EndWrite precisely do and why should it make a difference when it’s called?

“Not call BeginWrite right after buffer creation, otherwise it will work slower”

then when should it be called?

“Thus it’s only GPUInstance implementation problem, not Unity,”

this doesn’t explain why it works with Vulkan.

eizenhorn
Guest
eizenhorn

then when should it be called? As I said (and this is what Unity engineers told me when we discussed this with them) cache and reuse buffers. Create buffers at OnCreate\Awake, and then use them in Update. By slower path I mean render thread processing will be slower. Main Thread or Jobs writing cost stay the same (because it’s just operation with memory), thus it affects only performance of render thread itself. this doesn’t explain why it works with Vulkan. I don’t know, again it’s most likely GPUInstancer implementation problem, I don’t know what they do underhood, maybe using platform… Read more »

Sebastiano Mandalà
Admin
Sebastiano Mandalà

GPUInstancer doesn’t have any special code for Vulkan and it doesn’t support SubUpdates. I hacked the code to support SubUpdates, but you can still be correct about it.
So your point is this: as long as the buffers do not change, cache them. I got it. I still haven’t got when call EndWrite though, you say you call it before each job, but from you explanation it seems that you should never call it, because if you call it you will dispose the buffer?

eizenhorn
Guest
eizenhorn

Lol it throws 403 when previous comment have “On Update \Update” string (without spaces), after removing “On Update \”(without spaces) all become fine, seems inside quote \U not supported )))

Sebastiano Mandalà
Admin
Sebastiano Mandalà

wordpress is super annoying

Sebastiano Mandalà
Admin
Sebastiano Mandalà

New comments plugin enabled, now this thread is a mess, but the plugin seems good! 🙂

eizenhorn
Guest
eizenhorn

>I still haven’t got when call EndWrite though, you say you call it before each job, but from you explanation it seems that you should never call it, because if you call it you will dispose the buffer?

Yes I call it before job scheduling AND before new BeginWrite 🙂 Loop looks like that:
…-> StartFrame -> BeginWrite, Schedule Job ->EndFrame -> StartFrame -> EndWrite, new BeginWrite, Schedule Job – >EndFrame ->…
Nope, this call only “dispose” native array reference which you use for writing. It not disposing buffer itself

Sebastiano Mandalà
Admin
Sebastiano Mandalà

well then you are using more or less like I used in my demo, the only difference is that you call end write before the job and not after. I don’t understand where the cache is then.

eizenhorn
Guest
eizenhorn

You already cache ComputeBuffer, if you do this only once in Start:
runtimeData.transformationMatrixVisibilityBuffer = new ComputeBuffer(…);

Sebastiano Mandalà
Admin
Sebastiano Mandalà

oh now I understand. I thought you were talking about caching the NativeArray returned by BeginWrite. We are on the same page now. The only thing I am still not sure about is what’s the difference between calling EndWrite before and after the job.

eizenhorn
Guest
eizenhorn

Oh you doing it a bit wrong, I just looked at your code 🙂 You call EndWrite right after scheduling, job not done yet it’s just scheduling, you should call EndWrite after Complete call (or you can not force that and check IsCompleted for waiting untill job done by self and call EndWrite after that and shedule next jobs)

eizenhorn
Guest
eizenhorn

I think you now should remove end part of article, for excluding misleading someone 🙂

eizenhorn
Guest
eizenhorn

Thus you close write stream without any thing written in to that. For Vulcan it can work because on hardware level it’s, maybe, not closing write stream and you still can access GPU memory.

eizenhorn
Guest
eizenhorn

Moreover this code should throw exception like that:

InvalidOperationException: The NativeArray has been deallocated, it is not allowed to access it

eizenhorn
Guest
eizenhorn

Because nativeArray disposed right after EndWrite and job not run yet (Schedule not mean execute from this place, it start execution later in frame)