So I profiled the minimal project you posted and wasnt able to reproduce the spikes you’re talking about (I tried with a 980 and a 1060 on win10 with 2018.1.0b2)
However I suggest you dont rely on profiling directly in editor (as there can be spurious spikes due to the editor env) but attach to a development build instead. See if you can reproduce the spikes by profiling a dev build.
Note that in the project you posted, you’re reading back 4MB of data, this is not free. Depending on your bandwidth it should take between 0.5ms and 1ms to transfer and probably as much for the cpu copy/conversion to the request buffer.
thanks for taking a look.
you’re right, I’ll keep that in mind. I’ve always only profiled c# which count in the 10ms so any spikes caused by the editor was burried in this, gpu is another beast
What do you suggest?
In the image below, this is a 1000x1000 float array that’s being downloaded to the cpu. I can see 1 ms of rendertexture which I’m not asyncing but I don’t see any such slowdown from downloading the array itself, what would it look like in the profiler?
from the build I see time spent rendering but where is the array transfer?
by the way, what’s the favorite way to get rid of that RenderTexture.SetActive ?
self answer: Request(RenderTexture)
How do you combine requests, say I want the data and the rendertexture but don’t want to do two requests?
Your profiler screens with the dev build look ok to me now.
Yes you should see async readback markers on main and render threads (CPU profiler), but it probably lacks some GPU markers to have them appear in the GPU profiler.
About not seeing the transfer time, it can possibly be overlapped with other computations (async DMA transfer while the GPU is still working). However adding markers will disable parallelism when profiling.
How would you download 4Mb from the gpu, in unity?
I’m wondering if this is possible with the way you do things: slow down the transfer to the cpu so the request may take 2x as long but impact stuff much less, kind of like doing a calculation across more frames with a coroutine.
Or alternatively reduce the precision, I did just that, using half, half2 and half4 in the compute shader but … maybe the size of data transferred is determined on the c# side. Is there a low precision vector4?
Sure the cost is linear with the size of your data. Make sure to readback only what’s needed, in the form you need to read it back.
Normally the copy/conversion cost is on the render thread but it seems you’re using single thread mode.
If really the cost is too high, you can always slice your readbacks (you can pass some window/box parameters to you request), then assemble your texture/buffer with n frame readbacks on cpu. For instance instead of reading 4mb, you read 500k for 8 frames. i.e.more latency to absorb the cost even more.
I’m also seeing performance spikes, but I don’t think think it’s AsyncGPUReadback.Request() anymore.
I see them both in editor and in standalone builds. In the standalone build, I got huge perf spikes every other frame. In editor, it appears to be more random, but the spikes are still there. The offending routine always appears to be under Camera.Render() / Gfx.WaitForPresent, which can range from not appearing at all to 20ms, and I see as high as 83ms in there.
I’m on Direct3 11.0 [level 11.1] (according to output_log.txt), and the Renderer is NVIDIA GeForce GTX TITAN X.
Also noticing the profiler says the GPU is always using 0.00ms of time now.
I tried removing my AsyncGPUReadback.Request’s, and it had no effect. I think it’s primarily from loading up the GPU with a bit more than it can bite off, and then it goes into some kind of degenerative state, presumably because the next frame, I’m trying to render the same thing and it hasn’t finished the last frame yet?
Is there some way to make renders synchronous so that I can see how much time they’re really taking?