[Feature Request] Asynchronous execution (or doing graphics updates automatically)

I would really like to be able to run some commands on a separate thread.

e.g. the commands TextureConvert.ToTensor() and engine.Execute() only seem to work on the main thread.

Which seems to mean the CPU on the main thread is blocked until the results come back from the GPU.

I have tried the option of running the models a few layers per frame but I don’t think this solves the issue since still the CPU is idle until the results are back from the GPU. Also, it’s quite tricky to calculate how many layers will be optimal per frame for every device to avoid a choppy framerate.

Or maybe there could be something like:

engine.ExecuteAsync(callback);

Another example: In the ExecuteOnTensor sample there is the line:

 var outputTensor = s_Ops.ArgMax(m_InputTensor, axis: 0, keepdim: true);

Maybe they could be made asyncronous like:

 var outputTensor = await s_OpsAsync.ArgMax(m_InputTensor, axis: 0, keepdim: true);

When using ONNX Runtime in unity I am able to do inference on a separate thread without affecting the framerate which makes it really simple to use. (e.g. I could run stable diffusion for 20 seconds in the background while still displaying real-time graphics).

If there was a simpler way to run models in the background without affecting framerate this would be excellent. (Maybe this is impossible without rewriting the whole of Unity IDK.)

Edit:

Just seen this page: Read output from a model asynchronously | Sentis | 1.0.0-exp.6
While it does help a bit (I think mainly because it’s calling the model every other frame) calling Execute() still blocks the CPU for a while.

3 Likes

So the issue is that to schedule GPU work we need to be on the main thread.
Same as jobs within the unity job system.
So there is no way around this unfortunately.

Do note that when you execute a model or call an op, we only schedule the work (on the GPU/CPU)
unless you do a readback or download the tensor then there is no waiting.

Your argmax example in only synchronous if you read from the tensor.

For bigger model, Execute itself takes a few ms just for job/compute work scheduling, but apart from that it’s not blocking.

For your case, make sure you are not reading from the tensor and use the sample to read asynchronously.
Else do send us the model we can investigate

2 Likes

Thanks Alex. I’ll check again with a bigger model.

I did try the Execute on its own and it seemed it made seemed to be blocking for a bit. It might just have been a fraction of second or something but enough to make the graphics stall. Which would be a problem if you’re running it multiple times a second. Every second counts as they say :slightly_smiling_face:

I’ll get back to you when I do more tests.

1 Like

Hi Alex,

Apologies, I think you are right it is not blocking the CPU. I think it is blocking the graphics update.

I would like to know a bit more about how Execute works. I am assuming it queues up all the sentis operations in one go like this:

ssssssssssssssssssssssssssssssssssssG

s=sentis operation
G=graphics update

so if this is correct, the frame will not get updated until all sentis operations are done resulting in the graphics freezing and Update not getting called for a while.

Whereas for a real-time game you want something like this:

sssssssGsssssssssssGssssssssssGssssssssssssGssssssssssG

running the layers in the background while keeping the graphics updates at a reasonable rate.

In the samples there is a way of doing this by doing a few layers at a time per frame. This works OK but if each layer is run asynchronously then I don’t have an accurate way of measuring how much time these layers are taking to run on the GPU.

At the moment I am doing this in the update:

while( stopwatch.ellapsedticks< 20000){
   //execute another layer
}

to try and execute as many layers as I can in under 2ms. But now I realise this is probably not measuring the time correctly. I can fix the number of layers per frame instead but this is not ideal either as different layers take different times and different machines can run more layers in faster times.

An ideal solution would be a parameter in Execute like this:

model.Execute( forceGraphicsUpdateEvery = 16 )

which would run all the Sentis layers but force a graphics update every 16ms. (But doing at least one layer per frame)

This is not a problem with small models like MNIST which can be run in a single update. But for larger models this is important. In fact, I would even go so far as to say, if we had a function like this which did the graphics updates automatically this would be the single biggest improvement for usability possible. (In my opinion :slightly_smiling_face:) (Maybe there is already a way to do this and I just missed it)

In the meantime do you think the best option is to do a fixed number of layers per frame? Or perhaps there is a better way such as using Coroutines.

Edit: Did some more tests, and it looks like Execute does take a long time (well I’m calling 40ms long for a 90Mb model) to do its setup. So as well as making it not block the graphics update, ideally it should also spread out it’s queuing over several frames too.

3 Likes

Having another thought. Doing a fixed number of layers per frame might be a good option. As with a faster GPU it can render more FPS and thus will also run more layers. So it should scale together. But still picking the right number of layers to run is still a tricky business. And it depends if you want to prioritise frame rate or model processing on lower end machines. I would say you would want to ensure a minimum frame rate at least.

1 Like

Yeah, unfortunately this is a pretty classical graphics problem.
The issue is that you cannot know before hand how much a graphics job will cost, nor can you stop graphics command if previous jobs are taking more than a given ms.
There is two ways to deal with this:

  • do one execution beforehand and profile how much ms each layer take. Either with Renderdoc or Unity graphics profiler. Then you have the cost of each layer and then can make sure that the correct amount of layers are dispatched every frame
  • do a mock execution and track the previous frame ms. Then you have a heuristic and chose how much layers needs to be executed each frame (like dynamic resolution).

We have a tutorial called ExecuteInParts that goes over how to execute N layer every frame
One thing we should expose is since we know that a layer is executed on the cpu or gpu is to dispatch N GPU layers as this info is internal…

2 Likes

It’s a tricky one. I guess the best option is to have it as user adjustable which the user can check in the settings along with the other usual graphics options. And maybe add a test scene where the user can try out the different settings or the computer can adjust it dynamically until it gets a good setting.

Thanks :+1:

Hi Alex, could you tell me where to find the ExecuteInParts tutorial you mentioned? I am having the same problem running a model that could not finish in the period of one frame. Thanks.

1 Like

~Samples\Run a model a layer at a time
or

2 Likes