Hi Sentis Team! First, I just wanted to say a big thank you for all your hard work releasing Sentis and integrating it with MLAgents. Super excited that the work from Barracuda was continuing.
So a bit first about my use case. I’ve previously trained using MLAgents an AI agent for a game. It’s a pretty simple model, with about ~1,000 inputs, and 10 discrete outputs, with a few dense layers. It runs on Burst since I found it performed better on CPU than GPU, although for this type of model, does that change with the latest inference options (e.g. compute or pixel shader)?
During my gameplay, I have multiple (3 to 7) AI agents that all use the same model architecture and using the default Academy.EnvironmentStep(), I run into a few issues after upgrading from
MLAgents 2.0.1 + Barracuda 3.0.0 to MLAgents 3.0.0 + Sentis 1.2.0.
First off, let’s talk about garbage collection. It looks like the amount of garbage generated per step went from 34.5 kB to just 5.1 kB. Yay this is awesome! Great work! It still looks like Sentis.Execute generates quite a bit of garbage, as it creates these tensors. As I know what the shape of these tensors should look like ahead of time, is there a mechanism for me to pre-allocate these?
Comparing the execution time, using Development Mode on Android, the execution time of the model (ModelRunner.DecideAction) increased from 8.91 ms to 13.71 ms (for 7 agents), which is a 50% jump. The top image below is MLAgents 2.0.1 + Barracuda 3.0.0, and the bottom image is MLAgents 3.0.0 + Sentis 1.2.0. It looks like:
Diving into this, it looks like the delta is caused by GenerateTensors going from 0.37ms to 8.96ms, and counteracted by Barracuda.Execute going from 8.42 ms to 4.43 ms:
Can you explain the cause for this performance delta? Is it possible to alleviate this by pre-allocating tensors ahead of time, given that I know exactly how many agents I have in my scene and also what the architecture is?
Can I further use CPU time slicing so that instead 7 agents doing model inference in the same timestep, only 1 agent does model inference per 1 timestep? Is there any downside of this approach? Are there any suggestions for how to get started?
Thanks so much again for all your hard work, and also for any input you might have!
could you list which layers are creating GC?
We’ll patch them in a upcoming release there is no reason to why we shouldn’t be at 0 GC
2/3. let me get back to you
Alternatively, for (3), I’m wondering if it’s safe to run the model on a background thread so it doesn’t block the main thread? Specifically I’m wondering if Academy.EnvironmentStep() can be executed safely in a background thread. I don’t use any Monobehavior based sensors. I know this is partially a MLAgents question too, but if you could help comment on the Sentis model execution side, that’d be great!
So I tried the CPU time slicing approach by running one inference per update. Unfortunately, it doesn’t work too well. It seems like although the time for GenerateTensors (9.1 ms → 0.6 ms), AgentSendState (1.25 ms → 0.2 ms), AgentAct (0.28 ms → 0.08 ms) are reduced as expected, the amount of time it takes Sentis.Execute stays the same. Is this because we’re effectively operating with batch size = 1, or is this behavior unexpected?
The total agent execution goes from 15.6 ms for 7 agents when not using time slicing to 6 ms for 1 agent when using time slicing. Therefore, this approach probably doesn’t make sense.
Therefore, my question becomes, is there any way to effectively run model execution over multiple frames without blocking the main thread? I tried a naive implementation using UniTask to run Academy.EnvironmentStep() in a background thread, but the task faulted.
I’ll just add one more thing that is interesting. Upon inspecting the GenerateTensors block in the profiler, I discovered that it’s calling JobHandle.Complete 6k times taking 4 ms, and just waiting another 5 ms, which is why it’s taking so long. I’m just using 2 VectorSensors and one ActionMask.
Edit: Thinking about it some more, I think the 6k figure corresponds to size of the observations (7 agents x 743 length vector sensor). It’s calling JobHandle.Complete for every single element. On the other hand, in MLAgents 2.0.1 + Barracuda 3.0.0, it only calls JobHandle.Complete 3 times, once for each VectorSensor and ActionMask.
Edit 2: Tracing through the code in ObservationWriter.cs in ML Agents, we call the following code for every observation, it looks like the difference seems to be in Tensor.cs in Barracuda 3.0.0 vs. Sentis 1.2.0:
((TensorFloat)m_Proxy.data)[m_Batch, index + m_Offset + writeOffset] = val;
In Barracuda 3.0.0, there seems to be a cache which is then uploaded to a device. However, in Sentis 1.2.0, every write seems call Set(int d0, T value), which I presume doesn’t use a cache?
I think the same issue with setting one observation value at a time without a cache is also present when we are retrieving values in ApplyTensors, specifically in DiscreteActionOutputApplier.Apply in ApplierImpl.cs (line 94).
Yeah Tensor indexing is now more costly due to the removal of the tensor cache.
That is for sure cause for a slowdown.
There is some workaround around that.
The best would be to poke ml-agent about it to see if they can update their code.
On the sentis side I’ll add a check on the fences to see if they are done instead of calling complete every time, it might have an impact
That would be amazing, thanks Alexandre! I went ahead and forked the ml-agents code myself, and wanted to confirm a few things:
If I first call CompleteAllPendingOperations, then follow up with consecutive calls to set / get, will that be safe or would that might have unintended consequences?
BurstTensorData data = tensorProxy.data.tensorOnDevice as BurstTensorData;
data?.CompleteAllPendingOperations();
for (int i = 0; ...)
data.array.Set(index, value)
If I launch IWorker.Execute from a background thread, would there be any unexpected behavior? The reason I’d like to do this is so that it won’t block the main thread.
I’ve done both and my agent behavior appears to be fine, but I just wanted to double check. Thanks!