Trying to understand visual observation bottlenecks

Hi, I have 10 agents in a scene, collecting visual observations with a decision interval of 5. CPU inference framerates are around 120, but when I lower the interval to 1, performance drops to a crawl below 5 FPS. According to the profiler, the main culprits seem to be MLAgents.Test.ExecuteGraph [especially Barracuda.Conv2D/…/UnsafeMatrixBlockMultiplyUnrolled8xhJob(Burst)] and MLAgents.Test.GenerateTensors.

So my thinking was that I could generally boost performance by spreading out decision calls, reducing the amount of work AcademyFixedUpdateStepper has to do per FixedUpdate step. I wrote a staggered decision requester using a step offset value. With the original interval of 5, now only 2 out of 10 agents are requesting decisions at a time. Surprisingly though, this resulted in my framerates dropping to around 50.

I don’t really understand why that’s happening. Are multiple visual observations maybe batched somehow, so that processing them together is faster than successively? Is spreading out agent decisions a bad idea in general? Thanks!

Hi @mbaske ,
Yes, I think in general, Barracuda is good at batch calculations, so the cost of running inference for 2 agents is less than 2x the cost of doing it for 1 agent.

Judging from the profiler name, it looks like you’ve upgraded to the 1.8.0-preview version of the package and are using Burst for the inference mode. But if not, you might want to give that a try.

Would you mind posting your model file somewhere so that we can look at it? Random weights (i.e. zero training steps) are fine. And if you don’t want to post it publicly, you can message us at ml-agents@unity3d.com

Thank you @celion_unity .
I’ve now upgraded to 1.8.0-preview, but I’m still seeing the same behaviour. It’s the most basic setup: just 10 agents, each with a camera and camera sensor. The framerate drops by around 50% if I replace the decision requester component (interval = 5, 10 decisions every 5th step) with the staggered requester (interval = 5, 2 decisions at every step). FPS look about the same regardless of whether I select CPU or Burst for inference, “Burst:Enable Compilation” is checked.
I’m attaching my untrained model, config params were copied from VisualPyramids.

EDIT: Just tried another more complex project with 1.8.0 and I’m indeed getting about 25% better performance using Burst inference. Great update!

6893480–806336–CamTest.zip (2.63 MB)

Thanks, I set up a test scene using your model, and compared the two stepping strategies. Here were my average times for stepping the Academy:
Step2AgentsEachFrame: 17.3758838 ms
StepAllAgentsEvery5Frames: 10.121792 ms
(I’m using Unity 2018.4 and on a Mac laptop, but I’m guessing the ratio would be roughly similar for you).

The time is roughly evenly split between GenerateTensors (which is where the actual camera render is happening) and Barracuda execution. I’m not sure there’s much we can do about the render, but I’ll try to get more information on the Barracuda side and see if there’s anything that can be improved there.

My test case is on GitHub - Unity-Technologies/ml-agents at repro-vis-obs-perf if you want to try it out (and maybe see if I’m not accounting for anything).

1 Like

Thanks again for investigating - good to know it’s better to request decisions in a single call.

Well, it’s hard to say definitively - for games where you need a consistent frame rate, it may still be better to spread things out evenly, even though this means more total time spent.