Expected performance using immediate mode with Unity Physics

Hi, I’m experimenting with writing a MonoBehaviour wrapper around Unity Physics for the purpose of integrating it with MB-based networking engines (specifically Photon Fusion). This has been fairly straightforward to do by reverse engineering the Unity Physics ECS systems and I have made a functional prototype.

I am now running into some major performance issues. I expected that the overall system would perform worse than the ECS version, but here I am just profiling Unity.Physics.Simulation.StepImmediate (so none of my code) and it’s significantly slower than PhysX.

My current setup is 5x5x5 (125) cubes stacked with 5 solver iterations and SyncCollisionWorld=true (dt is 0.02, but that shouldn’t really matter). On my i7-10700K I stopwatch it at ~10ms when they’re all stacked. With PhysX, I time Physics.Simulate() at 0.25ms…and Physics.Simulate() encapsulates more than just running the sim step (I use a script to force all the RBs awake, as ofc sleeping is a major perf gain for PhysX over UPhysics).

This seems pretty out of wack with my I expected. I’d like to sit down and build a similar test comparing with an immediate step using ECS, but I haven’t worked with Unity ECS enough to be able to do this and confidently get the numbers correct, so I figured I’d ask here for opinions. Note as stated above I do understand any MB wrapper will likely be slower than the ECS one, but my assumption is that once the data is initialized and sent to U Physics it is no longer relevant where it came from.

Thanks for any direction.

StepImmediate() runs entirely single-threaded. So it will be significantly slower than standard parallelized Unity Physics, or parallelized PhysX.

Check the profiler for more information, specifically the jobs section, where you will see in a standard parallelized Unity Physics simulation a large number of jobs launched throughout the physics pipeline.

For high performance, instead of using StepImmediate(), you might want to schedule the jobs required for the different phases in the Unity Physics pipeline, as is done in standard Unity Physics under the hood.
I suggest you have a look at the UnityPhysicsSimulationSystems.cs file to see how this is done.

Hey Daniel, thanks for the response!

I did initially look at implementing the jobs simulation path to compare performance, but ran into some issues that I wasn’t able to resolve. My code for the step is below.

BuildPhysicsWorld(deltaTime, gravity);

SimulationStepInput input = new()
{
    World = PhysicsWorld,
    TimeStep = deltaTime,
    Gravity = gravity,
    NumSolverIterations = solverIterations,
    // Setting this to true causes the dynamic world resulting from
    // integrating velocities to be written back to the collision world,
    // which is used to set the game object transform positions.
    SynchronizeCollisionWorld = true
};

// This works fine
// simulationContext.Reset(input);
// Simulation.StepImmediate(input, ref simulationContext);

// This does not
Simulation sim = Simulation.Create();
SimulationJobHandles handles = sim.ScheduleStepJobs(input, default);

handles.FinalExecutionHandle.Complete();
handles.FinalDisposeHandle.Complete();

sim.Dispose();

Now there’s a lot going on in the BuildPhysicsWorld method, but I don’t think it should be relevant, as the StepImmediate works with the simulation input that is constructed, while stepping with scheduled jobs does not, causing the following exceptions

Exceptions
System.ArgumentException: foreachCount must be > 0foreachCount
This Exception was thrown from a job compiled with Burst, which has limited exception support.
0x00007ff85ef5a5ce (a80b9d0adfc3aede30f0d2b3656a74c) burst_Abort_Trampoline
0x00007ff85ef585db (a80b9d0adfc3aede30f0d2b3656a74c) Unity.Jobs.IJobExtensions.JobStruct`1<Unity.Collections.NativeStream.ConstructJobList>.Execute(ref Unity.Collections.NativeStream.ConstructJobList data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex) -> void_0d64390541958b43604801c180219108 from UnityEngine.CoreModule, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null (at K:/Documents/Game Dev/UnityPhysicsGameObjects/Library/PackageCache/com.unity.burst@1.8.18/.Runtime/unknown/unknown:0)
0x00007ff85ef5827d (a80b9d0adfc3aede30f0d2b3656a74c) 959783104064e8c81fba5d33d94ead01
0x00007ff7a6cb71a2 (Unity) ExecuteJob
0x00007ff7a6cb858f (Unity) ForwardJobToManaged
0x00007ff7a6cb43ba (Unity) ujob_execute_job
0x00007ff7a6cb37ed (Unity) lane_guts
0x00007ff7a6cb6424 (Unity) worker_thread_routine
0x00007ff7a6ee3aeb (Unity) Thread::RunThreadWrapper
0x00007ff906477374 (KERNEL32) BaseThreadInitThunk
0x00007ff9073fcc91 (ntdll) RtlUserThreadStart
Invalid allocation label passed to UnsafeUtility::Malloc

I’ve done lots of multithreaded programming, but I haven’t worked with jobs much so I’m not sure if there’s an issue passing default in to ScheduleStepJobs’s dependencies parameter. The exceptions seem fairly low level, making me think it’s something to do with how I’m handling the jobs?

I’d suggest you run the same thing but without burst so that you can get a more meaningful error message.

Also, SimulationStepInput.SynchronizeCollisionWorld should only be set to true if you need to perform accurate collision queries after the physics update and after rigid bodies have moved.

Disabling Burst gives a bit better error message:

Exceptions
ArgumentException: foreachCount must be > 0
Parameter name: foreachCount
Unity.Collections.NativeStream.CheckForEachCountGreaterThanZero (System.Int32 forEachCount) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/NativeStream.cs:773)
Unity.Collections.NativeStream.AllocateForEach (System.Int32 forEachCount) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/NativeStream.cs:276)
Unity.Collections.NativeStream+ConstructJobList.Execute () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/NativeStream.cs:242)
Unity.Jobs.IJobExtensions+JobStruct`1[T].Execute (T& data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, Unity.Jobs.LowLevel.Unsafe.JobRanges& ranges, System.Int32 jobIndex) (at <6ff3bcb667574bf9a8630184172fcfbf>:0)

and

Invalid allocation label passed to UnsafeUtility::Malloc
UnityEngine.StackTraceUtility:ExtractStackTrace ()
Unity.Collections.Memory/Unmanaged/Array:Resize (void*,long,long,Unity.Collections.AllocatorManager/AllocatorHandle,long,int) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/Memory.cs:79)
Unity.Collections.Memory/Unmanaged:Allocate (long,int,Unity.Collections.AllocatorManager/AllocatorHandle) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/Memory.cs:20)
Unity.Collections.AllocatorManager:TryLegacy (Unity.Collections.AllocatorManager/Block&) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/AllocatorManager.cs:1097)
Unity.Collections.AllocatorManager:Try (Unity.Collections.AllocatorManager/Block&) (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/AllocatorManager.cs:1129)
Unity.Collections.AllocatorManager/Block:TryFree () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/AllocatorManager.cs:960)
Unity.Collections.AllocatorManager/Block:Dispose () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/AllocatorManager.cs:940)
Unity.Collections.LowLevel.Unsafe.UnsafeStream:Deallocate () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/UnsafeStream.cs:319)
Unity.Collections.LowLevel.Unsafe.UnsafeStream:Dispose () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/UnsafeStream.cs:335)
Unity.Collections.NativeStreamDispose:Dispose () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/NativeStream.cs:797)
Unity.Collections.NativeStreamDisposeJob:Execute () (at ./Library/PackageCache/com.unity.collections@2.5.1/Unity.Collections/NativeStream.cs:808)
Unity.Jobs.IJobExtensions/JobStruct`1<Unity.Collections.NativeStreamDisposeJob>:Execute (Unity.Collections.NativeStreamDisposeJob&,intptr,intptr,Unity.Jobs.LowLevel.Unsafe.JobRanges&,int)

Neither of the stack traces point to anything higher level (not sure what’s up with that), but it seems to revolve around iterating over an unsafe collection…making me wonder again if my default JobHandle that I’ve passed in is causing issues.

I’m still building the CollisionWorld using the immediate API, but I figure that shouldn’t matter since it’s all done before the simulation is stepped.

Good to know–I think when I was reverse engineering stuff I got the assumption it was needed to read back the resulting state of the sim, to copy to the MB side of things.

Investigated this a bunch–wasn’t able to figure out which job specifically is throwing the exception, it seems too low level. I did discover that setting multiThread to false does prevent all the issues.

SimulationJobHandles handles = sim.ScheduleStepJobs(input, default, false);

Which is interesting. Looking at the code, it seems like the same jobs are run, just significantly less parallelized. It makes me think there’s some sort of race condition occurring in resulting from the way I’ve set up my physics world, maybe? Not sure how I could debug this further.

@RoystanHonks I was playing with your example project, and I got the same error.

By copying the package content somewhere else as regular C# scripts, I was able to get some debugging context while this happens.

It seems this ConstructJobList job is only scheduled twice, both during the broadphase update, in the ScheduleFindOverlapsJobs():

JobHandle dynamicConstruct = NativeStream.ScheduleConstruct(out dynamicVsDynamicPairsStream, dynamicVsDynamicNodePairIndices, allocateDeps, Allocator.TempJob);
JobHandle staticConstruct = NativeStream.ScheduleConstruct(out staticVsDynamicPairsStream, staticVsDynamicNodePairIndices, allocateDeps, Allocator.TempJob);

Here, staticVsDynamicNodePairIndices seems to be a list of length 0 when the ConstructJobList gets scheduled, which triggers the assert.

That being said, I can’t for the life of me understand why it’s an empty list.

There is a static object registered in the scene, but it looks like the CollisionWorld.StaticTree needs its internal BranchCount to be non-0 for that list to have any values… and it is always 0 (whereas for the dynamic rigidbodies tree, it is non-0).

I’m getting lost in the init procedure of those trees, and I can’t find where it differs for them to end up with different values.

Anyway, hope this can help. I’ll keep digging too because I’d really like to be able to use a stateless physics system with GOs

EDIT: I found a solution: it seems that using CollisionWorld.ScheduleBuildBroadphaseJobs() instead of CollisionWorld.BuildBroadphase() and forcing the rebuild of static objects when their number has changed (which is the case for the first update) is what was required. I opened a PR in your repo with a fix, if you want to take a look

Hey, a few days ago actually the GitHub user aliaumem submitted a pull request with roughly the same fix! I’ve integrated it into the multithreading branch and am planning to set up some tools for profiling, ideally to compare it directly against PhysX (or as directly as possible).

Added some simple profiling to my project. Started with a really conservative measurement to establish the best-case performance for Unity Physics against PhysX. For PhysX, I profiled Physics.Simulate(...);, while for UPhysics just the inner step method that kicks off the threads, Simulation.ScheduleStepJobs(input, default, true);

I found for a 125 cube drop (can see the video here), PhysX clocked in at 0.45ms, while Unity Physics at 4.5ms. Now, obviously the two methods are not directly comparable, as Physics.Simulate does a lot more than just step the simulation (also copies all the states back to the game objects, etc), but it’s a good smoke test.

I’m a bit surprised–I figured that PhysX would have a massive edge when objects are able to fall asleep, but in this test everything was awake. As well, I’ve excluded from the profile all the world build and broadphase steps from the U Physics side, so it has an advantage there.

Curious whether this is about best case performance for the engine, or if I’ve done something odd that’s slowing it down.

I’m curious if you see any performance difference between running UPhysics manually vs using the authoring components and default implementation in ECS. If there is, maybe the way you’re invoking the whole system introduces slow-downs?

This was my first thought, but I enabled Burst Compilation and actually got a huge performance boost–up to ~0.6ms for the stack of collapsed cubes, compared to the 0.45ms of PhysX. Setting it to a 10x10x10 stack and including the broadphase call, I get PhysX at 2.3ms, UPhys 1.3! Overall, those are really solid results.