Help, performance is (still) bad!

Hi!

I ported Visual Pinball’s physics engine to ECS. It’s a 1:1 port of the original C++ code, so it doesn’t use Unity’s DOTS/Havok physics. Due to Burst and ECS’ memory layout, I expected performance to be in the same ballpark as the C++ implementation, but I was utterly disappointed: it’s several orders of magnitude slower.

I’m looking for hints where to put my focus first, so I’ll briefly describe the parts of my port where I had to improvise without documentation or examples. Both Visual Pinball and my port are open source, so I’ll link to the relevant code.

Game Loop

The physics engine uses 1ms ticks for each cycle. If a collision is detected, there might be additional cycles. So I needed a way to update my physics world more than once per frame.

I did that using a ComponentSystemGroup which has a reference to all the needed systems, and updates them in a loop in OnUpdate(). One of those systems does most of the simulation, and contains another loop.

So we have two groups, one with an outer loop, and one with an inner loop. Most systems are set to [DisableAutoCreation] and are called explicitly by those groups. There are additional groups that are updated, but they are empty classes and their systems are attached via [UpdateInGroup].

I’m wondering what the overhead of this is, and if nesting is a problem.

Concurrency

The original C++ code is single-threaded. It’s a pinball simulator, so the apart from simulating multiple balls on multiple threads (knowing there’s usually only one ball), the job system won’t help a lot in terms of performance. And there’s no ROM emulation yet that would consume resources.

Joachim recognized this as well and suggested using Entities.ForEach().Run() for systems dealing with very few entities in order to still benefit from Burst.

The port currently features 13 systems that are each updated multiple times per frame. Could this cause a significant overhead? I remember reading that DOTS was designed for hundreds of systems, so I would say no, but my gut feeling tells me otherwise. :wink:

Data Size

I haven’t split up the ball data that is queried pretty much everywhere. It’s not a huge struct (65 bytes), but I could split it up further, to not load attributes that aren’t used in a particular system.

To my understanding, splitting up data allows for better concurrency, since the jobs can be scheduled more aggressively, but as mentioned in the previous section, I don’t think that there is much to gain in terms of concurrency.

Maybe the size is a problem though?

[SOLVED] Chunk Looping

Three of my systems write to other entities’ data, so I have jobs that loop through chunks and somewhat arbitrarily update data. I haven’t figured out any other way, and I find it pretty awkward due to the amount of boilerplate code.

One thing I noticed when debugging is that every ball entity seems have its own chunk, while Entity Debugger showed me multiple entities per chunk.

Maybe I’m doing something fundamentally wrong here, so here are my three systems in order of execution:

Context: Dynamic entities are balls, i.e. these systems only deal with ball/ball collisions. The rest of the world is static and handled by other systems.

Solved! Replaced the awkward chunk looping with ComponentDataFromEntity.

Abstract Colliders

To resolve the collision between a ball and an object, I’ve implemented a [Collider](https://github.com/freezy/VisualPinball.Engine/blob/master/VisualPinball.Unity/VisualPinball.Unity/Physics/Collider/Collider.cs) struct. However, a collider can have different types. For example there’s a point collider and a plane collider. Both types contain additional type-specific data and obviously different logic (but the same interface).

In order to get an inheritance-like structure, my Collider struct just contains a header with the base data (including a type). The actual colliders (the “children”) contain the same header, plus their additional data at the tail.

When resolving collisions, the Collider is cast to the type defined in its header and then executed. Since the collider tree is part of a BlobAssetReference, I can allocate each collider based on its type.

This works, but it might be really bad for performance. Here are the relevant snippets:

Get/Set Component Data

Some systems read and write data that isn’t queried in Entities.ForEach. For example, a ball-flipper collision fetches and updates the flipper’s movement data (while Entities.ForEach loops through balls).

So, based on the collider type, I first fetch additional data with GetComponent<>(), update it, and write it back with SetComponent(). (Side note: a compiler warning about updated data not being written back would have saved me days of debugging!)

How’s the performance of GetComponent and SetComponent? I know at least GetComponent is well documented, so I don’t assume that’s a bottleneck?

Then I’m also reading component data in the inner loop, i.e. not inside Entities.ForEach. Could this affect performance as well?


Thanks for reading if you made it this far. :slight_smile:

If you’ve identified an issue by design, or you were courageous enough to crawl through the code and find a problem there, that’s awesome!

If you actually want to test this, clone the repo, create a new (built-in renderer) project in Unity, add the cloned repo’s package.json as local package in Package Manager. Then, use the new Visual Pinball menu to import a table (for example this one). “B” adds a new ball when playing.

Thanks in advance!

-freezy.

Have you profiled it? What in the jobs is taking all the time.

Yes, and nothing particular. There’s a chunk of WaitForJobGroupID, but otherwise it just seems like everything is executed slowly:

I mean within your jobs using the profiler

That said, a quick glance at your code I don’t feel like it’s particularly optimal for very high performance that you seem to be seeking (too many gets) but I don’t have time atm to look into it closely unfortunately.

Yeah, but sloppily. I added markers for every system, but I didn’t see them in the hierarchy and Burst was complaining about strings.

Let me do that properly!

private static ProfilerMarker profileMarker = new ProfilerMarker("MySystem.Something");

public void Update()
{
    var marker = profileMarker;
    Entities.ForEach((Entity entity) =>
    {
        marker.Begin();
        // code
        marker.End();
    }).ScheduleParallel()
2 Likes

Huh, I was using Profiler.BeginSample("name"). Using ProfilerMarker looks like a better way indeed.

By “get” you mean GetComponent()? Which particular system were you looking at?

That’s the whole point of profile marker!

I’ve added markers, but I can’t seem to find them in the profiler’s hierarchy view (I’ve also turned on Deep Profile).

However I’ve noticed that getting entities in the non-jobified/bursted SystemGroup seems to be slow:

See also the absence of the “FlipperDisplacementSystem” marker that is supposed to show up under the FlipperDisplacementSystem.

On second try, with actually working markers, there doesn’t seem to be a dominant system, I also noticed that WaitForJobGroupID takes twice as much time as the actual execution. That’s for a system that isn’t scheduled but run on the main thread.

So out of the 0.59ms that the DynamicBroadPhaseSystem uses, only 0.08ms are actually spent in my code? How is this possible?

Hey! First of all nice job!

I’m running into the same issues and I (think) deep profiling screws with the measurement.
Anyway, data acquisition with a low amount of entities currently seems like a big bottleneck that is only apparent when you start to use several updates in one ECS update.
For my current project I fast-forward several 1000 frames, up to some hours and I’m seeing the same behaviour.

I got my biggest improvements with making ComponentSystemGroups (which you already did) and watching very carefully in the timeline how jobs are scheduled there and if there are dependencies that shouldn’t be there.
For example, my EnergySystem, totally unrelated to the AttackSystem was waiting its complete handle.
Putting it into another group solved this.

Also, and that was the weirdest. I had 3 ForEach in my EnergySystem and both accumulated when measured to 100ms. Those ForEach isolated needed 1,2 and 12ms. I then split the 3 ForEach into 3 SystemBases and performance was better. Yeah, I don’t know how.

Another thing that was a big performance killer was any form of structural changes. Don’t Add/Remove if you can live with a component being always on an entity. I think empty IComponentData or tags don’t trigger structural changes.

Also, DynamicBuffers are a performance hog. In most cases I’m not sure how to work around DynamicBuffers but keep in mind that they are expensive. Joachim told us it might be better to cast the buffer to a NativeArray but I’ve never tested this. It might be worth it if you have more than 3-5 elements.

1 Like

It looks like most of the time is spent waiting for other jobs. I suggest looking at this in the timeline profiler too.

On the topic of waiting for jobs. Is there a more sane way of debugging job scheduling and dependency chains than the timeline?

@Enzi Great stuff, thanks a lot. Yes, I’m using quite a few DynamicBuffers. Do you have a link about casting them to NativeArrays? I’m not adding or removing any data, however. The only structural changes are adding balls. I also only have one loop per system.

@Joachim_Ante_1 Yes, and there were a few gaps. However, none of my systems are currently scheduled, everything runs on the main thread.

I remember watching a tutorial from Mike Geig where he put all systems on the main thread because the scheduling overhead wasn’t worth the time actually spent in the job. Does that mean even then there’s still a significant overhead?

The timeline view will give you a much better visual of where that time is going and where you need to put profiler markers. If you haven’t yet, try profiling in a build.

DynamicCollisionJob is super weird… Using ComponentDataFromEntity would be far simpler & efficient.

That might make things a bit faster, but ultimately for the collision code path ultimately packing all the collider data into a native array before doing all collision steps and doing collision detection against that would make things a lot faster. Only when actual near phase collision has been detected, actually apply the result back to entities., using ComponentDataFromEntity.

1 Like

Thanks guys, you’re awesome.

@Joachim_Ante_1 I’ve replaced the awkward chunk loop with ComponentDataFromEntity, I didn’t even know about that one! Still trying to figure out what you mean by packing collider data into a native array, but I’ll figure it out somehow and tomorrow is another day. :wink:

Have you had a look at the flow visualization in Timeline view for schedule and sync flows yet? (added in 2020.1, further improvements coming in 2020.2 around the beta start) it’s still lacking info on inter-job-dependencies but still… I would be interested to hear what could make that saner and what currently makes it less then optimal. (You can pm me too if that would derail this thread too much otherwise)

Also, there is this thread for feedback on tooling to help debug DOTS .

Since I can’t put a NativeArray into a component, I’m still struggling to understand what @Joachim_Ante_1 meant. What I’ve understood is that there’s an issue how I’m dealing with data. So here’s a as-brief-as-possible description about that. Maybe someone has ideas for improvement. All links are links to the code of my implementation.

There are two types of collisions: Ball-ball collisions (dynamic), and ball-object collisions (static). Both are handled separately.

Static Collisions

Uses a quadtree, which is generated in a conversion system, saved as BlobAssetReference in a data component and attached to a singleton entity.

The quadtree references colliders by ID, and the actual colliders are also stored in another BlobAssetReference of a data component attached to a singleton entity. So here’s the flow:

  • In StaticBroadPhaseSystem we load the quadtree singleton entity to get the quadtree blob, loop through all ball entities and attach the matched collider IDs as a DynamicBuffer to the ball entity.
  • In StaticNarrowPhaseSystem we load collider data singleton entity to get the collider data blob, loop through the matched collider IDs for each ball, simulate collision and save the collider that is hit next. That collider is stored as an ID in the collision event data to the ball entity.
  • Finally, the StaticCollisionSystem loads the collider data blob again in order to retrieve the matched collider for each ball entity. Depending on the collider type, other component data might have to be fetched in order to resolve the collision (for example, when colliding with a flipper). That data might needs to be updated as well (because a ball hitting a flipper will also move the flipper).

Dynamic Collisions

  • DynamicBroadPhaseSystem creates an octree from the balls, then loops through all balls to find other balls with overlapping AABBs. The octree generation takes place in a Job.WithCode, which is then used by an Entity.ForEach for the matching. The matched balls are attached to the ball entity with a DynamicBuffer.
  • DynamicNarrowPhaseSystem loops through the previously updated DynamicBuffer. I’m using GetComponentDataFromEntity to fetch the ball data. The results are like the in broadphase written to CollisionEventData of the ball entities.
  • DynamicCollisionSystem resolves the collisions, using the collision event set before. Depending on the balls’ IDs, either ball A or ball B is updated.

If you’re only reading from a buffer, use DynamicBuffer.AsNativeArray then it doesn’t have to use as much checks.

1 Like