On the topic from yesterday I was having a design issue for several days that I couldn’t quite figure out how to optimise to get into the territory of <16ms. I had huge spikes, I couldn’t bring out, which had to do with a huge number of instantiations. This topic crops up a lot reading the forum, the problem with high-frequency events that only live for a short amount of time and I had it myself in the past. We should be able to use entities and components for this but right now, this is really slow in ECS, because either the instantiation is slow or enabling/disabling which involves structural changes. Even TagComponents, where I read they are not supposed to make structural changes are slow and unusable for that particular case.
DOTS devs have told us they will work on Enable/Disable or optimise the path in general but honestly, I think having the ability for a maximum count of entities in an EntityQuery would be enough.
So for my case, instantiating a huge amount of data (entities with comps, used as pool) that I don’t delete and then only use a slice in subsequent frames of the available entities in chunks works pretty well.
My solution with setting ChunkComponents is over engineered though. I have to set the chunk component data, the chunks get scheduled, the chunk has to check the chunk component and then iterate over its actual amount it’s allowed to, often 0. So lots of overhead.
With the ability to set something like a maximum count in a query or ArchetypeComponent gasp we could reduce the amount of work in the root itself. Learning about ChunkComponents only recently I wondered why no equivalent for a whole Archetype exists. A place where we can save meta data. But even though this would be nice, the meta data alone in an ArchetypeComponent wouldn’t solve the problem of scheduling chunks and entities that aren’t even needed.
Hope this suggestion finds someone of note.
As always, open for discussion and further input.
If your chunk components already store the number of entities in each chunk, then you can iterate over the metachunks (an EntityQuery composed of your chunk component as a normal component and ChunkHeader) and build a list of ArchetypeChunk (from ChunkHeader). Then you run IJobParallelForDefer over that list.
Funny, I read the docs (again) today and found IJobParallelForDefer. From the documentation I didn’t get much out of it and there’s no sample. From what you’re describing it’s what I’m looking for.
Albeit not as straight forward and I have to calculate the number of entities and set the chunk components which my suggested solution would skip. So while not the perfect solution, it’s getting closer and it’s something I can implement in the current entities version.
Basically you mean I should schedule an array of ArchetypeChunks instead of an EntityQuery, right? As I’m already iterating over the ArchetypeChunks and set the ChunkComponent I could use this array for scheduling.
I’m confused about 2 things:
I’m not sure what you mean with the ArchetypeChunk in relation to the ChunkHeader. Why would I need that?
I don’t really get what IJobParallelForDefer is for and how it differs to IJobParallelForBatch
For what it is worth, Unity doesn’t know that information either. So you aren’t going to get any more performance out of such a feature request.
Not an array, a list. A list which only contains the chunks that you need to touch.
Chunk components don’t actually live in the chunks they are associated with. They live in metachunks and are packed together in component arrays. This means that iterating these arrays directly is way faster than jumping around to all their associated chunks. That’s important, because in a single-threaded job we can blitz through these arrays and build a list of chunks that we actually need to touch in a parallel job. In this single-threaded job, we can build a NativeList. We get the ArchetypeChunk pointers from the ChunkerHeader, which is another array in the metachunks that we can iterate side-by-side with our chunk components. As soon as we have collected all the chunks, this is where IJobParallelForDefer comes in.
IJobParallelForDefer allows scheduling a job for each element of a yet-to-be-filled list. In other words, the number of indices in the job is calculated in another job. This allows you to avoid completing jobs while still being able to schedule parallel jobs on dynamically-sized lists. In this particular case, we can schedule an IJobParallelForDefer from the to-be-filled NativeList.
If you are still confused by this, I can share an example of how I computed a dual priority ranking mechanism for entities in parallel, where each entity receives a unique rank value based on its unique priority.
Hm, I know the count at that stage and as far as I see it, it would be just a more streamlined version of what you’re describing. I trust you on this though, knowing that you know far more than I do. Just on the topic of convenience and ease of usage, would you not say it’s easier to setup than the whole thing with the Chunks, headers and IJonParallelDefer? Although this could be all wrapped up in an extension I guess.
I think I understand the sequence now and why the defer job is important. You have explained it nicely.
Out of completion and maybe for others, it would be really nice of you posting a sample.
And one thing I also have to ask at this point. How did you get this knowledgable about this? I feel like I did something really wrong, having worked with Entities so long now but knowing so little. lol, I’m missing a lots of low level stuff, did you just read a lot of code and tested or how did you get there? Genuinely interested. And thanks for helping us noobs out. Knowing is one thing, but sharing is another.
The whole point of chunk components is that the entity counts in the chunks don’t change, so the chunk components don’t need to be updated. Only you know that. Not Unity. So Unity would have to do the much simpler thing of running a single-threaded job through the chunk array and counting up entities until the threshold is reached. If that’s what you want, the secret sauce to build that little extension is ArchetypeChunkIterator, although you will need internal access to wrap it.
A lot of it is reverse-engineering Unity’s code. Recently I have been dissecting the Hybrid Renderer in order to build a custom skinning and animation solution. The hybrid renderer uses chunk components for both render bounds and lod selection (as well as others, but those were the two I cared about). The only reason I am trying to understand this stuff is so that I know what I can and cannot do without breaking Unity. But as I figure out these tricks, I quickly learn that I need similar tricks for my own implementations.
Is what you are trying to solve simply the ability to timeslice via EntityQuery?
We also already have support for the following in the next entities release:
IJobEntityBatch.RunWithEntities(NativeArray<Entity> entities, EntityQuery query); and variants
It tries its best to be fast but if you care about 100k entities, this will never be great, since the user first has to get full entity array. Then RunWithEntities has to establish the ranges of entities to process. Of course the control it gives you is great.
Where basically IJobEntityBatch would only process the chunks in the range. The chunk range being a range against what EntityQuery.CreateChunkArray() returns. This would be pretty much zero overhead.
The downside is of course that it makes it so that users have to reason about slicing in chunks. And if the order changes due to add / remove component then some entities might get processed multiple times.
It’s really about tradeoffs of what is an acceptable performance bar here.
In essence, yes. I have a chunk with data that acts as a pool to get around the cpu spike of instantiation/allocation. Sometimes a big burst of data is needed and then only a fraction. This big chunk data is not used every frame, I still leave it because the actual time when another big burst of data is needed is unknown and bound to what players/npcs are doing. In any case, when the amount of data exceeds, new entities are created in this chunk. I leave this part of memory unchanged otherwise. No deleting. It’s only 45Mb in an extreme case so I think it’ll be okay.
From the parameters I would reason that I can control which entities are processed within the query. This would mean some form of lookup into the array unless it’s perfectly aligned which would never be as fast as a simple count where I don’t care about order, just the amount.
Hm, so would this value imply the amount of chunks that are processed or the amount of entities within the archetype, spread over the chunks? In case it’s amount of entities, perfect.
But I think it’s the amount of chunks, right? This would certainly help somewhat. Only the last chunk needs to be checked then because it could have entities in it that are not supposed to be processed. They could have data from previous frames which is uncleared so that would be a problem.
For this to work I think the chunk.Count (or a new type of Count) has to be overriden by an entityQuery in case the amount of entities in a chunk exceeds the requested limit of entities that should be processed.
Yeah, this would be a no-go in this particular implementation. The archetype really has to stay constant and the order is unimportant.
Thanks for having this on your radar. I know it sounds quite funky but I think this could have huge implications about how we can optimise Entities when dealing with extreme amounts of events in a single frame. A spike would only occur once or not at all when the archetype is filled with a proper amount of entities in initialization.
I’ll be honest, I don’t understand why these events need to be entities. It honestly sounds like you would be better served by another data structure.
Interestingly though, this thread did bring to my attention a use case which the current Entities release doesn’t make easy. And that’s the ability to do single-threaded “find first (n) of” tasks, which does chunk iteration but can early out rather than iterating through all the chunks in a query.
Does that work with NativeList.AsDeferredJobArray()?
Well, why not? They are many and need to be processed in a way. Having them live inside Entities is the perfect place. The creation, where these evens are fired, already uses “another” data structure, NativeQueue for now, which then gets converted to an array and a job fills up entities from the pool based on the creation data with other additonal data from components that serve as acceleration structure for the 4 following jobs. Sure, I could iterate 3 or 4 times over the Array but that’s not what’s slow. Instantiation/Allocation is and as it turned out I can’t even get allocating a NativeArray as fast as EntityManager creates an Archetype. As this list can be growing I’d need a NativeList which is also slower then.
Funny side anecdote, I built a NativeHashmap for lookups instead of utilizing ComponentDataFromEntity. Wanna guess how much difference it made? Zero. The whole Entities is built anyway on those NativeCollections. Unless Unity has very unoptimized code paths there is not much speed to be gained with “another” data structure. If anything, I’d question my whole data design, and I actually do that all the time. I’m just not coming up with anything better because RPGs are an entangled mess of lots of data.
Maybe I’ll end up using a simple array for these events. Right now I don’t see the benefit other than to ponder the question of is it better to use SoA or AoS. (I don’t care as long as one is faster) From testing, it made no difference though, so I’ll stick to Entities because that way I don’t have to write the same thing over and over again.
The important thing is, if we evade pain points constantly we never end up imagining new stuff that can help us in unforeseeable ways.
Oooph. This is some nasty tunnel vision that is really throwing a wrench in your ability to reason about the problem. First off, very little of entities actually uses NativeCollections. They are using raw memory management for most things. ComponentDataFromEntity is not a hashmap either. It is much better than that.
Not every problem can be mapped to Unity’s built-in mechanisms for optimal performance. Sometimes you have to make your own tools better suited for your problems. One of the reasons instantiation is so slow is because you are doing a single-threaded memcpy of 250,000 elements. What if you could parallelize that? Entities, NativeArray, and NativeList can’t do that. But NativeStream can. And if you need the data to be mutable, a chunk list could be what you are looking for.
You read my article, right? The whole point was that by considering all the requirements of what is both needed and not needed, you can make assumptions about the problem that open the door for new opportunities. Your problem is very specific, and can’t expect Unity to write all the performance for you.
I went back to the other thread, and I think I know why it is slower. You are zero-ing out your NativeArray on allocation. There’s a third argument in the constructor to turn that off.
Do you really want entity pooling to be a common thing?
Why would you say that? Look, the whole thing I’m writing has 13k locs now. If I give the impression of tunnel vision it’s mostly because I would not be able to manage lots of changes and I have to try one thing to get results. As I’m not happy right now with performance and how this works, it’ll change anyway at some point. For all intents and purposes I could write a C++ plugin but I rather want to push Entities as far as I can and see what sticks and what not. That’s the process of learning and getting better, wouldn’t you agree? Nothing I say here is really meant to be definitive. It’s just my current state of knowledge and measurement. I’ve written a LOT of ECS code now over the years and I’m still learning tricks and do’s and don’ts. I imagine in 5 years it won’t change much. There are always new things to reflect on. Really, the whole gist is that I don’t want to make the impression that I’m generally forcing myself on something out of habit or that I expect Unity to write all the performance for me. The extent I’m expecting this is the same expectation that a compiler developer is doing that for me.
Yeah, that was some stupid phrasing. What I meant was that utilizing NativeCollections on my end will lead to the same methods of allocating unmanaged memory. CDFE lookup is as fast as a NativeHashmap if not faster. (I want to write a real test for this to proof this and compare memory usage) I just wanted to point out that a form of optimisation I tried wasn’t an optimisation in the end. They are not the same but work very similar in terms of performance.
I plan to use this instead of the NativeQueue!
Chunk list as in, an archetype for that specific case? Isn’t that what my pooling use-case does?
I’m getting some interesting results here. When I init with Uninitialized memory the memcpy takes a long time (16.5ms instead of 3.5ms). Overall timings of init and memcpy are the same then, which is weird as this really shouldn’t happen? Hm.
Not in general. You’re a really active member so I think you are aware how much the topic of event systems are cropping up. It’s common enough in interactive media that we can’t just hand-wave the problem away and there are NO solutions for it. I don’t think that’s doing Entities any good.
To get more specific and concrete, your space shooter demo, what’s your highest count of bullet instantiations in a single frame? Let’s say you have a server with 1000 players and all engage some form of boss which fires a huge amount of bullets. Can you spawn 250k bullets in a single frame? What would be your path of optimization when even your optimised CreateInstantiateCommandBuffer is not good enough anymore? It may be hypothetical but when you are able to optimise this extreme case it’ll also run faster for way less.
We have to get to a point where these crazy things have good answers how to handle, right? Even though it might be a bad example that will never happen, who’s to say someone doesn’t think of a game where something equivalent could happen.
When you ask “why not” to a solution that produces unsatisfactory results, that suggests to me that you are locked into that solution a little too hard. That’s tunnel vision. It happens to the best of us. And sometimes we need a little reminder to think outside the box. I was trying to provide that for you. Anyways, from your response and the fact that your code is 13k lines (which seems like a lot for the problem you are solving, and I am very much missing a lot of requirements and context to help you much), it may be time to step away from this particular part of the code and work on something else, so that you can come back to it with a fresh mind and some refreshed patience.
Thanks for clarifying! We’re on the same page here.
That’s likely because in that case the cache is cold. Anyways, if the allocation step is fast, the memcpy step can be parallelized to reduced latency.
I’m well aware of how often it comes up and how confusing it is. What a lot of people don’t realize is that the concept of an “event system” that they are used to from OOP just straight up doesn’t scale in DOTS and can’t really be made to scale. The closest I’ve seen is tertle’s, but even it has its tradeoffs. At the scale of DOTS, you need to break down “event systems” into more fundamental functionalities by analyzing your events’ persistence, quantity, mutability, size variability, and production/consumption patterns.
Right now it is a few thousand (single digits) for the levels I expect to run smoothly on most machines. But that isn’t my bottleneck currently. Being stuck with Burst 1.4.1 is preventing some optimizations to physics which is what hurts scale.
That’s over 25k xerktloprysma cannons all firing on the same frame. That’s GPU-sim territory. A little bit of deterministic noise to spread out the spawning over a dozen frames would be much more reasonable. But I am going to be more bottlenecked by other systems at that quantity, mainly physics and audio. 1000 players though is totally in the ballpark assuming I can compress and transfer data over the internet fast enough. A single AI ship is much more expensive than a player ship, and I currently run 10k ships on normal “expected to run well” missions.
If it is outrunning Myri by a significant margin (Myri runs its sampling during InstantiateCommandBufffer playback), then I would maybe look into modifying entities to specify components for Unity to not initialize when instantiating entities, and then initialize all the components myself in a parallel job.
I would say whatever it is about that game that makes such a scenario plausible where instantiation is truly the only bottleneck, then there is likely something else about that game related to scale that would be key to the solution. Data-oriented design often means very specific solutions to very specific problems, rather than general solutions to general problems. Hypotheticals don’t mean a whole lot.
With all that said, I recognize you have a scenario where you need to spawn 250k of something, and the rest of the simulation supposedly runs fine after that spawn. I still don’t really understand what those 250k things even are, and that’s why I feel I can’t be very helpful. I need to know everything about the problem to truly be effective at optimizing. And even then I don’t always get it right on my first try. It takes patience.
NativeChunkedList doesn’t allow parallel writing so that’s not working out. I’ve rewritten some parts to NativeStream and the issue is again that a lot of time is lost with newing the NativeStream and the Allocates when writing to the stream. I’m paying 2ms just for newing a NativeStream with 250k forEachCount.
Right now I can’t find a way to make the NativeStream memory persistent to not have the hit everytime. If I use the same it’s bugging me that I can only write to an index once. There’s no clear method. Looking at the source I think I should be able to make a clear method.
I’m looking hard for other data types that account for the following:
allow parallel writing
allow parallel reading
being persistent over multiple frames to act as buffer, with allocates only once or when being resized/exceeded
Right now the only thing that checks all these boxes without much hassle is Entities itself, although that has other issues, mainly in setting up data. Certainly not a solution that can stick.
I’m not seeing much speedups or loss with any solution. It’s hiting a ceiling how fast it can go.
NativeArray = ArchetypeChunk > NativeStream > NativeList
Everything else is bound to not having cache misses and I have a lot
The really nice thing about NativeStream is that I can write an array to a certain index which solves a lot of unnecessary code for single and multi target spells. So that’s a big plus!
NativeLists are not resizeable in parallel writes so at some stages it’s not usable over NativeStream when the amount is unknown.
Alright. I have three candidates in my head and I want to pick a winner before I spend any more typing on long descriptions. So far what I know about your problem is:
You need to create lots of elements in a single frame with as low latency as possible (preferably in a parallel job)
You need to initialize some aspect of those elements with unique data
You need to iterate through the elements
You need to perform random access lookup on those elements
NativeThreadIndex-determined order is acceptable
Currently you are using pooled entities to solve this problem. I will refer to these elements as entities to make these questions easier for you:
What percentage of the components require per-entity initialization during instantiation?
What percentage of the components inherit initialized values from prefabs?
What percentage of the components could be left completely uninitialized?
How many frames does a typical entity live?
Are any of the components system-state?
What percentage of the components do you typically iterate over at a time using chunk iteration?
What percentage of the components do you look up randomly using ComponentDataFromEntity?
Are all entities of the same archetype?
Do these entities contain DynamicBuffers of variable sizes? If so, is there a max size?
Are any of the components Unity-defined and interact with built-in Unity systems?
I have rewritten most crucial jobs to IJobParallel now. Entities isn’t used anymore. Performance is pretty much the same. Instead of using Entity Archetypes, NativeLists are now used. They solve the problem of allocation and can grow to a certain size to reduce spikes. There are 3 NativeStreams in place now which I will also replace with NativeLists when and where I can. The cost of allocation for NativeStreams is simply too much. If I could get around this it would help a lot.
Using a NativeList has helped me to reduce a bunch of lookups mainly because I have this data already in place when creating the array for further processing and setting to the “buffer chunk” isn’t possible when in the job. I had ideas to handle pointers to the buffer chunks but yeah, this sounded overengineered. So, because of this, I had to do this in parallel job afterwards. Saved some ms to get rid of this job just for setting.
So, I’ve reduced this a lot now. The main bullet points are now only to find better data structures for the NativeStream and maybe NativeList even. The rest is really bottlenecked by the random CDFE lookups, which I think, I can not optimise away. Maybe dedicated jobs for gathering up this data would help but I don’t know honestly. If I would manage to gather everything up beforehand the jobs would run a lot faster, that’s for sure. Maybe I can try with some job to proof if it helps enough to justify the work. The problem is always that the additional iteration and writing back also costs and sometimes this is higher than just to eat the cost of the random lookup.
To answer your questions:
I don’t understand this one
4 right now, 2 are very small, 2 are bigger - they were set in a parallel job
None, they are all crucial for calculation.
1 frame, create->calculate->destroy
no
when they get processed, all of them
sadly a bunch, which is hard to reduce. like team/health/effects of target, there’s not really a linear way of getting this data once the targets have been determined. It’s around 3-5 for some jobs and also a huge reason why jobs are running slow.
yes
no - DynamicBuffers are the worst. In the current state of Entities I go as far as to say, never use them unless the entity amount is low. <1000
Instead of using a ForEach count equal to the number of entities, make the ForEach count equal to the number of chunks and use the BatchIndex in IJobEntityBatch.
The UnsafeParallelBlockList I wrote is likely very close to what you want. There are two things you would want to modify though. First, you probably would want to pool blocks rather than allocate them every time. That could be done with some clever per-thread shared statics. The other thing you would want want is a per-block iterator to iterate over the elements in parallel.
But the benefits are many. There’s no range-tracking overhead from NativeStream. It is optimized for fixed-sized elements. It is parallel. Addition is crazy fast. The memory is stable, so you can store direct pointers to the data. And you can synchronize multiple instances. So instead of having multiple NativeStreams and NativeLists, you would be able to write directly to this in parallel and then do all your work from within it, saving you a lot of memcpy.
I already utilized the chunkIndex as ID for the NativeStream index once, had to disable safety checks because for some reason it’s expecting to have the same range as the elements leading to hardly understandable error messages. I need to implement this for 2 of the NativeStreams again, for 1 I really need the entity index because the length is used in another calculation. (target handling)
Interesting to read about the UnsafeParallelBlockList. Do you have this available somewhere? I think starting from scratch I’d be overwhelmed.
Right now I’m pondering how to implement iterating on 2 chunks. Or if that makes sense even and one big one would be better. I split up one of the bigger archetypes but lookups are getting in the way and are probably much much slower. Hard to say what’s performing better without testing though. Have you ever experimented with 2 aligned chunks? Having the entities sorted would be crucial for them to align. The way I instantiate them they are aligning though.
I remember reading your spaceship had several entities. Did you utilize lookups from one to another or where they all contained so you didn’t need that?