Burst inspiration

Hi,

Love messing about with code trying make things fast, just looking for inspiration on what I can try next?

Currently Burst Mandelbrot (without Compute) but must be some other things I can try so reaching out for inspiration. :slight_smile:

Make some small game.
Your goal: is to animate 1mln of things, with minimum 60FPS.

Show us performance before and after burst and SIMD optimization.
Show us what have you changed, to increase performance.
Ideally, showing burst compiled code, for optimized lines of code.

1 Like

Ray Tracing (cpu)
https://raytracing.github.io/books/RayTracingInOneWeekend.html
(there also several versions of that in github, for unity cpu or gpu, if just want to take existing and make it faster)

2 Likes

Disregarding the stupid things in projects that bring everything to a crawl, the biggest performance culprit in real projects is going to algorithms with greater than O(n) complexity. Those are especially valuable for optimizing.

So rather than making a million things move, think about 10,000 things each interacting with the 9999 others. XPBD simulations, or even just a general purpose collision detection broadphase are good candidates.

Sorting algorithms are also a great candidate. They are especially important for recovering determinism after running algorithms in parallel jobs.

You could also get into really data-heavy algorithms such as audio DSP work. Or pathfinding.

I guess the biggest question I have is how practical do you want to be?

3 Likes

I decided to work on a sorting algorithm. Also tried intrinsics with that, but I think I’ll give that a miss for now!

Audio DSP, not sure how you would do real time without making native plugins and that’s beyond me at the moment! One of my older projects was actually lossless audio compression, maybe that’s something I could update to Burst one day. Suppose Burst would be great for a lot of things de/compression and de/serialisation?

Always trying to crunch items/fps, but would this be something to work with ECS or can performance be achieved drawing indirect? I’ve little experience on ECS but will get into it when it’s a stable release.

The project I’ve been stuck in for ages (turning into a test ground for many things) was originally about using audio to define a landscape, and then creating a per frame localised collider based on position (collider only needs to exist where there are physical objects). One of those random idea’s that turns into many others! I suppose simply, I could use quads as platforms (rather than a whole mesh, but maybe). I’m just trying to think how to make audio look good in 3 dimensions, kind of tricky really. I did crunch Keirijo’s code on BurstFFT and make it faster. I could either make a slice per frame (which is 2 dimension) or think of some way to make a spectrum 3D. I suppose Octave is another dimension. I’ll keep these formulating in my brain anyway!

Here was my original messing about for 3D audio, but this is with a 2D slice/construct (also based on AudioSource.GetSpectrum with is not native). Maybe Burst has the capacity to update a lot of verts but not sure (cost is in reload), probably a ComputeShader thing really. There’s no lighting in this but got a nice effect just with colour! With more complexity, might look quite interesting.

No plugin necessary. If you wanted, you can play with this code right here. It is a simple linear interpolator that initially looks easy to vectorize, but is actually incredibly difficult. I don’t remember which version Burst started auto-vectorizing it, but it was one of the more recent versions. I was really impressed when Burst started doing it, but despite using AVX, it only sped things up by 2X.

Anyways, if you want to explore audio more, I can definitely help you get started.

Visualization looks awesome by the way!

Edit: What happened to the link? Oh well: https://github.com/Dreaming381/Latios-Framework/blob/v0.5.7/MyriAudio/Jobs/Sampling.cs#L105-L121

1 Like

In regards to my sorting job; I have some questions on multi-threaded tasks.

Is it efficient to compare/copy in a multithread way, as in, if I’m comparing 4 massive separate arrays (only with themselves) in multi-thread, is the read/write going to be any faster?

If these are single index + for loop jobs, how can I schedule these in parallel, and is it worth it (since they would be fighting for fast cache rather than sequenced in one row).

t1 t2 t3 t4
[ A1 ] [ A2 ] [ A3 ] [ A4 ]
[ B1 ] [ B2 ] [ B3 ] [ B4 ]

Have a theory that might just work (probably already exists!), by comparing bits x32 although with floats will be a little complex with the structure.

8523530--1136918--upload_2022-10-18_22-33-59.png

Also, wondering why this reads backwards, I thought a float/int would go the other way? But to get the first byte of the float, I have to refer to byte[3]

8523530--1136927--upload_2022-10-18_22-40-43.png

If the parallel scheduling cost is negligible relative to the workload, then the worst that can happen is you become 100% memory bound and your multi-threaded reads have nearly the same performance as a single thread. It is only writing where you have to worry about false sharing, but in practice, I have yet to see a parallel job choke on this in Unity even with some crazy access patterns. I suspect due to the safety rules Unity imposes that pending writes on shared cache lines just get stacked in the store buffer, and since those cache lines are usually only written to and not read from, it doesn’t slow things down at all.

Best case scenario is each thread works on a little smaller than L2-sized block and every read hits L2 cache.

It is called “endianness”.

So perhaps I should have elaborated more on this problem when I proposed sorting to you.

In practice, very rarely is it valuable to sort one large list with a million elements. At max I think I have ever needed to sort about 50,000 for a command buffer that was written to in parallel and used an entityInQueryIndex sortKey (this is an ECS thing). What is a lot more common is having to sort results after a parallel bucketing algorithm. For example, you might need to sort all the values associated with a given key in a NativeParallelMultiHashMap. One of my use cases is for collision detection I use a coarse-resolution spatial hash and then sort objects in each hash bucket.

So in real use cases, I am much more likely to sort 50 buckets of 1000 objects each or 1000 buckets of 50 objects each (or some mix of sizes) rather than one large list of 50000 objects. The problem is already naturally parallelizable so the biggest performance wins will come from the algorithm and instructions generated.

Usually for very small counts, a sorting network is optimal. For slightly larger, there’s insertion sort. For larger still, up to 300 - 1000 items depending on your key size, quick sort typically wins. But then above that there’s radix sort. Then one has to consider the overhead of dynamically choosing the appropriate algorithm, how to manage a payload associated with the key if there is one and if so how big, and then some situations require sorting to be stable which can invalidate certain techniques for certain size ranges.

This is the reason why I asked “how practical do you want to be?” because a lot of parameters worth considering can only be derived from the use case. And picking real use cases is hard if you don’t have a real project with real performance bottlenecks. That’s also one of the many reasons why I try to make my projects and experiments open source so that people who don’t have such a project but still want to experiment have something to work off of.

@DreamingImLatios I did check out the various sorting algorithms that exist and didnt take time to really understand them, i like going on my own tangent of discovery even if its leads to failure. Always learning things along the way.

I try to keep code simple and english. C# code is often split so much it reminds me of why apparently c++ macros are bad. The difference though is you can write your own englsh macros. I so wish c# had a #define feature.

I like to be practical, applying it to use cases is another thing. But maybe i make something useful one day! Guess that is why I was looking for inspiration. Going with the flow of change is another thing too. Trying to embrace what exists at the same time as taking my own stubborn route hehe!

Well here’s my current sorting algorithm results roundabouts (slightly less than my predictions hehe), which I’m quite happy with. For me, messing around is learning more! Based on pre-warm up, 1000 loops.

8545031--1142027--upload_2022-10-28_0-53-46.png

One interesting factor is, pointer++/-- doesn’t seem to work the same way as it would normally, although that normally would be inside a fixed statement. I get some items not sorting, or crashes. But not sure what the difference is to setting it?

                        for (int j = 1; j < size + 1; j++)
                        {
                            ptr = &number[Ascender + j];
                            // ptr++; **Unstable**
                            if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
                        }

Well compared to the code snippet you have now, it seems your loop control logic must have also changed as well, so it was probably something else in your loop control that was off by one.

Though on x86, not using increment operators may actually be beneficial, as the way the code is written now, I could see Burst caching the address of &number[Ascender] and then using [A + kB] addressing. Hard to say without seeing full code though.

@DreamingImLatios It may be due to just running the code. My results above were wrong because I was using Execute for all (not just on lower arrays), I’ve now split into Quick, and the longer arrays +16 use Run giving 3-4x.

8545481--1142114--upload_2022-10-28_8-51-57.png

Here’s the Quick code. Length 16 seems to be the tricky one to enhance using either.

        public unsafe void Quick()
        {
            float* number = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(numbers);

            float* ptr = number;

            int size = numbers.Length - 2;

            if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
            for (int i = 0; i < size; i++)
            {
                for (int j = 1; j < size + 1; j++)
                {
                    ptr = &number[j];
                    if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
                }
                for (int k = size; k > -1; k--)
                {
                    ptr = &number[k];
                    if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
                }
            }
        }

I noticed I can optimise this further I think, it was a work in progress on top of my main code which uses bitshifts and branches.

Is Native Sort Unity’s sorting implementation? And is all of this running in Burst without safety checks?

I guess I’m just surprised to see a cocktail sort beat out a quick sort. Though for 16 elements or fewer, Unity’s sort uses Insertion Sort, which is a more competitive sort though I suspect Unity’s implementation may have some unnecessary overhead.

Well I found NativeSort exists so worked with that, not sure what else exists. All safety checks off. Not really following any algorithm just using my head and figuring how different ideas work. Currently about idea 20!

Also nice to know about the tuple swap, so tidy.

8547392--1142579--upload_2022-10-28_23-17-8.png

Here’s my current results. Would love to understand why 256-512 gives almost 6x performance, must be something to do with the overall size. Will share my branching code when I’m happy with it. But the 0-16 code (and also 0-8 branch code) is below. If this can be enhanced further let me know! :slight_smile:

        public unsafe void Quick()
        {
            float* number = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(numbers);

            float* ptr = number;

            int lhs = -1, rhs = numbers.Length - 2;
            int check = rhs / 2; // Rounds down

            while (lhs < check)
            {
                for (int up = ++lhs; up != rhs + 1; up++)
                {
                    ptr = &number[up];
                    if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
                }

                for (int down = --rhs; down != lhs - 1; down--)
                {
                    ptr = &number[down];
                    if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
                }
            }
            if (rhs - lhs != 0)
            {
                ptr = &number[lhs+1];
                if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
            }
        }

Ok. I just reread one of the earlier posts and realized you were only sharing the code for up to the 16 element test cases. That makes a lot more sense. Anyways, this might be a fun read for you: Vectorizing small fixed-size sort

1 Like

Will have to give it a proper read. I did attempt intrinsics but it was very slow, most likely my error! Figured Burst probably does a better job of optimising than I can think of unless I really understood how to use intrinsics in a more optimal way than the compiler can do with code. Not sure where my code fits into O(n) as it depends on branches. Basically per bit depth sort left and right, left becomes an additional branch until right becomes a dead end and jumps to the next branch. Took me 2 weeks to try and translate it from my brain! Just need to add the initial split for signed types as they would be reversed. Maybe for larger arrays I could Schedule 4 parallel but no idea if there would be a performance gain from doing that (would be operating 4 separate parts of the array).