Reinterpret NativeArray<int> to <v128>?

numbers = numbersIn.Reinterpret<v128>(sizeof(int) * 4)

// numbers - NativeArray<v128>
// numbersIn - NativeArray<int>

InvalidOperationException: Type System.Int32 was expected to be 16 but is 4 bytes

Thought this was going to be a walk in the park and just changes the pointer size to 16 bytes? Hope I don’t have to manually load ints into v128!

https://docs.unity3d.com/ScriptReference/Unity.Collections.NativeArray_1.Reinterpret.html

I did read that more, and changed to (sizeof(int)) but:

On a test array of 24 ints, numbers now has a length of 24 when it needs to be 6 x v128.

Assuming a NativeArray simply holds a pointer and a length, how do I change pointer size from 4 to 16 and the length from 24 to 6? Easily done with void pointers, not sure how in Unity. Trying to do things the safe way!

    NativeArray<int> array1 = new NativeArray<int>(24, Allocator.Temp);
    NativeArray<v128> array2 = array1.Reinterpret<v128>(sizeof(int));
    Debug.Log($"{array1.Length} {array2.Length}");

8517455--1135538--upload_2022-10-16_3-33-7.png

1 Like

Ok I got further, but here’s what I don’t understand. Without the commented lines to get the memory out of the loop it takes 300 ticks. With the line running so I can get the sum out… it takes 5000000 ticks…

I’m assuming 0+1 is much faster than 1000000+1, this can be the only reason, but why, more bits to add up?

        public void Execute(int i)
        {
            v128
                lefty = numbersV[i],
                right = new(),
                tally = new(),
                tTally = new();

            for (int idx = numbersV.Length-1; idx > -1; idx--)
            {
                right = numbersV[idx];

                v128 s2 = SSE2.shuffle_epi32(right, _1230);
                v128 s3 = SSE2.shuffle_epi32(right, _2301);
                v128 s4 = SSE2.shuffle_epi32(right, _3012);

                v128 c1 = SSE2.cmpgt_epi32(lefty, right);
                v128 c2 = SSE2.cmpgt_epi32(lefty, s2);
                v128 c3 = SSE2.cmpgt_epi32(lefty, s3);
                v128 c4 = SSE2.cmpgt_epi32(lefty, s4);

                v128 t1 = SSE2.add_epi32(c1, c2);
                v128 t2 = SSE2.add_epi32(c3, c4);
                v128 t3 = SSE2.add_epi32(t1, t2);
               
                tally = SSE2.add_epi32(tTally, t3);   // Empty tally
                //tally = SSE2.add_epi32(tally, t3);  // += tally
            }

            numbersOut[i] = tally;

Or is some crazy safety check based on the size? I thought 0+1 is no different from 1000000+1 in speed (well actually these output -1 so it would be -1000000±1)

When you use line 26 instead of line 27, each iteration of your loop completely replaces the value of tally without using anything from the previous iteration. This means only the last iteration does any actual work and the rest is useless.

The massive difference in timings suggests the burst compiler noticed this too and optimized things so the loop doesn’t exist anymore: it only needs to do the last iteration to obtain the same result (with idx = 0). The actual work is done only n times for an input array of size of n.

With line 27 instead of 26, the amount of work for n items is n², an exponential increase as your items list grows in size.

1 Like

Ok that makes sense. Well no idea how intrinsics here can make things faster, seems pretty slow unless I’m missing something (safety checks off). I’m also realising now how inputs/outputs are backwards. Intrinsics is one thing, thinking backwards is another! But at least I made a load of consts backwards for shuffling (after realising they need a backwards control). Maybe I can use that part for re-arranging.

Get the pointer to the array first: Unity - Scripting API: Unity.Collections.LowLevel.Unsafe.NativeArrayUnsafeUtility.GetUnsafePtr, then use SIMD instruction for it: https://docs.unity.cn/Packages/com.unity.burst@1.6/api/Unity.Burst.Intrinsics.X86.Sse.load_ps.html