# Vectorizing a loop.

I have very limited knowledge about SIMD. I would like to ask if a loop like this is vectorizable.

``````            for ( ; i < EDGE_ARRAY_SIZE; i++ )
{
graphNodeIntraIndexArray[ i ] = graphIntraEdges[ i ];
graphNodeInterIndexArray[ i ] = graphInterEdges[ i ];
parentEdgeArray[ i ] = -1;
parentPathIndexArray[ i ] = -1;
gCostArray[ i ] = int.MaxValue;
openArray[ i ] = false;
closedArray[ i ] = false;
}
``````

The arrays are all the same size.

I tried doing this, but @DreamingImLatios informed me it is incorrect.

``````            int i = 0;
int vectorLoopSize = EDGE_ARRAY_SIZE - 4;
int4 parentValue = new int4( -1 , -1 , -1 , -1 );
int4 gCost = new int4( int.MaxValue , int.MaxValue , int.MaxValue , int.MaxValue );
bool4 falseVector = new bool4( false , false , false , false );
for ( ; i < vectorLoopSize; i += 4 )
{
int4 loopIndex = new int4(
i ,
i + 1 ,
i + 2 ,
i + 3 );

graphNodeIntraIndexArray[ loopIndex.x ] = graphIntraEdges[ loopIndex.x ];
graphNodeIntraIndexArray[ loopIndex.y ] = graphIntraEdges[ loopIndex.y ];
graphNodeIntraIndexArray[ loopIndex.z ] = graphIntraEdges[ loopIndex.z ];
graphNodeIntraIndexArray[ loopIndex.w ] = graphIntraEdges[ loopIndex.w ];

graphNodeInterIndexArray[ loopIndex.x ] = graphInterEdges[ loopIndex.x ];
graphNodeInterIndexArray[ loopIndex.y ] = graphInterEdges[ loopIndex.y ];
graphNodeInterIndexArray[ loopIndex.z ] = graphInterEdges[ loopIndex.z ];
graphNodeInterIndexArray[ loopIndex.w ] = graphInterEdges[ loopIndex.w ];

parentEdgeArray[ loopIndex.x ] = parentValue.x;
parentEdgeArray[ loopIndex.y ] = parentValue.y;
parentEdgeArray[ loopIndex.z ] = parentValue.z;
parentEdgeArray[ loopIndex.w ] = parentValue.w;

parentPathIndexArray[ loopIndex.x ] = parentValue.x;
parentPathIndexArray[ loopIndex.y ] = parentValue.y;
parentPathIndexArray[ loopIndex.z ] = parentValue.z;
parentPathIndexArray[ loopIndex.w ] = parentValue.w;

gCostArray[ loopIndex.x ] = gCost.x;
gCostArray[ loopIndex.y ] = gCost.y;
gCostArray[ loopIndex.z ] = gCost.z;
gCostArray[ loopIndex.w ] = gCost.w;

openArray[ loopIndex.x ] = falseVector.x;
openArray[ loopIndex.y ] = falseVector.y;
openArray[ loopIndex.z ] = falseVector.z;
openArray[ loopIndex.w ] = falseVector.w;

closedArray[ loopIndex.x ] = falseVector.x;
closedArray[ loopIndex.y ] = falseVector.y;
closedArray[ loopIndex.z ] = falseVector.z;
closedArray[ loopIndex.w ] = falseVector.w;
}
``````

I have tried testing by storing int4/float4 etc in the arrays, and that produces SIMD code. But is there any other way it can be done?

You have 7 operations in your loop.
Operations 1 and 2 are memcpy operations.
The rest are memset operations.

You really donât need a loop here.

2 Likes

How would I set the values of the array without a loop?

UnsafeUtility.MemCpy
UnsafeUtility.MemSet

edit actually for this case youâd probably use
UnsafeUtility.MemCpyStride

Why? Doesnât MemCpyStride just let you set every multiple array indices for the cases where you are trying to interleave data? I donât see how that is superior than MemSet the memory to a literal? Or possibly even MemClear for operations 6 and 7?

1 Like

You are totally correct, bit of an early morning brain fart.

Thinking of

MemCpyReplicate

Thanks you very much

Than

Thank you very much!

I did not know about memset and memcopy operations.

Letâs say then that for operations 3-7, the values arenât uniform. How would you go about that?

It depends on what those values are and what patterns they exhibit. Unity provides quite a few powerful tools for doing these things.

Ok for instance assume 3-7 the values which are assigned to the arrays are calculated in the loop based on the loop counter?

Also how can I get around memset requiring the value to be a byte?

In that case we are back to the vectorization discussion unless you can pre-calculate these values.

UnsafeUtility.MemCpyReplicate. But for -1 you can just assign the byte the value 255. 255 as a byte is 0xFF, and -1 as an int is 0xFFFFFFFF.

Ok sweet! Thanks for being patient. But MemCprReplicate seems like it copies from one array to another. The value is a void pointer. How can I just send in one value?

And about the vectorization, I am a little lost. On small test loops, unrolling the loops sometimes produces SIMD, when assigning values from one arrray to another or when assigning a value to arrays, but sometimes it doesnt. I dont understand why my above code isnt vectorized. Even if the loop is unnecessary, doesnt Burst detect I am doing the same operation 4 times on consecutive indexes?

The unary & operator grabs the address of a variable, including stack variables. You can cast the result of that to a void*.

Auto-vectorization is fickle. It would help to know what Burst is doing over your original scalar not-trying-to-vectorize-or-unroll-anything code. If you can isolate this loop into as small of a job as possible, even if the job doesnât do anything useful, it can be a lot easier to figure out what Burst is doing. I use Burst Preview 1.3.0-3, and thereâs a tab called LLVM IR Optimi[ S ]ation Diagnostics which can give you some useful insight as to why it may not vectorize a loop automagically.

Try to give the small job its own small .cs file so you can share that file (inline it in the post but make sure the line numbers match the actual fileâs line numbers) along with the âCopy to Clipboardâ output of the Burst inspector Assembly with Enhanced Disassembly checked and Safety Checks unchecked.

Also, make sure you have a test-harness for this small useless job so that you can compare performance between versions with different changes.

Also knowing what version of Burst you are using will be super helpful.

Edit: Accidentally tricked the bb interpreter.

I have isolated a loop and its not showing me any SIMD instructions. In fact, I dont see any of the math operations in the assembly.

``````        .text
.def         @feat.00;
.scl        3;
.type        0;
.endef
.globl        @feat.00
.set @feat.00, 0
.intel_syntax noprefix
.file        "main"
.def         "Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7";
.scl        2;
.type        32;
.endef
.globl        "Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7" # -- Begin function Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7
.p2align        4, 0x90
"Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7": # @"Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7"
.Lfunc_begin0:
# %bb.0:                                # %entry
ret
.Lfunc_end0:
# -- End function
.def         burst.initialize;
.scl        2;
.type        32;
.endef
.globl        burst.initialize        # -- Begin function burst.initialize
.p2align        4, 0x90
burst.initialize:                       # @burst.initialize
.Lfunc_begin1:
# %bb.0:                                # %entry
ret
.Lfunc_end1:
# -- End function

.section        .drectve,"yn"
.ascii        " /EXPORT:Unity.Jobs.IJobExtensions.JobStruct`1<Test.Test2Job>.Execute(ref Test.Test2Job data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_9CFEF681FC23B6D7"
.ascii        " /EXPORT:burst.initialize"
``````
``````    [BurstCompile] private struct Test2Job : IJob
{
NativeArray<int> a;
NativeArray<int> b;
NativeArray<int> c;

public void Execute()
{
int4 graphClusterSizeVector = new int4( 5 );
int4 graphCellLengthVector = new int4( 5 );
int4 clusterPositionXVector = new int4( 5 );
int4 clusterPositionYVector = new int4( 5 );
int4 loopIndex = new int4(4);
int4 localCol = new int4(4);
int4 localRow = new int4(4);
int4 clusterX = new int4(4);
int4 clusterY = new int4(4);
int4 localIndex = localCol + localRow * graphClusterSizeVector;
int4 graphCol = localCol + clusterX;
int4 graphRow = localRow + clusterY;
int4 graphArrayIndex = graphCol + graphRow * graphCellLengthVector;
int i = 0;
int vectorLoopLength = 5 * 5 - 4;
for ( ; i < vectorLoopLength; i ++ )
{
loopIndex = new int4( i , i + 1 , i + 2 , i + 3 );
localCol = loopIndex / graphClusterSizeVector;
localRow = loopIndex % graphClusterSizeVector;
clusterX = graphClusterSizeVector * clusterPositionXVector;
clusterY = graphClusterSizeVector * clusterPositionYVector;
localIndex = localCol + localRow * graphClusterSizeVector;
graphCol = localCol + clusterX;
graphRow = localRow + clusterY;
graphArrayIndex = graphCol + graphRow * graphCellLengthVector;
}
}
}
``````

I feel like this basic code should produce SSE.

Lololololololol.

You only have one assembly instruction: ret
Why? Well because Burst realized that you werenât writing to anything other than local variables that were going to be destroyed out of scope. So because of that, it decided that none of the work you were doing was actually necessary at all and optimized everything out.

Try to make your inputs and outputs NativeArrays for this particular test, even if the lengths are going to be 1 and you hardcode access index 0.

2 Likes

LOL wow. Ok will doâŚ

1 Like
``````While compiling job: System.Void Unity.Jobs.IJobExtensions/JobStruct`1<PathfindingGraph/Test3>::Execute(T&,System.IntPtr,System.IntPtr,Unity.Jobs.LowLevel.Unsafe.JobRanges&,System.Int32)
at <empty>:line 0
.text
.def         @feat.00;
.scl        3;
.type        0;
.endef
.globl        @feat.00
.set @feat.00, 0
.intel_syntax noprefix
.file        "main"
.def         "Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01";
.scl        2;
.type        32;
.endef
.globl        __xmm@66666667666666676666666766666667 # -- Begin function Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01
.p2align        4
__xmm@66666667666666676666666766666667:
.long        1717986919              # 0x66666667
.long        1717986919              # 0x66666667
.long        1717986919              # 0x66666667
.long        1717986919              # 0x66666667
.globl        __xmm@00000005000000050000000500000005
.p2align        4
__xmm@00000005000000050000000500000005:
.long        5                       # 0x5
.long        5                       # 0x5
.long        5                       # 0x5
.long        5                       # 0x5
.globl        __xmm@00000032000000320000003200000032
.p2align        4
__xmm@00000032000000320000003200000032:
.long        50                      # 0x32
.long        50                      # 0x32
.long        50                      # 0x32
.long        50                      # 0x32
.text
.globl        "Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01"
.p2align        4, 0x90
"Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01": # @"Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01"
.Lfunc_begin0:
.seh_proc "Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01"
# %bb.0:                                # %entry
push        r15
.seh_pushreg 15
push        r14
.seh_pushreg 14
push        r13
.seh_pushreg 13
push        r12
.seh_pushreg 12
push        rsi
.seh_pushreg 6
push        rdi
.seh_pushreg 7
push        rbp
.seh_pushreg 5
push        rbx
.seh_pushreg 3
sub        rsp, 72
.seh_stackalloc 72
.seh_endprologue
mov        r13, rcx
movabs        rsi, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc_Ptr"
mov        ecx, 8016
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        rdi, rax
movabs        rbx, offset ".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected_Ptr"
lea        rcx, [rsp + 40]
call        qword ptr [rbx]
movabs        r12, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet_Ptr"
mov        r8d, 8016
mov        rcx, rdi
xor        edx, edx
call        qword ptr [r12]
mov        dword ptr [rdi], -1
mov        dword ptr [rdi + 8008], 1001
mov        rax, -8000
.p2align        4, 0x90
.LBB0_1:                                # %BL.0013.i.i.i
# =>This Inner Loop Header: Depth=1
mov        qword ptr [rdi + rax + 8008], 1001
jne        .LBB0_1
# %bb.2:                                # %"NativeMinHeap.Initialize(NativeMinHeap* this, int capacity, int infimum, int supremum)_E62019AE58A01108.exit.i"
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        r14, rax
lea        rcx, [rsp + 40]
call        qword ptr [rbx]
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        rdi, rbx
mov        rbx, rax
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        qword ptr [rsp + 64], rax # 8-byte Spill
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        r15, rax
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
xor        ebp, ebp
mov        r8d, 4000
mov        rcx, r15
xor        edx, edx
call        qword ptr [r12]
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        qword ptr [rsp + 56], rax # 8-byte Spill
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        ecx, 4000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        r15, r12
mov        r12, rax
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        r8d, 4000
mov        rcx, r12
xor        edx, edx
call        qword ptr [r15]
mov        ecx, 1000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        r12, rax
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        ecx, 1000
mov        edx, 4
mov        r8d, 2
call        qword ptr [rsi]
mov        rsi, rax
lea        rcx, [rsp + 40]
call        qword ptr [rdi]
mov        dword ptr [rsp + 40], -1
mov        byte ptr [rsp + 39], 0
movabs        r15, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate_Ptr"
lea        rdx, [rsp + 40]
mov        rcx, qword ptr [rsp + 64] # 8-byte Reload
mov        r8d, 4
mov        r9d, 1000
call        qword ptr [r15]
mov        dword ptr [rsp + 40], 2147483647
lea        rdi, [rsp + 39]
mov        rcx, qword ptr [rsp + 56] # 8-byte Reload
mov        rdx, rdi
mov        r8d, 4
mov        r9d, 1000
call        qword ptr [r15]
mov        rcx, r12
mov        rdx, rdi
mov        r8d, 1
mov        r9d, 1000
call        qword ptr [r15]
mov        rcx, rsi
mov        rdx, rdi
mov        r8d, 1
mov        r9d, 1000
call        qword ptr [r15]
movabs        rax, offset __xmm@66666667666666676666666766666667
movdqa        xmm0, xmmword ptr [rax]
movabs        rax, offset __xmm@00000005000000050000000500000005
movdqa        xmm1, xmmword ptr [rax]
movabs        rax, offset __xmm@00000032000000320000003200000032
movdqa        xmm2, xmmword ptr [rax]
xor        eax, eax
.p2align        4, 0x90
.LBB0_3:                                # %BL.0155.i
# =>This Inner Loop Header: Depth=1
lea        ecx, [rax + 1]
lea        edx, [rax + 2]
movd        xmm3, eax
lea        esi, [rax + 3]
pinsrd        xmm3, ecx, 1
pinsrd        xmm3, edx, 2
pinsrd        xmm3, esi, 3
pshufd        xmm4, xmm3, 245         # xmm4 = xmm3[1,1,3,3]
pmuldq        xmm4, xmm0
movdqa        xmm5, xmm3
pmuldq        xmm5, xmm0
pshufd        xmm5, xmm5, 245         # xmm5 = xmm5[1,1,3,3]
pblendw        xmm5, xmm4, 204         # xmm5 = xmm5[0,1],xmm4[2,3],xmm5[4,5],xmm4[6,7]
movdqa        xmm4, xmm5
psrld        xmm4, 31
movdqa        xmm4, xmm5
pmulld        xmm4, xmm1
psubd        xmm3, xmm4
movdqa        xmm4, xmm3
pmulld        xmm4, xmm1
pslld        xmm3, 2
movd        ecx, xmm4
movsxd        rcx, ecx
movd        dword ptr [r14 + 4*rcx], xmm4
pextrd        edx, xmm4, 1
movsxd        rdx, edx
pextrd        dword ptr [r14 + 4*rdx], xmm4, 1
pextrd        esi, xmm4, 2
movsxd        rsi, esi
pextrd        dword ptr [r14 + 4*rsi], xmm4, 2
pextrd        edi, xmm4, 3
movsxd        rdi, edi
pextrd        dword ptr [r14 + 4*rdi], xmm4, 2
movd        dword ptr [rbx + 4*rcx], xmm3
pextrd        dword ptr [rbx + 4*rdx], xmm3, 1
pextrd        dword ptr [rbx + 4*rsi], xmm3, 2
pextrd        dword ptr [rbx + 4*rdi], xmm3, 3
mov        rcx, qword ptr [r13]
movdqu        xmmword ptr [rcx + rbp], xmm3
cmp        rax, 100
jb        .LBB0_3
pop        rbx
pop        rbp
pop        rdi
pop        rsi
pop        r12
pop        r13
pop        r14
pop        r15
ret
.Lfunc_end0:
.seh_handlerdata
.text
.seh_endproc
# -- End function
.def         burst.initialize;
.scl        2;
.type        32;
.endef
.globl        burst.initialize        # -- Begin function burst.initialize
.p2align        4, 0x90
burst.initialize:                       # @burst.initialize
.Lfunc_begin1:
.seh_proc burst.initialize
# %bb.0:                                # %entry
push        rsi
.seh_pushreg 6
sub        rsp, 32
.seh_stackalloc 32
.seh_endprologue
mov        rsi, rcx
movabs        rcx, offset .Lburst_abort.function.string
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate_Ptr"
mov        qword ptr [rcx], rax
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc_Ptr"
mov        qword ptr [rcx], rax
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet_Ptr"
mov        qword ptr [rcx], rax
movabs        rcx, offset ".LUnity.Jobs.LowLevel.Unsafe.JobsUtility::get_IsExecutingJob.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::Create_Injected.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected.function.string"
call        rsi
movabs        rcx, offset ".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected_Ptr"
mov        qword ptr [rcx], rax
pop        rsi
ret
.Lfunc_end1:
.seh_handlerdata
.text
.seh_endproc
# -- End function
.section        .rdata,"dr"
.Lburst_abort.function.string:          # @burst_abort.function.string
.asciz        "burst_abort"

.lcomm        ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate_Ptr",8,8 # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate_Ptr"
".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate.function.string": # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate.function.string"
.asciz        "Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemCpyReplicate"

.lcomm        ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc_Ptr",8,8 # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc_Ptr"
".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc.function.string": # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc.function.string"
.asciz        "Unity.Collections.LowLevel.Unsafe.UnsafeUtility::Malloc"

.lcomm        ".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet_Ptr",8,8 # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet_Ptr"
".LUnity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet.function.string": # @"Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet.function.string"
.asciz        "Unity.Collections.LowLevel.Unsafe.UnsafeUtility::MemSet"

".LUnity.Jobs.LowLevel.Unsafe.JobsUtility::get_IsExecutingJob.function.string": # @"Unity.Jobs.LowLevel.Unsafe.JobsUtility::get_IsExecutingJob.function.string"
.asciz        "Unity.Jobs.LowLevel.Unsafe.JobsUtility::get_IsExecutingJob"

".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::Create_Injected.function.string": # @"Unity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::Create_Injected.function.string"
.asciz        "Unity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::Create_Injected"

.lcomm        ".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected_Ptr",8,8 # @"Unity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected_Ptr"
".LUnity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected.function.string": # @"Unity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected.function.string"
.asciz        "Unity.Collections.LowLevel.Unsafe.AtomicSafetyHandle::GetTempMemoryHandle_Injected"

.section        .drectve,"yn"
.ascii        " /EXPORT:Unity.Jobs.IJobExtensions.JobStruct`1<PathfindingGraph.Test3>.Execute(ref PathfindingGraph.Test3 data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_90ACBAFC044D0A01"
.ascii        " /EXPORT:burst.initialize"
``````
``````    [BurstCompile]
private struct Test3 : IJob
{
public NativeArray<int4> ints;

public unsafe void Execute()
{
int PATH_NODE_ARRAY_SIZE = 1000;

// Node sets
NativeMinHeap openSet = new NativeMinHeap();
openSet.Initialize( PATH_NODE_ARRAY_SIZE , -1 , PATH_NODE_ARRAY_SIZE + 1 );
NativeArray<int> localIndexArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );
NativeArray<int> graphIndexArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );
NativeArray<int> parentArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );
NativeArray<int> hCostArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp );
NativeArray<int> gCostArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );
NativeArray<int> fCostArray = new NativeArray<int>( PATH_NODE_ARRAY_SIZE , Allocator.Temp );
NativeArray<Blittable_Bool> openArray = new NativeArray<Blittable_Bool>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );
NativeArray<Blittable_Bool> closedArray = new NativeArray<Blittable_Bool>( PATH_NODE_ARRAY_SIZE , Allocator.Temp , NativeArrayOptions.UninitializedMemory );

int intValue = -1;
bool bValue = false;
void* pointer = ( void* ) &intValue;
UnsafeUtility.MemCpyReplicate( parentArray.GetUnsafePtr() , pointer , sizeof( int ) , PATH_NODE_ARRAY_SIZE );
intValue = int.MaxValue;
pointer = ( void* ) &bValue;
UnsafeUtility.MemCpyReplicate( gCostArray.GetUnsafePtr() , pointer , sizeof( int ) , PATH_NODE_ARRAY_SIZE );
UnsafeUtility.MemCpyReplicate( openArray.GetUnsafePtr() , pointer , sizeof( bool ) , PATH_NODE_ARRAY_SIZE );
UnsafeUtility.MemCpyReplicate( closedArray.GetUnsafePtr() , pointer , sizeof( bool ) , PATH_NODE_ARRAY_SIZE );

int4 graphClusterSizeVector = new int4( 5 );
int4 graphCellLengthVector = new int4( 4 );
int4 clusterPositionXVector = new int4( 2 );
int4 clusterPositionYVector = new int4( 2 );
int4 loopIndex = new int4();
int4 localCol = new int4();
int4 localRow = new int4();
int4 clusterX = new int4();
int4 clusterY = new int4();
int4 localIndex = localCol + localRow * graphClusterSizeVector;
int4 graphCol = localCol + clusterX;
int4 graphRow = localRow + clusterY;
int4 graphArrayIndex = graphCol + graphRow * graphCellLengthVector;
int vectorLoopLength = 100;
int i = 0;

for ( ; i < vectorLoopLength; i += 4 )
{
loopIndex = new int4( i , i + 1 , i + 2 , i + 3 );
localCol = loopIndex / graphClusterSizeVector;
localRow = loopIndex % graphClusterSizeVector;
clusterX = graphClusterSizeVector * clusterPositionXVector;
clusterY = graphClusterSizeVector * clusterPositionYVector;
localIndex = localCol + localRow * graphClusterSizeVector;
graphCol = localCol + clusterX;
graphRow = localRow + clusterY;
graphArrayIndex = graphCol + graphRow * graphCellLengthVector;

localIndexArray[ localIndex.x ] = localIndex.x;
localIndexArray[ localIndex.y ] = localIndex.y;
localIndexArray[ localIndex.z ] = localIndex.z;
localIndexArray[ localIndex.w ] = localIndex.z;

graphIndexArray[ localIndex.x ] = graphArrayIndex.x;
graphIndexArray[ localIndex.y ] = graphArrayIndex.y;
graphIndexArray[ localIndex.z ] = graphArrayIndex.z;
graphIndexArray[ localIndex.w ] = graphArrayIndex.w;

ints[ i ] = graphArrayIndex;
}
}
}
``````

So one thing I forgot to mention is that when you use int4, float4, and bool4, you disable Burstâs autovectorization capabilities. Instead, Burst replaces operations on those data types with individual simd instructions. So what you are doing is manually unrolling the loop and manually vectorizing your code.

With that said, you have simd instructions. Your for loop starts at .LBB0_3 and a lot of those instructions start with âpâ. You probably have a lot more instructions than you need for what you are trying to do. But you at least unlocked simd instructions, so now you need to start profiling and figure out what is faster.

What about vectorizing the writing to the arrays? It looks to me like they are writing sequentially.I thought the point of unrolling a loop is so you can work on multiple array indices at once.

And yes it seems like alot, but each of those instructions are 4 at a time right? Do you think this loop is even worth manually vectorizing?

``````localIndexArray.ReinterpretStore(localIndex.x, localIndex.xyzz);
graphIndexArray.ReinterpretStore(localIndex.x, graphArrayIndex);
``````
1 Like