Native Arrays approximately an order of magnitude slower than arrays

I made a little (amateurish) bench mark and got this:

 System.Diagnostics.Stopwatch stopWatch = new System.Diagnostics.Stopwatch();

        for (int i = 0; i <= 10; i += 2) {

            int lnght = i * 1000;

            float[] array = new float[lnght];
            NativeArray<float> nArray = new NativeArray<float>(lnght, Allocator.Temp);

            for (int a = 0; a < lnght; a++) {
                array[a] = (a % 2 == 0 ? -1 : 1) * a;
                nArray[a] = (a % 2 == 0 ? -1 : 1) * a;
            }

            long resultA = 0;
            long resultB = 0;

            stopWatch.Restart();

            for (int a = 0; a < lnght; a++)
                for (int b = 0; b < lnght; b++)
                    array[a] += array[b];

            resultA = stopWatch.ElapsedMilliseconds;
            stopWatch.Restart();

            for (int a = 0; a < lnght; a++)
                for (int b = 0; b < lnght; b++)
                    nArray[a] += nArray[b];

            resultB = stopWatch.ElapsedMilliseconds;

            nArray.Dispose();

            Debug.Log("lngth: " + lnght + "  array : " + resultA + "    /     nArray : " + resultB);
        }

lngth: 0 array : 0 / nArray : 0
lngth: 2000 array : 24 / nArray : 337
lngth: 4000 array : 97 / nArray : 1338
lngth: 6000 array : 212 / nArray : 2951
lngth: 8000 array : 389 / nArray : 5241
lngth: 10000 array : 608 / nArray : 8189

I understand that this will be mitigated by the Burst compiler in a Jobs situation, but I’d be interested in hearing people’s thoughts on it. Any insight as to why? Is it a beta thing that can be expected to go away?

4 Likes

Native array access from c# code outside of jobs is slower because of the extra bounds checks etc. This should not be done in builds.

1 Like

Thanks for the clarification! With ā€œthis should not be doneā€, do you mean that it’s bad practice and I shouldn’t do it, or that the overhead ought to disappear when I build the project? Sorry for being dumb ~

No. the compiler should only add the extra checks for the native arrays in the editor. I think this is how it works.

Also the point in using NativeArray isn’t to have faster random access speed, is to have linear memory layout and not garbage collection at all.

Sure, makes sense! But the ultimate point is performance, right … I’m having trouble actually getting any performance gains out of the jobs system, because I’m bottle-necked by shuffling my data to a nativearray and back again. My other idea was to just keep it in a nativearray to begin with, but that got me here - basically all the logic interacting with that nativearray becomes very slow. So end of the day I’m not really able to get any performance out of the system.

Perhaps my situation is just not what the job system is for.

That’s because Unity isn’t ready yet to use pure ECS effectiveness. We still need to convert from and back to managed objects in order to have some basic functionality, like rendering, sound, camera, etc. This should be changed as the Unity team releases new features compatible with Job System, ECS and Burst compiler.

Also, NativeArrays have a lot of checks to guarantee some safety to us. This slow down the access, but should be removed whe you build for production, outside the Editor, as @LennartJohansen pointed out.

Perhaps I should just wait then, and try to solve my problems with old school c# threading in the meantime. There’s a lot of potential to the jobs but it might be that it’s not really viable outside of specific test scenarios just yet!

As for the safety checks, it’s of course good news that it’d be faster in the build, but you still have to have decent performance in the editor to be able to work on the project…

Hi, I have run my own test and got a result similar to yours in-editor (10x slower). RW is by ++ operator and the amount of elements is 100,000 ints.

array R 3060
nativeArray R 35090
array W 3070
nativeArray W 29991
array RW 3803
nativeArray RW 59456
nativeArray RW (Job) 59796
nativeArray RW (Job Parallel 1) 64967
nativeArray RW (Job Parallel 2) 54481
nativeArray RW (Job Parallel 4) 43390
nativeArray RW (Job Parallel 8) 51902
nativeArray RW (Job Parallel 16) 49467
nativeArray RW (Job Parallel 32) 47657
nativeArray RW (Job Parallel 64) 47050
nativeArray RW (Job Parallel 128) 47827
nativeArray RW (Job Parallel 256) 47104

However when run this in Android device :

array R 8224
nativeArray R 13705
array W 3639
nativeArray W 9387
array RW 10176
nativeArray RW 17188
nativeArray RW (Job) 8859
nativeArray RW (Job Parallel 1) 61381
nativeArray RW (Job Parallel 2) 26555
nativeArray RW (Job Parallel 4) 14314
nativeArray RW (Job Parallel 8) 10844
nativeArray RW (Job Parallel 16) 7672
nativeArray RW (Job Parallel 32) 6644
nativeArray RW (Job Parallel 64) 6740
nativeArray RW (Job Parallel 128) 6800
nativeArray RW (Job Parallel 256) 7462

Don’t pay attention to the difference between read and write, when I rerun the test sometimes read is faster sometimes write is faster. I don’t know why. But RW should even everything out.

So

  1. Safety checks in editor adds 10 times performance hit.
  2. In a job it performs better than out of a job. It still lose to array outside of a job. It looked better in a job probably because locality of data. (regular array has its content on heap even if allocated in a local scope and native array out of the job maybe requires some pointer dereferencing, where in the job I think Unity made it more direct and more local)
  3. This is still without Burst

(Blog with the code : [Unity ECS] Native container performance test vs normal array | by 5argon | Medium)

1 Like

Safety checks in the editor have a significant performance cost.

The safety checks are disabled in the standalone player completely and in IL2CPP there is a fast path making builtin arrays and NativeArrays equally fast.

The real performance gains of NativeArray are leveraged from the burst compiler, when writing primarily jobified code with the [ComputeOptimization] attribute. We expect that for any code that is performance sensitive that developers will write it to run in a job in burst.

In burst the speed gains from using NativeArray are very significant. Usually on the order of 5-15x compared to il2cpp / mono.

3 Likes

I’m finding a similar slowdown. e.g. see Texture2D.GetRawTextureData() … when you deal with it as a generic type and return the native array, accessing the pixel data directly is HUGELY slower than accessing a regular byte array. Like 5seconds verses 35 seconds. This unfortunately seems to render the function practically useless for anything other than convenience and is probably even slower than just using GetPixels()/SetPixels() to make a copy of the pixel data. This seems a bit silly to me. I was hoping to see a much faster way to edit pixel data in the system memory without having to involve a copy operation on the whole buffer. But the performance is really bad.

I am interested in this new functionality, too (I.e. get pixel data as native array) - when you say no speed gains, did you follow what was said by Joachim in the post before yours?

I will only get to it in 10 days when i return. If you find something out earlier it would be great if you could share

Why do you have to add an attribute to enable burst compilation? I might have missed something, but I can’t see any reason to not turn on burst if it’s possible to do so. If there exists corner cases where it should be disabled, I think it makes more sense for there to be a [DisableComputeOptimization] attribute for those cases rather than the other way around.

Or is this simply something you have planned, but not gotten around to yet?

Burst is an experimental package and we want users to opt in for each job for the time being.

2 Likes

If someone uses something like nativearrayutility.getunsafeptr() and uses unsafe (pointers-are-ok) code… would this present a lower-level access to the internal ā€œrealā€ memory buffer (like struct data) of the native array, so that when you access elements like in an array, there is none of the ā€œoverheadā€ of native arrays, and it performs at maximum access speeds (like it would when using pointers on regular memory buffers?) or will there still be some kind of behind the scenes middle-man kind of activity interpreting all the accesses? ie can you do away with all the bounds checking and interfaces and so on and just get ā€œfull speedā€ access to the memory this way, even in the editor and without burst-stuff or trying to write jobs?

And if so, how would you pin or ā€˜fix’ the memory so that the garbage collector doesn’t try to move it?

So if you actually care about performance you will use NativeArray together with burst jobs.
In IL2CPP performance of builtin array and NativeArray is on par, in mono in a build, NativeArray is slower than builtin array.

So as a simple rule, just use NativeArray + Burst jobs and you will get the best possible performance.

In Burst using NativeArray is faster than GetUnsafePtr() because we can guarantee aliasing rules.

1 Like

If I recall about jobs there’s not really a ā€˜safe’ way to have multiple jobs/threads literally share access to the same memory buffer, ie, potential of reading and/or writing to the same byte in the same buffer ā€˜unsafely’… which I actually want to be able to do in my case.

Performance tests so far report that the UnsafePtr approach is waaay faster and pretty much the same performance as a regular int[ ] array access. Typically the native array is coming in around 7 times slower (in editor), while the unsafe ptr version is about the same ā€œfullā€ speed as a normal array access.

I was able to pin the native array using:

ulong handle; //gc handle
int* pinned = (int*)UnsafeUtility.PinGCObjectAndGetAddress(native,out handle);
int* myints = (int*)NativeArrayUnsafeUtility.GetUnsafePtr(native);

These two return different addresses. I presume the first is the address of ā€œobjectā€ (which contains data and a pointer to some memory), while the seconds returns the address of the actual memory? Do you think this guarantees that the pointer to the actual memory is also fixed because the object itself is fixed? Or does the UnsafePtr internally fix the memory behind the scenes?

I’m also not sure if what I’m pinning here is the actual native array of just my reference to it?

(also should I use .SetAtomicSafetyHandle() for some reason to lock it down further or is that just to get some kind of ownership over it so that other methods etc can ask whether it’s okay to also access the data at the same time, or?)

(note to self, objects larger than 85000 bytes are put on the large object heap and won’t be moved by the garbage collector).

Hey @Joachim_Ante_1 while I’m here… is there any reason why Unity still doesn’t let you Apply() a texture with a rect? In my case I am having to split up a large texture into many smaller textures in order to avoid having to push the entire large texture to graphics memory each frame. If Texture2D.Apply() would simply take a rect like Apply(new rect(0,0,64,64)); to apply only a small portion of the texture over the graphics bus, I would be able to just use one large texture and deal with which parts of it need uploading myself. This would open up a world of better performance for me, plus vastly less draw calls because splitting up a big texture into small ones ramps up the draw calls pretty fast to get the upload sizes to be smaller.

e.g in OpenGL v1 there was the instruction glCopyTexSubImage2D() which does exactly this. Any chance we can get a version of Apply() that’ll take a rect and only upload the part of the texture within the rectangle?

1 Like

@imaginaryhuman_1 you don’t seem to be testing with burst, its not very relevant to discuss what the performance in the editor with mono is. Its not relevant for any code you want in a job.

PinGCObjectAndGetAddress is not how this is meant to be used.

Please follow the standard way Burst / C# jobs / NativeArray is supposed to be used together.

By default that is not allowed. There are various attributes to opt out and allow behaviour that is not provably safe & deterministic.

NativeDisableParallelForRestrictionAttribute, NativeDisableContainerSafetyRestrictionAttribute, NativeDisableUnsafePtrRestrictionAttribute can be used on containers on jobs to circumvent safety mechanisms.

When following the recommended path it is very rare that you need those.

2 Likes

@Joachim_Ante_1 I see.

Are you saying then that if I use the job system, I should be able to access the native array without any hacking around and read/write from/to it for massively intensive ā€˜pixel’ operations, and it will run at the same performance level as the GetUnsafePtr() method … but with that perofrmance level in the build, not in the editor?

And if I switch over to using jobs then, how would I mark the Unity-created native array from a Texture2D, as having a disabled safety restriction so I can deal with the race conditions ā€œmanuallyā€?

You’re winning me over to the job system but I need clarification that I’m not going to see LESS performance as a result of working with the Texture’s native array this way, and seeing it run so much slower in the editor is a bit offputting.