I am trying to improve the performance of my terrain generation by utilizing vectorized loops with Burst where possible. I’ve found the function Unity.Burst.CompilerServices.Loop.ExpectVectorized(); that I can use to assert at compile time whether the loop is being vectorized as expected which is working great.
Using that assertion, I’ve narrowed down the non-vectorizable code to my calls to noise.snoise(float2) from the Unity.Mathematics package.
Here is an example of what I am doing to generate the noise values:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void GenerateNoise(float* valuesPtr, int numValues)
{
for (var i = 0; i < numValues; i++)
{
Unity.Burst.CompilerServices.Loop.ExpectVectorized();
valuesPtr[i] = noise.snoise(new float2(i, 1));
}
}
Doing something simple in that loop like valuesPtr = i;definitely works with the assertion and runs as expected. I’ve pulled the source for the snoise function and started to inline it, which was mostly working. Before I went to all of the effort to do that though, I wanted to ask for peoples thoughts: - Should the noise.snoise calls be vectorizable by the Burst compiler? Am I missing anything here, or doing anything wrong which is stopping it from working. - Is it worth heading down the path of in-lining the noise code or finding alternatives that support vectorization? Or is this a dead-end steet and isn’t possible for some reason Thanks
The implementation uses types float2, float3, and float4. Those types typically break Burst auto-vectorization and sort of enter a “manual vectorization” mode. It is possible to write a vectorized version of this code, but it is not trivial.
In fact,the function is not simple at all noise.snoise(new float2(i, 1)) is ether inline (expanded) or invoked.
vectorizable loop can not have any jump/branch in code execution. And snoise contains branches. so in both case it’s none-vectorizable.
EDIT: As I review the source. snoise does not contain branch. but it’s just too complicated.
source
public static float snoise(float2 v)
{
float4 C = float4(0.211324865405187f, // (3.0-math.sqrt(3.0))/6.0
0.366025403784439f, // 0.5*(math.sqrt(3.0)-1.0)
-0.577350269189626f, // -1.0 + 2.0 * C.x
0.024390243902439f); // 1.0 / 41.0
// First corner
float2 i = floor(v + dot(v, C.yy));
float2 x0 = v - i + dot(i, C.xx);
// Other corners
float2 i1;
//i1.x = math.step( x0.y, x0.x ); // x0.x > x0.y ? 1.0 : 0.0
//i1.y = 1.0 - i1.x;
i1 = (x0.x > x0.y) ? float2(1.0f, 0.0f) : float2(0.0f, 1.0f);
// x0 = x0 - 0.0 + 0.0 * C.xx ;
// x1 = x0 - i1 + 1.0 * C.xx ;
// x2 = x0 - 1.0 + 2.0 * C.xx ;
float4 x12 = x0.xyxy + C.xxzz;
x12.xy -= i1;
// Permutations
i = mod289(i); // Avoid truncation effects in permutation
float3 p = permute(permute(i.y + float3(0.0f, i1.y, 1.0f)) + i.x + float3(0.0f, i1.x, 1.0f));
float3 m = max(0.5f - float3(dot(x0, x0), dot(x12.xy, x12.xy), dot(x12.zw, x12.zw)), 0.0f);
m = m * m;
m = m * m;
// Gradients: 41 points uniformly over a line, mapped onto a diamond.
// The ring size 17*17 = 289 is close to a multiple of 41 (41*7 = 287)
float3 x = 2.0f * frac(p * C.www) - 1.0f;
float3 h = abs(x) - 0.5f;
float3 ox = floor(x + 0.5f);
float3 a0 = x - ox;
// Normalise gradients implicitly by scaling m
// Approximation of: m *= inversemath.sqrt( a0*a0 + h*h );
m *= 1.79284291400159f - 0.85373472095314f * (a0 * a0 + h * h);
// Compute final noise value at P
float gx = a0.x * x0.x + h.x * x0.y;
float2 gyz = a0.yz * x12.xz + h.yz * x12.yw;
float3 g = float3(gx,gyz);
return 130.0f * dot(m, g);
}
Thanks for the quick replies, do you know of any alternative noise functions/libraries/algorithms that support vectorisation.
At the scale I’m trying to generate for, this noise function has a decent performance impact.
this is pretty much the standard nose from GLSL.
if you want it to run faster. consider generate a noise buffer and access it by index. It’s the same technique as noise map.
It is difficult to suggest something because we don’t know how the noise is being used and what properties of this noise are strongly desired. If it were me and a 3-4x improvement was worth a few hours of effort, I would do the following:
Convert the float2, 3, and 4 into multiple floats, such as float2 v becoming float v_x and v_y.
Reimplement the function calls for dot products and the permute function.
Change the conditional operator to use math.select.
Check if autovectorization works. If not, continue.
Change v_x and v_y to be float4 instead of float.
Change local variables to also be float4 wherever the compiler complains about assigning a float4 to a float.
Thanks for the detailed step by step, that will definitely help going down that path.
Is there any good place to see why the compiler chose not to vectorize the code? The ExpectVectorized assertion is helpful, but it doesn’t give you any indication of what specific code triggered it, or why. Which leaves a process of elimination to find the culprit.
What is the best practice for benchmarking these burst calls directly?
At the moment I’m measuring it loosely via calls from a ECS system Entities.ForEach, but that has somewhat variable performance depending on what else is going on in the game/Unity.