Garbage Collection, Allocations, and Third Party Assets in the Asset Store

AKA: PLEASE STOP FEEDING THE GC ANIMAL

Given the most probable outcome that we are stuck with the slow garbage collection in Mono 2.6 until Unity 5.0 ( great if we do get an upgrade in 4.x but I don’t hold out much hope ) I’m ruthlessly trying to minimize allocate/destroy to avoid feeding the gc as little as possible. For a while I’ve been doing this in the Folk Tale code - although evidently I still have some work to do looking at ObjectCullingManager - and now I’ve turned my attention to third party components.

The profiler grab below illustrates what we’re up against with components available in the asset store. The figures in red are what is being allocated by the garbage collector. Some components do this every frame, others only do it when the functionality is triggered. While allocation is not necessarily an evil thing, allocating and discarding data for garbage collection every frame is.

960541--35754--$gc allocators.jpg

VLight is first up. The VLight.OnWillRenderObject is allocating 8-10KB every frame in our particular scene. Looks like it could benefit from some caching.

UnitySteer’s AutonomousVehicle.FixedUpdate is next up, allocating 7.1KB every frame, but that’ll be path data so is expected. I think I’ll actually replace UnitySteer with A*Pathfinding as we no longer have use for it.

A*Pathfinding is allocating sizeable chunks when a path is calculated, but we have a very large grid graph at fine detail, so that’s to be expected. Although some effort to reduce this would be welcomed as we have a lot of characters moving around so it’s fairly regular, and the KB varies depending on the path complexity. Perhaps have a pool of cached path points that is drawn from, and provide a repool function so we can return points that are no longer required?

EDIT ( 30 JULY 2012 ): Aron Granberg has further improved A*Pathfinding, and allocation has dropped to a consistent 21 bytes per frame. It will probably see general release in 3.2.

960541--35796--$cpu3-astarpathfinding and unitysteer.jpg

Extending the investigation another day, it looks like some of Unity’s own code might be contributing to the problem. The CharacterController code for example is allocating 10KB per update cycle:

And here’s what GetComponentsInChildren() allocates. In this instance I’ll have to make sure I cache and reuse the array if possible, preventing it being marked for garbage collection.

961755--35803--$cpu4-getcomponentsinobject.jpg

I’d really like this thread to serve as encouragement for asset store authors to optimize their code to cache and pool as many objects and variables as possible to minimise any food for the evil gc monster. If other pro user community members are witnessing similar behaviour with other components not listed thus far, please post your profiler data and the function name, and notify the component author. Hopefully we can get more authors to think more about code performance.

WHERE TO OPTIMIZE
The greatest emphasis should be on allocating any reference based heap objects once at the start, and recycling them to avoid feeding the garbage collector. Beyond that, here are a few points to help developers with further optimizations backed by profiler data, but they will have a much lesser impact.

  • Recycling heap objects will prevent the garbage collector kicking in
  • Recycling is faster than using new because it avoids calling the constructor ( ctor() )
  • Value types are subject to garbage collection when used in classes
  • Local variables are significantly faster than member variables
  • There is no performance difference between for and while loops
  • SqrMagnitude is ever so slightly faster than Vector3.Distance
  • For loops are much faster than foreach loops
  • Caching Component.transform, .rigidbody, .audio etc is considerably quicker

Disclaimer
These tests have been run on a PC desktop. The results may differ on other platforms and I encourage you to run your own tests.

Update
This thread is now being used to document allocations within the Unity API to assist both the community and UT following on from this topic. If you are thinking of posting a bug here, please read all posts from here onwards to check it hasn’t already been added, and be sure to file a bug report entitled “[Function Name] API call causes c# allocations”.

2 Likes

Keep in mind that value types (such as structs) do not allocate on the heap and therefore do not live in the land of the garbage collector. You could could call “new Vector3()” hundreds of thousands of times per frame and not cause a single garbage collection. I’m not sure what real tangible performance impact constructing new structs will cause, but it absolutely will not manifest in a garbage collection.

Otherwise in general I agree with your observations. Many people do not take care to reduce or eliminate heap allocations which will cause GCs, which can be especially painful on mobile platforms. All assets that do work at runtime should strive to avoid all allocations beyond initialization and startup, otherwise they impact the performance of their customers’ games. It does take some extra work, but it results in a much better product.

Yup, perhaps I should make a clearer distinction between allocation and garbage collection for the benefit of others. I’ve amended the OP accordingly.

Allocation is not normally an issue if done once at the start rather than in per frame functions, and is mostly concerned with performance. Recycling is much faster than allocating new, because the constructor doesn’t get called ( inc. structs ) and there’s no allocation ( heap objects ).

Feeding the gc monster with heap objects is best avoided because of the cpu spike.

Here’s the test code for performance of constructors v. recycling:

public Vector3 v3;
	
public void Update ()
{
	// ctor() called each time, self ms 1.38
	for ( int i = 0; i<100000; i++ )
	{
		v3 = new Vector3 ( 1f, 1f, 0.5f );
	}
}
	
public void Update ()
{
	// no ctor, self ms = 0.88
	for ( int i = 0; i<100000; i++ )
	{
		v3.x = 1f;
		v3.y = 1f;
		v3.z = 0.5f;
	}
}
	
// CONCLUSION: re-cycle Vector3 and Quaternions
1 Like

Value types do not get garbage collected when they are on the stack. However, if they are allocated on the heap (by being part of a class, for example) they are of course garbage collected.

Just mentioning here that none of my stuff misbehaves like this. :slight_smile: It’s pretty much “0 B” in the memory column for everything, except where actual new objects are created if necessary. Nothing on a continuous basis though.

–Eric

I’m going to run some tests so we have hard facts to support this discussion. I’ll be editing this post as I complete them.

Test Objective
Test @Jaimi’s statement about value types being garbage collected when part of a class

Case 1: Empty Class

	public class MyClass
	{
	}
	
	public void Update ()
	{
		int i;
		
		// garbage collector kicks in 
		for ( i=0; i<100000; i++ )
		{
			MyClass myObj = new MyClass ();
			myObj = null;
		}
	}

Outcome:

  • GC called less often.
  • 0.8MB collected each gc call

Case 2: Class With Struct

	public class MyClass
	{
		public Vector3 v3;
	}
	
	public void Update ()
	{
		int i;
		
		// garbage collector kicks in 
		for ( i=0; i<100000; i++ )
		{
			MyClass myObj = new MyClass ();
			myObj = null;
		}
	}

Outcome:

  • GC called more regularly
  • 1.9MB collected each gc call

Case 3: Class With Struct, Constructor Called

	public class MyClass
	{
		public Vector3 v3;
	}
	
	public void Update ()
	{
		int i;
		
		// garbage collector kicks in 
		for ( i=0; i<100000; i++ )
		{
			MyClass myObj = new MyClass ();
			myObj.v3 = new Vector3 ( 1f, 1f, 0.5f );
			myObj = null;
		}
	}

Outcome:

  • GC called regularly
  • myObj.v3 = new Vector3 has performance overhead
  • 1.9MB collected each gc call

Conclusion
Valid observation.
Structs do get allocated to the heap when part of a class, and thus subject to garbage collection.

Speaking of performance, it’s best to use local variables where possible (also makes for easier-to-maintain code). Using the above example, change it to this:

public void Update ()
{
    Vector3 v3;

    for ( int i = 0; i<100000; i++ )
    {
        v3.x = 1f;
        v3.y = 1f;
        v3.z = 0.5f;
    }
}

and you should see a few ms shaved off.

–Eric

in this case, the vector3 is directly on the stack, and is not allocated or deallocated - only the stack pointer changes.

Test Objective
Are local variables quicker than member variables.

Case 1: Member Variable

	public Vector3 memberV3;
	
	public void Update ()
	{
		for ( int i = 0; i<1000000; i++ )
		{
			memberV3.x = 1f;
			memberV3.y = 1f;
			memberV3.z = 0.5f;
		}
	}

Outcome:
8.3ms per frame

Case 2: Local Variable

	public void Update ()
	{
		Vector3 localV3;
		
		for ( int i = 0; i<1000000; i++ )
		{
			localV3.x = 1f;
			localV3.y = 1f;
			localV3.z = 0.5f;
		}
	}

Outcome:
5.1ms per frame

Conclusion
Valid Observation.
Local variables are considerably faster than member variables.

3 Likes

Test Objective
Where to define boundaries, inside or outside the loop.

Case 1: Boundary Definitions Inside Loop

	public class MyClass
	{
		public Vector3 v3;
	}	
	
	public List<MyClass> myList;
	
	public void Awake ()
	{
		myList = new List<MyClass>();
		for ( int i=0; i<1000000; i++)
		{
			myList.Add ( new MyClass() );
		}
	}
	
	public void Update ()
	{
		for ( int i=0; i<myList.Count; i++ )
		{
			myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
		}
	}

Outcome:

  • 23.8ms per frame

Case 2: Boundary Definitions Outside Loop ( aka hoisting )

	public class MyClass
	{
		public Vector3 v3;
	}	
	
	public List<MyClass> myList;
	
	public void Awake ()
	{
		myList = new List<MyClass>();
		for ( int i=0; i<1000000; i++)
		{
			myList.Add ( new MyClass() );
		}
	}
	
	public void Update ()
	{
		int i;
		int count = myList.Count;
		
		for ( i=0; i<count; i++ )
		{
			myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
		}
	}

Outcome:

  • 19.5ms per frame

Case 3: Mix Of Inside and Outside

	public class MyClass
	{
		public Vector3 v3;
	}	
	
	public List<MyClass> myList;
	
	public void Awake ()
	{
		myList = new List<MyClass>();
		for ( int i=0; i<1000000; i++)
		{
			myList.Add ( new MyClass() );
		}
	}
	
	public void Update ()
	{
		int i;
		
		for ( i=0; i<myList.Count; i++ )
		{
			myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
		}
	}

Outcome:

  • 25.1ms
  • Slightly surprising outcome, but doesn’t change our conclusion.

Conclusion
‘Hoist’ boundary definitions to be outside the loop.

One unfortunate thing about Mono is that it doesn’t seem to have the optimization that .NET has, where if you use “for (int i = 0; i < array.Length; i++)”, it removes the need for bounds checking in the array, since there’s no possibility of i being outside the bounds, so it’s actually faster than putting the initializer outside the loop. (Not sure if that applies to List.Count too, but I would expect so.)

–Eric

1 Like

It looks like at least for complex containers such as Lists that it’s quicker to move the boundary definitions outside the loop. The JIT optimizer for arrays should produce the outcome you describe but we’d have to test that.

Perhaps other members could contribute their tests? These aren’t exactly scientific tests I’m executing.

I’ll have to watch this, I know the JIT optimizes this all away in many of the cases listed above … c# … bleh … cripes next i’ll be back to unrolling loops too :(.

Test Objective
Is there a performance difference between while and for loops

Case 1: For Loops

	public class MyClass
	{
		public Vector3 v3;
	}	
	
	public List<MyClass> myList;
	
	public void Awake ()
	{
		myList = new List<MyClass>();
		for ( int i=0; i<1000000; i++)
		{
			myList.Add ( new MyClass() );
		}
	}
	
	public void Update ()
	{
		int i;
		int count = myList.Count;
		
		for ( i=0; i<count; i++ )
		{
			myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
		}
	}

Outcome:

  • 20.99ms

Case 2: While Loop

	public class MyClass
	{
		public Vector3 v3;
	}	
	
	public List<MyClass> myList;
	
	public void Awake ()
	{
		myList = new List<MyClass>();
		for ( int i=0; i<1000000; i++)
		{
			myList.Add ( new MyClass() );
		}
	}
		
	public void Update ()
	{
		int i = 0;
		int count = myList.Count;
		
		while ( i<count )
		{
			myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
			i++;
		}
	}

Outcome:

  • 20.98ms

Conclusion
No difference. Probably because for and while loops are both evaluated the same in .NET.

I wish I had the time to invest in this as I love finding most optimal use cases, but sadly I to busy.

However I’m a little dubious over this test case, repeatedly using the same values, its not very ‘real-world’ and I would wonder if the compiler might not be able to do some ‘unfair’ optimisations itself. In my opinion it will be more valid to assign x,y,z using some basic equation so the values are different in each loop and be different for every update.

Its also strange as i’m sure I profiled this myself in terms of pure performance and found the reverse was true, it was slightly counter-intuitive as you’d expect the overhead of new vector to add up. May have been due to other factors as it was a ‘real-world’ test. Like I said wish I had time to repeat my tests, in order to provide evidence. Maybe I’ll find the time later.

Anyway I think this is a great idea, will be interesting to see what you discover.

Its also a shame you can’t disable bounds checking in Unity/mono, I’m still not clear exactly when it comes into effect, but if it is, then removing it could certainly speed up some of my heavy loop/array code.

Of course for real optimisations you’ll want to me examining the output ISLM code, though thats not always fun :wink:

Test Obejctive
Is there a speed difference between Vector3.Distance and SqrMagnitude?

Case 1: Vector3.Distance

	public Vector3 pointA = Vector3.zero;
	public Vector3 pointB = Vector3.one;
	
	public void Update ()
	{
		int i;
		float distance;
		
		for ( i=0; i<1000000; i++ )
		{
			distance = Vector3.Distance ( pointA, pointB );
		}
	}

Outcome:

  • 33.5ms per frame

Case 2: SqrMagnitude

	public Vector3 pointA = Vector3.zero;
	public Vector3 pointB = Vector3.one;
	
	public void Update ()
	{
		int i;
		float distance;
		
		for ( i=0; i<1000000; i++ )
		{
			distance = ( pointA - pointB ).sqrMagnitude;
		}
	}

Outcome:

  • 30.5ms

Conclusion
SqrMagnitude is a little faster than Vector3.Distance

Your main point about garbage collection is a good one. I think getting in to the little test cases is getting off topic and confuses the post.

“Structs do get allocated to the heap when part of a class, and thus subject to garbage collection.”

Structs may be allocated to the heap or the stack. The decision to do so is based on the expected lifetime of the variable (is it short term or long term?). Its up to the compiler, but generally all local variables (value or refs) used in an iterator block are stored in the heap.

More detailed discussion at: http://blogs.msdn.com/b/ericlippert/archive/2010/09/30/the-truth-about-value-types.aspx

EDIT: Oops wrong link

Test Objective
Noisecrime’s “real-world” test simulation - recycling v new Vector3()

Case 1: new Vector3()

	public Vector3 v3;
	
	public void Update ()
	{
		for ( int i = 0; i<1000000; i++ )
		{
			float x = Random.Range ( 0f, 100f );
			float y = Random.Range ( 0f, 100f );
			float z = Random.Range ( 0f, 100f );
			
			// ctor() called lots, 127.8ms
			v3 = new Vector3 ( x, y, z );
		}
	}

Outcome:

  • 127.8ms

Case 2: Recycling

	public Vector3 v3;
	
	public void Update ()
	{
		for ( int i = 0; i<1000000; i++ )
		{
			float x = Random.Range ( 0f, 100f );
			float y = Random.Range ( 0f, 100f );
			float z = Random.Range ( 0f, 100f );
			
			v3.x = x;
			v3.y = y;
			v3.z = z;
		}
	}

Outcome:

  • 108ms

Conclusion
Recycling is faster than new Vector3()

Something to note is that when calling a method in c#, all parameters passed are copied each time, therefore passing an object only requires 4 bytes being copied, but the same struct will have to copy all it’s values, so for example, passing a Vertor3 struct will take 3 times longer than passing a Vector3 object. You would also get better cache coherence as the original values will probably sit in the cpu cache.

This may be trivial, but since we’re talking performance, it’s worth mentioning, especially if calls like that could happen thousands of times a second.

Do you have time to send me a PM with a little code test so I can run the comparisons and upload the graphs?