Centralized v. Decentralized Update

Seon from Sector3 Games has mentioned that he uses 1x Update 1x FixedUpdate in his games, and that this results in a significant performance increase.

This seemed unlikely since the hierarchy needs to be traversed regardless, which is a great opportunity to cache component routines, but I thought I’d give it a quick test, and lo and behold… he’s right.

I really don’t want to uproot my perty OO design, but the performance gain in a very primitive vector copy of 10K calls was about 10:1, cached v. component.

WTF? Either I am going to have to have a really big Update, or I am going to have to manually cache my update calls and kill the MonoBehaviour dependency in nearly all of my classes. What a pain in the ass!

Thanks for the heads up on this Seon!

The test project is attached… see for yourself.

169294–6098–$updateperformance_823.zip (482 KB)

Here’s some centralized performance feedback for anyone who is interested:

Some of the results surprise me; I expected more overhead with locals and allocs, and less for static vectors operations (creating temps I suppose)

In short though, Vector operations are expensive, which cripples simple operations like synching sprites to physical objects, etc.

using UnityEngine;

public class Routine {

	Vector3 target;
	Vector3 _a = new Vector3 (1,1,1);
	Vector3 _b = new Vector3 (1,1,1);
	Vector3 _c = new Vector3 (1,1,1);
	Vector3 _d = new Vector3 (1,1,1);
	Vector3 _e = new Vector3 (1,1,1);
	Vector3 _f = new Vector3 (1,1,1);
	Vector3 _g = new Vector3 (1,1,1);
	Vector3 _h = new Vector3 (1,1,1);
	Vector3 _i = new Vector3 (1,1,1);
	
	public void update () {

// 		40fps @ 1000 calls / Update on device (no leak; new struct is still value-type)
//		target = new Vector3 (0, 5.0f, 0);
		
// 		20fps @ 1000 calls / Update on device
//		target = Vector3.right * 5.0f;
		
//		30fps @ 1000 calls / Update on device
//		Vector3 source = (Vector3)Random.insideUnitCircle;
//		target = source;

//		60fps @ 1000 calls / Update on device
//		target = _a;

//		30fps @ 1000 calls / Update on device
//		target = _a + _b;

//		20fps @ 1000 calls / Update on device
//		target = new Vector3 (_a.x + _b.x, _a.y + _b.y);

//		15fps @ 1000 calls / Update on device
//		target = _a + _b + _c;

//		7.5fps @ 1000 calls / Update on device
//		target = _a + _b + _c + _d + _e + _f;

//		5fps @ 1000 calls / Update on device
//		target = _a + _b + _c + _d + _e + _f + _g + _h + _i;

// 		3fps @ 1000 calls / Update on device
/*		Vector3 a = new Vector3 (1,1,1);
		Vector3 b = new Vector3 (1,1,1);
		Vector3 c = new Vector3 (1,1,1);
		Vector3 d = new Vector3 (1,1,1);
		Vector3 e = new Vector3 (1,1,1);
		Vector3 f = new Vector3 (1,1,1);
		Vector3 g = new Vector3 (1,1,1);
		Vector3 h = new Vector3 (1,1,1);
		Vector3 i = new Vector3 (1,1,1);
		target = a + b + c + d + e + f + g + h + i;
*/
	
// 		3fps @ 1000 calls / Update on device
/*		target =
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1) +
			new Vector3 (1,1,1);		
*/

// 		3fps @ 1000 calls / Update on device
/*		target =
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one +
			Vector3.one;
*/
	}
}

Keep in mind that 1.1 has significant performance improvements, so I wouldn’t put too much effort into this until 1.1 is out, because some things might not be valid anymore.

–Eric

How to handle Vector creation and manipulation how to structure your updates is fundamental to how your program is built. These things propagate through your entire code-base.

If someone can afford to live with substandard performance, throw away their work or just wait and see what 1.1 has in store when a) no one knows when it is coming out and b) it seems to be largely geared toward taking advantage of 3Gs hardware, then so be it :slight_smile:

And, I too, am looking forward to the update, but until then, hopefully someone can benefit from this data (which took less than an hour to compile and directly challenges Unity’s component implementation) or, better still, someone can add their experience to the list and maybe we can compile a reliable “best practices” list.

Some of this is already out there, but many aspects have yet to be addressed.

That’s definitely not the case. There’s exactly one announced feature that has to do with 3GS (8 texture combiners); the rest applies to all devices.

–Eric

You are right, although it is completely irrelevant.

What I was referring to were the SIMD unit (3Gs-specific) and VFP improvements that would directly influence the Vector3 operations. But what do you see in that list that addresses Update call overhead? Why should I not be concerned about a 10x performance increase if I work around the unity component model?

I pointed out that that is not a viable solution for me, and more than likely, many others. You suggested to wait. Wait until when? Do you know the release date? If so, please share it!

What was relevant about your point, exactly? Maybe I missed something…

For those who are interested, the feature list is below;

Without downloading and messing around with the demo project, do you have any numbers you can post that might give us an idea of the margin separating the performance of both approaches?

Problems with update overhead, vector operations, physics performance have forced me to move on to cocos2d for my current 2d project, so I haven’t put any additional time into this, but the numbers for update overhead with the following load (along with the HUDFPS script) were:

		Vector3 source = (Vector3)Random.insideUnitCircle;
		target = source;

editor
@1k 1:1
@10k 2:1
@100k 5:1
@250k 5:1
@500k -:-

device
@100 1:1
@500 3:1
@1k 2.5:1
@5k 3.3:1

editor, centralized (looped direct call of non-monobehaviour routine):
1k calls 100fps
10K calls 100fps
100k calls 50fps
250k calls 20fps
500k calls 10fps

editor, decentralized (component Update, flat hierarchy)
1k calls 100fps
10k calls 50fps
100k calls 10fps (with much longer allocation/cleanup time)
250k calls 4fps (~20 sec allocation, even longer cleanup time)
500k calls forget about it

device (ipod touch gen2) centralized
100 calls 60fps
500 calls 60fps
1k calls 30fps
5k calls 10fps

device (ipod touch gen2) decentralized
100 calls 60fps
500 calls 20fps
1k calls 12fps
5k calls 3fps

So, realistically, it’s not 10:1 like I said before (this number was referring to the 10x difference to maintain the same fps), but for all practical purposes on the device about 3:1 cached call v. component update.

hope that helps!