GPUs usually have very small caches. They are optimized for throughput, not for latency, so they have high memory frequencies, wide buses but can use small caches. CPUs on the other hand, are optimized for “anything can happen now” case, so small latency is more important than high throughput, hence huge caches.
Anyway, back to the question (warning - long post ahead). I did some tests on my own. The attached project (for Unity 2.0) has two tests:
-
Draw Call test. This draws exactly same number of exactly same size polygons in exactly the same places, just using different batch count. In essence, it has lots of plane grids, and subdivides them into variable number of tiles. So the GPU vertex processing always stays more or less the same (sans the number of duplicate vertices across touching tiles), pixel processing is also the same, just the batch count differs.
-
Vertex processing throughput test. This always draws same number of batches (500), with meshes that occupy the same screen area. Each test uses meshes subdivided into variable number of polygons. So vertex processing requirement on the GPU grows with each test - it has more vertices to process. The CPU should be loaded the same in each test though (number of batches is the same).
I tested on first-gen MacBook Pro, Core Duo 1.83GHz, Radeon X1600, OS X 10.5.1. The GPU is not very fast (it’s underclocked a lot from the “normal” X1600s). Important to test in standalone player, as in the editor we do a lot more error checking, plus other editor overhead (e.g. drawing the scene view).
1. Draw call testing. I think the proper question to ask here is “how many draw calls I can afford?”. Each draw call takes some CPU time; estimate how many FPS do you want to have (say, 60), how much CPU you want to leave for physics, game logic etc., and you’ll end up with how much CPU time can you afford for submitting objects for rendering.
On my machine, to draw all ~200 thousand vertices in this test, with spending all CPU on drawing: best to draw it in 160 batches (2401 vertices/batch, 145 FPS). If however I want to spend 10 milliseconds/frame of CPU on other tasks, then it’s best to draw the whole thing in 40-90 batches (9409-4225 vertices/batch, 74.4 FPS - note that each frame I simulate “10 ms spent somewhere”, hence absolute FPS is lower). If I want to spend 20 milliseconds/frame of CPU on other tasks, then it’s best to draw in 40 batches (9409 vertices/batch, 42.7 FPS).
My machine seems to be able to do about 90000 batches/second, if I max out the batch count. Note that batches here are very simple: just changing a mesh; in a real game you’ll quite often change textures, shaders, colors and whatnot, so “real batches” might be more expensive.
2. Vertex processing throughput
My results are like this: each row is vertices/batch, and resulting processing rate in millions of vertices / second. Like said above, this is 500 draw calls per frame. Data:
Verts/Batch MVerts/s
25 0.5
121 2.3
289 5.0
529 8.3
841 12.3
1225 16.6
1681 20.9
2209 24.9
2809 28.7
3481 32.3
// drop!
4225 25.1
5041 26.2
5929 27.0
6889 27.4
// further it slowly goes up to 30.0 Mverts/s
You can see that the limit of this machine is about 30 million vertices/second (quite low… I blame Apple for underclocking it!). And except for a curious drop at about 4000 vertices/batch, the processing throughput increases with increasing batch sizes.
Why at about 4000 vertices/batch there’s a sudden drop in vertex processing performance - I don’t know. My mesh vertices are position+color, that makes them 16 bytes (12 for position, 4 for color), so at 4000 vertices the mesh reaches 64 kilobytes of vertex data. Maybe Apple’s OpenGL driver or this particular graphics card switches to some “slower mode” when mesh vertex data exceeds 64 kilobytes? I could imagine that the internal format of graphics card’s push-buffers changes when some limit is exceeded, but I don’t know for sure. But overall using larger batches makes the graphics card more happy.
Ok, time to sleep now.
58041–2108–$batchtesting_412.zip (481 KB)