I’m creating a cool little app with the Indie version of Unity in which a user can create and delete cubes to create a blocky looking model. Each model contains thousands of cubes, each of which has its own material so that each cube can have its own color. Currently, I’m instantiating a standard Unity cube with each click.
This approach means that I get thousands of draw calls (one per cube). I’ve finished my game’s basic gameplay and UI, and now I’m on to debugging and optimizing. As a fairly new Unity user (unfamiliar with mesh scripting, shader programming, etc) I’m looking and failing for ways to optimize my game’s performance. I’ve got little knowledge about optimizing, and I’m looking for different ways to raise my framerate.
I’m also including two directional lights in my scene. With one light, I’m down to about 900 draw calls per frame on an 876 cube model, but it’s fairly hard to detect cube edges and tell where things are. With two lights, I’m at about 1800 draw calls, which makes sense, looks good, but is far too expensive.
I’m currently the Transparent/Diffuse shader on my cubes so I can do some neat alpha effects. Is there a better shader I could use?
I’d like to lower the draw calls to 400 - 800 draw calls per frame at least. How can I do this?
I’ve heard something about combining all my cubes into one mesh (this thread), but what if each of my cubes needs to have a different color?
Your high number of draw calls is a concern, but that could be resolved through mesh combining. However of equal concern to me is the use of the transparent/diffuse shader, since this makes the zbuffer ineffectual ( meaning everything will be rendered) with potentially massive overdraw, where each pixel is rendered many times, due to alpha transparency.
Depending on your target hardware you may be fill-rate limited rather than draw-call limited. Effectively this means you could halve the draw calls yet see minimal gain in performance.
With regard to reducing draw calls you can simply combine cubes that share the same material/colour. Its not clear from you initial description though whether any cubes will share material/colour or if each cube might have a unique colour. If the former than you can collapse the cubes together into a single mesh and save considerable draw calls, if the latter then that gets tricky.
Unfortunately i’m doubtful mesh combining is going to be acceptable as you lose the locality of each cube, which is currently being used to sort the cubes into a render order. Without a render order (back to front of scene from camera view) the transparency order can not be guaranteed and can cause random changes or order between cubes, causing ‘popping’. In this case if you have collection of non-local cubes combined the render order is likely to be wrong as the meshes are no longer self contained cubes but can stretch across the model interfering with other combined cubes.
Apart from swapping to using non-transparent materials the only alternative that comes to mind is that I believe an additive transparent shader result is independent of render order. Unfortunately it also means it can ‘white out’ quickly too, where you have so many overdraws of some colour, using additive shading it turns white.
I feel some testing is going to be in order for you to determine the best approach. Out of interest it would be useful to know the specification of your development system, the current fps achieved and then comparable fps for say switching all transparent/diffuse to simple diffuse or vertex-lit. This would show any performance difference caused by transparent overdraw. Another test to try is measuring the framerate of the scene running at different resolutions (e.g. 640x480, 800x600, 1280x1024) as this can show at what point you are fill-rate limited due to transparency.
To view overdraw there is a useful toggle in the ‘scene’ window drop down menus that will illustrate the amount of overdraw.
I do feel that perhaps a better solution is possible, but its more challenging and untested.
Essentially you’d combine all the cubes into a single mesh (limited to 65536 vertices) then using a dedicated shader you’d look up a vertex position with respect to the grid of cubes and use this to calculate a (u,v) co-ordinate that looks up a specific pixel in a 2D texture. By careful construction of the texture you could give cubes individual colours and change them at will. It will still rely on using additive shader for blending to get good render results.
It may be possible to resolve the render order (for one view direction) by sorting the triangles yourself before building the single mesh. The idea being to put the furthest triangles first in the triangle list and the closest last. This assumes that unity does not change the triangle order prior to be sending to the card though.
I’ve done this in another 3D engine where it was slow but worked well allowing for normal transparent blending. However it does suffer from the old ‘painters algorithm’ issue, where inter-penetrating polygons cannot be given a correct render order. Luckily with cubes I don’t see this being a problem. I’ll admit its a brute force approach, might be worth searching online to see if there are any modern methods ( e.g. depth peeling, weighted averaged )
A slight variation on this method would be to use vertex colours, much easier to deal with than doing texture look ups, but the Unity docs don’t explicit state that they can have alpha values. The example only changes RGB, but since its a ‘colour’ type I would assume the alpha is supported. I’m aslo unsure how you’d get Unity to render the vertex colours, it mentions using a particle shader.
An interesting problem, good luck.
Edit: Should point out that this applies to forward rendering (i.e < Unity 3.0) not sure what the differences would be with deferred shading.
Yes you can use alpha. You’d use a shader which use vertex colors, probably a custom one since, while the particle shaders do use vertex colors, they’re not necessarily appropriate otherwise.
@Noisecrime: Thank you very much for taking the time to respond to my question. This program is targeted towards Macs, and is going to be released on the Mac App Store when it is complete. This means I won’t have to worry about 400 - 500 draw-call limited PCs. (Until very recently, I thought that was the limit for all computers, not just PCs, which means I have a little more room to work in, correct?)
@Eric5h5: That’s good to know, thanks for the pointer.
What exactly do you mean by fill-rate? I’ve never heard that term before. If I’m targeting Macs, it sounds as though from the context of that sentence that I will be fill-rate limited as opposed to draw-call limited. Am I right?
I’ll do some testing shortly and let you know when I get the results. For now, as far as specs go, I’ve got an dual-core Intel i3 processor with 4 gig of RAM running on a Windows 7.
Test Results:
Note: All these tests use the Vertex-Lit render path, which I understand to be the fastest.
I get about 60 - 75 fps with no lighting using the Transparent/Vertex shader with 912 draw calls, which corresponds to the 875 cubes in my scene (which is odd, it’s one draw call per cube, correct?).
Using the Transparent/Diffuse shader with one directional light I get about 38 - 45 fps with 912 draw calls. With no light, same shader, I’m up to the 65 - 75 range again.
Vertex-Lit shader, no transparency, no lighting, up to the 65 - 75 FPS range again.
Vertex-Lit shader, one light, about 54 fps.
However, I pretty much need the transparency, as that’s a crucial part of my game. So I think right now the best option would be no lighting, Transparent/Vertex shader. Any thoughts?
Not sure where you got the idea that PC’s are draw limited to 500 calls, PC/Mac should be irrelevant its the GPU that is the determining factor, although bandwidth between the cpu to gpu can be a factor, like for like it should make no difference. Besides 500 draw calls was the recommended value a few years ago now. With continued GPU development I wouldn’t be surprised to have seen that grow considerably. Indeed by the results you posted I thought the fps were quite acceptable for near on 1000 draw calls.
Again Mac/PC makes no difference (in general) its the GPU thats important, although I’ve always felt Mac lagged behind in this area in the past, but have improved greatly recently. The key is if you are targeting a specific platform do some research and determine what system, specifically what GPU you target user will be using. If its considerably less than your development machine, then you need to account for that difference in the performance they are likely to get. But thats a whole other discussion
As for fill-rate, it simply means the number of pixels that can be rendered per frame/second, once beyond that value, performance will be affected. Window/desktop dimensions can have an affect, but over-draw, where the same pixel is rendered multiple times in a single frame is probably a greater issue these days. So for your project with lots of transparent cubes, you are going to have high overdraw, since many pixels are likely to be rendered many times.
I’d strongly urge you to start testing on your target market Mac hardware, as I ssupect your current development system is somewhat over-powered, but heavily depends on you gpu which you don’t give details about.
As to the results, they seemed pretty good considering the amount of draw calls and transparency being used, except for when using pixel lighting. The higher values suggest maybe draw call limited (i.e even the simplest non-transparent shader the game can’t run at more than 75 fps), although I guess v-sync might be having an affect here, but then i’d expect results to either be 75 fps or 38 fps and nothing inbetween.
Really it comes down to you now and what you feel is an adequate framerate for your project (keeping in mind your target machines may be slower than your development machine) and of course how much time you want to invest in optimising, which from my previous post is probably going to be challenging.
Have you tried using the ‘combine mesh’ script that comes with Unity? That might be a good first step. Don’t apply it too all cubes, just collections of them, see if it improves performance and if it produces any graphic glitches such as incorrect render order. Alternatively use the links by other replies i this thread to write your own mesh combiner to have a little more control. I think its a case of doing so simple easy things first and see if there is any payoff or downfalls, then make a decision based on those results.
Might be useful to post a screenshot for future discussion, though I think everything from my first reply still stands, its just not necessarily simple or quick to implement, so you have to really consider if you want/need to go down that route.
I believe I had read it somewhere on the forums… many posts here are quite a few years old, which may have led to the confusion.
Ah, that makes sense.
I will, as soon as I can find some code to get the frames per second in-game. I tried 1.0f / Time.deltaTime, but that number differed greatly from the number shown in the Stats panel.
Oh, sorry. My GPU is an ATI HD5470 Mobility Radeon.
Well, I’m exclusively aiming for Intel Macs, since I plan on releasing to the Mac store. That gives me some wiggle room with the draw calls, I believe, since most of the newer Intel Macs have pretty good graphics cards.
Right, but isn’t the CombineMesh script limited to combining cubes with only the same materials? Since each of my cubes has a material instance, they shouldn’t combine.
Okay, I’ve attached a screencapture of the model I ran the tests on. This was getting 65 - 75 fps with the vertex shader and no lighting whatsoever.
The second model contains 441 cubes and has the same settings as the previous. The FPS counter in the center, kindly referenced by Eric5h5 in the below post, displays the framerate.
Draw calls are a function of the CPU, not the graphics card. As far as the GPU goes, I could buy stuff on the Mac App Store with my 2007 Mac mini (which has an Intel GMA 950), so don’t count on anyone having a good graphics card. (I mean, personally I’d use my Mac Pro since the mini is only for testing/backup, but there are quite a few Macs with Intel GPUs out there.)
Ah, right. I suppose that by now you’ve noticed that I’m sadly unfamiliar with computer parts in general. I’ve been developing for just over ten months, and I’ve not had much experience with the inner workings of computers prior to that. I guess I should mention this because I’m 14 (I’m not saying “I’m too young to learn” but rather, “I’ve not had much experience”).
That’s true, but I’m counting on the fact that the people who download the App Store know what they’re getting into most of the time. People like that are more likely (not guaranteed) to have better computers than your neighbor, who hasn’t touched his Mini for so long it’s dusty. At least, that’s the theory.
Plus, I’m planning on limiting the amount of cubes you can draw to something around 1000 - 1500 cubes so that around 1500 draw calls is the highest it could get (the general theme of this app is “doodling, but in 3D!”, so I can get away with doing something like that)
and so an important lesson, always perform you’re own testing
Whilst the information around here and online is good enough for generalizations, nothing can beat making you’re own tests, especially with regard to your target minimum system specification. Most information can be trusted, but other stuff over time obviously becomes out of date. However whilst figures and stats could be argued over, the main concept (i.e in this case minimising draw calls) is the important part to understand.
Yeah the stats panel I believe shows the time taken for the just the rendering, doesn’t take into account the game code or off course any performance lost within the editor itself. There are a few fps scripts, but I found 1.0f/Time.smoothDeltaTime produced the same results and they seemed accurate enough not to bother writing my own.
Interesting, not too up on ATI anymore, but that looks like a pretty good entry/mid range, but I suspect is some way ahead of machines just a few years old. Its hard to say anything about it, as being mobile and recent its not high end, but then neither is it some 4 year old two generations old card that would be quite low end nowadays. My suspicion then would be that acceptable frame rates on this system, would equate to needing a min 2 year old machine to run comfortably.
However to be sure you need to test, one good approach is to upload a demo, either for general Unity community or for just specific people/local friends and actually compile some test results. Get details of the age of the machine, the cpu, gpu and the fps from the demo. that way you can build a graph of the performance and make a more informed judgment on the minimum system spec (which I would expect to be informed of before purchasing an application), or how much optimisation you need to make.
Well no, not really, as I said in 'genera’l its the gpu, but also the cpu and bandwidth connection between them comes into play. Basically don’t assume anything, sure use the information online as a rough guide, but nothing beats actual testing
You’re right, sorry about that forgot you said that they didn’t share a single material, although you’ve not mentioned whether every cube is going to have a unique colour or if in fact despite each having its own instance, many cubes might share the same colour. These specifics are quite important. If the former, then the suggestions I made previously stand, if the latter then maybe careful use of combine mesh (or your own version) would provide enough benefit.
Do you always intend to have a base of cubes, or is that just this specific example?
Is it likely that that base will end up using the same material? Forgot about whether they are all instances for the moment, its whether they all share the exact same colour, alpha values etc thats important. If they do then they would be a prime candidate for mesh combining since a single material could be used to render them all. Doing this would remove over 200 cubes in one go.
I get the feeling that generalized optimising is not going to be appropriate here, hence looking more specifically at how your app works. Thus maybe combining cubes that have the same colour/alpha and are local neighbors, though of course doing so may cause render order problems as mentioned previously. But getting something like that working, which should be relatively quick would enable you to make an informed decision as to whether to investigate a more complex solution.
ha, well you’re more mature than most 14 year olds i’ve dealt with online and your eagerness to learn for yourself is admirable. Unfortunately learning is time consuming and can be hard, but well worth it. Don’t be afraid of taking some time out from developing your game to do some research online. For example going to the nvida website and reading up their white papers, can be fascinating and you learn much useful information…
… think you might guess what i’d say to this
Well certainly in this case I don’t think the number of draw calls is over the top (though depends on minimum system spec you are aiming for), anything above 30 fps is perfectly usable.
I may even be tempted to let the user control how many cubes and thus drawcalls they make. They’ll realise that performance drops with more cubes, but at least they wont find themselves 10 cubes short of their magnum opus Maybe a good compromise would be to add a user preference to the max number of cubes, if the user increases it, inform them that performance might suffer. More importantly this way you’re game wont be locked into a maximum number of cubes, where in 5 years it might be able to handle double that number
However coming back to your original question, if I were you i’d keep developing your application to the point of it being complete in terms of usability. Then i’d look at doing the optimisation, probably along the lines of my initial reply. Heck you could even release version 1 as it is and work on a version 2 with optimisiation afterwards (as a free update of course ).
I only say that because I can’t guarantee any of the methods I suggested above will actually be appropriate for your needs. The issue you are dealing with (order independent rendering of transparent objects) is pretty cutting edge (e.g. read up on depth peeling and weighted average algorithms) and has always been problematic in realtime 3D graphics. Specifically its the combination of many cubes and transparency that is the problem, either one on its own isn’t too bad to solve.
I intend to release the app for something around 1.99 - 2.99 USD. It’s going to be fairly small and simple, so I’m trying to avoid the time-consuming extra research involved in favor of going for a more simplistic, cheap approach where buyers don’t think twice if it doesn’t run on their computer.
No, that was a user-created base.
Thank you. Ha, “dealt”?
Yes, the Unity learning curve itself was pretty steep, and as a result I’m slightly hesitant to get into mesh scripting.
I could never! Oh, wait, “Run some actual tests yourself and get specifications for the minimal system requirements so you can optimize based on those numbers”?
I think I may be persuaded to do just that. It’s flexible and yet theoretically runs well on even the oldest Intel Macs. It also allows for the users, like you mentioned, to not have to worry about their “magnum opus” as you termed it.
Yes, you’re probably right. I got caught up in trying to both optimize and develop at the same time. Perhaps I’ll call a moratorium on the optimizing for now. After I’ve completed the program usability-wise, I think I’ll return to optimization, definitely working off of the extremely extensive replies you’ve kindly provided. Would you mind me contacting you via PM if I began to work on optimization again and had a question or two?
You know what’s funny about that is, I’ve actually been planning on doing something along those lines (some sort of update, a version 2) ever since I started working on the app.
I’d also like to thank you again for taking the time to write the extremely extensive replies that I’ve received.
Well if it doesn’t run i’d be asking for my money back, its not the amount but the principle. However it wouldn’t realy take much to test out on a few machines to gauge a minimum spec. At least if you state a min system, buyers can’t complain much if it doesn’t work on a lower system.
Steep but worth it. Learning never really gets easier, as you tend to keep advancing, so stuff that was hard becomes easy and you don’t even think of it as learning, whilst stuff that you though was impossible to understand becomes the new challenge.
Just keep in mind that you may have to optimise the rendering of the cubes in the future and ensure you can easily append/switch out code to do so. One important bit of advice, which you may or may not already do, is not to always look at trying to optimise within your finished project. Sometimes it can be more beneficial to create a new project to test specific ideas without the complexities of existing code. It may be worth keeping a back up of this current build as from the screenshot it still looks relatively simple and might provide a good basis to explore optimization on. Once you find a solution you can then implement it in your finished project. I find this method allows you to focus on the issue/solution and avoids getting into ugly confusing code, with old test stuff commented out.
For example if it were me i’d write some quick code to create a random 3d grid of cubes, assigned with random colours (using random seed to ensure I can recreate the 3d grid exactly each time its ran). That would let me play with different ideas for mesh combining, shaders, using vertex colours etc, a testbed if you like.
Feel free to PM me, though I may not reply immediately, depends what i’m doing, also I’ll state upfront its unlikely i’m going to want to ‘do the work’ for you, but happy to answer questions or bounce ideas off etc.
Oh that makes me feel all warm and fuzzy inside, or is it the morning coffee?
Unfortunately (for me) you’ve got me thinking about this and the tricky part of supporting transparency.
So just thinking out loud…
In the past i’ve written per polygon/triangle sorting code to sort them within a model. Furthest triangle first in the mesh triangle index list, so it (in theory) is rendered first working its way to the closest triangle which is rendered last. However the sorting can be somewhat slow for high polygon models. I’m assuming that Unity and the GPU maintain the triangle index order and that they are render in order… I know its worked in the past, but guess it should be something to be aware of since it relies on the gpu rendering the triangle in the order we give them.
In your case though we have a constrained system, a grid of cubes should mean we can make some assumptions.
Firstly if we don’t render back facing triangles we don’t need to sort the triangle faces within a cube as there are no overlaps, we only need to know the order between cubes. That means we can simply build the combined mesh with the furthest away cube adding its triangles to the beginning of the triangle index list first and work our way to the closest cube.
To calculate the order you simply have to transform the origin point of the cube to camera space (before perspective) to give you its depth from the camera. So that means we only need to sort a single point per cube (potentially 12 triangles at one time). This is obviously going to be much faster than sorting all the individual triangles of a cube.
I suspect this will be fast enough already in terms of sorting, though we could use a ‘bucket sort’ instead. This is where you define N number of buckets, each one covering a consecutive range of depths (e.g. bucket 1 contains z>0 and z<=10, bucket 2 contains z>10 and z <=20 etc). So all we would need to do is divide the depth of the point from the camera by say 16 and place it into the resultant bucket (e.g. z= 38, bucket = 38/16 = 2 - we ignore remainder). Then you simply loop through the buckets adding the triangles for each cube within the bucket before moving to the next bucket.
So that would give us very fast sorting, performance really comes down to how efficient building a new mesh of X polygons in Unity is. Its probably ‘relatively’ slow (though many times faster than what i’ve been use to in other 3D engines), but that shouldn’t be a massive problem, as you could split up the mesh into chunks (accounting for potential sort order issues!) and rebuild a chunk per a frame, thus splitting up the total cost across several frames, which often is not noticed by users. Threading might be useful, but adds complexity and i’m not really up on it, especially from within Unity.
In theory then we have transparent cubes, with near perfect (painters algorithm - google it) sorting as we are using cubes. The next step would be to get vertex colours working so that each cube can have an individual colour and alpha without needing multiple materials, which would defeat the purpose of combining the meshes anyway.
Of course the is one glaring problem atm, the sorting is only valid for a specific view direction/angle. So to be used where the camera can be rotated we’d need to re-sort and rebuild the mesh each time. This is where it can be expensive in terms of overall performance, though there are some tricks such as only re-sorting if the camera direction changes by a certain amount and that we can do the sort/build over several frames if necessary.
Edit:
Whilst looking up something unrelated I came across this thread from a while back. Ares reply is quite interesting. Might be useful to play with if you want to understand more about draw call cost.
Okay, you’ve got me convinced. I’ll do some testing and determine a minimum system before putting it out there.
Right, I’d never imagine working with meshes three months ago, even two weeks ago (about when I started working on this project). Now, however, I look forward to the upcoming learning curve with anticipation.
Yes, I’ll keep that the parts where I’ll have to stick my optimization code as simple, clean, and commented as possible without writing a full-length novel about each of my lines of code.
Yeah, that’s probably a good idea, since from this point on, the two code paths (optimized and non-optimized) go very different ways.
I was actually intending to create a separate project in which I could learn and explore dynamic mesh editing, and that’s probably a good place to start.
Thanks, it’s good to know I’ve got an expert to reference. Do the work for me? You probably know what I’d say to that.
Regarding the entire post on mesh sorting you just wrote, it definitely makes a lot of sense to me, as un-mesh-knowledgeable as I am. However, before we dive into the optimization, I’d like to a) Release an unoptimized Version 1 one of the app and b) Get some sort of basic knowledge of mesh scripting by looking at the docs, the procedural mesh generation examples, and just playing around on my own.
Just give me somewhere around two weeks or so more and I’ll have a fully functional app (tested well) ready to put in the market. I’m also moving in about two weeks, so I won’t have as much time on my hands to work on it after that, as I’ll be packing/cleaning, etc.
In the meantime, I’ll play around with mesh scripting and see what I can find out on my on while finishing the app. Like you pointed out, it is rather simple, and I intend to keep it that way. I’m trying to appeal to a larger audience of less technically-minded people, and the more complex the program gets, the more my audience range narrows.