CPU optimization

I’m seeing a weird performance bottleneck. Each frame I run the following loop:

        for (int i = 0; i < pickupCount; i++)
        {
            Vector3 pickupPos = cachedPickupPositions[i];
            float radius = cachedPickupRadii[i];

            if (nodeMax.y > pickupPos.y - radius)
                if (nodeMin.y < pickupPos.y + radius)
                    if (nodeMax.x > pickupPos.x - radius)
                        if (nodeMin.x < pickupPos.x + radius)
                        {
                          // I normally do some stuff
                          // here, but I've removed it
                          // in the interest of clarity
                          continue;
                        }
        }

Where pickupCount equals 21 in my current test case.

The thing is that iterating this loop 21 times takes my framerate from about 30 to 20 fps.

I realize the CPU power is limited but this is crazy, seeing as I’m skinning hundreds of vertices with no problem (I’ve heard that the skinning is 100% cpu, could someone confirm this btw?). Is there something I’m doing that causes Mono to behave weirdly? Is branching extremely slow?

Btw I know I could do a sphere tree or some such optimization, but that seems very overkill for such a tiny loop.

Thanks for any answers =)

Are you saying that just iterating is causing the bottle neck?

You’ve removed the code in the inner loop for clarity. How does this run without that code vs. with that code?

You have variable declarations inside a loop. It would be better to declare them once outside the loop. I doubt that the entire problem but it couldn’t hurt to move those.

Yes, just the code that I posted causes a big performance drop.

It turns out I exaggerated a bit though, it goes from 30 fps to 25 fps when running the above loop 21 times. Still, this is a huge performance hit for such a small operation.

Is there any way I can see the intermediate code that’s sent to Xcode? I looked around but couldn’t find anything.

Also, any insight into whether this is expected performance or I’m doing something wrong would be greatly appreciated. =)

Move your variable declarations.

I heard you the first time kenlem. :wink: That makes no difference, likely because struct/atomic local variables end up being allocated on the stack when entering the function anyway. And the Vector3 class has no default constructor as far as I can see. (From the Unity docs: “structs do not allocate memory”.)

Hmmm… I didn’t know that compiler was smart enough to put those structs on the stack when entering the function. That’s a nice optimization.

Can you post the link?

Yepp, I second that with the variables. Does the speed increase back again if you move the declaration of pickupPos and radius to before the loop? I would assume yes.

Vector3 pickupPos = cachedPickupPositions[i];
float radius = cachedPickupRadii[i];

to

Vector3 pickupPos;
float radius;
for (int i = 0; i < pickupCount; i++) 
  pickupPos = cachedPickupPositions[i];
   radius = cachedPickupRadii[i];

Since framerate on iPhone is locked to VSync (60Hz) then if you frame time changes from say 32.5ms to 33.5ms (just 1 ms difference) then you’ll see framerate drop from 30 to 20 fps. Judging your CPU workload from final framerate is wrong.

Go to AppController.mm and set ENABLE_INTERNAL_PROFILER to 1. Take a look how “mono-scripts” stats changes if you change your script. Pay attention to “fixed-update-count” as well (especially if your script code is running a lot from FixedUpdate). Inspect other stats as well.

Yes, even skinning functions in OpenGLES API actually do skinning on cpu on iPhone.

ReJ - Thanks for the profiling tip. I’m just starting a project and I want to profile as I develop instead of trying to go back once everything is done.

For “expensive” things I usually also try to let them run not in every update but in every 2nd or 3rd update only. Of course depends on the project, but helps a lot.

removed

ReJ: Ah of course, I didn’t realize that. Still, I calculate the average framerate over 100 frames or something, so it’s not completely useless. But measuring the CPU workload properly is obviously better. Thanks.

Martin: Why would the compiler not optimize that? Why would the compiler allocate them anywhere but the stack? Btw I did try it, no difference.

Hmm, I think I had something similar and think I noticed a difference actually when deployed on the phone. Was not the exact case, but similar. I might be wrong, but my impression was that creation of variables inside loop are fps-intensive.