Performance benchmarking results inaccurate?

Unity version: 2023.2.11f1
Test Framework version: 1.3.9
Performance testing API: 3.0.3

I’ve been doing some performance benchmarking and am getting results that don’t quite make sense. Overall, it seemed that the values recorded by the benchmarking framework are much higher than they should be.

I decided to build a very simple test measuring the cost of a dictionary lookup and was shocked at the results. First, here is the test code:

        [Test, Performance]
        public void BenchmarkResolve_DictonaryLookup()
        {
            const int WarmupCount = 1000;
            const int Measurements = 1000;
            const int Iterations = 1000;
            const string Key = "Test";

            SampleGroup sampleGroup = new SampleGroup("Time", SampleUnit.Nanosecond);
            Dictionary<string, int> dict = new();
            dict.Add(Key, 10);

            Measure.Method(() =>
            {
                dict.TryGetValue(Key, out int value);
            })
            .SampleGroup(sampleGroup)
            .WarmupCount(WarmupCount)
            .MeasurementCount(Measurements)
            .IterationsPerMeasurement(Iterations)
            .Run();
        }

The results from running this test are as follows (all numbers are in nanoseconds):

Deviation: 0.27 ns
SD: 5139.22 ns
Median: 18800 ns
Min: 18700 ns
Max: 98400 ns

The results were collected in Editor, in Release mode, and were a PlayMode test (I get similar results with EditMode tests).

I don’t think these results are correct. A dictionary lookup should be on the order of 200-250 nanoseconds maximum, not almost 20000. Here is a link to an article of someone using BenchmarkDotNet to sample a dictionary lookup with a string key length of 200 characters, which comes in at 214.9 nanoseconds. I understand that there are differences between Mono and the versions of .NET tested in that article, but 4.8.2 is not far off from the 4.7.2 that Mono is based off of and I can’t imagine the differences between those version are so drastic that one is nearly x100 slower than the other.

It’s much more likely that the performance benchmarking tool is either displaying incorrect data or is errantly capturing the performance overhead of other systems that aren’t related to what is being measured. Indeed, the suspicious Max value of 98400 ns seem to be caused by spikes that I doubt are related to the actual performance cost of a dictionary lookup. I’m not running anything else within Unity during these tests so whatever is being captured must be coming from the engine itself.

Am I using the tool wrong? Is my understanding of the data wrong? Is my understanding of how fast a dictionary lookup should be wrong? I am most certainly willing to accept that there’s some user error on my part here I’m just having difficulty seeing it. Any insight into this problem would be appreciated.

You are exercising a single method, and this being a TryXxx method it will have its exception handling internalized which can be rather slow. Try a non-Try method, and try what‘s known to be fast: a native collection like NativeHashMap.

Or measure anything for which you have the full source code available, because as you noticed you simply don‘t know the code that this dictionary method runs so judging the reasons for the current results is purely speculative.

Also your warmup count is extremely high, somewhere between 1-20 warmup rounds ought to be sufficient, and most commonly it‘s just the first round that one needs to skip. Similarly for iterations, you are already measuring a whopping 1,000 samples per iteration so more than 10 iterations is simply a waste of time.

Those spikes could be anything! It‘s unlikely to be an engine thing just by how many other things are going on at the same time on any system!

Examples: It could be the CPU throttling its frequency for a short time to reduce heat or because it passed its TDP boost limit or it got interrupted or the memory controller was busy fetching memory for a background task or memory read was interrupted due to a memory refresh cycle or a background task taking priority or the thread is halted and resumed on another core or … almost countless other reasons.

This comment made me re-evaluate how many iterations I’m doing and I’m noticing that if I only do 1 iteration, the median value for the test I wrote in my previous post is 100 nanoseconds. That is much more in line with what I was expecting in terms of Dictionary performance. It seems that increasing the iterations increases the final results. For example, if I increase iterations to 10, the result becomes 500 nanoseconds. If I increase it to 100, it becomes 2100 nanoseconds. And of course, at 1000 iterations the result is 20100 nanoseconds.

Is this not a bug? Or am I misunderstanding the purpose of iterations? Are iterations intended to be the capture of the bulk cost of an operation done X times?

Now that you mention it, yes that hit me too initially. But it makes sense for time to be accumulated for the total number of iterations because you also have the number of measurements.

As in: measure n times each running n iterations.

Increasing iterations makes sense where the exercised code is running so fast that any small deviation results in a several 100% increase/decrease of the measured time. For instance a code snippet like var x = a / b + c * d warrants running it several iterations, where you’d normally put it in a for loop.

That sort of makes sense. I guess the only thing I’m not certain about is why those fluctuations you mention wouldn’t already be accounted for by the fact that we’re doing 1000 measurements? For small operations, why would increasing iterations make more of a difference than increasing measurements? And if we do increase iterations, (for example 100 iterations) do I divide the median result by 100 with the expectation that that result is more accurate than doing 1 iteration?

Also, I’m noticing that with 1 iteration, the tool never seems to report a value lower than 100 nanoseconds, which seems wrong to me.

You can do that to get the average of a single iteration but since measurements already provide you with min/max and average values by choosing 100 or 1000 iterations you can then just move the comma in your head if you really need to grasp the actual time per iteration.

But since the measurements aren‘t meant to provide simple operation timings but rather comparing them with previous runs or similar operations you really only need to make sure to use the same scale (number of iterations) for those that you need to compare. Most of the time you measure to see whether a change did or did not improve/impact performance.

1 Like