Populating Dictionary getting slow after 1M

I’m populating a dictionary of <vector3,byte> and after 1million entries it starts to slow to a crawl

here is the essence of the code:

    for ( int x = 0; x < 1000; x++)
        {
            for (int y = 0; y < 1000; y++)
            {
                for (int z = 0; z < 50; z++)
                {
                  Dict.Add(new Vector3(x, y, z), byt);
}
}
}

I have this in a coroutine but I removed the other elements to make the code clear, what is evident is that the more filled the dictionary gets the slower it becomes to add new entries.

After reading a bit on it I believe it has something to do with colliding hashes which i know nothing about and I was hoping someone could point me a solution to set good hashes for the vector 3 keys and how I would implement it in this case, as you see these positions total 50million though the population slows to a crawl near the 3millionth entry, I would like to know if its possible to keep the same speed throughout the whole population.

  1. a dictionary with a million entries is just going to be a beast in the first place

  2. Vectors are horrible as keys in a dictionary due to the fact they’re really just 3 floats… and floats, due to the nature of the way they work, make for very terrible types to get hashed.

As for finding a solution to your problem… what are you attempting to do?

From the looks of your code you seem to be creating a 3d graph to store bytes. So that way you can just look up a point at say <10,3,5> and know what byte is stored there (is this maybe a graph used for a minecraft clone or something???).

It may be better to just use a 3-dimensional array.

2 Likes

This, my goto solution for stuff like that.

Using a Dict for Vector3 is a poor choice here, because it’s quite easy with this setup to have instant-access to these: an array. This is possible since you’re using ints to set up your Vector3’s; converting them to floats is pointless and wasteful (and can easily introduce floating point precision bugs), but keeping them as ints allows you to do this:

byte[] allBytes = new byte[1000*1000*50];
  for ( int x = 0; x < 1000; x++)
        {
            for (int y = 0; y < 1000; y++)
            {
                for (int z = 0; z < 50; z++)
                {
                    int index = z + y * 50 + x * 50 * 1000;
                    allBytes[index] = byt;
                }
            }
        }

No hash collisions here, and lookup will be even faster than Dictionary ever could be; it’s literally just multiplying a few numbers and going straight to a spot in memory. It’s instant, really.

You may want to pull out your 1000s and 50s into variables so they can be reused without needing to find&replace, but this is the basic idea.

2 Likes

Even ignoring performance issues, Vector3 might have issues as a Dictionary key because Unity overrides operator== for Vector3 to test for approximate equality instead of exact equality. That means you could have two Vector3s that compare as equal with operator== but that have different hash codes. That violates the design assumptions of the Dictionary class and could lead to unpredictable/“buggy” behavior.

If you are only intending to use integer coordinates, then Vector3Int might be OK.

But if you are filling in all possible integer vectors in a rectangular area, you should almost certainly be using lists/arrays instead of dictionaries. Dictionaries are good for “sparse” data sets where most possible coordinates are unused or when it’s inconvenient to convert your keys into consecutive integers.

very elegant solution thank you =D, I will use this instead

and thanks to everyone else for the information about not using vector3 as keys. So for future reference, was the problem with the dictionary overload due to the bad key, amount of entries or the combination of both? Would a dictionary keyed with ints or strings run into the same problem when getting into high amounts?

Combination of both.

The Vector3’s definitely increased the issue.

But yeah a dictionary in the millions is just going to be slow regardless.

Both issues would create different problems. The overload is just due to the number of entries - there are only a fixed number of hash values that are possible, and more values than that will cause collisions to be common, no matter how the hashes are generated. If you do need massive numbers of things in a dictionary, look for ways to subdivide your list into smaller sets - perhaps a Dict of Dicts, where two superfast lookups would replace one increasingly slow lookup.

Using Vector3 or anything float-based as a key would could problems due to floating point imprecision, as described in the article I linked. You weren’t experiencing this problem yet, but probably would have as soon as you started accessing these values.

That’s kind of a complicated question.

Just having a huge amount of data tends to make things slower. To some extent, this is true regardless of what kind of data structure you use or how cleverly you use it, but there are certain things you can do better or worse.

There’s several different problems you can run into with dictionaries, and whether you actually will run into them sometimes depends on the distribution of your data and certain implementation details that you would usually ignore.

One problem you were very likely running into here is that dynamically-sized data structures need to guess how much memory they should allocate, and then if they guess too low they need to “reallocate” by creating a bigger data structure and copying all of their existing data into the new structure. For instance, if a List has enough space to store 128 items, but then you try to insert 129 items, it needs to create a new chunk of memory to store the new item, but then it also needs to copy all of the previous 128 items into the new memory in order to keep everything “lined up” (otherwise you’d forfeit most of the efficiency advantages of a list). So inserting that 129th item (in this hypothetical example) is way more expensive than the previous inserts, because it implicitly requires the List to copy the other 128.

Dictionaries are more complicated, but they do something similar. If you start with a small dictionary and then just add things one at a time, it will need to periodically copy everything into a bigger space to make room. The bigger the dictionary gets, the more expensive this copying gets. (Data structures like this usually grow exponentially, so that the frequency of reallocations gets smaller as they get more expensive.)

But if you know you’re going to need a really big dictionary, you can avoid all this copying by saying so in advance. There’s a constructor where you specify what “capacity” you want the dictionary to have. If you know you’re going to add a million items, you can ask for a capacity of a million up-front and it will start out big instead of needing to periodically grow.

This also applies to Lists, by the way! If you know you’re going to stuff a million entries into your List, make sure to construct it with a suitably large capacity.

Another problem that’s unique to dictionaries is hash collisions. Under the covers, C# Dictionaries use hash tables to store data. Briefly, that means each key you insert is “hashed” to produce a number that’s used to quickly find the right “approximate location” in the dictionary; think of it like turning to the correct page. Dictionaries work most efficiently when all of your keys have different hashes. When there’s only one entry on a given “page”, then you can easily see the answer as soon as you go there. When multiple keys have the same hash, that’s called a “collision”, and it means the “page” gets more crowded, so dictionary needs to spend more effort actually reading the “page” once it gets there to find the right individual entry.

So the question is: how different are your hashes? That can be a hard question to answer, since it depends both on the hash algorithm being used and on how your data is distributed. For example, imagine that you write a hash function for Vector2Int that simply returns x ^ y. That means (among other things) that the vector (x, y) will always have the same hash as the vector (y, x). If you happen to be using a lot of mirror-image vectors as keys, that could be a problem! But if your vectors are distributed randomly, it might not matter.

Since you are converting 2 ints (x and y coordinates) into a single int, you are always going to get some vectors that end up with the same hash. Theoretically, for any hash algorithm you use, there’s always some data set that will perform really badly with that hash algorithm. The goal is to try to pick one that performs well for “likely” data sets (while also being really fast to calculate).

If you use some obscure data type and the author didn’t really expect it to be used as a dictionary key, then they might not have put very much effort into that…

If your keys are really big, then there’s this annoying issue that writing a hash algorithm that takes into account every detail of the really big thing will take longer to compute, which slows things down. But if you don’t take every detail into account, then sometimes the details you ignore will be the important ones, and then you get a LOT of collisions. I heard one horror story about some company that decided to hash strings by looking at only the first 5 and last 5 characters, so that if you used a million-character-long string they wouldn’t have to hash the whole thing one character at a time. Then someone made a hash table with URLs as keys, so the first 5 characters were mostly “http:” and the last 5 were mostly “.html”, so they all hashed to the same value and the Dictionary tried to put everything on the same “page”…

So nowadays, if you use strings as keys, the hash function is probably looking at the whole string. This is good for safety, but means that using very long strings as keys can cause its own problems…

Collisions get even more complicated because dictionaries don’t actually store every unique hash separately; they sort the hashes into a limited number of bins based on how big the dictionary is. (You wouldn’t really want every dictionary to allocate int.MaxValue “pages” just-in-case, would you?) So sometimes different hashes end up on the same “page” even though they are different.

I see, thanks for the help!

very interesting, thanks a lot for the information!! I was using dictionaries with string keys that are quite big but weren’t giving me problems yet but now I will review them with this in mind.

by the way how would I go about to reverse this formula? Say I know the index of the array and i know the max size of each dimension how can i solve the equation to give me the respective x y z

given the layout provided by StarManta.

I would say somehing like this. Keep in mind it is not tested.

the Count variables are the count of components, in his example 1000 , 1000 , 50

//All values/results are of type int
x = index / (yCount * zCount);
y = (index - x * yCount * zCount) /zCount;
z = (index - (x * yCount * zCount + y * zCount))

As the thread is about performance, be sure to use a single array with stride, or jagged array if not. Don’t use a multi-dimmed array as this will be quite a bit slower.

Begins to matter when iterating over a quite a lot of elements.

Thank you!