Vector2Int.GetHashCode - me making weird little musings about hashcodes

So I’m here to just have a little musing about GetHashCode and some weirdness that can come about it. I thought this find was interesting and it’s on a topic that I know through out my career is often found to be a bit cryptic amongst developers. So bare with me as I go on my little diatribe and maybe you’ll find it funny as well.

So in C# all objects inherit a function called ‘GetHashCode’, this method is used in various ways through the .net library. An example of which is say a Dictionary or HashSet in which the hashcode is used to generate indexes in a lookup table (for more info on the implementations of these check out the ‘hash table’ article on wiki).

Basically GetHashCode will return an integer associted with the state of that value. In the case of say an int it’ll just be the int itself. For a float it’ll be the bit-wise representation of the float. For more complicated things usually some bitwise operations will be performed on the members of the object. A string will maybe xor together a bunch of the characters in the string. Or a struct of multiple variables will just combine together the hashcodes of each member of the struct. Ref types will use the default object.GetHashCode which returns the “sync block index” (which is out of the scope of this post).

When you define a struct/class you can override GetHashCode to define how it calculates its hashcode. MOST of the time you don’t need to worry about this. But sometimes you do. And knowing what to write for a GetHashCode implementation can be very difficult… and this is why I wanted to chat here. It’s just one of those topics that like no one talks about.

So with that preface out of the way… here is what happened today.

So I needed to write an algorithm where I wanted to traverse a graph of nodes in a grid for which the only real unique identifier I had was an (x,y) coordinate for. So I used a Vector2Int to represent said coordinate. Now I needed a way to quickly recall if I had already ‘visited’ a node… so I decided to use a HashSet since its a unique unordered collection (a value only exists once in it). My code effectively looks like this:

var targets = new List<Vector2Int>();
//SNIPPED - fill targets with like 4 goals

var q = new Queue<Vector2Int>();
var visited = new HashSet<Vector2Int>();
q.Enqueue(targets.First());

//SNIP - some random logic you don't need

while (q.Count > 0 && targets.Count > 0)
{
	var n = q.Dequeue();
	visited.Add(n);
	targets.Remove(n); //this only removes IF it was a target, List is always so small the cost in negligible

	p.Set(n.x - 1, n.y);
	if (visited.Add(p) && !this.Occupied(p)) q.Enqueue(p);
	p.Set(n.x + 1, n.y);
	if (visited.Add(p) && !this.Occupied(p)) q.Enqueue(p);
	p.Set(n.x, n.y - 1);
	if (visited.Add(p) && !this.Occupied(p)) q.Enqueue(p);
	p.Set(n.x, n.y + 1);
	if (visited.Add(p) && !this.Occupied(p)) q.Enqueue(p);
}

return targets.Count > 0;

(this is basically just a flood fill that tests if the targets exist in the same fill region)

Here’s the thing for small grids this is pretty fast. But when I start growing the grid it gets slower and slower for obvious reasons… more nodes to ‘visit’.

But the speed was weirdly SLOW IMO… a 100x100 grid was coming back with > 1s times. Which threw me off. So I pop open the trusty profiler to see what was going on. And it pointed to the ‘HashSet.Add’ method… digging deeper it was the Vector2Int.GetHashCode of all things!

Wat?

Now don’t get me wrong. I know that HashSet.Add can be slow technically due to its reliance on this hashcode. It’s why despite HashSet having a complexity of O(1) for add/remove/contains doesn’t necessarily mean its faster than List (which is O(n)), just that for larger collections it’ll scale more flatly in complexity.

But this is a large collection situation with 10,000 nodes! And my speeds I was getting were behaving more like the speeds I’d expect from a List. So I started digging deeper, I went to ‘Vector2Int.GetHashCode’ and found this:

image

What is this gibberish you might be pondering… well what it’s doing is taking the hashcode of x and xoring it with the hashcode of y bit shifted 2 bits to the left. See, using XOR in hashcode calculations is actually really common. See we need to combine hashcodes and our fastest operations for that are generally AND, OR, and XOR. Thing is AND tends towards stripping high bits (1’s) and OR tends towards proliferating high bits (1’s)… meaning the former tends towards 0 and the other tends towards 2^32.

And here’s the thing, when generating hash values we want noisiness. We don’t want dissimilar objects/values generating similar hash codes. But if AND and OR tend towards all high bits or all low bits… that means our algorithm would tend towards a small number of hash codes that are equal more often.

This is called a “collision”.

So XOR is generally the preferred way to combine hashcodes as it doesn’t tend to strip or proliferate high bits like AND and OR. You end up with noisy values. And we want noisy values.

So you’d think this algorithm was good then! We even threw in a little 2 bit shift in case x and y were equal (if they’re equal we could accidentally set it to 0 due to the nature of XOR).

So what’s the problem? Why is this still slow? It’s approaching the problem the way one would expect to approach. Use bitwise operations that generate more noise!

Here’s the thing… and I’ll mention this now, I dig through the Unity api a lot in my spare time. My IDE will decompile it for you, and the github from unity is always up as well. And I like to just read the code. So when I say that reading this GetHashCode implementation was familiar to me, well that’s because it was, I’ve seen it elsewhere in the unity library.

Here is Vector2.GetHashCode:
image

It’s identical!

I wouldn’t be surprised if who ever implemented Vector2Int just… copied Vector2 and replaced all floats with int? I mean… why not? I’d do that!

But there in lies the crucial flaw… this GetHashCode implementation was implemented with floats in mind. Not ints.

See, int.GetHashCode just returns the int (why convolute what is already an int?). 2.GetHashCode returns 2. 537432.GetHashCode returns 537432.

float on the other hand though, it returns the bitwise representation of the float:
590px-Float_example.svg

And the bitwise representation of a float is noisy. There’s a sign bit, a exponent bit, and a mantissa. The high bits are scattered across the entire 32-bits of the values. Even if the value is just 1, there is several bits all over representing that:

0011 1111 1000 0000 0000 0000 0000 0000

And nearby values like say 2 are very dissimilar:

0100 0000 0000 0000 0000 0000 0000 0000

And of course fractional values like 2.59809 are wildly bizarre:

0100 0000 0010 0110 0100 0111 0001 1011

Integers though are just:

0000 0000 0000 0000 0000 0000 0000 0001
0000 0000 0000 0000 0000 0000 0000 0010
0000 0000 0000 0000 0000 0000 0000 0011

If we XOR together two floats we can get noisy values with high bits scattered front to back of the value.

The int though… well it depends on how big the int is. And well… when using a Vector2Int which generally represents coordinates those values aren’t going to get really big. My 100x100 grid consists of all values in the 0>99 range. That’s what… the first 7 bits of the number? Everything else is zero no matter how many times you XOR them together. Even with that << 2 we top that at 9 bits, and that’s only for the upper portion of the grid.

So what this means is that while I’m feeding Vector2Int’s into this HashSet and it calculates those hashcodes we’re getting tons of collisions. We have 10,000 unique values to fit in a bit depth that contains 512 unique hash codes (and that distribution is NOT consistent). We’re invariably going to get collisions!

And if you understand how a HashSet or Dictionary works. What it does is calculates the hashcode, then uses that to generate an index in an array by modulo’n it across the length of the array. In the case of C# the length of that array is some prime number that is in the vicinity of the number of elements in the collection. It uses a prime number so that way the modulo of the hashcode against it further reduces collisions (prime numbers have no factors aside from itself).

So what does it do if there is a collision in some slot in the array? Well it has what is called ‘buckets’. It just stores them as a sorted vector (an array) of those values for that slot. So when you go to look up the value (say you call contains or add) it goes to the slot and finds multiple entries and now has to loop over them linearly to see if any of them match.

BASICALLY… by having a hashcode with a high rate of collisions we’ve effectively turned out HashSet into an O(n) collection (well not quite O(n), but very nearly). And this is why I was getting times that felt more like I was using a List than if I was using a HashSet for ever larger sets.

So how did I remedy this?

class FastVector2IntComparer : IEqualityComparer<Vector2Int>
{
	public static readonly FastVector2IntComparer Default = new();

	public bool Equals(Vector2Int x, Vector2Int y)
	{
		return x.x == y.x && x.y == y.y;
	}

	public int GetHashCode(Vector2Int v)
	{
		return v.x << 16 | (v.y & 0xFFFF);
	}
}

I implemented my own equality comparer, passed that into my HashSet:

var visited = new HashSet<Vector2Int>(FastVector2IntComparer.Default);

And used that instead. My result was orders of magnitude faster! To the point that my algorithm became usable for larger graphs (arguably I could also use a less naive algorithm, but that’s not the point of this thread… the point of this thread is to discuss GetHashCode).

And how did I do it? Well I used more of the space available to me in the 32-bit hashcode. I’m splitting the 2 ints across the upper and lower 16 bits.

I’m not using XOR though? Why am I not using XOR? Isn’t that the better for combing values? Sure, but in my algorithm I’ve redacted those 16 bits off the respective ints. XOR and OR are effectively the same operation here because I know those bits are open.

(note I could have also just done ((v.x << 16) ^ v.y), but again less fun for the discussion…).

Now for my 10,000 nodes I get 10,000 unique hashcodes which speeds up our HashSet. And I mean really… how often do you have a Vector2Int where x or y is > 4294967295? You probably don’t… so this will make pretty much all hashcodes in the commonly used Vector2Int space unique.

Anyways… I thought that’d be a fun conversation.

10 Likes

Oh the joy of hash codes.

These days I just use System.HashCode.Combine and not think too much about it.

If anything your experience highlights the importance of using custom equality comparators for collections like dictionaries and hash codes.

4 Likes

I’ll throw in that Unity Mathematics’s int2 has its own little hashing going on for GetHashCode

If only that were Burst-compatible.

2 Likes

Speaking of burst, for my real-world situation I’ll likely go with writing something that is more burst compatible. This version of the algorithm is working fine enough for now though during this early phase of prototyping as well as to highlight as a discussion (its less about the code and more about the thought process).

So the tl;dr is that vector2Int has a horrible hashcode implementation. It’s probably also one of those things that can never be fixed because somebody might have put the hash code into some serialized data somewhere and expect it to persist. Ohno.

Good to see that int2 has a decent looking hash code implementation. I’m curious, though, did you ever try to just use (int, int) and see if that hashes decently? I have no clue how C# deals with tuple hashes.

It’s not quite a matter of what C# does, more so what the implementation of ValueTuple<…> does.

Modern .NET does this, Unity’s class libraries do this, this and this.

3 Likes

Nice. Those look pretty sensible, so it might be that just using (int, int) is the correct approach here. Vector2Int does have some methods that work on it, but if the intent of the code is to just have two ints instead of having a 2D vector in integer space that represents a pixel box or whatever, then just two ints is probably the correct thing to use.

1 Like

lol, yeah, I thought about this as well. It’s why I just talked about it more as a fun little meandering, and not a “HEY UNITY YOU SUCK” post. Mainly because… I totally get how that kind of stuff gets into the code base and then gets stuck.

This is a brilliant way of coping with the problem. Can I use it in my project?

This is a nice add-on functionality you got there rolling, I made a custom equality comparer just recently (for something I wrote here) but never thought of passing it as a comparer for the hash collections.

My solution was to implement a lightweight struct for dictionary use only, then convert between them on the fly. Though I’m surprised that you’ve discovered this issue so recently, because hashing is super useful for procedural meshes, and especially for detecting/cleaning duplicate vertices.

There are so many small and subtle issues with Vector types (hash codes, not implementing IEquality, implicit conversions, not having component-wise multiplication, swizzling, [edit] someone’s weird choice to override the equality operator etc etc), I wish they would revise them somehow.

At this rate they will never do a clean slate, tech debt will eat us all.

It’s brilliant isn’t it?

It’s just xxHash if I’m not mistaken, it should be implementable for Burst.
In fact, here you go.

Edit:
xxHash link
Proof that HashCode.Combine uses xxHash

1 Like

You’re welcome to use whatever I’ve posted here on the forums.

Honestly this post was more me musing on about something as I was looking for a reason to post since the transition to the new forum layout. Trying to get used to this new getup… I’m an old man set in my old ways. It’s taking me time accepting it and I figure me going on a classic wall-o-text about a topic that I’ve noticed juniors often struggle with would be a way for me to get over it.

2 Likes