I believe Rust uses hashbrown as the underlying implementation now. This just ca...

tialaramex · on April 12, 2023

Your initial observation is correct, Rust's HashMap is these days a HashBrown, which is an implementation of Google's Swiss Tables:

But the use of the "Bucket" nomenclature in that file has probably misled you, the buckets it is talking about are for putting exactly one item in and they're just stored as one huge growable array (like a Vec). Suppose we do as I mentioned before:

   let mut doodads: HashMap<Doodad> = HashMap::with_capacity(623);

The calculation will do 623 * (8/7) which is 712, and then it'll round up to the next power of two [ultimately because shifts are cheaper elsewhere in the arithmetic] which is 1024. So it allocates 1024 buckets sized to store one Doodad each.

The Swiss Tables algorithm would work up until 1023 Doodads are in the table (one must be empty, but it doesn't matter which one) however performance trade offs mean you should resize before that, HashBrown does so after 1024 * (7/8) = 896 items, which you'll observe is indeed "at least" 623 items.

> In the worst case all the items you insert will go in one bucket, and you'll have to rehash

In the worst case - which is incredibly unlikely with a good hash, and you'll notice Rust supplies a good hash by default - we'll insert 623 items which have the same hash, and so they'll prefer to ideally be in the same bucket, but they're instead taking up 623 contiguous buckets and all our searches are now linear, so our performance is poor with average probe length 312. But we don't in fact grow, even if you could argue we should grow, this is so unlikely that it's not catered for, you won't find any code to detect this scenario let alone do anything about it.

> If there is a clever way to avoid this it would certainly be interesting to read about it.

All the modern Open Addressed hash tables avoid this, by just not caring about the incredibly unlikely "But what if all my data hashes to the same value?" case. This means if you did hit that case their performance is abysmal, but you won't so who cares. They just degrade gracefully as we get progressively unluckier, whereas Go's approach is forced to allocate as soon as we get unlucky enough.

It might seem like Go's approach is great if we've got plenty of memory and don't mind allocations, since at least we don't degrade performance when we're unlucky. Unfortunately that non-degraded performance isn't very good. Go would choose 128 buckets with space for 8 Doodads in each bucket, but this means most of our buckets have 4 or 5 Doodads in them, and we always try them in order, so look-up actually probes considerably more Doodads than with Open Addressing normally. The Swiss Tables have to get pretty unlucky before they are that bad, and whenever they're not unlucky they're doing much better...

foldr · on April 12, 2023

> In the worst case - which is incredibly unlikely with a good hash, and you'll notice Rust supplies a good hash by default - we'll insert 623 items which have the same hash, and so they'll prefer to ideally be in the same bucket, but they're instead taking up 623 contiguous buckets and all our searches are now linear, so our performance is poor with average probe length 312.

Yeah, this is the nub of the issue. You’re avoiding allocation by giving up amortized O(k) lookup. So there’s a trade off here. Roughly, you have a choice between getting slow in unlikely scenarios or allocating in unlikely scenarios. In typical Go code, neither option is obviously better or worse in general. In the context of Rust, it makes sense to avoid allocations.

Preallocating hashtables of fixed size seems like a rare use case to me, though I don’t doubt that it might come up from time to time.

tialaramex · on April 12, 2023

> You’re avoiding allocation by giving up amortized O(k) lookup

What's "k" here ?

Here's an experiment, I know how to write it in Rust, but not in Go, so let's make it a thought experiment. We have 623 custom Doodads which we've arranged (this is the part I know how to do in Rust but not Go) to have identical hashes despite being different. This clearly could happen (though it's fantastically unlikely) for real data but we're doing it artificially.

With Swiss Tables, and several other techniques that come to mind, we end up with 623 contiguous entries, and a 312 average probe length which means our performance sucks, but we didn't allocate beyond the original 1024 single item "buckets".

What happens with Go's approach? I think it ends up with all buckets empty, except one, that one is full and has 77 overflow buckets connected. In this case we've also got 312 average probe length and the performance sucks even worse. We have to allocate those 77 overflow buckets, and if we make the mistake of trying to re-size we allocate for that and improve nothing at all.

EtA: Oh, and pre-allocating makes sense when we know up front how many things will go in the hash table, which happens really often when we're building it once, using it, and then throwing it away. It will usually make sense to pre-allocate even if the table will latterly grow, because we can skip all the little sizes that we'd otherwise have to grow past.

foldr · on April 12, 2023

I think you’re looking at the truly pathological case where the entire value returned by the hash function is identical for all keys. I’m thinking of the more realistic case where it’s identical modulo the number of buckets (for a small number of buckets). In that case, rehashing is likely to help (since if you double the number of buckets, you can use one extra bit of the hash value). In this scenario I think Go will rehash whereas Rust won’t. At the very least, Go in principle has the option of rehashing in this scenario, as it makes no guarantee regarding allocations.

Preallocating makes sense, yes, but you can successfully preallocate in Go most of the time. Scenarios where you need guaranteed preallocation strike me as rare. At least, it’s not something I’ve ever found myself worrying about.

tialaramex · on April 12, 2023

If it's happening because we genuinely got unlucky the strategy you describe is no more likely to help than anything else, whereas if it's happening because your hash function is rubbish your strategy works, but using a hash function which isn't rubbish also works for the good hash table designs and unlocks far better performance in the process.