Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented).
Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).
Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.
I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering.
Yet the super fast probe with apparently made it not an issue? Mind-boggling.
(Later version allowed to scan from arbitrary position by mirroring first bucket as last.)
A simple linear probe is very efficient if the hash quality is high enough. More complex hash tables have the useful property that their performance is less sensitive to lower hash quality.
Historically, the computational cost of excellent hash quality was too high, so the overall cost of a hash table was lower using a worse hash and more complex table design. Today, it is possible to generate high quality hashes with minimal computational cost, so complex hash table designs are less useful. Modern hash table performance can largely be reduced to the average number of cache lines (both data and instruction) touched by the operations.
The absl tables (the canonical Swisstables) are good, but not the fastest. Of course, everything will depend on your hardware, compiler and usage patterns, but if you look at an all-round benchmark like https://martin.ankerl.com/2022/08/27/hashmap-bench-01/, they're about in the middle of the pack even when picking the best-working hash functions (and really bad if you happen to have a bad one). This mirrors my own experience.
Of course, absl tables will have been improved since 2022, but there are many competitors that have improved as well. And you could argue “well, who ever copies or iterates over a map”, but even if you take away those categories, they're not a clear leader really anywhere else either.
I wouldn’t say it’s the middle of the pack. They’re very clearly in the front of the pack for most operations but the summarization of the author penalizes it’s weaknesses more than of others because they have a horse in the race.
It's worth mentioning that boost.unordered now also has a state-of-the-art open adressing hashtable (boost::unordered_flat_map) that beats abseil in some benchmarks, particularly for unsuccessful look-up: https://bannalia.blogspot.com/2022/11/inside-boostunorderedf...
absl::flat_hash_map performs quite well on the slight more "realistic" benchmarks that were added to ankerls map_benchmark after the article came out (GameOfLife and knucleotide). See this slide: https://www.youtube.com/watch?v=Rg8MZ5pJIJA&t=1996s (the talks is also great btw)
The Swiss Table is not presently the fastest known design for many purposes, but other open addressed schemes take that honour today and yes, nobody who wants a fast modern hash table tries to mitigate collision - just use a decent hash and the collisions are rare enough to have negligible effect so long as your scheme degrades rather than failing.
The C++ approach is focused around mitigating the price of collision. Accept you lost, and now try to make that not sting too bad. This made some sense because their provided hash for the integers is typically literally the identity function, collision is thus very frequent and predictable. But "Probably we lost, lets mitigate that" isn't a route to success and it only made sense when memory fetches were cheap and ALU operations were expensive. Forty years ago that was a sane choice, thirty years ago when C++ standardization was in progress it was already looking dodgy, fifteen years ago when they actually standardized this feature it was already very stupid.
Open addressed strategies begin by assuming you probably won. Your first main memory fetch is probably enough to know if this key is present, and if it is the next fetch probably gets the key and value.
The std::unordered_list strategy will only tell you if the key was not present on its first fetch some small fraction of the time, and on every other occasion you must do the second fetch to even discover whether the key might be present, several more fetches are needed on average to get a final key + value or certainty that it's not found.
I think it just abuses on how reading a few contiguous cache lines at random isn't a whole lot more expensive than reading a single random byte.
I found myself never wanting to use binary heaps, and instead use a wider one with an arity that can use at least a whole cache line, if not multiple ones after I started experimenting with my priority queues. Wider nodes meant fewer jumps, and jumping between nodes was more expensive (time, cache misses) than the work done at each node.
tombstones use the same bit as empty (empty and deleted don't have hash codes, so there's room).
the key realization behind swissdict is that all the complications people add are only necessary if you have a bad hash function, so you can just use a good one and be happier.
linear probing also has the significant advantage that it allows removing tombstones when whenever the next entry is empty
> Later version allowed to scan from arbitrary position by mirroring first bucket as last.
I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.
It's one of the improvements they claimed in the 2019 presentation. https://youtu.be/JZE3_0qvrMg?feature=shared&t=1054
Reporting 10% speedup on find, but 15% slowdown on insert. The speedup probably comes from using 4 more bits of hash, which leads to fewer collisions. And slowdown from more complicated code for the mirroring.
I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput.
Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.
Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.
I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)