Something similar is used in swiss tables - metadata bucket entries are 1 bit oc...

jandrewrogers · 2025-07-28T16:40:48 1753720848

A simple linear probe is very efficient if the hash quality is high enough. More complex hash tables have the useful property that their performance is less sensitive to lower hash quality.

Historically, the computational cost of excellent hash quality was too high, so the overall cost of a hash table was lower using a worse hash and more complex table design. Today, it is possible to generate high quality hashes with minimal computational cost, so complex hash table designs are less useful. Modern hash table performance can largely be reduced to the average number of cache lines (both data and instruction) touched by the operations.

Sesse__ · 2025-07-28T14:09:12 1753711752

The absl tables (the canonical Swisstables) are good, but not the fastest. Of course, everything will depend on your hardware, compiler and usage patterns, but if you look at an all-round benchmark like https://martin.ankerl.com/2022/08/27/hashmap-bench-01/, they're about in the middle of the pack even when picking the best-working hash functions (and really bad if you happen to have a bad one). This mirrors my own experience.

Of course, absl tables will have been improved since 2022, but there are many competitors that have improved as well. And you could argue “well, who ever copies or iterates over a map”, but even if you take away those categories, they're not a clear leader really anywhere else either.

vlovich123 · 2025-07-28T15:52:15 1753717935

I wouldn’t say it’s the middle of the pack. They’re very clearly in the front of the pack for most operations but the summarization of the author penalizes it’s weaknesses more than of others because they have a horse in the race.

spacechild1 · 2025-07-28T21:58:28 1753739908

It's worth mentioning that boost.unordered now also has a state-of-the-art open adressing hashtable (boost::unordered_flat_map) that beats abseil in some benchmarks, particularly for unsuccessful look-up: https://bannalia.blogspot.com/2022/11/inside-boostunorderedf...

camel-cdr · 2025-07-29T03:52:53 1753761173

absl::flat_hash_map performs quite well on the slight more "realistic" benchmarks that were added to ankerls map_benchmark after the article came out (GameOfLife and knucleotide). See this slide: https://www.youtube.com/watch?v=Rg8MZ5pJIJA&t=1996s (the talks is also great btw)

tialaramex · 2025-07-28T21:06:49 1753736809

The Swiss Table is not presently the fastest known design for many purposes, but other open addressed schemes take that honour today and yes, nobody who wants a fast modern hash table tries to mitigate collision - just use a decent hash and the collisions are rare enough to have negligible effect so long as your scheme degrades rather than failing.

The C++ approach is focused around mitigating the price of collision. Accept you lost, and now try to make that not sting too bad. This made some sense because their provided hash for the integers is typically literally the identity function, collision is thus very frequent and predictable. But "Probably we lost, lets mitigate that" isn't a route to success and it only made sense when memory fetches were cheap and ALU operations were expensive. Forty years ago that was a sane choice, thirty years ago when C++ standardization was in progress it was already looking dodgy, fifteen years ago when they actually standardized this feature it was already very stupid.

Open addressed strategies begin by assuming you probably won. Your first main memory fetch is probably enough to know if this key is present, and if it is the next fetch probably gets the key and value.

The std::unordered_list strategy will only tell you if the key was not present on its first fetch some small fraction of the time, and on every other occasion you must do the second fetch to even discover whether the key might be present, several more fetches are needed on average to get a final key + value or certainty that it's not found.

dietr1ch · 2025-07-28T16:04:13 1753718653

I think it just abuses on how reading a few contiguous cache lines at random isn't a whole lot more expensive than reading a single random byte.

I found myself never wanting to use binary heaps, and instead use a wider one with an arity that can use at least a whole cache line, if not multiple ones after I started experimenting with my priority queues. Wider nodes meant fewer jumps, and jumping between nodes was more expensive (time, cache misses) than the work done at each node.

adgjlsfhk1 · 2025-07-28T12:15:47 1753704947

tombstones use the same bit as empty (empty and deleted don't have hash codes, so there's room).

the key realization behind swissdict is that all the complications people add are only necessary if you have a bad hash function, so you can just use a good one and be happier.

linear probing also has the significant advantage that it allows removing tombstones when whenever the next entry is empty

moonchild · 2025-07-29T21:32:39 1753824759

> Later version allowed to scan from arbitrary position by mirroring first bucket as last

you may find my improved table design of interest, which avoids the need for mirroring: https://outerproduct.net/trivial/2022-10-06_hash.html

teo_zero · 2025-07-28T17:54:11 1753725251

> Later version allowed to scan from arbitrary position by mirroring first bucket as last.

I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.

mizmar · 2025-07-28T18:54:38 1753728878

It's one of the improvements they claimed in the 2019 presentation. https://youtu.be/JZE3_0qvrMg?feature=shared&t=1054 Reporting 10% speedup on find, but 15% slowdown on insert. The speedup probably comes from using 4 more bits of hash, which leads to fewer collisions. And slowdown from more complicated code for the mirroring.

I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput. Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.

Rietty · 2025-07-28T17:29:55 1753723795

> Yet the super fast probe with apparently made it not an issue?

Could you explain how that makes it a non-issue? It seems counter-intuitive to me that it solves the problem by just probing faster?