the "XOR in hash functions" mentioned at the end most certainly refers to Zobrist hashing, the actually useful XOR trick.
Requires fixed-length keys. Generate random table for each byte position. Then to hash a key, for each position lookup byte in table and XOR the results.
This is used in chess/go/shogi/... engines because the board/position representation is fixed-length and you can undo modes easily - XOR-out byte from previous state and XOR-in from new state.
Another similar trick - XOR doubly linked list: XOR the prev and next pointers (storaged size of node decreases) and you can recover the values when accessing the node from prev or next side and thus already know one of the addresses.
There is another way to compare floats for rough equality that I haven't seen much explored anywhere: bit-cast to integer, strip few least significant bits and then compare for equality.
This is agnostic to magnitude, unlike epsilon which has to be tuned for range of values you expect to get a meaningful result.
Great reading, thanks. Describes how to handle ±0, works with difference to avoid truncation errors.
First half of the paper is arriving at this correct snippet, second part of the paper is about optimizing it.
bool DawsonCompare(float af, float bf, int maxDiff)
{
int ai = *reinterpret_cast<int*>(&af);
int bi = *reinterpret_cast<int*>(&bf);
if (ai < 0)
ai = 0x80000000 - ai;
if (bi < 0)
bi = 0x80000000 - bi;
int diff = ai - bi;
if (abs(diff) < maxDiff)
return true;
return false;
}
I'm unconvinced. Doesnt this just replace the need to choose a suitable epsilon with the need to choose the right number of bits to strip? With the latter affording much fewer choices for degree of "roughness" than does the former.
Not quite. It's basically a combined mantissa and exponent test, so it can be thought of as functionally equivalent to scaling epsilon by a power of two (the shared exponent of the nearly equal floating point values) and then using that epsilon.
I think I'll just use scaled epsilon... though I've gotten lots of performance wins out of direct bitwise trickery with floats (e.g., fast rounding with mantissa normalization and casting).
This is essentially ULP (units in the last place) comparison, and it's a solid approach. One gotcha: IEEE 754 floats have separate representations for +0 and -0, so values straddling zero (like 1e-45 and -1e-45) will look maximally far apart as integers even though they're nearly equal. You need to handle the sign bit specially.
There's another gotcha. Consider positive, normal x and y where ulp(y) != ulp(x). Bitwise comparison, regardless of tolerance, will consider x to be far from y, even though they might be adjacent numbers, e.g. if y = x+ulp(x) but y is a power of 2.
This case actually works because for finite numbers of a given sign, the integer bit representations are monotonic with the value due to the placement of the exponent and mantissa fields and the implicit mantissa bit. For instance, 1.0 in IEEE float is 0x3F800000, and the next immediate representable value below it 1.0-e is 0x3F7FFFFF.
Signed zero and the sign-magnitude representation is more of an issue, but can be resolved by XORing the sign bit into the mantissa and exponent fields, flipping the negative range. This places -0 adjacent to 0 which is typically enough, and can be fixed up for minimal additional cost (another subtract).
I interpreted OP's "bit-cast to integer, strip few least significant bits and then compare for equality" message as suggesting this kind of comparison (Go):
with the sensitivity controlled by ignoreBits, higher values being less sensitive.
Supposing y is 1.0 and x is the predecessor of 1.0, the smallest value of ignoreBits for which equiv would return true is 24.
But a worst case example is found at the very next power of 2, 2.0 (bitwise 0x40000000), whose predecessor is quite different (bitwise 0x3FFFFFFF). In this case, you'd have to set ignoreBits to 31, and thus equivalence here is no better than checking that the two numbers have the same sign.
Yeah, that's effectively quantization, which will not work for general tolerance checks where you'd convert float similarity to int similarity.
There are cases where the quantization method is useful, hashing/binning floats being an example. Standard similarity checks don't work there because of lack of transitivity. But that's fundamentally a different operation than is-similar.
Sure, but that operator can propagate a carry all the way to the most significant bit, so a check for bitwise equality after "strip[ping] few least significant bits" will yield false in some cases. The pathologically worst case for single precision, for example, is illustrated by the value 2.0 (bitwise 0x40000000) and its predecessor, which differ in all bits except the sign.
Rather than stripping bits, you can just compare if the bit-casted numbers are less than N apart (choose an appropriate N that works for your data; a good starting point is 4).
This breaks down across the positive/negative boundary, but honestly, that's probably a good property. -0.00001 is not all that similar to +0.00001 despite being close on the number line.
It also requires that the inputs are finite (no INF/NAN), unless you are okay saying that FLT_MAX is roughly equal to infinity.
That works well for sorting/bucketing/etc in a few places, but as a comparison it's prone to false negatives (your values are close and computed to not be close), so you're restricted to algorithms tolerant of that behavior.
Similar to make, it does mtime chronological comparison of dependencies with target to determinate if dependencies changed. This is just so flawed and simple to fool by operations on filesystem that do not change mtime (move, rename):
1) pick a source file and make a copy of it for for later
2) edit selected source file and rebuild
3) move the copy to it's original location
4) try to rebuild, nothing happens
I might be missing your sarcasm, but this is a common approach for large scale builds. Virtual filesystems are used to provide a pre-computed tree hash as a xattr. In a more typical case, you can read the git tree hash.
Not sure it was meant as sarcasm really. I just think so many build (and other) problems could have been avoided it a file hash was available on every file by default.
In the current POSIX paradigm yes, it would be expensive. But if the hash was defined as the hash of fixed blocks, it wouldn't be expensive. The raciness depends, a lot, on the semantics we would define. (In the context of a build system, it's no different than that the file could get a new mtime after we read the mtime.)
By not tracking file metadata through an index file, mtime-only incremental build systems trade a lot of reliability for only slightly more simplicity. https://apenwarr.ca/log/20181113
Copy (1) and edit (2) both bump mtime, usually. It's not obvious that in the workflow you describe ninja is problematic, rather than the workflow itself (which is atypical).
Copy and edit do, but move (aka rename) generally does not, and that is the part that is problematic.
I don't think the described sequence of operations is all that unusual. Not the most common case for sure, but hardly unlikely in the grand scheme of things.
ninja fails to detect that file changed from last build - all it's mtime, ctime, inode and size can change, yet it's not detected as long as mtime is not newer than target.
Again, this is just a weird workflow, and you're assuming copy/edit don't bump mtime. That usually isn't the case. If you're doing this weird thing, you can just run `touch` when you move files over existing files like this to explicitly bump mtime.
I run into this issue when building against different environments, each with a
1. A library depends on a system package. To test against the different versions of the system package, the library is compiled within a container.
2. To minimize the incremental rebuild time, the `build` directory is mounted into the build container. Even when using a different version of the system package, this allows re-use of system-independent portions of the build.
3. When switching to a build container with a different version of the system package, the mtime of the system package is that of its compilation, not that of the build container's initialization. Therefore, the library is erroneously considered up-to-date.
Because the mtime is the only field checked to see if the library is up to date, I need to choose between having larger disk footprint (separate `build` directory for each build container), slower builds (touch the system package on entering the container, forcing a rebuild), or less safe incremental builds (shared `build` directory, manually touch files when necessary).
Incorrect, I only assume move/rename of backup to original location doesn't change it's mtime (which it doesn't with default flags or from IDE or file manager). And I don't think this is a weird or obscure workflow, I do it all the time - have two versions of a file or make a backup before some experimental changes, and restore it later.
I used to do that some 20 years ago when I was learning programming, but now it does seem like a weird workflow when git exists and handles this case well.
Good point. I think it would be fixable by using the Change Time instead of the Modify Time, because that changes when moving the copy over the original.
I wouldn't call wayland-client a callback hell. All callbacks are called at expected time when you call wl_display_dispatch() (and its variants) or during wl_display_roundtrip(). GLFW also works with callbacks and nobody complains about that.
There is no async involved so function coloring argument doesn't really apply here.
I don't share author's hate for them, but they are definitely more verbose than popping form event queue and switch statement on event type ala SDL loop. Plenty of callbacks just set parameters in some state struct and do not propagate further. And you need to fill the vtable structs, and register that as listener. This boilerplate is probably the reason why basic window examples have ~200 lines instead of 40. But in larger project this is barely a problem.
Wayland makes it unnecessarily difficult to make simple clients. Gnome still doesn't support server-side window decoration and libdecor is an absolute nightmare and wayland-cursor doesn't even detect the system theme properly.
>and I still don't know what's the difference between them (wl_display_roundtrip() & wl_display_dispatch()) and in what order to call them on
I've been struggling with this initially as well, it's pretty poorly explained in docs. Short explanation:
Wayland-client library implements a queues over the socket. So to get it, you have to think about when is the socket read from and written to, and when are the queues pulled from or pushed to.
There is always a default queue, but for example EGL+OpenGL creates it's own queue, which further makes it more confusing.
- `wl_display_dispatch_pending()` only pulls messages from default queue to callbacks
- `wl_display_dispatch()` also tries to do blocking read on the socket if no messages are in queue
- quite recently `wl_display_dispatch_queue_timeout()` was finally added, so you can do non-blocking read from the socket. earlier you had to hack the function yourself
- `wl_display_flush()` writes enqueued messages in queue to socket
- `wl_display_roundtrip()` sends a ping message and does blocking wait for response. the purpose is that you also send all enqueued requests and receive and process all responses. for example during init you call it to create registry and enumerate the objects, and you call it for second time to enumerate further protocol objects that got registered in registry callback, such as seat
- `eglSwapBuffers()` operates on its own queue, but reading from socket also enqueues to default queue, so you should always call `wl_display_dispatch_pending()` (on default queue) afterwards
There is also a way to get around being stuck in `eglSwapBuffers()` during window inhibition: disable the blocking with `eglSwapInterval(0)` and use `wl_surface_frame()` callback, and you get notified in callback when you can redraw and swap again. But you can't do blocking reads with `wl_display_dispatch()` anymore, have to use the timeout variant.
After using it this way, you can also easily manage multiple vsynced windows independently on the same thread, and even use wayland socket in epoll event loop. None of this is documented of course.
The clipboard interface is definitely compromised a bit by being shared with drag-and-drop events, but it's not that complicated. Also there is a pitfall when you copy-paste to your own application and don't use any async event loop, you can get deadlocked by being expected to write and read on the same file descriptor at the same time.
I would mention stb_textedit.h, but I would not recommend it. It was an interesting thing to study. but the library has many shortcomings and is pain to integrate and use. It is used in ImGui, but somewhat modified. Just to illustrate the flaws - it can't be used with utf-8 bytes easily, there is a large switch on keyboard shortcuts to trigger actions so you have to fake keyboard input to script things, the default word boundary detection is pretty bad, and there is no way to nicely provide double or triple click selection.
The two notable functions are stb_text_locate_coord() and stb_textedit_find_charpos(), which connect the physical x,y coordinates with position in text buffer. They both iterate lines of text - accumulating y position; and the chars of last line - accumulating x position.
For windowing, drawing and OS integration, SDL with SDL_ttf is actually pretty good. SDL3_ttf API got an improvement and no longer requires zero-terminated strings so you don't need to relocate every chunk.
The copy in dear imgui sources has a few tweaks to make it possible to use with UTF-8 without unnecessary lookups: see `STB_TEXTEDIT_GETPREVCHARINDEX`,`STB_TEXTEDIT_GETNEXTCHARINDEX` and `stb_textedit_text()` use.
I generally love the STB libraries but I have to agree that stb_textedit is/was a false economy at least for my use case of Dear ImGui. I can't fault the design but if you want a fancy text editor the value you get from stb_textedit is rather minor compared to whole things, and it gets in the way. I will aim to remove it from dear imgui at some point.
It's not that bad. You need really large files to notice.
The largest realistic file I'll ever touch - sqlite3 amalgamation with 270k lines and 9.1 kB - still takes only 6 ms to memmove it on my poor laptop.
Any regular up-to 10k lines file is memmoved in order of microseconds.
Yes, absolutely. I've since switched to rope-backed buffers, but I don't think the rope itself is actually adding much from a performance standpoint, even for really very large files.
We talk about big-O complexity a lot when talking about things like this, but modern machines are scarily good at copying around enormous linear buffers of data. Shifting even hundreds of megabytes of text might not even be visible in your benchmark profiling, if done right.
When benchmarking, I discovered that the `to_pos`/`to_coord` functions, which translate between buffer byte positions and screen coordinates, were by far the heaviest operation. I could have solved that problem entirely simply by maintaining a list of line offsets and binary-searching through it.
That's true for code editing, but it's nice to not have to reach for a different solution when editing huge files. Sometimes I like to open up big log files, JSON test data, etc.
Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented).
Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).
Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.
I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering.
Yet the super fast probe with apparently made it not an issue? Mind-boggling.
(Later version allowed to scan from arbitrary position by mirroring first bucket as last.)
A simple linear probe is very efficient if the hash quality is high enough. More complex hash tables have the useful property that their performance is less sensitive to lower hash quality.
Historically, the computational cost of excellent hash quality was too high, so the overall cost of a hash table was lower using a worse hash and more complex table design. Today, it is possible to generate high quality hashes with minimal computational cost, so complex hash table designs are less useful. Modern hash table performance can largely be reduced to the average number of cache lines (both data and instruction) touched by the operations.
The absl tables (the canonical Swisstables) are good, but not the fastest. Of course, everything will depend on your hardware, compiler and usage patterns, but if you look at an all-round benchmark like https://martin.ankerl.com/2022/08/27/hashmap-bench-01/, they're about in the middle of the pack even when picking the best-working hash functions (and really bad if you happen to have a bad one). This mirrors my own experience.
Of course, absl tables will have been improved since 2022, but there are many competitors that have improved as well. And you could argue “well, who ever copies or iterates over a map”, but even if you take away those categories, they're not a clear leader really anywhere else either.
I wouldn’t say it’s the middle of the pack. They’re very clearly in the front of the pack for most operations but the summarization of the author penalizes it’s weaknesses more than of others because they have a horse in the race.
It's worth mentioning that boost.unordered now also has a state-of-the-art open adressing hashtable (boost::unordered_flat_map) that beats abseil in some benchmarks, particularly for unsuccessful look-up: https://bannalia.blogspot.com/2022/11/inside-boostunorderedf...
absl::flat_hash_map performs quite well on the slight more "realistic" benchmarks that were added to ankerls map_benchmark after the article came out (GameOfLife and knucleotide). See this slide: https://www.youtube.com/watch?v=Rg8MZ5pJIJA&t=1996s (the talks is also great btw)
The Swiss Table is not presently the fastest known design for many purposes, but other open addressed schemes take that honour today and yes, nobody who wants a fast modern hash table tries to mitigate collision - just use a decent hash and the collisions are rare enough to have negligible effect so long as your scheme degrades rather than failing.
The C++ approach is focused around mitigating the price of collision. Accept you lost, and now try to make that not sting too bad. This made some sense because their provided hash for the integers is typically literally the identity function, collision is thus very frequent and predictable. But "Probably we lost, lets mitigate that" isn't a route to success and it only made sense when memory fetches were cheap and ALU operations were expensive. Forty years ago that was a sane choice, thirty years ago when C++ standardization was in progress it was already looking dodgy, fifteen years ago when they actually standardized this feature it was already very stupid.
Open addressed strategies begin by assuming you probably won. Your first main memory fetch is probably enough to know if this key is present, and if it is the next fetch probably gets the key and value.
The std::unordered_list strategy will only tell you if the key was not present on its first fetch some small fraction of the time, and on every other occasion you must do the second fetch to even discover whether the key might be present, several more fetches are needed on average to get a final key + value or certainty that it's not found.
I think it just abuses on how reading a few contiguous cache lines at random isn't a whole lot more expensive than reading a single random byte.
I found myself never wanting to use binary heaps, and instead use a wider one with an arity that can use at least a whole cache line, if not multiple ones after I started experimenting with my priority queues. Wider nodes meant fewer jumps, and jumping between nodes was more expensive (time, cache misses) than the work done at each node.
tombstones use the same bit as empty (empty and deleted don't have hash codes, so there's room).
the key realization behind swissdict is that all the complications people add are only necessary if you have a bad hash function, so you can just use a good one and be happier.
linear probing also has the significant advantage that it allows removing tombstones when whenever the next entry is empty
> Later version allowed to scan from arbitrary position by mirroring first bucket as last.
I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.
It's one of the improvements they claimed in the 2019 presentation. https://youtu.be/JZE3_0qvrMg?feature=shared&t=1054
Reporting 10% speedup on find, but 15% slowdown on insert. The speedup probably comes from using 4 more bits of hash, which leads to fewer collisions. And slowdown from more complicated code for the mirroring.
I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput.
Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.
Requires fixed-length keys. Generate random table for each byte position. Then to hash a key, for each position lookup byte in table and XOR the results. This is used in chess/go/shogi/... engines because the board/position representation is fixed-length and you can undo modes easily - XOR-out byte from previous state and XOR-in from new state.