Whoa. Nice resource! I spent ages last year learning about memory, memory allocators, cpu caches etc but never came across such a comprehensive document.
I had to implement my own malloc/free/etc. on an embedded project recently, due to a lack of thread-safety in the vendor's C stdlib * . I wrote a blog post about it (http://spin.atomicobject.com/2013/04/17/embedded-memory-unde...), with a couple lessons I learned along the way.
* There were compiler flags to generate a version of the stdlib with user-provided mutex functions for an RTOS. I did so, then confirmed (by reading the stdlib's malloc's disassembly) that it never called them!
This survey paper has been one of my favorite papers (using the term loosely; it's like a small book) since 1996 or so; it's part of what transitioned me from network administration with a sideline in development to fulltime software development:
Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective has a pretty solid overview of memory allocation strategies at just the right level of abstraction for someone who's just getting into this kind of thing: doubly- vs. singly-linked free lists, first-fit vs. best-fit vs. segregated fits and so on. That was the book I used in my Intro to Systems Programming class in college and I really loved it.
It continues to amaze me how many workloads can be dominated by malloc() performance. In particular, last summer we at Joyent found that a production node.js load was surprisingly dominated by small object malloc()/free() performance -- and achieving maximum performance ultimately required not just adding a per-thread cache, but cycle bumming the hell out of it.[1] Point is: malloc() still very much matters, despite (or maybe because of) modern interpreted environments.
I am sort of horrified by a garbage collected VM that uses small malloc/free calls to manage memory. I also can't believe the linked article doesn't mention what VM the node.js code was running on.
Edit: reading more of the linked articles, these benchmarks seem to be running on V8? Which seems to be the only platform that node.js runs on...facepalm. Still confused why V8 would be using standard malloc/free calls.
True enough, but I think the introduction is pretty tractable and contains some pretty important (but subtle) points about memory allocator performance. Quoth the paper:
It is not sufficient to measure the time consumed by the allocator code in isolation. Memory layout can have a significant impact on how quickly the rest of the application runs, due to the effects of CPU cache, RAM, and virtual memory paging.
The only definitive measures of allocator performance are attained by measuring the execution time and memory usage of real applications. This poses challenges when qualifying the performance characteristics of allocators. Consider that an allocator might perform very poorly for certain allocation patterns, but if none of the benchmarked applications manifest any such patterns, then the allocator may appear to perform well, despite pathological performance for some work loads. This makes testing with a wide variety of applications important. It also motivates an approach to allocator design that minimizes the number and severity of degenerate edge cases.
This example implementation should come with a disclaimer: keeping metadata inside the allocated objects is a bad idea because it pollutes the caches and increases fragmentation.
I think the problem of memory allocation becomes much more interesting in the context of multithreaded applications. I really like the streamflow paper[0]. For similar implementations that are more widely used nowadays, see hoard[1], jemalloc[2] and tcmalloc[3].
Writing my own malloc with an eye out for multi-core performance as part of an OS university course was one of the funnest assignments I had the pleasure of working on. It's so satisfying to shed yet another layer of magic, and get a thorough understanding on how the fundamentals work.
We did exactly the same thing in my C programming course. Its funny how the best way to learn is to reimplement functions and algorithms that were revolutionary 30 years ago =)
Managing free memory is indeed an interesting problem. Besides the simple free-list approach sketched by the author (sidestepping the first-fit vs. best-fit search strategy, which affects fragmentation, too), I always found the buddy-list to be particularly interesting:
In the allocator I wrote (see the Stremflow heading: http://www.scott-a-s.com/projects/), I used both a free-list and the buddy algorithm.
The free-list was for small object allocations; it maintained free-lists of different sizes. (Say, the 8 byte free list, the 16 byte free list, the 32 byte free list, etc.) However, I obtained the memory for these lists from a page manager, and that page manager used the buddy system. The relationship here is that the small-object part of the allocation obtained its memory from the page allocator; the page allocator obtained its memory from the operating system. This system allowed me to have the benefits of a free-list (good cache behavior for small objects allocated and used together), but low overall fragmentation and good reuse of the free lists.
Full details are in our paper: http://www.scott-a-s.com/files/ismm06.pdf I have portions of a draft of a more detailed explanation that I never got around to finishing and publishing.
By the way, for anyone interested in how the memory manager is implemented in Windows 8, I watched this video a few months back and remember it being pretty interesting.
http://www.akkadia.org/drepper/cpumemory.pdf
The serialized version is at http://lwn.net/Articles/250967/