Memory Allocators 101

js2 · on May 16, 2013

And if you want to descend down the rabbit hole - "What every programmer should know about memory":

http://www.akkadia.org/drepper/cpumemory.pdf

The serialized version is at http://lwn.net/Articles/250967/

npsimons · on May 16, 2013

Another good read along these lines: http://gee.cs.oswego.edu/dl/html/malloc.html

gradstudent · on May 16, 2013

Whoa. Nice resource! I spent ages last year learning about memory, memory allocators, cpu caches etc but never came across such a comprehensive document.

silentbicycle · on May 16, 2013

I had to implement my own malloc/free/etc. on an embedded project recently, due to a lack of thread-safety in the vendor's C stdlib * . I wrote a blog post about it (http://spin.atomicobject.com/2013/04/17/embedded-memory-unde...), with a couple lessons I learned along the way.

* There were compiler flags to generate a version of the stdlib with user-provided mutex functions for an RTOS. I did so, then confirmed (by reading the stdlib's malloc's disassembly) that it never called them!

tptacek · on May 17, 2013

This survey paper has been one of my favorite papers (using the term loosely; it's like a small book) since 1996 or so; it's part of what transitioned me from network administration with a sideline in development to fulltime software development:

http://www.cs.cmu.edu/afs/cs/academic/class/15213-f98/doc/ds...

mrbrowning · on May 16, 2013

Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective has a pretty solid overview of memory allocation strategies at just the right level of abstraction for someone who's just getting into this kind of thing: doubly- vs. singly-linked free lists, first-fit vs. best-fit vs. segregated fits and so on. That was the book I used in my Intro to Systems Programming class in college and I really loved it.

bcantrill · on May 17, 2013

It continues to amaze me how many workloads can be dominated by malloc() performance. In particular, last summer we at Joyent found that a production node.js load was surprisingly dominated by small object malloc()/free() performance -- and achieving maximum performance ultimately required not just adding a per-thread cache, but cycle bumming the hell out of it.[1] Point is: malloc() still very much matters, despite (or maybe because of) modern interpreted environments.

[1] http://dtrace.org/blogs/rm/2012/07/16/per-thread-caching-in-...

seanmcdirmid · on May 17, 2013

I am sort of horrified by a garbage collected VM that uses small malloc/free calls to manage memory. I also can't believe the linked article doesn't mention what VM the node.js code was running on.

Edit: reading more of the linked articles, these benchmarks seem to be running on V8? Which seems to be the only platform that node.js runs on...facepalm. Still confused why V8 would be using standard malloc/free calls.

akkartik · on May 17, 2013

malloc doesn't have to suck: http://www.cs.umass.edu/%7Eemery/pubs/berger-oopsla2002.pdf

JoachimSchipper · on May 18, 2013

If you mean "lea malloc is pretty fast", ok; but going to regions, while an enormous win in some cases, is moving the goalpoints.

(Still, thanks for posting the paper - the reaps concept is very nice. I did expect regions to win by more than they did, though. Oh well.)

minimax · on May 16, 2013

Memory Allocators 102: The jemalloc paper.

http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemall...

jamesgolick · on May 16, 2013

More like Memory Allocators 405 haha

minimax · on May 16, 2013

True enough, but I think the introduction is pretty tractable and contains some pretty important (but subtle) points about memory allocator performance. Quoth the paper:

It is not sufficient to measure the time consumed by the allocator code in isolation. Memory layout can have a significant impact on how quickly the rest of the application runs, due to the effects of CPU cache, RAM, and virtual memory paging.

The only definitive measures of allocator performance are attained by measuring the execution time and memory usage of real applications. This poses challenges when qualifying the performance characteristics of allocators. Consider that an allocator might perform very poorly for certain allocation patterns, but if none of the benchmarked applications manifest any such patterns, then the allocator may appear to perform well, despite pathological performance for some work loads. This makes testing with a wide variety of applications important. It also motivates an approach to allocator design that minimizes the number and severity of degenerate edge cases.

lgeek · on May 17, 2013

This example implementation should come with a disclaimer: keeping metadata inside the allocated objects is a bad idea because it pollutes the caches and increases fragmentation.

I think the problem of memory allocation becomes much more interesting in the context of multithreaded applications. I really like the streamflow paper[0]. For similar implementations that are more widely used nowadays, see hoard[1], jemalloc[2] and tcmalloc[3].

Edit: it turns out one of the authors of streamflow was posting in this tread: https://news.ycombinator.com/item?id=5722049

[0] http://haiocl.googlecode.com/svn-history/r21/trunk/doc/paper...

[1] http://www.hoard.org/

[2] http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemall...

[3] http://goog-perftools.sourceforge.net/doc/tcmalloc.html

scott_s · on May 17, 2013

Thanks for making my day. I would have emailed that to you, but you don't have any contact information in your profile.

null_ptr · on May 16, 2013

Writing my own malloc with an eye out for multi-core performance as part of an OS university course was one of the funnest assignments I had the pleasure of working on. It's so satisfying to shed yet another layer of magic, and get a thorough understanding on how the fundamentals work.

richan90 · on May 16, 2013

We did exactly the same thing in my C programming course. Its funny how the best way to learn is to reimplement functions and algorithms that were revolutionary 30 years ago =)

jared314 · on May 17, 2013

That is why I love, and advocate, the nand2tetris[0] course. Reimplementing a simple machine from the nand up gives a new perspective.

[0] http://www.nand2tetris.org/

sb · on May 16, 2013

Managing free memory is indeed an interesting problem. Besides the simple free-list approach sketched by the author (sidestepping the first-fit vs. best-fit search strategy, which affects fragmentation, too), I always found the buddy-list to be particularly interesting:

https://en.wikipedia.org/wiki/Buddy_memory_allocation

Unfortunately, the article is not a very good explanation, I remember having seen a drawing of the data structure, but cannot recall from where...

edit:

It turns out that the mentioned jemalloc internally uses buddy lists as well.

scott_s · on May 17, 2013

In the allocator I wrote (see the Stremflow heading: http://www.scott-a-s.com/projects/), I used both a free-list and the buddy algorithm.

The free-list was for small object allocations; it maintained free-lists of different sizes. (Say, the 8 byte free list, the 16 byte free list, the 32 byte free list, etc.) However, I obtained the memory for these lists from a page manager, and that page manager used the buddy system. The relationship here is that the small-object part of the allocation obtained its memory from the page allocator; the page allocator obtained its memory from the operating system. This system allowed me to have the benefits of a free-list (good cache behavior for small objects allocated and used together), but low overall fragmentation and good reuse of the free lists.

Full details are in our paper: http://www.scott-a-s.com/files/ismm06.pdf I have portions of a draft of a more detailed explanation that I never got around to finishing and publishing.

rlu · on May 17, 2013

By the way, for anyone interested in how the memory manager is implemented in Windows 8, I watched this video a few months back and remember it being pretty interesting.

http://channel9.msdn.com/Shows/Going+Deep/Inside-Windows-8-G...