Ask HN: What is your favorite hand-rolled malloc?

tptacek · on Aug 4, 2009

I have two go-to solutions for malloc. In order of preference:

I use arena allocators when I can get away with it. Arenas are absolutely trivial to code; there's no per-object "free", so allocation is just a pointer bump. You will be surprised at how often you can get away with this; lots of allocation-heavy code paths map to a single operation, transaction, request, file, or what-have-you, and all you're doing is deferring free to the end of it. Using arenas in this scenario basically erases the cost of allocation from your program. It's also dramatically harder to fuck up.

For anything else, when malloc hits the top of my profile, I have a simple freelist library for pools of homogenous allocations, with an embedded linked list of free items.

You could get smarter than an arena, a generic freelist, and malloc, but why?

neilc · on Aug 4, 2009

Yeah, arena allocators are great. I wouldn't immediately go and write my own from scratch, though: talloc (http://talloc.samba.org/), Postgres, and APR all implement variants on the arena idea that you can copy from liberally.

huhtenberg · on Aug 4, 2009

And talloc was derived from halloc (http://swapped.cc/halloc), which is a hierarchical allocator. Or rather a wrapper for an existing allocation api.

neilc · on Aug 4, 2009

Thanks for mentioning that link -- that's a nice library.

adamt · on Aug 4, 2009

At various times I've profiled some CPU bound code and thought malloc is taking too much time. This has led me to read up on, and try and develop various usage-optimised allocators. The one lesson I've learnt is that in most cases the standard libc allocator is actually extremely good.

Where I have had big advantages over libc malloc is where the memory allocations have been specific to a thread context. For example, consider a multi-threaded server, where by you could guarantee that only one thread was active in a connection at anyone time. It's quite easy for a single connection to spend 5-10% of it's time in malloc and there are often big globalish locks around general purpose allocators, so it's easy to get get mutex contention which is far worse than the overhead of a sub-efficient allocator.

In such cases you often need to have small bits of memory within the context of a connection. Here I found big benefits my mallocing memory in bug chunks and then simply splitting that up into small sections within a connection specific allocator to avoid any malloc mutex locking.

What I've found is that in order to beat normal malloc, I think you really have to have a specific usage case that restricts something to do better.

ori_b · on Aug 4, 2009

Slab Allocators, as described by Bonwick (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4...) if I'm allocating lots of same-size objects. Other than that, I mainly stick to the system malloc(), or something specifically optimized for the task at hand.

lacker · on Aug 4, 2009

At Google we use tcmalloc which is now open sourced.

A paper about it:

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

The code:

http://code.google.com/p/google-perftools/

jey · on Aug 4, 2009

It really depends on the application. I just use the default system allocator until it's inadequate for some task, and then I switch to an allocation scheme or heap allocator that's well-suited for that problem.

Can you tell us more about your specific problem? Then we'd be able to give suggestions and help evaluate the options.

leif · on Aug 4, 2009

The current issue is that we allocate and free small structures very, very rapidly. They have very short lifespan, and there are never (in the average case) more than a few around at any one time. After profiling, we saw lots of time spent in malloc, and wrote a fixed-size-slab bitfield-based allocator which gave fantastic performance.

There are some other issues with the same piece of code that need re-working, and we'll end up with a totally different approach by the end, so I don't really have a problem that needs a crowd solution; I was just curious what everyone here thinks.

huhtenberg · on Aug 4, 2009

If it's a fixed-size slab, why bother with the bitfield ? It is hard to beat the performance of allocating a block from a free (double-linked) list. Start with an empty list and full slab. Try allocating from the list first, and from the bottom of the slab second. Free into the free list only. If you memory utilization is stable in a long run, then at some point you will be allocating/freeing from the list only.

kvs · on Aug 4, 2009

My recent allocator design is an attempt to address memory fragmentation in Windows XP & Vista. The unique problem I faced was that there was significant amount of code already in place which used C++ new/delete (about 300KLOC) and lot of coding is ongoing at breakneck speed. The allocator is written in C++.

My story:

- Profiled the code to figure out allocation sizes and number of allocations.

   * Found most objects are 32, 64, 128 bytes and we allocate millions of them.

   * Allocation rate is linear in the sense, rate of allocation is pretty steady within a time window.

   * Life time of these objects are long; Majority of them (99.9%) lasts as long as the application itself.

   * Found couple of large allocations in the range of 64MB to 128MB

Can you tell me yours? With such detail, then perhaps we can figure out what's best for you? :-)

slackerIII · on Aug 5, 2009

Did you try the windows low-fragmentation heap? I've seen it used very successfully.

kvs · on Aug 5, 2009

Yes. That didn't work to our satisfaction.

scott_s · on Aug 4, 2009

Mine: http://www.cs.vt.edu/~scschnei/streamflow

We allocated fixed size chunks of N objects, although we didn't have a smooth function for what the chunk size was. The inflection points where we changed our strategy were cache-line size and page size.

Used/free record keeping was a simple table to map an address to the data structure for that address, and we re-used the free memory for an internal list.

That's what we did to get good sequential performance - it was the multithreaded performance we were interested in. Aiming for good multithreaded performance influenced the data structure design to be simpler to reduce synchronization.

Details are in the paper and source code.

jacquesm · on Aug 4, 2009

I'm using jemalloc in places where we ran in to trouble with the regular malloc.

It's been a long long time since I rolled my own, so I don't think I'm qualified to answer the rest of your questions except for the second, linked list of pointers of free(d) memory would be the best choice I think, with as an extra an array of such lists grouped by size. (I hope that makes sense...) to speed up the search for a block of the right size.

leif · on Aug 4, 2009

I think the regular (gnu, at least) malloc uses bucketed linked lists.

dejb · on Aug 4, 2009

For me the best malloc for me is one I don't have to know anything about - whatever is built in to the scripting language I happen to be using. But I know you aren't playing around with memory allocation for the sheer geekery of it. There are still some problems that can't be solved efficiently with higher level languages. This class of problems I'll gladly leave to you guys.

Locke1689 · on Aug 4, 2009

I usually just use the standard GCC implementation. I remember implementing a pretty quick red-black tree version for my systems class. It got me an A, but I don't think I'd use it unless I find myself having serious problems with GCC.

spot · on Aug 4, 2009

tcmalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html http://mituzas.lt/2006/12/06/google-perftools-tcmalloc-squid...