according to wikipedia, the behaviour we are experiencing with slab, may be a known handicap, sorry its only on a german page:
https://de.wikipedia.org/wiki/Slab_allocator
The issues mentioned in this context appear to refer to the kernel implementation of the slab allocator. The kernel slab allocator should care about proper alignment of allocations, so as to avoid friction with multiple processors and non-uniform memory access.
The slab allocator in clib2 sidesteps these issues by mostly ignoring them. Unless I made a mistake, allocations are currently aligned to 64 bit word boundaries because of the chunk allocation granularity (this may change, though). No optimizations for multiprocessing or NUMA are needed.
the mentioned buddy allocator is afaik the default one in bernds ixemul library >6x.x, which may explain, why it performs well in comparison. maybe some more considerations should be spent on that issue?
From what I know buddy allocators are much more complex in operation than the slab allocator. For example, the highly configurable dlmalloc allocator, including documentation comments, is more than 5000 lines long.
In a buddy allocator effort is spent on merging chunks, and depending upon the order in which allocations are being made, buddies may not be released in the order which allows them to be merged, slowly increasing fragmentation over time. dlmalloc is designed to make best-fit allocations, as opposed to first-fit, which contributes to the complexity and the effort spent (first-fit is fast at the expense of quickly increasing fragmentation over time).
By comparison, a slab allocator can deliver both best-fit and first-fit performance at the same time without spending any effort on merging chunks. Furthermore, it can deliver this in nearly constant time, i.e. O(1), discounting that it has to obtain the pages which it manages from somewhere
