OK, this is more like I would expect - everything running consistently at a slower speed.
I know exactly what is causing the slowdown now. clib2 uses memory pools with a puddle size and expected allocation size of 4K. I modified that in the newer build to use normal memory allocations instead.
What is happening, is that early memory allocations are fast and efficiently allocated in 4K chunks. Then, when bits of memory is de-allocated it leaves holes. When new memory blocks are allocated it - and this is where I'm not sure of the implementation details in the OS - is trying to fill in the gaps in the already-allocated pools? With a lot of pools it may be taking some time to search through and find a gap of the correct size, which is similar to how normal memory allocations work when searching through all of RAM (and thus a similar speed).
Quite simply, we are allocating and de-allocating so much memory that we quickly lose any advantage of memory pools.
To fix it... well, that's tricky. The correct way would be to pool together elements of the same size to avoid fragmentation, but I can't do that in the core and all libraries without re-writing all the memory allocations (which would definitely not be popular). Note I already do this in the frontend everywhere it is practical (this was one of my earlier OS3 optimisation attempts!)
It may simply be a case of making the memory pools bigger, and I will try that first.
I suspect that this may not make much of a difference. The memory pools, which is what the malloc()/alloca()/realloc()/free() functions in clib2 are built upon, were intended to avoid fragmenting main memory. This is accomplished by having all allocations smaller than the preset puddle size draw from a puddle that still has enough room left for it to fit. Fragmentation happens inside that puddle.
The problems begin when the degree of fragmentation inside these puddles becomes so high that the only recourse is to allocate more puddles and allocate memory from that. The number of puddles in use increases over time, and when you try to allocate more memory, the operating system has to first find a puddle that still has room and then try to make the allocation work. Both these operations take more time the more puddles are in play, and the higher the fragmentation within these puddles is. Allocating memory will scale poorly, and what goes for allocations also goes for deallocations.
The other problem is with memory allocations whose length exceeds the puddle size. These allocations will be drawn from main memory rather than from the puddles. This will likely increase main memory fragmentation somewhat, but the same problems that exist with the puddles apply to main memory, too: searching for a chunk to draw the allocation from takes time, and the same goes when deallocating that chunk. There's an additional burden on this procedure because the memory pool has to keep track of that "larger than puddle size" allocation, too.
Because all the memory chunk/puddle, etc. allocations and deallocations use the humble doubly-linked Exec list as its fundamental data structure, the amount of time spent finding the right memory chunk, and putting the fragments back together, scales poorly. Does this sound familiar?
From the clib2 side I'm afraid that the library can only leverage what the operating system provides, and that is not well-suited for applications which have to juggle large number of allocated memory fragments.
Question is what size of memory chunk is common for NetSurf, how many chunks are in play, how large they are. If you have not yet implemented it, you might want to add a memory allocation debugging layer and collect statistics for it over time.
It may be worth investigating how the NetSurf memory allocations could be handled by an application-specific, custom memory allocator that sits on top of what malloc()/alloca()/realloc()/free() can provide and which should offer better scalability.