I realise it's a bit off topic, but I'm far more surprised by my Q9450 result. I've gotten it down to ~6.5 seconds using a RAM based tmpfs for source/destination and disabling clock scaling so that the machine is locked at 2.67GHz.
I actually find that a pretty embarrassing result for a 64-bit build, which supposedly has SSE3 enabled by default*. 6MB L2 cache (12 in total, but the Q9450 design is akin to a pair of 6MB cache dual cores), 1333MHz FSB with dual channel DDR3 6-6-6 latency and an X48 chipset. And all I got was a factor of 2.6 over a 1.5GHz G4 when there's a 1.78x increase in clockspeed to start with? That makes this machine, what, 46% faster at this task clock for clock? :lol:
*I'm going to have to rebuild this myself and ensure all SSE3 optimizations are enabled.