I think the speed up from 32bit to 64bit that mdma and myself are experiencing is simply due to more elegant optimisation allowed for by 16 registers... also the ISA of the x86-64 long mode has been "cleaned up", probably allowing instrution sequencing to allow better branch prediction and better caching, not to mention it probably looks a lot more like the RISC core, which means less overhead too.
-Edit-
<----Look at the size of that Cache!!! :-o
-/Edit-
Note, the Athlon64's register size is now 64bit long, Not sure if there is much 64bit integer processing in Windows but that will be speeded up. Maybe CPU's are so fast now that Microsoft needs counters that can go above 4billion for their "Your Brand new CPU is too slow" delay loops ;-)