@mdma
Remember that x86 has used a RISC style core for a long time. There are actually a very large number of rename registers as 'aliases' for the main ones. You can't use them directly, of course, but they basically allow greater parallelism when subsequent instructions don't depend on the immediate outcome of previous ones.
The AMD64 uses this same trick too. It has a lot more than 16 registers using a similar rename scheme. Even the PPC, with 32 registers uses rename mechanisms to help eliminate stalls when multiple instructions executing concurrently depend on each other. Even the venerable 603e has five rename registers (basically amounting to one each per functional unit in the core).
What it basically means is that x86 and AMD64 both are running from their registers and L1 cache most of the time. Having 16 registers just means you can write better code where the programmer/compiler can take advantage of more registers. If you imagine an algorithm on x86 might spend a reasonable amount of time juggling register variables to/from memory (usually the cache) during a loop, the same algorithm for AMD64 could simply keep the important values in registers, cutting down the number of instructions required to perform the overall operation. This is where you will see some speedup.