exactly, but Intels have huge caches (I think), so its not actually pumping it out
to memory but to the cache, thus it will be a lot faster than you think,
The problem is not with memory accesses (x86 generally have the most efficient caches in the computing world), the problem is that code goes from (stupid example):
mov eax, [ebx+myOffset] ; dependency on ebx
add eax, ecx ; dependency on the above load+ecx
to:
mov eax, [ebx+myOffset] ; dependency on ebx
bswap eax ; dependency on above load
add eax, ecx ; dependency on bswap+ecx
For a modern OOO processor the most important compiler optimization is to make the dependency chains as short as possible, bswaps lengthens the dependency chains and so makes the code generally slower.
if it has 8 general purpose registers that is quite sufficient,
you may be underestimating compilers, see later,
No you are. Without enough "native" registers the compiler can't really do a number of powerful optimizations such as loop unrolling, software pipelining, loop fusion and more. That is (as with all thing in life) not completly true, x86 compilers can use those optimizations for some cases but not for more interresting problems.
PPC has toooo many registers, the people who designed it are clueless about
what compilers can do, ...
What?!? RISC where designed to make it possible to get a more effective hardware-software interaction and one of the things that makes it better is more registers! You are really making yourself sounding clueless...
to compute an expression with M terms you only need approx log_2(M) registers,
No you don't need any (programmer visible) registers at all. LIFO-4-life.
IBM dont know how to design CPUs otherwise why did they originally use Intel
and now use Motorolas PPC specification
(I dont know how they did their Mainframes)
Really... It is now clear that you have absolutely no f*c*ing clue about this area, and still you want to expose your ignorance?
IBM invented many things that now is used in processors all over the world, but they have no clue right?
Their research where what made RISC a possibility, their Power line of processors provided the base for the PPC architecture which clearly shows that they are clueless.
How they did their mainframes? They designed and implemented an architecture that inspired most other computer manufacturers and is still after many many years top of the line in it's field. But of course they don't know how to design CPUs...
a good compiler would implement this as:
move.l d1,d0
add.l #5,d0
rts
2 registers implementing 4 variables,
No a good compiler would inline that code...
many 1024 term expressions can be done with much fewer registers eg:
x1 + x2 + x3 +....+x1024 only needs 2 registers:
move x1,d0
move x2,d1
add d1,d0
move x3,d1
add d1,d0
move x4,d1
add d1,d0
.....
move x1024,d1
add d1,d0
And suddenly your processor is serialized! The last add will be dependent on the preceding 2047 instructions, a better compiler would make use of the superscalar nature of the processor and allow the hardware to parallellize it.
In real code from real programs a lot of the time only 4 registers are necessary,
Yes completly true.
with 32 registers you could do expressions with 4 billion terms which is
an impractical ability,
And yet again your ignorance shows. The reason for having many registers are that one can generate more efficient code, I have sometimes been forced to use the ESP (the x86 stack pointer register) in some innerloops to get satisfactory results as the 7 "free" registers where not enough. 16 registers really simplifies the code generation (for both ASM-programmers and compilers) and 32 is even better.
Let's see... The year is 2004, most manufacturers are now beginning to target 90nm processes. This fact combines with the fact that register files are compact (==short wires). When we then add the fact that modern processors already have >80registers (for renaming), that decode and scheduling parts of the processors are much more complex and larger than a tiny register file it really seem redicolous to complain about it. And with more registers we can optimize the code better and thus get faster execution, do you still complain?
real compiler code uses exactly the accumulator concept see the code fragment
above: there is no alternative,
computing a mathematical expression is entirely an accumulator-process:
x = (y+z)*t ; load y, add z, mul t, store x,
also its always like this,
this is why accumulator CPUs will be very fast,
LOL!