@whoosh777
Heh, we're not arguing, just comparing chip implementations. I still say, considering your enthusiasm for the idea, go for it! :-D
Anyhow, back to that circle :-D
1. small number of volatile scratch registers,
2. the first function arguments in non-scratch registers: if function arguments
are in scratch registers then those registers are no longer scratch!:
scratch registers are short term breathing space,
3. further arguments to the stack,
eg on 68k:
a0,a1,d0,d1 as scratch, and function arguments f(d2,a2,d3,a3)
if you have too many function arguments in registers you
run out of breathing space: the called function has to start using the stack to free up registers for internal use,
also the calling function may already be using those registers for something else so it has to back them up somewhere,
if you have too many scratch registers its wasteful,
I sincerely suggest you read the PPC open ABI specification to get a better understanding of how it works. The issues you raise almost never occur due to how the ABI is layed out. Stack based calling is rare (unless you have more than 8 integer/pointer parameters). Those registers never need to be backed up before a call because the compiler won't use them to hold anything that needs to survive a call.
As you yourself point out, 4 registers is usually enough for most purposes. With 32 registers, some of which are used for the stack, global data section etc, the compiler can almost always find somewhere to store a variable without using the stack.
Even then, the compiler can spot which volatile registers don't get hammered across calls in the same translation unit. Same with function parameters, it can see which ones 'survive' the call, eg const arguments etc.
When I look at well optimised PPC code generated from a good compiler, I see very little stack usage, very little register backup etc.
Now, if you think back to your "stack cache" idea, having a large register file, of which programming standards say "this half is volatile etc." actually gives you the same functionality.
Large register files are fantastically useful for breaking down complex expressions. All the examples you have posted so far tend to deal with linear code, doing fairly simple arithmetic.
x = a + b * (c + (d*d)); etc.
Chuck in function calls (some of which may be inlined), multidimensional pointer dereferences (that may be used several times in one expression), etc., and more and more volatile registers becomes useful to hold temporaries that may be needed more than once.
For a different example of why large register sets are handy, a small interpretive emulator I wrote as a thought experiment (for a theoretical bytecode cpu), uses several function lookup tables. One for opcodes, one for effective address calculation and one for debugging traps etc.
There is a data structure for the "core", containing a register file, stack (seperate ones for data, register backup and function call) and code pointer (the PC, but expressed as absolute address) etc.
Code is executed in blocks, until either a certian number of statements have been executed, or a particular trap (or break signal) has been invoked.
Careful design of the code allowed each function table base, the virtual register file, the virtual stack pointers and code pointer to persist in the same registers throught a call to execute(), without the need to be moved, saved etc. etc. across all the calls incurred during execution of the bytecode. Given that we are talking about possibly millions of calls, that saving is considerable.
Regarding the x86 internal RISC issue, I never said that it was better or worse than "up front" RISC.
But we can infer 2 things:
1) A RISC style core clearly makes sense as virtually every CPU manufacturer is using it.
2) Your assumption that the "external CISC" style approach of x86 could be better based on code density is very difficult to judge. You have to consider that the code decomposition into the internal micro op language is far from simple to achieve. The design of these cores are fantastically complicated, gobbling up silicon like nobody's buisness.
The problem is, it's not a simple linear process where x86 instruction "X" is always decomposed into micro-op codes "a b c". The early stage of decode may work this way but once it has to start processing "a b c", it almost works like a mini compiler, looking to see what rename registers are/will become free, which instructions have dependencies etc. All this takes clock cycles - which is partially why modern x86 CPU's have such very long pipelines and "time of flight" for instructions.
The above work is largely non existant for up-front RISC designs because it's a compile time issue. They still have to worry about rename registers and such, but compile time instruction scheduling has made their life a lot easier.
In fact, part of the whole point of RISC is that it makes the CPU's life easier by making the compilers life more difficult :-D
Now, the present x86 designs uses the above internal RISC approach not because they thought "Hmm, this is better than those young whipper snapper RISC cpu's", but because they *had* to keep x86 object code compatibility and newer RISC style processors emerging were seriously threatening them.
You only have to look back at the time when x86 introduced these RISC style cores and DEC still made the Alpha AXP. We had several Windows NT workstations in our spectroscopy labs at Uni, one was using the latter at 266 MHz and we had a newer P-II 300 MHz, with it's new spanking "internal RISC style core" and the alpha still stuffed it :-D