If they don't get 150MHz this year, they probably will next as the speed of fpgas increase and the price drops. There are fpgas that are fast enough now but they cost a lot of money. The 68k did often outperform the early PPC processors. Even the 68040 outperformed the early PPC processors on the early MACs. Amiga users with 68060 and Shapeshifter or Fusion had the fastest MAC for about a year. Apple made the later MAC OS incompatible with the 68060 because of it. MAC OS 7 worked great with the 68060 and then MAC OS 8 didn't for some odd reason. An fpga N68k isn't going to wow people or steal x86 market share but it will probably be fast enough to impress Amiga users still using the classic and fast enough for general computer needs. That's enough for me. I'm not opposed to PPC Amigas. I would like to see 68k for the low end and laptops and PPC for the high end and desktops. The attitude of Hyperion has turned me off though. I prefer the openness of Natami and AROS (but don't care for the x86 focus of AROS).
Here's what an IBM engineer has to say about the 68060 and PPC...
"With 2 instructions per clock and excellent multiplication and branch performance, the 68060 performs very good. Depending on the workload the 68060 can even outperform similar clocked 60x/G2/G3 PowerPC CPU."
"Actually the 68060 is faster in multiplication than many PowerPC.
The PPC G2 (603) and G3 CPU need 5 cycles for a multiplication.
Which means a 100 Mhz 68060 achieves the same multiplication performance as an 250 Mhz G3."
http://www.natami.net/knowledge.php?b=4¬e=2418
Let's look at some 68k code to see what is so great about the 68k. Let's take a simple 68k memory copy with size (longwords) in d0...
.loop:
move.l (a0)+,(a1)+
subq.l #1,d0
bcc.b .loop
Let's say we don't know the alignment of the data either. This copies 1 longword/cycle with data aligned and is 6 bytes. If data is unaligned this is still pretty good. Now write that on PPC with anywhere near the performance. Don't let the old outdated 68060 with tiny little cache and only 4 bytes of instruction fetch/cycle DESTROY. I'll even give you a few hints. You better align the data first or the performance is really bad. You will need twice as many instructions to duplicate what's above. You will need to use an unrolled loop (wasting more code) and preload the cache. If you do all that optimally, you are still likely slower than the 68060
. No wonder PPC needs all those GHz.
K8/K10 and Core 2/i3/i5/i7 can process IMUL(32bit) at every cycle(1 cycle througput) with 3 cycle latency.
Core i3/i5/i7 can process IMUL (64bit) at every cycle (1 cycle througput) with 3 cycle latency. K8/K10 and Core 2 has can process IMUL (64bit) in 2 cycles (2 cycle througput) with 4 cycle latency. To hide the latency, factor in the pipeline and it's spectualive execution design.
Per instruction benchmarks doesn't show the whole picture in regards to overall performance i.e. use some proper application benchmarks.
....
This means that on the x86 you need to work constantly with variables on the stack. This limits the overall performance of the x86 quite a lot. All INTEL chips can at best do 1 stack operation per clock. This means as they often have to work on the stack your average instructions per clock goes down to 1.x.
That the x86 has to work with the stack constantly makes it very hard for the x86 to effective use more ALUs to increase performance
As for X86's stack arch issue, AMD K10 includes Sideband Stack Optimizer hardware while Intel Core includes Stack Pointer Tracker hardware.
From
http://www.xbitlabs.com/articles/cpu/display/amd-k10_5.html"Sideband Stack Optimizer unit tracks the stack status changes and modifies the instructions chain into an independent one by adjusting the stack offset for each instruction and placing sync-MOP operations (top of the stack synchronization) in front of the instructions that work directly with the stack register. This way instructions working directly with the stack can be reordered without any limitations"
--
This amounts to JIT(Just-In-Time) optimizer on hardware.
On Intel Core i3/i5/i7
Stack Pointer Tracking (SPT) implements the Stack Pointer Register (RSP) update logic of instructions which manipulate the program stack (PUSH, POP, CALL, LEAVE and RET) within the IDU. These macro-instructions were implemented by several micro-ops in previous architectures.
The benefits with SPT include using
+ a single micro-op for these instructions improves decoder bandwidth,
+ execution resources are conserved since RSP updates do not compete for them,
+
parallelism in the execution engine is improved since the implicit serial dependencies have already been taken care of,
+ power efficiency improves since RSP updates are carried out by a small hardware unit
---
http://www.chip-architect.com/news/2001_10_02_Hammer_microarchitecture.html"This is for instance the case in the ESP Look Ahead Unit that allows among other things that consecutive PUSHes and POPs to and from the stack can be executed simultaneously"
PS; 2001 block diagram resembles AMD Bulldozer block diagram.