Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188656 times)

matthey · « **on:** January 06, 2013, 06:49:18 PM »

Quote from: ChaosLord;721489

... designed by a 200th TechMage.

Quote from: freqmax;721492

What do you mean with that?

I think he means 200th level TechMage. You must not have played D&D as a child.

@TCL
I thought the N050 only implemented write-through caches which are much easier to implement than copyback. They have excellent compatibility and the little bit faster modern memory would make up for some of the speed deficit. I agree that at least write-through caches for both instruction and data is needed. Anyone saying otherwise should turn off their accelerator caches and experience 68000 performance all over again

.

matthey · « **Reply #1 on:** January 08, 2013, 08:40:38 AM »

Quote from: Mrs Beanbag;721644

It might be a good starting point. [ColdFire] Differences are:

1. No DBcc
2. No bitwise rotation (rol, ror)
3. No bitfield operations
4. Multiply instructions don't set flags. From the Coldfire manual:
CCR[V] is always cleared by MULS/U, unlike the 68K family processors

1-3 on your list are not 68k conflicts but trapping would make them very slow. Here is a list of 68k and ColdFire conflicts that are not fixable by trapping (i.e. ColdFire.library):

1. ColdFire stack is 4 byte aligned (68k 2 byte). MOVE.B/W (SP)+ and -(SP) fail.
2. REMS/REMU encoding is incompatible with DIVSL/DIVUL encoding.
3. ColdFire multiply instructions don't set flags like the 68k.

In addition, practically anything in Supervisor mode will not work.

Quote from: Mrs Beanbag;721644

Coldfire also has a few extra commands (some of which would be quite useful, such as saturate and multiply-accumulate)

MVS, MVZ and BYTEREV should have been in the 68060. SATS is good for DSP/Codec type processing but where is SATU and ABS? The CF MAC processor is powerful but is a bolt on that doesn't fit with the 68k/ColdFire IMO. It's a poor man's SIMD as the CF is low end and cheap, cheap, cheap. Freescale will sell you PPC or now ARM (which they sadly license) if you need some real processing power.

Quote from: JimDrew;721748

Well, there are quite a few Amiga programs - including several of my own that all follow 100% legal programming practices (according to common sense and the RKMs) that will not run on an 060 with superscalar and/or branch caching enabled. I don't recall all of the reasons behind the issues. I should go look at the mmu.library replacement that we made for EMPLANT and FUSION... I know I commented some things there. I know that self modifying code is definitely one of the things that causes a problem when one of the cached instructions in the pipeline has been modified (like a branch table). Yes, I consider self-modifying code 100% legal. You are suppose to flush the caches (or turn them off) with self modifying code, but when you do that you are then running at sub-030 speeds.

Self modifying code needs to flush the caches (including branch cache) which negates the advantage of the caches and any speed gains of self modifying code. If you don't like caches, stick to the 68000 until you change your mind :/. Some early 68060.library's may not have flushed all the caches properly, fixed the superscaler bugs in the 68060 properly or may have had bugs in the CPU support code used for trapping. The best ones matured and work fine. Fusion works fine on the 68060 here except for an occasional random crash. The last ShapeShifter was more stable though. That was using the last version of Fusion which I bought in your Fusion/PCx CD bundle. Fusion had some nice features over ShapeShifter like the file transfer and auto screen mode changes from within the Mac but stability is more important. I would still use Fusion if it was more stable and supported more hard drive options which ShapeShifter is better at.

The Natami fpga CPU was going to use writethrough caching with snooping and auto flushing of detected dirty cache lines. This is a good option that allows very large caches with excellent compatibility. It would be possible to auto flush a branch cache in the address range of the dirty lines that are detected by snooping also. With the faster memory and larger caches of today, this should give cache performance close to that of the 68060 with better tolerance for self modifying code.

Quote from: JimDrew;721748

The 060 really only adds dual instruction pipelining and a 4-way cache. A higher speed (100MHz+) 040 core would probably be better in the long run, especially if it handled floating point without completely stalling the core like the 060 does.

The MC68060UM says:

"The MC68060 allows simultaneous execution of two integer instructions (or an integer and a float instruction) and one branch instruction during each clock."

"The MC68060's FPU operates in parallel with the integer unit. The FPU performs numeric calculations while the integer unit continues integer processing."

The 68060 FPU was a nice improvement over the 68040 FPU. It dropped a few 040 FPU instructions that were very rarely used and added back the FINT and FINTRZ instructions which compilers use commonly. The execution speeds were also improved across the board and more parallel operation is possible. The 040 can do some limited parallel operation also.

The 68060 is a great processor which does a lot of parallel work but it's not easy to make and it's probably not as easy to make in an fpga. A faster clocked more 68040 like CPU makes sense in the fpga. Bigger caches, a branch cache and more parallel operation are needed for maximizing performance though.

matthey · « **Reply #2 on:** January 13, 2013, 03:22:37 PM »

Quote from: psxphill;722293

Ok. It's bytes that affect a7 by 2, but everything else by 1.

No. The stack is "affected" by the number of bytes pushed or popped from or to it as follows.

68k:
byte = 2 bytes (with 1 byte of padding to maintain word alignment)
word = 2 bytes
longword = 4 bytes

ColdFire:
byte = 4 bytes (with 3 bytes of padding to maintain longword alignment)
word = 4 bytes (with 2 bytes of padding to maintain longword alignment)
longword = 4 bytes

This is one of the main incompatibilities between 68k and CF as I mentioned earlier.
Note that movem.w (sp)+, sign extends as it restores registers and is missing on the CF.

Quote from: Mrs Beanbag;722302

How about this for a crazy idea, an accelerator with an Arm CPU and an FPGA, the FPGA can function as a 68k CPU if set up as such, so it could run like the PPC accelerator boards. BUT you install AROS for ARM ROM chips and use the Arm as the main CPU, and allow the FPGA to be reconfigured by the Arm chip, so then you could develop your 68k core "live", and install updates through software.

That hardware sounds pretty close to what the fpga Arcade is already.

matthey · « **Reply #3 on:** January 14, 2013, 01:56:17 PM »

Quote from: freqmax;722430

But I still find one FPGA-done the least amount of fuss solution. And the m68k op codes to be way nicer to deal with in contrast to x86 ones.

+1

matthey · « **Reply #4 on:** January 15, 2013, 01:30:50 PM »

Quote from: JimDrew;722601

If you really plan on using strictly a FPGA to emulate the CPU, I would suggest someone modifying WinUAE to make a histogram of instruction usage. This would let someone focus on optimizing the 680x0 core by looking at instruction usage which could help determine things like what changes in the cache, pipelines, etc. will benefit the speed.

Some instructions are used less often but reduce branching (my favorite), are very powerful and/or save code (some operations are easy in hardware but difficult to do in software). Some DSP or SIMD like instructions are used in processor intensive codecs and drivers where they offer huge speedups but are used less in normal code. A simple instruction count only gives a partial picture of what instructions are best.

matthey · « **Reply #5 on:** January 15, 2013, 02:32:25 PM »

Quote from: ChaosLord;722637

Yes! I love those! I (and Phil) were always pushing these at the Natami CPU Dezine Dept. but Gunnar did not like them or didn't understand so he was totally against adding a new instruction for this purpose.

Gunnar thought he could solve all short branches with predication but it has issues when variable length and cycle instructions are used. I originally thought instructions like ABS would not be useful for this reason but I came to realize predication was not a good idea even before Gunnar. It doesn't work well with the 68k updated address register addressing modes like (An)+ and -(An) either. The original Scc instruction takes the correct approach. It didn't take too much to convince Gunnar that these kinds of instructions were better than predication or conditional moves like x86 CMOV. That's why we added SBcc, SELcc, ABS, POPCNT, etc. to the ISA and which fit and have minimal pipeline overhead (hazards) while reducing short branches. Long branches still need to jump. If we could remove 5-15% of branches (the short ones) and the overhead in the branch cache and history, the 68k would be one of the best processors at branching. Add to that a relatively short pipeline (and mis-predicted branch penalty) and 0 cycle loops and we would have much improved performance, a beautiful CPU to program and even better code density.

matthey · « **Reply #6 on:** January 16, 2013, 01:04:14 AM »

Quote from: psxphill;722642

Trying to improve the ISA is a time sink which you'll never get payback from. The only benefit is a bit of ego boosting, but that subsides when reality hits.

It's not a time sink if people are working together in parallel which is the way it was suppose to be when I started documenting the new 68k ISA. It's not a time sink if the new ISA attracts interest from outside of the retro crowd. It's not a time sink if the ISA is implemented and found to be a substantial improvement in power, code density, compiler support and ease of programming. You give up very little with the possibility to gain much more. There is a market for retro computing but a bigger market for a processor that can handle today's processing needs quickly with compact code as well as being compatible with old code. That's what ARM and x86 did. They evolved and now they are successful. Building a 68020 compatible CPU comes first, but even then it's smart to plan ahead to make future enhancements easier.

Quote from: psxphill;722642

Making the pipeline follow the predicted branch might be hard, but it's doable. Thumb drops a lot of conditional instructions from Arm.

Yes, but they were using predication (unusual for a CPU) that only offers a small advantage in some specific hardware. The smaller the block of predicated instructions and the simpler the instructions the better. Most original ARM ISA instructions could be conditional which worked ok but was dropped with the Thumbs because it was not good for code density which they were going after. The ARM block predication instruction was for multiple instruction predication but ARM went to OoO processors where it didn't work as well. The conditional instructions proposed in the 68kF ISA should work nicely while being a small simplification improvement over a more generic CMOV like x86. They would work well on a Superscaler CPU with a short pipeline and a cheap branch predictor (or no branch predictor) which the 68k is likely to have. There would still be some optimized code that would not want to use them at times. This includes highly predictable branches that are executed often and very tight loops where a highly predictable branch could be used instead. Note that some instructions like ABS (absolute value) have no drawbacks yet remove a branch that can be difficult to predict and SELcc can remove 2 branches in some cases. I would like to do some testing in an implementation before finalizing the ISA.

matthey · « **Reply #7 on:** January 16, 2013, 09:07:33 AM »

Quote from: bloodline;722713

Sounds interesting, you might want to start a new thread about optimising and evolving the 68k ISA... As any discussion here might get confused with talk about FPGA implementations

There is this thread on amigacoding.de:

http://www.amigacoding.de/index.php?topic=273.msg635;topicseen#msg635

There is also a lot of good techy discussion on the Natami forum where the ideas started. You can do searches there for about any CPU term and find something interesting.

matthey · « **Reply #8 on:** January 16, 2013, 06:52:32 PM »

Quote from: psxphill;722751

You're very optimistic.

Optimistic? Yes! Waste of time? Maybe. At least I can say I tried even if I'm dreaming a little. Reality is only one visionary person with a wad of cash away 8-).

Quote from: psxphill;722751

It's difficult to predict the future, but I can't imagine there is anyone outside of the retro community that will ever have any interest in a 680x0 cpu core. There are far too many other SOC/ASIC/FPGA solutions that have already carved up the market. There is no competitive edge against any of the other alternatives and nobody in business will care if they can run 680x0 code.

No Edge? How about the best ease of use and code density in the industry. There is the FIDO, ColdFire and CPU32 but they were all cut down from the 68k instead of enhanced. ARM with Thumb 2 has moved close to what an enhanced 68k would be and it doesn't have any trouble selling. I think we would be a little more powerful and easier to use while Thumb 2 is a little more power efficient.

Quote from: psxphill;722751

The majority of people want something that can run existing software and use existing compilers, adding instructions will cause market fragmentation if anyone is tempted to ever use them. A product that doesn't ship because the people behind it gets delusions of grandeur is no use to anybody.

You are correct that the 68k is behind in development software. We tried to add instructions that would be easy for existing compilers to support. This includes common instructions on other platforms and ColdFire instructions that could be enabled in the compiler. Also, an optimizing assembler (like Frank Wille's vasm) could do a lot of optimizations without even changing the compiler.

Quote from: psxphill;722751

Chasing rainbows is all well and good, but it's the reason that Natami failed. I'd rather see something ship for once.

I'd like to see more Amiga products ship as well. They should ship with the most usable debugged 68020 core first but an fpga can be modified. The people that want a 68020 only core can stay with that and those who want to try something enhanced could also.

matthey · « **Reply #9 on:** January 18, 2013, 06:32:32 PM »

Quote from: psxphill;723054

Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.

By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.

For the most part, the 68060 chose good instructions to remove from hardware. One big exception is the integer 32x32=64. This was already used by compilers to turn a divide by a constant into a multiply saving a huge number of cycles.

The .library should go in flash so it's available very early for bootable games.

Quote from: psxphill;723068

page 3-1

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.

It's quite a simple design.

I wouldn't say it's simple although it may be compared to some modern processor designs (e.g. x86). There is an instruction buffer in between the pipeline stages that is very costly (muxes) on an fpga. There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate. There is 6 bytes of data with each 16 bit fixed length RISC word and I don't know if, for example, a MOVEA.W #,A0 immediate is extended when decoding or in the OEP. I believe the instruction becomes pOEP only if there is >6 bytes of data from extension words but what if there is more than 12 bytes of extension word data (up to 18 bytes is possible)? If you think this is all simple, I volunteer you to do the VHDL programming of the replacement 68060

.

Quote from: psxphill;723068

It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.

Also, in some cases the OEPs are locked together to process an instruction together.

Quote from: psxphill;723068

I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.

Long instructions (lots of extension words) are more of a problem than 1 cycle instructions for fetch starvation. The 68060 doesn't have a low fetch bottleneck with most 68020 code because it's short (the 020/030 has a serious fetch bottleneck). A 68060 fetch bottleneck can be seen in artificial tests. Gunnar did some continuous work in a mini bench test program he made (on the Natami forum) that used longword immediates continuously which did show a substantial slowdown (1/4-1/3 slowdown as I recall). The 68060 needs longword data to be efficient but can slow down fetching it very often. Most longword immediates are <16 bits and extending data is low overhead even in fpga (ARM uses shift which is high overhead in fpga). This is how MOVEA.W #,An and ADDA.W #,An work already. The same could be done for data registers also, as we found, which would be even more common. Also, adding MVS and MVZ would have helped.

Quote from: psxphill;723068

The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.

What is odd is that they claim different for predicted correctly taken and predicted correctly not taken

Different timing for predicted correctly taken and predicted correctly not taken is normal with a pipelined processor. Branches predicted backward with the branch target in the branch cache are effectively 0 cycles for loops which is awesome as loop unrolling is mostly not needed improving code density. Branches that fall through eat a cycle in the pOEP but a sOEP instruction can execute simultaneously if available (also awesome). Note that the branch unit is a separate unit that can do processing in parallel and that the branch target must be in the branch cache to get the 0 cycle branch taken. That means there is usually some additional overhead the first time executing code. I believe the 68060 does some kind of instruction folding/fusing of the branch with CMP/TST/SUBQ in order to make the 0 cycle branches happen. Very few modern processors have effectively free branches. Jens and Gunnar (Natami) didn't even have all the magic figured out. Joe Circello and the 68060 team had this all figured out back in the 90s and the Motorola marketing guys killed it for PPC. Pencil pusher power!

Quote from: psxphill;723068

So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.

I think the branch instruction does go through the pOEP. The branch unit looks at it very early, makes a prediction and starts speculative execution. The pOEP still has to verify that the prediction is correct at execution time or flush the pipe and continue executing the other branch path.

Quote from: Mrs Beanbag;723058

The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

No EXTB.L or TST.W/L An? No misaligned reads or writes? The misaligned reads and writes are a huge saver when not sure of the alignment. Compilers often can't guess the alignment so they bloat up the code and slow down the CPU to align the data before reading or writing.

The 68020+ has some other niceties but they are more advanced.

Quote from: ChaosLord;723082

And Branches >128 bytes :angel:

I think you mean Bcc.L and BSR.L. Branches up to 16 bit were supported on the 68000. The longword branches are big savers but only on fairly large programs. Not too many assembler programmers create programs >65k.

Quote from: freqmax;723088

I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?

It's signed so plus or minus ~32k.

Quote from: freqmax;723088

I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502
How about ARM?

x86 branches are so screwed up with the early segmentation crap that you really have to define which x86 ISA and then don't ask me. The ARM 32 bit ISA is better but still has some limitations as I recall. I believe it only allow 24 bit addressing, too. It's quite old but the 68k was one of the first to have full 32 bit position independent code done right. An assembly programmer doesn't have to worry about the size with a modern optimizing assembler like vasm. It will automatically generate the most efficient encoding (for more than branching as 68020+ allows) including forward and backward branch optimization. The 68020+ enhancements removed a lot of limitations and can be used or optimized transparently which is great. They should have left the double memory indirect modes away though.

matthey · « **Reply #10 on:** January 18, 2013, 07:32:49 PM »

Quote from: psxphill;723092

Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.

No, but logically something has to happen to 32 bit instructions to fit in 16 bits. It's possible (even logical) that the 2nd word of a 32 bit instruction becomes part of the 6 bytes of "data" per OEP. I don't know if that is possible for all 32 bit instructions though.

Quote from: Mrs Beanbag;723093

I can't honestly say if I use TST.L An or not, off the top of my head. Pretty sure I never do TST.W An though, can't think of much use for that.

You never do:

movea.l myptr,a0
tst.l a0
beq .nullptr

or

jsr (-$xxx,a6)
movea.l d0,a0
tst.l a0
beq .nullptr

Of course the latter is better sometimes:

jsr (-$xxx,a6)
tst.l d0
movea.l d0,a0
beq .nullptr

Some 68k processors could reduce the branch overhead on this one on a piplelined CPU although be careful that the a0 is not an input to an EA calculation right after the branch or the first option was better.

Not many have used TST.W An but be careful. It actually operates on a word and not a longword as many OPA.W instructions do. Vasm and PhxAss were doing an optimization to TST.W which was wrong for many years until recently found and fixed. Most of the time it would not cause a problem but could lead to very rare random crashes.

Quote from: Mrs Beanbag;723093

I'm actually pretty careful not to do misaligned access, it just seems wrong, somehow. Just because you can, doesn't mean you should!

Good. Treat is like credit. Don't use it when you don't need it and don't abuse it when you do need it.

matthey · « **Reply #11 on:** January 18, 2013, 08:28:09 PM »

Quote from: freqmax;723099

Do you think it's feasable to create something that can get near 50 MHz 68060 in FPGA?

In an affordable fpga? Yes, but I think some different techniques would be better in an fpga than used on the 68060. It is probably easier to make a non-superscaler (more 68040 like) CPU that is clocked higher at first. It should be possible to achieve 100MHz+ in a sufficiently pipelined 68020+ processor. A Link stack, more code fusing/folding and new instructions could make up for some of the disadvantages of the fpga processor vs the 68060. You will probably not get to 68060@100MHz performance until fpga's get cheaper. A CPU, FPU and MMU in fpga will probably push the logic capacity of affordable fpga's also.

Quote from: freqmax;723099

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?

The selection process is described in the MC68060UM Section 10 "Instruction Execution Timing".

Quote from: freqmax;723099

Btw, Is there any ISA that is neater and more straightforward than m68k?

Yes. There are simpler ISAs but most are less powerful. Motorola/Freescale have liked the simple clean ISAs favoring RISC since the 68k. The 88k is the 68k RISC replacement before being abandoned for PPC. It's a simple and clean classic RISC but a little weak compared to the 68k. The 96k DSP is an interesting RISC/CISC hybrid borrowing much from the 68k that is quite powerful and fairly clean but more difficult to use. The ColdFire was an attempt to simplify the 68k but in doing so made it inconsistent (more difficult to program but still relatively easy) and less powerful even though some late enhancements brought some of the power back. The MCORE is a 16 bit fixed length (very simple) RISC that was meant to compete with ARM. It competed in power efficiency but it has to be one of the weakest modern 32 bit processors I have ever seen. It looks straight forward to program but looks very tedious. Note that the PPC is not a Motorola/Freescale design. It is not very simple for a RISC (but fairly consistent), not easy to program and is very powerful.

matthey · « **Reply #12 on:** January 18, 2013, 09:24:22 PM »