Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188153 times)

matthey · « *Last Edit: January 18, 2013, 06:47:33 PM by matthey* »

Quote from: psxphill;723054

Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.

By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.

For the most part, the 68060 chose good instructions to remove from hardware. One big exception is the integer 32x32=64. This was already used by compilers to turn a divide by a constant into a multiply saving a huge number of cycles.

The .library should go in flash so it's available very early for bootable games.

Quote from: psxphill;723068

page 3-1

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.

It's quite a simple design.

I wouldn't say it's simple although it may be compared to some modern processor designs (e.g. x86). There is an instruction buffer in between the pipeline stages that is very costly (muxes) on an fpga. There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate. There is 6 bytes of data with each 16 bit fixed length RISC word and I don't know if, for example, a MOVEA.W #,A0 immediate is extended when decoding or in the OEP. I believe the instruction becomes pOEP only if there is >6 bytes of data from extension words but what if there is more than 12 bytes of extension word data (up to 18 bytes is possible)? If you think this is all simple, I volunteer you to do the VHDL programming of the replacement 68060

.

Quote from: psxphill;723068

It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.

Also, in some cases the OEPs are locked together to process an instruction together.

Quote from: psxphill;723068

I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.

Long instructions (lots of extension words) are more of a problem than 1 cycle instructions for fetch starvation. The 68060 doesn't have a low fetch bottleneck with most 68020 code because it's short (the 020/030 has a serious fetch bottleneck). A 68060 fetch bottleneck can be seen in artificial tests. Gunnar did some continuous work in a mini bench test program he made (on the Natami forum) that used longword immediates continuously which did show a substantial slowdown (1/4-1/3 slowdown as I recall). The 68060 needs longword data to be efficient but can slow down fetching it very often. Most longword immediates are <16 bits and extending data is low overhead even in fpga (ARM uses shift which is high overhead in fpga). This is how MOVEA.W #,An and ADDA.W #,An work already. The same could be done for data registers also, as we found, which would be even more common. Also, adding MVS and MVZ would have helped.

Quote from: psxphill;723068

The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.

What is odd is that they claim different for predicted correctly taken and predicted correctly not taken

Different timing for predicted correctly taken and predicted correctly not taken is normal with a pipelined processor. Branches predicted backward with the branch target in the branch cache are effectively 0 cycles for loops which is awesome as loop unrolling is mostly not needed improving code density. Branches that fall through eat a cycle in the pOEP but a sOEP instruction can execute simultaneously if available (also awesome). Note that the branch unit is a separate unit that can do processing in parallel and that the branch target must be in the branch cache to get the 0 cycle branch taken. That means there is usually some additional overhead the first time executing code. I believe the 68060 does some kind of instruction folding/fusing of the branch with CMP/TST/SUBQ in order to make the 0 cycle branches happen. Very few modern processors have effectively free branches. Jens and Gunnar (Natami) didn't even have all the magic figured out. Joe Circello and the 68060 team had this all figured out back in the 90s and the Motorola marketing guys killed it for PPC. Pencil pusher power!

Quote from: psxphill;723068

So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.

I think the branch instruction does go through the pOEP. The branch unit looks at it very early, makes a prediction and starts speculative execution. The pOEP still has to verify that the prediction is correct at execution time or flush the pipe and continue executing the other branch path.

Quote from: Mrs Beanbag;723058

The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

No EXTB.L or TST.W/L An? No misaligned reads or writes? The misaligned reads and writes are a huge saver when not sure of the alignment. Compilers often can't guess the alignment so they bloat up the code and slow down the CPU to align the data before reading or writing.

The 68020+ has some other niceties but they are more advanced.

Quote from: ChaosLord;723082

And Branches >128 bytes :angel:

I think you mean Bcc.L and BSR.L. Branches up to 16 bit were supported on the 68000. The longword branches are big savers but only on fairly large programs. Not too many assembler programmers create programs >65k.

Quote from: freqmax;723088

I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?

It's signed so plus or minus ~32k.

Quote from: freqmax;723088

I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502
How about ARM?

x86 branches are so screwed up with the early segmentation crap that you really have to define which x86 ISA and then don't ask me. The ARM 32 bit ISA is better but still has some limitations as I recall. I believe it only allow 24 bit addressing, too. It's quite old but the 68k was one of the first to have full 32 bit position independent code done right. An assembly programmer doesn't have to worry about the size with a modern optimizing assembler like vasm. It will automatically generate the most efficient encoding (for more than branching as 68020+ allows) including forward and backward branch optimization. The 68020+ enhancements removed a lot of limitations and can be used or optimized transparently which is great. They should have left the double memory indirect modes away though.

psxphill · « **Reply #405 on:** January 18, 2013, 06:49:42 PM »

Quote from: matthey;723091

There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate.

Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.

Maybe the early decode just figures out how long each instruction is and whether the next instruction is valid to go in the secondary pipeline.

Mrs Beanbag · « **Reply #406 on:** January 18, 2013, 06:51:08 PM »

Quote from: matthey;723091

No EXTB.L or TST.W/L An? No misaligned reads or writes?

Ah, you got me. I do use EXTB.L, on occasion. Although I could easily do without.

I can't honestly say if I use TST.L An or not, off the top of my head. Pretty sure I never do TST.W An though, can't think of much use for that.

I'm actually pretty careful not to do misaligned access, it just seems wrong, somehow. Just because you can, doesn't mean you should!

freqmax · « **Reply #407 on:** January 18, 2013, 07:29:46 PM »

@matthey, Nice insight!

Do you think it's feasable to create something that can get near 50 MHz 68060 in FPGA ?

I was thinking on Intel 80386 ISA in protected mode (kernel and user) regarding branching. As for 8086 and segments.. yuck

Quote from: psxphill;723092

Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.

Perhaps another reverse engineering approach is to figure out from other parts what you need to make your duplicate to work.
A 68060 functionally duplicate won't have to be designed the same way. Just interact with software code in way that the original programmer intended.

Quote from: psxphill;723092

Maybe the early decode just figures out how long each instruction is and whether the next instruction is valid to go in the secondary pipeline.

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?

Btw, Is there any ISA that is neater and more straightforward than m68k?

matthey · « **Reply #408 on:** January 18, 2013, 07:32:49 PM »

Quote from: psxphill;723092

Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.

No, but logically something has to happen to 32 bit instructions to fit in 16 bits. It's possible (even logical) that the 2nd word of a 32 bit instruction becomes part of the 6 bytes of "data" per OEP. I don't know if that is possible for all 32 bit instructions though.

Quote from: Mrs Beanbag;723093

I can't honestly say if I use TST.L An or not, off the top of my head. Pretty sure I never do TST.W An though, can't think of much use for that.

You never do:

movea.l myptr,a0
tst.l a0
beq .nullptr

or

jsr (-$xxx,a6)
movea.l d0,a0
tst.l a0
beq .nullptr

Of course the latter is better sometimes:

jsr (-$xxx,a6)
tst.l d0
movea.l d0,a0
beq .nullptr

Some 68k processors could reduce the branch overhead on this one on a piplelined CPU although be careful that the a0 is not an input to an EA calculation right after the branch or the first option was better.

Not many have used TST.W An but be careful. It actually operates on a word and not a longword as many OPA.W instructions do. Vasm and PhxAss were doing an optimization to TST.W which was wrong for many years until recently found and fixed. Most of the time it would not cause a problem but could lead to very rare random crashes.

Quote from: Mrs Beanbag;723093

I'm actually pretty careful not to do misaligned access, it just seems wrong, somehow. Just because you can, doesn't mean you should!

Good. Treat is like credit. Don't use it when you don't need it and don't abuse it when you do need it.

Mrs Beanbag · « **Reply #409 on:** January 18, 2013, 08:04:45 PM »

I do such things like:

move.l myptr(PC),D0
beq .nullptr
move.l D0,A0

in the 2nd example could always use tst.l D0 anyway.

flags are set for free when moving to an address register. Also note the first line, I always write relocatable code.

In other news, I've been thinking about a RISC instruction set for internal use in a 68k core for some time. I think we can identify a few obvious simplifications:
1. tread An and Dn identically (use extra instructions if different behaviour is required)
2. only MOVE can use as either source or destination operand (load/store architecture)
3. all other instructions register-register, or "quick" short-constant source operands
4. spare "temporary" registers for internal use.
we could map 68k instructions to short sequences of internal instructions, and design those instructions to give the shortest sequences.

psxphill · « **Reply #410 on:** January 18, 2013, 08:15:51 PM »

Quote from: freqmax;723099

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?

Yeah, you can't execute in parallel if the first instruction modifies a register that the second uses: for example

MOVEQ #0,D0
TST.W D0

The secondary pipeline can't execute all instructions either. Floating point instructions can only be dispatched from the primary pipeline for instance.

From the description it would seem that it checks in the DS stage whether it can be executed in parallel, which implies there is one fifo for both execution units. I'd have thought that would make it tricker than a fifo for each pipeline, but the documentation is what you'd have to go on for a pure clone.

The manual is largely vague on the FIFO:

"The instruction is pre-decoded for pipeline control information"

"The MC68060 variable-length instruction system is internally decoded into a fixed-length representation and channeled into an instruction buffer.

There are 96 bytes for the FIFO. Someone claims it's 16 entries of 6 bytes each, but the longest instruction is 10 bytes and there is no way you're going to squeeze an MOVE $10000,$20000 instruction into 6 bytes. It's more likely to be 6 entries of 16 bytes or 4 entries of 24 bytes. I can't find anything that suggests that instructions are split into multiple "micro ops", like Intel does.

The 68060 cannot execute out of order and doesn't do anything complex like register renaming that Intel did on the Pentium pro. It really is the simplest design for dual issue that you can possibly do.

There is no reason why you have to 100% duplicate the functionality exactly. However if there is documentation available then it might make sense to do it the same as they probably spent a while designing it, so it's probably good.

matthey · « **Reply #411 on:** January 18, 2013, 08:28:09 PM »

Quote from: freqmax;723099

Do you think it's feasable to create something that can get near 50 MHz 68060 in FPGA?

In an affordable fpga? Yes, but I think some different techniques would be better in an fpga than used on the 68060. It is probably easier to make a non-superscaler (more 68040 like) CPU that is clocked higher at first. It should be possible to achieve 100MHz+ in a sufficiently pipelined 68020+ processor. A Link stack, more code fusing/folding and new instructions could make up for some of the disadvantages of the fpga processor vs the 68060. You will probably not get to 68060@100MHz performance until fpga's get cheaper. A CPU, FPU and MMU in fpga will probably push the logic capacity of affordable fpga's also.

Quote from: freqmax;723099

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?

The selection process is described in the MC68060UM Section 10 "Instruction Execution Timing".

Quote from: freqmax;723099

Btw, Is there any ISA that is neater and more straightforward than m68k?

Yes. There are simpler ISAs but most are less powerful. Motorola/Freescale have liked the simple clean ISAs favoring RISC since the 68k. The 88k is the 68k RISC replacement before being abandoned for PPC. It's a simple and clean classic RISC but a little weak compared to the 68k. The 96k DSP is an interesting RISC/CISC hybrid borrowing much from the 68k that is quite powerful and fairly clean but more difficult to use. The ColdFire was an attempt to simplify the 68k but in doing so made it inconsistent (more difficult to program but still relatively easy) and less powerful even though some late enhancements brought some of the power back. The MCORE is a 16 bit fixed length (very simple) RISC that was meant to compete with ARM. It competed in power efficiency but it has to be one of the weakest modern 32 bit processors I have ever seen. It looks straight forward to program but looks very tedious. Note that the PPC is not a Motorola/Freescale design. It is not very simple for a RISC (but fairly consistent), not easy to program and is very powerful.

freqmax · « **Reply #412 on:** January 18, 2013, 08:29:51 PM »

I suspect there is no documentation except the usual datasheet

Perhaps someone could interview some of the original engineers?

What's a "Link stack" ..?

Have you looked at the Actel FPGAs?, they are way faster than any competitor last time I checked. Of course they are slightly more expensive.

As for ISA, my thinking were if the ISA of ARM, Transmeta, PDP-11, MIPS, Sparc, DEC Alpha, PA-RISC, etc is easier to deal with. Without sacrificing performance.

psxphill · « **Reply #413 on:** January 18, 2013, 09:00:57 PM »

Quote from: freqmax;723104

What's a "Link stack" ..?

It might be what the later coldfire has, which is an on chip stack which allows the target of an rts instruction to be predicted.

As well as writing to the a7 stack, the jsr stores the program counter in the on chip stack. When the rts instruction is fetched it assumes the next instruction is the value off the on chip stack. If when it executes the value is different then it flushes the pipeline.

At the moment the rts basically blocks the pipeline until it executes, which is why it's such a slow instruction.

It's only a four entry stack though and as I don't think it can support re-fetching lower return values, then it's probably not that great apart from simple subroutines being called from a loop.

An 060 MMU in an FPGA isn't going to take a huge amount of space. An FPU on the other hand probably will.

In the manual it says it transfers 16 bits for the opcode and 2 x 32 bit operands from the fifo, which is 10 bytes and sounds pretty much like the 68000 instruction set converted to fixed length. So it might be 8 instructions of 12 bytes each, with 2 bytes used for "pipeline control". For whatever this means in the manual: "The instruction is pre-decoded for pipeline control information"

matthey · « **Reply #414 on:** January 18, 2013, 09:24:22 PM »