Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 53074 times)

psxphill · « *Last Edit: January 18, 2013, 08:41:23 PM by psxphill* »

Quote from: freqmax;723099

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?

Yeah, you can't execute in parallel if the first instruction modifies a register that the second uses: for example

MOVEQ #0,D0
TST.W D0

The secondary pipeline can't execute all instructions either. Floating point instructions can only be dispatched from the primary pipeline for instance.

From the description it would seem that it checks in the DS stage whether it can be executed in parallel, which implies there is one fifo for both execution units. I'd have thought that would make it tricker than a fifo for each pipeline, but the documentation is what you'd have to go on for a pure clone.

The manual is largely vague on the FIFO:

"The instruction is pre-decoded for pipeline control information"

"The MC68060 variable-length instruction system is internally decoded into a fixed-length representation and channeled into an instruction buffer.

There are 96 bytes for the FIFO. Someone claims it's 16 entries of 6 bytes each, but the longest instruction is 10 bytes and there is no way you're going to squeeze an MOVE $10000,$20000 instruction into 6 bytes. It's more likely to be 6 entries of 16 bytes or 4 entries of 24 bytes. I can't find anything that suggests that instructions are split into multiple "micro ops", like Intel does.

The 68060 cannot execute out of order and doesn't do anything complex like register renaming that Intel did on the Pentium pro. It really is the simplest design for dual issue that you can possibly do.

There is no reason why you have to 100% duplicate the functionality exactly. However if there is documentation available then it might make sense to do it the same as they probably spent a while designing it, so it's probably good.

psxphill · « **Reply #30 on:** January 18, 2013, 09:00:57 PM »

Quote from: freqmax;723104

What's a "Link stack" ..?

It might be what the later coldfire has, which is an on chip stack which allows the target of an rts instruction to be predicted.

As well as writing to the a7 stack, the jsr stores the program counter in the on chip stack. When the rts instruction is fetched it assumes the next instruction is the value off the on chip stack. If when it executes the value is different then it flushes the pipeline.

At the moment the rts basically blocks the pipeline until it executes, which is why it's such a slow instruction.

It's only a four entry stack though and as I don't think it can support re-fetching lower return values, then it's probably not that great apart from simple subroutines being called from a loop.

An 060 MMU in an FPGA isn't going to take a huge amount of space. An FPU on the other hand probably will.

In the manual it says it transfers 16 bits for the opcode and 2 x 32 bit operands from the fifo, which is 10 bytes and sounds pretty much like the 68000 instruction set converted to fixed length. So it might be 8 instructions of 12 bytes each, with 2 bytes used for "pipeline control". For whatever this means in the manual: "The instruction is pre-decoded for pipeline control information"

psxphill · « **Reply #31 on:** January 18, 2013, 11:26:40 PM »

Quote from: matthey;723118

Actually, this may work in parallel. Some very simple instructions are retired early and the longword (only) result made available early. This is not specifically stated but the result is made available early from these types of instructions for change/use stalls and are probably also available early for the other OEP although it's not specifically stated. These early retirement instructions include:

lea
move.l #,Rn
moveq
clr.l Dn

The list is unclear, I'm assuming the first 4 are the primary and the last 2 are secondary.

"Certain instructions have been optimized to ensure no change/use stall occurs on

subsequent instructions. The destination register of the following instructions is available
for subsequent instructions:
lea
mov.l&imm,Rn
movq
clr.lDn,
any op(An)+
any op–(An)
as a base register for address calculation with no stall, or as an index register for
address calculation with no stall, if Xi.l*{1,4}. If the index register used is Xi.l*2, Xi.l*8,

or Xi.w, then the previously described 3 cycle stall occurs."

It doesn't have to retire it early, the second pipeline could look in the primary pipeline. Mips has a similar handling for lwl/lwr opcodes, it pulls the register value from the pipeline and stops the register being updated at all. The register doesn't actually get updated until you stop executing lwl/lwr opcodes.

This one is also vague:

"The MC68060 provides another change/use optimization for a commonly encountered
construct—when an address register is loaded from memory and then used in an operand

address calculation, the OEP experiences a one cycle stall.

mov.l,An

"

I guess they both enter the pipelines at the same time, the primary goes through ea fetch and then on the next clock the secondary goes through ea fetch. I'm assuming that the ea on the second register is literally ea and not adjusted by an immediate or register. It can't advance the pipelines, it must change the state it's in. The primary pipeline might be translated into a move.l immediate once the value is available, to make the short circuiting common.

Quote from: freqmax;723130

If one start with the instructions and then try to impose the correct architecture.. well it could be messy

With any processor emulation, it's always worth starting small and bringing it up an instruction at a time until you know you're on the right track.

[/FONT][/SIZE][/FONT]

psxphill · « **Reply #32 on:** January 19, 2013, 05:32:29 AM »

Quote from: matthey;723140

There are at least 2 different optimizations here. One is the early instruction retirement and register forwarding. The other is more of a MOVE.L+OP.L optimization which is possible because MOVE.L is only half an operation in a register memory architecture that can do both in 1 operation.

Passing register results pass between pipelines is a pretty standard concept, what I don't get is if it's going to introduce a one cycle delay when running these two at the same time:

mov.l,An

"

Then why wouldn't you just run both operations sequentially on the primary pipeline?

psxphill · « **Reply #33 on:** January 19, 2013, 10:49:36 AM »

Quote from: freqmax;723170

gcc version 3.3.3 has these options:
-m68000 -m68020 -m68020-40 -m68030 -m68040 -m68881 -mbitfield -mc68000 -mc68020 -mfpa -mnobitfield -mrtd -mshort -msoft-float

But perhaps SAS C has something more specific?

I'm pretty sure the latest SAS/C does, but it's not good. I found gcc 2.95 with -mc68000 was faster on the projects I tried.

Quote from: matthey;723172

BFCHG, BFCLR, BFEXTS, BFEXTU, BFINS, BFSET and BFTST are simpler but very easy for a compiler. They would be a compiler writers dream come true if they were fast.

Probably, but with so little code using them they wouldn't be my first choice for optimising.

psxphill · « **Reply #34 on:** January 19, 2013, 11:21:26 AM »

Quote from: freqmax;723174

As the 020, 030, 040 options doesn't mention the omission of any instructions. It seems the 060 is the only m68k CPU to have less instructions than it's predecessors.

The 040 has less FPU instructions, the 060 is the first to drop integer instructions.

"-m68040 Generate output for a 68040. This is the default when the compiler is configured for 68040-based systems. This option inhibits the use of 68881/68882 instructions that have to be emulated by software on the 68040. Use this option if your 68040 does not have code to emulate those instructions."

I don't know whether it's less:

"-m68020-40
Generate output for a 68040, without using any of the new instructions. This results in code which can run relatively efficiently on either a 68020/68881 or a 68030 or a 68040. The generated code does use the 68881 instructions that are emulated on the 68040.
-m68020-60Generate output for a 68060, without using any of the new instructions. This results in code which can run relatively efficiently on either a 68020/68881 or a 68030 or a 68040. The generated code does use the 68881 instructions that are emulated on the 68060. "

psxphill · « **Reply #35 on:** January 19, 2013, 06:32:04 PM »

Quote from: freqmax;723203

Why were these instructions dropped?

And would be more efficient performance wise to implement a 020, 030, or 040 and then horrendously overclock it?

Nobody used the instructions and they required support from the 68851 MMU. It made no sense to bloat the 68030 and it's MMU with them.

It depends on how you measure efficiency, but you'll hit the upper clock speed quickly.

psxphill · « **Reply #36 on:** January 20, 2013, 12:07:49 PM »

Quote from: bloodline;723285

I rather like Mrs Beanbag's idea of a nice simple RISC core tailored to executing instructions that have been decoded from 68k instructions, it could simplify the decode stage maybe

How could it simplify the decode stage? All it does is split one decode stage into two.

psxphill · « **Reply #37 on:** January 20, 2013, 06:13:39 PM »

Quote from: matthey;723339

(we don't even know how the 68060 deals with these very long encodings).

What do you mean? AFAIK the longest instruction is 10 bytes and that is what gets transferred from the FIFO in the decode stage.

Quote from: Mrs Beanbag;723343

Right, simplifying the decoding stage wasn't the idea so much. But if you can split a problem into two parts, it is usually easier to solve. I'm trying to make the developer's job easier really.

I think you'd either have to put up with it being slower, or making the rest of it much more complex to compensate. It's a juggling act.

psxphill · « **Reply #38 on:** January 20, 2013, 06:57:40 PM »

Quote from: matthey;723348

Do you see any mistakes?

Whats the encoding for the move.l with that addressing mode? I can't see anything that matches that in the 68020 user manual.

Quote from: Mrs Beanbag;723353

I guess that's not too far from my idea, now I think about it, but with a CPU core specifically designed to emulate 68k. Sort of a hardware emulator, I guess.

If a 1 cycle 68060 instruction translates to 2 of your instructions, then you'd have to clock at double the speed to achieve the same throughput. So each of those would have to be a 1:1 mapping or you've already failed.

psxphill · « **Reply #39 on:** January 20, 2013, 11:32:34 PM »

Quote from: matthey;723370

move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)

The hexadecimal encoding for above is:

23b0 0173 1234 5678 1234 5678 0173 1234 5678 1234 5678

Hmm, for me that disassembles as:

001000: 23B0 0173 1234 5678 0173 1234 5678 move.l ([$12345678,A0],$1234), ($78,A1,D5.w*

00100E: 1234 5678 move.b ($78,A4,D5.w*

, D1

psxphill · « **Reply #40 on:** January 21, 2013, 10:32:49 AM »

Quote from: matthey;723378

It's not uncommon to see bugs in assemblers, disassemblers and debuggers using these advanced and seldom used addressing modes. I used vasm typing in:

Code: [Select]
MC68060 move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678) rts
I assembled it from test.asm to test. It disassembled just as I typed it with my modified version of ADis from here:

http://www.heywheel.com/matthey/Amiga/ADis.lha

Disassembling with:

ADis -m6 -a test

The old version of ADis would have had problems. IRA 2.04 fails to disassemble the destination correctly. D68k v2.0.8 is very close but oddly gets the address register in the destination wrong.

BDebug from the Barfly package gets it right. CPR from SAS/C gets it right (although doesn't display the $ for hex numbers on instructions).

I thought you might have been using D68k at first but apparently not. What disassembler did you use?

I used mame (arcade game emulator), typed the hex into memory and then disassembled and executed it. It's not just the disassembler, the emulation consumed the same number of bytes. So that needs looking at, can you post the exe you assembled?

There is mention in the manual about some instructions being split over two pipelines, it might do that by splitting it into two FIFO entries. With the result of the ea fetch from the primary pipeline getting forwarded to the secondary pipeline so it can get stored.

Have you tried running this encoding on a real 68060?

psxphill · « **Reply #41 on:** January 21, 2013, 03:59:26 PM »

Quote from: matthey;723433

http://www.heywheel.com/matthey/Amiga/test68020

Are you involved with developing or testing mame?

Developing mainly, although I've not had much to do with the 680x0 side.

Quote from: matthey;723433

Right. The OEPs are locked together and each OEP performs 1/2 of the ea for a move ,.

The OEPS are always locked together, the manual hints at how move , works:

"pOEP-until-last Many of the non-standard instructions represent a combination of
multiple “standard” operations. As an example, consider the
memory-to-memory MOVE instruction. This instruction is decomposed
into two standard operations: first, a standard read cycle followed by a
standard write cycle. This class allows a standard single-cycle
instruction to be dispatched from the sOEP during the last cycle of its

pOEP execution."

It seems to say that two entries are written to the FIFO and the second entry in the FIFO sits waiting until the primary is about to finish before despatching to the secondary. Although I'd have thought it would despatch earlier so it could calculate the EA.

Quote from: Mrs Beanbag;723435

Actually something just occurred to me. If the most common instruction is "tst", it should be possible to know whether a branch will be taken or not some time in advance. Because "tst" only looks at a single register, the contents of that register must have been determined some time before. So you could look ahead in the instruction queue for a "tst/bcc", and inform the branch predictor well in advance. "tst" instruction then takes effectively NO cycles.

Apart from the cycles it takes to look ahead in the instruction stream every time you hit a tst instruction, and it will get complex to even follow the code as you would have to follow branches as well. Basically to avoid the cycles when a branch happens, you'll end up going through the same overhead as running the code after every tst instruction (tst isn't the only instruction that affects branches).

It also won't help a branch directly after a branch because it will already have started progressing through the pipeline.

psxphill · « **Reply #42 on:** January 21, 2013, 07:38:03 PM »

Quote from: Mrs Beanbag;723441

Instructions are read into a buffer ahead of time, so can detect a tst/bcc when it is first read in. I wouldn't bother following branches, to be able to predict only the next branch would still help. Yes it would only work if the branch follows a tst, but if the profiles from the Megadrive are anything to go by, that is the most common case. Basic RISC principle, "make the common case fast"!

The basic risc principle is keep instructions simple so that you can use the spare space for large register sets and caches.

It wouldn't help at all when the branch follows the test, because you're going to have to flush all the following instructions from the pipeline. If you're going to remove the pipeline completely or a significant number of stages then you'll have a huge number of instructions taking multiple cycles and the overhead of incorrectly predicted branches is going to be so insignificant that it won't be worth doing.

psxphill · « **Reply #43 on:** January 21, 2013, 11:21:14 PM »

Quote from: Mrs Beanbag;723463

and as soon as a test followed by a branch is read in, it can do the test immediately (which is a very simple operation) and predict the branch based on that. So as long as the register doesn't change by the time the branch instruction comes out of the other end of the FIFO the branch will have been predicted correctly.

What you're suggesting will break I/O, which is the major use of TST. You can only perform the read once & you can't do the read until all the registers are correct, or you could be reading from anywhere. You can't change the order of memory accesses, you'll have to wait until any instructions that access memory have been run.

You also can't run the EA Fetch for an instruction after a branch in the pipeline, until you've resolved whether it's going to branch or not. I am assuming the 68060 pipeline length enforces that, I haven't checked it out too carefully.

psxphill · « **Reply #44 on:** February 11, 2013, 05:15:25 PM »

Quote from: bloodline;724860

I've actually spent a few days trying to design my own 68K emulator, to work out how to modify the MIPS ISA to make it super efficient at 68k emulation...

Going from a general purpose cpu would end up as an inefficient microcode based 68k.

It makes sense to dump the original 68000 & 68010 microcode to emulate those processors. Not all of it's done yet though, there is some work hopefully happening soon.

I don't know how much of the 68020 and later were microcoded.

Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 53074 times)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)