Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188217 times)

JimDrew · « **Reply #389 from previous page:** January 18, 2013, 01:14:34 AM »

Quote from: mikej;723000

Jim : "Has anyone worked with the MCC-216 "

Several of us have a real problem with this guy - he was on this forum for a while. He is using GPL code and not releasing the source, which is naughty. We will release the code for the Replay system as soon as we start shipping a stable core.
/MikeJ

Interesting. I wonder what I have then. I got a link to a complete dev kit but I have not looked to see what was in it. It took a month to get it after I paid for it (eBay auction). He did add the JTAG header and set it up for development work. I chatted via email with him a few times before I found out about your project. In fact, I got so frustrated at the delivery time that I started looking at other FPGA developer boards, which is how I found out about your project!

From a hardware standpoint, the MCC-216 is pretty simple - just the Cyclone III, some RAM, and some I/Os. I am not sure how fast it is or what can be done with it. I know that most of the projects I have seen for the Altera are all Verilog based, and even though I am a U.S. guy, I have only experience with VHDL so your projects appeals to me - that and you are a heck of a nice guy from everything I have seen so far.

billt · « **Reply #390 on:** January 18, 2013, 01:18:21 AM »

Quote from: A6000;723006

I read here that the 060 bus interface is too complex and that the 030 bus is better, also the 060 is less compatible with amiga software than the 030 because many instructions were not implemented, so why do we want an 060 replica, why not try to implement an 030+882 that runs as fast as an 060?

68060 replacement.

Not 68060 replica.

There's a difference.

Yes, my ideal is something that goes into a 68060 socket, is 680x0 compatible (tg68, n050, n070, whatever), and hopefully outperforms real 680x0 from Motorola/Freescale.

Then do adaptors or different PCB designs for other sockets to hold the otherwise identical CPU core, whatever it may be. (Tg68 or n050 or arm or x86 or whatever)

Heiroglyph · « **Reply #391 on:** January 18, 2013, 03:03:06 AM »

If you notice, none of the improvements listed above have anything to do with the 060 bus.

I'm just still not seeing the point of not making it immediately compatible with more existing hardware.

Look at an 030 card (especially a3640), then look at an 040 or 060 card. Most of those chips are glue to hack it onto the Amiga bus.

RobertB · « **Reply #392 on:** January 18, 2013, 03:43:44 AM »

Quote from: JimDrew;723020

From a hardware standpoint, the MCC-216 is pretty simple - just the Cyclone III, some RAM, and some I/Os. I am not sure how fast it is or what can be done with it.

Jim, look at

http://postimage.org/image/1t1u1655w/

That is for the first Amiga core of the MCC-216. The original thread is at

http://www.amiga.org/forums/showthread.php?t=55975&highlight=mcc-216&page=2

So far, at the SCCAN and FCUG meetings we've been running Amiga OS 1.3 with the MCC-216.

And at the last SCCAN meeting, member Matt B. showed his easy-to-use installer for dropping cores and files into the MCC-216.

But that's news for a different thread,
Robert Bernardo
Fresno Commodore User Group
http://videocam.net.au/fcug

JimDrew · « **Reply #393 on:** January 18, 2013, 04:38:26 AM »

Thanks Robert, I will check that out. I know some people are using OS2.04 on this device (well, at least they claim they are!)

A6000 · « **Reply #394 on:** January 18, 2013, 07:52:12 AM »

Quote from: billt;723021

68060 replacement.

Not 68060 replica.

There's a difference.

Yes, my ideal is something that goes into a 68060 socket, is 680x0 compatible (tg68, n050, n070, whatever), and hopefully outperforms real 680x0 from Motorola/Freescale.

Then do adaptors or different PCB designs for other sockets to hold the otherwise identical CPU core, whatever it may be. (Tg68 or n050 or arm or x86 or whatever)

If someone designs an 030+882 compatible processor that is as fast as an 060 then it is an 060 replacement. if someone designs an 060 compatible processor with an 060 bus interface then it is an 060 replica.

From what I have read it is easier to interface an 030 to the amiga than it is for an 060, there are not many 060 sockets around, better to use an 030 socket, and new boards would be simpler if they used an 030 bus rather than 060 bus.

I understand that you want something to go in an 060 socket, but would'nt it be better to design something that more people could use?

freqmax · « **Reply #395 on:** January 18, 2013, 09:22:48 AM »

That issue is solved with a simple mechanical adapter, as already mentioned. The 060 socket has signals that 030 socket doesn't. So you can go down, but not up.

bloodline · « **Reply #396 on:** January 18, 2013, 09:38:58 AM »

Quote from: freqmax;723047

That issue is solved with a simple mechanical adapter, as already mentioned. The 060 socket has signals that 030 socket doesn't. So you can go down, but not up.

True...
From the datasheets it looks like the 680x0 busses aren't really that different (excepting the larger address size 020 onwards), the extra signals on the later chips just offer more clues the the hardware about cashing and protection

psxphill · « **Reply #397 on:** January 18, 2013, 12:00:18 PM »

Quote from: A6000;723006

I read here that the 060 bus interface is too complex and that the 030 bus is better, also the 060 is less compatible with amiga software than the 030 because many instructions were not implemented, so why do we want an 060 replica, why not try to implement an 030+882 that runs as fast as an 060?

Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.

By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.

The compatibility will be the same as a real 68060 & people have had nearly 20 years to come up with patches and workarounds for that.

Mrs Beanbag · « **Reply #398 on:** January 18, 2013, 12:47:47 PM »

Quote from: freqmax;723015

68060 has some performance factors:
# 10 stage pipeline.

That's rather longer than I expected.

Mrs Beanbag · « **Reply #399 on:** January 18, 2013, 12:55:19 PM »

Quote from: psxphill;723054

Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.

The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

psxphill · « **Reply #400 on:** January 18, 2013, 01:46:08 PM »

Quote from: Mrs Beanbag;723057

That's rather longer than I expected.

page 3-1

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.

It's quite a simple design.

It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.

I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.

There is a document (http://cdn.preterhuman.net/texts/underground/phreak/68060Info.txt) that explains the pipelines in more detail, I don't where the pdf is as the pictures are missing in the ascii version. It might be this: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=289639 but I'm not paying for it :-)

The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.

What is odd is that they claim different for predicted correctly taken and predicted correctly not taken

"If the BC indicates that the instruction is a branch and that this branch should be predicted as taken,
the IAG pipeline stage is updated with the target address of the branch
instead of the next sequential address. This approach, along with the
instruction folding techniques that the BC uses, allow the 68060 to achieve a
zero-clock latency penalty for correctly predicted taken branches.
If the BC predicts a branch as not-taken, there is no discontinuity
in the instruction prefetch stream. The IFP continues to fetch instructions
sequentially. Eventually, the not-taken branch instruction executes as a
single-clock instruction in the OEP, so correctly predicted not-taken
branches require a single clock to execute. These predicted as not-taken
branches allow a superscalar instruction dispatch, so in many cases, the next
instruction executes simultaneously in the sOEP."

So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.

"The 68060 performs the actual condition code checking to evaluate the
branch conditions in the EX stage of the OEP. If a branch has been
mispredicted, the 68060 discards the contents of the IFP and the OEPs, and
the 68060 resumes fetching of the instruction stream at the correct location.
To refill the pipeline in this manner, there is a seven-clock penalty for a
mispredicted branch."

I guess it comes down to how you interpret this from the first quote:

"allow the 68060 to achieve a zero-clock latency penalty for correctly predicted taken branches"

ChaosLord · « **Reply #401 on:** January 18, 2013, 03:53:00 PM »

Quote from: Mrs Beanbag;723058

The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

And Branches >128 bytes :angel:

Mrs Beanbag · « **Reply #402 on:** January 18, 2013, 04:40:34 PM »

68000 can do 16-bit branches, it's 32-bit branches that are 68020+

freqmax · « **Reply #403 on:** January 18, 2013, 05:47:01 PM »

I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?

I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502

How about ARM?

matthey · « **Reply #404 on:** January 18, 2013, 06:32:32 PM »

Quote from: psxphill;723054

Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.

By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.

For the most part, the 68060 chose good instructions to remove from hardware. One big exception is the integer 32x32=64. This was already used by compilers to turn a divide by a constant into a multiply saving a huge number of cycles.

The .library should go in flash so it's available very early for bootable games.

Quote from: psxphill;723068

page 3-1

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.

It's quite a simple design.

I wouldn't say it's simple although it may be compared to some modern processor designs (e.g. x86). There is an instruction buffer in between the pipeline stages that is very costly (muxes) on an fpga. There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate. There is 6 bytes of data with each 16 bit fixed length RISC word and I don't know if, for example, a MOVEA.W #,A0 immediate is extended when decoding or in the OEP. I believe the instruction becomes pOEP only if there is >6 bytes of data from extension words but what if there is more than 12 bytes of extension word data (up to 18 bytes is possible)? If you think this is all simple, I volunteer you to do the VHDL programming of the replacement 68060

.

Quote from: psxphill;723068

It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.

Also, in some cases the OEPs are locked together to process an instruction together.

Quote from: psxphill;723068

I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.

Long instructions (lots of extension words) are more of a problem than 1 cycle instructions for fetch starvation. The 68060 doesn't have a low fetch bottleneck with most 68020 code because it's short (the 020/030 has a serious fetch bottleneck). A 68060 fetch bottleneck can be seen in artificial tests. Gunnar did some continuous work in a mini bench test program he made (on the Natami forum) that used longword immediates continuously which did show a substantial slowdown (1/4-1/3 slowdown as I recall). The 68060 needs longword data to be efficient but can slow down fetching it very often. Most longword immediates are <16 bits and extending data is low overhead even in fpga (ARM uses shift which is high overhead in fpga). This is how MOVEA.W #,An and ADDA.W #,An work already. The same could be done for data registers also, as we found, which would be even more common. Also, adding MVS and MVZ would have helped.

Quote from: psxphill;723068

The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.

What is odd is that they claim different for predicted correctly taken and predicted correctly not taken

Different timing for predicted correctly taken and predicted correctly not taken is normal with a pipelined processor. Branches predicted backward with the branch target in the branch cache are effectively 0 cycles for loops which is awesome as loop unrolling is mostly not needed improving code density. Branches that fall through eat a cycle in the pOEP but a sOEP instruction can execute simultaneously if available (also awesome). Note that the branch unit is a separate unit that can do processing in parallel and that the branch target must be in the branch cache to get the 0 cycle branch taken. That means there is usually some additional overhead the first time executing code. I believe the 68060 does some kind of instruction folding/fusing of the branch with CMP/TST/SUBQ in order to make the 0 cycle branches happen. Very few modern processors have effectively free branches. Jens and Gunnar (Natami) didn't even have all the magic figured out. Joe Circello and the 68060 team had this all figured out back in the 90s and the Motorola marketing guys killed it for PPC. Pencil pusher power!

Quote from: psxphill;723068

So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.

I think the branch instruction does go through the pOEP. The branch unit looks at it very early, makes a prediction and starts speculative execution. The pOEP still has to verify that the prediction is correct at execution time or flush the pipe and continue executing the other branch path.

Quote from: Mrs Beanbag;723058

The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

No EXTB.L or TST.W/L An? No misaligned reads or writes? The misaligned reads and writes are a huge saver when not sure of the alignment. Compilers often can't guess the alignment so they bloat up the code and slow down the CPU to align the data before reading or writing.

The 68020+ has some other niceties but they are more advanced.

Quote from: ChaosLord;723082

And Branches >128 bytes :angel:

I think you mean Bcc.L and BSR.L. Branches up to 16 bit were supported on the 68000. The longword branches are big savers but only on fairly large programs. Not too many assembler programmers create programs >65k.

Quote from: freqmax;723088

I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?

It's signed so plus or minus ~32k.

Quote from: freqmax;723088

I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502
How about ARM?

x86 branches are so screwed up with the early segmentation crap that you really have to define which x86 ISA and then don't ask me. The ARM 32 bit ISA is better but still has some limitations as I recall. I believe it only allow 24 bit addressing, too. It's quite old but the 68k was one of the first to have full 32 bit position independent code done right. An assembly programmer doesn't have to worry about the size with a modern optimizing assembler like vasm. It will automatically generate the most efficient encoding (for more than branching as 68020+ allows) including forward and backward branch optimization. The 68020+ enhancements removed a lot of limitations and can be used or optimized transparently which is great. They should have left the double memory indirect modes away though.

Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188217 times)

JimDrew

Re: Motorola 68060 FPGA replacement module (idea)

billt

Re: Motorola 68060 FPGA replacement module (idea)

Heiroglyph

Re: Motorola 68060 FPGA replacement module (idea)

RobertB

Re: Motorola 68060 FPGA replacement module (idea)

JimDrew

Re: Motorola 68060 FPGA replacement module (idea)

A6000

Re: Motorola 68060 FPGA replacement module (idea)

freqmax

Re: Motorola 68060 FPGA replacement module (idea)

bloodline

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

psxphill

Re: Motorola 68060 FPGA replacement module (idea)

ChaosLord

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

freqmax

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)