Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188332 times)

matthey · « **Reply #14 from previous page:** January 18, 2013, 10:21:28 PM »

Quote from: freqmax;723104

What's a "Link stack" ..?

psxphill got it although the link stack can be different sizes. It should make RTS 2 cycles instead of 7 cycles on the 68060.

Quote from: freqmax;723104

Have you looked at the Actel FPGAs?, they are way faster than any competitor last time I checked. Of course they are slightly more expensive.

No. I have only heard. I haven't played around with any fpga's although I have looked at some VHDL code for a 68k CPU or 2

.

Quote from: freqmax;723104

As for ISA, my thinking were if the ISA of ARM, Transmeta, PDP-11, MIPS, Sparc, DEC Alpha, PA-RISC, etc is easier to deal with. Without sacrificing performance.

I don't think that any RISC processors are going to be as easy as the 68k. ARM probably comes the closest and MIPS is also logical and usable in assembler from my limited exposure. They both look way easier than PPC despite PPC having as many instructions as many CISC processors. PPC is as bad about using acronyms as the U.S. military.

The PDP-11 should have been very easy to program, possibly easier than the 68k. The performance would be limited by the encodings but it would be interesting to see someone try to implement a modern version in fpga. The instructions are powerful but would probably require a lot of microcode above a RISC core. It's too bad that students will probably not be able to see how easy to program a processor can be. Even the 68k is all but dead.

Quote from: Mrs Beanbag;723123

I would rather optimise for 68000 instructions and provide the rest just for compatibility. How common are the bitfield instructions in real code? I never use them.

It varies. Most old code doesn't use them much but GCC started using them heavily from about GCC 3.x on, even when the timing for them was slower. It's often faster not to use them on the 68060 because it can do a shift and and in the same cycle. They are often worthwhile on the 68020-68040 and are good for code density and fairly intuitive. They have 32 bit results which is good for 32 bit register forwarding and make efficient use of registers. They are very useful for processing streams of data in memory (with caches) which the register memory architecture of the 68k can do well. The only draw back is a little bit more complexity than the average instruction. If they were fast, they would be used a lot more. Implementing them would help the performance of GCC where trapping them would slow these newer GCC compiled programs to a crawl. You get faster smaller programs with and slower bigger programs without. It's a not so tough choice for me.

matthey · « **Reply #15 on:** January 19, 2013, 12:04:34 AM »

Quote from: psxphill;723136

The only thing I can find is this:

"If the primary OEP instruction is a simple “move long to register” (MOVE.L,Rx) and the destination register Rx is required as either the sOEP.A or sOEP.B input, the MC68060 bypasses the data as required and the test succeeds."

Which says it's only for move.l, although I guess the others could be translated. It doesn't have to retire it early, the second pipeline could look in the primary pipeline. Mips has a similar handling for lwl/lwr opcodes, it pulls the register value from the pipeline and stops the register being updated at all. The register doesn't actually get updated until you stop executing lwl/lwr opcodes.

There are at least 2 different optimizations here. One is the early instruction retirement and register forwarding. The other is more of a MOVE.L+OP.L optimization which is possible because MOVE.L is only half an operation in a register memory architecture that can do both in 1 operation. The Natami processor was planning to use instruction fusing/folding to handle most of these cases. The non-superscaler v4 ColdFire probably does too:

"Last, ColdFire v4 is smart about collapsing commonly used constructs into a single operation. If two instructions will execute in different stages and have no dependencies, they will execute together in a single cycle. This “instruction folding” is ColdFire’s first move toward superscalar dispatch." -ColdFire Doubles Performance With v4 by Jim Turley

The M68060UM is less than clear about these optimizations, even if understanding how these types of optimizations commonly work. Editors usually make this stuff worse than what the engineers started with too. I can say I don't fully understand and I have better knowledge than most people and experience with coding the 68060. By looking at code compiled for the 68060, it looks like many compiler programmers didn't understand either. Most 68060 optimized code doesn't do much except replace some trapped instructions, if that.

matthey · « **Reply #16 on:** January 19, 2013, 10:48:52 AM »

Quote from: ChaosLord;723154

Which compilers even have an 060 option?

GCC, vbbc and SAS/C have 060 options. That should cover 90%+ of Amiga compiling.

Quote from: ChaosLord;723154

I can't remember SASC even having such an option. Or maybe it does and I just don't use it...

Check your SCoptions. It's there or you are using an old version of SAS/C.

Quote from: ChaosLord;723156

Every time I ever wanted to use Bitfield instructions I would consult the timing charts and it was always faster to just do things the RISCy way and not use bitfield instructions. So I have never used them. I just use good ol' ANDing and ORing.

If you are looking at the 68060 timings then yes, they are slow compared to the shifting and logic instructions. BFFFO is an exception to keep in mind. It replaces a loop with the last iteration costing 7 cycles on fall through. If there was enough encoding space and a logical place to put them, I would have done a BFPOPCNT/BFCNTO and possibly a BFFind (find a binary sequence in a BF which could also be used as a BFCMP) for 68kF which would also replace loops/branches. These are a little bit more difficult to use by a compiler but good for specialized code like codecs and decompression. BFCHG, BFCLR, BFEXTS, BFEXTU, BFINS, BFSET and BFTST are simpler but very easy for a compiler. They would be a compiler writers dream come true if they were fast.

matthey · « **Reply #17 on:** January 19, 2013, 11:27:27 AM »

Quote from: freqmax;723174

As the 020, 030, 040 options doesn't mention the omission of any instructions. It seems the 060 is the only m68k CPU to have less instructions than it's predecessors.

If you are talking about integer instructions then yes. The 68040 FPU did remove from hardware most 68881/68882 instructions (which compilers were using and became very slow if used). Most of them are not used often and the 68040 FPU is enough faster than the 68881/68882 for common operations that no slow down will be noticed. The 68060 removed a few more seldom used 68040 FPU instructions but added back the commonly used FINT and FINTRZ. This was a smart move to correct a big mistake in removing them from the 68040 FPU.

matthey · « **Reply #18 on:** January 19, 2013, 06:58:36 PM »

Quote from: freqmax;723203

Why were these instructions dropped?

CALLM and RTM were for calling subroutines (probably from the OS) but MacOS and Atari were trapping A-line instructions to supervisor mode OS calls. Traps are slow so the Amiga uses regular JSR (Jump to subroutine) instruction for OS calls and the OS is mostly in user mode (as libraries) like everything else. Supervisor violations do still trap to the OS on the Amiga. Having all of the OS in supervisor provides a little more security though. The 68020 was used for big Unix boxes and similar back then (before they dropped 68k for RISC) and CALLM was probably to cater to that market although I doubt they ever used it because of the additional overhead.

Oxypatcher, Cyberpatcher and Remus do not even bother patching CALLM/RTM because they are unused on the Amiga except for a very few programs that supposedly use them to detect a 68020 and count on this to trap if not a 68020. This is a poor assumption and so rare that it can be patched if necessary.

Quote from: freqmax;723203

And would be more efficient performance wise to implement a 020, 030, or 040 and then horrendously overclock it?

It doesn't really matter as the fpga implementation would be different than the real chip. The 68020+ ISA is practically the same between all of them except for the FPU and MMU. You certainly wouldn't want the limitations of the earlier 68020/68030 unless a cycle exact CPU was needed.

matthey · « **Reply #19 on:** January 20, 2013, 12:01:49 AM »

Quote from: freqmax;723245

Regarding instruction set (ISA) I was thinking in general why they changed it. Because the end result is a slight confusion.

Which ISA change:

1) 68000-68020 (major change)

2) 68020->68030 removal of CALLM/RTM (very minor change)

matthey · « **Reply #20 on:** January 20, 2013, 02:57:06 AM »

Quote from: bloodline;723248

I have been reading matthey's 68kF2 ISA proposal, and it reminded me how complex the 68k instruction encoding is,

Complex? Take a look at a decoder for x86

. Yea, the 68k does need more logic in the decoder but the improved code density allows more instructions to be piped into the processor. Most RISC instructions use a consistent 32 bit fixed length encoding which is great for decoding. The 68k needs several separate decoding tables (lacking a better name) for different encoding areas. Some encoding holes are even divided into a separate table of instructions. This part of the 68k could have been a little better but it's not too much of a problem. The 68k does compress a lot of data with sign extended values which works very well and can be improved on. The overall slowdown from the decoder is minimal on the 68k and can be made up for with powerful instructions and addressing modes which it has and can be improved on. ARM with Thumb 2 works well because of the code density plus powerful instructions for RISC. This was a good tradeoff even though they now have a little more complex decoder. MIPS and PPC have also experimented with code compression (MIPS16E and CodePack respectively) but it never caught on or fit as well for them:

http://www.embedded.com/electronics-blogs/significant-bits/4024933/Code-compression-under-the-microscope

matthey · « **Reply #21 on:** January 20, 2013, 05:33:25 PM »

Quote from: bloodline;723285

I rather like Mrs Beanbag's idea of a nice simple RISC core tailored to executing instructions that have been decoded from 68k instructions, it could simplify the decode stage maybe

psxphill has a point. Decoding and re-encoding into a completely different format adds complexity (the 68060 RISC is likely simpler and based on the CISC encoding). The 68k has EAs which are too long to be encoded into 16 bit RISC and some CISC instructions would have to decode into multiple RISC instructions which increases the data to be dealt with. RISC has a lot of simple instructions to handle while CISC has fewer complex instructions to handle. The extra logic in a CISC decoder is more than made up for by the reduced logic for caches and memories. This even applies to the worst case x86 decoders. There is a potential problem of a slowdown with poorly encoded CISC due to lack of parallel decoding. The instruction length needs to be simple to determine for parallel operations. The x86 has historically had a longer pipeline because this is not possible (and to increase processing power at the expense of good branch performance). The 68k is significantly better although the outer displacement of double memory indirect 68020+ addressing modes hurts decoding significantly as well as increasing the max instruction length (we don't even know how the 68060 deals with these very long encodings). There is not much advantage to them if the EA has to be calculated 2x which also makes execution more complex. I suggested trapping the double memory indirect modes with an outer displacement in the 68kF ISA. We need to see if the decoder is still the bottleneck after that. I'm guessing that it won't be in fpga because of the slow execution of 32 bit multiply and shift in the ALU.

The N68040 fpga pipeline looks like this:
1) Instruction Fetch
2) Decoder *
3) Register Fetch
4) EA Calculation *
5) DCache-Read
6) ALU Execution *
7) Write-Back

With 3 potential bottlenecks in fpga:
B1) Decoder (identifying the instruction length, to be able to fetch the next instruction)
B2) EA Calculation
B3) ALU Execution

matthey · « **Reply #22 on:** January 20, 2013, 06:34:12 PM »

Quote from: Mrs Beanbag;723343

The advantages are that each part can be developed, tested and optimised separately, and indeed the RISC core could conceivably be useful on its own (and an assembler could be modified to compile 68k asm to run on it). It would be easier to add new instructions, much in the same way that microcode does, but the "microcode" in this case is more readily understandable, being 68k-like itself.

Hmm. Why not just use JIT in UAE then? The 68k code does make a nice compressed cross platform intermediate language (it's much better than Android or Java byte code). If the time is taken to optimize the RISC JIT emulation, then it will probably perform almost as well as native RISC code that is not optimized. Optimizing compilers for RISC were supposed to give it the advantage but that failed. I remember reading 15 years ago how the PPC processors of the future would have compilers to add all necessary cache hints, align all data, schedule all instructions properly and optimize most branches. Funny thing is, a PPC assembly language programmer can still beat PPC compiled code in most cases. The not so funny thing is, most computer scientists and engineers stay with this same failed philosophy

.

Quote from: psxphill;723345

What do you mean? AFAIK the longest instruction is 10 bytes and that is what gets transferred from the FIFO in the decode stage.

move.l ([xxx.L,a0],xxx.L),([xxx.L,a1],xxx.L]) ;move.l , 2x double memory indirect

1 word for move.l
2x1 word for full extension word
2x2 words for inner longword displacement
2x2 words for outer longword displacement
----
11 words total = 22 bytes

The 68kF ISA would do away with the outer displacement option which reduces the maximum instruction length to 7 words or 14 bytes. Do you see any mistakes?

matthey · « **Reply #23 on:** January 20, 2013, 10:49:59 PM »

Quote from: psxphill;723354

Whats the encoding for the move.l with that addressing mode? I can't see anything that matches that in the 68020 user manual.

move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)

The hexadecimal encoding for above is:

23b0 0173 1234 5678 1234 5678 0173 1234 5678 1234 5678

We can see the flaw of the 68k here. Both of the full extension format words would be better at the start followed by their data like this:

23b0 xxxx xxxx 1234 5678 1234 5678 1234 5678 1234 5678

As is, we may have to examine up to 16 bytes to determine the instruction length. If we remove the outer displacement, the max instruction length would be:

23b0 xxxx 1234 5678 xxxx 1234 5678

We need to examine 10 bytes to find the instruction length now. The maximum instruction length would become 14 bytes which is fewer than the x86 15 byte maximum instruction length

.

matthey · « **Reply #24 on:** January 21, 2013, 12:20:02 AM »

Quote from: psxphill;723376

Hmm, for me that disassembles as:

001000: 23B0 0173 1234 5678 0173 1234 5678 move.l ([$12345678,A0],$1234), ($78,A1,D5.w*
00100E: 1234 5678 move.b ($78,A4,D5.w*, D1

It's not uncommon to see bugs in assemblers, disassemblers and debuggers using these advanced and seldom used addressing modes. I used vasm typing in:

Code: [Select]

   MC68060

   move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)
   rts

I assembled it from test.asm to test. It disassembled just as I typed it with my modified version of ADis from here:

http://www.heywheel.com/matthey/Amiga/ADis.lha

Disassembling with:

ADis -m6 -a test

The old version of ADis would have had problems. IRA 2.04 fails to disassemble the destination correctly. D68k v2.0.8 is very close but oddly gets the address register in the destination wrong.

BDebug from the Barfly package gets it right. CPR from SAS/C gets it right (although doesn't display the $ for hex numbers on instructions).

I thought you might have been using D68k at first but apparently not. What disassembler did you use?

Edit:
I also found this excellent article about decoding the x86 to RISC:

http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html

If x86 can do it, we can do it better and easier without a longer pipeline hurting branch performance

.

matthey · « **Reply #25 on:** January 21, 2013, 03:19:35 PM »

Quote from: psxphill;723410

I used mame (arcade game emulator), typed the hex into memory and then disassembled and executed it. It's not just the disassembler, the emulation consumed the same number of bytes. So that needs looking at, can you post the exe you assembled?

http://www.heywheel.com/matthey/Amiga/test68020

Are you involved with developing or testing mame?

Quote from: psxphill;723410

There is mention in the manual about some instructions being split over two pipelines, it might do that by splitting it into two FIFO entries. With the result of the ea fetch from the primary pipeline getting forwarded to the secondary pipeline so it can get stored.

Right. The OEPs are locked together and each OEP performs 1/2 of the ea for a move ,. This is the only 68k instruction that allows 2 EAs by the way.

Quote from: psxphill;723410

Have you tried running this encoding on a real 68060?

It's not safe as it writes memory but I have never had a problem with double memory indirect modes before. Some compiler versions of GCC and SAS/C will use them. ThoR's 68060.library uses them because they save saving and reloading a register on the stack for short functions. They do need to be at least trapped in an fpga 68020+ CPU or compatibility will not be good.

Quote from: bloodline;723432

Brilliant!! I kinda figured the branch, move and compare instructions would be the more popular

I hope MOVE is a popular instruction as it takes almost 1/4 (actually 3/16 but who's counting) of the 68k encoding space

.

matthey · « **Reply #26 on:** January 21, 2013, 05:33:03 PM »

Quote from: Mrs Beanbag;723434

Also I have been thinking of a way to make the instruction translation do branch predication in the case a conditional branch skips only a few instructions.

Be careful with the predication on the 68k. It might be possible to get it to work as 1 conditional instruction sometimes. It doesn't work well with multiple instructuctions, multicycle instructions or addressing modes that update the base register like (An)+ and -(An). The data to be predicated ends up having to be examined for suitability. IMO, this would only be worthwhile with very common code. Image handling this:

Code: [Select]

   beq skip
   movem.l d0-d7/a0-a6,-(sp)
skip:
   move.l d0,-(sp)

The N68k fpga CPU is supposedly conditional 3 op internally making predication easier. There were enough problems on the 68k that we decided adding SBcc and SELcc were easier. Even this takes some logic but the 68k already has Scc which is handled much the same way.

Quote from: Mrs Beanbag;723435

Actually something just occurred to me. If the most common instruction is "tst", it should be possible to know whether a branch will be taken or not some time in advance. Because "tst" only looks at a single register, the contents of that register must have been determined some time before. So you could look ahead in the instruction queue for a "tst/bcc", and inform the branch predictor well in advance. "tst" instruction then takes effectively NO cycles.

The 68000 (16 bit) code in a console is going to be very different from 68060 optimized code for a dynamic OS today. I very much doubt TST is going to be number 1 any more. I expect MOVE to be #1. MOVE sets the condition codes so a TST should not be needed too often with optimized code. Folding a TST, CMP, or SUB/SUBQ with a branch is something the 68060 does to help achieve 0 cycle branch prediction although I don't know which specifically it does. TST has a higher likely hood of testing a register that has not been modified for a time than MOVE which sets the cc. Many processors do try to determine the branch rather than predict it. The PPC is especially good at this. It also provides several cc's that can be selectively set and branched on later. Most PPC processors have a fairly short pipeline too so branching on a condition set 3 or 4 instructions ago or testing and immediately branching on an instructions that hasn't changed recently may be enough to determine the branch without prediction. It probably helps, especially if the compilers can generate good code, but it obviously hasn't helped PPC destroy x86 like was predicted 20 years ago

.

Quote from: Mrs Beanbag;723441

Not strictly true. Can also do "cmp (Ax)+,(Ay)+"

addx, subx, abcd and sbcd can use predecrement for both operands.

All of these are two cycle instructions.

Yes, they are more complex on the 68060 but no they don't use 2 EAs. They are special cases that do not calculate even 1 EA. The plus of (An)+ is added after the EA is used and is not part of the calculation.

matthey · « **Reply #27 on:** January 21, 2013, 10:10:56 PM »

Quote from: Mrs Beanbag;723472

Yes it would be a prediction, the prediction isn't always necessarily right, but as long as it's right more than 50% of the time it will help.

Not necessarily. The default BTFN (backward taken forward not taken) logic is ~65% correct and doesn't slow down loops with miss predictions. The 68060 2 bit saturation prediction is good for ~90% prediction accuracy. The x86 can have branch prediction up to 95% accurate and can even predict patterns, but the logic needed is large and the prediction is a little slower which is bad for tight loops (some have other optimizations for tight loops).

Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 188332 times)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)

matthey

Re: Motorola 68060 FPGA replacement module (idea)