Welcome, Guest. Please login or register.

Author Topic: Motorola 68060 FPGA replacement module (idea)  (Read 188656 times)

Description:

0 Members and 55 Guests are viewing this topic.

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« on: January 06, 2013, 06:49:18 PM »
Quote from: ChaosLord;721489
... designed by a 200th TechMage.

Quote from: freqmax;721492
What do you mean with that? ;)

I think he means 200th level TechMage. You must not have played D&D as a child.

@TCL
I thought the N050 only implemented write-through caches which are much easier to implement than copyback. They have excellent compatibility and the little bit faster modern memory would make up for some of the speed deficit. I agree that at least write-through caches for both instruction and data is needed. Anyone saying otherwise should turn off their accelerator caches and experience 68000 performance all over again ;).
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #1 on: January 08, 2013, 08:40:38 AM »
Quote from: Mrs Beanbag;721644
It might be a good starting point. [ColdFire] Differences are:

1. No DBcc
2. No bitwise rotation (rol, ror)
3. No bitfield operations
4. Multiply instructions don't set flags. From the Coldfire manual:
CCR[V] is always cleared by MULS/U, unlike the 68K family processors

1-3 on your list are not 68k conflicts but trapping would make them very slow. Here is a list of 68k and ColdFire conflicts that are not fixable by trapping (i.e. ColdFire.library):

1. ColdFire stack is 4 byte aligned (68k 2 byte). MOVE.B/W (SP)+ and -(SP) fail.
2. REMS/REMU encoding is incompatible with DIVSL/DIVUL encoding.
3. ColdFire multiply instructions don't set flags like the 68k.

In addition, practically anything in Supervisor mode will not work.

Quote from: Mrs Beanbag;721644
Coldfire also has a few extra commands (some of which would be quite useful, such as saturate and multiply-accumulate)

MVS, MVZ and BYTEREV should have been in the 68060. SATS is good for DSP/Codec type processing but where is SATU and ABS? The CF MAC processor is powerful but is a bolt on that doesn't fit with the 68k/ColdFire IMO. It's a poor man's SIMD as the CF is low end and cheap, cheap, cheap. Freescale will sell you PPC or now ARM (which they sadly license) if you need some real processing power.

Quote from: JimDrew;721748
Well, there are quite a few Amiga programs - including several of my own that all follow 100% legal programming practices (according to common sense and the RKMs) that will not run on an 060 with superscalar and/or branch caching enabled.  I don't recall all of the reasons behind the issues.  I should go look at the mmu.library replacement that we made for EMPLANT and FUSION... I know I commented some things there.  I know that self modifying code is definitely one of the things that causes a problem when one of the cached instructions in the pipeline has been modified (like a branch table).  Yes, I consider self-modifying code 100% legal.  :)  You are suppose to flush the caches (or turn them off) with self modifying code, but when you do that you are then running at sub-030 speeds.

Self modifying code needs to flush the caches (including branch cache) which negates the advantage of the caches and any speed gains of self modifying code. If you don't like caches, stick to the 68000 until you change your mind :/. Some early 68060.library's may not have flushed all the caches properly, fixed the superscaler bugs in the 68060 properly or may have had bugs in the CPU support code used for trapping. The best ones matured and work fine. Fusion works fine on the 68060 here except for an occasional random crash. The last ShapeShifter was more stable though. That was using the last version of Fusion which I bought in your Fusion/PCx CD bundle. Fusion had some nice features over ShapeShifter like the file transfer and auto screen mode changes from within the Mac but stability is more important. I would still use Fusion if it was more stable and supported more hard drive options which ShapeShifter is better at.

The Natami fpga CPU was going to use writethrough caching with snooping and auto flushing of detected dirty cache lines. This is a good option that allows very large caches with excellent compatibility. It would be possible to auto flush a branch cache in the address range of the dirty lines that are detected by snooping also. With the faster memory and larger caches of today, this should give cache performance close to that of the 68060 with better tolerance for self modifying code.

Quote from: JimDrew;721748
The 060 really only adds dual instruction pipelining and a 4-way cache.  A higher speed (100MHz+) 040 core would probably be better in the long run, especially if it handled floating point without completely stalling the core like the 060 does.

The MC68060UM says:

"The MC68060 allows simultaneous execution of two integer instructions (or an integer and a float instruction) and one branch instruction during each clock."

"The MC68060's FPU operates in parallel with the integer unit. The FPU performs numeric calculations while the integer unit continues integer processing."

The 68060 FPU was a nice improvement over the 68040 FPU. It dropped a few 040 FPU instructions that were very rarely used and added back the FINT and FINTRZ instructions which compilers use commonly. The execution speeds were also improved across the board and more parallel operation is possible. The 040 can do some limited parallel operation also.

The 68060 is a great processor which does a lot of parallel work but it's not easy to make and it's probably not as easy to make in an fpga. A faster clocked more 68040 like CPU makes sense in the fpga. Bigger caches, a branch cache and more parallel operation are needed for maximizing performance though.
« Last Edit: January 08, 2013, 09:26:21 AM by matthey »
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #2 on: January 13, 2013, 03:22:37 PM »
Quote from: psxphill;722293
Ok. It's bytes that affect a7 by 2, but everything else by 1.

No. The stack is "affected" by the number of bytes pushed or popped from or to it as follows.

68k:
byte = 2 bytes (with 1 byte of padding to maintain word alignment)
word = 2 bytes
longword = 4 bytes

ColdFire:
byte = 4 bytes (with 3 bytes of padding to maintain longword alignment)
word = 4 bytes (with 2 bytes of padding to maintain longword alignment)
longword = 4 bytes

This is one of the main incompatibilities between 68k and CF as I mentioned earlier.
Note that movem.w (sp)+, sign extends as it restores registers and is missing on the CF.

Quote from: Mrs Beanbag;722302
How about this for a crazy idea, an accelerator with an Arm CPU and an FPGA, the FPGA can function as a 68k CPU if set up as such, so it could run like the PPC accelerator boards. BUT you install AROS for ARM ROM chips and use the Arm as the main CPU, and allow the FPGA to be reconfigured by the Arm chip, so then you could develop your 68k core "live", and install updates through software.

That hardware sounds pretty close to what the fpga Arcade is already.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #3 on: January 14, 2013, 01:56:17 PM »
Quote from: freqmax;722430

But I still find one FPGA-done the least amount of fuss solution. And the m68k op codes to be way nicer to deal with in contrast to x86 ones.


+1
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #4 on: January 15, 2013, 01:30:50 PM »
Quote from: JimDrew;722601
If you really plan on using strictly a FPGA to emulate the CPU, I would suggest someone modifying WinUAE to make a histogram of instruction usage.  This would let someone focus on optimizing the 680x0 core by looking at instruction usage which could help determine things like what changes in the cache, pipelines, etc. will benefit the speed.

Some instructions are used less often but reduce branching (my favorite), are very powerful and/or save code (some operations are easy in hardware but difficult to do in software). Some DSP or SIMD like instructions are used in processor intensive codecs and drivers where they offer huge speedups but are used less in normal code. A simple instruction count only gives a partial picture of what instructions are best.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #5 on: January 15, 2013, 02:32:25 PM »
Quote from: ChaosLord;722637
Yes!  I love those!  I (and Phil) were always pushing these at the Natami CPU Dezine Dept. but Gunnar did not like them or didn't understand so he was totally against adding a new instruction for this purpose. :(


Gunnar thought he could solve all short branches with predication but it has issues when variable length and cycle instructions are used. I originally thought instructions like ABS would not be useful for this reason but I came to realize predication was not a good idea even before Gunnar. It doesn't work well with the 68k updated address register addressing modes like (An)+ and -(An) either. The original Scc instruction takes the correct approach. It didn't take too much to convince Gunnar that these kinds of instructions were better than predication or conditional moves like x86 CMOV. That's why we added SBcc, SELcc, ABS, POPCNT, etc. to the ISA and which fit and have minimal pipeline overhead (hazards) while reducing short branches. Long branches still need to jump. If we could remove 5-15% of branches (the short ones) and the overhead in the branch cache and history, the 68k would be one of the best processors at branching. Add to that a relatively short pipeline (and mis-predicted branch penalty) and 0 cycle loops and we would have much improved performance, a beautiful CPU to program and even better code density.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #6 on: January 16, 2013, 01:04:14 AM »
Quote from: psxphill;722642
Trying to improve the ISA is a time sink which you'll never get payback from. The only benefit is a bit of ego boosting, but that subsides when reality hits.


It's not a time sink if people are working together in parallel which is the way it was suppose to be when I started documenting the new 68k ISA. It's not a time sink if the new ISA attracts interest from outside of the retro crowd. It's not a time sink if the ISA is implemented and found to be a substantial improvement in power, code density, compiler support and ease of programming. You give up very little with the possibility to gain much more. There is a market for retro computing but a bigger market for a processor that can handle today's processing needs quickly with compact code as well as being compatible with old code. That's what ARM and x86 did. They evolved and now they are successful. Building a 68020 compatible CPU comes first, but even then it's smart to plan ahead to make future enhancements easier.

Quote from: psxphill;722642

Making the pipeline follow the predicted branch might be hard, but it's doable. Thumb drops a lot of conditional instructions from Arm.


Yes, but they were using predication (unusual for a CPU) that only offers a small advantage in some specific hardware. The smaller the block of predicated instructions and the simpler the instructions the better. Most original ARM ISA instructions could be conditional which worked ok but was dropped with the Thumbs because it was not good for code density which they were going after. The ARM block predication instruction was for multiple instruction predication but ARM went to OoO processors where it didn't work as well. The conditional instructions proposed in the 68kF ISA should work nicely while being a small simplification improvement over a more generic CMOV like x86. They would work well on a Superscaler CPU with a short pipeline and a cheap branch predictor (or no branch predictor) which the 68k is likely to have. There would still be some optimized code that would not want to use them at times. This includes highly predictable branches that are executed often and very tight loops where a highly predictable branch could be used instead. Note that some instructions like ABS (absolute value) have no drawbacks yet remove a branch that can be difficult to predict and SELcc can remove 2 branches in some cases. I would like to do some testing in an implementation before finalizing the ISA.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #7 on: January 16, 2013, 09:07:33 AM »
Quote from: bloodline;722713
Sounds interesting, you might want to start a new thread about optimising and evolving the 68k ISA... As any discussion here might get confused with talk about FPGA implementations :)


There is this thread on amigacoding.de:

http://www.amigacoding.de/index.php?topic=273.msg635;topicseen#msg635

There is also a lot of good techy discussion on the Natami forum where the ideas started. You can do searches there for about any CPU term and find something interesting.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #8 on: January 16, 2013, 06:52:32 PM »
Quote from: psxphill;722751
You're very optimistic.


Optimistic? Yes! Waste of time? Maybe. At least I can say I tried even if I'm dreaming a little. Reality is only one visionary person with a wad of cash away 8-).
 
Quote from: psxphill;722751

It's difficult to predict the future, but I can't imagine there is anyone outside of the retro community that will ever have any interest in a 680x0 cpu core. There are far too many other SOC/ASIC/FPGA solutions that have already carved up the market. There is no competitive edge against any of the other alternatives and nobody in business will care if they can run 680x0 code.


No Edge? How about the best ease of use and code density in the industry. There is the FIDO, ColdFire and CPU32 but they were all cut down from the 68k instead of enhanced. ARM with Thumb 2 has moved close to what an enhanced 68k would be and it doesn't have any trouble selling. I think we would be a little more powerful and easier to use while Thumb 2 is a little more power efficient.

Quote from: psxphill;722751

The majority of people want something that can run existing software and use existing compilers, adding instructions will cause market fragmentation if anyone is tempted to ever use them. A product that doesn't ship because the people behind it gets delusions of grandeur is no use to anybody.


You are correct that the 68k is behind in development software. We tried to add instructions that would be easy for existing compilers to support. This includes common instructions on other platforms and ColdFire instructions that could be enabled in the compiler. Also, an optimizing assembler (like Frank Wille's vasm) could do a lot of optimizations without even changing the compiler.

Quote from: psxphill;722751

Chasing rainbows is all well and good, but it's the reason that Natami failed. I'd rather see something ship for once.


I'd like to see more Amiga products ship as well. They should ship with the most usable debugged 68020 core first but an fpga can be modified. The people that want a 68020 only core can stay with that and those who want to try something enhanced could also.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #9 on: January 18, 2013, 06:32:32 PM »
Quote from: psxphill;723054
Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.
 
By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.

For the most part, the 68060 chose good instructions to remove from hardware. One big exception is the integer 32x32=64. This was already used by compilers to turn a divide by a constant into a multiply saving a huge number of cycles.

The .library should go in flash so it's available very early for bootable games.

Quote from: psxphill;723068
page 3-1
 
http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf
 
The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.
 
It's quite a simple design.

I wouldn't say it's simple although it may be compared to some modern processor designs (e.g. x86). There is an instruction buffer in between the pipeline stages that is very costly (muxes) on an fpga. There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate. There is 6 bytes of data with each 16 bit fixed length RISC word and I don't know if, for example, a MOVEA.W #,A0 immediate is extended when decoding or in the OEP. I believe the instruction becomes pOEP only if there is >6 bytes of data from extension words but what if there is more than 12 bytes of extension word data (up to 18 bytes is possible)? If you think this is all simple, I volunteer you to do the VHDL programming of the replacement 68060 :P.

Quote from: psxphill;723068
It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.

Also, in some cases the OEPs are locked together to process an instruction together.

Quote from: psxphill;723068
I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.

Long instructions (lots of extension words) are more of a problem than 1 cycle instructions for fetch starvation. The 68060 doesn't have a low fetch bottleneck with most 68020 code because it's short (the 020/030 has a serious fetch bottleneck). A 68060 fetch bottleneck can be seen in artificial tests. Gunnar did some continuous work in a mini bench test program he made (on the Natami forum) that used longword immediates continuously which did show a substantial slowdown (1/4-1/3 slowdown as I recall). The 68060 needs longword data to be efficient but can slow down fetching it very often. Most longword immediates are <16 bits and extending data is low overhead even in fpga (ARM uses shift which is high overhead in fpga). This is how MOVEA.W #,An and ADDA.W #,An work already. The same could be done for data registers also, as we found, which would be even more common. Also, adding MVS and MVZ would have helped.

Quote from: psxphill;723068
The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.
 
What is odd is that they claim different for predicted correctly taken and predicted correctly not taken

Different timing for predicted correctly taken and predicted correctly not taken is normal with a pipelined processor. Branches predicted backward with the branch target in the branch cache are effectively 0 cycles for loops which is awesome as loop unrolling is mostly not needed improving code density. Branches that fall through eat a cycle in the pOEP but a sOEP instruction can execute simultaneously if available (also awesome). Note that the branch unit is a separate unit that can do processing in parallel and that the branch target must be in the branch cache to get the 0 cycle branch taken. That means there is usually some additional overhead the first time executing code. I believe the 68060 does some kind of instruction folding/fusing of the branch with CMP/TST/SUBQ in order to make the 0 cycle branches happen. Very few modern processors have effectively free branches. Jens and Gunnar (Natami) didn't even have all the magic figured out. Joe Circello and the 68060 team had this all figured out back in the 90s and the Motorola marketing guys killed it for PPC. Pencil pusher power!

Quote from: psxphill;723068
So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.

I think the branch instruction does go through the pOEP. The branch unit looks at it very early, makes a prediction and starts speculative execution. The pOEP still has to verify that the prediction is correct at execution time or flush the pipe and continue executing the other branch path.

Quote from: Mrs Beanbag;723058
The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.

No EXTB.L or TST.W/L An? No misaligned reads or writes? The misaligned reads and writes are a huge saver when not sure of the alignment. Compilers often can't guess the alignment so they bloat up the code and slow down the CPU to align the data before reading or writing.

The 68020+ has some other niceties but they are more advanced.

Quote from: ChaosLord;723082
And Branches >128 bytes :angel:

I think you mean Bcc.L and BSR.L. Branches up to 16 bit were supported on the 68000. The longword branches are big savers but only on fairly large programs. Not too many assembler programmers create programs >65k.

Quote from: freqmax;723088
I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?

It's signed so plus or minus ~32k.

Quote from: freqmax;723088
I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502 ;)
How about ARM?

x86 branches are so screwed up with the early segmentation crap that you really have to define which x86 ISA and then don't ask me. The ARM 32 bit ISA is better but still has some limitations as I recall. I believe it only allow 24 bit addressing, too. It's quite old but the 68k was one of the first to have full 32 bit position independent code done right. An assembly programmer doesn't have to worry about the size with a modern optimizing assembler like vasm. It will automatically generate the most efficient encoding (for more than branching as 68020+ allows) including forward and backward branch optimization. The 68020+ enhancements removed a lot of limitations and can be used or optimized transparently which is great. They should have left the double memory indirect modes away though.
« Last Edit: January 18, 2013, 06:47:33 PM by matthey »
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #10 on: January 18, 2013, 07:32:49 PM »
Quote from: psxphill;723092
Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.


No, but logically something has to happen to 32 bit instructions to fit in 16 bits. It's possible (even logical) that the 2nd word of a 32 bit instruction becomes part of the 6 bytes of "data" per OEP. I don't know if that is possible for all 32 bit instructions though.

Quote from: Mrs Beanbag;723093

I can't honestly say if I use TST.L An or not, off the top of my head. Pretty sure I never do TST.W An though, can't think of much use for that.


You never do:

   movea.l myptr,a0
   tst.l a0
   beq .nullptr

or

   jsr (-$xxx,a6)
   movea.l d0,a0
   tst.l a0
   beq .nullptr

Of course the latter is better sometimes:

   jsr (-$xxx,a6)
   tst.l d0
   movea.l d0,a0
   beq .nullptr

Some 68k processors could reduce the branch overhead on this one on a piplelined CPU although be careful that the a0 is not an input to an EA calculation right after the branch or the first option was better.

Not many have used TST.W An but be careful. It actually operates on a word and not a longword as many OPA.W instructions do. Vasm and PhxAss were doing an optimization to TST.W which was wrong for many years until recently found and fixed. Most of the time it would not cause a problem but could lead to very rare random crashes.

Quote from: Mrs Beanbag;723093

I'm actually pretty careful not to do misaligned access, it just seems wrong, somehow. Just because you can, doesn't mean you should!


Good. Treat is like credit. Don't use it when you don't need it and don't abuse it when you do need it.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #11 on: January 18, 2013, 08:28:09 PM »
Quote from: freqmax;723099

Do you think it's feasable to create something that can get near 50 MHz 68060 in FPGA?


In an affordable fpga? Yes, but I think some different techniques would be better in an fpga than used on the 68060. It is probably easier to make a non-superscaler (more 68040 like) CPU that is clocked higher at first. It should be possible to achieve 100MHz+ in a sufficiently pipelined 68020+ processor. A Link stack, more code fusing/folding and new instructions could make up for some of the disadvantages of the fpga processor vs the 68060. You will probably not get to 68060@100MHz performance until fpga's get cheaper. A CPU, FPU and MMU in fpga will probably push the logic capacity of affordable fpga's also.

Quote from: freqmax;723099

So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?


The selection process is described in the MC68060UM Section 10 "Instruction Execution Timing".

Quote from: freqmax;723099

Btw, Is there any ISA that is neater and more straightforward than m68k? ;)


Yes. There are simpler ISAs but most are less powerful. Motorola/Freescale have liked the simple clean ISAs favoring RISC since the 68k. The 88k is the 68k RISC replacement before being abandoned for PPC. It's a simple and clean classic RISC but a little weak compared to the 68k. The 96k DSP is an interesting RISC/CISC hybrid borrowing much from the 68k that is quite powerful and fairly clean but more difficult to use. The ColdFire was an attempt to simplify the 68k but in doing so made it inconsistent (more difficult to program but still relatively easy) and less powerful even though some late enhancements brought some of the power back. The MCORE is a 16 bit fixed length (very simple) RISC that was meant to compete with ARM. It competed in power efficiency but it has to be one of the weakest modern 32 bit processors I have ever seen. It looks straight forward to program but looks very tedious. Note that the PPC is not a Motorola/Freescale design. It is not very simple for a RISC (but fairly consistent), not easy to program and is very powerful.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #12 on: January 18, 2013, 09:24:22 PM »
Quote from: Mrs Beanbag;723101
I do such things like:

move.l myptr(PC),D0
beq .nullptr
move.l D0,A0

in the 2nd example could always use tst.l D0 anyway.


68000 style ;). Needs a scratch data register but it's usually available. TST.L An is not needed here but there are some places it's useful.

Quote from: Mrs Beanbag;723101

flags are set for free when moving to an address register. Also note the first line, I always write relocatable code.


Mind swap on the address register :). I let the assembler do the PC relative and the MOVE.L ,An -> MOVEA.L ,An even though I didn't for clarity in my examples.

Quote from: Mrs Beanbag;723101

In other news, I've been thinking about a RISC instruction set for internal use in a 68k core for some time. I think we can identify a few obvious simplifications:
1. tread An and Dn identically (use extra instructions if different behaviour is required)


That's nice for simplification but not good for code density. Are you looking at a fixed 16 bit or 32 bit RISC encoding?

Quote from: Mrs Beanbag;723101

2. only MOVE can use as either source or destination operand (load/store architecture)


Ok, but now you have to divide up CISC instructions into multiple RISC instructions. Your instruction stream just grew big time.

Quote from: Mrs Beanbag;723101

3. all other instructions register-register, or "quick" short-constant source operands

4. spare "temporary" registers for internal use.
we could map 68k instructions to short sequences of internal instructions, and design those instructions to give the shortest sequences.


http://en.wikipedia.org/wiki/Microcode

I have heard a rumor that as much as 1/3 of the 68060 is microcode. It's generally slower though. The 68060 bit field instructions are a good example. They can be done in 1-3 cycles (data in cache) on an fgga but they take 2x-3x that long on the 68060.


Quote from: psxphill;723102
Yeah, you can't execute in parallel if the first instruction modifies a register that the second uses: for example
 
MOVEQ #0,D0
TST.W D0


Actually, this may work in parallel. Some very simple instructions are retired early and the longword (only) result made available early. This is not specifically stated but the result is made available early from these types of instructions for change/use stalls and are probably also available early for the other OEP although it's not specifically stated. These early retirement instructions include:

   lea
   move.l #,Rn
   moveq
   clr.l Dn
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #13 on: January 18, 2013, 10:21:28 PM »
Quote from: freqmax;723104

What's a "Link stack" ..?


psxphill got it although the link stack can be different sizes. It should make RTS 2 cycles instead of 7 cycles on the 68060.

Quote from: freqmax;723104

Have you looked at the Actel FPGAs?, they are way faster than any competitor last time I checked. Of course they are slightly more expensive.


No. I have only heard. I haven't played around with any fpga's although I have looked at some VHDL code for a 68k CPU or 2 :).

Quote from: freqmax;723104

As for ISA, my thinking were if the ISA of ARM, Transmeta, PDP-11, MIPS, Sparc, DEC Alpha, PA-RISC, etc is easier to deal with. Without sacrificing performance.


I don't think that any RISC processors are going to be as easy as the 68k. ARM probably comes the closest and MIPS is also logical and usable in assembler from my limited exposure. They both look way easier than PPC despite PPC having as many instructions as many CISC processors. PPC is as bad about using acronyms as the U.S. military.

The PDP-11 should have been very easy to program, possibly easier than the 68k. The performance would be limited by the encodings but it would be interesting to see someone try to implement a modern version in fpga. The instructions are powerful but would probably require a lot of microcode above a RISC core. It's too bad that students will probably not be able to see how easy to program a processor can be. Even the 68k is all but dead.

Quote from: Mrs Beanbag;723123

I would rather optimise for 68000 instructions and provide the rest just for compatibility. How common are the bitfield instructions in real code? I never use them.


It varies. Most old code doesn't use them much but GCC started using them heavily from about GCC 3.x on, even when the timing for them was slower. It's often faster not to use them on the 68060 because it can do a shift and and in the same cycle. They are often worthwhile on the 68020-68040 and are good for code density and fairly intuitive. They have 32 bit results which is good for 32 bit register forwarding and make efficient use of registers. They are very useful for processing streams of data in memory (with caches) which the register memory architecture of the 68k can do well. The only draw back is a little bit more complexity than the average instruction. If they were fast, they would be used a lot more. Implementing them would help the performance of GCC where trapping them would slow these newer GCC compiled programs to a crawl. You get faster smaller programs with and slower bigger programs without. It's a not so tough choice for me.
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show all replies
Re: Motorola 68060 FPGA replacement module (idea)
« Reply #14 on: January 19, 2013, 12:04:34 AM »
Quote from: psxphill;723136
The only thing I can find is this:
 
"If the primary OEP instruction is a simple “move long to register” (MOVE.L,Rx) and the destination register Rx is required as either the sOEP.A or sOEP.B input, the MC68060 bypasses the data as required and the test succeeds."

Which says it's only for move.l, although I guess the others could be translated. It doesn't have to retire it early, the second pipeline could look in the primary pipeline. Mips has a similar handling for lwl/lwr opcodes, it pulls the register value from the pipeline and stops the register being updated at all. The register doesn't actually get updated until you stop executing lwl/lwr opcodes.

There are at least 2 different optimizations here. One is the early instruction retirement and register forwarding. The other is more of a MOVE.L+OP.L optimization which is possible because MOVE.L is only half an operation in a register memory architecture that can do both in 1 operation. The Natami processor was planning to use instruction fusing/folding to handle most of these cases. The non-superscaler v4 ColdFire probably does too:

"Last, ColdFire v4 is smart about collapsing commonly used constructs into a single operation. If two instructions will execute in different stages and have no dependencies, they will execute together in a single cycle. This “instruction folding” is ColdFire’s first move toward superscalar dispatch."  -ColdFire Doubles Performance With v4 by Jim Turley

The M68060UM is less than clear about these optimizations, even if understanding how these types of optimizations commonly work. Editors usually make this stuff worse than what the engineers started with too. I can say I don't fully understand and I have better knowledge than most people and experience with coding the 68060. By looking at code compiled for the 68060, it looks like many compiler programmers didn't understand either. Most 68060 optimized code doesn't do much except replace some trapped instructions, if that.
« Last Edit: January 19, 2013, 12:20:12 AM by matthey »