Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 53032 times)

Mrs Beanbag · « **Reply #44 from previous page:** January 18, 2013, 08:04:45 PM »

I do such things like:

move.l myptr(PC),D0
beq .nullptr
move.l D0,A0

in the 2nd example could always use tst.l D0 anyway.

flags are set for free when moving to an address register. Also note the first line, I always write relocatable code.

In other news, I've been thinking about a RISC instruction set for internal use in a 68k core for some time. I think we can identify a few obvious simplifications:
1. tread An and Dn identically (use extra instructions if different behaviour is required)
2. only MOVE can use as either source or destination operand (load/store architecture)
3. all other instructions register-register, or "quick" short-constant source operands
4. spare "temporary" registers for internal use.
we could map 68k instructions to short sequences of internal instructions, and design those instructions to give the shortest sequences.

Mrs Beanbag · « **Reply #45 on:** January 18, 2013, 09:40:31 PM »

Quote from: matthey;723118

Mind swap on the address register .

Oops I meant data register.

Quote

That's nice for simplification but not good for code density. Are you looking at a fixed 16 bit or 32 bit RISC encoding?

Code density doesn't matter here as it would only used internally, external 68k code translated into internal code in some kind of buffer. Fixed length but the number of bits could be anything, it's not actually stored in the RAM so it doesn't even need to be 16 or 32.

Quote

I have heard a rumor that as much as 1/3 of the 68060 is microcode. It's generally slower though. The 68060 bit field instructions are a good example. They can be done in 1-3 cycles (data in cache) on an fgga but they take 2x-3x that long on the 68060.

I would rather optimise for 68000 instructions and provide the rest just for compatibility. How common are the bitfield instructions in real code? I never use them.

Of course see what fits on an FPGA first and maybe we can add more bits in later.

Mrs Beanbag · « **Reply #46 on:** January 19, 2013, 06:16:01 PM »

Quote from: freqmax;723203

Why were these instructions [CALLM & RTM] dropped?

More to the point, why were they ever included in the first place?

Personally I'd suggest a minimalist implementation (68000 + some 020 features) and see how fast we can get it, before adding anything else.

Mrs Beanbag · « **Reply #47 on:** January 20, 2013, 05:58:09 PM »

Right, simplifying the decoding stage wasn't the idea so much. But if you can split a problem into two parts, it is usually easier to solve. I'm trying to make the developer's job easier really.

The advantages are that each part can be developed, tested and optimised separately, and indeed the RISC core could conceivably be useful on its own (and an assembler could be modified to compile 68k asm to run on it). It would be easier to add new instructions, much in the same way that microcode does, but the "microcode" in this case is more readily understandable, being 68k-like itself.

Mrs Beanbag · « **Reply #48 on:** January 20, 2013, 06:41:21 PM »

I don't think so... the Dn,Dn instructions are fairly trivial to translate, ,Dn and Dn, instructions need to wrap an instruction with a load and/or a store, I don't think that would be hard. I think it would be simple to get something to work (minus the nastier addressing modes mentioned above), maybe difficult to properly master to get the most out of it. It's perhaps a bit like Arm + Thumb, which seems to work well enough.

I'm thinking of using 32 bit instruction words, which encode either a single move/lea operation, or a pair of register-only operations that can run in parallel. I call this "explicit superscalar". To give you an idea of how that might work, for example an "exg D0,D1" operation can be synthesised as

move.l D0,D1 ; move.l D1,D0

So kind of like VLIW but not "very long". Finding operations to pair would be the job of the instruction translator, that part might be tricky but the RISC core itself would be simple.

One could also do other tricks, like if you have

add D0,D1
move D1,D2

usually this would have to take (at least) two cycles, but one could use a right-hand side instruction that simply writes the result of the left instruction elsewhere, like this:

add D0,D1; write D2

The other thing that occurs to me is you could translate the code into the cache upon reading, so that it would function as a JIT.

Mrs Beanbag · « **Reply #49 on:** January 20, 2013, 06:54:15 PM »

Quote from: matthey;723348

Hmm. Why not just use JIT in UAE then? The 68k code does make a nice compressed cross platform intermediate language

I guess that's not too far from my idea, now I think about it, but with a CPU core specifically designed to emulate 68k. Sort of a hardware emulator, I guess.

I don't like such things as out-of-order execution, I can't see how it can save you very much compared to optimising the code properly, at least for the massive amount of extra logic required. My favourite CPU is still the UltraSparc T1.

Mrs Beanbag · « **Reply #50 on:** January 20, 2013, 07:59:23 PM »

Quote from: psxphill;723354

If a 1 cycle 68060 instruction translates to 2 of your instructions, then you'd have to clock at double the speed to achieve the same throughput. So each of those would have to be a 1:1 mapping or you've already failed.

I've got the instruction execution timings in front of me here. Instructions with indirect addressing modes or immediate data can take longer than 1 cycle on 68060. Move (An),(An) for instance takes two cycles. All the Register-Register instructions take 1 cycle. These could be mapped 1:1, or better. So the answer is it probably depends on the program. But it might be possible to process some combinations of two 32 bit instructions simultaneously as well. (Add some degree of implicit superscalar operation, probably move ,Dn with ALU Dn,Dn operations, with some pipeline trickery.) Also how well the cache performs will have a lot to do with it.

If I don't beat a 68060 clock for clock on my first attempt, I won't be too disheartened, anyway. If I can beat a 68040 it would be a nice start.

Mrs Beanbag · « **Reply #51 on:** January 21, 2013, 01:15:28 PM »

Bare in mind that these complex, base displacement addressing modes do not exist on 68000, only from 68020 onwards.

@mrgreedy98: as has been pointed out already, ColdFire core can be licensed and used, but it cannot be modified because it is encrypted (and no doubt obfuscated as well, I know Xilinx obfuscate their cores before encrypting them). Without modification, ColdFire is not much use, sadly. I have investigated this. The differences might be subtle but that doesn't make them easy to work around.

Mrs Beanbag · « **Reply #52 on:** January 21, 2013, 01:56:55 PM »

There is this information from the Megadrive:
http://emu-docs.org/CPU%2068k/68kstat.txt

although it might be more instructive to see which are the most common addressing modes for these instructions, too.
For instance, rate of "add Dx,Dy" vs "add (Ax),Dy" and "add Dx,(Ay)".

Mrs Beanbag · « **Reply #53 on:** January 21, 2013, 03:22:59 PM »

So keeping the pipeline relatively short is probably a more effective strategy than making sure all instructions are single cycle. We can afford a few 2-cycle instructions if we can shorten the pipeline by at least one stage, I reckon.

Also I have been thinking of a way to make the instruction translation do branch predication in the case a conditional branch skips only a few instructions.

Mrs Beanbag · « **Reply #54 on:** January 21, 2013, 03:58:23 PM »

Actually something just occurred to me. If the most common instruction is "tst", it should be possible to know whether a branch will be taken or not some time in advance. Because "tst" only looks at a single register, the contents of that register must have been determined some time before. So you could look ahead in the instruction queue for a "tst/bcc", and inform the branch predictor well in advance. "tst" instruction then takes effectively NO cycles.

Mrs Beanbag · « **Reply #55 on:** January 21, 2013, 04:50:21 PM »

Quote from: matthey;723433

Right. The OEPs are locked together and each OEP performs 1/2 of the ea for a move ,. This is the only 68k instruction that allows 2 EAs by the way.

Not strictly true. Can also do "cmp (Ax)+,(Ay)+"

addx, subx, abcd and sbcd can use predecrement for both operands.

All of these are two cycle instructions.

Quote from: psxphill;723436

Apart from the cycles it takes to look ahead in the instruction stream every time you hit a tst instruction, and it will get complex to even follow the code as you would have to follow branches as well. Basically to avoid the cycles when a branch happens, you'll end up going through the same overhead as running the code after every tst instruction (tst isn't the only instruction that affects branches).

Instructions are read into a buffer ahead of time, so can detect a tst/bcc when it is first read in. I wouldn't bother following branches, to be able to predict only the next branch would still help. Yes it would only work if the branch follows a tst, but if the profiles from the Megadrive are anything to go by, that is the most common case. Basic RISC principle, "make the common case fast"!

Mrs Beanbag · « **Reply #56 on:** January 21, 2013, 07:53:28 PM »

Quote from: psxphill;723460

It wouldn't help at all when the branch follows the test, because you're going to have to flush all the following instructions from the pipeline. If you're going to remove the pipeline completely or a significant number of stages then you'll have a huge number of instructions taking multiple cycles and the overhead of incorrectly predicted branches is going to be so insignificant that it won't be worth doing.

The following instructions wouldn't be in the pipeline yet, at the point you make the prediction, that's the whole point, to avoid having to flush the pipeline when you get to the branch.

I honestly don't know what you mean here. When you say "when the branch follows the test", when would the branch ever not follow the test? There wouldn't be much point doing a test and then not having a conditional branch after it.

I wonder if you understood my idea properly, so I'll try explaining it again. The instruction stream is read into a FIFO (which I believe is a fairly normal thing to do) and as soon as a test followed by a branch is read in, it can do the test immediately (which is a very simple operation) and predict the branch based on that. So as long as the register doesn't change by the time the branch instruction comes out of the other end of the FIFO the branch will have been predicted correctly.

Mrs Beanbag · « **Reply #57 on:** January 21, 2013, 08:48:14 PM »

Quote from: billt;723471

The test may not always be able to be done immediately. Might it not depend on the writeback of an instruction ahead of it but still in the pipeline and not yet finished? You may not yet have the right thing there to test just yet. Such as decrementing a loop counter might be right ahead of the test for 0...

Yes it would be a prediction, the prediction isn't always necessarily right, but as long as it's right more than 50% of the time it will help.

In the case of a loop, even if the decrement is right before the branch, the prediction will be right up until the very last iteration.

It would also be possible, in many cases, for a coder or compiler to optimise for it by re-ordering the instructions.

Mrs Beanbag · « **Reply #58 on:** January 22, 2013, 06:28:52 PM »

Quote from: psxphill;723488

What you're suggesting will break I/O, which is the major use of TST. You can only perform the read once & you can't do the read until all the registers are correct, or you could be reading from anywhere.

Good point. I was only thinking of tests on registers.

Mrs Beanbag · « **Reply #59 on:** February 12, 2013, 11:00:44 AM »

Yes the ALU is a piece you could get from various places ready made. I was pondering the possibility of using the cache management systems out of the OpenSPARC core.

68060 is definitely microcoded to some degree, move , is split into two "standard operations". No doubt fancy addressing modes are split up into even more operations. The RISC cores (x2) seem to be able to handle register-memory operations on their own though.

Author Topic: Motorola 68060 FPGA replacement module (idea) (Read 53032 times)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)

Mrs Beanbag

Re: Motorola 68060 FPGA replacement module (idea)