Author Topic: Die space for m68k on FPGA? (Read 13453 times)

matthey · « **on:** January 05, 2013, 05:59:33 AM »

Quote from: freqmax;721192

I' curious if an 80386 + VGA can be implemented on the existing FPGA Replay.

DosBox with 68k Dynamic Recompilation should be able to achieve 386 emulation speeds on a fast 68060 or fpga 68k CPU. The bonus is that the Amiga can multitask at the same time kind of like the advantage of ShapeShifter over a real 68k Macintosh. An enhanced fpga 68k CPU could support faster emulation of x86 by providing some useful instructions and addressing modes that the x86 has but the 68k does not. When creating the basic 68k dynamic recompilation, I could see that MVS/MVZ (x86 MOVSX/MOVZX), LEA to a data register, immediate shifts >8, base register update addressing mode, small longword->word compressed immediates, a PERM instruction and fast bitfield instructions would greatly speedup and simplify x86 emulation. Many of these ideas were previously suggested in the 68koolFusion ISA coincidentally. A true 386 fpga core should still be a little faster but then a real 386 DOS machine can be obtained for free.

Quote from: billt

an FPGA on a PGA carrier to replace 68060 which are so hard to find the good ones now.

Quote from: freqmax;721192

I like the idea. Power through the socket might be an issue thoe.

Yea, Interesting idea. An fpga is low power and voltage like the 68060. An fpga core would have to be very similar to a 68060 though.

matthey · « **Reply #1 on:** January 05, 2013, 03:44:16 PM »

Quote from: Fats;721327

Why not go for both a m68k and x86 CPU core at the same time on the FPGA ?

A 2nd (and larger in the case of x86) decoder would be needed but it is possible for a fpga CPU core internally to be flexible enough to handle most modern CPU instructions, addressing modes and functionality. The same task/process could only execute the code for 1 CPU at once because of encoding conflicts. Multiprocessing should be possible with care though. It would be like emulating an Amiga with a bridgeboard all in one fpga. Personally, I think a 68k and x86 together would be a waste. Enhance the 68k as I suggested and the 68k would be able to do practically all the common functionality of the 386 but to general purpose registers making emulation easier while being much easier to program. More interesting CPU combinations would be 68k+Z80 for emulating several game consoles and 68k+PPC for emulating PPC Amigas although that would require a big fpga and an efficient PPC core. Also, an enhanced 68k+68000 might be good for max compatibility without rebooting/configuring the fpga.

matthey · « **Reply #2 on:** January 05, 2013, 07:25:41 PM »

Quote from: psxphill;721344

What makes it too big? The ISA itself shouldn't be.

The PowerISA is large for a RISC ISA (although much is rarely used and can be trapped but results in poor performance if used) and the caches need to be significantly larger than the 68k for good performance. The code density of the PowerPC is about 1/2 of an enhanced 68k CPU needing 2x the instruction cache to hold the same amount of code, for example. ARM with Thumb 2 gave up a simple and clean RISC ISA and decoder to return to CISC "compressed" code much like the 68k (although a little simpler for decoding but not as compressed or programmer friendly).

Quote from: psxphill;721344

It would be tricky to match performance of a real chip & you'd probably have to leave a lot of complex features out.

It would be nearly impossible to match the performance of the original Amiga PPC cards or the SAM 440 with a PPC core in an affordable fpga. I think an enhanced 68k CPU core could come close though, if assembler optimized code is used. This is feasible on the 68k because it's almost as easy to program as a high level language where programmers enjoy using it but is almost impossible on RISC. The 68k advantages are discounted way too much while it's disadvantages are exaggerated. The 68060 performance proves it and outperformed many early PPC processors. That's why Apple put code in MacOS 8.x that kept the 68060 from working while MacOS 7.x using a 68060 worked great and blew away the PPC macs of the time.

matthey · « **Reply #3 on:** January 05, 2013, 09:51:40 PM »

Quote from: psxphill;721361

The slowest phase 5 board was a PowerPC 603e 160, which had 16kb L1 cache & used the 32bit ISA. I'm sure you could get close to that, a sam 440 maybe not. But then a sam 440 doesn't have aga, so nothing is perfect.

That's sounds about right. A 100MHz SuperScaler enhanced 68k CPU in an fpga would probably be like a 200-300MHz low end PPC in performance on average. Some things like a simple memory copy and optimized 68k code would be approaching a low clocked 440. The PPC has some areas where it's strong though too.

Quote from: psxphill;721361

It doesn't look to have a particularly more complex instruction set than the 68020 (once you factor in mmu & fpu). All instructions are 32bit, which affects density. But it also simplifies fetching, reducing density is more about ram usage than speed. It's better for performance if all instructions are the same length.

The PPC instructions set is large but fairly easy to decode. Decoding is one of the slow points of the 68020+ but it's way better than the x86. Constant instruction length is good for performance but so are small instructions that are simple to decode or allow parallel decoding. This allows more instructions in the instruction cache which is much faster and allows more instructions to be fetched in the same amount of time. A bonus is less system memory needed and more flexibility for instructions which results in an easier to program CPU.

matthey · « **Reply #4 on:** January 06, 2013, 01:01:42 AM »

Quote from: psxphill;721373

Fixed length 32 bit instructions (like PPC & MIPS) are the sweet spot for performance.

They're great when there is unlimited resources. The current trend is away from 32 bit fixed length instructions for a CPU though. There is a reason for this. Many knowledgeable engineers thought PPC, MIPS and the original ARM would destroy the x86 and 68k in performance but they don't. I have tried to explain why. Thumb 2 is an attempt to take advantage of smaller instructions and improve code density.

Quote from: psxphill;721373

Thumb is mainly for when you only have 16 bit access to ram, if your ram is 32bit then Thumb is slower (although it will use less ram).

Most modern ARM code is Thumb 2. Yes, it is a little slower in theory but works well with limited resources.

Quote from: psxphill;721373

Variable length instructions are a pain to parallel decode, because you have to decode the first one to know where the second one is.

Parallel decoding isn't even always possible with variable length instructions. In the best case, the decoder would look at one number in the code and know the instruction length. The length of any instruction on the 68k can be determined by looking at the first 32 bits. That's pretty good. The variable length instructions are not that much of a problem in reality while they save cache and improve ease of programming.

matthey · « **Reply #5 on:** January 06, 2013, 05:31:22 AM »

Quote from: psxphill;721380

The whole point of parallel decodes is that you can decode the first and second at the exact same time. If you have to look at the first to see what the length is, then you've failed. You need fixed length, then you can split the instruction cache so that odd/even instructions can be accessed simultaneously.

The Superscaler 68060 averages better than 1 instruction per cycle. A good assembler programmer should be able to average about 2 instructions per cycle in some code. This means that the 68060 is able to decode in parallel with variable length instructions. Short and simple instructions are the key. ARM with Thumb 2 also uses variable length instructions (Thumb 1 used a 16 bit instruction mode only).

Quote from: psxphill;721380

Thumb2 sounds slower:

...
The best Thumb-2 is -O3 -funroll-loops -mthumb -march=armv7-a -mtune=cortex-a8 at 88.7% of overall best

...

I agree that Thumb 2 is a little slower. ARM Holdings claimed 15-25% slower but I am guessing that did not consider that the code is in the cache more often. The figure above is more realistic and good enough that Thumb 2 is used most of the time. Thumb 2 is a real ISA that can stand on it's own unlike Thumb 1. Newer ARM processors will likely drop Thumb 1 support and maybe more.

Quote from: psxphill;721380

With PPC we can run powerup/warpup software, implementing arm is boring.

The fpga Arcade has an ARM CPU so there is no need to emulate. PPC would be interesting but fpga PPC CPU performance would be lousy.

matthey · « **Reply #6 on:** January 06, 2013, 04:35:19 PM »

Quote from: ChaosLord;721440

The M68060 dispatches, decodes, executes, completes and writes the results of 2 instructions at the same time.

This applies to most of the common simple simple instructions.

It does not apply to gigantic complicated instructions or rare instructions.

You have been studying

. It's not all done in parallel but more is than not. Code is sequential (in series) by nature so there are some limitations to parallel operation but the 68060 shows the proper way to do a Superscaler CPU.

Quote from: ChaosLord;721440

Furthermore, it does 3 instructions at the same time, as long as 1 of the instructions is a correctly predicted branch. Loops are common structures of computer programming. The branch at the bottom of the loop will be correctly predicted the 2nd thru the nth times it is executed.

I don't know if it will correctly predict the LOOP branch the 1st time it is encountered. But if you have a loop from 1 to 1000 then it will be correctly predicted 999 times out of 1000 which is a fairly good rate.

The 68060 has 4 execution units (2xinteger, fpu and branch) that operate at least partially in parallel. I believe you are correct with the 3 instruction per cycle max though. As far as loops go, there can be a slow down on the first loop iteration if the branch is not in the branch cache. There is a very costly misprediction on the last loop iteration as the branch falls through which can't be avoided. It would be possible to avoid sometimes with hardware help but not on tight loops.

Quote from: psxphill;721432

68060 can dispatch two instructions at the same time, I don't think it decodes them at the same time.

...

I don't believe it can sustain 2 instructions per cycle for long before the fetch pipeline runs dry & that is if you can even find worthwhile work to do in instructions that can run in parallel.

See TCL's answer above and the 68060 documentation. The 68060 can sustain 2 instructions per cycle but it's impractical in most cases. Complex code is going to use some instructions that do not operate in parallel. The 68060 instruction fetch is 32 bits/cycle which is low but enough for 2 small instructions. Exceeding 32 bits of instructions per cycle usually does not cause a slowdown unless it's common. The 68060 is already a great CPU but it would have been a monster if they would have:

made MULS.W, MULU.W and SWAP as 1 cycle pOEP|sOEP (easy)
added MVS, MVZ and small longword immediate to word compression (see 68kF)
added a link stack

I would expect those moves alone would have been good for 1.5+ instructions per cycle in compiled code because they greatly reduce 68060 bottlenecks.

Quote from: psxphill;721468

It's pipelined, while it can dispatch an instruction in 1 clock cycle and execute an instruction in 1 clock cycle. They aren't the same instruction that it's doing, you don't notice that from the point of view of the program until you get a mis-predicted branch.

I thought what TCL described was pipelined processing and correct. I didn't assume that the same 2 instructions were processed in all the pipelined steps at once. I don't understand his English as specifying this information.

Quote from: ChaosLord;721470

Adding new instructions is easy. Adding new instructions modes.... uhmm... you would just have to say that certain instructions are hardwired for this new addressing mode that you want?

No. I want to add the new addressing modes for all 68k effective addresses (EAs). The addressing mode in question and used in DosBox is base register update which the 68k does not have. I represent it in the 68kF docs as (bd,An,Rn*Scale)! with the explanation point at the end specifying to update the base register An with the calculated value. The EA is already calculated so there is little additional overhead. ARM also has this addressing mode and probably some other processors making emulation easier. The 68k addressing mode would be much more flexible and usable than the x86 addressing mode because the 68k has general purpose registers and more of them.

matthey · « **Reply #7 on:** January 06, 2013, 07:27:31 PM »

Quote from: Mrs Beanbag;721495

How would you encode it in the instruction?

Preliminary 68koolFusion ISA:

OpenOffice Writer
http://www.heywheel.com/matthey/Amiga/68kF_PRM.odt

PDF
http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf

html
http://www.heywheel.com/matthey/Amiga/68kF_PRM.html

The addressing modes are at the top.

Quote from: ChaosLord;721498

Its been such a long time since I studied the addressing mode bits... I thought they were all used up? How will u encode a new universal addressing mode?

Nope. There is 1 free bit in the full format extension and then available I/IS encodings that were reserved. All new addressing modes are fully backward compatible. There is room to add addressing modes with scale > *8 or even (bd,An*Rn) which would be powerful but they would cause a slowdown, especially on an fpga. They might be fine in silicon if a longer pipeline was chosen although I don't like very long pipelines.

Quote from: ChaosLord;721498

The first thing Jens is going to say is "it messes up the pipeline structure and adds more complexity to the processing and I have to add another secret internal register to hold and forward the results and..."

Only a wee bit more complexity and no secret internal registers (that's needed for LEA EA,Dn which would help x86 emulation also :/ ).

Quote from: Mrs Beanbag;721500

move -(An),Dn
for instance, already updates the address register with the calculated value. So it shouldn't be too much trouble.

Bingo. It does require 2 register writes but that is already needed on the 68k.

Quote from: Mrs Beanbag;721500

It's the encoding that worries me, however I have the Motorola reference manual in front of me and it states that IS-I/IS values of 0100 and 1100-1111 are "reserved".

Reserved is reserved for future ISA changes

.

matthey · « **Reply #8 on:** January 07, 2013, 12:12:20 AM »

Quote from: psxphill;721510

No, reserved means you can't use them. Especially for ones that raise invalid instruction exceptions that programs trap. There is software that uses the reserved line-a exceptions for example.

http://forums.sonicretro.org/index.php?showtopic=24409

If you don't care about compatibility then go ahead and add instructions that will make it not work properly, but then why do you want 680x0?

There is some 68k software that uses a missing or invalid instruction to trap but they are very few and supporting them means no enhancements are possible. The 68020 ISA broke a fair amount of 68000 code but it was a very good enhancement. The 68kF1 ISA would break way less software than 68000->68020. Even 68kF2 would break less. If 99.999% of code runs then I'm happy. The rest can be patched.

A-line is not currently used by either ISA. The MacOS, Atari ST and others used A-line for system call traps. I don't know of any Amiga software that makes use of A-line except emulators.

Quote from: ChaosLord;721511

Ah yes! I remember that. I still want my *32 scalefactors!
Seriously I do want them.

There is room for future enhancements. I want to be sure we don't slow down the EA unit before adding more CPU intensive EA processing. The most power would come from a multiply+add in each EA unit but it is also the most costly in CPU processing although fairly easy to encode. Scale factors greater than *8 are not too CPU intensive in silicon but are challenging to encode. It's doable but may require giving up some options of (bd,An,Rn) like sign extended Rn word sizes and/or not allowing some suppression. The encoding wouldn't be pretty and 68k EA consistency would be ruined or the instruction size would grow. Add to that that the EA unit would be slower in fpga and I'd rather skip it until someone smarter than me tells how and if it should be done.

Quote from: Mrs Beanbag;721525

I didn't spot bit #3 of extension word... in the docs it is simply '0'!

It takes a little studying to figure that out

. I'm glad that you were able to understand my documentation. The 68000PRM was a little confusing in regards to addressing modes. The powerful addressing modes of the 68k are what gives her so much power.

Quote from: Mrs Beanbag;721525

"Reserved" means Motorola might use them for something later... I guess they might come back and make a 68080 and then we'd be in trouble!

Motorola/Freescale killed the 68k so that it wouldn't compete with PPC and created the low end ColdFire for microcontrollers and simple embedded uses. In the mean time, ARM Holdings created the less powerful, less programmer friendly, less code dense than 68020 variable length instruction ARM with Thumb 2 and it now sells at least 8 billion ARM processors per year and is used in 95% of smartphones, 90% of hard disk drives, 40% of digital televisions and set-top boxes, 15% of microcontrollers and 20% of mobile computers. Oh, Freescale pays license fees to ARM too. Maybe they were a little smarter than C= after all, they aren't bankrupt yet. Tech companies that stop innovating start dying. Freescale, Microsoft, Sony and Amiga Inc. are dying. Apple, Intel, IBM and ARM Holdings keep innovating.

Quote from: Mrs Beanbag;721525

Although personally I think we're getting ahead of ourselves here... make a fully pipelined 680x0 + accelerator first, and worry about extending the ISA later.

Compatible standards need to be in place before there are incompatible products. It also takes time to develop good standards and ISAs. Yes, making a fast 68k CPU is a higher priority but there are already several fpga processors that are quite capable. More than 1 means there is a possibility for incompatible enhancements.

I also try to have fun and innovate with the 68k communities help. There has to be some good ideas that are different than x86 and ARM.

Author Topic: Die space for m68k on FPGA? (Read 13453 times)

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?