Author Topic: Die space for m68k on FPGA? (Read 5152 times)

Fats · « **Reply #14 on:** January 05, 2013, 02:50:04 PM »

Quote from: matthey;721296

DosBox with 68k Dynamic Recompilation should be able to achieve 386 emulation speeds on a fast 68060 or fpga 68k CPU. The bonus is that the Amiga can multitask at the same time kind of like the advantage of ShapeShifter over a real 68k Macintosh.

Why not go for both a m68k and x86 CPU core at the same time on the FPGA

?

matthey · « **Reply #15 on:** January 05, 2013, 03:44:16 PM »

Quote from: Fats;721327

Why not go for both a m68k and x86 CPU core at the same time on the FPGA ?

A 2nd (and larger in the case of x86) decoder would be needed but it is possible for a fpga CPU core internally to be flexible enough to handle most modern CPU instructions, addressing modes and functionality. The same task/process could only execute the code for 1 CPU at once because of encoding conflicts. Multiprocessing should be possible with care though. It would be like emulating an Amiga with a bridgeboard all in one fpga. Personally, I think a 68k and x86 together would be a waste. Enhance the 68k as I suggested and the 68k would be able to do practically all the common functionality of the 386 but to general purpose registers making emulation easier while being much easier to program. More interesting CPU combinations would be 68k+Z80 for emulating several game consoles and 68k+PPC for emulating PPC Amigas although that would require a big fpga and an efficient PPC core. Also, an enhanced 68k+68000 might be good for max compatibility without rebooting/configuring the fpga.

psxphill · « **Reply #16 on:** January 05, 2013, 03:45:43 PM »

Quote from: freqmax;721299

This is very true for parallelport bitbanging DOS software. So it's the same issue as with software emulated Amigas. They can't deal with latency and propagation races properly.

It's ok as long as you have the hardware. The dos support in 32bit windows or if you're running 64 bit then virtualpc or dosbox are pretty good. What is lost in speed probably helps as the software was designed to run on something 20 times slower.

As soon as you have to use a usb serial port/parallel port (not that I have come across a usb parallel port that copes with anything other than printing) then the latency of usb really kills performance.

I have only used laptops for the last 12 years, but I know people who still use desktops with parallel ports that are able to run really old software. My old laptop had a parallel port and only runs 32 bit windows anyway, so that sometimes gets used. But that's for practical reasons and slowly I've been moving all those over to intelligent usb devices. By moving the software onto a cpu on the usb device you can offload the time critical code but still plug it into pretty much any modern computer.

There is probably some people that would get a use for it, not as many as want to run amiga software. The PC doesn't get people as passionate.

xyzzy · « **Reply #17 on:** January 05, 2013, 04:08:35 PM »

Quote from: Fats;721327

Why not go for both a m68k and x86 CPU core at the same time on the FPGA ?

Better would be to add specific instructions to the 68k that help with emulation of other processors.

freqmax · « **Reply #18 on:** January 05, 2013, 06:12:51 PM »

The FPGA used in Replay is not likely to have the die space to handle 386 + m68k at the same time. And emulating 386 on a 68060 on a FPGA is an so inefficient solution I recommend: Think again.
As for PPC it has been discussed before. The size is just too big to be practical. It's way better to use the ASIC PPC until moores law makes it feasable.

psxphill · « **Reply #19 on:** January 05, 2013, 06:17:33 PM »

Quote from: freqmax;721343

As for PPC it has been discussed before. The size is just too big to be practical. It's way better to use the ASIC PPC until moores law makes it feasable.

What makes it too big? The ISA itself shouldn't be. It would be tricky to match performance of a real chip & you'd probably have to leave a lot of complex features out. Most people would only want it for running warpup & powerup based software, it doesn't necessarily need to use the official kernels to do it either.

matthey · « **Reply #20 on:** January 05, 2013, 07:25:41 PM »

Quote from: psxphill;721344

What makes it too big? The ISA itself shouldn't be.

The PowerISA is large for a RISC ISA (although much is rarely used and can be trapped but results in poor performance if used) and the caches need to be significantly larger than the 68k for good performance. The code density of the PowerPC is about 1/2 of an enhanced 68k CPU needing 2x the instruction cache to hold the same amount of code, for example. ARM with Thumb 2 gave up a simple and clean RISC ISA and decoder to return to CISC "compressed" code much like the 68k (although a little simpler for decoding but not as compressed or programmer friendly).

Quote from: psxphill;721344

It would be tricky to match performance of a real chip & you'd probably have to leave a lot of complex features out.

It would be nearly impossible to match the performance of the original Amiga PPC cards or the SAM 440 with a PPC core in an affordable fpga. I think an enhanced 68k CPU core could come close though, if assembler optimized code is used. This is feasible on the 68k because it's almost as easy to program as a high level language where programmers enjoy using it but is almost impossible on RISC. The 68k advantages are discounted way too much while it's disadvantages are exaggerated. The 68060 performance proves it and outperformed many early PPC processors. That's why Apple put code in MacOS 8.x that kept the 68060 from working while MacOS 7.x using a 68060 worked great and blew away the PPC macs of the time.

psxphill · « **Reply #21 on:** January 05, 2013, 08:36:43 PM »

Quote from: matthey;721353

It would be nearly impossible to match the performance of the original Amiga PPC cards or the SAM 440 with a PPC core in an affordable fpga.

The slowest phase 5 board was a PowerPC 603e 160, which had 16kb L1 cache & used the 32bit ISA. I'm sure you could get close to that, a sam 440 maybe not. But then a sam 440 doesn't have aga, so nothing is perfect.

It doesn't look to have a particularly more complex instruction set than the 68020 (once you factor in mmu & fpu). All instructions are 32bit, which affects density. But it also simplifies fetching, reducing density is more about ram usage than speed. It's better for performance if all instructions are the same length.

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2/$file/6xx_pem.pdf

matthey · « **Reply #22 on:** January 05, 2013, 09:51:40 PM »

Quote from: psxphill;721361

The slowest phase 5 board was a PowerPC 603e 160, which had 16kb L1 cache & used the 32bit ISA. I'm sure you could get close to that, a sam 440 maybe not. But then a sam 440 doesn't have aga, so nothing is perfect.

That's sounds about right. A 100MHz SuperScaler enhanced 68k CPU in an fpga would probably be like a 200-300MHz low end PPC in performance on average. Some things like a simple memory copy and optimized 68k code would be approaching a low clocked 440. The PPC has some areas where it's strong though too.

Quote from: psxphill;721361

It doesn't look to have a particularly more complex instruction set than the 68020 (once you factor in mmu & fpu). All instructions are 32bit, which affects density. But it also simplifies fetching, reducing density is more about ram usage than speed. It's better for performance if all instructions are the same length.

The PPC instructions set is large but fairly easy to decode. Decoding is one of the slow points of the 68020+ but it's way better than the x86. Constant instruction length is good for performance but so are small instructions that are simple to decode or allow parallel decoding. This allows more instructions in the instruction cache which is much faster and allows more instructions to be fetched in the same amount of time. A bonus is less system memory needed and more flexibility for instructions which results in an easier to program CPU.

wawrzon · « **Reply #23 on:** January 05, 2013, 10:06:59 PM »

what does it matter to make assumptions about all that. you could be right you could be wrong. all that matters is what there is.

Hattig · « **Reply #24 on:** January 05, 2013, 10:52:56 PM »

I believe that you can buy FPGAs that have an on-board PowerPC core - that would seem to me to be the best solution to getting a PowerPC processor alongside the system implemented in the FPGA.

mongo · « **Reply #25 on:** January 05, 2013, 11:14:48 PM »

Quote from: Hattig;721371

I believe that you can buy FPGAs that have an on-board PowerPC core - that would seem to me to be the best solution to getting a PowerPC processor alongside the system implemented in the FPGA.

You can, but they're not cheap.

psxphill · « **Reply #26 on:** January 05, 2013, 11:56:13 PM »

Quote from: matthey;721368

Constant instruction length is good for performance but so are small instructions that are simple to decode or allow parallel decoding.

Fixed length 32 bit instructions (like PPC & MIPS) are the sweet spot for performance.

Thumb is mainly for when you only have 16 bit access to ram, if your ram is 32bit then Thumb is slower (although it will use less ram).

Variable length instructions are a pain to parallel decode, because you have to decode the first one to know where the second one is.

matthey · « **Reply #27 on:** January 06, 2013, 01:01:42 AM »

Quote from: psxphill;721373

Fixed length 32 bit instructions (like PPC & MIPS) are the sweet spot for performance.

They're great when there is unlimited resources. The current trend is away from 32 bit fixed length instructions for a CPU though. There is a reason for this. Many knowledgeable engineers thought PPC, MIPS and the original ARM would destroy the x86 and 68k in performance but they don't. I have tried to explain why. Thumb 2 is an attempt to take advantage of smaller instructions and improve code density.

Quote from: psxphill;721373

Thumb is mainly for when you only have 16 bit access to ram, if your ram is 32bit then Thumb is slower (although it will use less ram).

Most modern ARM code is Thumb 2. Yes, it is a little slower in theory but works well with limited resources.

Quote from: psxphill;721373

Variable length instructions are a pain to parallel decode, because you have to decode the first one to know where the second one is.

Parallel decoding isn't even always possible with variable length instructions. In the best case, the decoder would look at one number in the code and know the instruction length. The length of any instruction on the 68k can be determined by looking at the first 32 bits. That's pretty good. The variable length instructions are not that much of a problem in reality while they save cache and improve ease of programming.

psxphill · « **Reply #28 on:** January 06, 2013, 01:57:21 AM »

Quote from: matthey;721378

The length of any instruction on the 68k can be determined by looking at the first 32 bits. That's pretty good. The variable length instructions are not that much of a problem in reality while they save cache and improve ease of programming.

The whole point of parallel decodes is that you can decode the first and second at the exact same time. If you have to look at the first to see what the length is, then you've failed. You need fixed length, then you can split the instruction cache so that odd/even instructions can be accessed simultaneously.

Thumb2 sounds slower:

"The best options for armv7-a, thumb-2 and thumb-1 and overall:

The best is -O3 -funroll-loops -marm -march=armv5te -mtune=cortex-a8
The best armv7-a is -O3 -funroll-loops -marm -march=armv7-a -mtune=cortex-a8 at 95.2 % of overall best
The best Thumb-2 is -O3 -funroll-loops -mthumb -march=armv7-a -mtune=cortex-a8 at 88.7% of overall best
The best Thumb-1 is -O2 -mthumb -march=armv5te -mtune=cortex-a8 at 64.4% of overall best"

With PPC we can run powerup/warpup software, implementing arm is boring.

matthey · « **Reply #29 from previous page:** January 06, 2013, 05:31:22 AM »

Quote from: psxphill;721380

The whole point of parallel decodes is that you can decode the first and second at the exact same time. If you have to look at the first to see what the length is, then you've failed. You need fixed length, then you can split the instruction cache so that odd/even instructions can be accessed simultaneously.

The Superscaler 68060 averages better than 1 instruction per cycle. A good assembler programmer should be able to average about 2 instructions per cycle in some code. This means that the 68060 is able to decode in parallel with variable length instructions. Short and simple instructions are the key. ARM with Thumb 2 also uses variable length instructions (Thumb 1 used a 16 bit instruction mode only).

Quote from: psxphill;721380

Thumb2 sounds slower:

...
The best Thumb-2 is -O3 -funroll-loops -mthumb -march=armv7-a -mtune=cortex-a8 at 88.7% of overall best

...

I agree that Thumb 2 is a little slower. ARM Holdings claimed 15-25% slower but I am guessing that did not consider that the code is in the cache more often. The figure above is more realistic and good enough that Thumb 2 is used most of the time. Thumb 2 is a real ISA that can stand on it's own unlike Thumb 1. Newer ARM processors will likely drop Thumb 1 support and maybe more.

Quote from: psxphill;721380

With PPC we can run powerup/warpup software, implementing arm is boring.

The fpga Arcade has an ARM CPU so there is no need to emulate. PPC would be interesting but fpga PPC CPU performance would be lousy.

Author Topic: Die space for m68k on FPGA? (Read 5152 times)

Fats

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

psxphill

Re: Die space for m68k on FPGA?

xyzzy

Re: Die space for m68k on FPGA?

freqmax

Re: Die space for m68k on FPGA?

psxphill

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

psxphill

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

wawrzon

Re: Die space for m68k on FPGA?

Hattig

Re: Die space for m68k on FPGA?

mongo

Re: Die space for m68k on FPGA?

psxphill

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?

psxphill

Re: Die space for m68k on FPGA?

matthey

Re: Die space for m68k on FPGA?