You mean as in ASIC silicon? that would cost at least 40 000 USD, proberbly around 400 000 USD.
The function is not protected as the patents has expired. However perhaps the instruction set opcodes are? but I doubt that to as there exist many other projects in this area without legal problems.
But that a solution could be to join a male socket and a FPGA on top of that. Much like the 486- (https://en.wikipedia.org/wiki/Intel_80486_OverDrive) or pentium (https://en.wikipedia.org/wiki/Pentium_OverDrive) overdrive solutions for the x86.
I suspect the biggest speed boost would come from full pipelining. I would guess the RAM could be fast enough relative to the FPGA for a data cache not to be that important,
Motorola 68060 FPGA replacement module (idea)Its waaaaaaaaaaaaay to hard to make the signals on the pins perfectly match a real 68060.
I forgot about RAM latency, but wouldn't RAM at 1333MHz and CL9 still be able to max out a 100MHz CPU?
then your speed drops like a rockConfusing metaphor :crazy: When you drop a rock, it just goes faster and faster!
Its waaaaaaaaaaaaay to hard to make the signals on the pins perfectly match a real 68060.
Confusing metaphor :crazy: When you drop a rock, it just goes faster and faster!
Ok, well, I'll take your word for it.
You know what I don't even care about 1333MHz RAM anymore, you can buy 8Mb chips of 16-bit SRAM:
http://uk.farnell.com/renesas/r1wv6416rbg-5si/sram-64mbit-3v-55ns-48fbga/dp/2068172
I guess that would be fast enough for an off-chip cache for an FPGA 68060 implementation, if not big enough for the main ram itself.
Maybe he will shoot this down for some other reason... but if I throw enough ideas at the wall maybe one of them will stick. Like a rock.
I forgot about RAM latency, but wouldn't RAM at 1333MHz and CL9 still be able to max out a 100MHz CPU?
Would there be any advantage to using graphics memory in this application? (i.e. GDDR4/5 rather than DDR2/3)
My original idea was to provide a solution for those that want FPGA Arcade (MikeJ) with add on board and want to make use of a fast 68060 CPU but just can't get one.
@ChaosLord, How can you be absolutely sure there's no 1333 MHz capable RAM ..? ;)
So in principle, if we were to use SRAM as memory, and FPGA as CPU at 100-ish MHz, would we still need a cache in the CPU core?
I am looking at an adapter board which will fit in the pins of the 68060 daughter board and carry either a 680x0 or a much faster Virtex7 class FPGA running a soft CPU..
Fastest I ever heard about back when I studied RAM chips was 333Mhz.
I will admit an instruction cache; it can be simpler because you don't write to it. Also even a plain 68020 has an instruction cache (albeit a small one).
Just trying to think in terms of "maximum impact/minimum effort". I'm thinking essentially 68020, but fully pipelined and fastest possible clock speed.
A quick look at the 68060 technical data suggest it needs like 16 kB cache RAM. It should be possible to handle within the FPGA "BlockRAM" (as Xilinx call it). So likely no need for extra RAM. Btw, don't forgett the cache coherency issues so that disc DMA and CPU don't disagree on what is in the memory..
10 It had to be written, rewritten, rerewritten to be optimized, oops now there are a ton of bugz, oh crap this is getting really tedious I think I will take a vacation ok now where did I leave off at... oh yeah gotta fix all these bugz hold on if I rewrite it this other way it will work better when we add the MMU later lather rinse repeat goto 10 :D
This isn't a problem if you can prefetch and burst fill your cache.
http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules
It looks like 1066mhz is the fastest standard I/o clock speed & 266mhz for the memory clock speed, but the latency's are huge. This isn't a problem if you can prefetch and burst fill your cache. You get 64bits per transfer per module as well.
It doesn't sound like much, but compared to chip ram or the memory in your 90's accelerator. It is pretty quick.
designed by a 200th TechMage.
For randomly accessing memory your speed is 266Mhz / 16 = 16.625Mhz which is the same speed as the memory u already have on your Amiga accelerator card.
... designed by a 200th TechMage.
What do you mean with that? ;)
Just looking at the price of FPGAs. Can get a 550MHz Virtex 5 for just under £100, not bad. And it has >200k of block RAM!
Now I only have to learn VHDL...
I think he means 200th level TechMage.
@TCL
I thought the N050 only implemented write-through caches which are much easier to implement than copyback.
They have excellent compatibility and the little bit faster modern memory would make up for some of the speed deficit. I agree that at least write-through caches for both instruction and data is needed. Anyone saying otherwise should turn off their accelerator caches and experience 68000 performance all over again ;).
For randomly accessing memory your speed is 266Mhz / 16 = 16.625Mhz which is the same speed as the memory u already have on your Amiga accelerator card.
You only get good speed when writing a bunch of bytes in a straight line. Even then it is really really hard to achieve over 500 MT/s since your memory controller must designed by a 200th TechMage.
This is why cache inside the CPU is dramatically important.
You only get good speed when reading/writing a bunch of bytes in a straight line.
I am curious why there is some idea of a shortage of 68060 chips? There are tens of thousands of these chips, both 50MHz and 60MHz (MC and XC versions) available from suppliers in China.
I am curious why there is some idea of a shortage of 68060 chips? There are tens of thousands of these chips, both 50MHz and 60MHz (MC and XC versions) available from suppliers in China. These were used in the Northern Telecom call center boards. There is a thread here about this. Just pull the chip with the heat sink and put it in your Amiga (or replay) board.
eBay has a slew of these boards, for about 1/2 of what 68060's by themselves are selling for.
Every once in a while someone starts a thread asking why AmigaKit keeps selling brand new 030 and 020 accelerators when what so many ppl want is an 060 accelerator. The answer that ppl post in forums is that there are no 060 chips available or that they cost ridiculous amounts of money so therefore no 060 accelerators can be built at a profit. Or they say that since the cheap 060 chips are "from China" they can't be trusted. Even though everyone who ever bought any of them was pleased with the results.
There is no shortage of 060s. But there is a shortage of 060 accelerator cards.
Why not just build an adapter with an ARM processor equipped with very fast
JIT interpretation?
Even though everyone who ever bought any of them was pleased with the results.
Because X86 is much faster.+more expensive
In another thread (http://www.amiga.org/forums/showthread.php?t=63775) the issue that there are too few fast Motorola 68060 (https://en.wikipedia.org/wiki/Motorola_68060) CPUs around came up. But that a solution could be to join a male socket and a FPGA on top of that. Much like the 486- (https://en.wikipedia.org/wiki/Intel_80486_OverDrive) or pentium (https://en.wikipedia.org/wiki/Pentium_OverDrive) overdrive solutions for the x86.
The MC68060 (http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MC68060) datasheet (http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf) provides the PGA 206 pinout at page 356. And the frequency span is 0 - 75 MHz. Power (p328, p344) requirement is 3.3V +/- 5% @ 2A with 5V compatible I/O. There has not been any QFP variant on the commercial market, ever?
So this is what the FPGA has to be able to work with. Some kind of onboard DC/DC circuit will be needed. The voltages of iVDD, EVDD, PVDD and CVDD is unclear especially in a mixed 040/060 environment. So the question becomes, can a powerfull enough FPGA that implements 060 make do with 6.6 W ? and will the mechanical size be within limits? otherwise circuit board stacking may be needed.
Btw, with some additional PGA-114 (020) and PGA-132 (030) to PGA-206 adapter it could be used as a upgrade option for those CPUs too.
OT Found while searching:
a68k.de - Overclocking Amiga.pdf (http://www.a68k.de/xtechwb/filez/AMIGA/Hacks+Reps/Overclocking_Amiga.pdf)
+more expensive
+more power hungry
+...
+ THE ENEMY
Why not just build an adapter with an X86 processor equipped with very fast
JIT interpretation?
THEY ARE MORE POWERFUL- who cares?
x86 has performance, but it has no charm. Ask yourself which you would rather go for dinner with.
Actually, I would prefer a super fast 040 over an 060 any day. The reason? From a programmer's stand point, the 060 requires several work arounds for the superscaler and branch prediction caches. When I did the code for the Mac emulation, I ended up turning off half of the 060's features because code would blow up due to the Mac OS and many different Mac apps that were not compatible with a full running 060. You have to deliberately write your code to be 060 compatible, and since the 060 didn't exist when the Mac OS was written (all of the way through OS8.x), the OS didn't support it. How many 060 boards did you see for the Mac? None that I am aware of... they went from the 040 to the PPC.
This is all an interesting idea. I don't know what the state of available FPGA cores for 68040 are, but a cursory search did turn up a Coldfire core. Might another option be to use a Coldfire FPGA core and modify the microcode of it to get around the incompatibilities. Weren't there just a handful of unimplemented instructions and a couple of instructions that behaved differently?It might be a good starting point. Differences are:
It might be a good starting point.
A practical issue is that due to the through-hole (https://en.wikipedia.org/wiki/Through-hole_technology) nature of the 68060 PGA socket any PCB will be occupied by solder pads from the pins. A solution is to put double row a straight pin 1.27 mm header around the PCB edges. Such that a another PCB can be mounted on top and the space be used for FPGA, DC-DC and EEPROM.
I would just like to clarify that MacOS has problems with the 060 because MacOS is crap. MacOS is so bad that its creator threw it in the garbage and switched to a totally completely different OS.
On the Amiga, the 68060 is totally compatible with all normal software and causes no problems.
In order to make a program be incompatible to the 060 on the Amiga, one would have try really hard to do it on purpose, or do something that is plain illegal, or be banging the MMU or performing some weird esoteric function that no normal programmer would ever have any need to do.
On the Amiga we have MMU.library so nobody needs to bang the MMU.
I have been writing Amiga software since 1985 and none of my C or asm programs has ever failed on the 060.
Microsoft BASIC fails on the 060 because Microsoft BASIC is a pile of garbage that does wildly illegal things. M$ BASIC won't even work on 020.
Well, there are quite a few Amiga programs - including several of my own that all follow 100% legal programming practices (according to common sense and the RKMs) that will not run on an 060 with superscalar and/or branch caching enabled.
I don't recall all of the reasons behind the issues.
I should go look at the mmu.library replacement that we made for EMPLANT and FUSION... I know I commented some things there.
I know that self modifying code is definitely one of the things that causes a problem when one of the cached instructions in the pipeline has been modified (like a branch table). Yes, I consider self-modifying code 100% legal. :) You are suppose to flush the caches (or turn them off) with self modifying code, but when you do that you are then running at sub-030 speeds.
The 060 really only adds dual instruction pipelining and a 4-way cache. A higher speed (100MHz+) 040 core would probably be better in the long run, especially if it handled floating point without completely stalling the core like the 060 does.
It might be a good starting point. [ColdFire] Differences are:
1. No DBcc
2. No bitwise rotation (rol, ror)
3. No bitfield operations
4. Multiply instructions don't set flags. From the Coldfire manual:
CCR[V] is always cleared by MULS/U, unlike the 68K family processors
Coldfire also has a few extra commands (some of which would be quite useful, such as saturate and multiply-accumulate)
Well, there are quite a few Amiga programs - including several of my own that all follow 100% legal programming practices (according to common sense and the RKMs) that will not run on an 060 with superscalar and/or branch caching enabled. I don't recall all of the reasons behind the issues. I should go look at the mmu.library replacement that we made for EMPLANT and FUSION... I know I commented some things there. I know that self modifying code is definitely one of the things that causes a problem when one of the cached instructions in the pipeline has been modified (like a branch table). Yes, I consider self-modifying code 100% legal. :) You are suppose to flush the caches (or turn them off) with self modifying code, but when you do that you are then running at sub-030 speeds.
The 060 really only adds dual instruction pipelining and a 4-way cache. A higher speed (100MHz+) 040 core would probably be better in the long run, especially if it handled floating point without completely stalling the core like the 060 does.
The experienced users that have this much hardware/software/firmware knowledge generally do not work for free. So, with an extremely limited market, there is really no desire to work on something where you won't at least recoup your time investment.
actually someone is actively working on an a600 fpga accelerator:
http://www.natami.net/knowledge.php?b=6¬e=32232&x=7
http://www.a1k.org/forum/showthread.php?p=589135#post589135
The experienced users that have this much hardware/software/firmware knowledge generally do not work for free. So, with an extremely limited market, there is really no desire to work on something where you won't at least recoup your time investment.
actually someone is actively working on an a600 fpga accelerator
Developers may want make the most comatible solution, others the fastest with most bells and whistles which ofcourse ends up with that you loose the starting point.
Some have different coding styles. Or just use different schematics CAD. It might be more fun to make new than to integrate with existing creations. Some stuff just requires a heavy start like Kickstart+Workbench and thus require a dedicated work like the one undertaken by AROS-m68k.
etc..
There are reasons why efforts diverge.
He has his own site: http://www.majsta.com/
Seems several FPGAs had to give their life to that project due to soldering technique. But he seems on track now.
Sadly even a 75MHz 060 is really slow compared to modern Intel processors, so the idea of a dedicated Intel CPU adapter is probably the fastest and least expensive option.
Indeed, and I've been very excited about that. It's basically what I imagine for this discussion, only with a 68000 plug on the bottom rather than 040/060. It's not exactly shaped like a 68000, it has level shifting to be 5V safe, has power and memory onboard. But very much the same idea. He's working with the TG68, which has had some issues to work out to fit onto a standard 68000 bus. I'd really like to see the Suska 68000 code in there instead, as I think it would more readily fit the standard bus than the TG68. (Though I understand that further work on TG68 core is improving that as well, in addition to enhancing to 020 compatibility)
I thought everything you needed to interface that x86 pain is in the datasheets? anyway perhaps a x86-microcontroller could do the job. But I still see that solution as flawed.
It also adds another dependency on a chip to source. With a generic HDL source you can just neareast enough powerfull FPGA to do the job. And only have one big chip to deal with.
Would the x86 doing nothing more than emulating a 68K be higher performance than the FPGA? That's possible.
Lot's more chips to source, route, solder, and debug. I prefer just one FPGA and done with it.
datasheets will talk about pinouts, such as connecting cpu to PCH(north/southbridge), pci-express, pci, etc. it won't talk about how to make a new chip to hook onto the PCH bus.
I think it's unrealistic to do this in an 68060 replacement module, but an A1200 accelerator design would be awesome.
You can get one of Xlinx's Zynq-7000 devices, which is basically an FPGA with a dual core ARM Cortex-A9 all in one chip.
http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/silicon-devices/index.htm
or an 030slot board for big box amigas. the glueboard could provide both variants.
edit - these are way too expensive! You are still better off with a $50 x86 CPU.
The XC7Z010 is $63.75 in single quantities.
TobiFlex had the TG68 core running on an A500 about 3 years ago.
http://www.a1k.org/forum/showthread.php?t=20223
Yes. Some glue logic (Mach or some type of small FPGA) and probably a bit of dual ported RAM would make a great 680x0 emulator. The performance could be quite impressive even with an older x86 CPU. The x86 CPU would be not much more than a state machine and floating point processor. This is a project that makes sense to me... and since I have written a 68040 core in x86 assembly, I could probably lend a hand. :)Hmmm... That's interesting! :)
Yes. Some glue logic (Mach or some type of small FPGA) and probably a bit of dual ported RAM would make a great 680x0 emulator. The performance could be quite impressive even with an older x86 CPU. The x86 CPU would be not much more than a state machine and floating point processor. This is a project that makes sense to me... and since I have written a 68040 core in x86 assembly, I could probably lend a hand. :)
Wow, bloodline's right, you have a crucial piece of the puzzle.Indeed!
There, of course, would still be a lot of work designing the hardware.
But a super fast '040 sounds ideal.
So which socket do we aim for?
The dip or the square '040/'060 type?
It might be easier to design a processor card, but then we'd be limited to A3000s and A4000.
And x86 need lot's of peripherals to run.
Indeed!
Though I personally think that what might be more fun is to have the x86 interface directly with an FPGA large enought to take the MiniMig core, and bring out the Amiga compatible I/O (as the MiniMig does) ;)
It might be easier to design a processor card, but then we'd be limited to A3000s and A4000.
sounds even better perhaps, a x86 cpu module for fpgaarcade?? there would be no doubt about interface, and the original amigas might stay what they are, which is what im fine with.
@ChaosLord, How can you be absolutely sure there's no 1333 MHz capable RAM ..? ;)
(but signal integrity will be a pain)
DDR = Double Data Rate. So a 1333MHz stick is running at half.. 666.5Mhz. However as it's double data rate RAM it can transfer twice as much data on one I/O bus clock, than the actual IO frequency so in fact it works out at 1333MHz.
So to if I break it down. 1333MHz is the data transfer rate. The I/O clock is 666.5MHz.
And x86 need lot's of peripherals to run.
What about AMD's Geode line of x86 SoC's?
http://en.wikipedia.org/wiki/Geode_(processor)
The Geode LX in particular would be perfect imo!
http://en.wikipedia.org/wiki/Geode_(processor)#Geode_LX
There are more powerful processor (including processors with more then one core which could aid in hardware emulation), but the XP-M based versions of the Geode aren't bad and might be powerful enough.
sounds even better perhaps, a x86 cpu module for fpgaarcade?? there would be no doubt about interface, and the original amigas might stay what they are, which is what im fine with.
If it's just a companion to an FPGA and providing FPU support it should be grunty enuf - could always use the Athlon based version NX Series if they fit into the required power envelope?
Can the Geode be directly interfaced via level converters to the 68060 socket?
(it seems the only one self contained so far that it won't cause a serious circuit mess)
Geode does not have a 68060 bus in its pinout, so no.
I saw a PCI bus for Geode, do an FPGA PCI to 68060 bridge. If Geode's PCI bus is 5V, then you'll need level shifters between FPGA and Geode as well as perhaps between FPGA and 680x0 socket.
I'd rather uses an FPGA for bus translation.
Going trough the PCI bus would be seriously low.
What I mentioned did use the FPGA for bus translation, PCI bus to 680x0 bus. That's what a bridge does. (Sometimes bridges sit between two of the same bus as well, as in PCI to PCI bridge which helps give more slots total than a single bus can provide)
As someone mentioned the Geode LX, have a look at the datasheet
http://wiki.laptop.org/images/a/a1/Lx_databook.pdf
Page 21 has a diagram showing the pin groupings, basically a schematic symbol. If not PCI, what else would you connect to?
I see that the PCI bus in it is 3.3V signalling, so good there.
after 2 hours here it is. MC68010 in FPGA so now I m convinced, are you ?
Seems http://www.majsta.com/ has come slightly further:
On 4th januari it has trouble booting. On 9th januari the FPGA seems to emulate 68010.
so all you need is decelerator board? lets focus on something else.
i'm happy to see his thing booting at all
i don't understand the slowness
Endian is not an issue. You can swap with the FPGA. :)
My 68040 core handles everything without needing the endian reversed, but I am sure it would be significantly faster without having to do that.
The only thing I don't do in my code is instruction cycle counting. That could be done, but I never bothered. The FPGA could be used to trigger an event to denote the end of the instruction cycle (where a process loop just waited for this to occur). So, based on the speed of your x86 CPU, you could reliably have cycle exact timing at a speed limited to by your fastest instruction (nop). I know my Mac emulation has no JIT type of stuff, is 100% assembly, and is frightenly fast on modern PC hardware. I will have to test it on my Sandy Bridge setup to see how fast of a 040 Mac it is. :)
My 68040 core handles everything without needing the endian reversed, but I am sure it would be significantly faster without having to do that.
supposing one does a movem.w (SP)+,D0-D7 for instance... urgh
I noticed his website has been hacked - i hope this doesnt hinder his progress.
http://www.majsta.com/
Doesn't movem.w (SP)+ add 4 to the stack pointer for each write? at least on some processors I'm sure the stack pointer gets 4 byte aligned.The programmer's reference manual doesn't say so. It says "the address is incremented by the operand length (2 or 4)".
I know for fact you can do the byte swap/word swap with the FPGA in real time, but depending on how fast the x86 is it could slow down the operation of the 680x0 emulation.
The programmer's reference manual doesn't say so. It says "the address is incremented by the operand length (2 or 4)".
How do you know if you need to swap or not ? For example a memory copy function that uses 32bit transfers but may be copying strings that may not be byte swapped ?
Ok. It's bytes that affect a7 by 2, but everything else by 1.
How about this for a crazy idea, an accelerator with an Arm CPU and an FPGA, the FPGA can function as a 68k CPU if set up as such, so it could run like the PPC accelerator boards. BUT you install AROS for ARM ROM chips and use the Arm as the main CPU, and allow the FPGA to be reconfigured by the Arm chip, so then you could develop your 68k core "live", and install updates through software.
How about this for a crazy idea, an accelerator with an Arm CPU and an FPGA, the FPGA can function as a 68k CPU if set up as such, so it could run like the PPC accelerator boards. BUT you install AROS for ARM ROM chips and use the Arm as the main CPU, and allow the FPGA to be reconfigured by the Arm chip, so then you could develop your 68k core "live", and install updates through software.+1
How do you know if you need to swap or not ? For example a memory copy function that uses 32bit transfers but may be copying strings that may not be byte swapped ?
greets,
Staf.
How about this for a crazy idea, an accelerator with an Arm CPU and an FPGA, the FPGA can function as a 68k CPU if set up as such, so it could run like the PPC accelerator boards. BUT you install AROS for ARM ROM chips and use the Arm as the main CPU, and allow the FPGA to be reconfigured by the Arm chip, so then you could develop your 68k core "live", and install updates through software.
There are several ways to handle this. One requires that the FPGA follows the instruction stream, so as instructions are decoded the data bus can be swapped depending on the incoming instruction. The other way is by having the x86 side (during the decoding) change the bus interface.But then it goes in the cache, and there are any number of things that can go wrong. Such as if you read an address as a longword for some reason, and then go on to process it as part of a string, it will have been byte swapped into the cache. Either you turn the cache off and destroy any advantage of using such a fast CPU, or you have serious difficulties ensuring coherence.
i dont get it, but sounds like another hybrid idea, aros arm system with 68k apps. imho anything like warpos solution is a waste of time. we shouldnt create another split/branch with the need of dedicated binaries, and we shouldnt follow feature creap strategy. lets have simple 68k accel, as simple as it gets, no strange ideas. lets treat 68k as virtual common platform/denominator, then we will maybe have chance to actually achieve something one day.But we already have AROS for Arm. We could make an Arm board for A1200 and install AROS ROMs in it. The expertise exists for that, I believe. We wouldn't need an emulation layer or any fancy bus tricks. Dual CPU PPC accelerators already exist so we have experience from that.
lets start discussing wierd complicated ideas and we can have fun threadthat will follow in natami footsteps, to nowhere.
But we already have AROS for Arm.sort of. actually hosted. its being worked on native version especially for pi.
Dual CPU PPC accelerators already exist so we have experience from that.we, means who? the documantation of ppc boards is closed source owned by dce germany, its outdated and it is not going to be given to us except for multiple ten thousands of euro as it has been revealed. also there is none who would realistically build such a ppc board. jens schoenfeld outright refuses to have anything to do with unreliable ppc architecture as he considers it. such an hybrid board is very complicated and expensive in fabrication, Many layers, bga, and requires special softwware (warpos like) which is even more complicated. besides an approach like that already exists with ultimateppc. lets see what will come out of it.
Well I don't want to saw my case up for one thing... I'm not suggesting using PPC just pointing out that dual CPU has been made to work. I've never owned one, though, so I can't say how well it works.i have one. i always refused one but got it beginning of this millenium, and can confirm that its nothing great. the best part is fast scsi controller. i can dispose of ppc, that can only be taken advantage of the specially precompiled code. all the usual (68k) stuff runs as usual on the 68k processor with its usual speed. so its just okay for what it should be.
Accelerators have an FPGA on board anyway to handle various bus signals, could just be a case of replacing it with a bigger one, and an interface to allow the firmware to be updated from software. FPGA accelerator basically works but there is a barrier to community development of the core(s). Plus people could create custom cores, which could produce some interesting projects. I'd like to develop my own core but I don't have the means to produce hardware.
Arm chip need not be expensive.
im sure its not just as simple as glueing another fpga to an existing design if there was one at disposal to start with.Hardware would need a redesign but I'm sure it wouldn't be beyond anybody who has made an accelerator before.
look how much effort has been put into minimig, fpgaarcade or natami hardware. there are several fpga aware people around the scene yet except those little else is available to us.But these all try to emulate a complete Amiga system.
Hardware would need a redesign but I'm sure it wouldn't be beyond anybody who has made an accelerator before.
But these all try to emulate a complete Amiga system.
With big FPGA+small Arm+Flash ROM (flashable in software) development could be done in the community rather than in isolated groups, that's essentially my "crazy idea" anyway. The hardware could be configured various ways, 68k core in the FPGA, software 68k emulation on the Arm, potentially PPC in the FPGA as well I guess, or anything else you could think of. But could run AROS "out of the box" with nothing in the FPGA but a bus interface.
let us stay realistic rather than pipe dreaming,You're demanding a lot of me here, I don't know if I can manage it.
so we have well encapuslated non aga chipset core at disposalI don't even need that, I have a real AGA Amiga.
What we DON'T have is a FPGA accelerator, because we don't have a 68k core. One reason why not is it means developing both hardware and software at once, while either on their own is a task in itself, so we get stuck at an impasse. No core? FPGA accelerator=useless. No FPGA accelerator? Core=useless.
but I wouldn't go to much expense to put an x86 in my classic
Anything that isn't 68k will most probable require some FPGA glue.Even a 68k needs an asynchronous bus interface to run at anything other than the motherboard clock speed. ACA1230 etc use an FPGA for this, as well as for the memory controller (I believe).
SPI is too slow, you need something that can handle about 15MBytes per second for basic use and about 30MBytes for ZIII to work.
Anything that isn't 68k will most probable require some FPGA glue. That means the minimum configuration is FPGA + EEPROM (core boot image). Adding anyting else adds to the BGA soldering hêll.
I say like @psxphill, there already exist TG68, opencore 68k, and FPGA Arcade 68030 softcore hybrid which is essentially a TG68 modded to 68020 modded to 68030.
I think that a CPU core in FPGA is fast enough to saturate the computer bus in Amiga.
So the least amount of hardware mess and using existing software availability is an FPGA + EEPROM with perhaps SRAM for cache.
KISS..
The other is getting a fast link between the CPU and the Amiga bus. Most SOC's have limited IO capability. GPIO isn't fast enough on any I've seen and local buses are long gone. PCI-e is on some of them, but that's not a cheap interface.
i wouldnt mind people not taking solutions as amiga enough. ppc isnt amiga as well. people will complain and then shut up and want have when they see the results. all that runs available and potentially possible 68k codebase but faster is fine whatever the tech behind it.
My issue with the FPGA/ARM combo chips is the price.
But then it goes in the cache, and there are any number of things that can go wrong. Such as if you read an address as a longword for some reason, and then go on to process it as part of a string, it will have been byte swapped into the cache. Either you turn the cache off and destroy any advantage of using such a fast CPU, or you have serious difficulties ensuring coherence.
bloodline, I'm seriously not targeting you or anything, just putting my thought process out there for debate hoping you or someone else will see a flaw in my logic.I only suggested SPI because it is super simple and cheap to implement, and I've used it in the past for some pretty fast transfers with ARM micro controllers... Also it was developed by Motorola so it might keep a few purists in the scene happy...
I read back through it and I was afraid you'd take it the wrong way.
I just haven't had the chance to discuss this CPU stuff with anyone, so I'm enjoying bouncing ideas.
If the FPGA performed the swap instead of doing it in software, I would hope that it would be cached!
But I still find one FPGA-done the least amount of fuss solution. And the m68k op codes to be way nicer to deal with in contrast to x86 ones.
-edit- I notice that you suggested USB too, in your post. That's certainly a great idea, though USVB has loads of cool features that one would never need for this project like hot plugging etc... Not sure is the latency of USB would be an issue?
Xilinx is a better option as their ISE has better Linux & BSD support during development than Altera.
USB has a minimum latency of 1 ms which make round trips to be 2 ms. Compare this with a slow A500 bus that has a round trip at 280 ns ie 3570 times faster!
Round trip matters..
As for choice of processor if direct FPGA implementation is not used I think ARM is the better choice as it is more efficient, more suitable to single board solutions unlike x86, can switch endianness etc.
But I still find one FPGA-done the least amount of fuss solution. And the m68k op codes to be way nicer to deal with in contrast to x86 ones.
then i would divide the approach in two parallel, as there are either suppoers for an asic cpu with fpga glue or pure fpga anyway. fpga may give us simpler purer solution, while arm or x86 will provide raw power. ok?
@Heiroglyph
+1 on all points
+ i think the approach to look from the 68k emulation perspective treating the amiga as expansion card providing interface to the original chipset and the original interfaces, is exactly right, since this is what it in fact is, even if you use your usual 68k accelerator today.
whats important is to provide yet some expansion possibility best not bound to amiga itself as for instance pci. who has fast cpu wants fast rtg as well.
By using the PGA-208 68060 socket one can use it on FPGA Arcade and A4000.
USB has a minimum latency of 1 ms which make round trips to be 2 ms. Compare this with a slow A500 bus that has a round trip at 280 ns ie 3570 times faster!
Round trip matters..
I also had the idea a while back also to approach the problem from the opposite end... make an Amiga graphics card (preferably AGA, for me) for PCs. The PC would then only need to emulate the 68k.
I think if we've got some fast CPU running 68k emulation or otherwise, it would be a shame not to make its raw power available to the user in some way.
I think an A1200 accelerator is worth doing with just an FPGA, some flash and some ram.
Making a 68060 socket compatible version might be useful for a minority, but I'm not convinced it's going to be very useful for the FPGA Arcade. It doesn't need a physical 68060 & it has an FPGA waiting for code.
@ Heiroglyph
I'm concerned that USB (any flavour) will have too high a latency to be used for a CPU/Chipset bus,
@ Heiroglyph
I'm concerned that USB (any flavour) will have too high a latency to be used for a CPU/Chipset bus, I'm gonna stick with my original idea of an SPI and see if I can make the numbers add up :)
That's cool with me, it would be way simpler if possible. I've seen the opposite problem with SPI in my thought experiments, low latency but low throughput.
Maybe I'm shooting for to too much memory speed since I'm trying to match the best numbers I've seen. Lower bandwidth might be acceptable.
Here is my thinking:
The PAL Amiga 500 has a CPU speed of 7.09Mhz, it only accesses the ram/chipset on every other cycle (effectively reducing the frequency to 3.545Mhz). With a bus width of 16bits, that means the highest data rate for the CPU/Chipset bus would be 6.76Meg per second.
A SPI bus with a frequency of 56.7Mhz can match that data rate, and I've used 80Mhz on an SPI bus with an SD card before, so bandwidth (at least for OCS/ECS) should be totally possible with SPI... Latency I guess will depend on how well the protocol is designed, but should be low :)
Inside Out board made with Minimig-AGA, once Yaqube and friends have that finished up.I never heard of Inside Out board, it seems to be a complete Amiga, CPU and all?
My issue with the FPGA/ARM combo chips is the price.
I've been trying to find a good way to interface the wave of super cheap ARM's that are 800-1200Mhz and loaded with peripherals.
It might take a combo chip, but then we're back to $1000 CPU cards. :(
I never heard of Inside Out board, it seems to be a complete Amiga, CPU and all?
Don't really need the CPU on board. Just the graphics + sound.
But we've also got AGA machines with faster speeds and 32bit width. All my numbers were based on the worst case A4000.I fear you want to run before you can walk ;)
I figure if we can use it on a 4000, the others are a piece of cake.
And the FPGA on the base board isn't large enough to implement a 68060 properly.
I think we're having trouble understanding what you're suggesting, because you can't just convert a little endian processor to big endian using an FPGA.
Has that been tried? I thought it was more that nobody had implemented the FPU & MMU.
Could it be possible implement cpu and gpu to FPGA? Something like ... Cirrus Logic GD5446 or S3 Virge?
Sure you can. In fact, the FPGA would only swap certain instructions where this is required.
The swap is only required because the byte order is backwards (Endian issue). So, if you fetch a long word from memory using something like mov.l $12345678,d0 that memory location will appear backwards to an x86 and would have to be swapped. The swap occurs during the opcode emulation, and using some hardware means to perform the swap (like the bus wired backwards momentarily) would eliminate having to do it in software.
The problem with JIT is that it is not cycle exact, so it breaks a LOT of programs. This is where the FPGA would come in handy as it could be used to throttle the instruction cycle speed so it is correct for a given desired performance level.
Sure you can. In fact, the FPGA would only swap certain instructions where this is required.I hope you mean move.l #$12345678,D0. But look at Psxphill's example. Supposing you do
The swap is only required because the byte order is backwards (Endian issue). So, if you fetch a long word from memory using something like mov.l $12345678,d0 that memory location will appear backwards to an x86 and would have to be swapped. The swap occurs during the opcode emulation, and using some hardware means to perform the swap (like the bus wired backwards momentarily) would eliminate having to do it in software
PCI(e) could be very good, an adapter that essentially converts the Amiga 1200 chipset into a graphics/sound card, and an accelerator that is essentially a tiny motherboard.
Personally I don't see the motivation for trying to squeeze all the data down a serial interface of any kind, when we have naturally parallel devices at both ends.
I hope you mean move.l #$12345678,D0. But look at Psxphill's example. Supposing you do
move.l (A0),D0
move.b (A0),D1
which byte of (A0) ends up in D1? Given that (A0) is now cached.
Let A0 point to an address that contains $12345678. FPGA swaps it to $78563412. Second line then reads $78 instead of $12!
Emulators can "lay out" the emulated memory in whatever way they like as long as the emulation makes it look "normal" to the emulated machine.
Since it is A0 itself that is byte swapped, the correct value for the lower byte will be returned.What do you mean, A0 is byte swapped? How do you byte swap an address register? What address does it point to, the most significant byte of our longword, or the least significant byte? It can't point to two different addresses.
No, because the BYTE value of $78563412 is $12 - just as it would be with a software only op-code interpreter.But the number is not $78563412. The number is $12345678. The byte value of that is $78.
I think Jim Drew is suggesting that the FPGA only swaps the bus when the CPU makes a request larger than a byte.
Supposing we have a longword access at 2(a0), then what?
I missed this post - That is correct.But the CPU is not making any request on the bus if the data is in the cache.
We've recently seen the Prometheus PCI open sourced so there is a ZorroIII to PCI interface that works and only needs 1-2 CPLDs. It could be even simpler since there aren't multiple PCI devices, it would be a target not a host and you wouldn't need to multiplex the address and data lines.
you absolutely must have a cycle exact emulation. There are quite a few programs that require it.
2 is added to the value of A0 and swapped. This is no different than how we do it with software emulation. An FPGA can certainly do anything that can be done in software. Instead of the FPGA handling the entire CPU core, it would be handling the memory bus accesses and off-loading the CPU emulation itself to the primary (x86) CPU.But then instead of getting "cdef" as the data, it gets "bahg" interpreted as "ghab" - completely wrong!
i would design it though treating amiga like one had mounted pci device on a bus among possible others, being a pci gfx card for instance. of course there is rather no need to have dma between amiga and pci gfx card, its enough to have it between host cpu boad and pci. on the other hand michael boehmer (e3b) has improved prometheus firmware to enable amiga-pci dma but this is closed source due to his agreement with prometheus designers.
on subject of openpci or alternative standards, i would propose to coordinate effort with aros68k maintainers. aros team has considered and rejected openpci as its standard for various reasons (license, free availability, documentation). it provides pci hidd based partly on netbsd so far i know and in parallel a prometheus.library.
I guess you guys have to decide if you want 100% compatibility or not. If you do, you absolutely must have a cycle exact emulation. There are quite a few programs that require it. You could also deliberately set emulation thresholds. For example, you could set the speed/emulation type to be Amiga 500 (68000), Amiga 3000 (16MHz or 25MHz 030), Amiga 4000 (25MHz 68040), etc. This way you could run those euro demos in Amiga 500 mode that won't work on anything else. :)
speaking from a user perspective the idea of an accelerator absolutely contradicts anything being cycle exact, am i right? so neither is an a1200 cycle exact to a500 nor an amiga with any accel to the same device as such. i think its self explanatory we have to sacrifice cycle exactness. and as an owner of practically only accelerated amigas id say, this is a very good deal.
So if those are too slow, then what about PCI?
PCI is faster than the Amiga bus both in clock and throughput and has the same bus width. It should just be a matter of timing and translation.
We've recently seen the Prometheus PCI open sourced so there is a ZorroIII to PCI interface that works and only needs 1-2 CPLDs. It could be even simpler since there aren't multiple PCI devices, it would be a target not a host and you wouldn't need to multiplex the address and data lines.
I thought the PCI hidd was reasonably close to or based on OpenPCI? I seem to remember seeing headers from openpci in there a long time ago.
I've recently seen PLX PCI9054 which converts 060 to PCI bus and don't need CPLD/FPGA...
ermmm.. what? where?
I've recently seen PLX PCI9054 which converts 060 to PCI bus and don't need CPLD/FPGA...
I'm really surprised nobody hooked one up to make a PCI backplane though.
@wawrzon, Any application that has failed for you because of accelerators ?
For buses in general:
* Latency
* Capacity (Mbit/s)
* Electrical compability
* Protocoll conversion
So I think PCI is doable but don't forget that translation between PCI and Zorro may introduce bottlenecks. But why introduce any bus at all between the CPU-in-FPGA and the CPU-socket? KISS..
why nobody?
http://www.powerphenix.com/ctpci/english/overview.htm
Ok, nobody on Amiga ;)
Ok, nobody on Amiga ;)
you'll try ;)
because there is m :) diator on amiga
@wawrzon, Any application that has failed for you because of accelerators ?none i remember or would seriously care for.
So I think PCI is doable but don't forget that translation between PCI and Zorro may introduce bottlenecks. But why introduce any bus at all between the CPU-in-FPGA and the CPU-socket? KISS..i think we are talking to almost drop zorro or at least zorro3, and have a direct pci interface next to amiga being interfaced by pci as well. no zorro bottleneck between cpu and pci anymore. i understand the zorro pci interface would be used to interface the remaining amiga hardware or whats left of zorro bus.
I'd rather get an SOC as the CPU, then you'd get a PCI bus and all other devices on the chip essentially for free.
why nobody?
http://www.powerphenix.com/ctpci/english/overview.htm
none i remember or would seriously care for.
i think we are talking to almost drop zorro or at least zorro3, and have a direct pci interface next to amiga being interfaced by pci as well. no zorro bottleneck between cpu and pci anymore. i understand the zorro pci interface would be used to interface the remaining amiga hardware or whats left of zorro bus.
Forgive me if this sounds terribly stupid, but surely an Arm chip (for instance) has data and address buses that we could connect to the trapdoor slot via some relatively simple FPGA glue logic, just as we would a 68060?
Forgive me if this sounds terribly stupid, but surely an Arm chip (for instance) has data and address buses that we could connect to the trapdoor slot via some relatively simple FPGA glue logic, just as we would a 68060?
You might look for one with an EBI bus. (I htink that's what it's called) forgot about that one until today. I'm not sure what kind of selection there is though, I'm only aware of moderate performance ones between 100MHz and 200MHz which don't thrill me for this task. But I don't know much about the higher-end ARM chips.
Interrupts can be handled by letting the Amiga side setting an interrupt register in the FPGA which in turn just signal a general interrupt (like "IRQ" on C64) to the overdrive CPU. The CPU side then reads what interrupt source that triggered the event and act accordingly. The extra performance will negate any delays for this code.
On 8086 etc.. an instruction may take 3 cycles but an IRQ may take 100 cycles just to hint on the amount of wasted cycles that may occur. Not counting Push/Pop instructions.
Choose another CPU ;)
On the FPGA you can make any signal you need..
Interrupts can be handled by letting the Amiga side setting an interrupt register in the FPGA which in turn just signal a general interrupt (like "IRQ" on C64) to the overdrive CPU. The CPU side then reads what interrupt source that triggered the event and act accordingly. The extra performance will negate any delays for this code.
On 8086 etc.. an instruction may take 3 cycles but an IRQ may take 100 cycles just to hint on the amount of wasted cycles that may occur. Not counting Push/Pop instructions.
I'm not suggesting to use MicroBlaze soft core, I'm citing it as example of performance that can be achieved by a soft core.
No, because the BYTE value of $78563412 is $12 - just as it would be with a software only op-code interpreter.
That's what I was thinking also. Anything that would break is already broken with existing accelerators.
A couple of reasons I'm not jumping at pure FPGA:
Large fast FPGAs get really expensive.
A fast enough core hasn't been done by now, this makes me think it's excessively hard to do.
Very few people are capable of writing something that complex and efficient. I'm not one of them.
Using an SOC gives a huge amount of devices for free, FPGA just gives a CPU.
I can help with software and smaller projects, so I'm tending to lean that direction.
If I depend on someone else to do the hardest part there's a really good chance it's not going to happen. If I play to my strengths, I have only myself to blame if it doesn't.
A large enough FPGA like XC3S1600 as used in FPGA Arcade cost 68 USD at D-key. Currently it can be seen in the FPGA Arcade thread it can beat 68030 @ 20 MHz Amigas using a 16-byte cache (4.46 times A1200). With hope of 28 MHz.
Thread: http://www.amiga.org/forums/printthread.php?t=39806&pp=15&page=57
Sysinfo (https://en.wikipedia.org/wiki/Sysinfo): http://www.yaqube.neostrada.pl/images/SysInfo28-16.gif
So XC3S1600 is more than enough and it has already been done.
A large enough FPGA like XC3S1600 as used in FPGA Arcade cost 68 USD at D-key. Currently it can be seen in the FPGA Arcade thread it can beat 68030 @ 20 MHz Amigas using a 16-byte cache (4.46 times A1200). With hope of 28 MHz.yet they make the daughterboard with an 060 which proves its rather hard to beat it with fpga. with a emu accelerator we might achieve multiple of that speed probably at the fration of cost.
Thread: http://www.amiga.org/forums/printthread.php?t=39806&pp=15&page=57
Sysinfo (https://en.wikipedia.org/wiki/Sysinfo): http://www.yaqube.neostrada.pl/images/SysInfo28-16.gif
So XC3S1600 is more than enough and it has already been done.
The FPGA gives you any device you can imagine that can be expressed as binary gates.
I know VHDL is a bitch but so was assembler, C etc too. It's hard but the reward makes it worthwhile. The power is awesome.
A large enough FPGA like XC3S1600 as used in FPGA Arcade cost 68 USD at D-key. Currently it can be seen in the FPGA Arcade thread it can beat 68030 @ 20 MHz Amigas using a 16-byte cache (4.46 times A1200). With hope of 28 MHz.
Thread: http://www.amiga.org/forums/printthread.php?t=39806&pp=15&page=57
Sysinfo (https://en.wikipedia.org/wiki/Sysinfo): http://www.yaqube.neostrada.pl/images/SysInfo28-16.gif
So XC3S1600 is more than enough and it has already been done.
The FPGA gives you any device you can imagine that can be expressed as binary gates.
I know VHDL is a bitch but so was assembler, C etc too. It's hard but the reward makes it worthwhile. The power is awesome.
Here (http://www.amiga.org/forums/showpost.php?p=608422&postcount=90) is another sysinfo screenshot:
(http://www.yaqube.neostrada.pl/images/SysInfo28-256P-256.gif)
A large enough FPGA like XC3S1600 as used in FPGA Arcade cost 68 USD at D-key. Currently it can be seen in the FPGA Arcade thread it can beat 68030 @ 20 MHz Amigas using a 16-byte cache (4.46 times A1200). With hope of 28 MHz.
Using an SOC gives a huge amount of devices for free, FPGA just gives a CPU.
Finding one that has that plus external interrupts with levels is tricky.
It's also not standardized across chips so you're back to vendor lock-in.
Edit: Also, like I said, *I* can't reasonably do it in an FPGA. I can do a localbus interface and I can do software. I'm just tired of waiting for "it's not that hard" to happen and following my gut on the quickest path to get there.
Isn't 68000 architectually too far from the 68060 in order to gain useful insight into the original 68060 ..?
There is quite a bit of work going on understanding the basic architecture of the 68KI've been watching the 68000k and Amiga chips on there for a while... Sadly it's slow progress... But eventually we will have real net lists for these chips! :)
http://www.visual6502.org/images/pages/Motorola_68000.html
The micro and nano microcode instructions roms are being read out.
If this works, a table based FPGA will be much smaller and more accurate than the current code - and can be tweaked easier.
A lot of the cloaning complexity of the 68K is a fall out from the way it was efficiently implemented due to die area limitations at the time.
/MikeJ
Personally I'd be inclined to start with something like a RISC core with 68k-like programming model, and optimise for speed first, then work in a compatibility layer. Starting with 68k compatibility and then trying to work in pipelines etc, seems like the difficult way around.Obviously the most sensible idea... But apparently not popular with the FPGA hobbyist ;)
If you really plan on using strictly a FPGA to emulate the CPU, I would suggest someone modifying WinUAE to make a histogram of instruction usage. This would let someone focus on optimizing the 680x0 core by looking at instruction usage which could help determine things like what changes in the cache, pipelines, etc. will benefit the speed.
Some instructions are used less often but reduce branching (my favorite)
The micro and nano microcode instructions roms are being read out.
Yes! I love those! I (and Phil) were always pushing these at the Natami CPU Dezine Dept. but Gunnar did not like them or didn't understand so he was totally against adding a new instruction for this purpose. :(
That's why we added SBcc, SELcc, ABS, POPCNT, etc. to the ISA and which fit and have minimal pipeline overhead (hazards) while reducing short branches. Long branches still need to jump. If we could remove 5-15% of branches (the short ones) and the overhead in the branch cache and history, the 68k would be one of the best processors at branching. Add to that a relatively short pipeline (and mis-predicted branch penalty) and 0 cycle loops and we would have much improved performance, a beautiful CPU to program and even better code density.
Different angle here..... Coldfire is not too far off from 68K instructions. A Coldfire V1 core is available for Altera FPGAs. Would it be a shorter easier journey to start with a Coldfire core and modify it to be more 68K compatible for our purpose?
EDIT: Hold on, I may be reading this Coldfire core information incorrectly. More study required.
EDIT 2: Yes, looks like my first comment was correct.... http://chipdesignmag.com/display.php?articleId=2371
The only benefit is a bit of ego boosting, but that subsides when reality hits.
There is quite a bit of work going on understanding the basic architecture of the 68K
http://www.visual6502.org/images/pages/Motorola_68000.html
The micro and nano microcode instructions roms are being read out.
If this works, a table based FPGA will be much smaller and more accurate than the current code - and can be tweaked easier.
A lot of the cloaning complexity of the 68K is a fall out from the way it was efficiently implemented due to die area limitations at the time.
/MikeJ
Check the store for this thing at http://www.ip-extreme.com/corestore/
Um, yeah. That's not going to happen. Darn, thought I had something there.
@billt
Is this the ARM project you mentioned? Doesn't look like much aditional worl has been done in a while.
http://opencores.com/project,core_arm
Plaz
Once upon a time there was a maybe more advanced one than that but ARM had it disposed of. :( If you look hard enough you might find it in some shady corner of the netiverse, but that was news over 10 years ago...Here is an implementation of the ARM2 ISA... It's old but would make a great starting point for any CPU project!
Trying to improve the ISA is a time sink which you'll never get payback from. The only benefit is a bit of ego boosting, but that subsides when reality hits.
Making the pipeline follow the predicted branch might be hard, but it's doable. Thumb drops a lot of conditional instructions from Arm.
It's not a time sink if people are working together in parallel which is the way it was suppose to be when I started documenting the new 68k ISA. It's not a time sink if the new ISA attracts interest from outside of the retro crowd. It's not a time sink if the ISA is implemented and found to be a substantial improvement in power, code density, compiler support and ease of programming. You give up very little with the possibility to gain much more. There is a market for retro computing but a bigger market for a processor that can handle today's processing needs quickly with compact code as well as being compatible with old code. That's what ARM and x86 did. They evolved and now they are successful. Building a 68020 compatible CPU comes first, but even then it's smart to plan ahead to make future enhancements easier.
Yes, but they were using predication (unusual for a CPU) that only offers a small advantage in some specific hardware. The smaller the block of predicated instructions and the simpler the instructions the better. Most original ARM ISA instructions could be conditional which worked ok but was dropped with the Thumbs because it was not good for code density which they were going after. The ARM block predication instruction was for multiple instruction predication but ARM went to OoO processors where it didn't work as well. The conditional instructions proposed in the 68kF ISA should work nicely while being a small simplification improvement over a more generic CMOV like x86. They would work well on a Superscaler CPU with a short pipeline and a cheap branch predictor (or no branch predictor) which the 68k is likely to have. There would still be some optimized code that would not want to use them at times. This includes highly predictable branches that are executed often and very tight loops where a highly predictable branch could be used instead. Note that some instructions like ABS (absolute value) have no drawbacks yet remove a branch that can be difficult to predict and SELcc can remove 2 branches in some cases. I would like to do some testing in an implementation before finalizing the ISA.
Sounds interesting, you might want to start a new thread about optimising and evolving the 68k ISA... As any discussion here might get confused with talk about FPGA implementations :)
Maybe "amiga hardware designs" should be a new subforum of the hardware discussion forum.
I took bloodline's reference as a point of study of how to do another cpu project, not necessarily an ARM.
I haven't messed with VHDL in an uncountable number of years. Someone please point me to a good reference where I can retrain some brain cells on some tools needed here.
Plaz
Personally I prefer Verilog. But it lack som capabilities, I think it was in regard to level triggering etc. So VHDL is it.
Btw, those links are books , not hw.
Maybe "amiga hardware designs" should be a new subforum of the hardware discussion forum.
Sorry for throwing yet another diversion into the thread.
I don't think the PCI to Amiga warrants a separate thread unless someone makes tangible progress.
It's probably just an option for cpuXtoAmiga like FPGA 060 replacement is a subset of 680x0 accelerator.
He asked for references to retrain his brain since he hadn't used VHDL in so long... So book links seemed appropriate.
It's not a time sink if people are working together in parallel which is the way it was suppose to be when I started documenting the new 68k ISA. It's not a time sink if the new ISA attracts interest from outside of the retro crowd. It's not a time sink if the ISA is implemented and found to be a substantial improvement in power, code density, compiler support and ease of programming. You give up very little with the possibility to gain much more. There is a market for retro computing but a bigger market for a processor that can handle today's processing needs quickly with compact code as well as being compatible with old code.
I'd rather see something ship for once.
You're very optimistic.
It's difficult to predict the future, but I can't imagine there is anyone outside of the retro community that will ever have any interest in a 680x0 cpu core.
You're very optimistic.
It's difficult to predict the future, but I can't imagine there is anyone outside of the retro community that will ever have any interest in a 680x0 cpu core. There are far too many other SOC/ASIC/FPGA solutions that have already carved up the market. There is no competitive edge against any of the other alternatives and nobody in business will care if they can run 680x0 code.
The majority of people want something that can run existing software and use existing compilers, adding instructions will cause market fragmentation if anyone is tempted to ever use them. A product that doesn't ship because the people behind it gets delusions of grandeur is no use to anybody.
Chasing rainbows is all well and good, but it's the reason that Natami failed. I'd rather see something ship for once.
im just trying to figure out how to modularize the project in order to divide it in a smaller easier doable parts dedicated to particular talents the contributors may have.
Optimistic? Yes! Waste of time? Maybe. At least I can say I tried even if I'm dreaming a little. Reality is only one visionary person with a wad of cash away 8-).
ARM with Thumb 2 has moved close to what an enhanced 68k would be and it doesn't have any trouble selling. I think we would be a little more powerful and easier to use while Thumb 2 is a little more power efficient.
pci2amiga, (one could distinguish a4000 030bus, zorro and a1000/500 expansion bus as slave) could be not neccessary with self made fpga board but might be a relief connecting any prefabricated device as master that would usually provide such an interface. having that interface technically working the other part would be to make the cpu of the host device take advantage of the interface. like 68k emulation on x86 to access the amiga chipset via pci.
Sorry, misunderstod first time.
For PCI bridge, you may end up making a different bridge to each of those Amiga targets in order to have it optimized for each. There may be some similarities, but I'm not sure a single thing to fit all of them would be best.
Natami on the other hand needed to lock out anyone from coming along who could write better VHDL.
The builtin CPU core in the FPGA Arcade is limited because of logic matrix constraints.
??? This seems counter intuitive. Bit of a dig a Natami management I presume.
It's difficult to predict the future, but I can't imagine there is anyone outside of the retro community that will ever have any interest in a 680x0 cpu core.
There are far too many other SOC/ASIC/FPGA solutions that have already carved up the market. There is no competitive edge against any of the other alternatives and nobody in business will care if they can run 680x0 code.
The majority of people want something that can run existing software and use existing compilers,
adding instructions will cause market fragmentation if anyone is tempted to ever use them. A product that doesn't ship because the people behind it gets delusions of grandeur is no use to anybody.
Chasing rainbows is all well and good,
You seem to care a lot about preventing any new 680x0 CPUs being built.
How is Matt Hey going to prevent MikeJ from shipping the Replay?
I keep trying to ignore your repeated insults of Matt Hey and anyone else who wants to make a faster Amiga but .... Could you just please stop with the insults?
Instruction fragmentation may occur regardless how it's implemented. Be it ARM-emulation, FPGA or ASIC.
Adding instructions may cause fragmentation regardless of it's implementation. Reread my post ;)
I think we're at the second of these currently with Yaqube's work.
Sorry if I'm dense, are we agreeing?Adding/removing instructions isn't going to fragment, added instructions can be ignored (see the 68020) and removed instructions can be trapped (see the 68060)... Fragmentation would occur if instruction behaviour is altered...
I thought you implied that no matter what, fragmentation would happen.
My point was that it wouldn't fragment us unless someone added or removed 680x0 instructions.
I guess I am dense. I can't take yes for an answer ;)
So our best option so far for FPGA implementation then is to support Yaqube's efforts? Does he want or need help, are there resources he needs to help things along?
Adding/removing instructions isn't going to fragment, added instructions can be ignored (see the 68020) and removed instructions can be trapped (see the 68060)... Fragmentation would occur if instruction behaviour is altered...
Reusing a previously assigned opcode cold cause problems, unless it wasn't commonly used on the Amiga... If it has potential to improve compiler code generation, or speed execution... then I say go for it!! ;)
The best situation is that he release the HDL-resources. Other than that one could start with TG68 and work from there.
I'll agree. Make it work FIRST. If it ships it's a bonus ;)
The beauty with FPGA is that you can ship first, and code later :P
TG68 source looks like a good starting point, but if Yaqube is well on the way to creating the core needed, then wouldn't it be preferred to support that goal instead of duplicating the effort? Is his project so different in FPGArcade that it wouldn't work well here? I've not followed FPGArcade very closely, will the work be open or closed source?
Plaz
While the 060 bus probably isn't the best example of something requiring obsessive-compulsive signal integrity planning, it is at the low end of where you start to care. The general rule of thumb for this starts around 50MHz. Some say they've seen problems as low as 17KHz...
I agree with everything you've said, but I have a question.
Why do we keep mentioning duplicating the 060 bus?
It's hard to work with sources that you don't have ;)
Read the name of this thread's topic.
This started out as a discussion to replace the very difficult to find, legitimate, best mask-set, full-featured and fastest 68060 chips from Motorola/Freescale, to put into 68060 sockets such as the socket found on some Amiga accelerators and on MikeJ's daughtercard for FPGA-Replay system.
Other things, such as the 3000/4000 accelerator slot, 030 socket, 020 socket, 000 socket, etc. have also come up, and could most likley be used via adapter, or do new PCBs directly targetting those and reuse the FPGA softcore stuff there. No reason a TG68 or N050 or whatever can't be plugged into any one of those things, but this topic came from the 68060 issue and desire to have better than whatever it is we already have.
Shout out to Mr. obvious.
To the point..... has Yaqube ever mentioned opening the source?
For the answer to that, I guess I'll just go ask him myself. (predicting the answer is no)
Second question.... anyone know what detailed documentation is available for 060? Schematic of the internals would be the bomb. I think I still have my 030 motorola dev books from back in the day. Guess I can start there.
Plaz
If someone wants to pump years of work and thousands of dollars into an easily redesigned FPGA replay addon that a hand full of people own and the already hard to buy 060 cards, that's their time and money.
It just doesn't make sense to me and I hope it doesn't take resources away from anything actually useful to the community.
I'm thinking on developing at least one FPGA just to try it out. And you can't have the core until you have a PCB nor can you have a working PCB test until you have a core ;)
So one make a PCB. Then generate a core that just toggle bits and does basic bus testing. When that is complete. The next step is to code a 68k core.
Perhaps it's possible to run the 68k bus really slow like 1 MHz just to prove it works.
Let's focus on the design rather than repeating the basics. Please?
Agreed, but so far I'm hearing we don't have the basics covered yet. No core, design no go. Even if starting with TG68... must compare were it is to where it needs to go.
Shout out to Mr. obvious.
To the point..... has Yaqube ever mentioned opening the source?
For the answer to that, I guess I'll just go ask him myself. (predicting the answer is no)
I agree with everything you've said, but I have a question.
Why do we keep mentioning duplicating the 060 bus?
It's hard to find 060 CPU cards.
Real 060's have to be heavily adapted to fit the Amiga bus.
030 cards are dirt cheap and plentiful.
An 060 is no faster than a synchronous 030 with burst for communicating with the Amiga itself. Actually they can often be slower since many 060's are async, can't burst, are running in 040 bus mode and have a lot of glue logic.
The 3000/4000 local bus are basically straight 030@25MHz, no glue required and I'd think the 1200 would be very similar but slower. You can't talk to the Amiga faster than 25MHz, period.
Local devices on the CPU card can communicate any way you want them to. They don't have to be limited to 030 Amiga speeds, they can be custom or off the shelf high speed buses.
030 just seems like the sweet spot for our needs.
KiCad (https://en.wikipedia.org/wiki/KiCad)
(free and thus makes sharing easy)
so something like this to some extent?
http://www.gb97816.homepage.t-online.de/gba_tk02.htm
Orignal problem: The FPGA Arcade has some 68020 hybrid. But the number of logic cells in the XC3S1600 is finite so any more fancy CPU has to be elsewhere. Now the solution mikej has accomplished is a daughterboard with a 68060 CPU.
Several of us have a real problem with this guy - he was on this forum for a while. He is using GPL code and not releasing the source, which is naughty. We will release the code for the Replay system as soon as we start shipping a stable core.
/MikeJ
@A6000, If you get tired of 68060 you can switch to a 68030 in 1/10 second.. (with FPGA)
Jim : "Has anyone worked with the MCC-216 "
Several of us have a real problem with this guy - he was on this forum for a while. He is using GPL code and not releasing the source, which is naughty. We will release the code for the Replay system as soon as we start shipping a stable core.
/MikeJ
I read here that the 060 bus interface is too complex and that the 030 bus is better, also the 060 is less compatible with amiga software than the 030 because many instructions were not implemented, so why do we want an 060 replica, why not try to implement an 030+882 that runs as fast as an 060?
From a hardware standpoint, the MCC-216 is pretty simple - just the Cyclone III, some RAM, and some I/Os. I am not sure how fast it is or what can be done with it.
68060 replacement.
Not 68060 replica.
There's a difference.
Yes, my ideal is something that goes into a 68060 socket, is 680x0 compatible (tg68, n050, n070, whatever), and hopefully outperforms real 680x0 from Motorola/Freescale.
Then do adaptors or different PCB designs for other sockets to hold the otherwise identical CPU core, whatever it may be. (Tg68 or n050 or arm or x86 or whatever)
That issue is solved with a simple mechanical adapter, as already mentioned. The 060 socket has signals that 030 socket doesn't. So you can go down, but not up.
I read here that the 060 bus interface is too complex and that the 030 bus is better, also the 060 is less compatible with amiga software than the 030 because many instructions were not implemented, so why do we want an 060 replica, why not try to implement an 030+882 that runs as fast as an 060?
68060 has some performance factors:That's rather longer than I expected.
# 10 stage pipeline.
Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.
That's rather longer than I expected.
The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.
Motorola removed some of the instructions added to the 020 and some of the FPU instructions to save space, that could be used for making it run quicker.
By only supporting the 060 instructions then you've saved space in the FPGA and the time taken to implement them.
page 3-1
http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf
The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.
It's quite a simple design.
It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.
I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.
The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.
What is odd is that they claim different for predicted correctly taken and predicted correctly not taken
So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.
The only 68020 features I ever used are longword multiplies and divides, and scale factors on indexed addressing modes.
And Branches >128 bytes :angel:
I presume 16-bit branching is the same as that if a certain flag is set then one can conditionally jump 65536 memory positions?
I have some memory that x86 is limited to 128 position limit on branching? or perhaps it's 6502 ;)
How about ARM?
There is also a translation from 16 bit variable length CISC to a fixed length 16 bit RISC in there. I don't think Motorola released the encoding format of their internal fixed length RISC making it difficult to duplicate.
No EXTB.L or TST.W/L An? No misaligned reads or writes?Ah, you got me. I do use EXTB.L, on occasion. Although I could easily do without.
Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.
Maybe the early decode just figures out how long each instruction is and whether the next instruction is valid to go in the secondary pipeline.
Is there any evidence to show they remap the opcodes at all? They might just store each opcode +operands within the fixed width fifo.
I can't honestly say if I use TST.L An or not, off the top of my head. Pretty sure I never do TST.W An though, can't think of much use for that.
I'm actually pretty careful not to do misaligned access, it just seems wrong, somehow. Just because you can, doesn't mean you should!
So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?
Do you think it's feasable to create something that can get near 50 MHz 68060 in FPGA?
So there is a a kind of selection process such that instructions that doesn't depend on sequent instructions could be done in parallel while the rest is single pipeline?
Btw, Is there any ISA that is neater and more straightforward than m68k? ;)
What's a "Link stack" ..?
I do such things like:
move.l myptr(PC),D0
beq .nullptr
move.l D0,A0
in the 2nd example could always use tst.l D0 anyway.
flags are set for free when moving to an address register. Also note the first line, I always write relocatable code.
In other news, I've been thinking about a RISC instruction set for internal use in a 68k core for some time. I think we can identify a few obvious simplifications:
1. tread An and Dn identically (use extra instructions if different behaviour is required)
2. only MOVE can useas either source or destination operand (load/store architecture)
3. all other instructions register-register, or "quick" short-constant source operands
4. spare "temporary" registers for internal use.
we could map 68k instructions to short sequences of internal instructions, and design those instructions to give the shortest sequences.
Yeah, you can't execute in parallel if the first instruction modifies a register that the second uses: for example
MOVEQ #0,D0
TST.W D0
Mind swap on the address register :).Oops I meant data register.
That's nice for simplification but not good for code density. Are you looking at a fixed 16 bit or 32 bit RISC encoding?Code density doesn't matter here as it would only used internally, external 68k code translated into internal code in some kind of buffer. Fixed length but the number of bits could be anything, it's not actually stored in the RAM so it doesn't even need to be 16 or 32.
I have heard a rumor that as much as 1/3 of the 68060 is microcode. It's generally slower though. The 68060 bit field instructions are a good example. They can be done in 1-3 cycles (data in cache) on an fgga but they take 2x-3x that long on the 68060.I would rather optimise for 68000 instructions and provide the rest just for compatibility. How common are the bitfield instructions in real code? I never use them.
Actually a default model could be to provide just a few instructions and have the rest as trapped instructions. That means one has something workable fast. Then one could make the architecture correct. And then add the full instruction set.
If one start with the instructions and then try to impose the correct architecture.. well it could be messy ;)
What's a "Link stack" ..?
Have you looked at the Actel FPGAs?, they are way faster than any competitor last time I checked. Of course they are slightly more expensive.
As for ISA, my thinking were if the ISA of ARM, Transmeta, PDP-11, MIPS, Sparc, DEC Alpha, PA-RISC, etc is easier to deal with. Without sacrificing performance.
I would rather optimise for 68000 instructions and provide the rest just for compatibility. How common are the bitfield instructions in real code? I never use them.
Actually, this may work in parallel. Some very simple instructions are retired early and the longword (only) result made available early. This is not specifically stated but the result is made available early from these types of instructions for change/use stalls and are probably also available early for the other OEP although it's not specifically stated. These early retirement instructions include:
lea
move.l #,Rn
moveq
clr.l Dn
If one start with the instructions and then try to impose the correct architecture.. well it could be messy ;)
The only thing I can find is this:
"If the primary OEP instruction is a simple “move long to register” (MOVE.L,Rx) and the destination register Rx is required as either the sOEP.A or sOEP.B input, the MC68060 bypasses the data as required and the test succeeds."
Which says it's only for move.l, although I guess the others could be translated. It doesn't have to retire it early, the second pipeline could look in the primary pipeline. Mips has a similar handling for lwl/lwr opcodes, it pulls the register value from the pipeline and stops the register being updated at all. The register doesn't actually get updated until you stop executing lwl/lwr opcodes.
That's a big part of why "they" moved away from hardwired control units in favor of microcoded control units. My own education thus far was about hardwired style, which is very dependent on the instruction set. I was hopin gto take the advanced followup course now, but it wasn't on the schedule. I'm trying to go through the Coursera one now, which is pretty advanced. Not sure if they explain microcoding or if that assumes you already know it. Going to try and find some time to read up on it more regardless.
I think you mean Bcc.L and BSR.L. Branches up to 16 bit were supported on the 68000. The longword branches are big savers but only on fairly large programs. Not too many assembler programmers create programs >65k.
It's signed so plus or minus ~32k.
By looking at code compiled for the 68060, it looks like many compiler programmers didn't understand either. Most 68060 optimized code doesn't do much except replace some trapped instructions, if that.
How common are the bitfield instructions in real code? I never use them.
There are at least 2 different optimizations here. One is the early instruction retirement and register forwarding. The other is more of a MOVE.L+OP.L optimization which is possible because MOVE.L is only half an operation in a register memory architecture that can do both in 1 operation.
I know some people are using OS2.04 on this device (well, at least they claim they are!)
Which compilers even have an 060 option?
Which compilers even have an 060 option?
I can't remember SASC even having such an option. Or maybe it does and I just don't use it...
Every time I ever wanted to use Bitfield instructions I would consult the timing charts and it was always faster to just do things the RISCy way and not use bitfield instructions. So I have never used them. I just use good ol' ANDing and ORing.
gcc version 3.3.3 has these options:
-m68000 -m68020 -m68020-40 -m68030 -m68040 -m68881 -mbitfield -mc68000 -mc68020 -mfpa -mnobitfield -mrtd -mshort -msoft-float
But perhaps SAS C has something more specific?
BFCHG, BFCLR, BFEXTS, BFEXTU, BFINS, BFSET and BFTST are simpler but very easy for a compiler. They would be a compiler writers dream come true if they were fast.
I'm pretty sure the latest SAS/C does,
As the 020, 030, 040 options doesn't mention the omission of any instructions. It seems the 060 is the only m68k CPU to have less instructions than it's predecessors.
As the 020, 030, 040 options doesn't mention the omission of any instructions. It seems the 060 is the only m68k CPU to have less instructions than it's predecessors.
The 040 has less FPU instructions, the 060 is the first to drop integer instructions.
Why were these instructions [CALLM & RTM] dropped?More to the point, why were they ever included in the first place?
Why were these instructions dropped?
And would be more efficient performance wise to implement a 020, 030, or 040 and then horrendously overclock it?
Why were these instructions dropped?
And would be more efficient performance wise to implement a 020, 030, or 040 and then horrendously overclock it?
Regarding instruction set (ISA) I was thinking in general why they changed it. Because the end result is a slight confusion.
I have been reading matthey's 68kF2 ISA proposal, and it reminded me how complex the 68k instruction encoding is, :)
Complex? Take a look at a decoder for x86 :P. Yea, the 68k does need more logic in the decoder but the improved code density allows more instructions to be piped into the processor. Most RISC instructions use a consistent 32 bit fixed length encoding which is great for decoding. The 68k needs several separate decoding tables (lacking a better name) for different encoding areas. Some encoding holes are even divided into a separate table of instructions. This part of the 68k could have been a little better but it's not too much of a problem. The 68k does compress a lot of data with sign extended values which works very well and can be improved on. The overall slowdown from the decoder is minimal on the 68k and can be made up for with powerful instructions and addressing modes which it has and can be improved on. ARM with Thumb 2 works well because of the code density plus powerful instructions for RISC. This was a good tradeoff even though they now have a little more complex decoder. MIPS and PPC have also experimented with code compression (MIPS16E and CodePack respectively) but it never caught on or fit as well for them:I rather like Mrs Beanbag's idea of a nice simple RISC core tailored to executing instructions that have been decoded from 68k instructions, it could simplify the decode stage maybe :)
http://www.embedded.com/electronics-blogs/significant-bits/4024933/Code-compression-under-the-microscope
I rather like Mrs Beanbag's idea of a nice simple RISC core tailored to executing instructions that have been decoded from 68k instructions, it could simplify the decode stage maybe :)
I rather like Mrs Beanbag's idea of a nice simple RISC core tailored to executing instructions that have been decoded from 68k instructions, it could simplify the decode stage maybe :)
Regarding instruction set (ISA) I was thinking in general why they changed it. Because the end result is a slight confusion.
(we don't even know how the 68060 deals with these very long encodings).
Right, simplifying the decoding stage wasn't the idea so much. But if you can split a problem into two parts, it is usually easier to solve. I'm trying to make the developer's job easier really.
The advantages are that each part can be developed, tested and optimised separately, and indeed the RISC core could conceivably be useful on its own (and an assembler could be modified to compile 68k asm to run on it). It would be easier to add new instructions, much in the same way that microcode does, but the "microcode" in this case is more readily understandable, being 68k-like itself.
What do you mean? AFAIK the longest instruction is 10 bytes and that is what gets transferred from the FIFO in the decode stage.
Hmm. Why not just use JIT in UAE then? The 68k code does make a nice compressed cross platform intermediate languageI guess that's not too far from my idea, now I think about it, but with a CPU core specifically designed to emulate 68k. Sort of a hardware emulator, I guess.
Do you see any mistakes?
I guess that's not too far from my idea, now I think about it, but with a CPU core specifically designed to emulate 68k. Sort of a hardware emulator, I guess.
If a 1 cycle 68060 instruction translates to 2 of your instructions, then you'd have to clock at double the speed to achieve the same throughput. So each of those would have to be a 1:1 mapping or you've already failed.I've got the instruction execution timings in front of me here. Instructions with indirect addressing modes or immediate data can take longer than 1 cycle on 68060. Move (An),(An) for instance takes two cycles. All the Register-Register instructions take 1 cycle. These could be mapped 1:1, or better. So the answer is it probably depends on the program. But it might be possible to process some combinations of two 32 bit instructions simultaneously as well. (Add some degree of implicit superscalar operation, probably move
Whats the encoding for the move.l with that addressing mode? I can't see anything that matches that in the 68020 user manual.
move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)
The hexadecimal encoding for above is:
23b0 0173 1234 5678 1234 5678 0173 1234 5678 1234 5678
Hmm, for me that disassembles as:
001000: 23B0 0173 1234 5678 0173 1234 5678 move.l ([$12345678,A0],$1234), ($78,A1,D5.w*8)
00100E: 1234 5678 move.b ($78,A4,D5.w*8), D1
MC68060
move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)
rts
For those saying that Altera is the better fpga for this task, what exactly makes it better, and what are you comparing it to on the Xilinx side?
It's not uncommon to see bugs in assemblers, disassemblers and debuggers using these advanced and seldom used addressing modes. I used vasm typing in:
Code: [Select]MC68060
move.l ([$12345678,a0],$12345678),([$12345678,a1],$12345678)
rts
I assembled it from test.asm to test. It disassembled just as I typed it with my modified version of ADis from here:
http://www.heywheel.com/matthey/Amiga/ADis.lha
Disassembling with:
ADis -m6 -a test
The old version of ADis would have had problems. IRA 2.04 fails to disassemble the destination correctly. D68k v2.0.8 is very close but oddly gets the address register in the destination wrong.
BDebug from the Barfly package gets it right. CPR from SAS/C gets it right (although doesn't display the $ for hex numbers on instructions).
I thought you might have been using D68k at first but apparently not. What disassembler did you use?
There is this information from the Megadrive:Brilliant!! I kinda figured the branch, move and compare instructions would be the more popular :)
http://emu-docs.org/CPU%2068k/68kstat.txt (http://emu-docs.org/CPU%2068k/68kstat.txt)
although it might be more instructive to see which are the most common addressing modes for these instructions, too.
For instance, rate of "add Dx,Dy" vs "add (Ax),Dy" and "add Dx,(Ay)".
I used mame (arcade game emulator), typed the hex into memory and then disassembled and executed it. It's not just the disassembler, the emulation consumed the same number of bytes. So that needs looking at, can you post the exe you assembled?
There is mention in the manual about some instructions being split over two pipelines, it might do that by splitting it into two FIFO entries. With the result of the ea fetch from the primary pipeline getting forwarded to the secondary pipeline so it can get stored.
Have you tried running this encoding on a real 68060?
Brilliant!! I kinda figured the branch, move and compare instructions would be the more popular :)
http://www.heywheel.com/matthey/Amiga/test68020 (http://www.heywheel.com/matthey/Amiga/test68020)
Are you involved with developing or testing mame?
Right. The OEPs are locked together and each OEP performs 1/2 of the ea for a move, .
Actually something just occurred to me. If the most common instruction is "tst", it should be possible to know whether a branch will be taken or not some time in advance. Because "tst" only looks at a single register, the contents of that register must have been determined some time before. So you could look ahead in the instruction queue for a "tst/bcc", and inform the branch predictor well in advance. "tst" instruction then takes effectively NO cycles.
Right. The OEPs are locked together and each OEP performs 1/2 of the ea for a moveNot strictly true. Can also do "cmp (Ax)+,(Ay)+", . This is the only 68k instruction that allows 2 EAs by the way.
Apart from the cycles it takes to look ahead in the instruction stream every time you hit a tst instruction, and it will get complex to even follow the code as you would have to follow branches as well. Basically to avoid the cycles when a branch happens, you'll end up going through the same overhead as running the code after every tst instruction (tst isn't the only instruction that affects branches).Instructions are read into a buffer ahead of time, so can detect a tst/bcc when it is first read in. I wouldn't bother following branches, to be able to predict only the next branch would still help. Yes it would only work if the branch follows a tst, but if the profiles from the Megadrive are anything to go by, that is the most common case. Basic RISC principle, "make the common case fast"!
Also I have been thinking of a way to make the instruction translation do branch predication in the case a conditional branch skips only a few instructions.
beq skip
movem.l d0-d7/a0-a6,-(sp)
skip:
move.l d0,-(sp)
Actually something just occurred to me. If the most common instruction is "tst", it should be possible to know whether a branch will be taken or not some time in advance. Because "tst" only looks at a single register, the contents of that register must have been determined some time before. So you could look ahead in the instruction queue for a "tst/bcc", and inform the branch predictor well in advance. "tst" instruction then takes effectively NO cycles.
Not strictly true. Can also do "cmp (Ax)+,(Ay)+"
addx, subx, abcd and sbcd can use predecrement for both operands.
All of these are two cycle instructions.
Instructions are read into a buffer ahead of time, so can detect a tst/bcc when it is first read in. I wouldn't bother following branches, to be able to predict only the next branch would still help. Yes it would only work if the branch follows a tst, but if the profiles from the Megadrive are anything to go by, that is the most common case. Basic RISC principle, "make the common case fast"!
It wouldn't help at all when the branch follows the test, because you're going to have to flush all the following instructions from the pipeline. If you're going to remove the pipeline completely or a significant number of stages then you'll have a huge number of instructions taking multiple cycles and the overhead of incorrectly predicted branches is going to be so insignificant that it won't be worth doing.The following instructions wouldn't be in the pipeline yet, at the point you make the prediction, that's the whole point, to avoid having to flush the pipeline when you get to the branch.
The following instructions wouldn't be in the pipeline yet, at the point you make the prediction, that's the whole point, to avoid having to flush the pipeline when you get to the branch.
I wonder if you understood my idea properly, so I'll try explaining it again. The instruction stream is read into a FIFO (which I believe is a fairly normal thing to do) and as soon as a test followed by a branch is read in, it can do the test immediately (which is a very simple operation) and predict the branch based on that. So as long as the register doesn't change by the time the branch instruction comes out of the other end of the FIFO the branch will have been predicted correctly.
The test may not always be able to be done immediately. Might it not depend on the writeback of an instruction ahead of it but still in the pipeline and not yet finished? You may not yet have the right thing there to test just yet. Such as decrementing a loop counter might be right ahead of the test for 0...Yes it would be a prediction, the prediction isn't always necessarily right, but as long as it's right more than 50% of the time it will help.
Yes it would be a prediction, the prediction isn't always necessarily right, but as long as it's right more than 50% of the time it will help.
and as soon as a test followed by a branch is read in, it can do the test immediately (which is a very simple operation) and predict the branch based on that. So as long as the register doesn't change by the time the branch instruction comes out of the other end of the FIFO the branch will have been predicted correctly.
What you're suggesting will break I/O, which is the major use of TST. You can only perform the read once & you can't do the read until all the registers are correct, or you could be reading from anywhere.Good point. I was only thinking of tests on registers.
I think the thing with TST being the #1 instruction in SEGA games is either:
A: All those games were compiled with either SASC or GCC which generates silly wasted TST instructions all the time.
B: The Sega Genesis uses PIO (Polled IO) for some things so it has to constantly TST a certain memory location all the time in a loop.
C: All of the above.
I agree... with the MacOS, TST is not even in the top 10 instructions used.
I've had a bit more of a think about this... Would it make sense to design a RISC CPU for the FPGA with the same condition codes/flags as the 68k (where instructions would set the flags as expected), but limit all the exotic addressing modes to the load/store instructions?
A simple MMU could be added to mark memory block, to assist a JIT... That way we could have an FPGA CPU that could execute code really fast, allow for easy mapping of 68k instructions to the native instructions and move the 68k->native decoding to a software JIT :)
This is quite a neat trick, I've been looking into it for my own 68K core.:)
The Replay PS2 keyboard/mouse controller uses a picoblaze softcore as it is smaller than the logic used otherwise - and you can to a lot more.
http://www.roman-jones.com/PB8051Microcontroller.htm
Here they are using it as a 8031...
/MikeJ
This is quite a neat trick, I've been looking into it for my own 68K core.Thanks for the PicoBlaze hint, that lead me to the microBlase which lead me to the DLX... A sort of stripped back MIPS :)
The Replay PS2 keyboard/mouse controller uses a picoblaze softcore as it is smaller than the logic used otherwise - and you can to a lot more.
http://www.roman-jones.com/PB8051Microcontroller.htm
Here they are using it as a 8031...
/MikeJ
I've actually spent a few days trying to design my own 68K emulator, to work out how to modify the MIPS ISA to make it super efficient at 68k emulation...
Going from a general purpose cpu would end up as an inefficient microcode based 68k.
It makes sense to dump the original 68000 & 68010 microcode to emulate those processors. Not all of it's done yet though, there is some work hopefully happening soon.
I don't know how much of the 68020 and later were microcoded.
Yes the ALU is a piece you could get from various places ready made. I was pondering the possibility of using the cache management systems out of the OpenSPARC core.
This is quite a neat trick, I've been looking into it for my own 68K core.
The Replay PS2 keyboard/mouse controller uses a picoblaze softcore as it is smaller than the logic used otherwise - and you can to a lot more.
http://www.roman-jones.com/PB8051Microcontroller.htm
Here they are using it as a 8031...
/MikeJ
How many LUTs does it take to make a slice?
Trying to figure out which is smaller.
Hello,
I have done that already with the J68 core :
https://github.com/rkrajnc/minimig-de1/blob/master/minimig-src/j68/j68.v
It is loosely based on the J1 core (hence the name). So the heart of it is a stack-based CPU.
The ALU is a 16-bit ALU, compatible with a 68000 ALU.
It has some special micro-instruction for the effective address computation.
The core must run 2x to 3x faster than the original to reach the same speed.
The advantage is the size : less than 2000 LUTs. Micro-code take 2048 x 20 bits.
With further optimization (cache, prefetch, 32-bit ALU and effectve address ALU), I am sure it can be as fast as a 030/040.
Right now, this softcore can boot a Kickstart 2.04 in my AmiSOC core.
Regards,
Frederic
Frederic,
your 68K core (Verilog) code looks interesting and appears to have taken a fair amount of effort. This is some good work.
How much validation have you done on your 68K Verilog CPU core?
The MC216 is the only place that this core is in use as the 68K core? The older Minimig code ports use the TG68K VHDL 68K core (Minimig DE1/DE2/MIST code ports).
I looked over the J1 Fourth CPU paper as well.
Okey so ur 16-bit 68000 core is about triple the size of that PB8051 Microcontroller thingy that MikeJ is using.
But ur core has triple the style points of a PB8051. :)
I was comparing the ways :
- Taking a picoblaze to emulate a 8051
- Taking a J1 to emulate a 68000.
I would not take a J68 just to make a keyboard controller. :-)
Moreover, emulation is not the exact term since the 68000 "emulation" is heavily HW assisted :
- instruction decoder generates microcode address
- specialized micro instructions help evaluating the effective address
- the ALU is 68000 compatible
On the J68, the bus interface is what is taking most of the room (1000+ LUTs).
Big endian is cool but really resource hungry.
Regards,
Frederic
Fredric,
In your 68K Verilog code, I did not see code for the 68k interrupt acknowledge cycle (IACK cycle). The Amiga uses the 68K interrupt acknowledge cycle while the Atari ST (68K) uses both 68K Auto vector interrupts & 68K IACK cycle (for 68901MFP).
I may have missed this in your code, or do you need to still add this support in your 68K core?
Do we care so much about the size of the core anyway? If the aim of the game is the fastest 68k CPU, the fastest rated FPGAs tend to be quite big anyway. Unless we're squeezed for space I wouldn't worry about it.
Since all the Kickstart ROMs contain 0018 0019 001A 001B 001C 001D 001E 001F at location FFFFF0 - FFFFFF, I am not using interrupt aknowledge cycles, just auto vector.
That's enough for running an Amiga.
Why is big endian resource hungry?
(perhaps why Intel x86 is little endian)
68060 is definitely microcoded to some degree, move, is split into two "standard operations".
All official kickstarts contain that & only 68000 based Amiga's use it.
However it can be changed to allow for WHDLoad quit key support on 68000 by pointing the vector back inside rom and only passing on to the actual vector if the quit key is not pressed (basically what the ACA500 will do ).
I don't know whether Datel Action Replay and the rommable hrtmon make use of this functionality also, but I can imagine it would be useful.
I think it's worth supporting the cycles if you're doing a pure 68000 core.
Big vs little endian is an arbitrary choice.
The name comes from Gulliver's travels.
http://en.wikipedia.org/wiki/Lilliput_and_Blefuscu (http://en.wikipedia.org/wiki/Lilliput_and_Blefuscu)
Accessing little endian data on a big endian processor (and vice versa) is an extra overhead. However when emulating a 68000 the overhead is very small because it can't access unaligned data & it just takes a well placed xor of the address lines. 68020 and later plus all x86 processors can do unaligned access, so you have to take those cases into account. I'm not sure there should be any overhead in an FPGA implementation.
On another note... I just found a source in China with the Rev 006 MC68060RC50 chips. They apparently have over 10,000 of them left over from a production run of call center boards. I have a couple of samples coming.
big endian is human's natural order. little endian is computer's natural order.
A xor on the address lines is just your view as a programmer. Where are address lines A0 and A1 when you are dealing with a 32-bit data bus ?
Little endian is no more natural for a computer as big endian. The only difference in hardware is how you calculate the strobes and shifts. 32 bit values on a 32 bit bus are the same whether it's big endian or little endian, it's only when you access bytes or words that anything changes.
Little endian is no more natural for a computer as big endian. The only difference in hardware is how you calculate the strobes and shifts. 32 bit values on a 32 bit bus are the same whether it's big endian or little endian, it's only when you access bytes or words that anything changes.
xor'ing addresses with 3 for bytes and 2 for words just affects how the host CPU generates the strobes and shifts, as most CPUs use byte addresses for everything.
In hardware then you will always have to calculate the strobes and shifts from the low bits of the byte addresses and whether you calculate them for big endian or little endian makes no difference in terms of complexity. If you're doing anything extra for big endian then you're doing it wrong.
You're right! My OS4 laptop is way more charming than any of my pc laptops or my iBook is. I'm really happy i can go to dinner with that instead of the others...
I dunno, when you stop to think about it, most big-endian processors that support different size operations on register operands tend to do so as if they were little endian. For example, .b and .w operations on 68000 registers always affect the least significant byte and and 16-bit words respectively. Similarly the PowerPC performs byte and halfword operations on the least significant portion of the register. However, when it comes to memory operands (where supported), suddenly it's a different matter.
Conversely, little-endian processors like x86 tend to be consistent in that accessing a byte at a particular address modifies the least significant byte of any wider type considered to exist at that same address, as if that address were just another register.
it seems odd that such a large pile of them are sitting around somewhere when people are having such a terrible time getting them.
Hi,
What speed is your laptop, and who makes it? (stupid question probably Apple)
smerf
I was told that these are all genuine parts, with the one supplier a bit desperate to unload all of them. By the way, the going rate for all of the suppliers I found is about $50 per CPU.i think this is about the price i paid for a rev 6 here in germany few years ago. might be mistaken though. anyway sounds reasonable.
Ummm... I don't see that they are even remotely rare. I found 15 different sources with varying amounts located in Malaysia, Taiwan, and mainland China.
As long as the part works, I would not care if they were knock-offs... but they would definitely have to pass a test rig setup and function well beyond the 50MHz rating. I was told that these are all genuine parts, with the one supplier a bit desperate to unload all of them. By the way, the going rate for all of the suppliers I found is about $50 per CPU. I am not sure how that compares to the normal pricing. I just so happened to ask one of my part suppliers in China for the latest micro pricing, and the 68060 appeared on the list. I was kind of shocked to see that, so then I sent an inquiry to a bunch of chip wholesalers about the 68060. There are a lot of them out their folks!