Author Topic: Full 68060 implementation? (Read 10215 times)

matthey · « **on:** December 13, 2012, 03:15:40 AM »

Quote from: freqmax;718824

Does there exist enough documentation and FPGA fabric size to make a full soft-HDL 68060 processor? and at full speed? (preferably with Xilinx Spartan series, Virtex is expensive)

A 100% logic equivalent is not possible in fpga but a fully functionally equivalent soft CPU is possible. Note that muxes are much slower and bigger in an fpga than in silicon which messes up the timing as other logic elements are forced to wait. A fast fpga can more than make up for this deficiency and others but an optimal design will be different and optimized for an fpga, likely even for a particular fpga. My understanding is that an fpga 68060 like CPU design could be programmed for:

1) speed in the fpga (fastest in fpga, programming can get messy with optimizations)
2) small size in an fpga (for small fpga, slower)
3) burning in an ASIC (slower in fpga but fastest in ASIC and potentially easiest to program and maintain)
4) compatibility close to cycle exact (slower in fpga, difficult to make as deep understanding of a 68060 and logic testing or copying would be necessary)

Personally, I would go for #3 even though speed may be 20% lower than the same soft CPU optimized for speed (probably talking 100MHz speed in affordable fpgas). It would be easier to make enhancements and changes (which can also increase speed), faster to program and we could always go ASIC

.

Quote from: freqmax;718824

This could then be used to mitigate shortage, insane prices and make tweaks possible.

The last revision of the 68060 is the one that is expensive and difficult to find. You should be able to find slower ones for between $50 and $100. Ask MikeJ what the slower ones are going for in China. Even the older RC revisions have an MMU, FPU and less bugs than any 68k fpga CPU yet.

matthey · « **Reply #1 on:** December 13, 2012, 04:44:48 AM »

Quote from: freqmax;718841

I think functionally equal is a good goal. But ASIC is no good. 1) It's darn expensive 2) Any bugs will be unfixable

I did not mean to program an ASIC, I meant to program an fpga as one would do if they were later going to burn an ASIC. The logic would be tested and bugs fixed before even contemplating burning an ASIC but that would be an option for later. Having a tested fpga CPU ready to burn could open up possibilities like partners that would be willing to help with the expense part. The 68k FIDO CPU, for example, is burned by a company that specializes in fpgas and making ASICs from them. They have fpga code for many old chips that are no longer available. Several of them can be put in one fpga to reduce component cost and new software programming cost of a newer chip for embedded or retro systems.

matthey · « **Reply #2 on:** December 13, 2012, 02:51:33 PM »

Quote from: mikej;718850

They are really cheap - you can even can brand news ones for ~20USD.
Getting the latest mask set is Really tricky.
/MikeJ

That's even cheaper than I thought

. A 50-60MHz 68060 is already a big improvement over what the average Amiga user is using. Supporting faster memory speeds and easier overclocking will make it faster than most of the old 68060@50MHz accelerators.

matthey · « **Reply #3 on:** December 13, 2012, 03:50:43 PM »

@mikej
The 68060.library can make the older mask 68060s reliable with very little decrease in speed. It would be great if the 68060.library was installed from flash/kickstart so that bootable games could work. The older mask 68060s are not nearly as overclockable but still overclockable. They should be good for 54-66MHz depending on cooling and temperature environment where used. Add to this memory that is 25-50% faster than the old accelerator SIMMS and it should be significantly faster than a CSMK2 with 68060@50MHz. The newest mask 68060 is very nice though and worth the premium to me to fix all bugs and overclock to 100MHz. I have an extra one waiting

.

matthey · « **Reply #4 on:** December 13, 2012, 06:25:27 PM »

Quote from: billt;718885

Though I myself would not try to do an exact 68060 clone. The 060 has some instructions removed that were present in 68040 for example. If I were to make an FPGA 680x0, I'd put them all back in, and avoid trapping to software emulation.

It's not necessary to put all the trapped instructions back in. It made sense to get rid of CAS2, CHK2, CMP2 and MOVEP. Getting rid of the integer 64 bit result MULx was a mistake as it's used commonly by compilers to do an invert and multiply by a constant instead of a divide by a constant. It wouldn't have been so bad if they would at least have defined and allowed a MULx where Sz=1 (64 bit result) and Dl (result register low) = Dh (result register high) giving the upper 32 bits of the result (like PPC MULH). This is worth while to define even with 64 bit results as it saves trashing a register when the lower 32 bits are not needed. See MULS and MULU here:

http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf

The integer 64 bit result DIVx is used much less commonly but it is much slower to do in software using the shift method. MOVEP is used in older Amiga software (mostly games) but it is poorly encoded, has limited usefulness and most patched games have already removed them. CAS2 and CHK2 are uncommon supervisor instructions. CMP2 is user mode but has limited usefulness as designed and it's very rare. WHDload mentions only 1 known game and 1 demo as I recall. It would be better to install the 68060.library from flash or kickstart so that traps are never a problem. Some instructions and addressing modes are better trapped and simplifying gives gains elsewhere. Notice that I removed (for trapping) the double indirect addressing modes that used the outer displacement in the 68kF ISA pdf above as there was little advantage (only useful when no free registers) and simplifies the decoder. If all remaining full extension word format addressing modes could be 1 cycle faster then it would be worth it.

The SWAP instruction in the 68060 should have worked in both integer units which was an oversight. The result is longword for forwarding and is common. The bitfield instructions should have been made faster which is possible and they have 32 bit results for forwarding. Of course bigger caches, a link stack and instruction combining like the ColdFire has would be done at minimum if modernizing the 68060.

Quote from: billt;718885

I just did a very small microprocessor design for a Computer Architecture class that finished last night. Very small as in a total of 6 instructions, including load, store, and two kinds of branching, leaving only two ALU operations, add and subtract of BCD numbers. 8bit instruction and 16bit memory address bus, and four registers. But it was nice to learn how this stuff works and the fundamentals of how to approach designing a processor. I actually wrote some assembly language code for what I'd want such a terribly simple thing to do before doing the logic design. You break down the instructions into stages (not pipeline stages, but individual steps taken to complete a single instruction at a time), and then you can extract your hardware logic design based on that. I found it very interesting, and I was surprised at how complicated it was NOT, even for my uselessly simple thing. Sure, a more complete and useful design like a 680x0 will be bigger and more complex than this, but it's not absurdly complicated as I would have imagined previously. That said, it would still be a great deal of work. Could be fun for someone with the time.

Sounds like fun

. It's not rocket science and it's very logical but I imagine complex and time consuming when doing more advanced design and coding.

Quote from: billt;718885

TG and Yaqube have taken some steps in improving the TG68 sortof in that direction. I thought I'd seen something about the Suska guy taking his 68000 core up to 68020 or 030 but I didn't see anything available last I checked. I think the aoocs guy has a 68000 core as well, not sure what his plans for the future of that are. There's also closed-source Natami CPU, but I'm not sure what's happening with that anymore. But they are things you can look at for inspiration.

68020+ support should be the minimum for the Amiga IMO. It makes programming much easier than the 68000 while offering significantly better speed and code density. The Natami CPU (N68050) is not finished and was only partially supporting the 68020 last I understood but it is fairly advanced as far as cache and pipeline design. I don't think Jens is working on it much if any anymore. He has talked about making it open source though. Gunnar is working on a soft CPU based on it but he is not very reliable. He claims to be experimenting with a 200MHz softcore although he increased the pipeline length significantly in order to do it. This increases branch penalties and can cause other stalls much like a highly clocked DSP, GPU or x86 CPU. Even if I could believe him, it's experimental at best and Gunnar has a history of not completing much.

matthey · « **Reply #5 on:** December 13, 2012, 08:13:47 PM »

Quote from: ChaosLord;718904

But from an engineering standpoint I simply don't know how to implement it.
If it can't be done as fast as the other multiplies then u hafta stop the whole pipeline from moving while u spend 2-4 cycles to complete the instruction.

Ok now that I think about it there MUST be a mechanism for stopping the pipleine because all those weirdo addressing modes in multiword instructions take multiple cycles to complete.

Motorola said they removed the 64 bit integer MULx and DIVx to simplify the pipeline while Gunnar said they were no problem to implement. I think they do add complexity to the pipeline which can slow the CPU and this was seen as not being worthwhile for the benefit. The engineers likely did not realize the benefit or did evaluations before compilers started using the invert and multiply for divide trick and before 64 bit operations and CPUs became more common. I am not a hardware design guy that could dig out the truth so this is my best guess.

Quote from: ChaosLord;718904

Maybe the problem was that its bitpattern overlapped the bitpattern of a normal multiply? I donno. But if that is what messed them up they could have invented a NEW completely different bitpattern for large multiplies.

I don't think it would have been a problem to use this existing undefined bit pattern but a completely new encoding of MULH would have been acceptable also.

Quote from: ChaosLord;718904

I used to tell Gunnar all the time that its ok to trap those weirdo hypercomplicated addressing modes. But he said they worked out an optimized way for the address unit to handle it without needing to trap.

Maybe he had an idea at one point but he did not have the double indirect modes implemented in the Apollo. Then again, he changes his experimental designs quite often so maybe it would work one day and not the next.

Quote from: Matt Hey

The SWAP instruction in the 68060 should have worked in both integer units which was an oversight.

Quote from: ChaosLord;718904

I wonder if they had a reason?

There were several instructions that were left out of the 2nd integer unit (sOEP) probably to save space from less commonly used instructions but this one is common and very simple. It wouldn't have been as bad if the immediate shift worked with greater than 8 shifts but as is shift is 2x as fast as swap in the 68060. A modern 68060 would probably have a more balanced sOEP as CPU size is not as important now. Some instructions would still be sOEP only as they are uncommon or don't make sense to execute in parallel.

Quote from: ChaosLord;718904

Without clever optimizations u won't get 68060 speeds using affordable FPGA chips.

True.

Quote from: ChaosLord;718907

What are these bugs that everyone is always speaking of?

Is there a list somewhere?

Are they serious? Or mainly just technical? Or ?

http://cache.freescale.com/files/32bit/doc/errata/MC68060DE.pdf

The most serious bugs can be worked around but need the 68060.library.

matthey · « **Reply #6 on:** December 13, 2012, 08:28:25 PM »

Quote from: Blinx123;718910

Is there any particular reason for people still calling the Natami CPU a 68050?
Last I read, it was called a 68070.

Jens coded the N68050 that was far enough along that he was trying to adapt it to work in the Natami fpga up to several months ago. The N68070 was Gunnar's design (based on the N68050 and possibly with Jen's help) of a superscaler (2 integer units) CPU which became the Apollo CPU when he split from the Natami.

matthey · « **Reply #7 on:** December 13, 2012, 09:12:52 PM »

Quote from: ChaosLord;718932

I wish u wouldn't call it Apollo. Wayyyyy to confusing.

We should agree to call it 070. Or N68070 to be precise.

It's confusing but accurate. Gunnar made lots of changes to the N68070 design after he left so it's not really the same anymore. Then again, I can't really separate what's hype and what's real when we are talking about Gunnar so maybe it would be best not to refer to them at all :/.

Quote from: ChaosLord;718932

My brain just melted.

My work around is never use those first 2 mask sets.

Is that the problem with the $20.00 060s? They are the buggy prototype versions?

They are not prototype masks. They just didn't have the bugs fixed yet. The 1f43g mask has all the bugs listed. I had one of these marked 50MHz and the system was reliable although it didn't even overclock to 60MHz which was bad luck. The 1g65v mask fixes 1 bug and the one I had (marked 50MHz) overclocked to 60MHz reliably but not 66MHz which I hear a few will do. There are buggy masks that have higher clock ratings like 60MHz. The 0e41J mask has all known bugs fixed (all on the errata) and can generally be clocked between 90MHz and 105MHz. It had a die shrink that allows for this. Most if not all are still marked 50MHz and there are fakes with the mask changed. All other masks are not full 68060s with MMU and FPU which I do not recommend. The 68060 may not be reliable in an Amiga without an MMU.

Quote from: ChaosLord;718934

He/they actually started adding a 2nd integer unit??

Last I remember the superscalar stuff was "something to do in the future".

That's what Gunnar claimed anyway. The 2nd integer unit was actually what he called a cheater or helper integer unit at first. It could not do calculations, only immediates and register direct which is still good. That's when the Apollo fit inside of a normal sized fpga. He has since made it bigger with 2 units targeting larger fpgas. He considers the Natami dead and no longer a potential target.

matthey · « **Reply #8 on:** December 14, 2012, 02:34:27 AM »

Quote from: freqmax;718967

Is there any software that use 68060 (or 030, 040) specific "features" that make them incompatible if the processor isn't cycle exact and uses the same behaviour?

Programmers relied less and less on CPU timing with the later 68k. On the 68000, it was possible to map out the exact state of the CPU and what it is doing from cycle to cycle depending on the code. The pipeline was very short, there was practically no cache and the memory speed from Amiga to Amiga was pretty close. Relying on CPU timing was generally safe with the exception of the processor clock speed between NTSC and PAL which caused a few games to fail. The later addition of true fast memory was enough of a timing change to kill a fair number of games. The 68020 and 68030 were a little more difficult to count cycles (overlapping cycles) with a little longer pipeline, small caches and a difference in memory timing. Programmers started to learn it wasn't such a good idea to make timing assumptions and they had the benefit of testing on a much wider variety of CPUs with widely different timing. The 68040 and 68060 didn't break much with timing differences. Incompatibilities are usually due to other reasons like a difference in cache size (self modifying code) or not having their CPU libraries installed which allows them to have similar behavior to previous CPUs with widely different timing but it wasn't so much of a problem by this time. The 68060 is almost impossible to predict timing. The execution of integer code can vary from one execution to the next depending on which integer unit is currently used, how much instruction memory has prefetched, what's in the caches including now branch cache, instruction folding, etc. The Motorola engineers did not release all the information needed to make a cycle exact 68060 from the documentation. Testing could reveal more info but it's really rather pointless. In theory, there isn't any software that relies on the timing of the 68060. In reality, there are probably a few demos and games that would fail if the timing varies very much. They would probably fail from other CPU enhancements like faster clock speed, faster memory and faster and bigger caches or custom chip enhancements like a faster blitter and CIA timing changes first but it's not enough of a problem that we hear Amigans with 68060@100MHz complaining (the old bugs can be patched too).

matthey · « **Reply #9 on:** December 14, 2012, 05:50:51 AM »

Quote from: freqmax;718973

Is there any CPU specific behaviors other than what the assembler instructions specify. That any software is dependent on?

(Like instruction XX flipping register bit Y when in Z mode etc..)

The behavior of existing instructions and addressing modes is already defined in earlier 68k processors and if it doesn't match, it's a bug and will likely crash. I do know of some software that avoids bugs that might be in the 68060 but it will run on 68060s without the bug as well as other 68k processors and does not depend on the behavior of the 68060. I have also seen documentation of the 68060 changed because it was incorrect (but not a bug). This would be more likely to affect early hardware developed for the 68060 but could affect drivers (software) that rely on a particular timing of such early hardware. I don't foresee many incompatibility problems (and those can be fixed in fpga) when running 68060 code on a non-cycle exact advanced 68k fpga CPU.

Quote from: ChaosLord;718974

Several years ago I coded a special fx. It does some gfx calculations then does a full screen rotation. Obviously it burns a lot of cycles so I timed it often to see how slow it was.

Each time I do the fx, I either get time A or I get a totally different and much slower Time B. I never get anything in between. Its very very confusing to me.

Sometimes the routine goes at the speed I want and other times much slower. It makes no sense. Its like it has 2 gears it can run in.

My guess would be that some cache (could be the branch cache also) gets flushed. It could show like that if inadvertently synced to a task switch. You could try turning off different caches individually or disabling multitasking to see if it makes the timing closer. I often find minor differences in speed myself. It's like the 68060 is alive but chaos is ordered once the complexity is understood

.

matthey · « **Reply #10 on:** December 14, 2012, 01:48:37 PM »

Quote from: freqmax;719017

Have look at alignment issues for those timing issues. Especially in combination with pre-fetch (L-cache).

The 68060 does a fantastic job of handling misaligned data, especially on reads. By the documentation, a cycle is lost hear and there but I have found no measurable speed difference by aligning code (I-cache) for example. This is in contrast to the 68020/68030 where aligning branch targets to a longword can result in ~5% speedup.

matthey · « **Reply #11 on:** December 15, 2012, 12:56:06 PM »

Quote from: ChaosLord;719184

At one point I was thinking of going back thru all my asm code and massaging it so that all my popular branch targets were longword aligned... But then I never did it. What is the command for doing that in Devpac?

Most assemblers will accept:

CNOP 0,4 ;longword align

Quote from: ChaosLord;719184

Maybe I didn't bother to do the massage because there is no benefit on 040 or 060?

Does 040 get any benefit from longword aligned branch targets?

The 040 handles mis-alignment well like the 060. It's probable that a cycle is saved from time to time by aligning code but aligning code can result in less code in a cache line which can cost a cycle from time to time. It may still be effective to align the start of commonly used code to a longword (maybe even cache line with CNOP 0,16 but that starts to become wasteful of memory if not extremely common) which is easy enough and doesn't waste cache. Even on the 040/060 where there is no penalty to read any part of a cache line, it still takes 2x as long to load 2 cache lines as 1. The 060 at least, does such a good job of code alinement and code caching that I was unable to time a significant difference by aligning code. Many modern processors can't do this.

Quote from: ChaosLord;719184

And by branch targets, does that include bne as well as bsr/jsr and JMP?

Yes, I believe this includes Bcc branch targets. All instruction fetches on the 020 are longword and the 020 is delayed by fetches over a word. I would avoid using NOP instructions for alignment in any code that is executed.

matthey · « **Reply #12 on:** December 15, 2012, 05:07:04 PM »

Quote from: ChaosLord;719204

Calling it 68050 was never cool.

It had new instructions (addressing modes) so if I wrote code for it things would get really messed up!

I could say "This game requires 68050+" but that would be a lie because it would not work on 060.

There can be multiple CPU lines by multiple manufacturers. You could say it requires N68050+ (The Motorola line was M68k). Another option would be to specify the ISA like "68kF1+". That is usually what happens with ARM where there are *many* manufacturers and processor names.

Quote from: psxphill;719207

Adding new instructions or addressing modes is not cool. We need something that implements an 060, fpu & mmu in an fpga. I don't care if it's super scalar or supports out of order execution.

Why is it not cool to add new instructions and addressing modes? What if it can be done in a 99.999% compatible way and offer 5-15% speed and code density improvements at the same clock speed? The 68060 is missing some new features that modern processors have and some that would very much improved it. Also, it could be made to run ColdFire code with only minor changes. I can see focusing on making a compatible and tested 68020+ CPU first but there is good reason to modernize the 68k and we need ISA standards to do it. Do you think ARM or x86 would be where they are today if they had not changed? Were the 68020+ ISA changes a waste to you? Do you understand enough to have an educated opinion?

Quote from: freqmax;719214

Can one skip out-of-order execution, super scalar, etc.. and still run 68060 code?

What are the minimum feature set that has to be implemented? (albeit slow..)

The code on a 68060 is generally not awhere of the superscaler execution. The 68060 resets with superscaler execution turned off and a bit in the PCR register (requires supervisor mode) must be turned on to enable it.

Out-of-Order (OoO) execution is a completely different concept to superscaler for parallel execution of the same code using units. Most modern processors use this but there are some very powerful exceptions. It generally makes better use of the executing units at the cost of complexity and size. No 68k or ColdFire CPU has ever been OoO that I'm aware of. There was talk of partial OoO execution for division on the N68k. This would allow non-depending instructions to execute while the costly division is calculated. I believe IBM has done something similar before.

matthey · « **Reply #13 on:** December 15, 2012, 07:53:02 PM »

Quote from: psxphill;719242

a) I'd rather actually see something that can run 68060 code, the extra effort to adding 68020+68882 instructions isn't a big deal to me. Time better spent on a 68060 compatible mmu.

There were few user mode additions to the 68060 that made it incompatible to the 68020/68030. The only user mode instruction that I can think of is MOVE16 and it's not used by any compiler that I have seen. You can take 68060 optimized compiler code and run it on a 68020-68040 in almost every case, it just won't be quite as fast.

Quote from: psxphill;719242

b) Out of order execution is something that 68070 was going to have, but again taking time on things that aren't necessarily needed just means it never comes out.

It was to be superscaler and not OoO except for possibly divide. It would not be considered OoO because of the divide although you could, maybe in the loosest sense, get away with calling a superscaler OoO hybrid.

Quote from: psxphill;719242

Oh yeah, 68060 compatible cache would be good to.

Most code knows nothing of the cache nor does anything with it except to flush it for DMA or loading/modifying code. The main thing for Amiga compatibility is to have a selectable cache size if copyback caching is used, especially if the cache size grows as is likely for a new CPU. The 68060 had a 1/2 cache size setting that made it's caches the same size as the 68040 (which already broke many things compared to the tiny cache in the 020/030). The N68k was going to have writethrough caching only with bus snooping and auto invalidation of instruction cache writes (self modifying code) which would have given maximum compatibility for self modifying code.

Quote from: psxphill;719242

It's just ego masturbation. The speed improvement isn't worth the effort and definitely not worth fragmenting the user base. Plus it wouldn't be so bad if it could actually run all 68060 code, but as they never implemented an mmu it can't.

The 5-15% speed increase would be the "average" of a program optimized for the 68kF ISA. It's not only possible, but likely IMO, that you could see a 25-50% speed up of some codecs including picture and video processing. The 68060 is missing (predates) a lot of speedups for this type of code.

A lack of MMU will not affect very much code. Most code that can use the MMU has an option for turning it off. The FPU is more important for user mode compatibility although an MMU may be needed for Amiga system reliability using the 68060. I would like to see an fpga MMU but the benefit on the Amiga is small as the AmigaOS doesn't use it.

Quote from: ChaosLord;719243

That is better than calling it N68050.

Or we could just call it N68070 and be done with it

Altho the FPGAReplay guys will have theirs out first so it will be F68070 or R68070

The fpga Arcade CPU is already called the TG68. You better ask before changing the name on them

. The N68k naming conventions probably don't matter anymore as the Natami is likely dead or the name (where the 'N' comes from) will be renamed by Thomas Hirsch :/.

Author Topic: Full 68060 implementation? (Read 10215 times)

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?

matthey

Re: Full 68060 implementation?