Amiga.org
Amiga computer related discussion => Amiga Hardware Issues and discussion => Topic started by: freqmax on December 13, 2012, 01:10:57 AM
-
Does there exist enough documentation and FPGA fabric size to make a full soft-HDL 68060 (https://en.wikipedia.org/wiki/Motorola_68060) processor? and at full speed? (preferably with Xilinx Spartan series, Virtex is expensive)
This could then be used to mitigate shortage, insane prices and make tweaks possible.
-
Sure, if you want to spend the years to write one. FPGA size is not the problem.
-
Does there exist enough documentation and FPGA fabric size to make a full soft-HDL 68060 (https://en.wikipedia.org/wiki/Motorola_68060) processor? and at full speed? (preferably with Xilinx Spartan series, Virtex is expensive)
A 100% logic equivalent is not possible in fpga but a fully functionally equivalent soft CPU is possible. Note that muxes are much slower and bigger in an fpga than in silicon which messes up the timing as other logic elements are forced to wait. A fast fpga can more than make up for this deficiency and others but an optimal design will be different and optimized for an fpga, likely even for a particular fpga. My understanding is that an fpga 68060 like CPU design could be programmed for:
1) speed in the fpga (fastest in fpga, programming can get messy with optimizations)
2) small size in an fpga (for small fpga, slower)
3) burning in an ASIC (slower in fpga but fastest in ASIC and potentially easiest to program and maintain)
4) compatibility close to cycle exact (slower in fpga, difficult to make as deep understanding of a 68060 and logic testing or copying would be necessary)
Personally, I would go for #3 even though speed may be 20% lower than the same soft CPU optimized for speed (probably talking 100MHz speed in affordable fpgas). It would be easier to make enhancements and changes (which can also increase speed), faster to program and we could always go ASIC ;).
This could then be used to mitigate shortage, insane prices and make tweaks possible.
The last revision of the 68060 is the one that is expensive and difficult to find. You should be able to find slower ones for between $50 and $100. Ask MikeJ what the slower ones are going for in China. Even the older RC revisions have an MMU, FPU and less bugs than any 68k fpga CPU yet.
-
A 100% logic equivalent is not possible in fpga but a fully functionally equivalent soft CPU is possible. Note that muxes are much slower and bigger in an fpga than in silicon which messes up the timing as other logic elements are forced to wait. A fast fpga can more than make up for this deficiency and others but an optimal design will be different and optimized for an fpga, likely even for a particular fpga.
+1
Even the older RC revisions have an MMU, FPU and less bugs than any 68k fpga CPU yet.
:roflmao: good point!
-
I think functionally equalient is a good goal. But ASIC is no good. 1) It's darn expensive 2) Any bugs will be unfixable
-
I think functionally equal is a good goal. But ASIC is no good. 1) It's darn expensive 2) Any bugs will be unfixable
I did not mean to program an ASIC, I meant to program an fpga as one would do if they were later going to burn an ASIC. The logic would be tested and bugs fixed before even contemplating burning an ASIC but that would be an option for later. Having a tested fpga CPU ready to burn could open up possibilities like partners that would be willing to help with the expense part. The 68k FIDO CPU, for example, is burned by a company that specializes in fpgas and making ASICs from them. They have fpga code for many old chips that are no longer available. Several of them can be put in one fpga to reduce component cost and new software programming cost of a newer chip for embedded or retro systems.
-
What about a less complicated CPU (68000/68020) running at a high speed (100 MHz)?
-
What about a less complicated CPU (68000/68020) running at a high speed (100 MHz)?
Already covered by MikeJ and his main developer. ;)
But it could serve as a starting point.
-
Ask MikeJ what the slower ones are going for in China. Even the older RC revisions have an MMU, FPU and less bugs than any 68k fpga CPU yet.
They are really cheap - you can even can brand news ones for ~20USD.
Getting the latest mask set is Really tricky.
/MikeJ
-
They are really cheap - you can even can brand news ones for ~20USD.
Getting the latest mask set is Really tricky.
/MikeJ
That's even cheaper than I thought :). A 50-60MHz 68060 is already a big improvement over what the average Amiga user is using. Supporting faster memory speeds and easier overclocking will make it faster than most of the old 68060@50MHz accelerators.
-
That's even cheaper than I thought :). A 50-60MHz 68060 is already a big improvement over what the average Amiga user is using. Supporting faster memory speeds and easier overclocking will make it faster than most of the old 68060@50MHz accelerators.
Yes, but they are older revisions and cannot be clocked up - and have bugs.
The E41J mask set are tricky to get at all (most are fakes) and price is negotiable.
I am taking a chip tester with me to screen locally.
/MikeJ
-
Perhaps they can be overclocked with peltier and fluid cooling? or even nitrogen for a few hours?
-
@mikej
The 68060.library can make the older mask 68060s reliable with very little decrease in speed. It would be great if the 68060.library was installed from flash/kickstart so that bootable games could work. The older mask 68060s are not nearly as overclockable but still overclockable. They should be good for 54-66MHz depending on cooling and temperature environment where used. Add to this memory that is 25-50% faster than the old accelerator SIMMS and it should be significantly faster than a CSMK2 with 68060@50MHz. The newest mask 68060 is very nice though and worth the premium to me to fix all bugs and overclock to 100MHz. I have an extra one waiting ;).
-
Does there exist enough documentation and FPGA fabric size to make a full soft-HDL 68060 (https://en.wikipedia.org/wiki/Motorola_68060) processor? and at full speed? (preferably with Xilinx Spartan series, Virtex is expensive)
This could then be used to mitigate shortage, insane prices and make tweaks possible.
The assembly language manuals and the 68060 databook are all that you should need, for someone that knows how to do this kind of thing.
Though I myself would not try to do an exact 68060 clone. The 060 has some instructions removed that were present in 68040 for example. If I were to make an FPGA 680x0, I'd put them all back in, and avoid trapping to software emulation.
If you're redoing things in an FPGA, you have freedom to make improvements on things lke that. Want to make the cache bigger? Why not? You need to design a cache controller anyway if you support a cache, so do what you like in a way that is compatible with everything.
I just did a very small mircoprocessor design for a Computer Architecture class that finished last night. Very small as in a total of 6 instructions, including load, store, and two kinds of branching, leaving only two ALU operations, add and subtract of BCD numbers. 8bit instruction and 16bit memory address bus, and four registers. But it was nice to learn how this stuff works and the fundamentals of how to approach designing a processor. I actually wrote some assembly language code for what I'd want such a terribly simple thing to do before doing the logic design. You break down the instructions into stages (not pipeline stages, but individual steps taken to complete a single instruction at a time), and then you can extract your hardware logic design based on that. I found it very interesting, and I was surprised at how complicated it was NOT, even for my uselessly simple thing. Sure, a more complete and useful design like a 680x0 will be bigger and more complex than this, but it's not absurdly complicated as I would have imagined previously. That said, it would still be a great deal of work. Could be fun for someone with the time.
TG and Yaqube have taken some steps in improving the TG68 sortof in that direction. I thought I'd seen something about the Suska guy taking his 68000 core up to 68020 or 030 but I didn't see anything available last I checked. I think the aoocs guy has a 68000 core as well, not sure what his plans for the future of that are. There's also closed-source Natami CPU, but I'm not sure what's happening with that anymore. But they are things you can look at for inspiration.
-
Though I myself would not try to do an exact 68060 clone. The 060 has some instructions removed that were present in 68040 for example. If I were to make an FPGA 680x0, I'd put them all back in, and avoid trapping to software emulation.
It's not necessary to put all the trapped instructions back in. It made sense to get rid of CAS2, CHK2, CMP2 and MOVEP. Getting rid of the integer 64 bit result MULx was a mistake as it's used commonly by compilers to do an invert and multiply by a constant instead of a divide by a constant. It wouldn't have been so bad if they would at least have defined and allowed a MULx where Sz=1 (64 bit result) and Dl (result register low) = Dh (result register high) giving the upper 32 bits of the result (like PPC MULH). This is worth while to define even with 64 bit results as it saves trashing a register when the lower 32 bits are not needed. See MULS and MULU here:
http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf
The integer 64 bit result DIVx is used much less commonly but it is much slower to do in software using the shift method. MOVEP is used in older Amiga software (mostly games) but it is poorly encoded, has limited usefulness and most patched games have already removed them. CAS2 and CHK2 are uncommon supervisor instructions. CMP2 is user mode but has limited usefulness as designed and it's very rare. WHDload mentions only 1 known game and 1 demo as I recall. It would be better to install the 68060.library from flash or kickstart so that traps are never a problem. Some instructions and addressing modes are better trapped and simplifying gives gains elsewhere. Notice that I removed (for trapping) the double indirect addressing modes that used the outer displacement in the 68kF ISA pdf above as there was little advantage (only useful when no free registers) and simplifies the decoder. If all remaining full extension word format addressing modes could be 1 cycle faster then it would be worth it.
The SWAP instruction in the 68060 should have worked in both integer units which was an oversight. The result is longword for forwarding and is common. The bitfield instructions should have been made faster which is possible and they have 32 bit results for forwarding. Of course bigger caches, a link stack and instruction combining like the ColdFire has would be done at minimum if modernizing the 68060.
I just did a very small microprocessor design for a Computer Architecture class that finished last night. Very small as in a total of 6 instructions, including load, store, and two kinds of branching, leaving only two ALU operations, add and subtract of BCD numbers. 8bit instruction and 16bit memory address bus, and four registers. But it was nice to learn how this stuff works and the fundamentals of how to approach designing a processor. I actually wrote some assembly language code for what I'd want such a terribly simple thing to do before doing the logic design. You break down the instructions into stages (not pipeline stages, but individual steps taken to complete a single instruction at a time), and then you can extract your hardware logic design based on that. I found it very interesting, and I was surprised at how complicated it was NOT, even for my uselessly simple thing. Sure, a more complete and useful design like a 680x0 will be bigger and more complex than this, but it's not absurdly complicated as I would have imagined previously. That said, it would still be a great deal of work. Could be fun for someone with the time.
Sounds like fun :). It's not rocket science and it's very logical but I imagine complex and time consuming when doing more advanced design and coding.
TG and Yaqube have taken some steps in improving the TG68 sortof in that direction. I thought I'd seen something about the Suska guy taking his 68000 core up to 68020 or 030 but I didn't see anything available last I checked. I think the aoocs guy has a 68000 core as well, not sure what his plans for the future of that are. There's also closed-source Natami CPU, but I'm not sure what's happening with that anymore. But they are things you can look at for inspiration.
68020+ support should be the minimum for the Amiga IMO. It makes programming much easier than the 68000 while offering significantly better speed and code density. The Natami CPU (N68050) is not finished and was only partially supporting the 68020 last I understood but it is fairly advanced as far as cache and pipeline design. I don't think Jens is working on it much if any anymore. He has talked about making it open source though. Gunnar is working on a soft CPU based on it but he is not very reliable. He claims to be experimenting with a 200MHz softcore although he increased the pipeline length significantly in order to do it. This increases branch penalties and can cause other stalls much like a highly clocked DSP, GPU or x86 CPU. Even if I could believe him, it's experimental at best and Gunnar has a history of not completing much.
-
I hereby appoint billt to dezine us a superfast 680x0 cpu. We can call it the 68070.
You are hired. :D
-
Getting rid of the integer 64 bit result MULx was a mistake as it's used commonly by compilers to do an invert and multiply by a constant instead of a divide by a constant. It wouldn't have been so bad if they would at least have defined and allowed a MULx where Sz=1 (64 bit result) and Dl (result register low) = Dh (result register high) giving the upper 32 bits of the result (like PPC MULH). This is worth while to define even with 64 bit results as it saves trashing a register when the lower 32 bits are not needed. See MULS and MULU here:
http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf
Philosophically I agree with u.
But from an engineering standpoint I simply don't know how to implement it.
If it can't be done as fast as the other multiplies then u hafta stop the whole pipeline from moving while u spend 2-4 cycles to complete the instruction.
Ok now that I think about it there MUST be a mechanism for stopping the pipleine because all those weirdo addressing modes in multiword instructions take multiple cycles to complete.
So ok, I agree it was a total mistake to take that out.
Maybe the problem was that its bitpattern overlapped the bitpattern of a normal multiply? I donno. But if that is what messed them up they could have invented a NEW completely different bitpattern for large multiplies.
It would be better to install the 68060.library from flash or kickstart so that traps are never a problem.
+2
Some instructions and addressing modes are better trapped and simplifying gives gains elsewhere. Notice that I removed (for trapping) the double indirect addressing modes that used the outer displacement in the 68kF ISA pdf above as there was little advantage (only useful when no free registers) and simplifies the decoder. If all remaining full extension word format addressing modes could be 1 cycle faster then it would be worth it.
I used to tell Gunnar all the time that its ok to trap those weirdo hypercomplicated addressing modes. But he said they worked out an optimized way for the address unit to handle it without needing to trap.
The SWAP instruction in the 68060 should have worked in both integer units which was an oversight.
I wonder if they had a reason?
68020+ support should be the minimum for the Amiga IMO. It makes programming much easier than the 68000 while offering significantly better speed and code density.
+1
The Natami CPU (N68050) is not finished and was only partially supporting the 68020 last I understood but it is fairly advanced as far as cache and pipeline design. I don't think Jens is working on it much if any anymore.
:(((((
He has talked about making it open source though.
Dear God I hope so.
They cooked up really kewl trix that can make a really fast 680x0 cpu!
Making a 680x0 softcore is all about how many clever tricks you can come up with and combine them all together to make something fast.
Without clever optimizations u won't get 68060 speeds using affordable FPGA chips.
Gunnar is working on a soft CPU based on it but he is not very reliable. He claims to be experimenting with a 200MHz softcore although he increased the pipeline length significantly in order to do it. This increases branch penalties and can cause other stalls much like a highly clocked DSP, GPU or x86 CPU. Even if I could believe him, it's experimental at best and Gunnar has a history of not completing much.
If someone would give him some medication to make him calm down and just write code without attacking everybody all the time then he could write good code and finish everything. He could start by trying Lorazepam. If that doesn't work for him then he could try something else.
p.s If someone would clone me then my other me would very happily spend 20 hours a day for 2 years to cook up an awesome 68070. With Jens, Matt, Phil and some other asm guys helping I know we could do it. Sadly I am only 1 person and I just can't devote that kind of time this project. :(
Once I start trying to solve a puzzle I get obsessed and can't stop. So I must stay away from puzzles that I know are large and complex since I have other responsibilities in my life that I must tend to. :(
-
IMO, today you'd never get an FPGA which is big enough and fast enough and cheap enough to rival a real MC68060RC50 chip.
You'd either have to do an FPGA-Arcade philosophy (System on a chip, full system recreations) where in the long term you could charge much more.
Or wait 5-10 years for technology to catch up.
That doesn't mean that as a community you couldn't start now. Converting the specifications into HDL for logic synthesis. Working on the cache, MMU, FPU, branch prediction and super scalar architecture. Using the free tools to targeting some of today's FPGA's and see what target frequencies you can reach and area you would need in the future.
I mean you can already get fully open source HDL for 68000, 8088, 8051, 6502, 65816, Z80, ARM, MIPs and countless other CPU's.
But I thought that is what the NatAmi team SAID they were doing with their 68050 design?? I take it they never released anything? Not even design specifications?
-
Yes, but they are older revisions and cannot be clocked up - and have bugs.
What are these bugs that everyone is always speaking of?
Is there a list somewhere?
Are they serious? Or mainly just technical? Or ?
-
But I thought that is what the NatAmi team SAID they were doing with their 68050 design?? I take it they never released anything? Not even design specifications?
They were working on it. They did a lot of stuff using simulation software and ran lots of tests and made progress.
But it is true that all the code is topsecret and never got released.
After Gunnar blew up, everything just stopped AFAIK.
If there is a way for Jens to test his 68050 on an FPGAreplay then I think we could motivate him to resume work on it.
And if we could get him to team up with someone who is emotionally stable and helpful then that would speed things up too.
Is there a way that Jens can test the 050 on the FPGAreplay?
I hope so. But the FPGA chip is so small that I don't know for fact it would fit. :(
In any event some of the tricks they were doing won't work on the FPGAreplay so it WILL be slower than it was planned for Natami. FPGAreplay does not have all the SRAM blocks and stuff that Natami FPGA has.
-
Is there any particular reason for people still calling the Natami CPU a 68050?
Last I read, it was called a 68070.
-
I hereby appoint billt to dezine us a superfast 680x0 cpu. We can call it the 68070.
You are hired. :D
Thank you for your faith in ability, but there's more to learn before I'd take it up. There's a followup course that I'd like to do, but it's not on the schedule this spring like I had expected, so I don't know when that might happen. I'd also like to take the followup class to the FPGA/VHDL class I took last spring. And I lack time for something like this right now anyway. I took the class because I'm interested in tinkering with TG68 and things like that, and wanted to understand this stuff. But I always struggle to find any time for any projects, and I have some other project ideas that are a lot higher on my priority list.
-
But from an engineering standpoint I simply don't know how to implement it.
If it can't be done as fast as the other multiplies then u hafta stop the whole pipeline from moving while u spend 2-4 cycles to complete the instruction.
Ok now that I think about it there MUST be a mechanism for stopping the pipleine because all those weirdo addressing modes in multiword instructions take multiple cycles to complete.
Motorola said they removed the 64 bit integer MULx and DIVx to simplify the pipeline while Gunnar said they were no problem to implement. I think they do add complexity to the pipeline which can slow the CPU and this was seen as not being worthwhile for the benefit. The engineers likely did not realize the benefit or did evaluations before compilers started using the invert and multiply for divide trick and before 64 bit operations and CPUs became more common. I am not a hardware design guy that could dig out the truth so this is my best guess.
Maybe the problem was that its bitpattern overlapped the bitpattern of a normal multiply? I donno. But if that is what messed them up they could have invented a NEW completely different bitpattern for large multiplies.
I don't think it would have been a problem to use this existing undefined bit pattern but a completely new encoding of MULH would have been acceptable also.
I used to tell Gunnar all the time that its ok to trap those weirdo hypercomplicated addressing modes. But he said they worked out an optimized way for the address unit to handle it without needing to trap.
Maybe he had an idea at one point but he did not have the double indirect modes implemented in the Apollo. Then again, he changes his experimental designs quite often so maybe it would work one day and not the next.
The SWAP instruction in the 68060 should have worked in both integer units which was an oversight.
I wonder if they had a reason?
There were several instructions that were left out of the 2nd integer unit (sOEP) probably to save space from less commonly used instructions but this one is common and very simple. It wouldn't have been as bad if the immediate shift worked with greater than 8 shifts but as is shift is 2x as fast as swap in the 68060. A modern 68060 would probably have a more balanced sOEP as CPU size is not as important now. Some instructions would still be sOEP only as they are uncommon or don't make sense to execute in parallel.
Without clever optimizations u won't get 68060 speeds using affordable FPGA chips.
True.
What are these bugs that everyone is always speaking of?
Is there a list somewhere?
Are they serious? Or mainly just technical? Or ?
http://cache.freescale.com/files/32bit/doc/errata/MC68060DE.pdf
The most serious bugs can be worked around but need the 68060.library.
-
Is there any particular reason for people still calling the Natami CPU a 68050?
Last I read, it was called a 68070.
Jens coded the N68050 that was far enough along that he was trying to adapt it to work in the Natami fpga up to several months ago. The N68070 was Gunnar's design (based on the N68050 and possibly with Jen's help) of a superscaler (2 integer units) CPU which became the Apollo CPU when he split from the Natami.
-
Maybe he had an idea at one point but he did not have the double indirect modes implemented in the Apollo.
I wish u wouldn't call it Apollo. Wayyyyy to confusing.
We should agree to call it 070. Or N68070 to be precise.
Then again, he changes his experimental designs quite often so maybe it would work one day and not the next.
Good point.
http://cache.freescale.com/files/32bit/doc/errata/MC68060DE.pdf
The most serious bugs can be worked around but need the 68060.library.
My brain just melted.
My work around is never use those first 2 mask sets.
Is that the problem with the $20.00 060s? They are the buggy prototype versions?
-
J The N68070 was Gunnar's design (based on the N68050 and possibly with Jen's help) of a superscaler (2 integer units) CPU which became the Apollo CPU when he split from the Natami.
He/they actually started adding a 2nd integer unit??
Last I remember the superscalar stuff was "something to do in the future".
In fact, where I left off at, the L1 cache was not working yet or had been broken or something.
I really wish someone would finish it.
-
I wish u wouldn't call it Apollo. Wayyyyy to confusing.
We should agree to call it 070. Or N68070 to be precise.
It's confusing but accurate. Gunnar made lots of changes to the N68070 design after he left so it's not really the same anymore. Then again, I can't really separate what's hype and what's real when we are talking about Gunnar so maybe it would be best not to refer to them at all :/.
My brain just melted.
My work around is never use those first 2 mask sets.
Is that the problem with the $20.00 060s? They are the buggy prototype versions?
They are not prototype masks. They just didn't have the bugs fixed yet. The 1f43g mask has all the bugs listed. I had one of these marked 50MHz and the system was reliable although it didn't even overclock to 60MHz which was bad luck. The 1g65v mask fixes 1 bug and the one I had (marked 50MHz) overclocked to 60MHz reliably but not 66MHz which I hear a few will do. There are buggy masks that have higher clock ratings like 60MHz. The 0e41J mask has all known bugs fixed (all on the errata) and can generally be clocked between 90MHz and 105MHz. It had a die shrink that allows for this. Most if not all are still marked 50MHz and there are fakes with the mask changed. All other masks are not full 68060s with MMU and FPU which I do not recommend. The 68060 may not be reliable in an Amiga without an MMU.
He/they actually started adding a 2nd integer unit??
Last I remember the superscalar stuff was "something to do in the future".
That's what Gunnar claimed anyway. The 2nd integer unit was actually what he called a cheater or helper integer unit at first. It could not do calculations, only immediates and register direct which is still good. That's when the Apollo fit inside of a normal sized fpga. He has since made it bigger with 2 units targeting larger fpgas. He considers the Natami dead and no longer a potential target.
-
Is there any software that use 68060 (or 030, 040) specific "features" that make them incompatible if the processor isn't cycle exact and uses the same behaviour?
-
Is there any software that use 68060 (or 030, 040) specific "features" that make them incompatible if the processor isn't cycle exact and uses the same behaviour?
Programmers relied less and less on CPU timing with the later 68k. On the 68000, it was possible to map out the exact state of the CPU and what it is doing from cycle to cycle depending on the code. The pipeline was very short, there was practically no cache and the memory speed from Amiga to Amiga was pretty close. Relying on CPU timing was generally safe with the exception of the processor clock speed between NTSC and PAL which caused a few games to fail. The later addition of true fast memory was enough of a timing change to kill a fair number of games. The 68020 and 68030 were a little more difficult to count cycles (overlapping cycles) with a little longer pipeline, small caches and a difference in memory timing. Programmers started to learn it wasn't such a good idea to make timing assumptions and they had the benefit of testing on a much wider variety of CPUs with widely different timing. The 68040 and 68060 didn't break much with timing differences. Incompatibilities are usually due to other reasons like a difference in cache size (self modifying code) or not having their CPU libraries installed which allows them to have similar behavior to previous CPUs with widely different timing but it wasn't so much of a problem by this time. The 68060 is almost impossible to predict timing. The execution of integer code can vary from one execution to the next depending on which integer unit is currently used, how much instruction memory has prefetched, what's in the caches including now branch cache, instruction folding, etc. The Motorola engineers did not release all the information needed to make a cycle exact 68060 from the documentation. Testing could reveal more info but it's really rather pointless. In theory, there isn't any software that relies on the timing of the 68060. In reality, there are probably a few demos and games that would fail if the timing varies very much. They would probably fail from other CPU enhancements like faster clock speed, faster memory and faster and bigger caches or custom chip enhancements like a faster blitter and CIA timing changes first but it's not enough of a problem that we hear Amigans with 68060@100MHz complaining (the old bugs can be patched too).
-
Is there any CPU specific behaviors other than what the assembler instructions specify. That any software is dependent on?
(Like instruction XX flipping register bit Y when in Z mode etc..)
-
The 68060 is almost impossible to predict timing. The execution of integer code can vary from one execution to the next depending on which integer unit is currently used, how much instruction memory has prefetched, what's in the caches including now branch cache, instruction folding, etc. The Motorola engineers did not release all the information needed to make a cycle exact 68060 from the documentation.
Several years ago I coded a special fx. It does some gfx calculations then does a full screen rotation. Obviously it burns a lot of cycles so I timed it often to see how slow it was.
Each time I do the fx, I either get time A or I get a totally different and much slower Time B. I never get anything in between. Its very very confusing to me.
Sometimes the routine goes at the speed I want and other times much slower. It makes no sense. Its like it has 2 gears it can run in.
I just assumed it was either:
A: When I run the exe, the code happens to load in such a way that the code for that fx is somehow compatible with the pipeline, and other times it is somehow misaligned so that it runs much slower.
B: When I run the exe, the memory for the fx gets allocated in a manner that somehow makes a dramatic difference to the speed of the routine. Maybe a certain memory alignment works better/worse with the cache.
I don't specifically know how either of these is possible. Its just my best guess.
What I should have done was picked up the phone and said "Hey Matt why is my code acting wonky?" :biglaugh:
All my timing tests, data and documentation of this anomaly were in my history file which was lost in a hard drive gitch / brownout. :(
-
Is there any CPU specific behaviors other than what the assembler instructions specify. That any software is dependent on?
I guess u mean like undocumented opcodes?
I havent heard of undocumented opcodes doing anything since the C64 days.
-
It goes further than opcodes. As the I/O connections and modes may screw around with execution behaviour, or even bus response.
-
I guess u mean like undocumented opcodes?
I havent heard of undocumented opcodes doing anything since the C64 days.
The Hitachi 6309, now that was the king of hidden opcodes.
An extra 16bit accumulator.
A 32bit accumulator.
I better stop there because the list takes up about a page.
-
Is there any CPU specific behaviors other than what the assembler instructions specify. That any software is dependent on?
(Like instruction XX flipping register bit Y when in Z mode etc..)
The behavior of existing instructions and addressing modes is already defined in earlier 68k processors and if it doesn't match, it's a bug and will likely crash. I do know of some software that avoids bugs that might be in the 68060 but it will run on 68060s without the bug as well as other 68k processors and does not depend on the behavior of the 68060. I have also seen documentation of the 68060 changed because it was incorrect (but not a bug). This would be more likely to affect early hardware developed for the 68060 but could affect drivers (software) that rely on a particular timing of such early hardware. I don't foresee many incompatibility problems (and those can be fixed in fpga) when running 68060 code on a non-cycle exact advanced 68k fpga CPU.
Several years ago I coded a special fx. It does some gfx calculations then does a full screen rotation. Obviously it burns a lot of cycles so I timed it often to see how slow it was.
Each time I do the fx, I either get time A or I get a totally different and much slower Time B. I never get anything in between. Its very very confusing to me.
Sometimes the routine goes at the speed I want and other times much slower. It makes no sense. Its like it has 2 gears it can run in.
My guess would be that some cache (could be the branch cache also) gets flushed. It could show like that if inadvertently synced to a task switch. You could try turning off different caches individually or disabling multitasking to see if it makes the timing closer. I often find minor differences in speed myself. It's like the 68060 is alive but chaos is ordered once the complexity is understood ;).
-
Have look at alignment issues for those timing issues. Especially in combination with pre-fetch (L-cache).
-
Have look at alignment issues for those timing issues. Especially in combination with pre-fetch (L-cache).
The 68060 does a fantastic job of handling misaligned data, especially on reads. By the documentation, a cycle is lost hear and there but I have found no measurable speed difference by aligning code (I-cache) for example. This is in contrast to the 68020/68030 where aligning branch targets to a longword can result in ~5% speedup.
-
The 68060 does a fantastic job of handling misaligned data, especially on reads. By the documentation, a cycle is lost hear and there but I have found no measurable speed difference by aligning code (I-cache) for example. This is in contrast to the 68020/68030 where aligning branch targets to a longword can result in ~5% speedup.
At one point I was thinking of going back thru all my asm code and massaging it so that all my popular branch targets were longword aligned... But then I never did it. What is the command for doing that in Devpac?
I haven't written any real asm in 4 years. I'm forgetting everything.
Maybe I didn't bother to do the massage because there is no benefit on 040 or 060?
Does 040 get any benefit from longword aligned branch targets?
And by branch targets, does that include bne as well as bsr/jsr and JMP?
-
At one point I was thinking of going back thru all my asm code and massaging it so that all my popular branch targets were longword aligned... But then I never did it. What is the command for doing that in Devpac?
Most assemblers will accept:
CNOP 0,4 ;longword align
Maybe I didn't bother to do the massage because there is no benefit on 040 or 060?
Does 040 get any benefit from longword aligned branch targets?
The 040 handles mis-alignment well like the 060. It's probable that a cycle is saved from time to time by aligning code but aligning code can result in less code in a cache line which can cost a cycle from time to time. It may still be effective to align the start of commonly used code to a longword (maybe even cache line with CNOP 0,16 but that starts to become wasteful of memory if not extremely common) which is easy enough and doesn't waste cache. Even on the 040/060 where there is no penalty to read any part of a cache line, it still takes 2x as long to load 2 cache lines as 1. The 060 at least, does such a good job of code alinement and code caching that I was unable to time a significant difference by aligning code. Many modern processors can't do this.
And by branch targets, does that include bne as well as bsr/jsr and JMP?
Yes, I believe this includes Bcc branch targets. All instruction fetches on the 020 are longword and the 020 is delayed by fetches over a word. I would avoid using NOP instructions for alignment in any code that is executed.
-
Is there any particular reason for people still calling the Natami CPU a 68050?
Last I read, it was called a 68070.
Yeah, whilst working on it everyone was keen for it to be fully super-scalar, OoO, dual-issue etc so it would be architecturally like a 68060 but more advanced. In the end that was deemed a little impractical for a first attempt so instead it would be more like a 68040, but a bit more advanced hence 68050, the "N" was to separate it from the Motorola series.
EDIT: Oh yes and as was pointed out by Matthey, the "N68070" idea eventually became the Apollo thing that Gunnar is off doing now! Very confusing. So it's probably best discussed as a timeline!
1. the "N68070" was the original target,
2. became the more achievable "N68050", ran code in a simulator,
3. "N68070" would be the limited-OoO, dual-issue future version of the "N68050",
4. Gunnar left and started the "Apollo" project which is basically the "N68070".
Erm... the only one I know existed is the N68050 which Jens & Gunnar had running in a simulator whilst I was still involved with Natami. Dunno what happened after that.
-
Most assemblers will accept:
CNOP 0,4 ;longword align
Thanx! I just remembered CNOP 0,4 right before I clicked ur msg. It just popped into my head. Sometimes I hafta tilt my head a little to get the fluids over to the dry part of my brain. :)
The 040 handles mis-alignment well like the 060. It's probable that a cycle is saved from time to time by aligning code but aligning code can result in less code in a cache line which can cost a cycle from time to time. It may still be effective to align the start of commonly used code to a longword (maybe even cache line with CNOP 0,16 but that starts to become wasteful of memory if not extremely common) which is easy enough and doesn't waste cache. Even on the 040/060 where there is no penalty to read any part of a cache line, it still takes 2x as long to load 2 cache lines as 1. The 060 at least, does such a good job of code alinement and code caching that I was unable to time a significant difference by aligning code. Many modern processors can't do this.
Now u have reminded me why I didn't do it. Too complicated. I might make things worse. Its easier to just tell ppl to buy an 060 card. :D
-
Calling it 68050 was never cool.
It had new instructions (addressing modes) so if I wrote code for it things would get really messed up!
I could say "This game requires 68050+" but that would be a lie because it would not work on 060.
So I would need to say "This game requires 68050 or 68070+ but NOT 68060"
Its just dumb.
If you add new instructions you should just call it a 68070. Then all programs written for that instruction set can say "Requires 68070+"
It saves thousands of hours of confusion from 10,000s of Amiga users.
I tried to explain this years ago but as usual, Gunnar would not listen.
-
Calling it 68050 was never cool.
It had new instructions (addressing modes) so if I wrote code for it things would get really messed up!
Adding new instructions or addressing modes is not cool. We need something that implements an 060, fpu & mmu in an fpga. I don't care if it's super scalar or supports out of order execution.
-
Can one skip out-of-order execution, super scalar, etc.. and still run 68060 code?
What are the minimum feature set that has to be implemented? (albeit slow..)
-
May I just remind everyone that the 68070 was a licenced clone of the 68k by Philips, with a few extra bits of hardware on the silicon. :)
-
Can one skip out-of-order execution, super scalar, etc.. and still run 68060 code?
What are the minimum feature set that has to be implemented? (albeit slow..)
u can run normal 68060 code on 68020.
Or 68020+68881 FPU
And there is no Out-of-order execution in 68060.
-
Calling it 68050 was never cool.
It had new instructions (addressing modes) so if I wrote code for it things would get really messed up!
I could say "This game requires 68050+" but that would be a lie because it would not work on 060.
There can be multiple CPU lines by multiple manufacturers. You could say it requires N68050+ (The Motorola line was M68k). Another option would be to specify the ISA like "68kF1+". That is usually what happens with ARM where there are *many* manufacturers and processor names.
Adding new instructions or addressing modes is not cool. We need something that implements an 060, fpu & mmu in an fpga. I don't care if it's super scalar or supports out of order execution.
Why is it not cool to add new instructions and addressing modes? What if it can be done in a 99.999% compatible way and offer 5-15% speed and code density improvements at the same clock speed? The 68060 is missing some new features that modern processors have and some that would very much improved it. Also, it could be made to run ColdFire code with only minor changes. I can see focusing on making a compatible and tested 68020+ CPU first but there is good reason to modernize the 68k and we need ISA standards to do it. Do you think ARM or x86 would be where they are today if they had not changed? Were the 68020+ ISA changes a waste to you? Do you understand enough to have an educated opinion?
Can one skip out-of-order execution, super scalar, etc.. and still run 68060 code?
What are the minimum feature set that has to be implemented? (albeit slow..)
The code on a 68060 is generally not awhere of the superscaler execution. The 68060 resets with superscaler execution turned off and a bit in the PCR register (requires supervisor mode) must be turned on to enable it.
Out-of-Order (OoO) execution is a completely different concept to superscaler for parallel execution of the same code using units. Most modern processors use this but there are some very powerful exceptions. It generally makes better use of the executing units at the cost of complexity and size. No 68k or ColdFire CPU has ever been OoO that I'm aware of. There was talk of partial OoO execution for division on the N68k. This would allow non-depending instructions to execute while the costly division is calculated. I believe IBM has done something similar before.
-
u can run normal 68060 code on 68020.
Or 68020+68881 FPU
And there is no Out-of-order execution in 68060.
I know, but
a) I'd rather actually see something that can run 68060 code, the extra effort to adding 68020+68882 instructions isn't a big deal to me. Time better spent on a 68060 compatible mmu.
b) Out of order execution is something that 68070 was going to have, but again taking time on things that aren't necessarily needed just means it never comes out.
Oh yeah, 68060 compatible cache would be good to.
Why is it not cool to add new instructions and addressing modes? What if it can be done in a 99.999% compatible way and offer 5-15% speed and code density improvements at the same clock speed?
It's just ego masturbation. The speed improvement isn't worth the effort and definitely not worth fragmenting the user base. Plus it wouldn't be so bad if it could actually run all 68060 code, but as they never implemented an mmu it can't.
The sane choice is 100% 68060 compatibility and nothing more.
-
There can be multiple CPU lines by multiple manufacturers. You could say it requires N68050+ (The Motorola line was M68k).
That would be very confusing to many "regular Joes".
I don't want to spend time answering emails explaining the cpu naming system.
Another option would be to specify the ISA like "68kF1+". That is usually what happens with ARM where there are *many* manufacturers and processor names.
That is better than calling it N68050.
Or we could just all it N68070 and be done with it :)
Altho the FPGAReplay guys will have theirs out first so it will be F68070 or R68070 :D
-
a) I'd rather actually see something that can run 68060 code, the extra effort to adding 68020+68882 instructions isn't a big deal to me. Time better spent on a 68060 compatible mmu.
There were few user mode additions to the 68060 that made it incompatible to the 68020/68030. The only user mode instruction that I can think of is MOVE16 and it's not used by any compiler that I have seen. You can take 68060 optimized compiler code and run it on a 68020-68040 in almost every case, it just won't be quite as fast.
b) Out of order execution is something that 68070 was going to have, but again taking time on things that aren't necessarily needed just means it never comes out.
It was to be superscaler and not OoO except for possibly divide. It would not be considered OoO because of the divide although you could, maybe in the loosest sense, get away with calling a superscaler OoO hybrid.
Oh yeah, 68060 compatible cache would be good to.
Most code knows nothing of the cache nor does anything with it except to flush it for DMA or loading/modifying code. The main thing for Amiga compatibility is to have a selectable cache size if copyback caching is used, especially if the cache size grows as is likely for a new CPU. The 68060 had a 1/2 cache size setting that made it's caches the same size as the 68040 (which already broke many things compared to the tiny cache in the 020/030). The N68k was going to have writethrough caching only with bus snooping and auto invalidation of instruction cache writes (self modifying code) which would have given maximum compatibility for self modifying code.
It's just ego masturbation. The speed improvement isn't worth the effort and definitely not worth fragmenting the user base. Plus it wouldn't be so bad if it could actually run all 68060 code, but as they never implemented an mmu it can't.
The 5-15% speed increase would be the "average" of a program optimized for the 68kF ISA. It's not only possible, but likely IMO, that you could see a 25-50% speed up of some codecs including picture and video processing. The 68060 is missing (predates) a lot of speedups for this type of code.
A lack of MMU will not affect very much code. Most code that can use the MMU has an option for turning it off. The FPU is more important for user mode compatibility although an MMU may be needed for Amiga system reliability using the 68060. I would like to see an fpga MMU but the benefit on the Amiga is small as the AmigaOS doesn't use it.
That is better than calling it N68050.
Or we could just call it N68070 and be done with it :)
Altho the FPGAReplay guys will have theirs out first so it will be F68070 or R68070 :D
The fpga Arcade CPU is already called the TG68. You better ask before changing the name on them ;). The N68k naming conventions probably don't matter anymore as the Natami is likely dead or the name (where the 'N' comes from) will be renamed by Thomas Hirsch :/.