Author Topic: in case you are interested to test new fpga accelerators for a600/a500 (Read 39133 times)

matthey · « **on:** March 27, 2015, 11:59:06 PM »

Quote from: ChaosLord;786770

I am sorry but these cards are not 500 Mhz.

Quote from: alphadec;786771

why so negative. ?

Since when is stating facts and the truth negative?

Quote from: kolla;786792

Jim Drew should get a board and make sure Macintosh emulation will work. In time maybe he could implement a Macintosh chipset on FPGA too, and have it running all natively.

MacOS emulation on Phoenix would require major patchwork with the last I heard of Gunnar's ISA (no documentation or ISA encoding maps are available so no one knows for sure). Some 68020 ISA instructions which are illegal on the Amiga but necessary in MacOS like CAS and CAS2 are not implemented and even the encoding may be partially gone (reused). I was initially in favor of the change which improves consistency, simplifies decoding and is slightly better at code density but decided it was not worthwhile after doing code analysis and finding another way to gain most of the code density (the code density gain would be <1% for most programs if assemblers followed an ISA which was documented intelligently). Other problems include lack of a 68k compatible MMU (for a half way modern MacOS environment anyway) and dropping of packed BCD decimal fp support in the FPU (the MacOS uses this according to Jim Drew). The latter could probably be worked around and was likely never used by Amiga programs (I've never seen it or heard of it).

matthey · « **Reply #1 on:** March 28, 2015, 07:21:13 AM »

Quote from: kolla;786808

Well, as I mentioned, it would be interesting to see Shapeshifter or other Mac emulation on Amiga with a Phoenix core, Shapeshifter does not require MMU I believe. Anyways, what are the odds that other software, such as games and demos that bypass the OS with their coding, use features not found on Phoenix?

Programs which do not use the MMU or FPU should not need much supervisor use. A little bit of code can be very problematic though. CPU detection code, interrupt handling, supervisor stack code, etc. can cause problems. I don't know how much work Gunnar has done to support supervisor emulation.

Quote from: Lurch;786816

Be interesting to know what the issue is with AmigaOS3.9. Even a 030@40MHz has no issues running it, have it setup nicely on my A500 at the moment using the IndiECS as an RTG card.

I don't think the Phoenix AmigaOS 3.9 problem is with emulation but rather initialization.

matthey · « **Reply #2 on:** March 28, 2015, 06:02:56 PM »

Quote from: biggun;786824

Truth is good.
Can you start posting the truth instead spreading false rumours?

What you say is not true.
CAS and CAS2 are supported.

Matt, you lack overview about Phoenix.
I've noticed that you very often post wrong and untrue stuff.
Posting wrong info does help no one.

What is the truth? Can you tell us when you change your mind again and update your documentation (the only documentation I find is N68050 encoding maps)? The truth can change especially where you control it. The last truth (or lie from your posts) was that you were using OPI.L #.W, which would partially overwrite encodings of some other instructions like CAS, CAS2 and CMP2.

http://www.apollo-core.com/knowledge.php?b=4¬e=2732

Did you finally listen to ThoR after arrogantly running Meynaf and me off for suggesting the same thing? You finally caved to adding all the 68020 addressing modes after Meynaf and I told you having them would provide best compatibility and keep some heavy using programs from crawling due to traps. I would like to support your project but I won't put up with your arrogance and abuse.

matthey · « **Reply #3 on:** March 29, 2015, 08:46:22 PM »

Quote from: biggun;786888

The Phoenix development team does meet every day in IRC channel.
The team members are therefore informed and have good overview.
Matt if you are never in IRC channel, you can not know what is going on.
If you want to know more you need to participate.

Your time zone is 7-8 hours different. A forum is better to provide information if it was correct.

Quote from: biggun;786888

The Phoenix development team has access to complete instruction decoding definition.
But you are not part of the development team,
therefore you have no access to this - so how can you know them?

Why make a public forum with partial and incorrect information posted by you and then keep the current "developer" information private? Use some logic, learn how to communicate and get organized.

Quote from: biggun;786888

The team members have hardware card and use Phoenix.
So the team members can speak from practical knowledge.
But if you have no card and never used Phoenix - then you have no practical knowledge.

Do you really think the majority of Majsta accelerator owners know what enhancements and encodings are in their card?

Quote from: biggun;786888

Matt you look at the project from on outside view.
There is nothing wrong with this.

But don't you agree that from your outside view you can not really know anything about Phoenix?

And do you not agree that as long you lack overview, and have no real knowledge - you can also spread rumours. But spreading rumours and false information is really not helping the people at all.

I don't need hardware to see an ISA or encoding conflicts or what you have written on a forum (although maybe I need an interpreter because you have been anything but clear). I am only unable to test what I see. I would certainly consider buying the hardware if I could get good information and straight answers about the accelerator without wasting my time logging on to the IRC at inconvenient hours.

You called me out and everything but called me a liar but have not provided an answer, explanation or apology to what is probably your error.

Quote from: biggun;786824

Truth is good.
Can you start posting the truth instead spreading false rumours?

What you say is not true.
CAS and CAS2 are supported.

From your forum at http://www.apollo-core.com/knowledge.php?b=4¬e=2732

Quote from: Gunnar von Boehn

Lets quickly sum up the available new instructions:

We have two short encodings which reduce program size:

ADDI.L #16bit,(ea)

CMPI.L #16bit,(ea)

The 16bit value is sign extended to 32bit before the 32bit operation is done.

Specifically,

ORI.L #d16, does not allow trapping of CMP2.B and CHK2.B
ANDI.L #d16, does not allow trapping of CMP2.W and CHK2.W
SUBI.L #d16, does not allow trapping of CMP2.L and CHK2.L
ADDI.L #d16, no conflicts
EORI.L #d16, does not allow trapping of CAS.B
CMPI.L #d16, does not allow trapping of CAS.W and CAS2.W

While most of these instructions (with these operation sizes anyway) are likely to be rare even on the MacOS (and all should be deprecated IMO), I don't know if they are used on the MacOS so they could pose a problem. I wish Jim Drew had found his list of MacOS instruction frequencies he said he made years ago.

Here is the encoding map which shows the encoding conflicts.

http://www.heywheel.com/matthey/Amiga/68kF1_map.ods

OPI #d16,Rn has no conflicts in any case but there is little advantage when OP.L ,Dn using a new sign extended addressing mode (which who came up with?) allows OP.L #d16.W,Dn and the ISA documentation specifies to use the latter. Most instances of OPI.L #d16, are to a data register which can be converted to OP.L #d16.W,Dn with the exception of EORI.L because of the missing EOR.L ,Dn (uncommon enough that the 68k didn't support it).

Now instead of making me out to be an uninformed liar with your propaganda and falacies, how about letting us know how you performed the impossible of avoiding these encoding conflicts and allowing trapping for all possible 68020 integer instructions the MacOS could use. Are you willing to guarentee that there are no encoding conflicts when using MacOS emulation or you will allow the return of accelerators and personally refund the money of unhappy customers?

matthey · « **Reply #4 on:** March 30, 2015, 02:35:00 AM »

Quote from: biggun;786922

Matt you to agressive here.
Can you just calm down?

I am calm. Do you see one exclamation point, upper case text or bold text from me? You accused me of twisting the truth and I am not supposed to defend myself? I ask questions to uncover the truth and I receive more questions in return. I worked to create an open enhanced 68k ISA and instead you use some of my ideas and analysis to make a secretive ISA. You try to make me look like an outsider who is not capable of understanding the complexity of your ISA yet I have much of it documented better than you. You ignore the suggestions and conclusions of the majority beneath you but make no apologies when you end up using what they suggested after you arrogantly run them off. Some people want a saviour so bad that they are willing to put up with anything but I am wary of the fruits of a false Messiah.

Quote from: biggun;786922

Your problem is that you misunderstand stuff and or misinterpret stuff.
There was nothing incorrect about the old forum post.

Lets look at the fact first before we start to panic ok?

There is no panic but only a desire for the truth which has not been revealed because my questions have not been answered. Your own information spreads confusion unless you use consistent sytax and terminology, give enough information to be clear and update your information when you make changes. The information you have given about your new ISA on your forum is unclear and mostly useless.

Quote from: biggun;786922

The forum post did show several instructions.
In fact only the instruction behavior and name was discussed there.
The encoding was never discussed in this posting.

This means the instruction where shown in NAME only not in ENCODING.
Is this correct - Matt?

That is correct but unlikely as there are not very many good ways to add OPI.L #d16.W, encodings for such a small gain. Also, why keep this info about your ISA secret instead of answering the question about MacOS compatibility?

Quote from: JimDrew;786929

I looked through even my Syquest cartridge backups and I could not find that info.

I can tell you that the MacOS uses ALL of the above instructions.

I will chat with Derek (emulators, inc.) I might have given the instruction frequency info to him when I sold FUSION-PC to his company.

Thanks for looking anyway Jim. I lose stuff I know I still have somewhere. I know the MacOS uses a wider variety of 68k instructions but those instruction frequencies would be interesting. CMP2 was quite rare on the Amiga with WHDLoad reporting patches of only 2 games or demos. CAS and CAS2 (and TAS) were not compatible with Amiga chip set DMA and were documented as illegal on the Amiga so they shouldn't have been used (but a few games used TAS at least). I was for reusing these rare encodings *if* there was a large enough benefit but after analyzing data and considering the importance of compatibility to a retro 68k market, my conclusion was that it is not worthwhile (not that my opinion or vote ever counted).

Quote from: Plaz;786931

Back in the day I worked on one of the coldfire projects specifically on code compatibility issues.

The exact problem is that there are matching instructions which are completely legal on both coldfile and 68K... BUT they execute different functions. Since these codes are legal, there is no way to trap them in supervisor mode. The 68K code is accepted by the coldfire, but doesn't do what's expected by the OS. (crash)

To catch these few trouble codes (2 if I recall) some other method is needed. Pre-processor one possibility. In the end, any of the solutions greatly increased the complexity and cost of a coldfile Amiga card.

The main user mode integer problems are:

1) The CF stack is longword aligned where the 68k stack is word aligned.
2) DIVSL/DIVUL incompatibility (68k returns quo+rem while CF returns quo and has REMS/REMU for rem)
3) MULU/MULS incompatibility (68k sets CCR[V] for overflow but CF does not)

Motorola could have sold a lot more CF processors if they had been smart and allowed all 68k instructions to be trapped, set the CC the same for all instructions and allowed a setting for the stack to be either word or longword aligned. They couldn't even do a good job of cutting the 68k to anemic performance

.

matthey · « **Reply #5 on:** March 30, 2015, 05:48:53 AM »

Quote from: xboxOwn;786935

So what you are saying is: biggun is a scamming us? I am no longer waiting for the vampire and ordering ACA1233 instead because matthey is giving me bad ideas about biggun's project.

No, I am not saying that! I fully believe the Phoenix/Apollo project is real and has performance potential several times greater than a 68060 in an affordable FPGA. There is no scam. There is only Gunnar's lofty ambitions of which this is not the first time it has been a problem (research the Natami project). He needs an attitude adjustment is all. Majsta creates the Vampire accelerators and has been nothing but a good example of openness, cooperation and persistence against adversity. His accelerators offer tremendous value at the low end. The ACA accelerators are a more mature product but don't have the performance potential.

Quote from: Plaz;786938

Thanks for the list. I specifically remember #2 and 3. With those "unsolvable" I never progressed far enough to see #1.

#2 is a real and common enough problem that every DIVSL.L and DIVUL.L has to be patched (it's probably easier to replace with a BSR to replacement code). #3 is actually pretty rare to use the CCR[V] from a multiplication but it is difficult to detect. #1 is a common problem also as every byte and word sized push and pop to and from the stack has to be fixed.

Quote from: Plaz;786938

Motorola's decisions confound me too. Seems it wouldn't have taken much to build a better bridge to legacy 68K hardware especially since there was so much of it all over the world.

It was almost like Motorola/Freescale didn't want 68k compatibility for ColdFire. The 68k users were supposed to upgrade to PPC. Low end 68k users were looked at suspiciously for "wanting" 68k compatibility but the CF was advertised as being 68k like and easy to use. It made no sense and Motorola/Freescale ended up killing off many loyal 68k customers. The poor performance and minimal features of early CF processors didn't help either.

matthey · « **Reply #6 on:** March 30, 2015, 08:52:03 PM »

Quote from: ElPolloDiabl;786990

@Matt Hey and Gunnar. Can you make the Coldfire compatible via a software library?

I don't think anyone involved with the Phoenix/Apollo project has considered a software library for 100% ColdFire compatibility up to ISA_C (excluding MAC, EMAC and FPU) but it could be done if there was a specific purpose and enough demand. The focus was to make CF as instruction level (not necessarily binary) compatible as practical. Assembler source code could be converted through the use of aliases and macros but some hand modification would likely be required. For example, MOV3Q is in A-line which is not good for 68k compatibility but an ISA alias could convert it to assemble as a new sign extended longword addressing mode. It would be helpful for CF compatibility if the stack (A7) alignment could be configured to word or longword alignment but I don't know how difficult this would be to do in hardware. The DIVSL/DIVUL encoding conflict and different CC flags for multiplication means that it is not possible to have 100% binary compatibility for CF in an enhanced 68k CPU.

Quote from: biggun;786995

To which Coldfire you want to be compatible?
Which model - which Coldfire ISA?

For me to understand - Can you explain why you want this?

There are libraries of ColdFire code and compilers which are more modern than what the 68k has. There is a ColdFire embedded market which is larger than the total 68k market (although probably shrinking) and needs a replacement which could be Phoenix if it was compatible enough.

Quote from: psxphill;787014

It seems like a bait and switch. Yeah you can have 400mhz 68060 speed, except you need to port your code to it and the new code won't run on a real 68060.

For compiled code to take advantage, the compiler support and backend code would need to be updated. Adding FPU registers can be done in an orthogonal way (as I proposed anyway) which would make this job much easier. The main changes would be interleaving the FPU instructions using the additional registers and coming up with an ABI which passes FPU registers to functions instead of using the stack. I created the vbcc vclib m060.lib math library, fixed a lot of bugs and added many new c99 math functions in a few months. There could be issues with precision in the vclib code (based on the 68060FPSP) if Gunnar reduces the FPU to 64 bits. Extended precision allows to avoid tricks which are needed to maintain maximum precision with 64 bits only. Personally, I would prefer to stay with extended precision for compatibility but double precision is considerably faster in an FPGA.

Quote from: Thomas Richter;787015

However, can anyone explain me the use case for "move.l dx,d(PC)", or the use case for "move zero extended to register"? Sure, that's probably all neat, but the number of applications where such an instruction is useful to increase the speed by an amount that makes an observable difference is near zero. Leave alone without new compilers or assemblers around. Yes, I can imagine that for special applications like decoding a hand-tuned inner decoder logic could be tremendously useful and worth the manual work. But seriously, is anyone saying "Ok, I'll now rewrite my application because I have now a move to dx with zero-extend instruction available, and THAT was exactly what I was missing?".

While I see limited use for PC relative writes, I think the encoding space used can not effectivly be used for other purposes and the protection provided by not allowing PC relative writes is a joke. I doubt compilers would bother with creating a new model like small data or small code but it should be possible to make tiny programs a little smaller and more efficient with PC relative writes opened. I would be willing to go along with whatever makes the 68k most acceptable.

The ColdFire MVS and MVZ instructions can be used very effectively by compilers and peephole optimizing assemblers (an important considerations of an ISA). The support is already available (ready to turn on) as most compilers share the same 68k and CF backend. I'm confident that Frank Wille could have support working in vasm already with a partial benefit in a matter of hours. Turning on the backend CF generation would be a little more work but requires more testing. Sure, it's not going to make a major difference but few integer ISA changes will (exceptions improve branch performance). The applications are obvious enough. Look at the code your layers.library produces and see how many places the MVS and MVZ instructions can be used. The intuition.library is another example where these instructions would be very useful. Of course, the gain is probably only going to be a few percent in performance and code density but it's easy as compilers can use it. Some code would barely use them at all though.

I'm surprised you never got into compilers. Your assumptions may be true most of the time but sometimes the 68020 addressing modes and ISA changes do make a big difference. For example, you say the 64 bit multiplication instructions are rare and they are for SAS/C but GCC has been using them since the '90s to convert division by a constant into a multiplication. Simple code like the following compiled for the 68020 with GCC will generate a 64 bit integer instruction.

Code: [Select]

scanf(&quot;%i&quot;,&d);
printf(&quot;d / 3 = %d\n&quot;, d/3);

The GCC optimization saves quite a few cycles. The 68060 ISA designers failed to recognize that GCC was already using this effectively. I'm working on a similar but improved magic number constant generator which I hope can be incorporated into vbcc. It's possible to use magic number constants for 16 bit integer division which GCC does not do. I may be a cycle counter because I know cycles add up but I still go after the big fish. I pay close attention to what compilers can do and where they fail. One thing I can't fix is where programmers fail. Another example of where the 68020 makes a huge difference we recently fixed in vbcc. The current vclib is compiled only for the 68000 like you think is good enough for your programs. Using ldiv() generated 4 divisions (lacking the 68020 32 bit division instrucitons and doing a division for the quotient and again for the remainder) and included a 256 byte table used for clz (lacking the 68020 BFFFO instruction). The next version of vbcc should have 68020 and maybe 68060 compiled versions of vclib but I fixed ldiv() for now with a single inline DIVSL.L in stdlib.h when compiling for the 68020+.

It is important to consider what the compiler developers think. They know what they need and can use. They should be part of the process of ISA development but the hardware developers (or should we say Gunnar) dictates what they will get. We can see that the ISA creation process has become secretive as can be seen by Gunnar refusing to answer questions in public (I showed how it is possible to mark an ISA with a disclaimer saying that it is for evaluation and subject to change). I tried to create an open ISA early for debate to try to avoid exactly these types of problems but most of the feedback I got was "there is no need yet". Even if my foresight is better than most people's hind sight, it wouldn't do me any good because nobody listens to me no matter how right I am. The truth doesn't seem to matter anymore.

matthey · « **Reply #7 on:** April 01, 2015, 09:56:58 PM »

Quote from: Thomas Richter;787092

So wait. Why exactly do we need a "move zero extended" instruction again?

After all, "moveq #0,d0; move.b (a0),d0" could also be merged into a single "meta"-instruction, right?

Similarly, "move.w (a0),d0;ext.l d0" could also be merged into one instruction....

I see now even less the need to extend the ISA.

I find it interesting that Gunnar talks about the technical design of the CPU, including assembler examples, but any mention of ISA details or questions about it or MacOS compatibility can't be discussed in public. I see this as hypocrisy and lack of openness. I tried to get you involved in the ISA creation when he was "taking control" but I don't even think you could have stopped him now.

The ColdFire MVS and MVZ instructions can be added in the same encoding location as the ColdFire with practically no compatibility issues for the 68k. I find your complaint about this odd considering how tame it is compared to Gunnar's hole filling encodings for E0-E7, a non-orthogonal A8 and possible use of A-line (goodbye MacOS compatibility).

Yes, you are correct with your equivalent "merged" instructions but these CF instructions can be used on their own destination register also.

MOVEQ #0,Dn + MOVE.W ,Dn -> MVZ.W ,Dn ; 2 bytes saved
MOVEQ #0,Dn + MOVE.B ,Dn -> MVZ.B ,Dn ; 2 bytes saved
SWAP Dn + CLR.W Dn + SWAP Dn -> MVZ.W Dn,Dn ; 4 bytes saved
AND(I).L #$ffff,Dn -> MVZ.W Dn,Dn ; 4 bytes saved
AND(I).L #$ff,Dn -> MVZ.B Dn,Dn ; 4 bytes saved

The latter 2 are vasm peephole CF optimizations (there are more using MVZ and MVS which are less common). Multi-instruction inputs are not allowed for vasm peephole optimizations. MVS at first appears less useful than MVZ.

MOVE.W ,Dn + EXT.L Dn -> MVS.W ,Dn ; 2 bytes saved
MOVE.B ,Dn + EXTB.L Dn -> MVS.B ,Dn ; 2 bytes saved

Compilers like these types of instructions because they are common on modern processors. The 68k and CF backends are shared by most compilers so support only needs to be turned on.

I would prefer to have 68k names of SXTB.L, SXTW.L, ZXTB.L and ZXTW.L like EXTB.L but specifying the type of extend. I would also allow "SXTW.L ,An" for "MOVE.W ,An" to improve orthogonality and description of the operation. Too many 68k compilers have had problems with the sign extension into address registers and none that I have seen have been able to take advantage of this auto word to longword sign extension for optimizations. Compilers have had trouble here and the CF instructions simplify the code.

Instruction folding/fusion has a cost and it doesn't catch reordered combinations. Let's look at your layers.library 45.24 for example.

Statistics
----------------------------------------
Instructions 2 bytes in length = 2574
Instructions 4 bytes in length = 1910
Instructions 6 bytes in length = 88
Instructions 8 bytes in length = 0
Instructions 10 bytes in length = 0
Instructions 12 bytes in length = 0
Instructions 14 bytes in length = 0
Instructions 16 bytes in length = 0
Instructions 18 bytes in length = 0
Instructions 20 bytes in length = 0
Instructions 22 bytes in length = 0
Instruction total = 4572
Code total bytes = 13316

1 op.l #,Rn -> op.l #.w,Rn : bytes saved = 2
3 opi.l #,Dn -> op.l #.w,Dn : 68kF1 bytes saved = 6
3 opi.l #,EA -> opi.l #.w,EA : 68kF2 bytes saved = 6
0 move.b EA,Dn + extb.l Dn -> mvs.b EA,Dn : bytes saved = 0
90 move.w EA,Dn + ext.l Dn -> mvs.w : bytes saved = 180
0 moveq #0,Dn + move.b EA,Dn -> mvz.b EA,Dn : bytes saved = 0
3 moveq #0,Dn + move.w EA,Dn -> mvz.w EA,Dn : bytes saved = 6

0 pea (xxx).w -> mov3q #,EA : bytes saved = 0
0 move.l #,EA -> mov3q #,EA : bytes saved = 0 68kF bytes saved = 0

EA modes used
----------------------------------------
Dn = 743
An = 976
# = 1
# = 57
# = 6
(xxx).l = 2
(An) = 217
(An)+ = 382
-(An) = 163
(d16,An) = 1449
(d8,PC,Xn*SF) = 2

This is from a version of ADis Statistics this "outsider" modified for Gunnar. The first thing that sticks out are the instruction lengths which are very short and why I adamently recommended 3 superscalar integer units which should be a success even if dual ported memory with 2 cache reads/cycle would make it a monster. The results are a little misleading as SAS/C avoids many immediate longword operations like this:

MOVEQ #,Dn + LSL.L #8,Dn (loads 3rd most significant byte)
MOVEQ #,Dn + SWAP Dn (loads 2nd most significant byte)
MOVEQ #,Dn + ROR.L #8,Dn (loads the most significant byte)

Tricks like this shorten the average instruction length but they generate 2 dependent instructions (without instruction scheduling or folding/fusion). It's more efficient and simpler for the compiler to use the OP.L #,Dn especially with a new addressing mode that could sometimes optimize "op(i).l #,Dn -> op.l #.w,Dn". More modern compilers which promote variables to longword show much more savings using a new sign extended addressing mode and MVS/MVZ instructions which are needed for the promotion to 32 bits. Many processors like the 68060 can only forward 32 bit results and are up to twice the performance with variable promotion to 32 bits.

My ADis Statistics find 90 occurences of "move.w EA,Dn + ext.l Dn" which would save 180 bytes with MVS.W and 3 occurences of "moveq #0,Dn + move.w EA,Dn" which would save another 6 bytes with MVZ.W (total number of instruction reduction for MVS and MVZ would be 93). This gives a good indication of how much could be saved with instruction folding/fusion of these popular pairs also. Let's look for additional savings that wasn't caught though.

move.w d0,d7 ; 26e : 3e00
movea.l a3,a0 ; 270 : 204b
ext.l d7 ; 272 : 48c7

move.w ($10,a0),d0 ; 330 : 3028 0010
move.w ($10,a1),d1 ; 334 : 3229 0010
ext.l d0 ; 338 : 48c0
ext.l d1 ; 33a : 48c1

move.w ($12,a1),d2 ; 33e : 3429 0012
move.l d1,d7 ; 342 : 2e01
move.w ($12,a0),d1 ; 344 : 3228 0012
ext.l d2 ; 348 : 48c2
ext.l d1 ; 34a : 48c1

move.w ($44,sp),d7 ; 3f8 : 3e2f 0044
move.w ($46,sp),d6 ; 3fc : 3c2f 0046
ext.l d7 ; 400 : 48c7
ext.l d6 ; 402 : 48c6

move.w ($46,sp),d0 ; 440 : 302f 0046
move.w ($4a,sp),d1 ; 444 : 322f 004a
ext.l d0 ; 448 : 48c0
ext.l d1 ; 44a : 48c1

move.w ($48,sp),d2 ; 44c : 342f 0048
sub.l d0,d1 ; 450 : 9280
move.w ($44,sp),d0 ; 452 : 302f 0044
ext.l d2 ; 456 : 48c2
ext.l d0 ; 458 : 48c0

...

Did someone turn on the instruction scheduler? I think it's helping some but if you want instruction folding/fusing then the scheduler is a problem. The instruction scheduler by itself removes some of the need for the folding/fusion and is probably better overall. The MVS/MVZ instructions also have limited capability to increase the IPC (Instructions Per Cycle) with only 1 cache reads/cycle but they do still improve code density. If 2 cache reads per cycle were ever implemented then they would make a big difference. The MVS/MVZ instructions would have been added to ColdFire after doing code analysis. I wish they had made it into the 68060 where they would have been especially helpful considering the instruction fetch bottleneck of this processor.

I think better ColdFire compatibility would be useful for the 68k and would win some support from ColdFire customers and developers. Just a few days ago, I compiled some code with ColdFire target (-cpu=5329) into my Amiga hunk executable to see what code it produced for someone. It would be nice if I could run the executable and the Amiga could become a CF CPU option and development platform. It could also be a next genation CPU option for the Atari Firebee. The more code and developers the better especially when relatively small changes can gain CF compatibility.

matthey · « **Reply #8 on:** April 02, 2015, 10:21:10 PM »

Quote from: mikej;787283

There was never any disrespect from my side.

Absolutely, with communication, tolerance and respect from all sides, this small community would certainly be a happier place.

Mike, do you see a need for and would you support a standard's committee? I speak of not just 68k enhancements but also custom chipset enhancements. We recently had a discussion on EAB about custom chipset implementations and enhancements.

http://eab.abime.net/showthread.php?t=77679

Without standards, we are going to end up with many different incompatible enhancements. One standard will gain more and better support from developers. Look at the support of CGFX and AHI which shows how important a standard can be, especially in a small market like the Amiga. Some people have said standards aren't important because the Amiga is on the brink of dying but we have to plan like it will live. New hardware with hardware standards may be what revives it. I tried to document a standard 68k CPU ISA starting back in 2012 but the Amiga was too dead then for most people to worry about. Gunnar would say it is all my fault for rocking the boat of his Apollo ISA standard but I believe his standard is too radical for other 68k FPGA processors to follow. We need more conservative standards which most FPGA hardware and UAE could adopt when there is developer support and software. Custom enhancements could be built on top of the standards. We can't have one person dictating the standards and half a standard is no standard at all. I don't think a standards committee is going to happen without representives from FPGA Arcade and Mist. I would like to hear from compiler developers if possible including Frank Wille and/or Volker Barthelmann (are there any other active Amiga compiler devs?). A-Eon may be interested. A standard's committee would benefit Gunnar as well. Any arguments could be voted on. Anyone should be able to submit ideas and listen in to discussions. I would probably be considered too biased to be chairman which is fine. We can elect someone. I'm not sure what platform would be best. Does anyone like the idea or have any suggestions for improvements?

matthey · « **Reply #9 on:** April 03, 2015, 12:12:41 AM »

Quote from: mikej;787299

I totally agree that without some form of standardisation the community will fracture.

I think the majority of us should be able to see this after what has happened on the Amiga and in this thread.

Quote from: mikej;787299

From a CPU perspective, I see absolutely no point adding or changing any instructions - I'm focussing on functional and timing accuracy for the 68000, then performance for the 68020+.

Personally, if you are going to mess around with the architecture sufficiently to force a compiler modification, you might as well recompile to something else entirely. ARM or MIPs spring to mind.

There is probably a minimal benefit for a 68k CPU enhancement for the FPGA Arcade or any other hardware focused on emulation accuracy right now. There are customers who want higher performance CPU cores and will want to run software compiled for higher performance cores like Apollo/Phoenix though. You can offer a retro compatible core without enhancements and a high performance core with enhancements provided the standard was not too difficult. ColdFire compatibility should be appealing for Atari Firebee users, program sizes could be reduced by 5%-15% to better fit the limited storage space and some FPGA Arcades may even sell for embedded purposes (the Raspberry Pi has sold many units for embedded purposes although it is cheaper and smaller). MIPS programs would be approaching twice the size of 68k+CF programs and they need that much more caches too. Thumb 2 is competive in code density with the 68k but a 68k+CF would be better and can have better single core and memory performance. Most of the CPU evaluation and testing would happen on more performance oriented hardware. Custom chipset enhancements are obviously higher priority.

Quote from: mikej;787299

For the chipset I have already made a few obvious improvements, such as extending all the DMA address register widths. There is not particularly controversial as there is space to do this in the memory map.

I don't expect software to use it, but if this sort of enhancement could be documented and agreed on, it becomes a possibility.

Documenting and making public the changes would be a good start. The custom chips are not my strong point but I would think RTG/chunky custom chip enhancements and maybe some improvement in the audio department would be wanted by most FPGA projects.

Quote from: kolla;787305

Matthey: I would contact Linux/m68k developers, gcc/glibc developers and NetBSD developers and ask if they would be interested in participating.

There are developers assigned to the 68k GCC backend but what they do is usually minimal maintenance. It wouldn't hurt to try inviting them to participate. As far as BSD/Linux 68k developers, there aren't very many active 68k developers and most need an MMU (ThoR and Frank Wille could give some insight if they were interested).

Quote from: OlafS3;787309

The standard is the existing hardware (processor+chipset) and the rest is done by the OS (perhaps with special optimized libraries)

Is 68020+AGA enough for everyone? These are well defined and it's not a bad standard but is that enough forever? If we want more, then we should create new standards.

matthey · « **Reply #10 on:** April 03, 2015, 08:34:33 PM »

Quote from: psxphill;787340

Any opcode that would cause an exception on a real 680x0 cpu can be used by software as a virtual opcode. LINEA & LINEF were officially available for that, but it's certainly possible that a piece of software could rely on any exception.

A-line is documented as user reserved. IMO, gated or switched on A-line instructions would be under user control so acceptable.

Quote from: M68060UM

An unimplemented A-line exception corresponds to vector number 10 and occurs when an instruction word pattern begins (bits 15%&$#?@!%&$#?@!%&$#?@!8211;12) with $A. The A-line opcodes are user-reserved and Motorola will not use any A-line instructions to extend the instruction set of any of Motorola%&$#?@!%&$#?@!%&$#?@!8217;s processors. A stack frame of format 0 is generated when this exception is reported. The stacked PC points to the logical address of the A-line instruction word.

Where did you get your information for F-line? Some operating systems did use F-line for traps but I have not seen documentation designating F-line as user reserved. Motorola's own MMU and FPU were incompatible with some software.

Quote from: M68060UM

An unimplemented F-line exception occurs when an instruction word pattern begins (bits 15%&$#?@!%&$#?@!%&$#?@!8211;12) with $F, the MC68060 does not recognize it as a valid F-line instruction (e.g., PTEST), and the processor does not recognize it as a floating-point MC68881 instruction. This exception corresponds to vector number 11 and shares this vector with the floating-point unimplemented instruction and the floating-point disabled exceptions. A stack frame of type 0 is generated by this exception. The stacked PC points to the logical address of the F-line word.

Quote from: psxphill;787340

The MMU & FPU that is proposed is certainly incompatible.

I am not aware of any interface to the MMU. It's possible the interface disappears completely or an unused coprocessor ID is used for maximum compatibility. Compatibility with a particular 68k could be provided by trapping. The FPU could be moved to another coID also but that wouldn't be very convenient when executing FPU code. The FPU changes I proposed are only incompatible with BCD floating point which I am not aware of any program on the Amiga which used it and it was trapped already on the 68040. It's in the same category as the 68020 only CALLM/RTM. Jim Drew said the MacOS does use this support but it should be possible to make it faster by working around it considering it is trapped on the 68040-68060 anyway. Other workarounds are already necessary with the "incompatible" 68060 FPU which has a mostly incompatible stack frame size causing crashes on initialization in most FPU software. This is patched in the AmigaOS and probably MacOS where detected. There could be some incompatibility if enabling 8 more FPU registers but then the performance could be up to twice as fast if FPU instruction results are not available to the next instruction. An SIMD unit would also have the same incompatibility when enabled but could provide several times the performance in some cases. The performance gains are big enough to warrant these additions, IMO. I believe the performance gain from more CPU integer registers would be significantly smaller and the potential for compatibility problems greater once the new registers were enabled.

Quote from: kolla;787349

http://www.apollo-core.com/knowledge.php?b=2¬e=2861&z=fwkq39

What does this imply? I presume FPU and MMU to not be in that soup, so compatibility is against 68EC030-EC060?

Addressing modes should work with coprocessors. The 68k/6888x from the beginning has done decoding and EA calculation in the integer units before passing the instruction to the coprocessor for completion. The Apollo core was going to have a fully pipelined FPU and I don't know what affect this would have.

matthey · « **Reply #11 on:** April 04, 2015, 10:07:49 AM »

Quote from: Thomas Richter;787371

I am not aware that there are currently MMU-activites for the Phoenix/Apollo core... On the 68K, there is a coprocessor ID reserved for it, but in reality, the MMU instructions differ already between 68020/68030 and the 68040/68060 substantially. Actually, given the small number of programs that really depend on the hardware interface, it is less a problem to create a new interface here. The FPU is more often used at instruction level, hence, compatibility on instruction level is much more important here. Partially. For the FPU, certainly. For the MMU: This is not sufficient, because the MMU does more than execute instructions and update registers. If you would want to emulate the MMU at instruction level, you would also need to emulate the table-walk of the MMU.

The FPGA Arcade and Mist have small enough memory sizes to get away with a 68040/68060 MMU. If Gunnar is not interested in a 68040/68060 MMU compatibility layer for his Apollo core then there isn't much reason to create a standard unless other FPGA Amiga hardware is introduced with more memory. Maybe a better way to query what hardware is available would be useful though.

Quote from: Thomas Richter;787371

Actually, concerning the FPU: Additional FPU registers are again problematic, again due to the exec scheduler. I believe it would be wiser to have the second set of FPU registers, or a specialized vector-FPU available under an additional coprocessor ID. The standard scalar FPU would preserve the legacy interface, and could be saved and restored by exec. Hence, programs and tools for applications that only use the scalar FPU could remain unchanged. If you need more speed, you would engange the "vector FPU" under a new coprocessor ID, and an updated exec scheduler would save and restore its registers. Hence, the incompatibility would only involve programs that actually use the new FPU, and not all programs. See above. I would advice against extending the scalar FPU. If you need more registers, or vector instructions, enable an extended vectorial FPU. This would allow exec to continue using its legacy stack frame if the vectorial FPU is not used, and hence incompatibilities could be minimized.

This is a good idea in theory but there are problems. A new SIMD/vector unit in an FPGA may only support integer operations because of the cost of even single precision floating point. There are several choices:

1) 8 register FPU with integer only SIMD = slow fp performance
2) 8 register FPU and wait for an SIMD with single precision fp = slow fp performance now
3) No FPU with SIMD supporting single precision fp = poor fp compatibility
4) 16 register FPU with integer only SIMD = average fp performance
5) 16 registers FPU and wait for single precision SIMD = average fp performance now

I thought 8 FPU registers was adequate when Gunnar wanted 16. I encoded it and found that it works out very well (unlike adding integer registers). The registers are orthogonal except for FMOVEM but the upper 8 FPU registers can and should be scratch registers, IMO. The 68k FPU would be much more efficient with more scratch registers (the cost of saving and restoring extended precision FPU registers is very expensive). Making all 8 new FPU registers scratch registers would cut the number of FPU register saves and restores in half or more, would be very efficient for passing arguments to functions in FPU registers which do not need to be preserved and fp instructions can be interleaved which could up to double performance when the result of one instruction can't be used in the next instruction. This could allow reasonable performance of wide operations in a slow FPGA and/or a 2nd FPU superscalar unit. Keeping the FPU extended precision makes more sense with 16 registers because of the register argument passing and reduced register saves and restores. Most compilers waste the 68k FPU extended precision by passing arguments as 64 bits on the stack which is easy and efficient to improve with 16 FPU registers and a new ABI. Which do you think would cause the least incompatibility?

A) adding 8 FPU registers and patching the exec scheduler
B) reducing FPU precision from 80 bits to 64 bits

I would choose A) above. Most FPU code does not rely on the extended precision but I know that several 68060FPSP algorithms would need fixing (if possible) or new algorithms. The performance advantage of a 64 bit FPU is significantly reduced when extra instructions are needed to retain maximum precision which are not needed with extended precision. I agree that the extra few bits of precision can be very useful too.

Compatibility is very important given the current state of the 68k Amiga but we need to increase performance and plan for the future also. Adding 8 FPU registers looks like a tremendous opportunity to me even if it adds some incompatibility when the extra FPU registers are used.

Author Topic: in case you are interested to test new fpga accelerators for a600/a500 (Read 39133 times)

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500

matthey

Re: in case you are interested to test new fpga accelerators for a600/a500