Author Topic: Coldfire AGAIN (Read 25814 times)

Karlos · « **on:** March 29, 2008, 03:36:01 PM »

Quote

though what you are suggesting is just a JIT, that sometimes spits out the instructions unchanged.

*cough* Dynamo-style JIT *cough* ;-)

Dynamo (a JIT made by Hewlett Packard) demonstrates the amusing (and at first glance ludicrous) fact that a hotspot JIT can 'emulate' code running the same processor it itself is running on faster than the CPU can run code natively.

The reason this is possible is down to the fact that at runtime you know more state information than you ever did at compile time. Consequently, a lot of if/else/switch/case/for/while etc code ends up taking only one or two possible paths at runtime (compared to many more possible paths at compile time) and unused code paths can be optimised away by the JIT.

The main overhead of any JIT system is the on-the-fly recompilation stage that's kicked off when the system encounters new code. Translating code for one CPU to another can be quite expensive where their architectures are very different. However, when most of your "recompilation" involves simply copying (rather than translating) the original code, that overhead is mitigated substantially.

Using such a mechanism, I expect a current generation coldfire core could run 680x0 code extremely well and without any of the performance problems trapping individual unimplemented instructions cause.

If only there were 24 more hours in my day I'd look at it.

Karlos · « **Reply #1 on:** March 29, 2008, 04:18:12 PM »

Semantics, my dear fellow :-D. Dynamo has been described by its creators as a hotspot JIT (like most other JIT implementation it also allows non-critical code to run through in interpreted mode). It dynamically recompiles critical sections to eliminate dead code branches, early returns etc. It simply happens to be the case that the target CPU is the same class as the source.

What you are alluding to are the deep implementation detail of how it works. That it is similar to the AthlonXP's instruction queue/decoder doesn't mean it is fundamentally different to any existing optimizing JIT as most of them employ the same sorts of code pruning.

Karlos · « **Reply #2 on:** March 29, 2008, 04:41:56 PM »

Well, FWIW, I don't think a straightforward trap-and-emulate based amiga accelerator mechanism would work that well, otherwise we'd have seen one by now.

I seem to recall, but I may be wrong, the problem is that certain opcodes actually behave differently to the same operations on m68k. That is to say, they are implemented but operate slightly differently to the 680x0.

I mean an instruction that works but works differently to what you expect is probably worse than one that isn't implemented at all as you can't really trap it in the first place.

Karlos · « **Reply #3 on:** March 29, 2008, 05:06:47 PM »

Quote

biggun wrote:
Quote

Karlos wrote:

I seem to recall, but I may be wrong, the problem is that certain opcodes actually behave differently to the same operations on m68k. That is to say, they are implemented but operate slightly differently to the 680x0.

Can you give a real example, or is this a hear say rumor mill?

Well, for one, I seem to recall that MULS and MULU fail to set the overflow bit of the condition code register.

If your 68k code looks at the CCR to see if an overflow occurred after a multiplication and perform some specific action, it isn't going to behave the same on both CPU's under all circumstances.

There were a few other nuances like this, but I'd need to check and don't have time.

Karlos · « **Reply #4 on:** March 29, 2008, 05:22:30 PM »

Quote

Oli_hd wrote:
Quote
Well, for one, I seem to recall that MULS and MULU fail to set the overflow bit of the condition code register.

Correct but the 68Klib provided free by freescale can emulate these instructions, you simply have to add an instruction before it to trigger the CPU's invalid instruction trap and then the emulator will give you a fully 68K compatiable MULS and MULU. (the other instructions are the DIV ones I think)
This wouldnt need to be done at compile time, a program could be wrote to insert the trap code into a binary file at the correct places.

/me goes back to watching all the Coldfire threads

Doesn't that change the length of the instruction stream? If so, presumably you then need to update all the branches too?

Karlos · « **Reply #5 on:** March 29, 2008, 05:29:11 PM »

Quote

bloodline wrote:

You have the most experience with the CF on this board you should say more!!

Quite.

Karlos · « **Reply #6 on:** March 29, 2008, 05:52:46 PM »

Quote

biggun wrote:

BTW muls.L would calculate wrong of you get the overflow and their is NO way of recovering from it besides using the 64bit MUL version or a proper multiplication routine.
In other words if your code can overflow you will never use this instruction in the first place.

I think you'll find it's used in most 68020+ compiler-generated code where the effects of overflow aren't really defined by the language standard.

In hand coded ASM, you would still use it for example if you are writing saturation-based fixed-point arithmetic routines for some visual or audio application. You'd optionally fill the result with your maximum fixed point value on overflow.

Quote

The issue that you are referring too does not exist for A500 programs.

Perhaps, but it is possibly not the only such difference. Anyway, I would have thought that 68020 would be the base level for any 'revived' m68k amiga platform (other than minimig)? After all, you need 68020 compatibility to be able to run OS3.5/3.9, right?

Isn't the NatAmi going for minimum of AGA compatibility? I'm unaware of any working plain 68000+AGA hardware combination.

Karlos · « **Reply #7 on:** March 29, 2008, 07:43:53 PM »

Quote

So what it the real effect?
A few, very limited number of tools might become buggy.
But 99% of the AMIGA application will run correctly on Coldfire.

This is how it really looks like.

That some people state that the Coldfire is not possible to
run 68k code is certainly a 100% overstatement.

I don't think anybody is saying the Coldfire can't run 68K code, I think they are saying AmigaOS and applications may not work readily on coldfire. In addition to the behavioural differences mentioned, how many byte and word size logic/arithmetic operations are there in typical amiga 68K object code that are not directly supported on coldfire (existing only as long version)? It may be the case that there will be more trap and emulation overhead than you think.

Remember, some applications using 64-bit integer multiplication on 020/030/040 ran like treacle on the first 060 cards that relied on trap-emulate (anybody remember Breathless) ?

I'm not saying that you can't run a coldfire based Amiga system but I really do think the difficulties are more than you seem to admit. There are a lot of things to consider beyond basic instruction implementation counts.

So far we've only looked at the user mode. Coldfire supervisor mode is a bit different and if I recall clearly, it doesn't have a separate supervisor stack pointer. This might not sound a big deal but it does have very real implications.

Any code that writes local data below the current stack depth (eg using negative offsets from a7), whilst working perfectly on a 680x0 Amiga, risks having that data trashed by an interrupt on a coldfire system. This might sound unlikely, but in fact code that has been optimised not to use stack frames within function may well assume it can safely use address modes such as -4(a7) etc to hold local variables (if it doesn't need to immediately call another function) rather than decrementing a7 first and using positive offsets for them, thus typically saving instructions to modify a7.

Can you say with certainty that the 100% of A500 applications you refer to as being compatible aren't doing anything like this?

Karlos · « **Reply #8 on:** March 29, 2008, 08:25:01 PM »

Quote

biggun wrote:

This is no problem.

You are referring to the very first Coldfire versions.
The V4 and V5 Coldfire have two a separate supervisor stack pointer.

That's good. What other incompatibilities do they address?

Karlos · « **Reply #9 on:** March 29, 2008, 08:30:41 PM »

@HenryCase

Drag(-)on, ey? Elbox don't seem to be in a rush to release it...

Karlos · « **Reply #10 on:** March 30, 2008, 05:39:51 PM »

I know, let's use PowerPC...

*hides*

Karlos · « **Reply #11 on:** March 30, 2008, 05:53:39 PM »

It has the advantage that it could run the PPC descendants of AmigaOS too...

Karlos · « **Reply #12 on:** March 31, 2008, 02:30:59 PM »

Memory protection exists to clean up the mess of bad coders...

*runs away*

Karlos · « **Reply #13 on:** March 31, 2008, 06:53:56 PM »

Quote

...what effectively happens is that the MMU updates the 'virtual' memory addresses with the real memory addresses (like saying to the program 'here's where your data was').

...

I really can't see a problem with what I've described, perhaps you can?

How do you take, say, 16 4KiB pages scattered across the 4G physical address space that an application requested and originally thought was one contiguous 64KiB lump of memory and tell it "here is where your data was" ?

A single allocation of memory on a VM system using an MMU that an application uses a single pointer to refer to can translate into many unrelated chunks of genuine physical memory. You can't assume contiguous address mapped memory is contiguous in physical RAM.

Karlos · « **Reply #14 on:** March 31, 2008, 08:56:13 PM »

Quote

HenryCase wrote:
Quote
Karlos wrote:
How do you take, say, 16 4KiB pages scattered across the 4G physical address space that an application requested and originally thought was one contiguous 64KiB lump of memory and tell it "here is where your data was" ?

By arranging it into a 64KiB lump before you give memory control back to the program.

Assuming you could do this, do you have any idea how complex the algorithm required to sort all the scattered physical blocks into contiguous lumps that reflect what the code originally allocated and ensuring pointers everywhere in the system are updated? That's not even including the overhead of copying pages of memory around.

Quote

Give me an example of when you'd use a pointer to address more than one memory location so I can explain how its done.

Any code that walks arrays, traverses containers, manipulates strings, etc.

Author Topic: Coldfire AGAIN (Read 25814 times)

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN

Karlos

Re: Coldfire AGAIN