Author Topic: OS4 Classic, why disable the 68k ? (Read 9193 times)

bbond007 · « **Reply #14 on:** April 11, 2012, 02:11:35 AM »

Quote from: Erebos;687963

@cgutjahr
as i was talking about classic version, I cant see how the PPC side of my blizard ppc at 240mhz can be faster emulating 68k than my real 060 onboard...
Plus i'd say that performances do not drop at all when on 3.9 i have 68k tasks running along side ppc ones (like wos mp3 decoding)
Reason maybe is that it is much easier to maintain an unique branch of the OS that is primary towarded to NG amigas and not designed for classic at first.
maybe that contexts switches are too heavy to handle, but...i'm skeptical

Anyway please do not misunderstand my words, i'm glad that OS4 exists for classic and thanks Hyperion for it ;-)

Well, you would need to take the same approach they took with 3.X

Because your native OS is in PPC you would need to write a specialized kernel for your 68K and as well as supporting PPC libraries to communicate with and facilitate data exchange with said 68K kernel.

You would convince application developers to compile parts of their applications 68K native and to retool their applications to use a design pattern conducive to offloading tasks to a slower processor. To convince the developers you'll just need to remind them of the Amiga's central design philosophy concerning multitasking and the offloading of work from the central CPU.

The first PPC->68K API would called "Power Down" but soon after its introduction a new standard will emerge called "WarpOS/2".

Fortunately, specific "Power Down" and "WarpOS/2" compatibility layers will created for the respective API's which will allow some degree of compatibility.

People will bicker over the technical merits of both APIs for years until the entire OS is fully ported to 68K.

cha05e90 · « **Reply #15 on:** April 11, 2012, 08:47:51 AM »

Quote from: Piru;687956

It was possible to run 68k and PPC code independent from each other, at least with PowerUP. There were certain highly optimized apps that used to do this. For instance the infamous FastQuake used a special cache inhibited memory area as a ring-buffer to avoid the need for cache flushes. This way both CPUs could run at full speed with as few context switches as possible.

I must admit that I've never really "seen" applications that ran parallel in that sense on a phase5 board - regardless which kernel or OS was used. Using that kind of "cache-protected" buffer looks like a way or work-around to build up some kind of transient parallel work time frame, indeed. But every now and then the context switch will show up again...Thanks for that info!

Quote from: bbond007

Power Down

Oh dear! :-) A new kernel war PowerDown vs. WarpDown will rise...

Cosmos Amiga · « **Reply #16 on:** April 11, 2012, 08:58:37 AM »

A WOS proggy have been coded to show the duration of the context switches ?

cgutjahr · « **Reply #17 on:** April 11, 2012, 02:55:18 PM »

Quote from: Erebos;687963

as i was talking about classic version, I cant see how the PPC side of my blizard ppc at 240mhz can be faster emulating 68k than my real 060 onboard...

Just give it a try, and draw your conclusions after you saw it in action. Apart from stuff that hits the CPU badly (like, say, Maxon Cinema rendering a scene) you probably won't miss your 68060. Having the whole OS running on the PPC makes a whole lot of a difference.

A lot of OS4 software (read: ports) is running way to slow on OS4 classic, because the PPC CPU is really not all that powerful. But I don't think 68k software is a problem. Anybody still using Maxon Cinema t render scenes on real Amiga hardware belongs into a mental asylum anyway

itix · « **Reply #18 on:** April 11, 2012, 04:48:44 PM »

Quote from: cha05e90;687984

I must admit that I've never really "seen" applications that ran parallel in that sense on a phase5 board - regardless which kernel or OS was used.

I am sure everyone have used PPC accelerated mpega.library or datatypes. PPC is decoding next mpega frames while 68k CPU is doing something else but of course memory bus is always a bottleneck there.

But CPUs cant run in parallel in sense they could work on same data structures simultaneously. Something like reading system structures is strictly forbidden from PPC side because it is not coherent (no, cache flush technique is not going to fix that).

Karlos · « **Reply #19 on:** April 11, 2012, 07:37:25 PM »

There seems to be a recurring misconception about the PowerUP hardware and the older PPC kernels that ran on it.

First off, the hardware. Both processors (all three, if you include the SCSI script processor on the 603e+ boards) most assuredly run concurrently. However, they do share a single bus which means that only one of them can perform IO on it at any given instant.

However, both processors also had large enough instruction and data caches, running them in copyback mode (meaning that reads and writes were generally full cache lines) that either one being forced to wait for in-progress IO from the other was not a particularly limiting factor.

Next, the software. A consequence of the hardware design is that each processor ends up with it's own, potentially out-of-date view of what is in RAM based on it's own cache. I may be wrong, but I don't think that bus snooping worked so well (or perhaps at all) with the design, so that whichever PPC kernel you used, caches had to be kept in sync by flushing them when one processor called another. The kernel generally managed this cache coherency problem allowing developers to write 68K apps and treat the PPC as a co-processor they'd compile certain expensive functions for, or alternatively write mostly PPC native apps that call the 68K for OS stuff, IO or event handling.

I can't recall the details for PowerUP but under WarpOS, each WarpOS application had at least 2 tasks. One that ran on the 68K and one that ran on the PPC. Together, these "mirror tasks" formed a single "virtual" thread of execution. Signals and such were routed to both tasks, but at any instant in time, only the PPC or 68K thread would be running and it's counterpart asleep, awaiting the flow of execution to come back to it. There was an exception to this rule, an asynchronous calling method that was rarely used as it required the application software to ensure cache coherency.

I think it's this notion of a given application running on only one CPU at any given instant that causes the misconception of a single processor model. However, Exec, WarpOS and PowerUP are all pre-emptively multitasking kernels. So, when your WarpOS task goes to sleep waiting for a slow 68K call to return, in principle, WarpOS is free to schedule any other ready-to-run PPC task in it's place. Vice versa, Exec the 68K.

In theory, this means that if you had two WarpOS applications, one could be executing some OS calls the 68K whilst the other is running PPC native code on the PPC. In practise, however, the context switching step whenever any application jumped CPU starved both processors of time until it was complete and more often than not, the actual time spent on a 68K function call (or vice versa) was dwarfed by the time spent just doing the context switching work. Consequently, whether you coded for PowerUP or WarpOS, the golden optimization rule was to minimize context switches. So, if you were going to need to do several OS calls, refactor your code so that they can be done from a single 68K call that you can invoke from the PPC (or vice versa). Sadly, it was not uncommon to see applications that didn't do this.

This brings us back to the original question. Well the answer is that emulating 68K code on the PPC is faster due to not having this cache flushing limitation. Even a 603e is generally faster than it would be on the real 68K, certainly if your 68K is an 040 like mine. Instantaneous JIT performance is, of course, variable and highly context sensitive so while there may be some 68K code out there that would prove pathological under emulation and thus run faster on the real 68K, any benefit would be instantly lost under the overhead of having to implement a WarpOS/PowerUP style cache coherencey strategy for both processors.

I did have a half-baked idea of my own that if I ever get around to experimenting with I might try, but it is probably doomed to the waste basket of silly ideas already. Essentially what I thought of was allocating a lump of memory to install some 68K code in and trying to see if I can get it running. However, I have no intention of using the 68K for running exsting 68K applications. Instead, I envisage it as a sort of general-purpose programmable DMA controller for classic hardware. It would use an MMU setup that would mark all memory uncached except for the space allocated for the code it executes and some private data workspace. As all other regions are uncached, coherency is only a problem from the host OS side and that's what CachePre/PostDMA() is there to manage. Having complete access to the hardware and memory space in the system might make it pretty useful for data transfer tasks. You might have a 68K uncacheable page of memory somewhere that represents a memory-mapped "register file" for your virtual DMA device. This would ideally need to be uncached from the PPC side too, so a page of ChipRAM might even be an idea. It doesn't matter that it's slow because you are only putting parameter data in it (eg address of memory region, size of memory region) and then via an interrupt or other mechanism, get the 68K to do the transfer (which, if done properly, could use move16 for block transfers to/from uncached memory).

It's all very pie in the sky, but it seemed theoretically possible when the idea first occurred that you might be able to produce virtual DMA controllers for things like the onboard IDE, parallel port, PCMCIA or whatever and then drivers that utilise them.

There are probably many very good reasons why this probably impossible though.

itix · « **Reply #20 on:** April 11, 2012, 08:14:25 PM »

Quote from: Karlos;688059

This brings us back to the original question. Well the answer is that emulating 68K code on the PPC is faster due to not having this cache flushing limitation. Even a 603e is generally faster than it would be on the real 68K, certainly if your 68K is an 040 like mine.

68k code that is executed only once probably runs faster on real 68k, though.

Quote

It's all very pie in the sky, but it seemed theoretically possible when the idea first occurred that you might be able to produce virtual DMA controllers for things like the onboard IDE, parallel port, PCMCIA or whatever and then drivers that utilise them.

Since interrupts are executed on a PPC side you would have to make some substantial changes there... and since those devices still wouldnt run any faster only advantage would be having more cpu time for idle task =P

Karlos · « **Reply #21 on:** April 11, 2012, 08:28:27 PM »

Quote from: itix;688069

68k code that is executed only once probably runs faster on real 68k, though.

Yeah, but as soon as it makes an OS call to the PPC native host, any such advantage would be lost instantly.

Quote

Since interrupts are executed on a PPC side you would have to make some substantial changes there... and since those devices still wouldnt run any faster only advantage would be having more cpu time for idle task =P

I think you misunderstood me. I was suggesting that using an interrupt "or other" mechanism to signal the 68K to do something asynchronously from the host.

Consider disk access on the motherboard IDE. It's cripplingly slow whether you do it on the 68K or the PPC. However, if you were able to do block transfers via your "virtual DMA device" (in reality a bit of 68K code doing PIO to/from some range of memory), the PPC would be free to do something else while it waits for it to complete. In that respect, no different than the 68K waiting for the NCR7xx to complete DMA transfer to/from a SCSI device. Why tie up your host processor doing such PIO transfers when the 68K can do it just as quickly in this case while your host schedules some other PPC task to run while this one sleeps?

If the required 68K transfer code were small enough to fit in the instruction cache (which a simple implementation should be), it might work pretty well. It's not hard to imagine reading 16-bit words from the IDE to a small16-byte aligned buffer in the cacheable area you reserved for the 68K and then move16'ing them over to wherever they are actually needed, all without performing any memory accesses during the loop except those required for data transfer.

Once finished, the 68K could raise an interrupt for the PPC so that the actual "driver" task for this would wake up, call CachePostDMA() and then return.

Since only it's private working area would be cacheable, most of the coherency issues that dogged the traditional software are not a problem. You just have to remember to treat it like any other DMA device within any host drivers you write that use this mechanism.

Zac67 · « **Reply #22 on:** April 11, 2012, 08:52:54 PM »

Quote

I may be wrong, but I don't think that bus snooping worked so well (or perhaps at all) with the design

Sadly, bus snooping doesn't solve the problem: since any CPU could be modifying data segments (that may even have been cached by the other side) without writing them back right away, data caches can diverge substantially. x86/x64 systems are having quite a tough time ensuring cache coherence - just look at AMD's MOESI protocol (with the added challenge of multiple local RAM busses). With a heterogene AMP system without MESI you have little choice.

The only way to get around the extremely expensive context switches is to divide the RAM into regions for each CPU with a shared (non-cached) region for message passing - if I understood correctly this is what you were thinking of. However, shared memory is also expensive as it must not be cached and you need to copy all data passed back and forth to this slow memory - so you'll starve a CPU depending on caching as well (though not as badly).

Karlos · « **Reply #23 on:** April 11, 2012, 09:51:29 PM »

Quote from: Zac67;688075

Sadly, bus snooping doesn't solve the problem: since any CPU could be modifying data segments (that may even have been cached by the other side) without writing them back right away, data caches can diverge substantially.

Indeed. I probably should have worded it a bit more clearly - this is essentially the point I was trying to make. Bus snooping may be implemented on one (or even both, but ISTR it was touted as a feature on 68040 but did it ever actually work?) of the processors but in the dual processor design, cache flushing was the only viable option.

Quote

The only way to get around the extremely expensive context switches is to divide the RAM into regions for each CPU with a shared (non-cached) region for message passing - if I understood correctly this is what you were thinking of. However, shared memory is also expensive as it must not be cached and you need to copy all data passed back and forth to this slow memory - so you'll starve a CPU depending on caching as well (though not as badly).

More or less. To reiterate, a section of ram allocated by the host for the 68K in which it stores the code, MMU tables and working data. The rest of the entire address range would then be marked as completely uncacheable by the 68K.

The shared memory would only be a few pages in which to implement a register file for the PPC to talk to whatever virtual DMA device you were getting the 68K to pretend be (for simplicity it could even be in chip RAM). Furthermore, it would be used for passing parameters, never bulk data, so even if it were in some very slow uncacheable location it wouldn't make any significant difference as you'd only ever be doing a few reads and writes to set up a "DMA" operation. The 68K would then read/write other memory locations as directed by the PPC. The PPC would have to ensure it called CachePreDMA() on the affected region beforehand and CachePostDMA() afterwards, just as it would with any other DMA device. The 68K, however, wouldn't need to worry about that as the only area it can cache is it's own private working set which the PPC would never touch after initializing it.

The idea is not to make data transfers faster since the bottleneck would be the device you were transferring data to or from. The idea was to make data transfers not chew up PPC cycles.

-edit-

I hope that's a bit clearer, but as I said, there are probably many other technical reasons why the whole idea would fall over that might not be obvious until trying it

Erebos · « **Reply #24 on:** April 11, 2012, 11:06:47 PM »

@Karlos
Seriously when I asked for this , I didn't expect this thread to slip that way, and that is very interesting. your views on the possibility to use the 68k to do something instead of just being there in the cold is great, even if technical explanation is a little over my capabilities i have to admit, anyway thanks. That is Amiga hacking spirit :-)

@all
Do you know some software bench i can use under OS4 classic to compare the performance of a 060 to an emulated 68k with Petunia ?

@cgutjahr
I bought OS4.0 at launch date but haven't been happy with software compability back in the day and the fact that games (mostly whdload'ed) under runinuae were unplayable with the configuration i had at the time, and so I switched back to 3.9 but maybe I would have to try the 4.1 classic version or... 4.2 XD

jj · « **Reply #25 on:** April 11, 2012, 11:37:43 PM »

Running classic game on os4 on a classic amiga will always be slow as hell.

Unless I am mistaken, you are emulating the whole chipset as well as the cpu.

Karlos · « **Reply #26 on:** April 11, 2012, 11:41:59 PM »

Quote from: JJ;688101

Running classic game on os4 on a classic amiga will always be slow as hell.

Unless I am mistaken, you are emulating the whole chipset as well as the cpu.

You are mistaken. The native chipset is still supported in OS4 on classics. Even hardware banging stuff tends to work since the hardware being banged is actually present.

Stuff that is critically dependent on 68K v custom chip timing might not work so well but then that's often a problem for faster 68K processors too.

Unless you are talking about using UAE on the classic machines, in which case, yeah, it's pretty slow.

jj · « **Reply #27 on:** April 11, 2012, 11:47:42 PM »

Oh ok I was always under the impression that running aga games for instance under os4 for classic had to be run in uae and were hence slow as hell, But you can just run an aga game in os4 and it will bang the hardware and the just he cpu will be emulated ?

Thats pretty neat/clever

Karlos · « **Reply #28 on:** April 11, 2012, 11:54:11 PM »

Quote from: JJ;688105

Oh ok I was always under the impression that running aga games for instance under os4 for classic had to be run in uae and were hence slow as hell, But you can just run an aga game in os4 and it will bang the hardware and the just he cpu will be emulated ?

As a rule, yes. Obviously there are some incompatibilities, just as there are with accelerated 68K systems trying to run old 68000/OCS titles generally (without resorting to patching the games ala whdload for instance).

jj · « **Reply #29 from previous page:** April 11, 2012, 11:57:29 PM »

That is pretty cool. Has anyone ever written a version of whdload for Morphos ?

That would make the classic version of morphos interesting

Author Topic: OS4 Classic, why disable the 68k ? (Read 9193 times)

bbond007

Re: OS4 Classic, why disable the 68k ?

cha05e90

Re: OS4 Classic, why disable the 68k ?

Cosmos Amiga

Re: OS4 Classic, why disable the 68k ?

cgutjahr

Re: OS4 Classic, why disable the 68k ?

itix

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

itix

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Zac67

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Erebos

Re: OS4 Classic, why disable the 68k ?

jj

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

jj

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

jj

Re: OS4 Classic, why disable the 68k ?