Author Topic: OS4 Classic, why disable the 68k ? (Read 9546 times)

Karlos · « **on:** April 11, 2012, 07:37:25 PM »

There seems to be a recurring misconception about the PowerUP hardware and the older PPC kernels that ran on it.

First off, the hardware. Both processors (all three, if you include the SCSI script processor on the 603e+ boards) most assuredly run concurrently. However, they do share a single bus which means that only one of them can perform IO on it at any given instant.

However, both processors also had large enough instruction and data caches, running them in copyback mode (meaning that reads and writes were generally full cache lines) that either one being forced to wait for in-progress IO from the other was not a particularly limiting factor.

Next, the software. A consequence of the hardware design is that each processor ends up with it's own, potentially out-of-date view of what is in RAM based on it's own cache. I may be wrong, but I don't think that bus snooping worked so well (or perhaps at all) with the design, so that whichever PPC kernel you used, caches had to be kept in sync by flushing them when one processor called another. The kernel generally managed this cache coherency problem allowing developers to write 68K apps and treat the PPC as a co-processor they'd compile certain expensive functions for, or alternatively write mostly PPC native apps that call the 68K for OS stuff, IO or event handling.

I can't recall the details for PowerUP but under WarpOS, each WarpOS application had at least 2 tasks. One that ran on the 68K and one that ran on the PPC. Together, these "mirror tasks" formed a single "virtual" thread of execution. Signals and such were routed to both tasks, but at any instant in time, only the PPC or 68K thread would be running and it's counterpart asleep, awaiting the flow of execution to come back to it. There was an exception to this rule, an asynchronous calling method that was rarely used as it required the application software to ensure cache coherency.

I think it's this notion of a given application running on only one CPU at any given instant that causes the misconception of a single processor model. However, Exec, WarpOS and PowerUP are all pre-emptively multitasking kernels. So, when your WarpOS task goes to sleep waiting for a slow 68K call to return, in principle, WarpOS is free to schedule any other ready-to-run PPC task in it's place. Vice versa, Exec the 68K.

In theory, this means that if you had two WarpOS applications, one could be executing some OS calls the 68K whilst the other is running PPC native code on the PPC. In practise, however, the context switching step whenever any application jumped CPU starved both processors of time until it was complete and more often than not, the actual time spent on a 68K function call (or vice versa) was dwarfed by the time spent just doing the context switching work. Consequently, whether you coded for PowerUP or WarpOS, the golden optimization rule was to minimize context switches. So, if you were going to need to do several OS calls, refactor your code so that they can be done from a single 68K call that you can invoke from the PPC (or vice versa). Sadly, it was not uncommon to see applications that didn't do this.

This brings us back to the original question. Well the answer is that emulating 68K code on the PPC is faster due to not having this cache flushing limitation. Even a 603e is generally faster than it would be on the real 68K, certainly if your 68K is an 040 like mine. Instantaneous JIT performance is, of course, variable and highly context sensitive so while there may be some 68K code out there that would prove pathological under emulation and thus run faster on the real 68K, any benefit would be instantly lost under the overhead of having to implement a WarpOS/PowerUP style cache coherencey strategy for both processors.

I did have a half-baked idea of my own that if I ever get around to experimenting with I might try, but it is probably doomed to the waste basket of silly ideas already. Essentially what I thought of was allocating a lump of memory to install some 68K code in and trying to see if I can get it running. However, I have no intention of using the 68K for running exsting 68K applications. Instead, I envisage it as a sort of general-purpose programmable DMA controller for classic hardware. It would use an MMU setup that would mark all memory uncached except for the space allocated for the code it executes and some private data workspace. As all other regions are uncached, coherency is only a problem from the host OS side and that's what CachePre/PostDMA() is there to manage. Having complete access to the hardware and memory space in the system might make it pretty useful for data transfer tasks. You might have a 68K uncacheable page of memory somewhere that represents a memory-mapped "register file" for your virtual DMA device. This would ideally need to be uncached from the PPC side too, so a page of ChipRAM might even be an idea. It doesn't matter that it's slow because you are only putting parameter data in it (eg address of memory region, size of memory region) and then via an interrupt or other mechanism, get the 68K to do the transfer (which, if done properly, could use move16 for block transfers to/from uncached memory).

It's all very pie in the sky, but it seemed theoretically possible when the idea first occurred that you might be able to produce virtual DMA controllers for things like the onboard IDE, parallel port, PCMCIA or whatever and then drivers that utilise them.

There are probably many very good reasons why this probably impossible though.

Karlos · « **Reply #1 on:** April 11, 2012, 08:28:27 PM »

Quote from: itix;688069

68k code that is executed only once probably runs faster on real 68k, though.

Yeah, but as soon as it makes an OS call to the PPC native host, any such advantage would be lost instantly.

Quote

Since interrupts are executed on a PPC side you would have to make some substantial changes there... and since those devices still wouldnt run any faster only advantage would be having more cpu time for idle task =P

I think you misunderstood me. I was suggesting that using an interrupt "or other" mechanism to signal the 68K to do something asynchronously from the host.

Consider disk access on the motherboard IDE. It's cripplingly slow whether you do it on the 68K or the PPC. However, if you were able to do block transfers via your "virtual DMA device" (in reality a bit of 68K code doing PIO to/from some range of memory), the PPC would be free to do something else while it waits for it to complete. In that respect, no different than the 68K waiting for the NCR7xx to complete DMA transfer to/from a SCSI device. Why tie up your host processor doing such PIO transfers when the 68K can do it just as quickly in this case while your host schedules some other PPC task to run while this one sleeps?

If the required 68K transfer code were small enough to fit in the instruction cache (which a simple implementation should be), it might work pretty well. It's not hard to imagine reading 16-bit words from the IDE to a small16-byte aligned buffer in the cacheable area you reserved for the 68K and then move16'ing them over to wherever they are actually needed, all without performing any memory accesses during the loop except those required for data transfer.

Once finished, the 68K could raise an interrupt for the PPC so that the actual "driver" task for this would wake up, call CachePostDMA() and then return.

Since only it's private working area would be cacheable, most of the coherency issues that dogged the traditional software are not a problem. You just have to remember to treat it like any other DMA device within any host drivers you write that use this mechanism.

Karlos · « **Reply #2 on:** April 11, 2012, 09:51:29 PM »

Quote from: Zac67;688075

Sadly, bus snooping doesn't solve the problem: since any CPU could be modifying data segments (that may even have been cached by the other side) without writing them back right away, data caches can diverge substantially.

Indeed. I probably should have worded it a bit more clearly - this is essentially the point I was trying to make. Bus snooping may be implemented on one (or even both, but ISTR it was touted as a feature on 68040 but did it ever actually work?) of the processors but in the dual processor design, cache flushing was the only viable option.

Quote

The only way to get around the extremely expensive context switches is to divide the RAM into regions for each CPU with a shared (non-cached) region for message passing - if I understood correctly this is what you were thinking of. However, shared memory is also expensive as it must not be cached and you need to copy all data passed back and forth to this slow memory - so you'll starve a CPU depending on caching as well (though not as badly).

More or less. To reiterate, a section of ram allocated by the host for the 68K in which it stores the code, MMU tables and working data. The rest of the entire address range would then be marked as completely uncacheable by the 68K.

The shared memory would only be a few pages in which to implement a register file for the PPC to talk to whatever virtual DMA device you were getting the 68K to pretend be (for simplicity it could even be in chip RAM). Furthermore, it would be used for passing parameters, never bulk data, so even if it were in some very slow uncacheable location it wouldn't make any significant difference as you'd only ever be doing a few reads and writes to set up a "DMA" operation. The 68K would then read/write other memory locations as directed by the PPC. The PPC would have to ensure it called CachePreDMA() on the affected region beforehand and CachePostDMA() afterwards, just as it would with any other DMA device. The 68K, however, wouldn't need to worry about that as the only area it can cache is it's own private working set which the PPC would never touch after initializing it.

The idea is not to make data transfers faster since the bottleneck would be the device you were transferring data to or from. The idea was to make data transfers not chew up PPC cycles.

-edit-

I hope that's a bit clearer, but as I said, there are probably many other technical reasons why the whole idea would fall over that might not be obvious until trying it

Karlos · « **Reply #3 on:** April 11, 2012, 11:41:59 PM »

Quote from: JJ;688101

Running classic game on os4 on a classic amiga will always be slow as hell.

Unless I am mistaken, you are emulating the whole chipset as well as the cpu.

You are mistaken. The native chipset is still supported in OS4 on classics. Even hardware banging stuff tends to work since the hardware being banged is actually present.

Stuff that is critically dependent on 68K v custom chip timing might not work so well but then that's often a problem for faster 68K processors too.

Unless you are talking about using UAE on the classic machines, in which case, yeah, it's pretty slow.

Karlos · « **Reply #4 on:** April 11, 2012, 11:54:11 PM »

Quote from: JJ;688105

Oh ok I was always under the impression that running aga games for instance under os4 for classic had to be run in uae and were hence slow as hell, But you can just run an aga game in os4 and it will bang the hardware and the just he cpu will be emulated ?

As a rule, yes. Obviously there are some incompatibilities, just as there are with accelerated 68K systems trying to run old 68000/OCS titles generally (without resorting to patching the games ala whdload for instance).

Karlos · « **Reply #5 on:** April 12, 2012, 12:01:52 AM »

Quote from: JJ;688111

That is pretty cool. Has anyone ever written a version of whdload for Morphos ?

No idea, but I wasn't implying there was a version for OS4, either. That was thrown in as an example that even without 68K emulation you often need a bit of an assist to get some classic games working on a classic system

Quote

That would make the classic version of morphos interesting

I've not had cause to boot it for a while, but my 1.4.5 install only seemed to work with RTG compatible titles.

Karlos · « **Reply #6 on:** April 12, 2012, 12:10:51 AM »

Quote from: JJ;688113

I could never get MorphOS to boot on my 1200. I am guessing though that hardware banging stuff didnt work on it ? Piru ?

I can't say I had a lot of joy on that front but OS/RTG friendly stuff ran well. My kit is pretty much A1200 + phase5 accelerator and graphics card so no surprises there

Quote

Does whdload run on os4 though?

Not tried it but I'd hazard a guess it probably doesn't. Or if it does, it will very much depend on what an individual installer patch does in order to get the specific game working. Poking around the supervisor model of the (emulated) 68K is probably a non starter.

Karlos · « **Reply #7 on:** April 12, 2012, 07:41:18 PM »

Quote from: dreamcast270mhz;688206

Karlos, if I understand what you're saying about this method of operation is to relegate the '060 to interfacing with the hardware, while the PowerPC is waiting for the memory registers to be open for use?

Operationally, it would be something like this. Imagine you have implemented a virtual DMA driver for the motherboard IDE. This is a small 68K application located in memory designated as 68K cacheable, along with some workspace for the working set and MMU tables. The rest of the entire address space is marked as uncacheable.
This means the only stuff the 68K will cache is access to the 68K code, local working set, stack and so on, allowing it to operate at full speed on local data.

The host PPC now wants to load 128K into some nice cache aligned address in fast ram. It writes the base address, number of bytes to transfer and any other relevant information to a "register file" belonging to this device. In practise, this register file area needs to be uncached on the PPC too.

Having loaded the parameters, it then invokes the 68K into processing the request. How this would be done is still somewhat vague, but the 68K would likely have to be able to respond to interrupts issued by the PPC. The PPC task now goes to sleep, awaiting some indication the process has completed. At this point, the host OS will just find something else to do.

Concurrently, the 68K now starts retrieving data from the motherboard IDE into it's local working store. Once it has a 68K cache line's worth, it can transfer that whole line to the fast ram address and increment. Obviously non aligned bits will have to be handled, but this side is not rocket science.

When the 68K has finished loading the 128K, or some error occurs, it updates some output values accordingly in it's "register file" and signals the PPC to wake up the calling task. Said task ensures the PPC cache isn't now out of date, checks the return status in the register file and takes whatever action is now necessary, be it a return of success or invoke some error strategy.

Quote

If so, this is a lot like what the Sega Saturn does in practice when you use both CPUs. In practice, the good development houses would have to utilize the caches of the CPU that is not performing I/O, but it did work well in games such as VF2 and Fighting Vipers, where one CPU performed the player character, and the other did the AI of the enemy. Parallel processing on older machines with only one memory bus is certainly a complex venture it seems.

I'm not really clear how the saturn works, but yes, my idea is that the 68K runs as an uncached IO processor with the sole exception that it can cache some private working area that the PPC never touches.

Karlos · « **Reply #8 on:** April 12, 2012, 10:13:15 PM »

Quote from: Zac67;688262

@Karlos
I think I got it now... I'd call it 'fine-grained cache coordination' - definitely doable, provided the quick signalling bit (interrupts passing) works, but I can hardly imagine how anything should be working without it. You'd have to write new drivers for the 68k side though (or provide elaborate frontends to the existing ones, not sure if a generic one can do the trick). I can imagine that it may be easier to write new PPC drivers from scratch (definitely for something generic as IDE - yes, only an example of course). I got a bit distracted with the 'DMA' bit, but that's just one of the ways you could use the passing of jobs back and forth - very much like the AOS messaging system actually!

The reason I mentioned DMA is that this would seem to me to be the most useful thing the 68K could be doing as a service for the PPC native host OS simply because it could be implemented in the non-cache-flush-context-switchy manner described. That is to say, the PPC upon which the whole OS now runs sees the 68K as some IO processor that can read and write data from anywhere in the machine's entire address space - memory, ports, everything. And the most obvious use for this capability would be transferring data to/from the motherboard (IDE, parallel etc) resources from/to host memory buffers. This thus makes the 68K a very versatile IO processor for all the legacy hardware, one that is virtually limitlessly reprogrammable.

If you want to run actual 68K applications, the existing JIT emulation is much more sensible than some inverted WarpOS style idea, which would suffer all the same problems WarpOS itself does.

However, even if my idea could be made to work (and there's no guarantee, but I can't see any massive obstacles) as you suggest, entirely new PPC native drivers (an their 68K counterpart code) would be needed that could leverage such a system. If it could be made to work, however, to me it seems like a legitimate use for the old silicon. My 040 can read and write data from the old ports just as fast as the PPC can since the limit is usually the port. The main benefit of using the 68K to do this is to allow the PPC to other stuff while the IO is in progress.

Karlos · « **Reply #9 on:** April 12, 2012, 11:23:57 PM »

Quote from: psxphill;688273

There is no reason they couldn't have included a 68k emulator in rom to run AmigaOS.

I disagree. First of all, 68K emulation on PPC was not as mature back when these cards were devised. They'd have spent a long time developing a 68K emulation that probably ended up slower than their faster 68K boards, substantially so if they had to use intepretive methods. Apple had this very issue when they first moved to PPC only. Except at least they had the benefit of a PPC OS to mitigate it. We didn't even have that then.

Author Topic: OS4 Classic, why disable the 68k ? (Read 9546 times)

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?

Karlos

Re: OS4 Classic, why disable the 68k ?