Author Topic: PowerPC accelerator - how does that work then? (Read 11085 times)

Karlos · « **on:** January 30, 2012, 08:22:37 PM »

Quote from: Mrs Beanbag;678299

Ok so I know the PowerPC accelerators for classic Amigas have both a PPC chip and a 68040 on board, but when does the PowerPC chip come in and how does it know? Does a program have to run on the 68k at first (like a bootstrap) and then trigger some kind of switch, throw an exception, or call a library function or something?

Neglecting OS4.x and MorphOS 1.x for the moment, there are 2 solutions on OS3.x. The following description is somewhat generalized and may have mistakes and omissions as it was a long time ago that I last looked at it.

First there was PowerUP (or just PUP). This was developed by phase5 as the software to allow applications to take advantage of the PPC. Essentially, it comprises a kernel for the PPC, along with some APIs for developers and a set of patches to the host operating system to bootstrap the processor.

Once the system was up and running, the (patched) host OS was able to load PPC binaries in ELF format (it may have been modified a bit, but essentially the same format described in the old System V). An ELF could be an entire application for the PPC, or just a set of PPC optimised functions to be used by an otherwise 68K application. Code running on the PPC was basically independent of the 68K until it needed to call a host OS function (note that the PUP kernel provides memory management, tasks and so on for the PPC side).
The mechanism required to call 68K code from the PPC (or vice versa) necessitated that both processors flush their data cache. The parameters for the call would then be passed through memory from one to the other. While the "other" CPU is working, the corresponding process on the current CPU is basically asleep until the opposite CPU completed and returned. This process was simply called a "context switch" (a widely used term in other areas of computing, but technically accurate nonetheless). I'm afraid I forget the specific underlying details on how all this was achieved exactly. If memory serves, there was an arbitration "server" process involved in the communication between the two processors. Perhaps someone better qualified can answer that.

The system worked, but early versions suffered poor performance due to the overhead of said context switches (this was addressed later). In the time it took either processor to flush it's cache, it could have done millions of instructions. Other criticisms revolved around the use of ELF which was seen as an alien standard on the Amiga* which had it's own hunk format for binaries. Soon thereafter, WarpUP appeared (later called WarpOS), a rival system that extended the amiga hunk format to allow PPC code segments in executables, leading to "fat" binaries.

Fundamentally, it was similar to PUP: a PPC kernel, API and a set of OS patches. The main differences were in the implementation. WarpUP tried to mirror Exec's structure for the services it provided. Threads were in pairs, one for each CPU and only one of which would be active at any time (there were asynchronous context switches but they were tricky to use). A single threaded application that used the PPC would thus really be two threads, one on either chip, only one of which was running at any instant, the other sleeping and waiting for it's counterpart to return. However, they shared many fundamental components, such as signals. If you sent CTRL-C to one, the corresponding signal bit would be set in both the Task and TaskPPC (or at least it was routed to whichever was active, I forget).

It still required cache flushing in the same way as PUP, the main advantage of the system (at the time) was that it offered considerably quicker at it. That performance gap closed in later versions of PUP, but by then we'd enjoyed the "kernel" wars and WarpUP had become the unofficial standard. At least until OS3.9 where the picture.datatype used it. That's about as official as it ever got.

Each system had it's strengths and weaknesses. The Achilles heel of both was the context switch. It may have been faster on WarpUp, but in real terms it was still shockingly expensive. You had to carefully design code to minimise these if you wanted your application to run acceptably.

*ironically, all the offshoot operating systems use it for their native binary format.

Karlos · « **Reply #1 on:** January 30, 2012, 08:36:34 PM »

Quote from: itix;678304

Or to make long story short: PPC is usable only as co-processor where it is completing heavy computing on a background. Calling an OS from PPC is too expensive.

That's not exactly true, though, is it? There were plenty of applications that ran almost entirely on the PPC, calling the OS to do IO, graphics etc.

The way I used to write code (albeit for WarpUP, never really did much for PUP) was to have a main 68K function that was polled from the PPC (usually once per frame in a graphical application). That function would then do all the OS stuff; switch screenbuffers, collect the contents of IDCMP messages and so on, before returning to the PPC. The PPC would then do a bunch of native callbacks for said messages before continuing on with it's main loop. It was surprisingly effective

Karlos · « **Reply #2 on:** January 30, 2012, 08:46:39 PM »

An interesting side-effect of the way both systems worked is that you actually got a degree of multiprocessing.

When a 68K thread invoked a PPC operation, it basically went to sleep until the PPC finished working and returned, and vice versa.

The PPC kernels were both, as far as I know, preemptive, as of course is exec. So, while a given 68K thread is asleep and the PPC doing some work for it, Exec is able to run a different 68K thread during that time (memory bandwidth permitting). Likewise, when any given PPC thread is asleep waiting for the 68K, the PPC kernel is able to run a different PPC thread.

Both processors ran with cache and copyback enabled and so if whatever they were executing didn't hit memory at the same time, you could get a 68K thread and (unrelated) PPC thread running concurrently.

Still, it probably didn't happen that often given that most people ran only one PPC application at a time.

Karlos · « **Reply #3 on:** January 30, 2012, 09:15:10 PM »

Quote from: Tension;678303

barry do teh money dance

Also, the above...

/me slowly edges backwards towards the door, smiling at the strange man from Belfast in a non-threatening manner :lol:

Karlos · « **Reply #4 on:** January 30, 2012, 09:29:15 PM »

Quote from: commodorejohn;678313

So nobody outside of OS4/MOS went the Apple route and simply did 68k in emulation? Huh.

That was the plan for later PPC accelerators that never materialized. Phase5 had announced the CyberstormG4 and BlizzardG4 line. I badly wanted the latter and had already located a source of spare kidneys to fund one, but alas it never appeared. Which is probably good news for one black market transplant recipient, I don't think any credible surgeon would be fooled by what I found at my local butchers...

Later, there was the AmiJoe G3, but again, it was never released.

All of these boards lacked a 68K processor and would require emulation resources in their firmware.

Karlos · « **Reply #5 on:** January 30, 2012, 11:22:47 PM »

Let's be clear about one thing. Commodore (the original) never produced any PPC hardware for Amiga at any time and future PPC compatibility was not on their radar. They did, however, produce x86 boards but they were designed to run x86 operating systems (typically MS-DOS) for business use. They weren't integrated with AmigaOS to any extent.

The PPC boards were entirely third party aftermarket expansions. However, their intent was to augment the host machine and allow AmigaOS to run applications that would otherwise be too processor intensive for 68K.

Karlos · « **Reply #6 on:** January 31, 2012, 12:26:43 AM »

@vox

What about AROS on x86? The first versions of that go back to mid to late 90's. I may be wrong, but I'm sure it appeared before there were any PPC boards on sale.

Also, regarding Amithlon, see the "big-endian x86" target. You could compile code for x86 with the automatic modification that all memory accesses are swapped (according to their size) to ensure interoperability with the OS. It was clever stuff.

Karlos · « **Reply #7 on:** January 31, 2012, 01:11:52 PM »

Quote from: itix;678387

What applications? Unless you mean games, do you

Processor intensive applications; typically games and media players principally. Not much productivity wise, though there were some utilities that could benefit. PPC accelerated datatypes were handy, particularly image ones with the OS3.9 picture.datatype where colour conversion and dithering can be offloaded to PPC too.

Even StormC 4 had ppc native compiler backends for both PPC and 68K which helped when compiling larger projects with it (storm had plenty of issues but that's a different discussion).

Karlos · « **Reply #8 on:** January 31, 2012, 08:51:19 PM »

Quote from: itix;678419

PPC native picture datatypes were nice but they were slower than 68k counterparts for small images. Still fast enough to replace 68k datatypes on my system where possible.

That's a bit over simplistic. Well-written code could use 68K where the cache flush overhead of invoking the PPC becomes too significant for something like decoding a small image. Not sure how many datatypes did that, but I experimented with patching things CopyMem() on my A1200 and used PPC for large copies (those bigger than 2x the combined cache size). Of course, most copies are too small to benefit, but synthetic tests validated the concept.

Quote

Video players on Amiga were always outdated and could display only few formats. And often my BlizzardPPC @ 160MHz was not fast enough to play videos at full frame rate.... PPC accelerated MP3 players were nice even if I had only 8-bit Paula.

Well, I was luckier in that my PPC was 240MHz, but my 040 only 25. The gulf in performance between the two is pretty conspicuous. At least one of the PPC players for 3.x was capable of using cgxvideo.library on my BVision. The Permedia2 doesn't have any video acceleration per se, but it does allow YUV textures to be scaled and painted onto trapezoids (judging by the occasional diagonal shear when not vsynced, it would seem this is exactly how it worked).

Quote

PPC native Quake was very cool though. Even if little illegal. I am sure Quake sold more PPC accelerators than anything else...

But when I think about it paying 250-300 euro for PPC accelerator maybe was not that bad... it is just funny when an accelerator board costs more than your computer is worth :-)

Well, one can slam the old PPC boards, but ever since fully PPC native operating systems have worked on them, they've shown that they are capable of giving a significant boost in performance over 68K when not shackled down by the old context switch mechanisms necessitated by the dual processor approach. My PPC board can run almost all my 68K classic software faster on OS4.1 than my 040 could, sometimes by a wide margin. The next generation of PPC accelerator cards from phase5 were set to be PPC only.

Author Topic: PowerPC accelerator - how does that work then? (Read 11085 times)

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?

Karlos

Re: PowerPC accelerator - how does that work then?