Author Topic: Do you approve of PPC (in some form) as the future of Amiga? (Read 29453 times)

Karlos · « **on:** October 13, 2010, 08:53:12 AM »

It's all rather irrelevant IMHO.

Approval or not, your choices are dictated by those developing it. OS4 and MOS are written for PPC hardware. I don't expect either of them to jump ship to x86, or even ARM any time soon. If you are disappointed with this, then you can take the AROS route.

There are pros and cons to each platform, but the point is, if you think PPC was a bad idea, you aren't actually stuck with it.

Karlos · « **Reply #1 on:** October 13, 2010, 09:28:15 AM »

Quote from: gertsy;584436

Other. I know I'm dreaming but I love to see some kind of multi GPU based system running OS4.1. Something like the Nvidia Tesla of a couple of years ago.

Trust me, you wouldn't want this. GPUs are not a simple replacement for CPUs. They have very specific programming requirements. Beyond the obvious, they aren't multithread friendly in the way a multicore CPU with SMP is in which several totally unrelated threads of execution can be running away concurrently. nVidia GPUs run very large numbers of threads through the same code using in-step execution. A multiprocessor will execute a warp of such threads completely in parallel. The moment you hit a conditional branch and some threads take conflicting paths, execution is serialized until the code paths merge again (the hardware scheduler is nice though; whenever any of the threads hit a slow memory access, it will switch it out for another warp of threads that are executable).

Furthermore, prior to the fermi architecture, the GPU could only be executing a single kernel (a block of code to be ran over a dataset, not OS core sense) at a time. This is at least one major improvement of the latest generation, but again, even with several kernels running concurrently, each one needs to be some massively parallel task to get any benefit from GPU execution.

Karlos · « **Reply #2 on:** October 13, 2010, 10:07:22 AM »

Quote from: djrikki;584444

Reading through a few comments on here. As far as I understood it PPC has a smaller instruction set than x86 and like for like will run much faster than an x86.

You realise a smaller instruction set means more instructions are required to implement any given bit of code, right? Like for like, once you hit the baseline minimum of one instruction per clock, the CPU with less code to execute will win.

Also, the PPC isn't really that RISC as far as the number of supported operations goes, it has plenty of instruction/variants. It's classification as RISC is rather more architectural (consistent instruction word encoding, load store design etc).

Anyway, I'm afraid that the observation of PPC versus x86 is far out of date. Certainly not since the AMD64 architecture at any road and all new "x86" destktop processors tend to be AMD64.

Karlos · « **Reply #3 on:** October 14, 2010, 01:03:07 PM »

Quote from: the_leander;584617

Only the earth shattering costs of reworking an arch that hasn't seen development in what, 16 years?

At least the 68010+ already meets the Popek & Goldberg virtualisation requirements

Karlos · « **Reply #4 on:** October 15, 2010, 11:47:12 AM »

Quote

Let's look at some 68k code to see what is so great about the 68k. Let's take a simple 68k memory copy with size (longwords) in d0...

.loop:
move.l (a0)+,(a1)+
subq.l #1,d0
bcc.b .loop

Let's say we don't know the alignment of the data either. This copies 1 longword/cycle with data aligned and is 6 bytes. If data is unaligned this is still pretty good. Now write that on PPC with anywhere near the performance. Don't let the old outdated 68060 with tiny little cache and only 4 bytes of instruction fetch/cycle DESTROY. I'll even give you a few hints. You better align the data first or the performance is really bad. You will need twice as many instructions to duplicate what's above. You will need to use an unrolled loop (wasting more code) and preload the cache. If you do all that optimally, you are still likely slower than the 68060 . No wonder PPC needs all those GHz.

This example isn't really good at anything other than demonstrating the 68K is forgiving of lazy programmers. The alignment is never an unknown property; all you need to do is test the least significant bit(s) of the source and destination operands. You can also test the count and build a nice duff's device loop in assembler and only handle the trailing bytes before and after. On the 060, you might even be able to use move16, under the right circumstances; even when source and destination is not 16-byte aligned, you can often read (or write) via a temporary cache line in an appropriately aligned bit of stack.

In short, when implemented properly on the 68K or PPC, this will almost always be significantly faster than the lazy code above.

Karlos · « **Reply #5 on:** October 15, 2010, 04:18:12 PM »

@matthey

You rather missed my point: Only a lazy programmer writes the smallest possible loop to do a job and then blames the architecture if performance sucks. The 68060 is forgiving, PPC is not, but the PPC will deliver far better perfomance when it's rules are respected.

Regarding move16, it also depends on how much you want your cache polluting. If you are copying large amounts of data it has many advantages. You should never assume that because most copies are small, they all will be; well written code ought to be prepared for any reasonable eventuality.

Quote

Quote
The alignment is never an unknown property
Check out exec.library/CopyMem(). It will copy memory of any alignment. This function is used way more than exec.library/CopyMemQuick() which does longword aligned copies. Never say never .

No, "never" is perfectly accurate. The fact that any given function may not make use of the information does not mean the information is not there to be made use of. At a machine level, you will never have a transfer of data between addresses where at least the logical alignment is not known, since you have the addresses themselves.

Failing to use that information where it is useful a bit lazy IMHO.

Karlos · « **Reply #6 on:** October 15, 2010, 08:46:53 PM »

Quote from: the_leander;584924

I remember when the X1000 was first demoed and all the Amigans.net and AW regulars came here in droves to big it up that Karlos put up a link somewhere showing that the PA6T wasn't a huge amount quicker per clock than the G5. It's single biggest selling point was it's low power usage compared to the G5.

Either way, it'd be nomm'ed up by anything remotely recent.

Floating point performance of the PA6T was significantly higher than the G5 as I recall, though. Also remember that the performance was for one core. Of course, until OS4 / MOS get some sort of support for more than one core, that's a moot point.

Karlos · « **Reply #7 on:** October 15, 2010, 08:55:19 PM »

I dunno, I'm used to wielding hundreds of GFLOPS nowadays. All CPU's seem insignificant in comparison.

Karlos · « **Reply #8 on:** October 15, 2010, 11:03:28 PM »

Quote

And you missed MattHey's point that a small inlined loop that can execute at the same efficiency of the big optimized loop in a subroutine makes the latter technique obsolete.

No, it does not. The original comparison to which I was replying was one of alleged 68K superiority over PPC in being able to execute such a loop effectively. The critical miss in this argument is the implication that the PPC is a poorer architecture because of this. This is, of course, complete nonsense. It's simply a different architecture with different gains and trade-offs. A non-lazy programmer will learn these and write code accordingly, not complain that the simplest possible loop is not as fast as it could be on the basis of the behaviour of a completely different architecture. Being able to do this on 68060 does not obsolete the technique at all when talking about a different CPU (the PPC) or even an earlier m68k.

The PPC can do floating point multiply add. That requires 2 instructions on 6888x/68040/68060. How horridly inefficient. It can also do bounded rotates and shifts, which require several instructions on 680x0. The 486 had bswap. Does that mean the 68K was utter pants for requiring 3 instructions to accomplish the same?

For the last time, a non-lazy programmer concerned about performance writes the best possible code for the architecture. If that's a simple loop, then great, an easy win. If he has to unroll it and align operands, then that's what he does instead.

Many moons ago, I wrote a series of tests to gather information about memory performance and got a great deal of data back regarding this very type of operation over different types of memory (system ram, chip ram, RTG ram) and on different 680x0 / PPC. FWIW, despite suggestions to the contrary, I have always found that a suitably aligned, unrolled loop even on 68060 performs better (or at least no worse) than the naive case. I just don't presently have the data to hand in order to back that up.

Author Topic: Do you approve of PPC (in some form) as the future of Amiga? (Read 29453 times)

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?

Karlos

Re: Do you approve of PPC (in some form) as the future of Amiga?