I'm wondering why that library was not optmized like yours back then.
Is all Amiga libraries being profiled and optimized like the layers one?
The fact is, that layers *was* actually optimized, that's probably the thing one should understand. It was just optimized for a different goal. That's probably one of the surprises one has to learn, and one of the "take home" lessons from all this discussion.
Yes, surprising, isn't it?
The point is: At the time layers was written, the CPU was slow, the blitter was fast, and chipmem was an expensive resource. Thus, layers was designed with these goals in mind: Use as little buffer memory as possible, probably of the expense of additional data manipulations to be made by the fast blitter. This resulted in an algorithm that uses the double-xor trick to avoid an additional buffer and to copy data between screen and off-screen buffer. Amongst many other decisions, of course. It was the right algorithm.
Nowadays (or rather, ten years ago) things changed: The CPU became fast, the blitter slow, graphics memory was not even reachable by the blitter, so the CPU had to emulate the blitter, and buffer memory became cheap. Thus, it requires a completely different algorithm to make this fast. Avoid double-xor, use extra buffers if necessary to avoid any extra copy operation that would slow down the CPU.
Nobody at CBM back then was overly stupid in creating the slice & dice operation in layers. They just optimized for the "wrong" goal, for today's perspective. The *good* part about layers is that it was written in C, not assembly, so one could take the algorithm apart and replace it by something equivalent, optimized for a different goal.
That's "software engineering", actually. Consider that you're shooting at a moving target, and that your target might move too fast to make low-level stuff feasible to approach your problem. And within ten years, the hardware was apparently already moving too fast for CBM to take the opportunity to optimize...
Is the actual GCC any good at doing optimizations for PowerPC after Apple's departure?
I seriously don't know. I don't have enough experience with the PPC to judge. Yes, I did a bit of programming on PPC, but that's not sufficient to make any statements on the compiler quality.