And you missed MattHey's point that a small inlined loop that can execute at the same efficiency of the big optimized loop in a subroutine makes the latter technique obsolete.
No, it does not. The original comparison to which I was replying was one of alleged 68K superiority over PPC in being able to execute such a loop effectively. The critical miss in this argument is the implication that the PPC is a poorer architecture because of this. This is, of course, complete nonsense. It's simply a different architecture with different gains and trade-offs. A non-lazy programmer will learn these and write code accordingly, not complain that the simplest possible loop is not as fast as it could be on the basis of the behaviour of a completely different architecture. Being able to do this on 68060 does not obsolete the technique at all when talking about a different CPU (the PPC) or even an earlier m68k.
The PPC can do floating point multiply add. That requires 2 instructions on 6888x/68040/68060. How horridly inefficient. It can also do bounded rotates and shifts, which require several instructions on 680x0. The 486 had bswap. Does that mean the 68K was utter pants for requiring 3 instructions to accomplish the same?
For the last time, a non-lazy programmer concerned about performance writes the best possible code for the architecture. If that's a simple loop, then great, an easy win. If he has to unroll it and align operands, then that's what he does instead.
Many moons ago, I wrote a series of tests to gather information about memory performance and got a great deal of data back regarding this very type of operation over different types of memory (system ram, chip ram, RTG ram) and on different 680x0 / PPC. FWIW, despite suggestions to the contrary, I have always found that a suitably aligned, unrolled loop even on 68060 performs better (or at least no worse) than the naive case. I just don't presently have the data to hand in order to back that up.