I'm not sure if this has been mentioned before, but the operating system itself, as shipped, hardly ever uses CopyMem() or CopyMemQuick() in a way which would benefit from optimization. In ROM CopyMem() is used to save space, and it usually moves only small amounts of data around, with the NCR scsi.device and ram-handler being the exceptions. The disk-loaded operating system components tend to prefer their own memcpy()/memmove() implementations, and only programs which were written with saving disk space in mind (e.g. the prefs editors) use CopyMem() to some extent. Again: only small amounts of data are being copied, which in most cases means that you could "optimize" CopyMem() by plugging in a short unrolled move.b (a0)+,(a1)+ loop for transfers shorter than 256 bytes.
It is true that most AmigaOS calls of exec.library CopyMem() are small to medium size but there are *many* calls. It's not efficent to use CopyMem() for small copies because of the library JSR+JMP overhead although CPU optimized code can reduce the overall cost to be close to that of non-library code for all but the smallest copies. AmigaOS CopyMem() uses a MOVE.B (A0)+,(A1)+ loop for small copies and a MOVEM.L loop for large copies. It's actually the MOVEM.L loop that is most inefficient because it is only good for the 68000-68030 with large copies. An unrolled MOVE.L (A0)+,(A1)+ loop would be significantly better for the 68000-68060 in most cases.
tiny static size copy use '=' in C
small size copy use quick loop
medium size copy use unrolled loop (the 68060 doesn't benefit from unrolling in this case)
large size copy use MOVEM.L loop for 68000-68030, use unrolled MOVE16 loop for 68040-68060
There is a trade-off with different memory copy techniques as SpeedGeek has mentioned. This applies to the exec.library CopyMem() as well as the C memcpy() and memmove() functions. Most memcpy()/memmove() calls are small and this is why vbcc uses inlined quick loops to get to work as fast as possible (after minimal alignment). SAS/C uses a subroutine call (BSR+RTS) to a poorly optimized unrolled loop with no aligning and a costly jump table at the end for the 68040+. The SAS/C memcpy() may be faster in some cases than the vbcc memcpy() for large aligned copies. Sadly, the SAS/C memcpy() probably beats the exec.library CopyMem() for medium to large copies on the 68040+.
With vbcc, it is best speed to use:
tiny static size copy use '=' in C
small size copy use C memcpy() and memmove()
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
If the exec.library CopyMem()/CopyMemQuick() used unrolled MOVE.L copy loops then we could be in good shape without patching. Patching for the uncommon large copies would become optional. MOVE16 does have the advantage of not flushing the DCache on large copies although it's questionable whether this is common enough and bug free enough to be standard.
The new version of vbcc 0.9d was recently released by the way:
http://sun.hasenbraten.de/vbcc/With SAS/C, it is best speed to use:
tiny static size copy use '=' in C
small size copy use C memcpy() and memmove() for 68000-68030, use CPU optimized exec.library CopyMem() and CopyMemQuick() for 68040+
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
Without patching or changing exec.library CopyMem() and CopyMemQuick(), the available options are not good. Most 68040-68060 memory copies will be significantly slower than what is possible. We want programmers to be able to use compilers and the AmigaOS without wasting time and code creating faster re-implementations of functions. This is what ThoR seems to ignore. He wants to stop the patching chaos but ignores the reason for the patching and the solution.
Now if you wanted to make a difference and speed up copying operations that are measurable and will affect a large number of programs, I'd propose a project to scan the executable code loaded by LoadSeg() and friends and replace the SAS/C, Lattice 'C' and Aztec 'C' statically linked library implementations of their respective memcpy()/memmove() functions with something much nicer. That would not be quite the "low-hanging fruit" of changing the CopyMem()/CopyMemQuick() implementation, but it might have a much greater impact.
I considered this but decided it was better to optimize compiler link lib code next and vbcc was the easiest place to start

.