It can never be as optimised because CopyMemQuick() only works on long word aligned data, while CopyMem() has to work on arbitrarily aligned data.
Only function prologue and epilogue are different. In CopyMem() they are longer, prologue because it must check long word alignment and epilogue because it must check if there are remaining bytes to copy.
I looked what NewCMQ060 patch does and in CopyMem() patch the prologue is only 6 asm instructions longer than in CopyMemQuick() patch. This is reason why small copies (4-16 bytes) take longer in CopyMem() because it has to execute more instructions before memory copy is started.
But it does not matter because neither CopyMemQuick() is optimal for small copies. It is better than CopyMem() but not the best. It never can be the best.
In fact in general computing it could be better if CopyMem() shared its memcopy routine with CopyMemQuick() to have better chance for L1 instruction cache hit.