@matthey
Incidentally, if you want to speed up your unrolled move16 loop a tiny bit, move the subq instruction before the last move16.
The move16 doesn't affect the CC so you can do this safely. Not having to test the cc in the instruction immediately after the one that sets it might give a small speedup.