Don't take the Voodoo figures too seriously, they are off the top of my head. I need to find them (or retest, but that machine is currently in need of attention). The BVision figures are good though.
I experimented a lot with move16 for both copying and other operations, like byteswap copying. Here, you allocate a cache aligned block on the stack, read data from the source swapping as you go, then using move16 to copy the block out to the VRAM. If you allocate enough cache-aligned space (say 64 bytes) you can unroll your transfer loop 4x which was about ideal (with some carefully optimised routines you could handle misaligned data since you do that reading from the source rather than transfering to the bitmap).
Not sure why move16 was faster on BVision VRAM and also it wasn't on every system tested. However, it was never slower. On some other cards, IIRC, like the CVision64, it was slower though.
All very hardware-dependent.