In this case it's simply easier to render to a chunky buffer in fast RAM, and then copy/translate it to display memory using an optimised C2P routine. But even that's difficult, as you are turning 32 chunky bytes into 8 planar 32-bit words. You inevitably need to store some working data in memory as you are processing things, so a fast large CPU cache comes in very handy.
Luckily for us, the 68040 with dual independent 4K L1 Caches was released in 1990 thus allowing Amiga to do 3D gaming properly.

Not just a 4K L1 DataCache, but a CopyBack DataCache which massively helps out with this sort of code. When you store things to memory you don't really store them to memory you just store them in the L1 Cache which is superfast.
My 68040 saved me from a life of slow 68030 drudgery. Yahoo!