Might I suggest asking Karlos for his thoughts on the matter?
Not much to suggest that I haven't repeated before like some broken record. Totally and utterly forget Warp3D as a driver layer. When all we had were Permedia2, CV3D and basic Voodoo, it was enough, but not any more.
My suggestion for OS 3.x, regardless of what hardware it is running on, would be:
1) Create a new driver layer, totally from scratch that is designed from the ground up to be as efficient as possible (as real 68K processors aren't fast).
2) Make sure the driver system works directly with suitable OS BitMaps and has no special needs (other than allocating VRAM for depth/stencil/texture buffers). Then provide as complete a set of accelerated 2D and 3D operations as possible. After all, OpenGL isn't just for 3D. Once installed, it should be also be possible to various patch graphics.library to use it since existing RTG sucks as far as acceleration is concerned. For this, some sort of fast, minimal locking protocol should be supported that allows the bare minimum of potentially-destroyed registers to be backed up and restored that is essentially independent of whatever is used for the main 3D stuff. This is essential if you plan to accelerate individual graphics.library calls as you can't afford the setup overhead of a typical W3D_LockHardware() style call.
3) Provide a transformation / lighting / clipping pipeline in the driver layer, abstracting it such that hardware specific drivers can leverage hardware acceleration for this on Radeon cards and that other hardware can at least interleave the necessary calculations on the CPU with the rasterizing operations going on on the card to get the best out of any parallelism that can be had.
4) Provide a thin-layer wrapper for OpenGL/glut around the whole thing.
5) Provide a thin-layer wrapper for Warp3D around it, bypassing any T&L stage.
You could look at Gallium and the like to get some inspiration, but I'd not recommend it for the basis of a 68K driver system as whatever you create needs to be designed and built with the limitations of existing 68K systems in mind and not just what is theoretically possible on far faster emulated systems. If you do it right, it should scale up on the latter but still deliver usable performance on the former.