Is there any "legal" way to put new commands into the ring buffer of the gpu (aka GPU FIFO)? Or just examining the GPU registers to see if the read pointer and write pointer are equal and assume it's ready to add the new commands there?
I'm not sure what you are asking exactly. The R100/200 3D drivers communicate with the radeon's ring buffer through the same system resource that the 2D driver does. Just to be clear, my musings above are in relation to the Permedia2 specifically.
Purely hypothetically, if I were to retrofit a 3D T&L pipeline to Warp3D I'd probably do so by exposing an interface that's not a million miles from the model already defined for the R100/200. Some methods to set transformation matrices to be applied to geometry, texture coordinates, lighting and material properties and clipping planes. In the corresponding drivers for R100/200 these would be mapped onto it's hardware implementation of these things for maximum efficiency.
However, as it's still a fairly low-level model, it would lend itself to drivers like the Permedia also. In the Permedia, data registers and command operations are loaded through a FIFO that can be written to directly or through DMA from a buffer. The latter is what I'd like to use but it's proven elusive for me to get working in the BVision (I'm sure it ought to be possible, even if it's from a location in VRAM rather than host mapped memory which is what the original intention was). In the current driver, the FIFO is loaded with data and then the driver has to wait for it to drain. I don't query the remaining FIFO space every time, instead I get a count and only read it again after writing as many entries as the last count value. That removes some overhead, but in the end, if you are rendering blended, perspective textured, fogged z-buffered polygons you will inevitably reach a position where the CPU is able to load the FIFO faster than the Permedia is emptying it and you end up polling the count register until there's enough room for your next operation.
We could use that time more productively if we were reading untransformed vertex data from memory, performing the transformation calculations in software and then writing the result to the FIFO because there'd be an improvement in the parallelism. It would take slightly longer per vertex to take a user-coordinate-space triangle fan and write it to the chip, but at the same time, you'd be doing those calculations at a point where you previously were just polling a register having already spent a similar amount of CPU time beforehand to do the transformation elsewhere.
I guess p96/cgx waits until GPU FIFO is ready, otherwise crashes may happen. I guess that the difference between CGX and P96 in this respect is that CGX driver will probably offer a "legal" way to add new pointers to command buffers and with P96 you can't "interleave" commands from both systems.
As I said, both the 3D and 2D drivers use the same resource on R200/100. However, Warp3D takes owenrship of the hardware system so at least when a hardware lock is in place, nobody else can be submitting packets to it. Outside of that situation, anybody can write to the resource. That's what the command processor / ring buffer was intended for.
And to continue the hijacking... any plans for a BlizzardPPC OS4 scsi driver?
Well I started, but I haven't really made progress due to a lack of available free time. I had to prioritise and the Warp3D I worked on affects more users overall. It's definitely something I'd like to complete if I get a chance.