Karlos, if I understand what you're saying about this method of operation is to relegate the '060 to interfacing with the hardware, while the PowerPC is waiting for the memory registers to be open for use?
Operationally, it would be something like this. Imagine you have implemented a virtual DMA driver for the motherboard IDE. This is a small 68K application located in memory designated as 68K cacheable, along with some workspace for the working set and MMU tables. The rest of the entire address space is marked as uncacheable.
This means the only stuff the 68K will cache is access to the 68K code, local working set, stack and so on, allowing it to operate at full speed on local data.
The host PPC now wants to load 128K into some nice cache aligned address in fast ram. It writes the base address, number of bytes to transfer and any other relevant information to a "register file" belonging to this device. In practise, this register file area needs to be uncached on the PPC too.
Having loaded the parameters, it then invokes the 68K into processing the request. How this would be done is still somewhat vague, but the 68K would likely have to be able to respond to interrupts issued by the PPC. The PPC task now goes to sleep, awaiting some indication the process has completed. At this point, the host OS will just find something else to do.
Concurrently, the 68K now starts retrieving data from the motherboard IDE into it's local working store. Once it has a 68K cache line's worth, it can transfer that whole line to the fast ram address and increment. Obviously non aligned bits will have to be handled, but this side is not rocket science.
When the 68K has finished loading the 128K, or some error occurs, it updates some output values accordingly in it's "register file" and signals the PPC to wake up the calling task. Said task ensures the PPC cache isn't now out of date, checks the return status in the register file and takes whatever action is now necessary, be it a return of success or invoke some error strategy.
If so, this is a lot like what the Sega Saturn does in practice when you use both CPUs. In practice, the good development houses would have to utilize the caches of the CPU that is not performing I/O, but it did work well in games such as VF2 and Fighting Vipers, where one CPU performed the player character, and the other did the AI of the enemy. Parallel processing on older machines with only one memory bus is certainly a complex venture it seems.
I'm not really clear how the saturn works, but yes, my idea is that the 68K runs as an uncached IO processor with the sole exception that it can cache some private working area that the PPC never touches.