There seems to be a recurring misconception about the PowerUP hardware and the older PPC kernels that ran on it.
First off, the hardware. Both processors (all three, if you include the SCSI script processor on the 603e+ boards) most assuredly run concurrently. However, they do share a single bus which means that only one of them can perform IO on it at any given instant.
However, both processors also had large enough instruction and data caches, running them in copyback mode (meaning that reads and writes were generally full cache lines) that either one being forced to wait for in-progress IO from the other was not a particularly limiting factor.
Next, the software. A consequence of the hardware design is that each processor ends up with it's own, potentially out-of-date view of what is in RAM based on it's own cache. I may be wrong, but I don't think that bus snooping worked so well (or perhaps at all) with the design, so that whichever PPC kernel you used, caches had to be kept in sync by flushing them when one processor called another. The kernel generally managed this cache coherency problem allowing developers to write 68K apps and treat the PPC as a co-processor they'd compile certain expensive functions for, or alternatively write mostly PPC native apps that call the 68K for OS stuff, IO or event handling.
I can't recall the details for PowerUP but under WarpOS, each WarpOS application had at least 2 tasks. One that ran on the 68K and one that ran on the PPC. Together, these "mirror tasks" formed a single "virtual" thread of execution. Signals and such were routed to both tasks, but at any instant in time, only the PPC or 68K thread would be running and it's counterpart asleep, awaiting the flow of execution to come back to it. There was an exception to this rule, an asynchronous calling method that was rarely used as it required the application software to ensure cache coherency.
I think it's this notion of a given application running on only one CPU at any given instant that causes the misconception of a single processor model. However, Exec, WarpOS and PowerUP are all pre-emptively multitasking kernels. So, when your WarpOS task goes to sleep waiting for a slow 68K call to return, in principle, WarpOS is free to schedule any other ready-to-run PPC task in it's place. Vice versa, Exec the 68K.
In theory, this means that if you had two WarpOS applications, one could be executing some OS calls the 68K whilst the other is running PPC native code on the PPC. In practise, however, the context switching step whenever any application jumped CPU starved both processors of time until it was complete and more often than not, the actual time spent on a 68K function call (or vice versa) was dwarfed by the time spent just doing the context switching work. Consequently, whether you coded for PowerUP or WarpOS, the golden optimization rule was to minimize context switches. So, if you were going to need to do several OS calls, refactor your code so that they can be done from a single 68K call that you can invoke from the PPC (or vice versa). Sadly, it was not uncommon to see applications that didn't do this.
This brings us back to the original question. Well the answer is that emulating 68K code on the PPC is faster due to not having this cache flushing limitation. Even a 603e is generally faster than it would be on the real 68K, certainly if your 68K is an 040 like mine. Instantaneous JIT performance is, of course, variable and highly context sensitive so while there may be some 68K code out there that would prove pathological under emulation and thus run faster on the real 68K, any benefit would be instantly lost under the overhead of having to implement a WarpOS/PowerUP style cache coherencey strategy for both processors.
I did have a half-baked idea of my own that if I ever get around to experimenting with I might try, but it is probably doomed to the waste basket of silly ideas already. Essentially what I thought of was allocating a lump of memory to install some 68K code in and trying to see if I can get it running. However, I have no intention of using the 68K for running exsting 68K applications. Instead, I envisage it as a sort of general-purpose programmable DMA controller for classic hardware. It would use an MMU setup that would mark all memory uncached except for the space allocated for the code it executes and some private data workspace. As all other regions are uncached, coherency is only a problem from the host OS side and that's what CachePre/PostDMA() is there to manage. Having complete access to the hardware and memory space in the system might make it pretty useful for data transfer tasks. You might have a 68K uncacheable page of memory somewhere that represents a memory-mapped "register file" for your virtual DMA device. This would ideally need to be uncached from the PPC side too, so a page of ChipRAM might even be an idea. It doesn't matter that it's slow because you are only putting parameter data in it (eg address of memory region, size of memory region) and then via an interrupt or other mechanism, get the 68K to do the transfer (which, if done properly, could use move16 for block transfers to/from uncached memory).
It's all very pie in the sky, but it seemed theoretically possible when the idea first occurred that you might be able to produce virtual DMA controllers for things like the onboard IDE, parallel port, PCMCIA or whatever and then drivers that utilise them.
There are probably many very good reasons why this probably impossible though.