Unfortunately, that was not the case. The various Mac benchmarking programs showed only minor improvements in certain benchmarks with the 060. SuperScalar always had to be off, and there was a limited amount of branch caching allowed in certain portions of the OS code, and the instruction and data caches were toggled off and on without anyone realizing it. Surprisingly, memory functions were quite a bit slower with the 060. We could compare the 040 speed vs. 060 speed using the same Phase 5 setup, just swapping the CPU card. So, the memory was the same.
The 68060 speed would drop to less than half with superscalar off. The CPU would be scalar with all the limitations of superscalar. Turning branch caching off makes the branching performance about the same as the 040. The 040 does outperform the 060 working in memory and has a larger cache fetch. This allows larger instructions but the 68060 can handle mixed instructions well, is faster with more complex addressing modes and is faster at shifting and multiplying. The 040 also has the 64 bit integer instructions and optimizations for bit field instructions in registers that would help it. The 060 is a clear winner with the FPU and has a clock speed advantage. It probably comes down to the code for the MAC. Code that is optimized for an 040 is not going to be optimal in an 060. I don't know if optimizing code for an 060 with superscalar disabled and many of the caches turned off would even be possible. It would be quite handicapped but still faster than a 68030 (the 68060 resembles in some ways a superscalar 030).
Keep in mind that the FPU was the Mac's biggest asset for the OS. This is why you didn't see many LC (or any EC) CPUs going into Macs. The MMU was needed of course for virtual memory. The FPU was used by EVERYTHING in the OS! The position of where to draw a pixel on the display was calculated by the FPU, not the CPU because it was faster to do it this way. When Joe and I re-wrote Apple's PACK4 and PACK5 in full assembly (like everything else we did), we actually broke most current benchmark programs in the FPU tests and we made the Mac insanely fast - to the point where production studios like Amblin Entertainment were using Amigas with my Mac emulation to run Avid video editing suites because that setup would run circles around real Macs... and they could also use Lightwave for rendering too.
The 040 FPU runs in parallel to the integer units but is still quite slow compared to integer. It's not very easy to go back and forth between the FPU and integer either with the lack of FINT/FINTRZ and no fp<-> unsigned integer. I'm kind of surprised you were using the FPU for the display. Are you sure it wasn't the MMU? There are drivers for Fusion/Shapeshifter that are faster with the MMU and the way the MAC renders the screen.
I think the only possible future for new accelerators for Classic's is FPGA. It's unlikely that anybody will implement an 040/060 as an FPGA but it looks like we'll soon have the a fully compatible 020/030 option. If this design could be clocked much higher than a real 020 and have faster memory access then it might actually be able to perform almost fast as a real 060 when running Classic 68k software. It might even be possible to add some 040/060 instructions to the 020 core in the future to enable it to run some 060 software.
Remember the crazy fpga hardware I was talking about the other day with a 150MHz enhanced 68k in fpga soon and a possible 500MHz+ 68k CPU in fpga in about a year? It is possible but I don't want any "announcements" or vapor ware claims. There may be some interesting reading on
http://www.amigacoding.de/ if you haven't been over there recently. I have seen the I/O expansion board early schematic

.