This is a good idea in theory but there are problems. A new SIMD/vector unit in an FPGA may only support integer operations because of the cost of even single precision floating point.
Why is there a difference between eight additional registers in a single FPU and eight additional registers in a dedicated FPU? I don't see one. To save the first set, use FMOVEM with the co-processor ID of the 68881. To save the second set, use the coprocessor ID of the second FPU. Even better, programs only using the first FPU would only store the FPU state of the first FPU on the exec stack frame, and hence the stack frame and the debugging tools would remain untouched.
1) 8 register FPU with integer only SIMD = slow fp performance
FPU and integer is contradition in terms. (-:
2) 8 register FPU and wait for an SIMD with single precision fp = slow fp performance now
Why do I need to "wait" for something? That is rather a question how I organize my vector unit. For all the decoding work, two register banks, each representing a vector of eight single precision registers would be rather ideal. But why stop there, I mean, you could also reorganize this as 4x4 unit or 2x8 unit. That's rather a question of organization and not a question of "slowness" or "fastness".
The 68k FPU would be much more efficient with more scratch registers (the cost of saving and restoring extended precision FPU registers is very expensive).
I would not even work with extended precision in the vector unit. What for? The applications I would have in mind that would allow speedup by the FPU - multimedia decoding, namely - only require single precision. These are "killer applications" where one can really profit from new features and where one can easily implement a dispatcher in a corresponding datatype.
Keeping the FPU extended precision makes more sense with 16 registers because of the register argument passing and reduced register saves and restores.
Look, 16 registers alone does not buy you much for the applications I have in mind. Despite, it creates potential incompatibilities because you need to save and restore the registers.
Which do you think would cause the least incompatibility?
Have a dedicated unit that is only enabled for programs that use it. It is easy to add this as a flag on the exec stack frame that indicates whether the second FPU is busy or not, and programs that do not use it would use the same old stack frame.
Thus, a scalar FPU, 80 bits precision, as today. A vectorial unit, only enabled when required and whose context is only saved and restored when required, single precision only. Vectors of size eight or four, probably two sets of them.