I'd actually like an architecture that only has SIMD registers, but still allows "normal" operations on them.
To enable this, each instruction could support a "mask" where a 16-bit word describes which sub elements of the register pair in an operation are to be affected.
Consider this pseudo example (assumes 128bit wide registers):
add.b r0,r1
This would add the "bottom" byte in r0 to the equivalent byte in r1, leaving the rest of d1 unaffected.
Whereas
add.b r0,r1,#FFFF
would add perform the same operation on every byte and
add.l r0,r1,#A
would pefrorm a 32-bit add on the highest and middle words, leaving the rest unaffected.
One obvious complication would be the status flags. You'd probably want up to 16 of them in this instance. However, in every VM I've written for fun, I never bothered implementing condition code flags, relying instead on compare two operands and branch instructions.