And because they are clocked twice as fast they can still perform 32-bit operations in one clock cycle...
Actually it takes 3 cycles according to the info I read. In the first "fast" cycle, the lower 16-bits are added and ready for forwarding. Then the next 16-bits are added, then finally the condition codes set in a third cycle.
It's pretty stupid really, they could have had a 32-bit ALU and ran it at the same speed, which would have performed better.
Interestingly, a lot of arithmetic operations generated by the compiler use the effective address calculation hardware to evaluate operations, bypassing the ALU all together
