040 optimised code is only faster on a real 040, because it uses tricks that the 040 hardware can do better than other CPUs. It's not faster under emulation, trust me. I notice no difference between 020/040/060 optimised code on the fastest 68k emulation currently available - MorphOS Trance.
In fact, UAE's 040 emulation can be a wee bit buggy too, another reason to stick to 020 code.