The programmer's reference manual doesn't say so. It says "the address is incremented by the operand length (2 or 4)".
Ok. It's bytes that affect a7 by 2, but everything else by 1.
"Indirect addressing with postincrement
Assembler syntax:
Same as indirect addressing, but An will be increased by the size of the operation after the instruction is executed. The only exception is byte operations on A7 - this register must point to an even address, so it will always increment by at least 2. Example:"
How do you know if you need to swap or not ? For example a memory copy function that uses 32bit transfers but may be copying strings that may not be byte swapped ?
My only thought would be to access bytes/words & dwords at different address ranges & disable all the caches. You could implement a cache inside the fpga, although it would probably still be slower than the on chip cache because it would be limited to the fsb speed. You'd have to do it to figure out which came out better or worse.
Most of the time the 68k code would be doing aligned accesses, so that should show up in the branch prediction. So a test for whether it's aligned and then the xor is likely to not have much impact at all.
I know Mike Coates spent a long time optimising his 68k core back in the day, it was used in MAME back in the late 1990's & early 2000's. Obviously some of the optimisations are likely to be deoptimisations on recent cpu's, but the clock speed outweighs the effort required. Even musashi (the C core that MAME now uses) would probably be more than sufficient.
Or if someone does an ARM board instead then there is always
http://notaz.gp2x.de/cyclone.php My thought on using x86 (or even better x64) is that using a VM on the card could allow a bridge board style PC emulator to run at the same time as the 68k. Crazy idea I guess, but heh these ideas are all crazy.