@Karlos
Indeed, this is exactly what I did aswell. Group as much as possible OS calls into one chunk of m68k asm code and then call it with argument structure.
Imagine for example calling some 68k routine 256 times in a loop with full cacheflush in between, and then calling the 256 x 68k routines in one go, and then doing *single* cacheflush afterwards... Needless to say, the speedup was massive.