The fastest jumptable I ever managed was one which was implemented in assembler. The handling code for each opcode had to align to and fit into 64 bytes (with branches out and back if needed for complex ops), including the code required to fetch the next opcode, which was inlined at the end of each handler. In that code, instead of actually looking up the address in a table, you just took the opcode value, left shifted it by 6 and jumped straight to the handling code via jmp (aN, dN), where aN held the address of the first handler. So in essence, you didn't really have a loop, you just had a branching frenzy, but in a relatively small are of code space. By moving to 128-byte alignment, there was no need to "branch out" for any of the handlers, but the overall performance suffered due to the fact that less of the handling code would fit into the cache on the 040. 64 was the compromise, about 90% of all the handlers fit completely in that constraint.
I couldn't really see any obvious way to make an interpreter any quicker.
That's pretty much the fastest pure interpreter method there is. You can get fairly close to this with certain versions of gcc using computed goto. You'll probably end up with an extra layer of indirection in the final code compared to the assembly approach you mention, but it's still pretty fast. Rather than an array of function pointers you have an array of label addresses. Unfortunately, not only is this dependent on a non-standard gcc extension, it doesn't even produce good code in reasonably recent versions of gcc when optimizations are turned on.
The next best thing is to stick a switch statement with a bunch of gotos in a macro and then call that macro wherever you would have used goto. You end up with an unnecessary range check, but otherwise the code generated is pretty much the same. It seems more resistant to gcc's optimizer screwing it up and works in MSVC (and presumably a number of other compilers).
Out of curiousity, Karlos, what CPU were you emulating in that example of yours?