@bloodline
Well there are some things I'd do differently if I was going to do it again. That VM had 16 general purpose 64-bit registers that could contain any 8/16/32/64-bit wide elemental type at once. The opcode defined how they were to be interpreted. Each opcode was (at least) a 2-byte entity, with a byte for the operation and usually a byte that encoded the source and destination register. As such it was a load/store architecture.
This made it easy to design and write, but doing it from scratch, I'd probably go for a stack-frame machine. It wouldn't actually be any slower since the above registers are still memory locations anyway, and if done correctly, would allow you to have as many "registers" as you have local data inside any function. That is to say, I'd use the same register-like topology but have only as many of them in a function context as needed.
I did write some documentation, but it is rather out of date, I expect:
http://extropia.co.uk/projects/vm/