I don't think he ever said that the audio data itself would be 16-bit. Just that the instruction set for his virtual DSP would use 16-bit opcodes. Much like how the SuperH series uses 16-bit opcodes even though it can work on 32 and 64-bit floating point numbers (well certain members of the SuperH series anyway).
Of course 16-bit opcodes can be quite limiting if you have a fixed width instruction set. There's a reason why most RISC architectures use 32-bit opcodes.