Try to find a use for d3, d4, d5, d6 and d7. It seems you are only using d0 to d2. There are various frequently constants and variables in your code you could put into these registers to improve access times, at least on slower machines. Even on faster ones, a read from the datacache is not as fast as a direct register access.