I'm a bit too drunk atm to make any detailed analysis, but I'll check it out later.
[EDIT]
But even when wasted I can give some ideas how to make it much faster (at least on older systems):
- Inline the innerloop (.NextX) subroutine calls.
- Use registers to store variables instead of memory. Do this at least for variables used in innerloop.
- Move out any 'y' related calculation from the .NextX loop, calculate these values before entering the X loop (this appears to have been done mostly, however).
[/EDIT]