I am with utri007 on this. What is the point of being faster on 1mio copies of 128 bytes if in reality this situation never happens? Where is the bottleneck you are trying to fix, how did you even find that? Optimizations are probably fun for you, but if it only shows in arbitrary tests - what is the point?
A few more thoughts from Mr. Spock:
Logically, if 1mio copies of 128 bytes are faster, than 100K copies of 128 bytes should also be faster, and 10K copies and 1K copies, etc. and so on.
The problem is that most lamers don't understand that coding a benchmark program (which offers even a moderate degree of accuracy) is no easy task. In fact, given the limitations of the Amiga Hardware (timer.device) and the Amiga OS (multitasking and serious interrupt dependencies) and compounded by Software developer tools limitations (SAS Crap compiler) it's truly amazing and very impressive that some Amiga Benchmark programs work as well as they do.
But the one thing all Benchmark coders will appreciate (given the above constraints) is that
More Iterations = Better Accuracy.So logically speaking, the 1mio copies of 128 bytes has a practical purpose (even if some lamer still doesn't get it).
The bottleneck fix has already been explained and I don't like repeating myself. Optimizations are often more work than fun, but there is some satisfaction in finally solving the bottleneck problem. Now, if I could only do that and avoid these silly lamer questions...