@Jose
Loop unrolling basically lessens the performance hit common to all loops - the time they spend testing the loop exit and branching.
For instance, a simple loop might be:
while counter > 0
perform action
counter = counter - 1
end while
[/size]
Suppose this loop is going to have a large counter value, you could unroll it 4x thus:
rem - do as much of the loop in blocks of 4 as possible
unroll_counter = counter / 4
while unroll_counter > 0
perform action
perform action
perform action
perform action
unroll_counter = unroll_counter - 1
end while
rem - handle any remaining stuff
counter = counter % 4
while counter > 0
perform action
counter = counter - 1
end while
[/size]
The above pesudocode shows the general loop unrolling idea. The bulk of any large loop is carried out in the unrolled block. Any odd fraction that remains at the end that is smaller than the unrolled block is performed in the second loop.
Duffs device elegantly does away with the need for this second part. It handles any odd remainder from unrolling by calculating an offset into the unrolled block and jumping straight into it.
Since you are an asm guy, here is a duff device style fragment from one of my memory copy routines (note this only performs a 32-bit aligned section of a larger copy that handles any trailing bytes before and after):
; d0 counter (in bytes)
; d1 scratch for jump position
; a0 from
; a1 to
; unrolled section moves 64 bytes in 16 longwords
move.l d0, d1
add.l #60, d0
lsr.l #2, d1
lsr.l #6, d0 ; d0 = (counter+60)>>6
and.l #$F, d1 ; d1 = (counter>>2) & 15
beq .case0
; calculate position to jump to
; jump offset = pc + (16 - d1)* size of move.l inst
neg.w d1
add.w #16, d1
jmp .case0(pc, d1.w*2)
CNOP 0,4
.case0 move.l (a1)+, (a0)+
.case15 move.l (a1)+, (a0)+
.case14 move.l (a1)+, (a0)+
.case13 move.l (a1)+, (a0)+
.case12 move.l (a1)+, (a0)+
.case11 move.l (a1)+, (a0)+
.case10 move.l (a1)+, (a0)+
.case9 move.l (a1)+, (a0)+
.case8 move.l (a1)+, (a0)+
.case7 move.l (a1)+, (a0)+
.case6 move.l (a1)+, (a0)+
.case5 move.l (a1)+, (a0)+
.case4 move.l (a1)+, (a0)+
.case3 move.l (a1)+, (a0)+
.case2 move.l (a1)+, (a0)+
.case1 move.l (a1)+, (a0)+
subq.l #1, d0
bgt.b .case0
[/size]
-edit-
Only just noticed the code tag thing in the edit window :lol: