Author Topic: Move16 to chipmem ... (Read 1495 times)

Karlos · « **on:** March 17, 2004, 11:34:21 PM »

@Jose

You should listen to this guy. He's no Shawn, that's for sure ;-)

-edit-

slightly OT:

move16 works in my BVision's VRAM area :-D

-edit 2-

...and patrik tells me to say it does it via the "linetransfer dance"...

It is faster. Copying RAM with longwords to VRAM on my 040 gives 6-7Mb/s (for best case aligned transfer).

A move16 loop gets about 9.3 Mb/s from RAM to VRAM.

Karlos · « **Reply #1 on:** March 19, 2004, 12:13:37 AM »

@peroxide chicken

Whatever you do, dont trash a7. Movem is a mixed bag, its slower on some 680x0 than consecutive move.l

Your best bet is to use loop unrolled code for this kind of stuff...

Karlos · « **Reply #2 on:** March 19, 2004, 01:36:52 AM »

Quote

itix wrote:
Or Duff's device :-)

Indeed, but remember to replace the modulus with an and operation when calculating the jump offset into the loop (assuming its a nice power of two long), its way quicker ;-)

All my loop unrolled C / asm stuff is duffs device style - its a neat trick :-D

Karlos · « **Reply #3 on:** March 19, 2004, 10:08:30 PM »

@Jose

Duffs device is a famous bit of C code that takes advantage of the switch/case construct to create an unrolled loop.

It goes something like this

-edit-
Saftey check for count <= 0 added - cheers Piru
-/edit-

Code: [Select]


void copy(int* to, int* from, int count)
{
    if (count<=0)
        return; /* safe as houses */

    int n = (count+7)/8;
    switch (count%8) {
        case 0:    do { *to++ = *from++;
        case 7:         *to++ = *from++;
        case 6:         *to++ = *from++;
        case 5:         *to++ = *from++;
        case 4:         *to++ = *from++;
        case 3:         *to++ = *from++;
        case 2:         *to++ = *from++;
        case 1:         *to++ = *from++; } while (--n);
    }
}

[/size]

The above code simply copies 8 ints for each loop. If the total number of ints to be copied has a remainded when divided by 8, the remainder ints are copied by an 'incomplete loop' thanks to the switch construct.

Eg, if you had 15 ints, the odd 7 ints are handled first as the code jumps to case7: from the switch (count%8) statement.

Its quite a nifty thing - very asm like, but in C.

-edit-

Try as I might, I cant seem to get the bugger to indend the code anymore...the old non breaking space trick isnt working

Karlos · « **Reply #4 on:** March 19, 2004, 11:28:07 PM »

@Piru

Indeed :-)

I was aiming only at an explanation of how the duff's device mechanism works, but you are correct that a count of zero would actually cause one iteration of the loop in the code I used. If your're going to start checking the count in this particular example, as it's an int, you might want also to modify the check for the negative case:

if (count<=0) return;

Otherwise a negative count value could cause a very long copy given the "while (--n)" condition ;-)

An otpimisation for real applications is not to use the modulus and division, simply use ands and shifts:

int n = ((count+7)>>3);
switch (count & 7)

Incidentally, how did you indent the code? I tried code tags and non breaking spaces, but neither seemed to work for me :-?

Karlos · « **Reply #5 on:** March 20, 2004, 02:31:02 AM »

@Jose

Loop unrolling basically lessens the performance hit common to all loops - the time they spend testing the loop exit and branching.

For instance, a simple loop might be:

Code: [Select]


while counter > 0
    perform action
    counter = counter - 1
end while

[/size]

Suppose this loop is going to have a large counter value, you could unroll it 4x thus:

Code: [Select]


rem - do as much of the loop in blocks of 4 as possible

unroll_counter = counter / 4

while unroll_counter > 0
    perform action
    perform action
    perform action
    perform action
    unroll_counter = unroll_counter - 1
end while

rem - handle any remaining stuff

counter = counter % 4

while counter > 0
    perform action
    counter = counter - 1
end while

[/size]

The above pesudocode shows the general loop unrolling idea. The bulk of any large loop is carried out in the unrolled block. Any odd fraction that remains at the end that is smaller than the unrolled block is performed in the second loop.

Duffs device elegantly does away with the need for this second part. It handles any odd remainder from unrolling by calculating an offset into the unrolled block and jumping straight into it.

Since you are an asm guy, here is a duff device style fragment from one of my memory copy routines (note this only performs a 32-bit aligned section of a larger copy that handles any trailing bytes before and after):

Code: [Select]


; d0 counter (in bytes)
; d1 scratch for jump position
; a0 from
; a1 to

; unrolled section moves 64 bytes in 16 longwords

    move.l   d0,        d1
    add.l    #60,       d0
    lsr.l    #2,        d1
    lsr.l    #6,        d0 ; d0 = (counter+60)>>6
    and.l    #$F,       d1 ; d1 = (counter>>2) & 15
    beq      .case0

; calculate position to jump to
; jump offset = pc + (16 - d1)* size of move.l inst

    neg.w    d1
    add.w    #16,       d1
    jmp      .case0(pc, d1.w*2)

    CNOP     0,4
.case0  move.l  (a1)+,  (a0)+
.case15 move.l  (a1)+,  (a0)+
.case14 move.l  (a1)+,  (a0)+	
.case13 move.l  (a1)+,  (a0)+
.case12 move.l  (a1)+,  (a0)+
.case11 move.l  (a1)+,  (a0)+
.case10 move.l  (a1)+,  (a0)+
.case9  move.l  (a1)+,  (a0)+
.case8  move.l  (a1)+,  (a0)+
.case7  move.l  (a1)+,  (a0)+
.case6  move.l  (a1)+,  (a0)+
.case5  move.l  (a1)+,  (a0)+
.case4  move.l  (a1)+,  (a0)+
.case3  move.l  (a1)+,  (a0)+
.case2  move.l  (a1)+,  (a0)+
.case1  move.l  (a1)+,  (a0)+

    subq.l    #1,        d0
    bgt.b     .case0

[/size]

-edit-

Only just noticed the code tag thing in the edit window :lol:

Author Topic: Move16 to chipmem ... (Read 1495 times)

Karlos

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...