Author Topic: Move16 to chipmem ... (Read 3340 times)

Jose · « **on:** March 17, 2004, 10:48:12 PM »

I read sometime ago, don't remember where, that one should take special precautions when using the Move16 instruction to chipmem. What are they?
Is there any benefit or the bandwith is not enouph, and maybe that's where the problem is, the processor halting till the instruction is finished (till all bytes are copied) ?

patrik · « **Reply #1 on:** March 17, 2004, 11:31:03 PM »

@Jose:

(edit):

The bus-logic should assert the /TBI (Transfer Burst Inhibit) pin of the 68040/68060 on chipram accesses, which will tell the cpu that the target (chipmem) doesnt support burst transfers. This will force the cpu to carry out four consecutive normal longword read/writes instead of a line read/write transfer (burst) when executing a move16 instruction.

There is a posibility that there are accelerators using a writebuffer to finish writes to motherboard resources like motherboard fastmem etc faster. This write-buffer itself might accept bursts, but as chipmem still cant be accessed with bursts of any kind after the writebuffer, there would be no bandwidth gain towards chipmem, it would possibly save the cpu a few cycles though.

It should theoretically be safe, but if I am not mistaken, I think Commodore adviced against using the move16 instruction.

(edit 2):

According to this this page, there is a bug in some of the oldest 68040 revisions requiring you to issue a nop instruction before all sets of move16 instructions. I dont think that there are many of these 68040s in use in Amigas though, Apple probably and hopefully got their hands on most of the first 68040 cpus, but nevertheless it can be good to know that an issue might exist.

/Patrik

Karlos · « **Reply #2 on:** March 17, 2004, 11:34:21 PM »

@Jose

You should listen to this guy. He's no Shawn, that's for sure ;-)

-edit-

slightly OT:

move16 works in my BVision's VRAM area :-D

-edit 2-

...and patrik tells me to say it does it via the "linetransfer dance"...

It is faster. Copying RAM with longwords to VRAM on my 040 gives 6-7Mb/s (for best case aligned transfer).

A move16 loop gets about 9.3 Mb/s from RAM to VRAM.

Jose · « **Reply #3 on:** March 18, 2004, 07:17:54 PM »

Thx for the info..
With the stuff I've read I think I should've started coding some stuff earlier ... :-D

peroxidechicken · « **Reply #4 on:** March 18, 2004, 09:00:57 PM »

Recently I was writing a small routine for clearing memory and I thought to myself 'Doing this 4 bytes at a time is kind of a drag'. But I didn't want to write something that would require a cpu check. Then I realized there's a kind of 'move64' that all 68ks will do - movem.l d0-d7/a0-a7. Although in most cases, one address register is going to be needed as a pointer into memory. Not sure if that instruction works with a direct address...

itix · « **Reply #5 on:** March 18, 2004, 11:33:55 PM »

A7 is a stack pointer and cant be used here.

Karlos · « **Reply #6 on:** March 19, 2004, 12:13:37 AM »

@peroxide chicken

Whatever you do, dont trash a7. Movem is a mixed bag, its slower on some 680x0 than consecutive move.l

Your best bet is to use loop unrolled code for this kind of stuff...

itix · « **Reply #7 on:** March 19, 2004, 12:19:53 AM »

Or Duff's device :-)

Karlos · « **Reply #8 on:** March 19, 2004, 01:36:52 AM »

Quote

itix wrote:
Or Duff's device :-)

Indeed, but remember to replace the modulus with an and operation when calculating the jump offset into the loop (assuming its a nice power of two long), its way quicker ;-)

All my loop unrolled C / asm stuff is duffs device style - its a neat trick :-D

Jose · « **Reply #9 on:** March 19, 2004, 09:54:41 PM »

knowing only some 68k asm stuff I checked around on what's this rolled/unrolled stuff is...

Woulddn't unrolled be similar to writting all the instructions that would be executed in a loop one after the other, avoiding the condition checking and jumps each time the loop is executed? This occupies a BIG space for some things no?

Ok, what da hell is duff's device by the way? I need to know now :-o 8-)

Karlos · « **Reply #10 on:** March 19, 2004, 10:08:30 PM »

@Jose

Duffs device is a famous bit of C code that takes advantage of the switch/case construct to create an unrolled loop.

It goes something like this

-edit-
Saftey check for count <= 0 added - cheers Piru
-/edit-

Code: [Select]


void copy(int* to, int* from, int count)
{
    if (count<=0)
        return; /* safe as houses */

    int n = (count+7)/8;
    switch (count%8) {
        case 0:    do { *to++ = *from++;
        case 7:         *to++ = *from++;
        case 6:         *to++ = *from++;
        case 5:         *to++ = *from++;
        case 4:         *to++ = *from++;
        case 3:         *to++ = *from++;
        case 2:         *to++ = *from++;
        case 1:         *to++ = *from++; } while (--n);
    }
}

[/size]

The above code simply copies 8 ints for each loop. If the total number of ints to be copied has a remainded when divided by 8, the remainder ints are copied by an 'incomplete loop' thanks to the switch construct.

Eg, if you had 15 ints, the odd 7 ints are handled first as the code jumps to case7: from the switch (count%8) statement.

Its quite a nifty thing - very asm like, but in C.

-edit-

Try as I might, I cant seem to get the bugger to indend the code anymore...the old non breaking space trick isnt working

Piru · « **Reply #11 on:** March 19, 2004, 10:35:18 PM »

The code misbehaves when called with count of 0. Here is a proper version:

Code: [Select]


void copy(int *to, int *from, int count)
{
  int n;

  if (!count)
    return;

  n = (count + 7) / 8;

  switch (count % 8)
  {
    case 0: do { *to++ = *from++;
    case 7: *to++ = *from++;
    case 6: *to++ = *from++;
    case 5: *to++ = *from++;
    case 4: *to++ = *from++;
    case 3: *to++ = *from++;
    case 2: *to++ = *from++;
    case 1: *to++ = *from++; } while (--n);
  }
}

Karlos · « **Reply #12 on:** March 19, 2004, 11:28:07 PM »

@Piru

Indeed :-)

I was aiming only at an explanation of how the duff's device mechanism works, but you are correct that a count of zero would actually cause one iteration of the loop in the code I used. If your're going to start checking the count in this particular example, as it's an int, you might want also to modify the check for the negative case:

if (count<=0) return;

Otherwise a negative count value could cause a very long copy given the "while (--n)" condition ;-)

An otpimisation for real applications is not to use the modulus and division, simply use ands and shifts:

int n = ((count+7)>>3);
switch (count & 7)

Incidentally, how did you indent the code? I tried code tags and non breaking spaces, but neither seemed to work for me :-?

Jose · « **Reply #13 on:** March 19, 2004, 11:29:15 PM »

Cool. I think I got the idea (at least the goal), though I don't udnerstand the code. Maybe later when I also learn C, wich I plan to, but I'm only into asm now!! :-D
So was my idea of unrolled correct?
It's allways cool that you guys are here to talk about this stuff by the way. This is what every long Amiga user should be into :-)

Karlos · « **Reply #14 on:** March 20, 2004, 02:31:02 AM »

@Jose

Loop unrolling basically lessens the performance hit common to all loops - the time they spend testing the loop exit and branching.

For instance, a simple loop might be:

Code: [Select]


while counter > 0
    perform action
    counter = counter - 1
end while

[/size]

Suppose this loop is going to have a large counter value, you could unroll it 4x thus:

Code: [Select]


rem - do as much of the loop in blocks of 4 as possible

unroll_counter = counter / 4

while unroll_counter > 0
    perform action
    perform action
    perform action
    perform action
    unroll_counter = unroll_counter - 1
end while

rem - handle any remaining stuff

counter = counter % 4

while counter > 0
    perform action
    counter = counter - 1
end while

[/size]

The above pesudocode shows the general loop unrolling idea. The bulk of any large loop is carried out in the unrolled block. Any odd fraction that remains at the end that is smaller than the unrolled block is performed in the second loop.

Duffs device elegantly does away with the need for this second part. It handles any odd remainder from unrolling by calculating an offset into the unrolled block and jumping straight into it.

Since you are an asm guy, here is a duff device style fragment from one of my memory copy routines (note this only performs a 32-bit aligned section of a larger copy that handles any trailing bytes before and after):

Code: [Select]


; d0 counter (in bytes)
; d1 scratch for jump position
; a0 from
; a1 to

; unrolled section moves 64 bytes in 16 longwords

    move.l   d0,        d1
    add.l    #60,       d0
    lsr.l    #2,        d1
    lsr.l    #6,        d0 ; d0 = (counter+60)>>6
    and.l    #$F,       d1 ; d1 = (counter>>2) & 15
    beq      .case0

; calculate position to jump to
; jump offset = pc + (16 - d1)* size of move.l inst

    neg.w    d1
    add.w    #16,       d1
    jmp      .case0(pc, d1.w*2)

    CNOP     0,4
.case0  move.l  (a1)+,  (a0)+
.case15 move.l  (a1)+,  (a0)+
.case14 move.l  (a1)+,  (a0)+	
.case13 move.l  (a1)+,  (a0)+
.case12 move.l  (a1)+,  (a0)+
.case11 move.l  (a1)+,  (a0)+
.case10 move.l  (a1)+,  (a0)+
.case9  move.l  (a1)+,  (a0)+
.case8  move.l  (a1)+,  (a0)+
.case7  move.l  (a1)+,  (a0)+
.case6  move.l  (a1)+,  (a0)+
.case5  move.l  (a1)+,  (a0)+
.case4  move.l  (a1)+,  (a0)+
.case3  move.l  (a1)+,  (a0)+
.case2  move.l  (a1)+,  (a0)+
.case1  move.l  (a1)+,  (a0)+

    subq.l    #1,        d0
    bgt.b     .case0

[/size]

-edit-

Only just noticed the code tag thing in the edit window :lol:

Author Topic: Move16 to chipmem ... (Read 3340 times)

Jose

Move16 to chipmem ...

patrik

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Jose

Re: Move16 to chipmem ...

peroxidechicken

Re: Move16 to chipmem ...

itix

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

itix

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Jose

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Piru

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...

Jose

Re: Move16 to chipmem ...

Karlos

Re: Move16 to chipmem ...