Author Topic: One for 680x0 experts - move16 issues... (Read 6693 times)

Karlos · « **on:** May 22, 2005, 02:58:04 PM »

Hi,

I was revisiting some old 68K asm sources last night and bumped into an old bug I never quite managed to fathom to my satisfaction.

I have a set of asm functions that can copy or fill memory using move16.

Normal cache-aligned copying with move16 works perfectly OK. It is the fill functions that are problematic. They do work perfectly well on many 040 and 060 but alas not on mine...

They work by allocating a 16-byte aligned area on the stack (sufficient to hold the source for up to 4 move16's in the first instance), which is then filled with the value to write to memory.

The basic plan was to fill the memory by writing cache lines.

The first basic loop looked something like this:

; d1 contains unrolled loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination

.loop
add.l #-64, a1; a1 now points to start of aligned stack area
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This gives 22MB/s on my 68040 compared to 15MB/s using a basic 16x unrolled move.l d0, (a0)+ implementation.

However, it tends to go awry, randomly filling parts of the area with garbage. It is as if the area read is trashed somehow.

I was wondering if perhaps somehow the cache lines get messed up, so I then tried the following non-unrolled mechanism to see what difference it might make.

; d1 contains loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination

.loop
add.l #-16, a1; a1 now points to start of aligned stack area
tst.l (a1); ensure line stays cached by accessing the first long
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This version was not perceptibly slower (~21MB/s) and produced less glitches in the area written to, but it was still pretty lousy.

So later, out of curiosity, I tried this:

; d0 contains 32-bit value to fill
; d1 contains loop counter
; a1 initially points to end our aligned stack area
; a0 is destination

.loop
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This produced the least glitching of all, but was also considerably slower (just 18MB/s) than the first one (but still faster than a move.l based fill).

Anyway, it only seems to be my BPPC/040 that does this. Other systems I have tried all work completely fine with the first version.

I exhaustively checked that my aligned area on the stack was both aligned and not touched by something in my code. There seemed no logical reason for the effect, but basically the only way to minimise the glitching was keep refilling the line I wanted to write.

Given I couldn't reproduce the problem on at least 3 other 68040's and 2 060's, I eventually reasoned to myself that perhaps my BPPC's 68040 has a bug even though a basic move16 block copy always worked OK.

Then last night, I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.

So now I am back to square one. Any asm experts here have an answer?

Doobrey · « **Reply #1 on:** May 22, 2005, 03:43:51 PM »

Quote

subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

I don't have the 68k docs to hand at the moment, so I can't check this out.
But does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?

Or changed the subq - bgt section to
move16 (a1)+,(a0)+
dbne d1, .loop .. dunno if this is slower though.

Another thing that pops into my head (yeah I know, a lot of free space at the moment !), are you running a replacement scheduler on the system that shows these problems?

Anyway, I'm sure Piru will be along any moment to tell me I'm wrong and give you the right answer :-P

Piru · « **Reply #2 on:** May 22, 2005, 04:00:38 PM »

Quote

does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?

No it does not. CCs are unaffected.

Karlos · « **Reply #3 on:** May 22, 2005, 04:02:58 PM »

Quote

Piru wrote:
Quote
does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?

No it does not. CCs are unaffected.

Quite. It was a token attempt to optimise the loop ;-)

Piru · « **Reply #4 on:** May 22, 2005, 04:03:17 PM »

Quote

I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.

This points to you forgetting to change A7 (SP) properly. Perhaps your src_stack_ptr < a7 (it could easily happen due to buggy alignment code, for example: sub.l #x,a7 / move.l a7,d0 / and.w #-16,d0 / move.l d0,a1 .. Depending on the initial alignment of the stack, no trashing or partial trashing would occur) ? If this is the case then task scheduling will trash the src stack area when rescheduling occurs.

Also, Forbid would obviously hide the problem as no task scheduling happens -> no registers are pushed to stack -> no trashing.

The proper alignment would do something like this (this is just one way of implementing it):

Code: [Select]


move.l sp,d0
sub.l  #16,d0  ; at least 16 bytes storage
and.w  #-16,d0 ; ...aligned by 16
move.l sp,a2   ; save old sp
move.l d0,sp
move.l d0,a1

....

move.l a2,sp   ; restore original sp

Karlos · « **Reply #5 on:** May 22, 2005, 04:10:43 PM »

Quote

Doobrey wrote:

Another thing that pops into my head (yeah I know, a lot of free space at the moment !), are you running a replacement scheduler on the system that shows these problems?

Tested with and without executive, same problem. It also affects some copy code I have which handles non-cache aligned copies (after copying up to the first cache aligned destination) by reading the misaligned data into an aligned buffer on the stack before moving it to the destination using move16. Despite this 'shuffling' of data, the speed gains over a straightforward copy are quite conspicuous in some cases.

Anyway, whilst the above works, I observe the most 'glitching' for a relative alignment of 4 bytes (that is the ultimate source and destination are out by 4 bytes). When the relative alignment is 2 or 6 bytes, the glitching is far less frequent.

Again, as with filling, the glitching is stopped completely when the call is wrapped in a Forbid()/Permit().

Karlos · « **Reply #6 on:** May 22, 2005, 04:18:00 PM »

Quote

Piru wrote:
This points to you forgetting to change A7 (SP) properly. Perhaps your src_stack_ptr < a7 (it could easily happen due to buggy alignment code, for example: sub.l #x,a7 / move.l a7,d0 / and.l #-16,d0 / move.l d0,a1 .. Depending on the initial alignment of the stack, no trashing or partial trashing would occur) ? If this is the case then task scheduling will trash the src stack area when rescheduling occurs.

Quite unusual that I have never experienced any problems outside my BPPC :-/

From memory, the alignment code is something like this. Eg, for a single 16-byte aligned area

link a2,#-32 ; allocate 32 byte slot within which we find the first 16-byte aligned address
move.l a7, d0
addq.l #15, d0
and.l #FFFFFFF0, d0
move.l d0, a1

;a1 now points to start of aligned area on stack

;rest of function

unlk a2

Basically I always allocate the size I require + 16, so for 2 cache lines, it would be link a2, #-48 etc.

Karlos · « **Reply #7 on:** May 22, 2005, 04:22:14 PM »

Quote

Piru wrote:

The proper alignment would do something like this (this is just one way of implementing it):
Code: [Select]
move.l sp,d0 sub.l #16,d0 ; at least 16 bytes storage and.w #-16,d0 ; ...aligned by 16 move.l sp,a2 ; save old sp move.l d0,sp move.l d0,a1 .... move.l a2,sp ; restore original sp

Cheers, I will try that later.

-edit-

Hmmm,

Aside from the increase in compactness, I don't yet see what the code above does that mine does not. You choose the lowest address that is 16 byte aligned and gives at least 16 bytes storage, but unless I misunderstood something about how link/unlk works, so does mine :-?

Am I missing something?

Piru · « **Reply #8 on:** May 22, 2005, 04:27:55 PM »

@Karlos

That looks ok to me.

In this case the only explanation can be some other process / task trashing the stack memory... You could perhaps try running the test after minimal boot (no startup-sequence) and see if it still happens.

If it was hw problem then Forbid wouldn't cure it.

Karlos · « **Reply #9 on:** May 22, 2005, 04:35:48 PM »

Quote

Piru wrote:
@Karlos

That looks ok to me.

In this case the only explanation can be some other process / task trashing the stack memory... You could perhaps try running the test after minimal boot (no startup-sequence) and see if it still happens.

If it was hw problem then Forbid wouldn't cure it.

It would also explain why I don't observe it elsewhere.

It has to be something even in my most basic 3.5 installation then.

-edit-

The thing is, it would need to be something very specific. I can't imagine I could run my machine for days on end without needing to reboot if some rogue task was trashing stacks here and there :-?

Karlos · « **Reply #10 on:** May 22, 2005, 05:07:19 PM »

@self

Time for an absolutely minimal OS3.1 + CGX installation?

boing · « **Reply #11 on:** May 22, 2005, 05:31:27 PM »

Piru, good catch the the stack changes per the Forbid/Permit.

But what is weird, is how other 040 and 060 systems don't behave like his does. That's the thing that bugs me.

itix · « **Reply #12 on:** May 22, 2005, 05:39:10 PM »

I vaguely remember that move16 instruction was buggy on some 040 chips. Couldnt find anything from google though.

But it could be a cache issue?

Framiga · « **Reply #13 on:** May 22, 2005, 05:53:27 PM »

in Piru NewCMQ060 archive readme, thats something about 040 move16 bug.

« **Reply #14 on:** May 22, 2005, 05:58:23 PM »

Quote

there’s a bug in some early versions of the 68040 chip (including some shipped Quadras) that requires you to use a Nop instruction before any set of Move16 instructions. The problem is that if you have a pending write to an address subsequently referenced by a Move16 instruction that executes before the pending write completes you’ll get bogus data. The Nop instruction flushes the instruction pipeline (including the pending write) and eliminates the possibility that the bug will show up. Strictly speaking, I don’t think I need the Nop given the instructions that execute in my code before the first Move16 but I left it in there for instructional purposes and because it might be needed for some other rare set of circumstances on certain batches of 040’s.

from here

Dunno if this is of any use?

Author Topic: One for 680x0 experts - move16 issues... (Read 6693 times)

Karlos

One for 680x0 experts - move16 issues...

Doobrey

Re: One for 680x0 experts - move16 issues...

Piru

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Piru

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Piru

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

boing

Re: One for 680x0 experts - move16 issues...

itix

Re: One for 680x0 experts - move16 issues...

Framiga

Re: One for 680x0 experts - move16 issues...

Re: One for 680x0 experts - move16 issues...