Hi,
I was revisiting some old 68K asm sources last night and bumped into an old bug I never quite managed to fathom to my satisfaction.
I have a set of asm functions that can copy or fill memory using move16.
Normal cache-aligned copying with move16 works perfectly OK. It is the fill functions that are problematic. They do work perfectly well on many 040 and 060 but alas not on mine...
They work by allocating a 16-byte aligned area on the stack (sufficient to hold the source for up to 4 move16's in the first instance), which is then filled with the value to write to memory.
The basic plan was to fill the memory by writing cache lines.
The first basic loop looked something like this:
; d1 contains unrolled loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination
.loop
add.l #-64, a1; a1 now points to start of aligned stack area
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This gives 22MB/s on my 68040 compared to 15MB/s using a basic 16x unrolled move.l d0, (a0)+ implementation.
However, it tends to go awry, randomly filling parts of the area with garbage. It is as if the area read is trashed somehow.
I was wondering if perhaps somehow the cache lines get messed up, so I then tried the following non-unrolled mechanism to see what difference it might make.
; d1 contains loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination
.loop
add.l #-16, a1; a1 now points to start of aligned stack area
tst.l (a1); ensure line stays cached by accessing the first long
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This version was not perceptibly slower (~21MB/s) and produced less glitches in the area written to, but it was still pretty lousy.
So later, out of curiosity, I tried this:
; d0 contains 32-bit value to fill
; d1 contains loop counter
; a1 initially points to end our aligned stack area
; a0 is destination
.loop
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This produced the least glitching of all, but was also considerably slower (just 18MB/s) than the first one (but still faster than a move.l based fill).
Anyway, it only seems to be my BPPC/040 that does this. Other systems I have tried all work completely fine with the first version.
I exhaustively checked that my aligned area on the stack was both aligned and not touched by something in my code. There seemed no logical reason for the effect, but basically the only way to minimise the glitching was keep refilling the line I wanted to write.
Given I couldn't reproduce the problem on at least 3 other 68040's and 2 060's, I eventually reasoned to myself that perhaps my BPPC's 68040 has a bug even though a basic move16 block copy always worked OK.
Then last night, I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.
So now I am back to square one. Any asm experts here have an answer?