Author Topic: One for 680x0 experts - move16 issues... (Read 9558 times)

Karlos · « **on:** May 22, 2005, 02:58:04 PM »

Hi,

I was revisiting some old 68K asm sources last night and bumped into an old bug I never quite managed to fathom to my satisfaction.

I have a set of asm functions that can copy or fill memory using move16.

Normal cache-aligned copying with move16 works perfectly OK. It is the fill functions that are problematic. They do work perfectly well on many 040 and 060 but alas not on mine...

They work by allocating a 16-byte aligned area on the stack (sufficient to hold the source for up to 4 move16's in the first instance), which is then filled with the value to write to memory.

The basic plan was to fill the memory by writing cache lines.

The first basic loop looked something like this:

; d1 contains unrolled loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination

.loop
add.l #-64, a1; a1 now points to start of aligned stack area
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This gives 22MB/s on my 68040 compared to 15MB/s using a basic 16x unrolled move.l d0, (a0)+ implementation.

However, it tends to go awry, randomly filling parts of the area with garbage. It is as if the area read is trashed somehow.

I was wondering if perhaps somehow the cache lines get messed up, so I then tried the following non-unrolled mechanism to see what difference it might make.

; d1 contains loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination

.loop
add.l #-16, a1; a1 now points to start of aligned stack area
tst.l (a1); ensure line stays cached by accessing the first long
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This version was not perceptibly slower (~21MB/s) and produced less glitches in the area written to, but it was still pretty lousy.

So later, out of curiosity, I tried this:

; d0 contains 32-bit value to fill
; d1 contains loop counter
; a1 initially points to end our aligned stack area
; a0 is destination

.loop
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop

This produced the least glitching of all, but was also considerably slower (just 18MB/s) than the first one (but still faster than a move.l based fill).

Anyway, it only seems to be my BPPC/040 that does this. Other systems I have tried all work completely fine with the first version.

I exhaustively checked that my aligned area on the stack was both aligned and not touched by something in my code. There seemed no logical reason for the effect, but basically the only way to minimise the glitching was keep refilling the line I wanted to write.

Given I couldn't reproduce the problem on at least 3 other 68040's and 2 060's, I eventually reasoned to myself that perhaps my BPPC's 68040 has a bug even though a basic move16 block copy always worked OK.

Then last night, I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.

So now I am back to square one. Any asm experts here have an answer?

Karlos · « **Reply #1 on:** May 22, 2005, 04:02:58 PM »

Quote

Piru wrote:
Quote
does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?

No it does not. CCs are unaffected.

Quite. It was a token attempt to optimise the loop ;-)

Karlos · « **Reply #2 on:** May 22, 2005, 04:10:43 PM »

Quote

Doobrey wrote:

Another thing that pops into my head (yeah I know, a lot of free space at the moment !), are you running a replacement scheduler on the system that shows these problems?

Tested with and without executive, same problem. It also affects some copy code I have which handles non-cache aligned copies (after copying up to the first cache aligned destination) by reading the misaligned data into an aligned buffer on the stack before moving it to the destination using move16. Despite this 'shuffling' of data, the speed gains over a straightforward copy are quite conspicuous in some cases.

Anyway, whilst the above works, I observe the most 'glitching' for a relative alignment of 4 bytes (that is the ultimate source and destination are out by 4 bytes). When the relative alignment is 2 or 6 bytes, the glitching is far less frequent.

Again, as with filling, the glitching is stopped completely when the call is wrapped in a Forbid()/Permit().

Karlos · « **Reply #3 on:** May 22, 2005, 04:18:00 PM »

Quote

Piru wrote:
This points to you forgetting to change A7 (SP) properly. Perhaps your src_stack_ptr < a7 (it could easily happen due to buggy alignment code, for example: sub.l #x,a7 / move.l a7,d0 / and.l #-16,d0 / move.l d0,a1 .. Depending on the initial alignment of the stack, no trashing or partial trashing would occur) ? If this is the case then task scheduling will trash the src stack area when rescheduling occurs.

Quite unusual that I have never experienced any problems outside my BPPC :-/

From memory, the alignment code is something like this. Eg, for a single 16-byte aligned area

link a2,#-32 ; allocate 32 byte slot within which we find the first 16-byte aligned address
move.l a7, d0
addq.l #15, d0
and.l #FFFFFFF0, d0
move.l d0, a1

;a1 now points to start of aligned area on stack

;rest of function

unlk a2

Basically I always allocate the size I require + 16, so for 2 cache lines, it would be link a2, #-48 etc.

Karlos · « **Reply #4 on:** May 22, 2005, 04:22:14 PM »

Quote

Piru wrote:

The proper alignment would do something like this (this is just one way of implementing it):
Code: [Select]
move.l sp,d0 sub.l #16,d0 ; at least 16 bytes storage and.w #-16,d0 ; ...aligned by 16 move.l sp,a2 ; save old sp move.l d0,sp move.l d0,a1 .... move.l a2,sp ; restore original sp

Cheers, I will try that later.

-edit-

Hmmm,

Aside from the increase in compactness, I don't yet see what the code above does that mine does not. You choose the lowest address that is 16 byte aligned and gives at least 16 bytes storage, but unless I misunderstood something about how link/unlk works, so does mine :-?

Am I missing something?

Karlos · « **Reply #5 on:** May 22, 2005, 04:35:48 PM »

Quote

Piru wrote:
@Karlos

That looks ok to me.

In this case the only explanation can be some other process / task trashing the stack memory... You could perhaps try running the test after minimal boot (no startup-sequence) and see if it still happens.

If it was hw problem then Forbid wouldn't cure it.

It would also explain why I don't observe it elsewhere.

It has to be something even in my most basic 3.5 installation then.

-edit-

The thing is, it would need to be something very specific. I can't imagine I could run my machine for days on end without needing to reboot if some rogue task was trashing stacks here and there :-?

Karlos · « **Reply #6 on:** May 22, 2005, 05:07:19 PM »

@self

Time for an absolutely minimal OS3.1 + CGX installation?

Karlos · « **Reply #7 on:** May 22, 2005, 06:10:42 PM »

@all

If you have an 040 / 060, you can test this yourself (note you still need an RTG system for this particular test)

There is a version of pixeltest here that tests some of these operations on VRAM and asks if you see any glitches in the pixels.

Note that there is one test, just before the 'shufflecopy' test that times a byteswap accelerated with move16. The output pixels will naturally look screwy after byteswapping - appearing as a strange patchwork pattern. This is not a glitch.

The glitches are the occasional flickering short spans of duff pixels you might see as some of the copy and set tests are performed.

Does anybody else see them, or is it just my one system (it works fine on another 040 here).

Karlos · « **Reply #8 on:** May 22, 2005, 07:20:03 PM »

@framiga

Only the OS copy test should be affected by your CMQ060 patch.

So you didn't see any glitched pixels during the various write/copy tests I take it.

@patrik

That looks like another system that prefers not to use move16 to VRAM, eh?

-edit-

Quote

Read RAM : 58624.13 K/sec
Write RAM : 45418.33 K/sec
Write RAM(C) : 33136.09 K/sec
Write RAM(16) : 36941.41 K/sec

That's interesting. The version you are running us using the last version of the asm I posted. I guess resetting the cache line with the four move.l is killing it. Or something :lol:

Karlos · « **Reply #9 on:** May 22, 2005, 08:15:48 PM »

@Piru

Hmm.

Maybe it is a hardware bug after all. I tried again with Forbid() just now and still observed glitching in some copy tests.

I also allocated enough space so that there would be at least one cache line before and after my test area. Same weirdness.

Karlos · « **Reply #10 on:** May 23, 2005, 12:24:10 AM »

@mdma

I have to use OS X and OSX Server at work on a daily basis. I can quite assure anybody reading, both are way overhyped. That is not to say they are bad but once you get past the eyecandy, they really are nothing exceptional.

Karlos · « **Reply #11 on:** May 23, 2005, 11:19:57 AM »

@all with 040/060 + RTG

Could you please test this version of pixeltest?

This version reinstates my origial move16 code (the 4x unrolled one) and the timing measurement should also be a fair bit more accurate. It also supports a new CLI argument, 'lock' which will hold off task switching during iterations.

I am just curious to see if the original move16 code I wrote will outperform the move.l based write as the previous one did not (not too surprising considering it kept re-updating the cache line in a vain attempt to beat the glitch).

Karlos · « **Reply #12 on:** May 23, 2005, 11:36:27 AM »

So can anybody shed any light on why the BPPC/BVision seems to support burst access to VRAM but the CSPPC/CVision does not?

I am inferring this only from the fact that my system (and one or two others) are faster when writing to VRAM with move16, but all the CSPPCs seem slower.

Karlos · « **Reply #13 on:** May 24, 2005, 01:55:24 PM »

Quote

Framiga wrote:

EDIT- and remember that there are 2 different models of CVPPC, that uses different VRAM chips.

Hmm. There is that.

BTW, do you find your overclocked P2 that much faster?

Karlos · « **Reply #14 on:** May 24, 2005, 02:25:41 PM »

Depending on the pixelclock, your lower frequencies might mean that you have more memory bandwidth left for the graphics core :-)

-edit-

Unless like me, you insist on using nothing less than 1280x960x16-bit for anything :-D

Author Topic: One for 680x0 experts - move16 issues... (Read 9558 times)

Karlos

One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...

Karlos

Re: One for 680x0 experts - move16 issues...