Amiga.org
Operating System Specific Discussions => Amiga OS => Amiga OS -- Development => Topic started by: Karlos on May 22, 2005, 02:58:04 PM
-
Hi,
I was revisiting some old 68K asm sources last night and bumped into an old bug I never quite managed to fathom to my satisfaction.
I have a set of asm functions that can copy or fill memory using move16.
Normal cache-aligned copying with move16 works perfectly OK. It is the fill functions that are problematic. They do work perfectly well on many 040 and 060 but alas not on mine...
They work by allocating a 16-byte aligned area on the stack (sufficient to hold the source for up to 4 move16's in the first instance), which is then filled with the value to write to memory.
The basic plan was to fill the memory by writing cache lines.
The first basic loop looked something like this:
; d1 contains unrolled loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination
.loop
add.l #-64, a1; a1 now points to start of aligned stack area
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
move16 (a1)+,(a0)+
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This gives 22MB/s on my 68040 compared to 15MB/s using a basic 16x unrolled move.l d0, (a0)+ implementation.
However, it tends to go awry, randomly filling parts of the area with garbage. It is as if the area read is trashed somehow.
I was wondering if perhaps somehow the cache lines get messed up, so I then tried the following non-unrolled mechanism to see what difference it might make.
; d1 contains loop counter
; a1 initially points to end our aligned stack area after filling it
; a0 is destination
.loop
add.l #-16, a1; a1 now points to start of aligned stack area
tst.l (a1); ensure line stays cached by accessing the first long
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This version was not perceptibly slower (~21MB/s) and produced less glitches in the area written to, but it was still pretty lousy.
So later, out of curiosity, I tried this:
; d0 contains 32-bit value to fill
; d1 contains loop counter
; a1 initially points to end our aligned stack area
; a0 is destination
.loop
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
move.l d0, -(a1)
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
This produced the least glitching of all, but was also considerably slower (just 18MB/s) than the first one (but still faster than a move.l based fill).
Anyway, it only seems to be my BPPC/040 that does this. Other systems I have tried all work completely fine with the first version.
I exhaustively checked that my aligned area on the stack was both aligned and not touched by something in my code. There seemed no logical reason for the effect, but basically the only way to minimise the glitching was keep refilling the line I wanted to write.
Given I couldn't reproduce the problem on at least 3 other 68040's and 2 060's, I eventually reasoned to myself that perhaps my BPPC's 68040 has a bug even though a basic move16 block copy always worked OK.
Then last night, I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.
So now I am back to square one. Any asm experts here have an answer?
-
subq.l #1, d1
move16 (a1)+,(a0)+
bgt.b .loop
I don't have the 68k docs to hand at the moment, so I can't check this out.
But does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?
Or changed the subq - bgt section to
move16 (a1)+,(a0)+
dbne d1, .loop .. dunno if this is slower though.
Another thing that pops into my head (yeah I know, a lot of free space at the moment !), are you running a replacement scheduler on the system that shows these problems?
Anyway, I'm sure Piru will be along any moment to tell me I'm wrong and give you the right answer :-P
-
does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?
No it does not. CCs are unaffected.
-
Piru wrote:
does the move16 change any condition codes in the way a normal move.l(a1)+,(a0)+ would do ?
No it does not. CCs are unaffected.
Quite. It was a token attempt to optimise the loop ;-)
-
I discovered that calling Forbid() / Permit() around the function call stops any glitching with any of the versions above.
This points to you forgetting to change A7 (SP) properly. Perhaps your src_stack_ptr < a7 (it could easily happen due to buggy alignment code, for example: sub.l #x,a7 / move.l a7,d0 / and.w #-16,d0 / move.l d0,a1 .. Depending on the initial alignment of the stack, no trashing or partial trashing would occur) ? If this is the case then task scheduling will trash the src stack area when rescheduling occurs.
Also, Forbid would obviously hide the problem as no task scheduling happens -> no registers are pushed to stack -> no trashing.
The proper alignment would do something like this (this is just one way of implementing it):
move.l sp,d0
sub.l #16,d0 ; at least 16 bytes storage
and.w #-16,d0 ; ...aligned by 16
move.l sp,a2 ; save old sp
move.l d0,sp
move.l d0,a1
....
move.l a2,sp ; restore original sp
-
Doobrey wrote:
Another thing that pops into my head (yeah I know, a lot of free space at the moment !), are you running a replacement scheduler on the system that shows these problems?
Tested with and without executive, same problem. It also affects some copy code I have which handles non-cache aligned copies (after copying up to the first cache aligned destination) by reading the misaligned data into an aligned buffer on the stack before moving it to the destination using move16. Despite this 'shuffling' of data, the speed gains over a straightforward copy are quite conspicuous in some cases.
Anyway, whilst the above works, I observe the most 'glitching' for a relative alignment of 4 bytes (that is the ultimate source and destination are out by 4 bytes). When the relative alignment is 2 or 6 bytes, the glitching is far less frequent.
Again, as with filling, the glitching is stopped completely when the call is wrapped in a Forbid()/Permit().
-
Piru wrote:
This points to you forgetting to change A7 (SP) properly. Perhaps your src_stack_ptr < a7 (it could easily happen due to buggy alignment code, for example: sub.l #x,a7 / move.l a7,d0 / and.l #-16,d0 / move.l d0,a1 .. Depending on the initial alignment of the stack, no trashing or partial trashing would occur) ? If this is the case then task scheduling will trash the src stack area when rescheduling occurs.
Quite unusual that I have never experienced any problems outside my BPPC :-/
From memory, the alignment code is something like this. Eg, for a single 16-byte aligned area
link a2,#-32 ; allocate 32 byte slot within which we find the first 16-byte aligned address
move.l a7, d0
addq.l #15, d0
and.l #FFFFFFF0, d0
move.l d0, a1
;a1 now points to start of aligned area on stack
;rest of function
unlk a2
Basically I always allocate the size I require + 16, so for 2 cache lines, it would be link a2, #-48 etc.
-
Piru wrote:
The proper alignment would do something like this (this is just one way of implementing it):
move.l sp,d0
sub.l #16,d0 ; at least 16 bytes storage
and.w #-16,d0 ; ...aligned by 16
move.l sp,a2 ; save old sp
move.l d0,sp
move.l d0,a1
....
move.l a2,sp ; restore original sp
Cheers, I will try that later.
-edit-
Hmmm,
Aside from the increase in compactness, I don't yet see what the code above does that mine does not. You choose the lowest address that is 16 byte aligned and gives at least 16 bytes storage, but unless I misunderstood something about how link/unlk works, so does mine :-?
Am I missing something?
-
@Karlos
That looks ok to me.
In this case the only explanation can be some other process / task trashing the stack memory... You could perhaps try running the test after minimal boot (no startup-sequence) and see if it still happens.
If it was hw problem then Forbid wouldn't cure it.
-
Piru wrote:
@Karlos
That looks ok to me.
In this case the only explanation can be some other process / task trashing the stack memory... You could perhaps try running the test after minimal boot (no startup-sequence) and see if it still happens.
If it was hw problem then Forbid wouldn't cure it.
It would also explain why I don't observe it elsewhere.
It has to be something even in my most basic 3.5 installation then.
-edit-
The thing is, it would need to be something very specific. I can't imagine I could run my machine for days on end without needing to reboot if some rogue task was trashing stacks here and there :-?
-
@self
Time for an absolutely minimal OS3.1 + CGX installation?
-
Piru, good catch the the stack changes per the Forbid/Permit.
But what is weird, is how other 040 and 060 systems don't behave like his does. That's the thing that bugs me.
-
I vaguely remember that move16 instruction was buggy on some 040 chips. Couldnt find anything from google though.
But it could be a cache issue?
-
in Piru NewCMQ060 archive readme, thats something about 040 move16 bug.
-
there’s a bug in some early versions of the 68040 chip (including some shipped Quadras) that requires you to use a Nop instruction before any set of Move16 instructions. The problem is that if you have a pending write to an address subsequently referenced by a Move16 instruction that executes before the pending write completes you’ll get bogus data. The Nop instruction flushes the instruction pipeline (including the pending write) and eliminates the possibility that the bug will show up. Strictly speaking, I don’t think I need the Nop given the instructions that execute in my code before the first Move16 but I left it in there for instructional purposes and because it might be needed for some other rare set of circumstances on certain batches of 040’s.
from here (http://www.mactech.com/articles/mactech/Vol.09/09.05/68040BlockMove/)
Dunno if this is of any use?
-
@all
If you have an 040 / 060, you can test this yourself (note you still need an RTG system for this particular test)
There is a version of pixeltest here (http://www.megaburken.net/~karlos/demos/pixeltest_2004-11-29.lzx) that tests some of these operations on VRAM and asks if you see any glitches in the pixels.
Note that there is one test, just before the 'shufflecopy' test that times a byteswap accelerated with move16. The output pixels will naturally look screwy after byteswapping - appearing as a strange patchwork pattern. This is not a glitch.
The glitches are the occasional flickering short spans of duff pixels you might see as some of the copy and set tests are performed.
Does anybody else see them, or is it just my one system (it works fine on another 040 here).
-
none here . . . i have CMQ060Move16 installed.
Do you want the output?
CVPPC 1024x768 16bit CGX 4.3 68060
-
@Framiga:
Just post it here so we all can see :=).
/Patrik
-
10/0.AmigaOS:> "Ram Disk:pixeltest"
Surface width: 640, height: 480, modulus: 0
Surface hwWidth: 640, hwHeight: 480
Test data pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Window pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Results (negative value indicates glitches were observed
Read RAM : 58624.13 K/sec
Write RAM : 45418.33 K/sec
Write RAM(C) : 33136.09 K/sec
Write RAM(16) : 36941.41 K/sec
RAM->RAM : 27755.91 K/sec
RAM->RAM(C) : 15264.19 K/sec
RAM->RAM(OS) : 33201.58 K/sec
RAM->RAM(16) : 32738.10 K/sec
-------------------------------
Read VRAM : 6094.18 K/sec
Write VRAM : 23460.41 K/sec
Write VRAM(C) : 5813.95 K/sec
Write VRAM(16): 16617.21 K/sec
RAM->VRAM : 16568.05 K/sec
RAM->VRAM(C) : 5273.44 K/sec
RAM->VRAM(OS) : 15774.65 K/sec
RAM->VRAM(16) : 16000.00 K/sec
RAM->VRAM(SC0): 12903.23 K/sec
RAM->VRAM(SC2): 11952.15 K/sec
RAM->VRAM(SC4): 12928.42 K/sec
RAM->VRAM(SC6): 11834.20 K/sec
RAM->VRAM(SC8): 12877.88 K/sec
RAM->VRAM(swp): 11583.01 K/sec
-------------------------------
VRAM->RAM : 5157.59 K/sec
VRAM->RAM(16) : 5853.66 K/sec
-------------------------------
Conversion : 16617.21 K/sec [output bandwidth]
Conversion : 8508011.87 pix/sec
Conversion attained 100.30% copy speed
-
@framiga
Only the OS copy test should be affected by your CMQ060 patch.
So you didn't see any glitched pixels during the various write/copy tests I take it.
@patrik
That looks like another system that prefers not to use move16 to VRAM, eh?
-edit-
Read RAM : 58624.13 K/sec
Write RAM : 45418.33 K/sec
Write RAM(C) : 33136.09 K/sec
Write RAM(16) : 36941.41 K/sec
That's interesting. The version you are running us using the last version of the asm I posted. I guess resetting the cache line with the four move.l is killing it. Or something :lol:
-
@Piru
Hmm.
Maybe it is a hardware bug after all. I tried again with Forbid() just now and still observed glitching in some copy tests.
I also allocated enough space so that there would be at least one cache line before and after my test area. Same weirdness.
-
@mdma
Found this from the article:
there’s a bug in some early versions of the 68040 chip (including some shipped Quadras) that requires you to use a Nop instruction before any set of Move16 instructions. The problem is that if you have a pending write to an address subsequently referenced by a Move16 instruction that executes before the pending write completes you’ll get bogus data.
-
Heh. After reading that article I feel MacOS classic is piece of {bleep} :-D
-
itix wrote:
Heh. After reading that article I feel MacOS classic is piece of {bleep} :-D
Hehe, some might say MacOS X is too! ;-)
-
@mdma
I have to use OS X and OSX Server at work on a daily basis. I can quite assure anybody reading, both are way overhyped. That is not to say they are bad but once you get past the eyecandy, they really are nothing exceptional.
-
@all with 040/060 + RTG
Could you please test this (http://www.megaburken.net/~karlos/demos/pixeltest_2005-05-23.lzx) version of pixeltest?
This version reinstates my origial move16 code (the 4x unrolled one) and the timing measurement should also be a fair bit more accurate. It also supports a new CLI argument, 'lock' which will hold off task switching during iterations.
I am just curious to see if the original move16 code I wrote will outperform the move.l based write as the previous one did not (not too surprising considering it kept re-updating the cache line in a vain attempt to beat the glitch).
-
8/0.AmigaOS:> "Ram Disk:pixeltest/pixeltest" lock
Surface width: 640, height: 480, modulus: 0
Surface hwWidth: 640, hwHeight: 480
Test data pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Window pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Locking enabled. No task switches or interrupts during iterations.
68040/060 detected. MOVE16 based tests will be performed.
Results (negative value indicates glitches were observed
Read RAM : 57380.75 K/sec
Write RAM : 45855.11 K/sec
Write RAM(C) : 33108.27 K/sec
Write RAM(16) : 52591.97 K/sec
RAM->RAM : 27757.65 K/sec
RAM->RAM(C) : 15544.19 K/sec
RAM->RAM(OS) : 32728.97 K/sec
RAM->RAM(16) : 32656.29 K/sec
-------------------------------
Read VRAM : 6408.90 K/sec
Write VRAM : 23131.52 K/sec
Write VRAM(C) : 6307.71 K/sec
Write VRAM(16): 19469.71 K/sec
RAM->VRAM : 16668.71 K/sec
RAM->VRAM(C) : 6022.64 K/sec
RAM->VRAM(OS) : 15999.47 K/sec
RAM->VRAM(16) : 15976.67 K/sec
RAM->VRAM(SC0): 12945.05 K/sec
RAM->VRAM(SC2): 11964.15 K/sec
RAM->VRAM(SC4): 12944.53 K/sec
RAM->VRAM(SC6): 11968.74 K/sec
RAM->VRAM(SC8): 12936.73 K/sec
-------------------------------
VRAM->RAM : 5970.79 K/sec
VRAM->RAM(16) : 6512.34 K/sec
-------------------------------
Conversion : 16668.29 K/sec [output bandwidth]
Conversion : 8534163.20 pix/sec
Conversion attained 100.00% copy speed
-
So can anybody shed any light on why the BPPC/BVision seems to support burst access to VRAM but the CSPPC/CVision does not?
I am inferring this only from the fact that my system (and one or two others) are faster when writing to VRAM with move16, but all the CSPPCs seem slower.
-
CyberStormPPC(060@50MHz) + CyberVisionPPC:
Surface width: 640, height: 480, modulus: 0
Surface hwWidth: 640, hwHeight: 480
Test data pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Window pixel format
Bytes : 2, endian native
Bits : A[ 0] R[ 5] G[ 6] B[ 5]
Offsets : A[ 0] R[ 11] G[ 5] B[ 0]
Maxima : A[ 0] R[ 31] G[ 63] B[ 31]
Locking enabled. No task switches or interrupts during iterations.
68040/060 detected. MOVE16 based tests will be performed.
Results (negative value indicates glitches were observed
Read RAM : 47892.42 K/sec
Write RAM : 38258.97 K/sec
Write RAM(C) : 27613.75 K/sec
Write RAM(16) : 43855.58 K/sec
RAM->RAM : 23149.21 K/sec
RAM->RAM(C) : 12955.79 K/sec
RAM->RAM(OS) : 21271.21 K/sec
RAM->RAM(16) : 27232.07 K/sec
-------------------------------
Read VRAM : 6197.98 K/sec
Write VRAM : 19459.75 K/sec
Write VRAM(C) : 5740.48 K/sec
Write VRAM(16): 16227.62 K/sec
RAM->VRAM : 13894.56 K/sec
RAM->VRAM(C) : 6426.90 K/sec
RAM->VRAM(OS) : 13680.40 K/sec
RAM->VRAM(16) : 13320.74 K/sec
RAM->VRAM(SC0): 10787.77 K/sec
RAM->VRAM(SC2): 9974.31 K/sec
RAM->VRAM(SC4): 10787.71 K/sec
RAM->VRAM(SC6): 9976.13 K/sec
RAM->VRAM(SC8): 10782.86 K/sec
-------------------------------
VRAM->RAM : 6448.74 K/sec
VRAM->RAM(16) : 6507.22 K/sec
-------------------------------
Conversion : 13894.99 K/sec [output bandwidth]
Conversion : 7114236.87 pix/sec
Conversion attained 100.00% copy speed
/Patrik
-
forgot to mention . . .mine is a
68060 @ 60 Mhz (ram settings>60ns)
CVPPC @ 92 Mhz (Permedia active cooled)
My CSPPC its a very early 604e 150 Mhz OC at 200 Mhz. (the model with NO RAM Precharge settings enabled)
EDIT- and remember that there are 2 different models of CVPPC, that uses different VRAM chips.
-
Framiga wrote:
EDIT- and remember that there are 2 different models of CVPPC, that uses different VRAM chips.
Hmm. There is that.
BTW, do you find your overclocked P2 that much faster?
-
to be honest . . .not so much :-)
Some benchmarks here and there but for the everyday use . . . i don't see it.
Another note . . .due my very old IIyama 21", i'm forced to use rather low frequencies rates (about 50Khz horizontal and 65 Hz vertical)
For the above reason, i think that my P2 is running very very . . . . quite (don't know if the exact definition)
-
Depending on the pixelclock, your lower frequencies might mean that you have more memory bandwidth left for the graphics core :-)
-edit-
Unless like me, you insist on using nothing less than 1280x960x16-bit for anything :-D
-
unfortunately 1024x768 16 bit here :-(
1120x900 only for ProStationAudio . . . as i said, its a very, very old chaps :-(
Idek IIYama MF5121 (2 identical monitors, give me as a gift by a relative of mine).
They were used on a sort of "custom printing workstation", in 1990.
I know . . .time to replace them but . . . no money now for at least a 19" (21" monitor addiction)
-
Framiga wrote:
unfortunately 1024x768 16 bit here :-(
To be fair, I say that's a good resolution. It doesn't consume too much VRAM and workbench isn't anywhere near as wasteful of screenspace (well save the non off-screen dragging of windows for OS.3x without a patch like PowerWindows) as many other OS's that fill every spare pixel with eyecandy.
I am yet to do a proper study of how write bandwidth to the VRAM of these cards (also blitting etc) is affected (if at all) by the resolution.
-edit-
Damn, am I a boring bugger or what? :lol:
-
oh, just for curiosity.
Would be possible, to write (you obviously) sort of AvailP96 thing for CGX?
Would be useful . . or not? :-)
EDIT- i've already an "very original, unique" name in mind . . . AvailCGX . . . :-D
-
I could look into it, I suppose :-)
-edit-
How about a generic P96 / CGX one, AvailVRAM ?
-
:-D
ah OK . . . so no royalties for me!
OK anyway :-)
-
How did move16 problems become Avail(RTG|CGX|VRAM)?
Only on amiga.org :-D
-
Also forgot to mention some stuff:
My 604e is a 200MHz model and the CyberStormPPC itself is the earlier model with no RAM precharge option. Other than that - my RAM-settings are set to the lowest latency for both the 060 and the 604e. The CyberVisionPPC is not clocked.
/Patrik
-
...and for those of us still wondering what precharge is, see here (http://xtronics.com/memory/how_memory-works.htm) ;-)