Author Topic: CopyMem Quick & Small released! (Read 14190 times)

SpeedGeek · « **on:** December 28, 2014, 04:54:04 PM »

Here is link to this thread:

http://eab.abime.net/showthread.php?p=993920

SpeedGeek · « **Reply #1 on:** December 29, 2014, 02:48:32 PM »

Zero bytes? You know how to make a CMQ patch this small? Goodie, I can't wait to try it out.

The case when 4 Long moves is faster than Move16 is when the Copyback cache is enabled and the 4 Long moves obtain best case performance, but in the case of worst case performance Move16 is much faster. The average case performance probably occurs at 50% of the size of the 040's data cache... and that's why I have copy block size limit >= 2048 bytes before any Move16 is enabled!

SpeedGeek · « **Reply #2 on:** December 30, 2014, 06:12:21 PM »

Quote from: Thomas Richter;780720

It's called "Use the Os provided function".

So you really don't have a patch that small. If the "OS provided function" was fast enough than why would anyone bother to make a patch in the first place?

Quote from: Thomas Richter;780720

No, it's not. Again, whether bursting works over Zorro or not is a matter of luck. For my A2000, MOVE16 is *slow* when I move into the graphics card ram of the GVP spectrum. This is non-cacheable (!), imprecise, non-serial. Thus, the CPU may reorder accesses, does not need to expect bus-errors, but may not cache them, but yet, surprisingly, MOVE16 is slower than four moves. I already said why that is: Bursts over Zorro are no-no's, and the hardware may have to run in circles to get the data over the bus. We tested all this back then for P96, as it was suggested that MOVE16 may improve some blitter emulation cases. It does not. Worst, it may break things. Simply don't try that, it's a bad idea.

Other than that, have you made measurements which speed benefit this program has? I mean, in a realistic use case? If so, I would be interested to learn about your results. Which programs to run, what did the program do, and how did you measure?

The Zorro2 bus does NOT support Burst and so again as with Chip RAM Burst is a non-issue. Move16 does NOT need Burst to obtain a performance benefit. While it's certainly true Burst capable memory can improve Move16 performance it's true to same extent Burst would improve the performance of MoveL and any other instruction.

Move16 get's it main performance benefit because it interacts differently with the data cache than MoveL. This means Move16 is not affected by the worst case performance problem when the Copyback cache is enabled. This also means it can't benefit from the best case performance as MoveL can.

I have already posted a Testit result indicating a 44% speed increase with Move16 on EAB. As I said previously it's the SIZE of the copy which determines whether or not Move16 offers any performance benefit.

Move16 should not cause any problems with the MMU reordering a write to any Zorro2 or Chip RAM since the write cycle will be completed as 4 separate longword writes. But if you want to play it safe you can always fix the MMU config.

What's really surprising here is how people can continue to read the 040 and 060 documentation and ignore the very obvious:

5.4.6 Transfer Burst Inhibit (TBI)
This input signal indicates to the processor that the accessed device cannot support burst mode accesses and that the requested line transfer should be divided into individual longword transfers. Asserting TBI with TA terminates the first data transfer of a line access, which causes the processor to terminate the burst and access the remaining data for the line as three successive long-word transfers. During alternate bus master accesses, the M68040 samples the TBI to detect completion of each bus transfer.

SpeedGeek · « **Reply #3 on:** December 30, 2014, 06:13:21 PM »

Quote from: Thomas Richter;780720

It's called "Use the Os provided function".

So you really don't have a patch that small. If the "OS provided function" was really fast enough than why would anyone bother to make a patch in the first place?

Quote from: Thomas Richter;780720

No, it's not. Again, whether bursting works over Zorro or not is a matter of luck. For my A2000, MOVE16 is *slow* when I move into the graphics card ram of the GVP spectrum. This is non-cacheable (!), imprecise, non-serial. Thus, the CPU may reorder accesses, does not need to expect bus-errors, but may not cache them, but yet, surprisingly, MOVE16 is slower than four moves. I already said why that is: Bursts over Zorro are no-no's, and the hardware may have to run in circles to get the data over the bus. We tested all this back then for P96, as it was suggested that MOVE16 may improve some blitter emulation cases. It does not. Worst, it may break things. Simply don't try that, it's a bad idea.

Other than that, have you made measurements which speed benefit this program has? I mean, in a realistic use case? If so, I would be interested to learn about your results. Which programs to run, what did the program do, and how did you measure?

The Zorro2 bus does NOT support Burst and so again as with Chip RAM Burst is a non-issue. Move16 does NOT need Burst to obtain a performance benefit. While it's certainly true Burst capable memory can improve Move16 performance it's true to same extent Burst would improve the performance of MoveL and any other instruction.

Move16 get's it main performance benefit because it interacts differently with the data cache than MoveL. This means Move16 is not affected by the worst case performance problem when the Copyback cache is enabled.

I have already posted a Testit result indicating a 44% speed increase with Move16 on EAB. As I said previously it's the SIZE of the copy which determines whether or not Move16 offers any performance benefit.

Move16 should not cause any problems with the MMU reordering a write to any Zorro2 or Chip RAM since the write cycle will be completed as 4 separate longword writes. But if you want to play it safe you can always fix the MMU config.

What's really surprising here is how people can continue to read the 040 and 060 documentation and ignore the very obvious:

5.4.6 Transfer Burst Inhibit (TBI)
This input signal indicates to the processor that the accessed device cannot support burst mode accesses and that the requested line transfer should be divided into individual longword transfers. Asserting TBI with TA terminates the first data transfer of a line access, which causes the processor to terminate the burst and access the remaining data for the line as three successive long-word transfers. During alternate bus master accesses, the M68040 samples the TBI to detect completion of each bus transfer.

SpeedGeek · « **Reply #4 on:** December 31, 2014, 02:32:35 AM »

** NEWS UPDATE **

CMQ&S040 v1.6 released

v1.6 minor change
- source address compare code misqualified Move16 on 8 byte offset
(This is fixed now but the 4 byte offset still doesn't work for some reason)

SpeedGeek · « **Reply #5 on:** December 31, 2014, 05:03:31 PM »

Quote from: psxphill;780764

Some people will spend time doubling the speed of a routine that takes 100ms and is only ever run once.

IIRC matthey logged copymem/copymemquick calls on an Amiga with >100MB of RAM and ran out of memory in 1 minute!

Quote from: psxphill;780764

Do you have any benchmarks of real software before and after installing the patch?

MOVE16 doesn't appear to be safe on an mmu less 040 as you can't use the workaround in the errata, although it's arguable that an mmu less 040 is safe in an amiga at all (yet they seem to exist).

Testit is really not a good program for testing Move16 performance (Of course it was written for 020 and earlier CPUs). I can run CMQ&S040 before Setpatch and any MMU code is installed. I can execute the s-s which then loads Setpatch and the MMU code.

Quote from: psxphill;780764

The TBI line isn't a solution, it completes the burst and then throws away the extra results. If you write and the data isn't in the cache it will try to burst read the cache line and throw that away too.

http://amigadev.elowar.com/read/ADCD_2.1/AmigaMail_Vol2_guide/node0161.html

WTF? TBI doesn't complete the burst it TERMINATES the burst! Throws away the extra results? What extra results are there? 4 longwords requested = 4 longwords completed. FYI, the cache control logic really doesn't care if the 4 longwords were transfered in a burst or non-burst cycle.

SpeedGeek · « **Reply #6 on:** January 02, 2015, 05:24:45 PM »

Quote from: Oldsmobile_Mike;780712

Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

http://aminet.net/package/util/boot/CopyMem

Since it seems like they both rely on Move16?

That's a very general question to ask, but a question which has very specific and qualified answers.

Faster in which category? Best, average, or worst case copies? Large, medium, or small size copies? Faster on 16 byte, longword, word or byte copies? Faster on aligned or mis-aligned copies. Faster on 020, 030, 040 or 060?

Any CMQ patch can be optimized to give better performance for a specific category but that will reduce it's performance in another category.

SpeedGeek · « **Reply #7 on:** January 03, 2015, 03:39:57 PM »

Quote from: Thomas Richter;780957

The specific one is that MOVE16 is not a good instruction to use on the Amiga. Problem is that MOVE16 runs a burst-cycle, even into memory or target regions that are marked as "cache-inhibited". The problem is now that it depends on the well-behaivedness of the turbo-board to detect this case and abort the burst. Given the "rather average" quality of some expansions and extensions, I would not be surprised that this actually doesn't always work as it should. Indeed, if I test this on my A2000, *not* trying to burst provides a small but measurable speed advantage over trying to initiate the burst.

There you go again, sounding the Burst warning alarm system you invented. I've tried to explain this many times (but you still don't get it). Burst is just an optional feature which under best case conditions can improve performance but there also worst case conditions where it reduces performance.

The CPU may only request a Burst cycle but the hardware (memory controller logic) makes the final decision on when (if ever) any Burst cycle will happen.

SpeedGeek · « **Reply #8 on:** January 04, 2015, 02:53:59 PM »

Quote from: Thomas Richter;780983

Exactly. But you silently assume that there is a memory controller logic, and that this memory controller logic is smart enough to pick the right decisions at all times. In fact, you can get away without ever touching the burst. RAM would be on the Turbo card anyhow, chip ram has to be cache inhibited, and the rest of I/O space has to be cache-inhibited as well. Cache-inibited accesses do not burst, hence no extra logic required. Or almost.

IOW, you rely on the hardware to be well-behaived, and that the vendor implemented an extra-logic just for a corner case. I really wonder where you take your confidence from. All what I learned over the years was that whenever there was a chance to cut the budget, hardware vendors took it. Here you have one...

Take it as you like, but I call it "defensive programming".

Yes, I can implicitly (and correctly) make the assumption the Accelerator card logic disables Burst by default or permanently disables it for cards which don't support it (It could be memory controller logic, glue logic, PLD logic or even a pull down/up resistor). Otherwise, you won't even be able to boot your Amiga. It's as simple as that.

Exec tries to enable the instruction cache in early startup. Now, what would happen when the CPU tries to run a Burst cycle to the Kickstart ROMs, Chip RAM or the ZorroII bus with Burst enabled and none of the above support Burst?

Quote from: Thomas Richter;780999

To find a solution, one first has to identify the problem. And that's exactly what I do not see here. So far, nobody has mentioned yet a real-world problem (e.g. a program, a series of programs, a particular use case) where the current CopyMemQuick() is the bottleneck, and not fast enough to address the needs of the user. I would rather say that if memory copy is your bottleneck, there is probably something wrong with your algorithm requiring to copy so much data in first place.

But anyhow - I had little problem to exchange it should there ever be a new version of exec, but as the situation currently is, I consider the option of a patch for an otherwise bug-free Os function less desireable than the small speed impact (if at all) of CopyMemQuick() as we have it now.

One of many examples from Aminet (Vbak2091):

INTRODUCTION ZorroII boards can only reach the lower 16MB of address space. So DMA SCSI controllers must find another way to transfer data to expansion RAM. Some of them (especially the A2091) do a very bad job in this situation. In an A4000/40 transfer rates may drop to 50KB/s. This program patches the (2nd.)scsi.device to use MEMF_24BITDMA RAM as a buffer followed (in case of CMD_READ) by CopyMem(). It was developed with the A4000/A2091 combinbation in mind, but should work with other configurations, too (see REQUIREMENTS). Some people reported good results with GVP controllers.

SpeedGeek · « **Reply #9 on:** January 11, 2015, 11:30:29 PM »

** 2ND NEWS UPDATE **

CMQ&S v1.6 released
v1.6 minor change
- fixed install code which could (but seldom ever did) trash a few bytes
of memory past the end of the patch

CMQ&S040 v1.7 released
v1.7 minor changes
- fixed install code which could (but seldom ever did) trash a few bytes
of memory past the end of the patch
- fixed 4 byte offset on Move16 compare code

SpeedGeek · « **Reply #10 on:** January 22, 2015, 03:54:12 PM »

** 3RD NEWS UPDATE **

No version change
- New 1024-8192 byte Block Size versions added to archive

(The new Block Size versions allow you to "Tune" the
MoveL vs. Move16 performance of your system).

Author Topic: CopyMem Quick & Small released! (Read 14190 times)

SpeedGeek

CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!