Author Topic: CopyMem Quick & Small released! (Read 14287 times)

itix · « **Reply #14 on:** December 30, 2014, 07:31:28 PM »

Quote from: psxphill;780726

I'm not convinced that you ever see a real world improvement with these patches. I don't remember my Amiga copying memory constantly, the entire OS design is based around never copying. A lot of software has it's own memcpy() and doesn't use exec anyway because the overhead of calling into exec when you're copying small amounts of data is not worth it.

Many RTG-based games use CopyMem() because they manage directly with ARGB/LUT buffers and copy data around. Those could be good candinate for benchmarking CopyMem() patches in real life.

But of course... if CPU is too slow it is too slow and no patch can help it.

Oldsmobile_Mike · « **Reply #15 on:** December 30, 2014, 08:14:52 PM »

Quote from: psxphill;780766

I think that might be a perception bias. I have a 2.5ghz Windows 8.1 laptop and if commodore had anything that felt this quick they wouldn't have gone bankrupt. The boot-up speed is probably the only thing the Amiga wins on, but my c128 boots up even faster.

Okay, there "might be" some perception bias involved, but let's see... 2000 / 33 = 60. It certainly doesn't feel 60x faster than my Amiga! :laughing:

...and it's for darn sure not nearly as fun, either. :lol:

psxphill · « **Reply #16 on:** December 30, 2014, 10:07:06 PM »

Quote from: Oldsmobile_Mike;780770

Okay, there "might be" some perception bias involved, but let's see... 2000 / 33 = 60. It certainly doesn't feel 60x faster than my Amiga! :laughing:

If you do similar things on both then how much faster does it feel?

ram speed hasn't kept up with cpu speed, so you can't expect it to be 60 times quicker anyway.

matthey · « **Reply #17 on:** December 30, 2014, 10:11:46 PM »

Quote from: SpeedGeek;780711

The case when 4 Long moves is faster than Move16 is when the Copyback cache is enabled and the 4 Long moves obtain best case performance, but in the case of worst case performance Move16 is much faster. The average case performance probably occurs at 50% of the size of the 040's data cache... and that's why I have copy block size limit >= 2048 bytes before any Move16 is enabled!

The best copy block size limit used before switching to a MOVE16 unrolled copy loop is different for the 68040 and 68060. The 68060 doesn't need an unrolled MOVE.L loop to give maximum copy performance which preserves ICache (faster). Of course the AmigaOS uses a MOVEM.L loop instead of an unrolled MOVE.L loop which is poor for every 68k CPU except large copies on the 68000/68010 and 68020/68030.

Quote from: Oldsmobile_Mike;780712

Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

http://aminet.net/package/util/boot/CopyMem

Since it seems like they both rely on Move16?

There is a version of CopyMem which does not use MOVE16. ZorroII and custom chip address space can be checked and avoided with little overhead also. I have not heard from anyone experiencing instability on any Amiga from using MOVE16 though.

Quote from: itix;780767

Many RTG-based games use CopyMem() because they manage directly with ARGB/LUT buffers and copy data around. Those could be good candidate for benchmarking CopyMem() patches in real life.

This is true. With exec.library CopyMem() patched, the overhead of using the library function is less than most compiler memory copy functions (for example the SAS/C copy routine). Some programmers like NovaCoder take advantage of this. GCC 3.4 may use exec.library CopyMem() for C memcpy(). AWeb uses CopyMem() for screen updates and while scrolling where there is a noticeable difference in scrolling speed on a 68060 (and probably 68040) with CopyMem() patched. AmigaOS and MUI use CopyMem() a lot so patching should free some CPU cycles as well.

Oldsmobile_Mike · « **Reply #18 on:** December 30, 2014, 10:42:48 PM »

Quote from: psxphill;780774

If you do similar things on both then how much faster does it feel?

ram speed hasn't kept up with cpu speed, so you can't expect it to be 60 times quicker anyway.

OMG guys. You all have about the squarest sense of humor, ever. I've been working on computer hardware for 30 years, of course I know that. I was trying to make a joke! *facepalm*

But I'll be damned if Wordsworth on my Amiga doesn't feel faster than OpenOffice on the Linux box. Of course you can say just one word to that: Java. HA!

SpeedGeek · « **Reply #19 on:** December 31, 2014, 02:32:35 AM »

** NEWS UPDATE **

CMQ&S040 v1.6 released

v1.6 minor change
- source address compare code misqualified Move16 on 8 byte offset
(This is fixed now but the 4 byte offset still doesn't work for some reason)

guest11527 · « **Reply #20 on:** December 31, 2014, 11:27:05 AM »

Quote from: SpeedGeek;780789

** NEWS UPDATE **

CMQ&S040 v1.6 released

v1.6 minor change
- source address compare code misqualified Move16 on 8 byte offset
(This is fixed now but the 4 byte offset still doesn't work for some reason)

The problem is - if something breaks, people are rarely aware or even able to relate that to the patch. As I said, MOVE16 *may* work fine on the CPU memory directly on the turbo board, but may fail when going over Zorro, or may be at least slower.

Now think again: How many people will consider your patch faulty if some program creates graphics defects? How many people will benchmark the copy operation to *rtg memory*? Actually, *did you* benchmark? Did you benchmark on every possible hardware combination? I can only re-ensure you that it's slower on my A2000.

psxphill · « **Reply #21 on:** December 31, 2014, 12:09:18 PM »

Quote from: Thomas Richter;780795

The problem is - if something breaks, people are rarely aware or even able to relate that to the patch. As I said, MOVE16 *may* work fine on the CPU memory directly on the turbo board, but may fail when going over Zorro, or may be at least slower.

The only solution to that problem is to run something that actually tests every single memory type in your computer and tells you whether it worked and what speed it was.

Then ideally it would be able to create a configuration so that certain types of memory could be excluded etc. Effectively a CopyMem construction kit.

SpeedGeek · « **Reply #22 on:** December 31, 2014, 05:03:31 PM »

Quote from: psxphill;780764

Some people will spend time doubling the speed of a routine that takes 100ms and is only ever run once.

IIRC matthey logged copymem/copymemquick calls on an Amiga with >100MB of RAM and ran out of memory in 1 minute!

Quote from: psxphill;780764

Do you have any benchmarks of real software before and after installing the patch?

MOVE16 doesn't appear to be safe on an mmu less 040 as you can't use the workaround in the errata, although it's arguable that an mmu less 040 is safe in an amiga at all (yet they seem to exist).

Testit is really not a good program for testing Move16 performance (Of course it was written for 020 and earlier CPUs). I can run CMQ&S040 before Setpatch and any MMU code is installed. I can execute the s-s which then loads Setpatch and the MMU code.

Quote from: psxphill;780764

The TBI line isn't a solution, it completes the burst and then throws away the extra results. If you write and the data isn't in the cache it will try to burst read the cache line and throw that away too.

http://amigadev.elowar.com/read/ADCD_2.1/AmigaMail_Vol2_guide/node0161.html

WTF? TBI doesn't complete the burst it TERMINATES the burst! Throws away the extra results? What extra results are there? 4 longwords requested = 4 longwords completed. FYI, the cache control logic really doesn't care if the 4 longwords were transfered in a burst or non-burst cycle.

psxphill · « **Reply #23 on:** December 31, 2014, 05:30:15 PM »

You know what, I'm only going on what I read. It's not new though.

http://www.programd.com/2_f594b4220be791d2_1.htm

SpeedGeek · « **Reply #24 on:** January 02, 2015, 05:24:45 PM »

Quote from: Oldsmobile_Mike;780712

Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

http://aminet.net/package/util/boot/CopyMem

Since it seems like they both rely on Move16?

That's a very general question to ask, but a question which has very specific and qualified answers.

Faster in which category? Best, average, or worst case copies? Large, medium, or small size copies? Faster on 16 byte, longword, word or byte copies? Faster on aligned or mis-aligned copies. Faster on 020, 030, 040 or 060?

Any CMQ patch can be optimized to give better performance for a specific category but that will reduce it's performance in another category.

guest11527 · « **Reply #25 on:** January 02, 2015, 06:02:35 PM »

Quote from: SpeedGeek;780920

Any CMQ patch can be optimized to give better performance for a specific category but that will reduce it's performance in another category.

Actually, I would be more curious to hear about any application that profits from such a patch. For me, uses of CopyMemQuick() are too rare to make any measurable difference in everyday usage. There may be exceptions, as always.

I would rather say that a an application that critically depends on memory-copies implements the copy itself, without going through the Os as there are many other factors only the calling program can know. For example, a "move" moves into and out of the cache. A move16 does not. Is that good or bad? MOVE16 doesn't "pollute" the cache. move "already fills the cache with the target data". Whether that is something you want or do not want cannot be distinguished by CopyMemQuick(). It is something only the calling program can known - and hence, only the calling program can select the optimal strategy. CopyMemQuick() is the "Ford Escord" you may select if it is "fast enough", so it's usually not worth the trouble patching into this call, even more so as it is rarely used.

Thorham · « **Reply #26 on:** January 02, 2015, 07:22:48 PM »

Quote from: Thomas Richter;780921

I would rather say that a an application that critically depends on memory-copies implements the copy itself, without going through the Os as there are many other factors only the calling program can know. For example, a "move" moves into and out of the cache. A move16 does not. Is that good or bad? MOVE16 doesn't "pollute" the cache. move "already fills the cache with the target data". Whether that is something you want or do not want cannot be distinguished by CopyMemQuick(). It is something only the calling program can known - and hence, only the calling program can select the optimal strategy. CopyMemQuick() is the "Ford Escord" you may select if it is "fast enough", so it's usually not worth the trouble patching into this call, even more so as it is rarely used.

Remember our OS blitting routine argument? You just stated the reason for writing one's own blit routine: A one size fits all routine isn't always the best solution.

guest11527 · « **Reply #27 on:** January 02, 2015, 10:24:11 PM »

Quote from: Thorham;780928

Remember our OS blitting routine argument? You just stated the reason for writing one's own blit routine: A one size fits all routine isn't always the best solution.

Not exactly. The question is "what is the solution you look for", and "what can the Os do for you", and "is a patch worth doing", and is "not using the Os" worth it. Each decision has advantages and drawbacks.

In case of doubt: Avoid a patch, especially if the average savings are negligible. In case of doubt: Use the Os for the job, unless you get substantial savings doing otherwise.

What happens now in the average program? If you don't care much, you probably pick memcpy() from the standard library or CopyMemQuick(). The former may or may not use the Os - it is rather inlined. If it matters much, you problaby have your own routine.

For the blitter, you get however a substantial disadvantage from not using the Os: If you try to implement a graphics primitivity, it might simply not work on an rtg system if you don't use the Os. Is it worth not using the Os? Typically not, because you "shoot yourself in the foot".

Thus, the situation between "patch" and "program", "copy mem" and "blitter" are not quite as symmetric as you may want to present them.

kolla · « **Reply #28 on:** January 02, 2015, 11:14:33 PM »

What's the big deal, it's just one out of hundreds of patches around. It is entirely optional to add patches and updates.

Thorham · « **Reply #29 from previous page:** January 03, 2015, 07:03:08 AM »

Quote from: Thomas Richter;780935

For the blitter, you get however a substantial disadvantage from not using the Os: If you try to implement a graphics primitivity, it might simply not work on an rtg system if you don't use the Os. Is it worth not using the Os? Typically not, because you "shoot yourself in the foot".

Whether it's worth it or not depends entirely on one's requirements. The OS blit routine is generic, and therefore unsuitable for fast, non-generic blits, even when the nature of the blits is very simple (you can see this clearly when you look at the function call). The right solution for getting both maximum performance on native screen modes and have GFX card compatibility is to simply implement both methods.

Author Topic: CopyMem Quick & Small released! (Read 14287 times)

itix

Re: CopyMem Quick & Small released!

Oldsmobile_Mike

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

matthey

Re: CopyMem Quick & Small released!

Oldsmobile_Mike

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

SpeedGeek

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

Thorham

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

kolla

Re: CopyMem Quick & Small released!

Thorham

Re: CopyMem Quick & Small released!