Welcome, Guest. Please login or register.

Author Topic: CopyMem Quick & Small released!  (Read 14377 times)

Description:

0 Members and 2 Guests are viewing this topic.

Offline SpeedGeekTopic starter

CopyMem Quick & Small released!
« on: December 28, 2014, 04:54:04 PM »
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #1 on: December 28, 2014, 09:44:45 PM »
Quote from: SpeedGeek;780669
Here is link to this thread:

http://eab.abime.net/showthread.php?p=993920

Two notes on this: First, the smallest patch on CopyMemQuick is zero bytes, no patch. Second, MOVE16 is *not* a save instruction. It initiates a burst. Unfortuntely, Zorro does know nothing about bursts. Thus, depending on the CPU card you have, multiple things may happen. If you MOVE16 from on-board RAM to on-board RAM, you're fine. If you copy into RAM on a Zorro board, the result could either fail, or could be slow. For my machine, it is only slow - the glue-logic on the CPU board detects the attempt of the CPU to burst, aborts it, and runs regular cycles instead. Net result: MOVE16 is slower than four moves.  General idea: Don't try to optimize unless you're sure what you're doing. MOVE16 is, in general, not a good idea. There are cases where it is, if you can be perfectly sure that there's no Zorro involved and you are moving from CPU-RAM to CPU-RAM, but the entire detection logic for this may take longer than to perform the actual move.
 

Offline SpeedGeekTopic starter

Re: CopyMem Quick & Small released!
« Reply #2 on: December 29, 2014, 02:48:32 PM »
Zero bytes? You know how to make a CMQ patch this small? Goodie, I can't wait to try it out.

The case when 4 Long moves is faster than Move16 is when the Copyback cache is enabled and the 4 Long moves obtain best case performance, but in the case of worst case performance Move16 is much faster. The average case performance probably occurs at 50% of the size of the 040's data cache... and that's why I have copy block size limit >= 2048 bytes before any Move16 is enabled!
 

Offline Oldsmobile_Mike

Re: CopyMem Quick & Small released!
« Reply #3 on: December 29, 2014, 03:08:19 PM »
Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

http://aminet.net/package/util/boot/CopyMem

Since it seems like they both rely on Move16?


(bonus points for the author mentioning @Thomas_Richter in the description of this program, haha) ;)
Amiga 500: 2MB Chip|16MB Fast|30MHz 68030+68882|3.9|Indivision ECS|GVP A500HD+|Mechware card reader + 8GB CF|Cocolino|SCSI DVD-RAM
Amiga 2000: 2MB Chip|136MB Fast|50MHz 68060|3.9|Indivision ECS + GVP Spectrum|Mechware card reader + 8GB CF|AD516|X-Surf 100|RapidRoad|Cocolino|SCSI CD-RW
 Amiga videos and other misc. stuff at https://www.youtube.com/CompTechMike/videos
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #4 on: December 29, 2014, 06:31:01 PM »
Quote from: Oldsmobile_Mike;780712
Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

You cannot make promises, in general, whether MOVE16 is slower, or faster, or even works at all, this is really the major problem. As said, MOVE16 initiates a burst, even if the memory region is marked as "non-cachable", which may or may not work, depending on the hardware, the bus between and other factors. A burst access over Zorro-II is nothing that is described in the Zorro documents, so whether that works or not is really up to your hardware.

One way or another, it is a corner case, and if it works for you, good for you. It certainly does not work for me, and you should be careful installing such patches in either case. They *may* or *may not* provide a speed benefit, or may even break the system.
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #5 on: December 29, 2014, 06:36:11 PM »
Quote from: SpeedGeek;780711
Zero bytes? You know how to make a CMQ patch this small? Goodie, I can't wait to try it out.
It's called "Use the Os provided function".

Quote from: SpeedGeek;780711
The case when 4 Long moves is faster than Move16 is when the Copyback cache is enabled and the 4 Long moves obtain best case performance, ...
No, it's not. Again, whether bursting works over Zorro or not is a matter of luck. For my A2000, MOVE16 is *slow* when I move into the graphics card ram of the GVP spectrum. This is non-cacheable (!), imprecise, non-serial. Thus, the CPU may reorder accesses, does not need to expect bus-errors, but may not cache them, but yet, surprisingly, MOVE16 is slower than four moves. I already said why that is: Bursts over Zorro are no-no's, and the hardware may have to run in circles to get the data over the bus. We tested all this back then for P96, as it was suggested that MOVE16 may improve some blitter emulation cases. It does not. Worst, it may break things. Simply don't try that, it's a bad idea.

Other than that, have you made measurements which speed benefit this program has? I mean, in a realistic use case? If so, I would be interested to learn about your results. Which programs to run, what did the program do, and how did you measure?
 

Offline Oldsmobile_Mike

Re: CopyMem Quick & Small released!
« Reply #6 on: December 29, 2014, 06:36:34 PM »
Quote from: Thomas Richter;780719
One way or another, it is a corner case, and if it works for you, good for you. It certainly does not work for me, and you should be careful installing such patches in either case. They *may* or *may not* provide a speed benefit, or may even break the system.

Understood, no warranties.  ;)  Reason I asked is because I've been running the other version (the one in the Aminet link) on my '040 A2000 for over a year.  No issues.  I always like having "the latest and greatest", so I was wondering if SpeedGeek thinks his new version is an improvement on this version that's already available, since it seems like they do much the same thing (and the source code for the Aminet version is included in the archive, so it should be pretty easy to compare the two).

But, maybe I should just test it and see.  :roflmao:
Amiga 500: 2MB Chip|16MB Fast|30MHz 68030+68882|3.9|Indivision ECS|GVP A500HD+|Mechware card reader + 8GB CF|Cocolino|SCSI DVD-RAM
Amiga 2000: 2MB Chip|136MB Fast|50MHz 68060|3.9|Indivision ECS + GVP Spectrum|Mechware card reader + 8GB CF|AD516|X-Surf 100|RapidRoad|Cocolino|SCSI CD-RW
 Amiga videos and other misc. stuff at https://www.youtube.com/CompTechMike/videos
 

Offline psxphill

Re: CopyMem Quick & Small released!
« Reply #7 on: December 29, 2014, 08:28:55 PM »
Quote from: Thomas Richter;780676
If you MOVE16 from on-board RAM to on-board RAM, you're fine.

It depends. The 040 errata mentions problems with MOVE16 and I don't think anyone ever implemented any of the workrounds.

It's like reducing the weight of your car to make it go quicker, by removing the brakes and airbags.

I'm not convinced that you ever see a real world improvement with these patches. I don't remember my Amiga copying memory constantly, the entire OS design is based around never copying. A lot of software has it's own memcpy() and doesn't use exec anyway because the overhead of calling into exec when you're copying small amounts of data is not worth it.
 

Offline SpeedGeekTopic starter

Re: CopyMem Quick & Small released!
« Reply #8 on: December 30, 2014, 06:12:21 PM »
Quote from: Thomas Richter;780720
It's called "Use the Os provided function".
So you really don't have a patch that small. If the "OS provided function" was fast enough than why would anyone bother to make a patch in the first place?

Quote from: Thomas Richter;780720
No, it's not. Again, whether bursting works over Zorro or not is a matter of luck. For my A2000, MOVE16 is *slow* when I move into the graphics card ram of the GVP spectrum. This is non-cacheable (!), imprecise, non-serial. Thus, the CPU may reorder accesses, does not need to expect bus-errors, but may not cache them, but yet, surprisingly, MOVE16 is slower than four moves. I already said why that is: Bursts over Zorro are no-no's, and the hardware may have to run in circles to get the data over the bus. We tested all this back then for P96, as it was suggested that MOVE16 may improve some blitter emulation cases. It does not. Worst, it may break things. Simply don't try that, it's a bad idea.

Other than that, have you made measurements which speed benefit this program has? I mean, in a realistic use case? If so, I would be interested to learn about your results. Which programs to run, what did the program do, and how did you measure?

The Zorro2 bus does NOT support Burst and so again as with Chip RAM Burst is a non-issue. Move16 does NOT need Burst to obtain a performance benefit. While it's certainly true Burst capable memory can improve Move16 performance it's true to same extent Burst would improve the performance of MoveL and any other instruction.

Move16 get's it main performance benefit because it interacts differently with the data cache than MoveL. This means Move16 is not affected by the worst case performance problem when the Copyback cache is enabled. This also means it can't benefit from the best case performance as MoveL can.

I have already posted a Testit result indicating a 44% speed increase with Move16 on EAB. As I said previously it's the SIZE of the copy which determines whether or not Move16 offers any performance benefit.

Move16 should not cause any problems with the MMU reordering a write to any Zorro2 or Chip RAM since the write cycle will be completed as 4 separate longword writes. But if you want to play it safe you can always fix the MMU config.

What's really surprising here is how people can continue to read the 040 and 060 documentation and ignore the very obvious:

        5.4.6 Transfer Burst Inhibit (TBI)
This input signal indicates to the processor that the accessed device  cannot support burst mode accesses and that the requested line transfer  should be divided into individual longword transfers. Asserting TBI with  TA terminates the first data transfer of a line access, which causes  the processor to terminate the burst and access the remaining data for  the line as three successive long-word transfers. During alternate bus  master accesses, the M68040 samples the TBI to detect completion of each  bus transfer.
« Last Edit: December 30, 2014, 06:39:54 PM by SpeedGeek »
 

Offline SpeedGeekTopic starter

Re: CopyMem Quick & Small released!
« Reply #9 on: December 30, 2014, 06:13:21 PM »
Quote from: Thomas Richter;780720
It's called "Use the Os provided function".
So you really don't have a patch that small. If the "OS provided function" was really fast enough than why would anyone bother to make a patch in the first place?

Quote from: Thomas Richter;780720
No, it's not. Again, whether bursting works over Zorro or not is a matter of luck. For my A2000, MOVE16 is *slow* when I move into the graphics card ram of the GVP spectrum. This is non-cacheable (!), imprecise, non-serial. Thus, the CPU may reorder accesses, does not need to expect bus-errors, but may not cache them, but yet, surprisingly, MOVE16 is slower than four moves. I already said why that is: Bursts over Zorro are no-no's, and the hardware may have to run in circles to get the data over the bus. We tested all this back then for P96, as it was suggested that MOVE16 may improve some blitter emulation cases. It does not. Worst, it may break things. Simply don't try that, it's a bad idea.

Other than that, have you made measurements which speed benefit this program has? I mean, in a realistic use case? If so, I would be interested to learn about your results. Which programs to run, what did the program do, and how did you measure?

The Zorro2 bus does NOT support Burst and so again as with Chip RAM Burst is a non-issue. Move16 does NOT need Burst to obtain a performance benefit. While it's certainly true Burst capable memory can improve Move16 performance it's true to same extent Burst would improve the performance of MoveL and any other instruction.

Move16 get's it main performance benefit because it interacts differently with the data cache than MoveL. This means Move16 is not affected by the worst case performance problem when the Copyback cache is enabled.

I have already posted a Testit result indicating a 44% speed increase with Move16 on EAB. As I said previously it's the SIZE of the copy which determines whether or not Move16 offers any performance benefit.

Move16 should not cause any problems with the MMU reordering a write to any Zorro2 or Chip RAM since the write cycle will be completed as 4 separate longword writes. But if you want to play it safe you can always fix the MMU config.

What's really surprising here is how people can continue to read the 040 and 060 documentation and ignore the very obvious:

        5.4.6 Transfer Burst Inhibit (TBI)
This input signal indicates to the processor that the accessed device  cannot support burst mode accesses and that the requested line transfer  should be divided into individual longword transfers. Asserting TBI with  TA terminates the first data transfer of a line access, which causes  the processor to terminate the burst and access the remaining data for  the line as three successive long-word transfers. During alternate bus  master accesses, the M68040 samples the TBI to detect completion of each  bus transfer.
 

Offline psxphill

Re: CopyMem Quick & Small released!
« Reply #10 on: December 30, 2014, 06:43:12 PM »
Quote from: SpeedGeek;780761
So you really don't have a patch that small. If the "OS provided function" was fast enough than why would anyone bother to make a patch in the first place?

Some people will spend time doubling the speed of a routine that takes 100ms and is only ever run once.
 
 Do you have any benchmarks of real software before and after installing the patch?
 
 MOVE16 isn't safe on an mmu less 040, although it's arguable that an mmu less 040 is safe in an amiga at all (yet they seem to exist).
 

Offline psxphill

Re: CopyMem Quick & Small released!
« Reply #11 on: December 30, 2014, 06:50:03 PM »
Quote from: SpeedGeek;780761
So you really don't have a patch that small. If the "OS provided function" was fast enough than why would anyone bother to make a patch in the first place?

Some people will spend time doubling the speed of a routine that takes 100ms and is only ever run once.

Do you have any benchmarks of real software before and after installing the patch?

MOVE16 doesn't appear to be safe on an mmu less 040 as you can't use the workaround in the errata, although it's arguable that an mmu less 040 is safe in an amiga at all (yet they seem to exist).

The TBI line isn't a solution, it completes the burst and then throws away the extra results. If you write and the data isn't in the cache it will try to burst read the cache line and throw that away too.
 
 http://amigadev.elowar.com/read/ADCD_2.1/AmigaMail_Vol2_guide/node0161.html
« Last Edit: December 30, 2014, 06:54:41 PM by psxphill »
 

Offline Oldsmobile_Mike

Re: CopyMem Quick & Small released!
« Reply #12 on: December 30, 2014, 07:06:42 PM »
Quote from: psxphill;780764
Some people will spend time doubling the speed of a routine that takes 100ms and is only ever run once.

IMHO I love that people like SpeedGeek and Cosmos are taking on these "micro optimizations" of old Amiga code.  I know other people's mileage may vary, but I'm running a ton of their patches on my A2000, and sitting right next to a 2000MHz PC running the latest version of Lubuntu, my 33MHz Amiga still feels like it flies.  :)
Amiga 500: 2MB Chip|16MB Fast|30MHz 68030+68882|3.9|Indivision ECS|GVP A500HD+|Mechware card reader + 8GB CF|Cocolino|SCSI DVD-RAM
Amiga 2000: 2MB Chip|136MB Fast|50MHz 68060|3.9|Indivision ECS + GVP Spectrum|Mechware card reader + 8GB CF|AD516|X-Surf 100|RapidRoad|Cocolino|SCSI CD-RW
 Amiga videos and other misc. stuff at https://www.youtube.com/CompTechMike/videos
 

Offline psxphill

Re: CopyMem Quick & Small released!
« Reply #13 on: December 30, 2014, 07:31:21 PM »
Quote from: Oldsmobile_Mike;780765
I know other people's mileage may vary, but I'm running a ton of their patches on my A2000, and sitting right next to a 2000MHz PC running the latest version of Lubuntu, my 33MHz Amiga still feels like it flies. :)

I think that might be a perception bias. I have a 2.5ghz Windows 8.1 laptop and if commodore had anything that felt this quick they wouldn't have gone bankrupt. The boot-up speed is probably the only thing the Amiga wins on, but my c128 boots up even faster.
 

Offline itix

  • Hero Member
  • *****
  • Join Date: Oct 2002
  • Posts: 2380
    • Show only replies by itix
Re: CopyMem Quick & Small released!
« Reply #14 on: December 30, 2014, 07:31:28 PM »
Quote from: psxphill;780726

I'm not convinced that you ever see a real world improvement with these patches. I don't remember my Amiga copying memory constantly, the entire OS design is based around never copying. A lot of software has it's own memcpy() and doesn't use exec anyway because the overhead of calling into exec when you're copying small amounts of data is not worth it.


Many RTG-based games use CopyMem() because they manage directly with ARGB/LUT buffers and copy data around. Those could be good candinate for benchmarking CopyMem() patches in real life.

But of course... if CPU is too slow it is too slow and no patch can help it.
My Amigas: A500, Mac Mini and PowerBook