Author Topic: CopyMem Quick & Small released! (Read 14261 times)

matthey · « **on:** December 30, 2014, 10:11:46 PM »

Quote from: SpeedGeek;780711

The case when 4 Long moves is faster than Move16 is when the Copyback cache is enabled and the 4 Long moves obtain best case performance, but in the case of worst case performance Move16 is much faster. The average case performance probably occurs at 50% of the size of the 040's data cache... and that's why I have copy block size limit >= 2048 bytes before any Move16 is enabled!

The best copy block size limit used before switching to a MOVE16 unrolled copy loop is different for the 68040 and 68060. The 68060 doesn't need an unrolled MOVE.L loop to give maximum copy performance which preserves ICache (faster). Of course the AmigaOS uses a MOVEM.L loop instead of an unrolled MOVE.L loop which is poor for every 68k CPU except large copies on the 68000/68010 and 68020/68030.

Quote from: Oldsmobile_Mike;780712

Breaking this down into layman's terms, would you say this version is faster than, not as fast, or equal to this version:

http://aminet.net/package/util/boot/CopyMem

Since it seems like they both rely on Move16?

There is a version of CopyMem which does not use MOVE16. ZorroII and custom chip address space can be checked and avoided with little overhead also. I have not heard from anyone experiencing instability on any Amiga from using MOVE16 though.

Quote from: itix;780767

Many RTG-based games use CopyMem() because they manage directly with ARGB/LUT buffers and copy data around. Those could be good candidate for benchmarking CopyMem() patches in real life.

This is true. With exec.library CopyMem() patched, the overhead of using the library function is less than most compiler memory copy functions (for example the SAS/C copy routine). Some programmers like NovaCoder take advantage of this. GCC 3.4 may use exec.library CopyMem() for C memcpy(). AWeb uses CopyMem() for screen updates and while scrolling where there is a noticeable difference in scrolling speed on a 68060 (and probably 68040) with CopyMem() patched. AmigaOS and MUI use CopyMem() a lot so patching should free some CPU cycles as well.

matthey · « **Reply #1 on:** January 03, 2015, 09:51:55 PM »

Quote from: olsen;780979

I'm not sure if this has been mentioned before, but the operating system itself, as shipped, hardly ever uses CopyMem() or CopyMemQuick() in a way which would benefit from optimization. In ROM CopyMem() is used to save space, and it usually moves only small amounts of data around, with the NCR scsi.device and ram-handler being the exceptions. The disk-loaded operating system components tend to prefer their own memcpy()/memmove() implementations, and only programs which were written with saving disk space in mind (e.g. the prefs editors) use CopyMem() to some extent. Again: only small amounts of data are being copied, which in most cases means that you could "optimize" CopyMem() by plugging in a short unrolled move.b (a0)+,(a1)+ loop for transfers shorter than 256 bytes.

It is true that most AmigaOS calls of exec.library CopyMem() are small to medium size but there are *many* calls. It's not efficent to use CopyMem() for small copies because of the library JSR+JMP overhead although CPU optimized code can reduce the overall cost to be close to that of non-library code for all but the smallest copies. AmigaOS CopyMem() uses a MOVE.B (A0)+,(A1)+ loop for small copies and a MOVEM.L loop for large copies. It's actually the MOVEM.L loop that is most inefficient because it is only good for the 68000-68030 with large copies. An unrolled MOVE.L (A0)+,(A1)+ loop would be significantly better for the 68000-68060 in most cases.

tiny static size copy use '=' in C
small size copy use quick loop
medium size copy use unrolled loop (the 68060 doesn't benefit from unrolling in this case)
large size copy use MOVEM.L loop for 68000-68030, use unrolled MOVE16 loop for 68040-68060

There is a trade-off with different memory copy techniques as SpeedGeek has mentioned. This applies to the exec.library CopyMem() as well as the C memcpy() and memmove() functions. Most memcpy()/memmove() calls are small and this is why vbcc uses inlined quick loops to get to work as fast as possible (after minimal alignment). SAS/C uses a subroutine call (BSR+RTS) to a poorly optimized unrolled loop with no aligning and a costly jump table at the end for the 68040+. The SAS/C memcpy() may be faster in some cases than the vbcc memcpy() for large aligned copies. Sadly, the SAS/C memcpy() probably beats the exec.library CopyMem() for medium to large copies on the 68040+.

With vbcc, it is best speed to use:

tiny static size copy use '=' in C
small size copy use C memcpy() and memmove()
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()

If the exec.library CopyMem()/CopyMemQuick() used unrolled MOVE.L copy loops then we could be in good shape without patching. Patching for the uncommon large copies would become optional. MOVE16 does have the advantage of not flushing the DCache on large copies although it's questionable whether this is common enough and bug free enough to be standard.

The new version of vbcc 0.9d was recently released by the way:

http://sun.hasenbraten.de/vbcc/

With SAS/C, it is best speed to use:

tiny static size copy use '=' in C
small size copy use C memcpy() and memmove() for 68000-68030, use CPU optimized exec.library CopyMem() and CopyMemQuick() for 68040+
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()

Without patching or changing exec.library CopyMem() and CopyMemQuick(), the available options are not good. Most 68040-68060 memory copies will be significantly slower than what is possible. We want programmers to be able to use compilers and the AmigaOS without wasting time and code creating faster re-implementations of functions. This is what ThoR seems to ignore. He wants to stop the patching chaos but ignores the reason for the patching and the solution.

Quote from: olsen;780979

Now if you wanted to make a difference and speed up copying operations that are measurable and will affect a large number of programs, I'd propose a project to scan the executable code loaded by LoadSeg() and friends and replace the SAS/C, Lattice 'C' and Aztec 'C' statically linked library implementations of their respective memcpy()/memmove() functions with something much nicer. That would not be quite the "low-hanging fruit" of changing the CopyMem()/CopyMemQuick() implementation, but it might have a much greater impact.

I considered this but decided it was better to optimize compiler link lib code next and vbcc was the easiest place to start

.

matthey · « **Reply #2 on:** January 07, 2015, 02:15:23 AM »

Quote from: psxphill;781156

I've heard that argument before and I don't buy it. If it's easily possible to write code that can render text faster then you should do that, because there are easily situations where an average editor is too slow. Like if you're running something reasonably intensive in the background.

Just being fast enough when nothing else is running isn't fast enough.

Sure we need it all to be standardised and consistent so it makes it easy to write software, but that should be doable.

I agree. I like the idea of using the OS but it needs to provide reasonably optimal functions. Is aligning the destination and using an urolled MOVE.L loop too much to ask for CopyMem()/CopyMemQuick() when it is competitively the fastest for the 68000-68060? Would it be a bad thing if Olsen sold more copies of Roadshow because the memory copying bottleneck was reduced? We need to improve and use Amiga profilers but memory copying is a CPU intensive task that is easily improved. The Amiga philosophy has always been about efficiency and not just replacing the CPU with a faster one.

Quote from: itix;781160

If you are running something CPU intensive in the background, like compiling large project with GCC, all you need is a good scheduler.

The 68k frontend for vbcc, vc, had the task priority lowered for better multi-tasking. Editing is now practical while compiling which is very convenient.

I believe 68k GCC will use the current shell process priority (ChangeTaskPri).

matthey · « **Reply #3 on:** January 13, 2015, 03:14:39 AM »

Quote from: olsen;781548

Back in the early days of Amiga programming (that would have been 1987/1988 in my case) it was hard to find a decent programmer's editor.

I knew "Z" but quickly discarded it for being too obtuse. Funny that the Aztec 'C' documentation gave it such prominence, stressing the fact how compatible it was with "vi". I think the defining sentence in the documentation was "if you know vi, then you know Z", which works the other around, too, but not in Z's favour: I didn't have a clue what the documentation was talking about in the first place ("vi"? was that a roman numeral or something? and what does the number six have to do with text editors anyway?) and had to conclude that whatever the authors were so excited about probably wasn't for me.

People seem to forget the history and how everything that wasn't assembler was related. We have BPTRs in dos.library which I believe came from the BCPL language?

BCPL -> B -> C

The Amiga was one of the first affordable computers to use C for most of the OS and it was a common development environment. The 68000 chip made it easier to use a high level language which was popular on non-affordable hardware (the 68k is a cheaper successor to a VAX and PDP-11). This was another important choice in foresight by Jay Miner. The Amiga and Atari ST helped make C popular even though most computer people would think C came from the PC where it was slow to catch on or Unix which is partially true but rare outside universities and a few big businesses at the time. Dennis Ritchie, Jay Minor, Carl Sassenrath and even RJ Mical were pioneers and innovators that few people know about today while Steve Jobs and Bill Gates get the glory for being good at marketing inferior products.

Quote from: Thorham;781574

Why use Ed at all?

Because it is free (with AmigaOS), available and works. Ed was at one time not too bad. It has powerful ARexx support and the menus are configurable so maybe it was the FrexxEd of the day? I did a lot with Ed and ARexx but the vanishing 1st line bug and the slow speed finally killed it for me.

I went to CED 3.5 and then CED 4.20 where I am now. CED is fast and very powerful but not perfect either.

o I wish I could change the menus to be more style guide compliant like Ed

.
o I wish all major bugs were fixed before moving to a payed upgrade. I shouldn't have to pay for bug fixes or upgrade to get bug fixes. CED 4.20 has 2 major bugs. Some files will not load and this seems to have something to do with the path and file name to the file. The other is the tab size changing when using an ARexx script which can be worked around by restoring the tab setting with ARexx after an ARexx script. These are very annoying bugs even though they don't cause data loss. I have installed the patch from Aminet which didn't fix the problem.
o I wish there was a 68020 compiled version. It's amazing that CED is as fast as it is when SAS/C uses a branch to a branch because there is no 32 bit branch on the 68000. A multiply or divide can take several times longer without 68020 MUL/DIV instructions. That SAS/C memory copy routine is less than spectacular also. Fortunatly, the good algorithms are more important than optimal compiler code generation.
o I wish an "editor" wasn't so expensive to upgrade and the process easy (my CD has no serial number).

The Amiga has many good editors now like CED, GoldEd, FrexxEd and BED. There are better free editors on Aminet now than ED, sometimes with source code.

Author Topic: CopyMem Quick & Small released! (Read 14261 times)

matthey

Re: CopyMem Quick & Small released!

matthey

Re: CopyMem Quick & Small released!

matthey

Re: CopyMem Quick & Small released!

matthey

Re: CopyMem Quick & Small released!