Author Topic: CopyMem Quick & Small released! (Read 14164 times)

psxphill · « **Reply #44 on:** January 03, 2015, 04:57:16 PM »

Quote from: SpeedGeek;780974

The CPU may only request a Burst cycle but the hardware (memory controller logic) makes the final decision on when (if ever) any Burst cycle will happen.

And when you rely on this then it's slower than not requesting the burst in the first place. Does Zorro III actually support it?

olsen · « **Reply #45 on:** January 03, 2015, 04:59:56 PM »

Quote from: biggun;780976

One question:

If you compare the time needed to develop the memcopy with the time spend on talking / defneding it here. How does this time compare?

Seems to me it's not so much about developing something, it's more about making sure that what is developed has a positive impact and no side-effects. Measure twice, cut once

However, this somewhat sober and "not much fun" side of system software development doesn't seem to be much in favour here. More or less, this speaks of the Amiga in its current form as a hobby.

Nothing wrong with computers as hobbies, or the fun of tinkering with the operating system. Spoilsports like Thomas and me do seem to have the engineering side of the operating system patches in mind, because that's what a lot of software builds upon, and it's sadly too easy to break things and never quite find out what actually caused the problems. If you're playing with the fundamentals of the operating system there comes a bit of responsibility with it, and that can't always be fun.

I'm not sure if this has been mentioned before, but the operating system itself, as shipped, hardly ever uses CopyMem() or CopyMemQuick() in a way which would benefit from optimization. In ROM CopyMem() is used to save space, and it usually moves only small amounts of data around, with the NCR scsi.device and ram-handler being the exceptions. The disk-loaded operating system components tend to prefer their own memcpy()/memmove() implementations, and only programs which were written with saving disk space in mind (e.g. the prefs editors) use CopyMem() to some extent. Again: only small amounts of data are being copied, which in most cases means that you could "optimize" CopyMem() by plugging in a short unrolled move.b (a0)+,(a1)+ loop for transfers shorter than 256 bytes.

I have no data on how third party applications use CopyMem()/CopyMemQuick(), but if these are written in 'C' it's likely that they will use the memcpy()/memmove() function which the standard library provides, and that typically isn't some crude bumbling implementation. However, it might benefit from optimization.

Now if you wanted to make a difference and speed up copying operations that are measurable and will affect a large number of programs, I'd propose a project to scan the executable code loaded by LoadSeg() and friends and replace the SAS/C, Lattice 'C' and Aztec 'C' statically linked library implementations of their respective memcpy()/memmove() functions with something much nicer. That would not be quite the "low-hanging fruit" of changing the CopyMem()/CopyMemQuick() implementation, but it might have a much greater impact.

guest11527 · « **Reply #46 on:** January 03, 2015, 06:23:21 PM »

Quote from: SpeedGeek;780974

The CPU may only request a Burst cycle but the hardware (memory controller logic) makes the final decision on when (if ever) any Burst cycle will happen.

Exactly. But you silently assume that there is a memory controller logic, and that this memory controller logic is smart enough to pick the right decisions at all times. In fact, you can get away without ever touching the burst. RAM would be on the Turbo card anyhow, chip ram has to be cache inhibited, and the rest of I/O space has to be cache-inhibited as well. Cache-inibited accesses do not burst, hence no extra logic required. Or almost.

IOW, you rely on the hardware to be well-behaived, and that the vendor implemented an extra-logic just for a corner case. I really wonder where you take your confidence from. All what I learned over the years was that whenever there was a chance to cut the budget, hardware vendors took it. Here you have one...

Take it as you like, but I call it "defensive programming".

kolla · « **Reply #47 on:** January 03, 2015, 06:30:20 PM »

Why not a patch to remove CopyMem() and CopyMemQuick() all together, and then see what breaks

guest11527 · « **Reply #48 on:** January 03, 2015, 06:38:59 PM »

Quote from: Thorham;780971

1. Ask the user.
2. Icon tool type and have two icons.

You are already demanding too much from the average user. Does it make a measurable difference? Does another option provide an advantage? Or does it confuse the user?

Quote from: Thorham;780971

Depends on the software. How much code are we talking about anyway?

How much testing are we talking about? The more decisions you have in the code, the easier it is to break. In reality, for any serious sized program, I prefer to have the minimum number of options to perform a given task - for example rendering something on the screen. I can test that once, and then rely on the correctness of the Os (hopefully, with the given patches around, this is a somewhat arbitrary decision nowadays). I'm not talking about the "implemented in two weeks" program.

Quote from: Thorham;780971

Two, maybe three kb extra? Hardly a waste if it means more users can enjoy the software.

Speed is not the only factor how to enjoy software. What about easiness of use and correctness? There are many factors that play into making such a decision.

Quote from: Thorham;780971

Do that twice, transpose, write to chipmem. After that you can unroll to use the pipeline on 20+ and get the transposes almost for free. I don't see how that's going to be anywhere near as fast with the OS blit function, so in this case it's crystal clear that it matters, because it lowers the CPU requirements.

How much time of the overall running time of the program is spend in that copy? How much development time does go into writing that? How much into testing? Would a user really bother? You are giving me all factors that make the "coder enjoy the development", but probably no argument concerning the overall "quality of experience" of the resulting program.

I cannot really give you a single "rule of thumb" of what is correct and what isn't. I would probably try first to use the Os. Then check whether the program satisfies my needs. If I see any lags, I benchmark, find where the bottleneck is, and optimize there. If that means that I have to re-implement parts of what I could do with the Os, so might it be, but that's rarely the case.

Quote from: Thorham;780971

I have another example. I wrote a simple real-time memory viewer. It opens a single bit plane 640x512 screen and blits 8x8 chars to the screen with my own code (which contains some optimized transposes from kalms' c2p routines). Very fast, and I highly doubt the OS can match that speed. It's important that such a program is fast because you're also running the program you're working on.

Maybe the Os wouldn't match the speed, but would it matter, actually? If I have a memory viewer (e.g. MonAm2, or COP), then I don't mind whether the screen updates faster than I type (or view). Actually, I would raise a couple of more important issues, as in "does it cooperate well with the rest of the system", "does it know not to touch I/O spaces to avoid interaction with the hardware", "can it print the memory contents to a printer and make a hardcopy". You have a very single-sighted view on the qualities and requirements of the software, where in real-life a lot of other aspects play a role, too. Whether the output of the program is twice as fast as that of a competing program is in most use-cases not important.

Look, I understand your joy writing such a program, but in reality, users probably have other needs you didn't take into account.

Quote from: Thorham;780971

That's the whole reason. Something doesn't run at sufficient speed, or you know this is going to happen, or simply want to reduce CPU usage as much as possible, so you write your own code. Nothing wrong with that.

This is Amiga land after all. Lots of not so fast < 68060s out there, and you can do more on those lower end machines if your code is faster.

For P96, the situation was really, "we need to emulate the blitter on even not so fast systems", and "CopyMemQuick() doesn't even provide the interface for doing what needs to be done", hence it was necessary to come up with something in assembler that should better be fast - because you can see the difference when moving windows around. In my average C program, if I copy a string around, I use memcpy(), the built-in compiler primitive from the standard C library, simply because it doesn't make any difference.

The situation was different, the requirements were different, the bottleneck was observed and benchmarked, hence a solution for the problem was required.

Thorham · « **Reply #49 on:** January 03, 2015, 08:08:59 PM »