1. Ask the user.
2. Icon tool type and have two icons.
You are already demanding too much from the average user. Does it make a measurable difference? Does another option provide an advantage? Or does it confuse the user?
Depends on the software. How much code are we talking about anyway?
How much testing are we talking about? The more decisions you have in the code, the easier it is to break. In reality, for any serious sized program, I prefer to have the minimum number of options to perform a given task - for example rendering something on the screen. I can test that once, and then rely on the correctness of the Os (hopefully, with the given patches around, this is a somewhat arbitrary decision nowadays). I'm not talking about the "implemented in two weeks" program.
Two, maybe three kb extra? Hardly a waste if it means more users can enjoy the software.
Speed is not the only factor how to enjoy software. What about easiness of use and correctness? There are many factors that play into making such a decision.
Do that twice, transpose, write to chipmem. After that you can unroll to use the pipeline on 20+ and get the transposes almost for free. I don't see how that's going to be anywhere near as fast with the OS blit function, so in this case it's crystal clear that it matters, because it lowers the CPU requirements.
How much time of the overall running time of the program is spend in that copy? How much development time does go into writing that? How much into testing? Would a user really bother? You are giving me all factors that make the "coder enjoy the development", but probably no argument concerning the overall "quality of experience" of the resulting program.
I cannot really give you a single "rule of thumb" of what is correct and what isn't. I would probably try first to use the Os. Then check whether the program satisfies my needs. If I see any lags, I benchmark, find where the bottleneck is, and optimize there. If that means that I have to re-implement parts of what I could do with the Os, so might it be, but that's rarely the case.
I have another example. I wrote a simple real-time memory viewer. It opens a single bit plane 640x512 screen and blits 8x8 chars to the screen with my own code (which contains some optimized transposes from kalms' c2p routines). Very fast, and I highly doubt the OS can match that speed. It's important that such a program is fast because you're also running the program you're working on.
Maybe the Os wouldn't match the speed, but would it matter, actually? If I have a memory viewer (e.g. MonAm2, or COP), then I don't mind whether the screen updates faster than I type (or view). Actually, I would raise a couple of more important issues, as in "does it cooperate well with the rest of the system", "does it know not to touch I/O spaces to avoid interaction with the hardware", "can it print the memory contents to a printer and make a hardcopy". You have a very single-sighted view on the qualities and requirements of the software, where in real-life a lot of other aspects play a role, too. Whether the output of the program is twice as fast as that of a competing program is in most use-cases not important.
Look, I understand your joy writing such a program, but in reality, users probably have other needs you didn't take into account.
That's the whole reason. Something doesn't run at sufficient speed, or you know this is going to happen, or simply want to reduce CPU usage as much as possible, so you write your own code. Nothing wrong with that.
This is Amiga land after all. Lots of not so fast < 68060s out there, and you can do more on those lower end machines if your code is faster.
For P96, the situation was really, "we need to emulate the blitter on even not so fast systems", and "CopyMemQuick() doesn't even provide the interface for doing what needs to be done", hence it was necessary to come up with something in assembler that should better be fast - because you can see the difference when moving windows around. In my average C program, if I copy a string around, I use memcpy(), the built-in compiler primitive from the standard C library, simply because it doesn't make any difference.
The situation was different, the requirements were different, the bottleneck was observed and benchmarked, hence a solution for the problem was required.