I pretty much doubt this. If there is no cliprect (i.e. the rastport has no layer) graphics.library goes directly to the hardware on native amigas. There is no overhead in such a case. Again, before making such claims, please measure.
If there is a graphics card, it goes to the hardware of the graphics card if it supports 2D acceleration, same story. Only difference: It works on all hardware.
I agree with your point on using the AmigaOS (where possible given constraints) but I disagree with the "no overhead" claim to using the AmigaOS, even if it "goes directly to the the hardware". Function calls through the jump table have overhead and compiled AmigaOS code is not optimal. For example, your new layers.library is riddled with instructions like:
lea (0,a6),a4 ; optimize to move.l a6,a4
move.l #0,-(sp) ; optimize to clr.l -(sp)
lea (a3),a6 ; optimize to move.l a3,a6
That's just talking about peephole optimizations. It's compiled for the 68000 which isn't a huge loss with this particular code but it's a few percent more. Actually, the biggest gain could be using the auto word extension of address registers which SAS/C doesn't take advantage of (MVS or MOVE.W+EXT.L fusion/folding would be a huge gain with this code). There is no instruction scheduling for superscalar processors like the 68060. Cache accesses sometimes look like they aren't even avoided like this:
move.w (a5),d0
move.w (a5)+,d1
We are not talking about a cycle or 2. All these lack of optimizations add up and then programmers roll their own code to gain 10%+ speed over the AmigaOS. I want programmers to use the AmigaOS functions (but not required). We need to improve compilers and try to make code close to optimal for this to happen. Call me a cycle counter and ignore me if you like.
Write software that's fast on 20s and 30s with AGA, and when possible OCS/ECS, while keeping it system friendly, but without sacrificing speed (no need for syskill practices, custom screens and what not, of course).
With a little bit of learning, it's possible to write code that is fairly optimal on the 68020-68060. Instruction scheduling for a 68060 generally doesn't affect 68020/68030 performance but can double 68060 performance with some code. Learning about modern CPU pipelining, superscalar execution, caches, hazards/bubbles, etc. (not just 68020/68030 timings and specifics) will improve your code for the 68020/68030 also. We may get a superscalar 68k fpga CPU someday where your code will magically be much faster too

.
What you say it correct.
But it does not need to be like this.
A CIA need minimal FPGA
By adding the smallest and cheapest FPGA to a new system - every system could be made CIA compatible for a price of close to nothing.
The same is true with USB and accessing them via DFFxxx registers.
A USB to DFFxxx bridge logic costs around $2.
Technically there is really no reason to not havce both.
A NEO sytem could with no problem at all implement USB and DFF chipset and CIAs for nearly not money.
The SAM 440 and 460 have a small Lattice fpga. The mentality of some of the so called next generation Amiga guys is to get away from hardware dependency. They also may be trying to keep their AmigaOS closed for proprietary and security reasons.