Welcome, Guest. Please login or register.

Author Topic: CopyMem Quick & Small released!  (Read 14164 times)

Description:

0 Members and 2 Guests are viewing this topic.

Offline psxphill

Re: CopyMem Quick & Small released!
« Reply #44 on: January 03, 2015, 04:57:16 PM »
Quote from: SpeedGeek;780974
The CPU may only request a Burst cycle but the hardware (memory controller logic) makes the final decision on when (if ever) any Burst cycle will happen.

And when you rely on this then it's slower than not requesting the burst in the first place. Does Zorro III actually support it?
« Last Edit: January 03, 2015, 04:59:36 PM by psxphill »
 

Offline olsen

Re: CopyMem Quick & Small released!
« Reply #45 on: January 03, 2015, 04:59:56 PM »
Quote from: biggun;780976
One question:

If you compare the time needed to develop the memcopy with the time spend on talking / defneding it here. How does this time compare?
Seems to me it's not so much about developing something, it's more about making sure that what is developed has a positive impact and no side-effects. Measure twice, cut once ;)

However, this somewhat sober and "not much fun" side of system software development doesn't seem to be much in favour here. More or less, this speaks of the Amiga in its current form as a hobby.

Nothing wrong with computers as hobbies, or the fun of tinkering with the operating system. Spoilsports like Thomas and me do seem to have the engineering side of the operating system patches in mind, because that's what a lot of software builds upon, and it's sadly too easy to break things and never quite find out what actually caused the problems. If you're playing with the fundamentals of the operating system there comes a bit of responsibility with it, and that can't always be fun.

I'm not sure if this has been mentioned before, but the operating system itself, as shipped, hardly ever uses CopyMem() or CopyMemQuick() in a way which would benefit from optimization. In ROM CopyMem() is used to save space, and it usually moves only small amounts of data around, with the NCR scsi.device and ram-handler being the exceptions. The disk-loaded operating system components tend to prefer their own memcpy()/memmove() implementations, and only programs which were written with saving disk space in mind (e.g. the prefs editors) use CopyMem() to some extent. Again: only small amounts of data are being copied, which in most cases means that you could "optimize" CopyMem() by plugging in a short unrolled move.b (a0)+,(a1)+ loop for transfers shorter than 256 bytes.

I have no data on how third party applications use CopyMem()/CopyMemQuick(), but if these are written in 'C'  it's likely that they will use the memcpy()/memmove() function which the standard library provides, and that typically isn't some crude bumbling implementation. However, it might benefit from optimization.

Now if you wanted to make a difference and speed up copying operations that are measurable and will affect a large number of programs, I'd propose a project to scan the executable code loaded by LoadSeg() and friends and replace the SAS/C, Lattice 'C' and Aztec 'C' statically linked library implementations of their respective memcpy()/memmove() functions with something much nicer. That would not be quite the "low-hanging fruit" of changing the CopyMem()/CopyMemQuick() implementation, but it might have a much greater impact.
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #46 on: January 03, 2015, 06:23:21 PM »
Quote from: SpeedGeek;780974
The CPU may only request a Burst cycle but the hardware (memory controller logic) makes the final decision on when (if ever) any Burst cycle will happen.

Exactly. But you silently assume that there is a memory controller logic, and that this memory controller logic is smart enough to pick the right decisions at all times. In fact, you can get away without ever touching the burst. RAM would be on the Turbo card anyhow, chip ram has to be cache inhibited, and the rest of I/O space has to be cache-inhibited as well. Cache-inibited accesses do not burst, hence no extra logic required. Or almost.

IOW, you rely on the hardware to be well-behaived, and that the vendor implemented an extra-logic just for a corner case. I really wonder where you take your confidence from. All what I learned over the years was that whenever there was a chance to cut the budget, hardware vendors took it. Here you have one...

Take it as you like, but I call it "defensive programming".
 

Offline kolla

Re: CopyMem Quick & Small released!
« Reply #47 on: January 03, 2015, 06:30:20 PM »
Why not a patch to remove  CopyMem() and CopyMemQuick() all together, and then see what breaks ;)
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC
---
A3000/060CSPPC+CVPPC/128MB + 256MB BigRAM/Deneb USB
A4000/CS060/Mediator4000Di/Voodoo5/128MB
A1200/Blz1260/IndyAGA/192MB
A1200/Blz1260/64MB
A1200/Blz1230III/32MB
A1200/ACA1221
A600/V600v2/Subway USB
A600/Apollo630/32MB
A600/A6095
CD32/SX32/32MB/Plipbox
CD32/TF328
A500/V500v2
A500/MTec520
CDTV
MiSTer, MiST, FleaFPGAs and original Minimig
Peg1, SAM440 and Mac minis with MorphOS
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #48 on: January 03, 2015, 06:38:59 PM »
Quote from: Thorham;780971
1. Ask the user.
2. Icon tool type and have two icons.
You are already demanding too much from the average user. Does it make a measurable difference? Does another option provide an advantage? Or does it confuse the user?  
Quote from: Thorham;780971
Depends on the software. How much code are we talking about anyway?  
How much testing are we talking about? The more decisions you have in the code, the easier it is to break. In reality, for any serious sized program, I prefer to have the minimum number of options to perform a given task - for example rendering something on the screen. I can test that once, and then rely on the correctness of the Os (hopefully, with the given patches around, this is a somewhat arbitrary decision nowadays).  I'm not talking about the "implemented in two weeks" program.  
Quote from: Thorham;780971
Two, maybe three kb extra? Hardly a waste if it means more users can enjoy the software.
Speed is not the only factor how to enjoy software. What about easiness of use and correctness? There are many factors that play into making such a decision.  
Quote from: Thorham;780971
Do that twice, transpose, write to chipmem. After that you can unroll to use the pipeline on 20+ and get the transposes almost for free. I don't see how that's going to be anywhere near as fast with the OS blit function, so in this case it's crystal clear that it matters, because it lowers the CPU requirements.
How much time of the overall running time of the program is spend in that copy? How much development time does go into writing that? How much into testing? Would a user really bother? You are giving me all factors that make the "coder enjoy the development", but probably no argument concerning the overall "quality of experience" of the resulting program.

I cannot really give you a single "rule of thumb" of what is correct and what isn't. I would probably try first to use the Os. Then check whether the program satisfies my needs. If I see any lags, I benchmark, find where the bottleneck is, and optimize there. If that means that I have to re-implement parts of what I could do with the Os, so might it be, but that's rarely the case.  
Quote from: Thorham;780971
I have another example. I wrote a simple real-time memory viewer. It opens a single bit plane 640x512 screen and blits 8x8 chars to the screen with my own code (which contains some optimized transposes from kalms' c2p routines). Very fast, and I highly doubt the OS can match that speed. It's important that such a program is fast because you're also running the program you're working on.
Maybe the Os wouldn't match the speed, but would it matter, actually? If I have a memory viewer (e.g. MonAm2, or COP), then I don't mind whether the screen updates faster than I type (or view). Actually, I would raise a couple of more important issues, as in "does it cooperate well with the rest of the system", "does it know not to touch I/O spaces to avoid interaction with the hardware", "can it print the memory contents to a printer and make a hardcopy". You have a very single-sighted view on the qualities and requirements of the software, where in real-life a lot of other aspects play a role, too. Whether the output of the program is twice as fast as that of a competing program is in most use-cases not important.

Look, I understand your joy writing such a program, but in reality, users probably have other needs you didn't take into account.  
Quote from: Thorham;780971
That's the whole reason. Something doesn't run at sufficient speed, or you know this is going to happen, or simply want to reduce CPU usage as much as possible, so you write your own code. Nothing wrong with that.

This is Amiga land after all. Lots of not so fast < 68060s out there, and you can do more on those lower end machines if your code is faster.

For P96, the situation was really, "we need to emulate the blitter on even not so fast systems", and "CopyMemQuick() doesn't even provide the interface for doing what needs to be done", hence it was necessary to come up with something in assembler that should better be fast - because you can see the difference when moving windows around. In my average C program, if I copy a string around, I use memcpy(), the built-in compiler primitive from the standard C library, simply because it doesn't make any difference.

The situation was different, the requirements were different, the bottleneck was observed and benchmarked, hence a solution for the problem was required.
 

Offline Thorham

  • Hero Member
  • *****
  • Join Date: Oct 2009
  • Posts: 1150
    • Show only replies by Thorham
Re: CopyMem Quick & Small released!
« Reply #49 on: January 03, 2015, 08:08:59 PM »
Quote from: Thomas Richter;780985
You are already demanding too much from the average user.
I don't see how providing two icons, one for native and one for GFX card, or having something like a screen mode requester is demanding too much. If someone can't be bothered with that, then they have issues.

Quote from: Thomas Richter;780985
Does it make a measurable difference?
Oh, come on! The example I gave makes it crystal clear that it would make a difference.

Quote from: Thomas Richter;780985
How much testing are we talking about?
Not much, because we're talking about some simple blit routines here. It's not rocket science.

Quote from: Thomas Richter;780985
How much time of the overall running time of the program is spend in that copy?
That particular blit happens for half the screen at about eight frames per second (320x256x7 bpls). The other half is similar, but you have only three sources instead of seven. The faster this runs, the better.

Quote from: Thomas Richter;780985
How much development time does go into writing that?
This kind of trivial code is very easy to write in a close to optimal way. Obviously, it's already written.

Quote from: Thomas Richter;780985
Would a user really bother?
They would get the option of running native, or if I'm going to implement it, GFX card. What do they have to bother with? They double click an icon and that's it.

Quote from: Thomas Richter;780985
You are giving me all factors that make the "coder enjoy the development", but probably no argument concerning the overall "quality of experience" of the resulting program.
Part of the quality of the experience comes from making sure people can actually play the game properly on a 25 mhz 68030 with AGA and some fastmem.

Quote from: Thomas Richter;780985
Maybe the Os wouldn't match the speed, but would it matter, actually? If I have a memory viewer (e.g. MonAm2, or COP), then I don't mind whether the screen updates faster than I type (or view). Actually, I would raise a couple of more important issues, as in "does it cooperate well with the rest of the system", "does it know not to touch I/O spaces to avoid interaction with the hardware", "can it print the memory contents to a printer and make a hardcopy".
I wrote this memviewer for my own needs, and wouldn't release it in it's current state. The speed is a requirement, because I want to be able to use it with heavier software without slowing things down too much. Not to mention that it's realtime, and the screen is updated once per VBL.

Quote from: Thomas Richter;780985
You have a very single-sighted view on the qualities and requirements of the software, where in real-life a lot of other aspects play a role, too. Whether the output of the program is twice as fast as that of a competing program is in most use-cases not important.
The way I see it software must be well-written (this includes maintainability), functional, easy to use and fast. The reason why I seem focused on speed, is because of the target platform I'm interested in: As close to a 68020 with some fastmem as I can get it without making any concessions (my reason for insisting on ASM has nothing to do with this).
« Last Edit: January 03, 2015, 08:11:28 PM by Thorham »
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #50 on: January 03, 2015, 08:23:24 PM »
Quote from: Thorham;780988
They would get the option of running native, or if I'm going to implement it, GFX card. What do they have to bother with? They double click an icon and that's it.


Part of the quality of the experience comes from making sure people can actually play the game properly on a 25 mhz 68030 with AGA and some fastmem.
Ah, you're talking about a game. That's yet another business. If the Os doesn't give you the game speed you need, then this is a justification of course. There is *some* support for moving objects in graphics, and that's even supported natively by P96, but admittedly, the Bobs support of gfx is pretty much broken.
 

Offline Thorham

  • Hero Member
  • *****
  • Join Date: Oct 2009
  • Posts: 1150
    • Show only replies by Thorham
Re: CopyMem Quick & Small released!
« Reply #51 on: January 03, 2015, 08:51:30 PM »
Quote from: Thomas Richter;780989
Ah, you're talking about a game. That's yet another business. If the Os doesn't give you the game speed you need, then this is a justification of course. There is *some* support for moving objects in graphics, and that's even supported natively by P96, but admittedly, the Bobs support of gfx is pretty much broken.
The requirements for the game are that I can have 320 animated tiles and 320 animated sprites (20x16 tile positions), all at the same time. All of these are 16x16 pixel aligned (sprites are 24 pixels high, and only one can move freely because it's turn based). This all has to run at around eight fps (for the animations) and leave enough CPU time to handle the AI and the 28 khz 14 bit stereo audio (stereo music with sound effects).

However, I would write my own text blitting routine if I were to write a text editor, for example. The OS simply uses the blitter, hence the reason for FBlit and FText making a real difference (also, double scan modes).

It's basically about how much effort you think is worthwhile to put into writing optimized custom code for things. It's also a hobby, and while you should of course try to actually finish software, it's also about writing the software the way you want (without making a mess). In a pro environment it's probably quite different.
« Last Edit: January 03, 2015, 08:55:11 PM by Thorham »
 

Offline itix

  • Hero Member
  • *****
  • Join Date: Oct 2002
  • Posts: 2380
    • Show only replies by itix
Re: CopyMem Quick & Small released!
« Reply #52 on: January 03, 2015, 09:25:01 PM »
Quote from: Thorham;780991
However, I would write my own text blitting routine if I were to write a text editor, for example. The OS simply uses the blitter, hence the reason for FBlit and FText making a real difference (also, double scan modes).


That would suck badly if antialiasing text was introduced to AmigaOS.

But of course, there wont be new AmigaOS. Hence patches are "future safe". (Not taking "NG" to discussion here.)
My Amigas: A500, Mac Mini and PowerBook
 

Offline matthey

  • Hero Member
  • *****
  • Join Date: Aug 2007
  • Posts: 1294
    • Show only replies by matthey
Re: CopyMem Quick & Small released!
« Reply #53 on: January 03, 2015, 09:51:55 PM »
Quote from: olsen;780979
I'm not sure if this has been mentioned before, but the operating system itself, as shipped, hardly ever uses CopyMem() or CopyMemQuick() in a way which would benefit from optimization. In ROM CopyMem() is used to save space, and it usually moves only small amounts of data around, with the NCR scsi.device and ram-handler being the exceptions. The disk-loaded operating system components tend to prefer their own memcpy()/memmove() implementations, and only programs which were written with saving disk space in mind (e.g. the prefs editors) use CopyMem() to some extent. Again: only small amounts of data are being copied, which in most cases means that you could "optimize" CopyMem() by plugging in a short unrolled move.b (a0)+,(a1)+ loop for transfers shorter than 256 bytes.


It is true that most AmigaOS calls of exec.library CopyMem() are small to medium size but there are *many* calls. It's not efficent to use CopyMem() for small copies because of the library JSR+JMP overhead although CPU optimized code can reduce the overall cost to be close to that of non-library code for all but the smallest copies. AmigaOS CopyMem() uses a MOVE.B (A0)+,(A1)+ loop for small copies and a MOVEM.L loop for large copies. It's actually the MOVEM.L loop that is most inefficient because it is only good for the 68000-68030 with large copies. An unrolled MOVE.L (A0)+,(A1)+ loop would be significantly better for the 68000-68060 in most cases.

tiny static size copy use '=' in C
small size copy use quick loop
medium size copy use unrolled loop (the 68060 doesn't benefit from unrolling in this case)
large size copy use MOVEM.L loop for 68000-68030, use unrolled MOVE16 loop for 68040-68060

There is a trade-off with different memory copy techniques as SpeedGeek has mentioned. This applies to the exec.library CopyMem() as well as the C memcpy() and memmove() functions. Most memcpy()/memmove() calls are small and this is why vbcc uses inlined quick loops to get to work as fast as possible (after minimal alignment). SAS/C uses a subroutine call (BSR+RTS) to a poorly optimized unrolled loop with no aligning and a costly jump table at the end for the 68040+. The SAS/C memcpy() may be faster in some cases than the vbcc memcpy() for large aligned copies. Sadly, the SAS/C memcpy() probably beats the exec.library CopyMem() for medium to large copies on the 68040+.

With vbcc, it is best speed to use:

tiny static size copy use '=' in C
small size copy use C memcpy() and memmove()
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()

If the exec.library CopyMem()/CopyMemQuick() used unrolled MOVE.L copy loops then we could be in good shape without patching. Patching for the uncommon large copies would become optional. MOVE16 does have the advantage of not flushing the DCache on large copies although it's questionable whether this is common enough and bug free enough to be standard.

The new version of vbcc 0.9d was recently released by the way:

http://sun.hasenbraten.de/vbcc/


With SAS/C, it is best speed to use:

tiny static size copy use '=' in C
small size copy use C memcpy() and memmove() for 68000-68030, use CPU optimized exec.library CopyMem() and CopyMemQuick() for 68040+
medium size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()
large size copy use CPU optimized exec.library CopyMem() and CopyMemQuick()

Without patching or changing exec.library CopyMem() and CopyMemQuick(), the available options are not good. Most 68040-68060 memory copies will be significantly slower than what is possible. We want programmers to be able to use compilers and the AmigaOS without wasting time and code creating faster re-implementations of functions. This is what ThoR seems to ignore. He wants to stop the patching chaos but ignores the reason for the patching and the solution.

Quote from: olsen;780979

Now if you wanted to make a difference and speed up copying operations that are measurable and will affect a large number of programs, I'd propose a project to scan the executable code loaded by LoadSeg() and friends and replace the SAS/C, Lattice 'C' and Aztec 'C' statically linked library implementations of their respective memcpy()/memmove() functions with something much nicer. That would not be quite the "low-hanging fruit" of changing the CopyMem()/CopyMemQuick() implementation, but it might have a much greater impact.


I considered this but decided it was better to optimize compiler link lib code next and vbcc was the easiest place to start ;).
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #54 on: January 03, 2015, 10:01:55 PM »
Quote from: Thorham;780991
However, I would write my own text blitting routine if I were to write a text editor, for example. The OS simply uses the blitter, hence the reason for FBlit and FText making a real difference (also, double scan modes).

This, however, is a pretty bad idea. The Os routine is quite ok, and there is little to be gained if your text editor should support arbitrary fonts (and yes, that's really a desired and useful feature given that you can adjust the screen size and hence the resolution).

The Os 1.3 Text() was a rather poor implementation that blitted glyph by glyph, but from 2.0 on graphics is smart enough to place the text into an off-screen buffer first and blit from there. The function used there is quite optimal given its genericity, and optimizations are likely only possible if you aim at specific font sizes only, e.g. 8x8 glyphs as for the topaz.font. However, this font is typically too small for today's applications.

For the records, there was a patch for 1.3 that optimized Text() for topaz.8 only ("FastFonts") and a similar patch by myself that optimized topaz.8 (8x8) and topaz.9 (10x9) only. Both of which are pretty much obsolete by today's standards due to their lack to support arbitrary fonts or styles.
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #55 on: January 03, 2015, 10:08:48 PM »
Quote from: matthey;780996
Without patching or changing exec.library CopyMem() and CopyMemQuick(), the available options are not good. Most 68040-68060 memory copies will be significantly slower than what is possible. We want programmers to be able to use compilers and the AmigaOS without wasting time and code creating faster re-implementations of functions. This is what ThoR seems to ignore. He wants to stop the patching chaos but ignores the reason for the patching and the solution.

To find a solution, one first has to identify the problem. And that's exactly what I do not see here. So far, nobody has mentioned yet a real-world problem (e.g. a program, a series of programs, a particular use case) where the current CopyMemQuick() is the bottleneck, and not fast enough to address the needs of the user. I would rather say that if memory copy is your bottleneck, there is probably something wrong with your algorithm requiring to copy so much data in first place.

But anyhow - I had little problem to exchange it should there ever be a new version of exec, but as the situation currently is, I consider the option of a patch for an otherwise bug-free Os function less desireable than the small speed impact (if at all) of CopyMemQuick() as we have it now.
 

Offline SpeedGeekTopic starter

Re: CopyMem Quick & Small released!
« Reply #56 on: January 04, 2015, 02:53:59 PM »
Quote from: Thomas Richter;780983
Exactly. But you silently assume that there is a memory controller logic, and that this memory controller logic is smart enough to pick the right decisions at all times. In fact, you can get away without ever touching the burst. RAM would be on the Turbo card anyhow, chip ram has to be cache inhibited, and the rest of I/O space has to be cache-inhibited as well. Cache-inibited accesses do not burst, hence no extra logic required. Or almost.

IOW, you rely on the hardware to be well-behaived, and that the vendor implemented an extra-logic just for a corner case. I really wonder where you take your confidence from. All what I learned over the years was that whenever there was a chance to cut the budget, hardware vendors took it. Here you have one...

Take it as you like, but I call it "defensive programming".

Yes, I can implicitly (and correctly) make the assumption the Accelerator card logic disables Burst by default or permanently disables it for cards which don't support it (It could be memory controller logic, glue logic, PLD logic or even a pull down/up resistor). Otherwise, you won't even be able to boot your Amiga. It's as simple as that.

Exec tries to enable the instruction cache in early startup. Now, what would happen when the CPU tries to run a Burst cycle to the Kickstart ROMs, Chip RAM or the ZorroII bus with Burst enabled and none of the above support Burst?

Quote from: Thomas Richter;780999
To find a solution, one first has to  identify the problem. And that's exactly what I do not see here. So far,  nobody has mentioned yet a real-world problem (e.g. a program, a series  of programs, a particular use case) where the current CopyMemQuick() is  the bottleneck, and not fast enough to address the needs of the user. I  would rather say that if memory copy is your bottleneck, there is  probably something wrong with your algorithm requiring to copy so much  data in first place.

But anyhow - I had little problem to exchange it should there ever be a  new version of exec, but as the situation currently is, I consider the  option of a patch for an otherwise bug-free Os function less desireable  than the small speed impact (if at all) of CopyMemQuick() as we have it  now.

One of many examples from Aminet (Vbak2091):

INTRODUCTION      ZorroII boards can only reach the lower 16MB of address space. So DMA     SCSI controllers must find another way to transfer data to expansion     RAM. Some of them (especially the A2091) do a very bad job in this     situation. In an A4000/40 transfer rates may drop to 50KB/s.     This program patches the (2nd.)scsi.device to use MEMF_24BITDMA     RAM as a buffer followed (in case of CMD_READ) by CopyMem().     It was developed with the A4000/A2091 combinbation in mind, but     should work with other configurations, too (see REQUIREMENTS). Some     people reported good results with GVP controllers.
« Last Edit: January 04, 2015, 05:27:42 PM by SpeedGeek »
 

Offline Thorham

  • Hero Member
  • *****
  • Join Date: Oct 2009
  • Posts: 1150
    • Show only replies by Thorham
Re: CopyMem Quick & Small released!
« Reply #57 on: January 05, 2015, 11:04:27 AM »
Quote from: Thomas Richter;780998
The Os routine is quite ok
No, it's not, hence the reason FBlit+FText makes a real difference.

Quote from: Thomas Richter;780998
but from 2.0 on graphics is smart enough to place the text into an off-screen buffer first and blit from there.
Which is slow, because you get additional memory accesses. Far better to do everything in registers, write to chipmem and be able to use the CPU pipeline on 68020+.

Quote from: Thomas Richter;780998
optimizations are likely only possible if you aim at specific font sizes only, e.g. 8x8 glyphs as for the topaz.font.
You can write a properly optimized font renderer for any normal text editor font size. You can also take syntax coloring in account and not write all bit planes for each character.
 

Offline Georg

  • Jr. Member
  • **
  • Join Date: Feb 2002
  • Posts: 90
    • Show only replies by Georg
Re: CopyMem Quick & Small released!
« Reply #58 on: January 05, 2015, 11:15:56 AM »
Quote from: Thomas Richter;780998

The Os 1.3 Text() was a rather poor implementation that blitted glyph by glyph, but from 2.0 on graphics is smart enough to place the text into an off-screen buffer first and blit from there.


Maybe smart if this works on a "per cliprect" basis, but does it?

Otherwise for things like text output in hidden simple refresh windows (like output in a shell window while compiling something, with the source code text editor in the front hiding all or most of it) it can do a lot of unnecessary work in the off-screen buffers.

Similar for long text strings where big parts may ends up being clipped away. Like maybe in a listview gadget.
 

guest11527

  • Guest
Re: CopyMem Quick & Small released!
« Reply #59 from previous page: January 05, 2015, 03:33:51 PM »
Quote from: Georg;781067
Maybe smart if this works on a "per cliprect" basis, but does it?

Actually, it is a single buffer. Manually clipping the text before rendering it to screen would complicate matters a lot. Clipping is done in BltTemplate() of the graphics library once rendering is complete.
Quote from: Georg;781067
 
Quote from: Georg;781067
Otherwise for things like text output in hidden simple refresh windows (like output in a shell window while compiling something, with the source code text editor in the front hiding all or most of it) it can do a lot of unnecessary work in the off-screen buffers.
I wouldn't be so sure. Look, you have to clip at some point. You can either clip while rendering the glyphs (which is what 1.3 did) or clip only once. Given that the complexity of the clipping is pretty high compared to rendering the text itself, it is probably better to "do the additional work" because it results in a much simpler algorithm. I believe the right approach is to optimize for the *typical* case, and the typical case is that the window you render text to is front-most, thus no clipping done.  
Quote from: Georg;781067
Similar for long text strings where big parts may ends up being clipped away. Like maybe in a listview gadget.

Actually, the typical ASL/Reqtools requester isn't *that* stupid. I don't know how MUI works, but the system requesters only render those lines that are actually visible on the screen and not those that are clipped away completely.