Author Topic: CopyMem Quick & Small released! (Read 14385 times)

matthey · « **Reply #74 from previous page:** January 07, 2015, 02:15:23 AM »

Quote from: psxphill;781156

I've heard that argument before and I don't buy it. If it's easily possible to write code that can render text faster then you should do that, because there are easily situations where an average editor is too slow. Like if you're running something reasonably intensive in the background.

Just being fast enough when nothing else is running isn't fast enough.

Sure we need it all to be standardised and consistent so it makes it easy to write software, but that should be doable.

I agree. I like the idea of using the OS but it needs to provide reasonably optimal functions. Is aligning the destination and using an urolled MOVE.L loop too much to ask for CopyMem()/CopyMemQuick() when it is competitively the fastest for the 68000-68060? Would it be a bad thing if Olsen sold more copies of Roadshow because the memory copying bottleneck was reduced? We need to improve and use Amiga profilers but memory copying is a CPU intensive task that is easily improved. The Amiga philosophy has always been about efficiency and not just replacing the CPU with a faster one.

Quote from: itix;781160

If you are running something CPU intensive in the background, like compiling large project with GCC, all you need is a good scheduler.

The 68k frontend for vbcc, vc, had the task priority lowered for better multi-tasking. Editing is now practical while compiling which is very convenient.

I believe 68k GCC will use the current shell process priority (ChangeTaskPri).

itix · « **Reply #75 on:** January 07, 2015, 08:31:52 AM »

Quote from: matthey;781162

The Amiga philosophy has always been about efficiency and not just replacing the CPU with a faster one.

Wrong... When Amiga/Commodore was at its height everyone needing better performance replaced the CPU with a faster one. Now 68k CPU family has ceased and the best CPU available is 20 years old.

guest11527 · « **Reply #76 on:** January 07, 2015, 09:42:13 AM »

Quote from: psxphill;781156

I've heard that argument before and I don't buy it. If it's easily possible to write code that can render text faster then you should do that, because there are easily situations where an average editor is too slow.

The problem is then really as soon as something in the Os gets updated, you'll break things or the renderer does no longer work correctly. Besides, the Os routine isn't exactly slow either. Unlike the 1.3 version.

So yes, one can write stupid editors, but it doesn't require the Os for them to be slow. (AmigaBasic is an example). But one can do pretty well with the Os, without compromizing speed or compatibility.

guest11527 · « **Reply #77 on:** January 07, 2015, 09:49:15 AM »

Quote from: matthey;781162

I agree. I like the idea of using the OS but it needs to provide reasonably optimal functions. Is aligning the destination and using an urolled MOVE.L loop too much to ask for CopyMem()/CopyMemQuick() when it is competitively the fastest for the 68000-68060? Would it be a bad thing if Olsen sold more copies of Roadshow because the memory copying bottleneck was reduced?

You are here mixing a couple of things that do not really belong together. First of all, CopyMemQuick() is reasonably optimal. Yes, one can improve it here or there, but it's surely better than the compiler-generated byte-wise copy - which has also its justification for short string moves, though.

I don't know how roadshow could or would depend on CopyMemQuick(). Actually, the trick is to avoid the copy in first place, and if that is not possible for one reason or another, it is still possible to implement something yourself. A memory copy is a reasonably trivial operation and trivial to do if it needs to be done, and it can then be tuned towards your precise requirements, unlike the Os function. But if your program spends a lot of type copying data around, you'll likely have a design problem somewhere.

psxphill · « **Reply #78 on:** January 07, 2015, 12:25:49 PM »

Quote from: itix;781160

If you are running something CPU intensive in the background, like compiling large project with GCC, all you need is a good scheduler.

A good scheduler would be nice, however it doesn't solve the problem. If your text editor was written by someone who thinks it doesn't matter if it needs 100% cpu as long as it's fast enough to keep up with you typing, the scheduler will allow the text editor to be responsive but now your GCC builds get no CPU time at all.

There is a point where optimising further makes no sense, but giving up when something is just barely fast enough to keep up with a slow typist is not that point.

As for a good scheduler, it would be nice if this was open sourced.
http://aminet.net/package/util/misc/Executive

At least it's now free http://aminet.net/package/util/misc/Executive_key

The problem of course is it's another patch....

olsen · « **Reply #79 on:** January 07, 2015, 12:49:43 PM »

Quote from: kolla;781105

CygnusEd has the option of using OS routines, and on native chipset that is a major slowdown. Ditto for MuchMore iirc.

CygnusEd's own custom display update routines bypass several layers of operating system routines which need to be able to handle any case of moving and clearing the screen contents. By comparison CygnusEd can restrict itself to dealing with just one single bit plane and there is no need to handle clipping or occlusion. Actually, if you are using the topaz/8 font then CygnusEd can even bypass the operating system's text rendering operations altogether.

Well, this is how it can work out if you know exactly which special case you need to cater for. If you have to have a general solution you'll always end up making sacrifices with regards to performance and resource usage.

olsen · « **Reply #80 on:** January 07, 2015, 01:11:43 PM »

Quote from: matthey;781162

I agree. I like the idea of using the OS but it needs to provide reasonably optimal functions. Is aligning the destination and using an urolled MOVE.L loop too much to ask for CopyMem()/CopyMemQuick() when it is competitively the fastest for the 68000-68060? Would it be a bad thing if Olsen sold more copies of Roadshow because the memory copying bottleneck was reduced?

Roadshow is a peculiar case. Because incoming and outgoing data needs to be copied repeatedly, the copying operation better be really, really well-optimized. For Roadshow I adapted the most efficient copying routine I could find, and this is what accounts for Roadshow's performance (among a few other tricks).

Because the TCP/IP stack needs to handle overlapping copying operations it would not have been possible to use CopyMem(), which is not specified to support it.

There is also special case copying code in Roadshow in order to support the original Ariadne card which due to a hardware bug handles single byte writes to its transmit buffer incorrectly (a single byte write operation is treated like a word write operation with a random MSB value). This is what the S2_CopyFromBuff16 SANA-IIR3 command is for, in case you always wondered why this oddball command is part of the standard

A general, efficient copying function has its merits, but it also ought to be sufficiently general in operation. Neither CopyMem() nor its subset CopyMemQuick() will handle overlapping copy operations, and none even flag an error condition if they cannot do what the caller asks them to (they show "undefined behaviour" instead).

The lack of support for overlapping copy operations is likely a deliberate design choice. If you read the addendum to the original exec.library AutoDocs, it mentions that a future version of CopyMem() might use hardware acceleration to perform the operation. To me it's not quite clear what was meant by that. It could mean that somebody was thinking about putting the blitter to use, if available, to move the data around (which could have been twice as fast as copying the data using the CPU on the original 68000 Amiga design; you would have to wait for the blitter to become ready for use, and you'd have to wait for it to complete its work, which taken together may have nullified any possible speed advantage over using the CPU straight away). Or it could mean that somebody was considering adding DMA-assisted memory copying functionality to the Amiga system design, like the 1985 Sun workstations reportedly had.

Now there is an interesting question which I'd like to ask Carl Sassenrath

itix · « **Reply #81 on:** January 07, 2015, 01:17:37 PM »

Quote from: psxphill;781171

A good scheduler would be nice, however it doesn't solve the problem. If your text editor was written by someone who thinks it doesn't matter if it needs 100% cpu as long as it's fast enough to keep up with you typing, the scheduler will allow the text editor to be responsive but now your GCC builds get no CPU time at all.

So what? GCC gains CPU time again when you stop typing.

Quote

There is a point where optimising further makes no sense, but giving up when something is just barely fast enough to keep up with a slow typist is not that point.

I probably would optimize it as much as possible but rationally it makes no sense. Most developers optimize code only for fun. To see how fast it can run but it rarely gives anything back. In commercial software development they are often money pits...

...unless you are going to add intellisense kind of features to 68k class systems. Then you might want to skip OS routines to cram more features to your software package.

Quote

As for a good scheduler, it would be nice if this was open sourced.
http://aminet.net/package/util/misc/Executive

At least it's now free http://aminet.net/package/util/misc/Executive_key

The problem of course is it's another patch....

It is very good but it works against documented behaviour so some software may just lock up. But multitasking was so much smoother on my A1200 that I didnt care. I recall utilities were written in C but scheduler is pure 68k asm.

guest11527 · « **Reply #82 on:** January 07, 2015, 01:33:04 PM »

Quote from: olsen;781172

CygnusEd's own custom display update routines bypass several layers of operating system routines which need to be able to handle any case of moving and clearing the screen contents. By comparison CygnusEd can restrict itself to dealing with just one single bit plane and there is no need to handle clipping or occlusion. Actually, if you are using the topaz/8 font then CygnusEd can even bypass the operating system's text rendering operations altogether.

For the records: I took the time and checked what ViNCEd does (the 3.9 Shell console). Actually, it does have the raster-scroll optimization as well, however, if I look at the sources nowadays and see how I had to "jump in circles" to get this done correctly, I would really not recommend doing so anymore.

Here, every line has a "raster mask" that describes which bitplanes it uses, and each window has a raster mask, too. When scrolling, not only the raster mask does have to be taken into account, but also whether the user has installed a custom border (probably drawing something in the console with other pens I'm not aware of) and whether the stuff is running on a graphics board, disabling the masking because it makes things slower, not faster. Then, one has to take into account wether the user has marked a block, changing the raster due to the inversion of the marked text, and so on...

Anyhow, it's quite an amount of code that went into this, and quite an amount of debugging, too. Looking at this today makes it a maintenance nightmare because changing the contents of the screen requires updating a lot of flags and settings to have everything consistent. No, I really won't do it like this anymore anytime...

There is no particular optimzation for topaz.8 in ViNCEd, though. Back then, I had my own "FastFonts" in the system which optimized both topaz.8 and topaz.9 (unlike the CBM version which claimed to do the latter, but never did), only for solid, non-underline, non-italic, non-bold. All that stopped making sense, already with Workbench 2.0 and its better Text() (Yes, ViNCEd is really *that* old, it goes back to workbench 1.3).

If you see how much code is between calling Text() and making the graphics appear on the screen with a graphics card, you would know why. Actually, I believe the "planar to chunky" conversion within BlitTemplate() that lies beyond Text() is probably the major contributor.

Thus, if you would want to optimize Text() nowadays, you would be rather well-adviced to change the internal representation of fonts right-away, as soon as you open one, change it from planar to chunky and have a chunky-optimized text renderer in first place. That makes a lot more sense than trying to squeeze out the bits from a couple of bitplanes or bit-shifts on a rather obsolete graphics representation.

olsen · « **Reply #83 on:** January 07, 2015, 01:44:25 PM »

Quote from: Thomas Richter;781175

For the records: I took the time and checked what ViNCEd does (the 3.9 Shell console). Actually, it does have the raster-scroll optimization as well, however, if I look at the sources nowadays and see how I had to "jump in circles" to get this done correctly, I would really not recommend doing so anymore.

What CygnusEd does is not limited to picking a single bit plane to render into, and which should be scrolled: it directly talks to the blitter itself to perform the update operation, bracketed between LockLayers()/OwnBlitter() and DisownBlitter()/UnlockLayers(), calling WaitBlit() before hitting the blitter registers.

The parameters which are fed into the blitter registers are precalculated in 'C', and only the hardware access itself is written in assembly language.

Cool stuff indeed

psxphill · « **Reply #84 on:** January 07, 2015, 01:57:47 PM »

Quote from: itix;781174

So what? GCC gains CPU time again when you stop typing.

Why should you have to stop typing?

Also the text editor could use 100% cpu all the time and still fit the requirement of being "fast enough because it keeps up with your typing".

psxphill · « **Reply #85 on:** January 07, 2015, 02:07:32 PM »

Quote from: Thomas Richter;781175

Thus, if you would want to optimize Text() nowadays, you would be rather well-adviced to change the internal representation of fonts right-away, as soon as you open one, change it from planar to chunky and have a chunky-optimized text renderer in first place.

To optimise for modern hardware you'd probably want a much more complex system where you stored fonts in multiple different formats, potentially in the vram of each graphics card you are using the font on. So topaz might be stored in chip ram for ocs/ecs/aga and if you also have a couple of pci graphics cards it's probably stored on those too.

itix · « **Reply #86 on:** January 07, 2015, 02:59:50 PM »

Quote from: psxphill;781179

Why should you have to stop typing?

Dont stop if there is still more text coming. However, you must be pretty good coder to type so quickly that GCC never got any CPU time.

Quote

Also the text editor could use 100% cpu all the time and still fit the requirement of being "fast enough because it keeps up with your typing".

That is just multitasking unfriendly editor. It is different requirement.

kolla · « **Reply #87 on:** January 07, 2015, 07:17:03 PM »

I remember when compiling linux kernels on remote machines became so fast that atelnet (or whatever I was using) could no longer keep up with the output and locked up my A3000 (CSPPC/CVPPC), haha.

Cosmos Amiga · « **Reply #88 on:** January 11, 2015, 09:30:20 AM »

Great ideas and work, SpeedGeek !

Your patch give me motivation to update the BVisionPPC monitor :

BVisionPPC v4.4

- section removed
- four 060 emulated fmovecr removed
- two realtime 68030 checking removed (what's that ?)
- one realtime FPU checking removed
- ugly internal SAS/C copyroutine replaced by a _LVOCopyMem call

They are still two ugly ugly ugly copyroutines with move16 into this new version : ashamed of slowness...

I'll email them to you, you will see...

psxphill · « **Reply #89 on:** January 11, 2015, 09:54:05 AM »

Quote from: itix;781182

That is just multitasking unfriendly editor. It is different requirement.

No it is exactly the same requirement as the text editor being able to use as much of your cpu as it wants as long as it can keep up with your typing. If you have other secret requirements that are obvious to you then you should specify them.

There may be software running in the background that needs to run at a high priority so that only 50% of cpu is left. The text editor needing 100% of cpu to keep up with typing is now unable to keep up with your typing.

I'm using quite extreme examples because it makes the maths easier, I understand that you can't practically optimise everything completely and there is an upper limit on the cpu resources available.

There are other vague issues like not saying how big the document is when you measure if it's able to keep up. A text editor that is designed for editing large documents will be able to insert and delete text in the middle without CopyMem()'ing the rest of the document around every time you type a character. On windows I regularly edit text files 100's of megabytes in size.

Author Topic: CopyMem Quick & Small released! (Read 14385 times)

matthey

Re: CopyMem Quick & Small released!

itix

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

olsen

Re: CopyMem Quick & Small released!

olsen

Re: CopyMem Quick & Small released!

itix

Re: CopyMem Quick & Small released!

guest11527

Re: CopyMem Quick & Small released!

olsen

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!

itix

Re: CopyMem Quick & Small released!

kolla

Re: CopyMem Quick & Small released!

Cosmos Amiga

Re: CopyMem Quick & Small released!

psxphill

Re: CopyMem Quick & Small released!