Author Topic: newb questions, hit the hardware or not? (Read 33454 times)

SamuraiCrow · « **Reply #59 on:** July 16, 2014, 02:44:28 PM »

Quote from: Thomas Richter;769125

That's a plain simple SAS C 6.51 simply because the Os development chain depends on it (with the exception of intuition, actually, which depended on a really rotten compiler.)

I've worked with that one also. Generates pretty good code most of the time. If you use deep orders of operation in a formula, it stuffs the temporaries to the stack regardless of how many registers are free for temporary variables. Also, ChaosLord used SAS/C for his game writing and it occasionally would get confused and generate pure nonsense code that wouldn't even execute. In that event inline Assembly is unavoidable.

Quote from: Thomas Richter;769125

Anyhow, I stand for my opinion. Pointless argument. If you want to write video codec, the *blitter* is your least problem. The decoding algorithm will make a huge difference, and even there it makes sense to optimize the algorithm first. Been there, done that. That was actually a JPEG 2000 if you care.

I would care, if I were making a bitmap-based codec. I was planning on using mostly filled vectors though. I know how to optimize a full-screen vector into the minimum number of line-draws so that the whole screen can do a single vector-fill operation. That full-screen, full bitplane-depth pass is going to be costly though, as are the uncompressed audio samples. I may have to triple-buffer the display and use the CPU to clear the screen after it's been displayed just to take some strain off the Blitter.

commodorejohn · « **Reply #60 on:** July 16, 2014, 04:23:58 PM »

Quote from: Thomas Richter;769129

Please get your math fixed. If you have n algorithms, each of them spends 1/nth of the time in solving a problem, and each of them is speed up by 10%, the overall speedup is still 10%. In fact, if you only speed up one of them (e.g. layers) by 10%, the overall improvement is much smaller, depending on n, and even marginal.

Yes, obviously. But: on a platform with a bus speed of less than 8MHz, 10% can make the difference between having enough time to get everything done in one frame or suffering a reduction in framerate. It doesn't matter how big or small of a percentage it is, it's the practical impact that matters.

Quote

If, however, you have an algorithm whose running time grows as O(N^2) (N being the number of layers being moved, arranged or resized) and that is replaced by an O(N) algorithm (as it happened, actually), then even for suitably small N the improvement can be enormous. It is really that simple. Do not waste your time optimizing the useless details. Get the big picture correct. Then, if performance is still not right, check whether the problem is, find the bottlenecks, and either get rid of them by changing the algorithm, or optimize only there.

Nobody's arguing that algorithmic optimization shouldn't be the first resort, or that it doesn't have much greater potential for performance increases. Of course that's true; it's so well known to be the case that everybody but you is taking it as a given. But that in no way makes the question of whether good code is being generated "useless details." You can't come up with an infinite series of successive algorithmic optimizations - eventually you're going to hit the optimal algorithm for the particular application and platform, and if that's not enough, no amount of casting about for better, purer Ideas will get you any further; you either have to get down and dirty with the low-level stuff, or give up on making it work. Cycles count, no matter how much you want to pretend that isn't the case - if that weren't true, there would be no difference between my Amiga 1200 and my Core 2 Duo laptop other than the quality of their algorithms.

guest11527 · « **Reply #61 on:** July 16, 2014, 07:32:17 PM »

Quote from: commodorejohn;769133

Yes, obviously. But: on a platform with a bus speed of less than 8MHz, 10% can make the difference between having enough time to get everything done in one frame or suffering a reduction in framerate. It doesn't matter how big or small of a percentage it is, it's the practical impact that matters.

Was layers V40 usable on that machine before? Yes. Thus, apparently, even with the deficiencies the algorithm had, the result was ok. The improvement yout get is more than 10% now. Did you notice? Most the time, no. Thus, do you think a 10% improvement in micro-optimizations would make a difference? I can tell you my answer: I don't care.

Quote from: commodorejohn;769133

But that in no way makes the question of whether good code is being generated "useless details." You can't come up with an infinite series of successive algorithmic optimizations - eventually you're going to hit the optimal algorithm for the particular application and platform, and if that's not enough, no amount of casting about for better, purer Ideas will get you any further; you either have to get down and dirty with the low-level stuff, or give up on making it work. Cycles count, no matter how much you want to pretend that isn't the case - if that weren't true, there would be no difference between my Amiga 1200 and my Core 2 Duo laptop other than the quality of their algorithms.

No, cycles do not count, at least not at this microscopic level. Improve the algorithm. If that's not enough, identify the bottlenecks. Then optimize there. Arguing about micro-optimizations like replacing lea's with move''s is a pointless argument since it won't make a difference. What's the bottleneck in layers: Not the lea's. The bottleneck is copying data from A to B, bitmap data. Layers V40 copied N times when re-arranging a single out of N overlapping layers. Layers V45 copies a constant number of times, twice: Current data to backing store, backing store to front layer, independent of N. *That* makes a difference because the number of cycles required to copy graphics data around makes a difference. Not the individual instructions to arrange the layer structure. It is a pointless cycle-counter argument to replace individual instructions at microscopic level because that's not where the bottleneck is. The algorithm is better, the amount of copies is lower. So, no, whoever counts cycles does not understand the problem, you're looking at the problem on a much too fine scale to be able to identify it. The problem is to measure performance, use a profiler or some other tool to identify bottlenecks, then improve it. I do not need to count cycles to get there.

commodorejohn · « **Reply #62 on:** July 16, 2014, 07:40:17 PM »

Quote from: Thomas Richter;769148

Thus, do you think a 10% improvement in micro-optimizations would make a difference? I can tell you my answer: I don't care.

Then why are you getting so worked up about it?

Thorham · « **Reply #63 on:** July 16, 2014, 08:56:39 PM »

Use the right algorithm? Really? How obvious :rolleyes: And because we're now using the right algorithm, optimizing it's performance isn't necessary, implying any crappy implementation will do. Yeah, right :rolleyes:

Use the right algorithm, AND write a PROPER implementation of it. Why bother with anything less? And besides, cycle counting is fun. Nothing beats hand optimizing tight loops on 20s and 30s :p Waste of time? Not to me, it's a hobby :p

matthey · « **Reply #64 on:** July 16, 2014, 10:15:17 PM »

Quote from: Thomas Richter;769121

That's exactly what I call a "cycle counter party argument". It is completely pointless because it makes no observable difference. Probably the reverse, the compiler had likely made the choice for a reason.

In all 3 of my peephole optimization examples, the CCR is set in the same way. Vbcc's vasm assembler would optimize these 3 examples by default. The savings may be more than you expect even if a single peephole optimization "makes no observable difference". Let's take a look at the simple:

lea (0,a6),a4

It looks short and harmless. The instruction is 4 bytes instead of 2 bytes for the MOVE equivalent so the extra fetch is only a fraction of the cost of executing an instruction on the 68000-68030, not counting any code that falls out of the ICache. The 68040 and 68060 can handle this instruction in 1 cycle with the 68060 only using 1 pipe. Now let's use A4 for addressing in the next instruction like this:

lea (0,a6),a4
move.l (8,a4),d0

There is a 2-4 cycle bubble between the instructions on the 68040+ (including pipelined fpga 68k processors). A superscalar processor like the 68060 can have 2 integer pipes sitting idle for several cycles instead of executing half a dozen instructions. The above code looks like a compiler flaw anyway. If a6=a4 then it should use a6 as a base instead of copying it and using the copy.

Quote from: Thomas Richter;769121

Anyhow, the low-level graphics.library is in assembly, if that makes you feel any better. Still, does not make a difference. Fast code comes from smart algorithms, not micro-optimizations. V45 is smarter in many respects because it avoids thousands of CPU cycles of worthless copy operations in most cases, probably of the expense of a couple of hundred CPU cycles elsewhere.

Smart algorithms are the starting point to efficient executables and I appreciate your work to that end. Your layers.library probably runs at half the speed of what the 68060 is capable of because of non-algorithm issues even though it may be several times faster than it was before with many overlapping windows. You could say the 68060 is fast enough already so there is no need to have efficient code for it. If you were a compiler writer, you would have a 68000 backend for the 68060, an 8086 (or would it be 8080) backend for x86/x86_64 and 32 bit ARM backend for Thumb and ARMv8 processors all with no optimization options and no optimizations. Your job would be complete and any complaints would be met with "make better algorithms".

Quote from: Thomas Richter;769121

Which I actually doubt, and even if it would be hardly noticable because there is more that adds up to the complexity of moving windows than a couple of trivial move operations. Actually, V45 is faster, not slower, because it is smarter algorithmically.

I estimated your layers.library code could be ~10% faster on the 68020/68030 if compilers were better and you cared. Yes, that doesn't mean layers operations will be 10% faster because other code probably has the same problems.

Quote from: Thomas Richter;769121

Pointless argument, see above. It requires algorithmic improvements, or probably additional abstractions to make it fit to the requirements of its time. Arguing about a

I guess it's such a pointless argument that you stopped typing mid sentence?

Thorham · « **Reply #65 on:** July 16, 2014, 10:26:56 PM »

You tell'm, matthey

guest11527 · « **Reply #66 on:** July 17, 2014, 07:36:36 AM »

Quote from: matthey;769161

In all 3 of my peephole optimization examples, the CCR is set in the same way. Vbcc's vasm assembler would optimize these 3 examples by default. The savings may be more than you expect even if a single peephole optimization "makes no observable difference". Let's take a look at the simple:

lea (0,a6),a4

It looks short and harmless. The instruction is 4 bytes instead of 2 bytes for the MOVE equivalent so the extra fetch is only a fraction of the cost of executing an instruction on the 68000-68030, not counting any code that falls out of the ICache. The 68040 and 68060 can handle this instruction in 1 cycle with the 68060 only using 1 pipe. Now let's use A4 for addressing in the next instruction like this:

lea (0,a6),a4
move.l (8,a4),d0

There is a 2-4 cycle bubble between the instructions on the 68040+ (including pipelined fpga 68k processors). A superscalar processor like the 68060 can have 2 integer pipes sitting idle for several cycles instead of executing half a dozen instructions. The above code looks like a compiler flaw anyway. If a6=a4 then it should use a6 as a base instead of copying it and using the copy.

And I tell you again that this is a pointless argument because it makes no observable difference. It is a waste of time to spend such effort in micro-optimizations because you loose the vision on the big picture. There is not much time spend in these algorithms in first place, but a lot of time copying data around that does not require copying. The problem was elsewhere. If any of you cycle-counters had layers (or any other component of the Os, for that matter) under your fingers, you would probably replace leas by moves or vice versa, probably gain some minor improvement, though the project would still not be done, would be bug ridden by using assembly, and you would have lost the real opportunities for optimization because you would have not been able to work on the necessary level of abstraction. Actually, major parts of gfx show that exactly that happened because gfx is too tight to the hardware, ill-designed and lacks a proper abstraction. Needless to say, major parts are in assembly. On the other hand, intuition has a proper level of abstraction (at least for its time), and has a very small interface. It's totally written in C. Intuition is fast enough for the 68K. Do you see a pattern here?

Georg · « **Reply #67 on:** July 17, 2014, 08:33:38 AM »

Quote from: Thomas Richter;769148

Layers V40 copied N times when re-arranging a single out of N overlapping layers. Layers V45 copies a constant number of times, twice: Current data to backing store, backing store to front layer, independent of N.

How do you do the re-arranging with only two copies if for example 3 or 4 smart refresh layers have their visible area changed after the single layer operation (moving, depth arrangement)? And the hidden area of a single smart refresh layers may consist of serveral cliprects = more than 1 backing store.

biggun · « **Reply #68 on:** July 17, 2014, 08:34:30 AM »

Isn't this two tasks for two people?

1) The application developer should focus on solving his problems effectively.

2) The compiler writer should improve the compiler so that reasonable good code is generated.

To write an application like a texteditor, using C and using OS calls sounds perfectly ideal to me.

If you want to write a 1KB bootblock demo with sinus scroller and copper-plasma, then using ASM and banging the hardware is probably the ideal way of doing it.

guest11527 · « **Reply #69 on:** July 17, 2014, 08:55:01 AM »

Quote from: Georg;769184

How do you do the re-arranging with only two copies if for example 3 or 4 smart refresh layers have their visible area changed after the single layer operation (moving, depth arrangement)? And the hidden area of a single smart refresh layers may consist of serveral cliprects = more than 1 backing store.

No, of course in that case two copies are not sufficient. If I have three overlapping layers, and I move one of the layers down such that the areas of the two other layers become visible, then three copies are necessary obviously.

The problem V40 had was in cases where you had a stack of layers sitting on top of each other (say, five stacked windows) and you depth-arranged them, i.e. moving the topmost to the bottom. V45 does simply that: Copy the frontmost layer to the backing store, the backing store of the next one to the screen. Sounds like the logical thing to do. Unfortuntely, V40 did *not* operate like this. Instead, it copied the data of *all* layers in the stack around.

V40 (or rather, V32) was designed for a different target: Back then, copying always used the blitter, the blitter was fast, the CPU was slow, and backing store was an expensive resource. So V32 used an algorithm that minimizes the amount of backing store allocated at once, at the expense of using too many copy operations in situations where many windows overlap. Nowadays, the situation turned around: Copying is slow, the CPU is fast, the blitter is slow, and enough memory is available. Thus, the algorithm had to change to adapt to the new requirements.

V32 tried to optimize the memory footprint for backing store at all means, at the price of using too many copy operations. It also used the double-XOR trick to swap regions (good for the blitter, bad on a graphics card since it requires emulation).
V45 tries to optimize the number of copy operations at the price of potentially using more backing store. It no longer uses double-XOR, and it uses the primitives of the graphics card if they are available, especially, it allocates the bitmaps for the backing store from graphics card memory if available.

guest11527 · « **Reply #70 on:** July 17, 2014, 08:57:44 AM »

Quote from: biggun;769185

Isn't this two tasks for two people?

1) The application developer should focus on solving his problems effectively.

2) The compiler writer should improve the compiler so that reasonable good code is generated.

To write an application like a texteditor, using C and using OS calls sounds perfectly ideal to me.

That pretty much sums it up. It is the matter of the compiler writer to create a reasonably good compiler. If that is *still* not good enough, one can still go and hand-optimize the bottlenecks, which is exactly what happened also in the past for my professional (as in "work for money") software.

Quote from: biggun;769185

If you want to write a 1KB bootblock demo with sinus scroller and copper-plasma, then using ASM and banging the hardware is probably the ideal way of doing it.

Except that, in case you really want such a thing, you'd probably better off nowadays with a couple of lines in javascript and run it in a browser. (-; Yes, times changed.

vxm · « **Reply #71 on:** July 17, 2014, 09:24:36 AM »

Quote from: biggun;769185

Isn't this two tasks for two people?

1) The application developer should focus on solving his problems effectively.

2) The compiler writer should improve the compiler so that reasonable good code is generated.

This is correct. In the same way that small streams make big rivers, the sum of the various levels of optimizations contribute to the overall system operation.

Thorham · « **Reply #72 on:** July 17, 2014, 10:50:20 AM »

Quote from: Thomas Richter;769182

would be bug ridden by using assembly

Blame the language for bugs, what a complete and utter nonsense.

Also, just because people optimize implementations, doesn't mean they are unable to choose the right algorithms to implement. You talk about this as if these are mutually exclusive, and it's nonsense.

Another thing, just because YOU think some things are a waste of time, doesn't mean that others don't. Some nerve you have talking for everyone like that.

And if you're using a compiler that writes lea (0,a0),a1 instead of move.l a0,a1, then you need a better compiler. It's crap code, whether you agree with that or not.

guest11527 · « **Reply #73 on:** July 17, 2014, 12:19:00 PM »

Quote from: Thorham;769192

Blame the language for bugs, what a complete and utter nonsense.

You haven't worked in professional software development, have you? Yes, humans make mistakes. No matter what language. Yes, the language matters: Some languages provide facilities to detect mistakes. Some more, some less. Higher languages are better than that than assembly.

Quote from: Thorham;769192

Also, just because people optimize implementations, doesn't mean they are unable to choose the right algorithms to implement. You talk about this as if these are mutually exclusive, and it's nonsense.

See above. Apparently, you do not yet have enough experience in implementing complex and large software projects. Such projects iterate, are under flux, change, several algorithms are tried, benchmarked, debugged. You cannot do that efficiently in assembly.

Quote from: Thorham;769192

Another thing, just because YOU think some things are a waste of time, doesn't mean that others don't. Some nerve you have talking for everyone like that.

Probably because I'm a bit longer in the business than you are? It's called experience. It comes over time. I've done many things in assembly, many things in other languages. C, C++, Pascal, Java, many more. Each has its drawback and advantages. Assembly is something you rarely ever need. It is in most cases a waste of time: Too much time to develop, to much time to debug, diminishing returns in terms of performance. Hence: Waste of time. If you do not believe me, ask other experienced developers.

Quote from: Thorham;769192

And if you're using a compiler that writes lea (0,a0),a1 instead of move.l a0,a1, then you need a better compiler. It's crap code, whether you agree with that or not.

The code is good enough - it makes no observable difference. You are of course invited to write a better compiler. I personally do not waste my time with micro-optimizations. Optimizations make sense, once the bottleneck is identified. Even assembly does sometimes make sense if you find that >50% of the time is spend in an isolated part of the code, and there is no further algorithmic improvement you could make use of. But layers is no such code. The low-level copy functions to move backing store and images around are. Guess what: That's assembly. It's still slow, but there's no chance to make it faster since the Zorro bus is the bottleneck - the only way to make it faster is to avoid the operations in first place whenever possible, and that's exactly what V45 does. Whether the lea's are coded with move's or not - makes no bloddy difference.

commodorejohn · « **Reply #74 from previous page:** July 17, 2014, 01:05:46 PM »

Quote from: Thomas Richter;769182

If any of you cycle-counters had layers (or any other component of the Os, for that matter) under your fingers, you would probably replace leas by moves or vice versa, probably gain some minor improvement, though the project would still not be done, would be bug ridden by using assembly, and you would have lost the real opportunities for optimization because you would have not been able to work on the necessary level of abstraction.

That's not even remotely how that works. Low-level optimization does not prevent you from doing high-level optimization (and the idea that assembly is more prone to bugs than high-level languages is a myth. Bugs come from sloppy thinking, not from lack of language features.)

Quote from: Thomas Richter;769187

Except that, in case you really want such a thing, you'd probably better off nowadays with a couple of lines in javascript and run it in a browser. (-; Yes, times changed.

No they haven't.

Quote from: Thomas Richter;769195

You haven't worked in professional software development, have you?

I have - and you're talking nonsense.

Quote

Such projects iterate, are under flux, change, several algorithms are tried, benchmarked, debugged. You cannot do that efficiently in assembly.

Bull.

Quote

Probably because I'm a bit longer in the business than you are? It's called experience. It comes over time.

As the canonical New Yorker cartoon caption says, "Christ, what an asshøle."

Author Topic: newb questions, hit the hardware or not? (Read 33454 times)

SamuraiCrow

Re: newb questions, hit the hardware or not?

commodorejohn

Re: newb questions, hit the hardware or not?

guest11527

Re: newb questions, hit the hardware or not?

commodorejohn

Re: newb questions, hit the hardware or not?

Thorham

Re: newb questions, hit the hardware or not?

matthey

Re: newb questions, hit the hardware or not?

Thorham

Re: newb questions, hit the hardware or not?

guest11527

Re: newb questions, hit the hardware or not?

Georg

Re: newb questions, hit the hardware or not?

biggun

Re: newb questions, hit the hardware or not?

guest11527

Re: newb questions, hit the hardware or not?

guest11527

Re: newb questions, hit the hardware or not?

vxm

Re: newb questions, hit the hardware or not?

Thorham

Re: newb questions, hit the hardware or not?

guest11527

Re: newb questions, hit the hardware or not?

commodorejohn

Re: newb questions, hit the hardware or not?