Author Topic: newb questions, hit the hardware or not? (Read 71789 times)

matthey · « **on:** July 16, 2014, 12:16:29 AM »

Quote from: Thomas Richter;769032

I pretty much doubt this. If there is no cliprect (i.e. the rastport has no layer) graphics.library goes directly to the hardware on native amigas. There is no overhead in such a case. Again, before making such claims, please measure.

If there is a graphics card, it goes to the hardware of the graphics card if it supports 2D acceleration, same story. Only difference: It works on all hardware.

I agree with your point on using the AmigaOS (where possible given constraints) but I disagree with the "no overhead" claim to using the AmigaOS, even if it "goes directly to the the hardware". Function calls through the jump table have overhead and compiled AmigaOS code is not optimal. For example, your new layers.library is riddled with instructions like:

lea (0,a6),a4 ; optimize to move.l a6,a4
move.l #0,-(sp) ; optimize to clr.l -(sp)
lea (a3),a6 ; optimize to move.l a3,a6

That's just talking about peephole optimizations. It's compiled for the 68000 which isn't a huge loss with this particular code but it's a few percent more. Actually, the biggest gain could be using the auto word extension of address registers which SAS/C doesn't take advantage of (MVS or MOVE.W+EXT.L fusion/folding would be a huge gain with this code). There is no instruction scheduling for superscalar processors like the 68060. Cache accesses sometimes look like they aren't even avoided like this:

move.w (a5),d0
move.w (a5)+,d1

We are not talking about a cycle or 2. All these lack of optimizations add up and then programmers roll their own code to gain 10%+ speed over the AmigaOS. I want programmers to use the AmigaOS functions (but not required). We need to improve compilers and try to make code close to optimal for this to happen. Call me a cycle counter and ignore me if you like.

Quote from: Thorham;769052

Write software that's fast on 20s and 30s with AGA, and when possible OCS/ECS, while keeping it system friendly, but without sacrificing speed (no need for syskill practices, custom screens and what not, of course).

With a little bit of learning, it's possible to write code that is fairly optimal on the 68020-68060. Instruction scheduling for a 68060 generally doesn't affect 68020/68030 performance but can double 68060 performance with some code. Learning about modern CPU pipelining, superscalar execution, caches, hazards/bubbles, etc. (not just 68020/68030 timings and specifics) will improve your code for the 68020/68030 also. We may get a superscalar 68k fpga CPU someday where your code will magically be much faster too

.

Quote from: biggun;769054

What you say it correct.

But it does not need to be like this.
A CIA need minimal FPGA
By adding the smallest and cheapest FPGA to a new system - every system could be made CIA compatible for a price of close to nothing.

The same is true with USB and accessing them via DFFxxx registers.
A USB to DFFxxx bridge logic costs around $2.

Technically there is really no reason to not havce both.
A NEO sytem could with no problem at all implement USB and DFF chipset and CIAs for nearly not money.

The SAM 440 and 460 have a small Lattice fpga. The mentality of some of the so called next generation Amiga guys is to get away from hardware dependency. They also may be trying to keep their AmigaOS closed for proprietary and security reasons.

matthey · « **Reply #1 on:** July 16, 2014, 09:23:23 AM »

Quote from: Thorham;769105

Can't do it. 20s and 30s have priority for me. Not to mention that instruction scheduling sucks. My goal is to get something to run well on the lower end machines (25 mhz 68020). When something runs well on such machines, why would I need to optimize for 68060s? For me anything above 68030 is irrelevant in terms of optimizing, because if a '30 can run it fast enough, then so can a '40 or '60. I also don't have a '40 or '60.

More performance is always useful. Settings with better gfx, more sound effects/music and more options can be turned on a 68040/68060. Some games are nicer at 30fps than 20fps even if they are playable and fun at 20fps on a 68020/030. It does take a little more time to instruction schedule code but the code become re-usable for more and expanded projects.

Quote from: Thorham;769105

Really? So, you're telling me that on 20/30 there's more than cache+timings+pipeline? Interesting!

The 020/030 is friendly being lightly pipelined but performance is affected by alignment and data sizes (32 bit is sometimes faster than 16 bit) at least. Unfortunately, documentation is lacking in general for 68k instructions in regards to hazards/bubbles and instruction scheduling. I know the 020/030 has some instruction overlap but I don't know if it's enough to affect resource availability from instruction to instruction. Contrary to most 68k Amiga programmers, I have studied the 040 and 060 more (and I know more about the AmigaOS functions than banging the Amiga hardware also). Avoiding general slow downs for the 040/060 rarely hurts and sometimes helps 020/030 performance. This in contrast to the 68000 where optimizing for 68000-68060 is difficult as the 68000 is a 16 bit processor.

matthey · « **Reply #2 on:** July 16, 2014, 10:15:17 PM »

Quote from: Thomas Richter;769121

That's exactly what I call a "cycle counter party argument". It is completely pointless because it makes no observable difference. Probably the reverse, the compiler had likely made the choice for a reason.

In all 3 of my peephole optimization examples, the CCR is set in the same way. Vbcc's vasm assembler would optimize these 3 examples by default. The savings may be more than you expect even if a single peephole optimization "makes no observable difference". Let's take a look at the simple:

lea (0,a6),a4

It looks short and harmless. The instruction is 4 bytes instead of 2 bytes for the MOVE equivalent so the extra fetch is only a fraction of the cost of executing an instruction on the 68000-68030, not counting any code that falls out of the ICache. The 68040 and 68060 can handle this instruction in 1 cycle with the 68060 only using 1 pipe. Now let's use A4 for addressing in the next instruction like this:

lea (0,a6),a4
move.l (8,a4),d0

There is a 2-4 cycle bubble between the instructions on the 68040+ (including pipelined fpga 68k processors). A superscalar processor like the 68060 can have 2 integer pipes sitting idle for several cycles instead of executing half a dozen instructions. The above code looks like a compiler flaw anyway. If a6=a4 then it should use a6 as a base instead of copying it and using the copy.

Quote from: Thomas Richter;769121

Anyhow, the low-level graphics.library is in assembly, if that makes you feel any better. Still, does not make a difference. Fast code comes from smart algorithms, not micro-optimizations. V45 is smarter in many respects because it avoids thousands of CPU cycles of worthless copy operations in most cases, probably of the expense of a couple of hundred CPU cycles elsewhere.

Smart algorithms are the starting point to efficient executables and I appreciate your work to that end. Your layers.library probably runs at half the speed of what the 68060 is capable of because of non-algorithm issues even though it may be several times faster than it was before with many overlapping windows. You could say the 68060 is fast enough already so there is no need to have efficient code for it. If you were a compiler writer, you would have a 68000 backend for the 68060, an 8086 (or would it be 8080) backend for x86/x86_64 and 32 bit ARM backend for Thumb and ARMv8 processors all with no optimization options and no optimizations. Your job would be complete and any complaints would be met with "make better algorithms".

Quote from: Thomas Richter;769121

Which I actually doubt, and even if it would be hardly noticable because there is more that adds up to the complexity of moving windows than a couple of trivial move operations. Actually, V45 is faster, not slower, because it is smarter algorithmically.

I estimated your layers.library code could be ~10% faster on the 68020/68030 if compilers were better and you cared. Yes, that doesn't mean layers operations will be 10% faster because other code probably has the same problems.

Quote from: Thomas Richter;769121

Pointless argument, see above. It requires algorithmic improvements, or probably additional abstractions to make it fit to the requirements of its time. Arguing about a

I guess it's such a pointless argument that you stopped typing mid sentence?

matthey · « **Reply #3 on:** July 17, 2014, 06:43:29 PM »

Quote from: Thorham;769222

Who the hell ever said that? Of course it takes longer and is easier to mess up (doesn't mean you end up with bug riddled code, like someone claimed).

It doesn't mean algorithms can't be improved in assembler either (even if ThoR considers them worthless micro-optimizations also). There are high level algorithm improvements that are generally easier in high level languages and low level algorithm improvements that high level languages may make difficult to see or implement because of abstraction. Assembler is complete freedom, it's just that most people don't know what to do with it. It requires logical and organized thinking to create good code. It's like a puzzle with beauty in the simplest and most logical code. A time schedule would take away from the creative freedom though. Some people have to code for a living and that's why we should try to improve the assembler in compilers for those imprisoned souls

.

Algorithms can require assumptions that change also. New fpga Amiga chipsets may have a blitter much faster than the CPU with faster memory again. Perhaps SmartRefresh should be selected for the chipset and SimpleRefresh for a gfx board. Maybe layers should query the gfx system for the best refresh method to use for a gfx mode. There is all that algorithm work fishing for the big fish and that can all become outdated if the "best" algorithm becomes obsolete. At least thousands of micro-optimizations can be applied with quick compiler switches over and over again, provided they are used. Personally, I like tuna and sardines. I prefer to be a little more open minded than 640kB of memory with perfect algorithms is enough for everyone. Isn't it really the time saved compared to the amount of work? I would think that optimizing compilers so that a compiler switch can be used would be efficient in processing time saved vs programming time spent even if it was a few percent of savings. Of course, I'm not a professional programmer so my opinion doesn't seem to count, according to some people.

matthey · « **Reply #4 on:** July 20, 2014, 08:41:15 AM »

Quote from: DamageX;769408

Thanks for jacking my thread, LOL.

From 2006 no less

.

Author Topic: newb questions, hit the hardware or not? (Read 71789 times)

matthey

Re: newb questions, hit the hardware or not?

matthey

Re: newb questions, hit the hardware or not?

matthey

Re: newb questions, hit the hardware or not?

matthey

Re: newb questions, hit the hardware or not?

matthey

Re: newb questions, hit the hardware or not?