Author Topic: GCC asm() warning suppression options? (Read 19973 times)

Trev · « **Reply #14 on:** June 30, 2009, 05:50:28 AM »

Quote from: Karlos;513878

You're building gcc 4.4 for an m68k backend just to check this? That's hardcore

Hardcore boredom, maybe. ;-) gcc 4.4.0 is definitely smarter than gcc 2.95.3:

intput:

static inline unsigned _rotl(unsigned val, int shift)
{
    shift &= 0x1f;
    val = (val>>(0x20 - shift)) | (val << shift);
    return val;
}

int main(void)
{
    volatile unsigned val = 1;
    volatile unsigned a = _rotl(val, 1);
    return 0;
}

output:

Code: [Select]

00000000

:
   0:   4e56 fff8       linkw %fp,#-8
   4:   7001            moveq #1,%d0
   6:   2d40 fffc       movel %d0,%fp@(-4)
   a:   202e fffc       movel %fp@(-4),%d0
   e:   721f            moveq #31,%d1
  [b]10:   e2b8            rorl %d1,%d0[/b]
  12:   2d40 fff8       movel %d0,%fp@(-8)
  16:   4280            clrl %d0
  18:   4e5e            unlk %fp
  1a:   4e75            rts

It's even decided that a register based rotate right is faster than an immediate rotate left, I guess. Or maybe not. I can't get it to produce a rotate left. I guess gcc isn't an ambiturner.

In this case, at least, the optimizer does OK. I don't know anything about gcc internals, but maybe the optimization occurs at the RTL level based on the capabilities of the underlying architecture. That would mean an optimization should apply regardless of the architecture, as long as the architecture supports it.

But why doesn't it rotate left? The execution times are the same for both. Correction: The execution times are the same if the bit count is the same: 8+2n for register operand, 12+2n for immediate operand. What am I missing?

I compiled with an m68k-elf target, but now the m68k-amigaos gods are calling. I know there are some gcc 4.x.x builds out there somewhere, but I haven't used one. I'd also like to see ixemul and libnix go away and be replaced with newlib or another current library.

If necessary, one could even target specific releases, e.g. m68k-*-amigaos1.2, m68k-*-amigaos2.0, m68k-*-amigaos3.0, et al. It depends on how tightly coupled the tool chain is to the target environment. The main differences from a tool chain perspective, though, should be in the hunks supported. Everything else could be handled as it is today, and newlib, crt0.s, and amiga.lib could be written to run optimally on arbitrary releases. Having that magic at compile time, though, would result in much tighter binaries.

Trev · « **Reply #15 on:** June 30, 2009, 04:33:14 PM »

Quote from: Piru;513963

Well that's not it. The concurrent jobs are only used for things that can be concurrent. Obviously make cannot change the order of commands being executed, that'd never work.

You have to trust that your target makefile is essentially thread-safe, i.e. all dependencies are properly documented for synchronization, no race conditions exist in similar commands used by different rules, etc.

Trev · « **Reply #16 on:** June 30, 2009, 04:41:32 PM »

Quote from: Karlos;513934

Hmm, unusual. I'm pretty sure an immediate left shift of 1 place ought to be faster than a register based shift right of 31 since you spare yourself the cost of the additional move.l #31, d1. It also reduces register pressure too, which could make all the difference in real code.

Probably in the test code here there's no need for it to do that.

Yeah, I'm really not sure why, since ror.l #n,Dn and ror.l Dx,Dn execute in the same number of cycles assuming n == Dx. Barring outside influences, the only reason to use a register is for a shift >8 or <24 (>8 in the opposite direction), as you've done in your template. Right?

Quote

Incidentally, you might want to compile that test code with -fomit-frame-pointer

Well, sure, but it's not actually going to run anywhere and it doesn't change the output of the test itself. :-P

Trev · « **Reply #17 on:** June 30, 2009, 06:10:34 PM »

Quote from: Karlos;513975

Well, you could probably do it with two successive rotates, but I figured that the register method might be better than a pair of rotates. Having said that, I didn't test the latter. You'd swap a move for a rotate but you'd gain a free register overall.

I'm leaning towards trusting the compiler. One can always hand optimize code, but writing specializations for all possible cases? Tedium. ;-)

Trev · « **Reply #18 on:** June 30, 2009, 07:17:27 PM »

Quote from: Karlos;513995

After analysing the output (from older gcc) it was clear that the emitted code wasn't very good. So, I made the inline assembler implementations you saw one of in this thread. There's no question they produce better code now than when they were using shifts and ors

Oh, for sure. I was just thinking that for the newer compiler, the optimizer should be much better at dynamically optimizing for all possible scenarios than I would be at hand coding them. I'm no amigaksi, after all.

Trev · « **Reply #19 on:** June 30, 2009, 09:59:17 PM »

Quote from: Karlos;514007

ROFL (+1)

;-)

Quote

Tell you what. Seeing as you've compiled a working 4.4 compiler, we could try some synthetic benchmarks. My template rotate versus a standard implementation based on shifting and or'ing that the compiler is left to optimize.

I'd actually be quite interested in the results

Well, it's m68k-elf with no real back end. We could count cycles in a simulator, I suppose. :-)

Also, fun with generic templates. Here's an x86 rotate that reduces positive and negative shift values to a positive shift (assuming +right, -left), and "optimizes" based on the width of the shift:

Code: [Select]


template <signed N> inline unsigned rotate(unsigned val)
{
    if ((32-(-(N%32)))%32 != 0) {
        if ((32-(-(N%32)))%32 < 16) {
            asm(&quot;rorl %1, %0;&quot; : &quot;=r&quot;(val) : &quot;I&quot;((32-(-(N%32)))%32), &quot;0&quot;(val) : &quot;cc&quot;);
        }
        else {
            asm(&quot;roll %1, %0;&quot; : &quot;=r&quot;(val) : &quot;I&quot;(32-((32-(-(N%32)))%32)), &quot;0&quot;(val) : &quot;cc&quot;);
        }
    }

    return val;
}

(I haven't looked at the execution times, so the optimization might not even make sense. But that wasn't point, regardless.)

But guess what! N==0 (or any value that reduces to 0) throws this:

Code: [Select]


warning: asm operand 1 probably doesn't match constraints

Bugger! It still compiles, still runs, and doesn't leave any dead code. Not sure how to get rid of the warning, though, if it's parsing code it shouldn't be parsing after templatization. Template misuse, maybe?

Trev · « **Reply #20 on:** June 30, 2009, 11:08:33 PM »

Quote from: Karlos;514035

Or I could write the function to be benchmarked, you can compile it and post the assembler output of the function and I'll put that source back into a test project?

Or that. :-P

Trev · « **Reply #21 on:** June 30, 2009, 11:12:12 PM »

Quote from: Karlos;514033

Boo! So you basically get the same warning I started this whole thread in aid of? :roflmao:

It's only taken us 50 posts to come full circle

I think we're safe as long as no one invokes Sir Elton.

Quote

It isn't quite as cheeky as the processor trap -> C++ exception throw that I used. Frankly, I'm amazed that bugger worked at all. Inside the (asm) m68k trap handler (which you install into your exec Task structure), you poke the stack frame to change the return address to a function which does nothing other than throw an exception of a type suitably mapped to the nature of the trap. Saves having to check for divide by zero when you can just put a try/catch block around a bit of code and trap ZeroDivide

Actually, that sounds like a quite valid use. Within the design of the operating system even. (Well, sort of. But manipulating stack frames is kind of at the core of exception handling, isn't it?)

Trev · « **Reply #22 on:** July 01, 2009, 12:15:44 AM »

Were you ever able to simulate a null pointer exception, short of wrapping all pointers in a class and overloading the indirection operator?

Trev · « **Reply #23 on:** July 01, 2009, 07:07:31 AM »

I suspect your template will be faster, but only because the optimizer isn't doing rol's:

Code: [Select]

template <signed N> static inline unsigned rotate(unsigned val)
{
    if ((32-(-(N%32)))%32 != 0) {
        if ((32-(-(N%32)))%32 < 9) {
            asm(&quot;rorl %1, %0;&quot; : &quot;=d&quot;(val) : &quot;I&quot;((32-(-(N%32)))%32), &quot;0&quot;(val) : &quot;cc&quot;);
        }
        else if ((32-(-(N%32)))%32 > 23) {
            asm(&quot;roll %1, %0;&quot; : &quot;=d&quot;(val) : &quot;I&quot;(32-((32-(-(N%32)))%32)), &quot;0&quot;(val) : &quot;cc&quot;);
        }
        else if ((32-(-(N%32)))%32 == 16) {
            asm(&quot;swap %0;&quot; : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;);
        }
        else {
            asm(&quot;rorl %1, %0;&quot; : &quot;=d&quot;(val) : &quot;d&quot;((32-(-(N%32)))%32), &quot;0&quot;(val) : &quot;cc&quot;);
        }
    }

    return val;
}

static inline unsigned _rotl(unsigned val, int shift)
{
    shift &= 0x1f;
    val = (val>>(0x20 - shift)) | (val << shift);
    return val;
}

static inline unsigned _rotr(unsigned val, int shift)
{
    shift &= 0x1f;
    val = (val<<(0x20 - shift)) | (val >> shift);
    return val;
}

int main(void)
{
    volatile unsigned x = 1;

    volatile unsigned a = _rotl(x, 1);
    volatile unsigned b = _rotr(x, 1);

    volatile unsigned c = rotate<-1>(x);
    volatile unsigned d = rotate<1>(x);

    return 0;
}

/*
00000000 <main>:
   0:   4e56 ffec       linkw %fp,#-20

volatile unsigned x = 1;
   4:   7001            moveq #1,%d0

volatile unsigned a = _rotl(x, 1);
   6:   2d40 fffc       movel %d0,%fp@(-4)
   a:   202e fffc       movel %fp@(-4),%d0
   e:   721f            moveq #31,%d1
  10:   e2b8            rorl %d1,%d0
  12:   2d40 fff8       movel %d0,%fp@(-8)

volatile unsigned b = _rotr(x, 1);
  16:   202e fffc       movel %fp@(-4),%d0
  1a:   e298            rorl #1,%d0
  1c:   2d40 fff4       movel %d0,%fp@(-12)

volatile unsigned c = rotate<-1>(x);
  20:   202e fffc       movel %fp@(-4),%d0
  24:   e398            roll #1,%d0
  26:   2d40 fff0       movel %d0,%fp@(-16)

volatile unsigned d = rotate<1>(x);
  2a:   202e fffc       movel %fp@(-4),%d0
  2e:   e298            rorl #1,%d0
  30:   2d40 ffec       movel %d0,%fp@(-20)

return 0;
  34:   4280            clrl %d0
  36:   4e5e            unlk %fp
  38:   4e75            rts
*/

I don't know anything about how the optimizer works, really, so I don't know why it's always opting for one solution over another.

Trev · « **Reply #24 on:** July 01, 2009, 07:52:05 AM »

I've been digging into GCC's SSA trees, but it's getting late here. Maybe I'll have a moment of clarity tomorrow and actually understand how they work. :-P

Trev · « **Reply #25 on:** July 01, 2009, 07:53:48 AM »

Quote from: Karlos;514076

Well, that and the fact it doesn't require an additional register to hold the shift value for many of the sizes. Saving a register gives the optimizer more breathing space in 'real' code.

And I suspect that GCC will reduce to constant values anything that isn't defined as or determined to be volatile.

Trev · « **Reply #26 on:** July 01, 2009, 07:24:56 PM »

In gcc 3.4.4, the traditional shift-or expressions are reduced to rotates while the SSA tree is being built, during constant folding and arithmetic reduction, before tree optimization and RTL generation occur. gcc 4.4.0 is probably similar. (I'm on a system without gcc 4.4.0 at the moment.) No idea what gcc 2.95.3 does yet.

EDIT: Hope to have an understanding later today of why gcc 4.4.0 m68k reduces to a shifted right rotate instead of a left rotate. None of this helps gcc 2.95.3, of course, but it's fun nonetheless.

EDIT2: Constant folding and arithmetic reduction should be done prior to RTL generation in gcc 2.95.3 as well.

Trev · « **Reply #27 on:** July 01, 2009, 09:38:42 PM »

gcc 2.95.3 isn't that bad, actually. For the most part, it optimizes in the same way your template would.

Code: [Select]


static inline unsigned _rotl(unsigned val, int shift)
{
    shift &= 0x1f;
    val = (val>>(0x20 - shift)) | (val << shift);
    return val;
}

static inline unsigned _rotr(unsigned val, int shift)
{
    shift &= 0x1f;
    val = (val<<(0x20 - shift)) | (val >> shift);
    return val;
}

int main()
{
    volatile unsigned x = 1;

    volatile unsigned c = _rotl(x, 64);
    volatile unsigned d = _rotl(x, 48);
    volatile unsigned e = _rotl(x, 41);
    volatile unsigned f = _rotl(x, 36);
    volatile unsigned g = _rotl(x, 32);
    volatile unsigned h = _rotl(x, 24);
    volatile unsigned i = _rotl(x, 16);
    volatile unsigned j = _rotl(x, 9);
    volatile unsigned k = _rotl(x, 4);
    volatile unsigned l = _rotl(x, 0);

    volatile unsigned m = _rotr(x, 0);
    volatile unsigned n = _rotr(x, 4);
    volatile unsigned o = _rotr(x, 9);
    volatile unsigned p = _rotr(x, 16);
    volatile unsigned q = _rotr(x, 24);
    volatile unsigned r = _rotr(x, 32);
    volatile unsigned s = _rotr(x, 36);
    volatile unsigned t = _rotr(x, 41);
    volatile unsigned u = _rotr(x, 48);
    volatile unsigned v = _rotr(x, 64);

    return 0;
}

Code: [Select]

00000000

:
   0:   4e56 ffac       linkw %fp,#-84
   4:   4eb9 0000 0000  jsr 0


   
volatile unsigned x = 1;
   a:   7001            moveq #1,%d0
   c:   2d40 fffc       movel %d0,%fp@(-4)

volatile unsigned c = _rotl(x, 64);
  10:   202e fffc       movel %fp@(-4),%d0
  14:   2d40 fff8       movel %d0,%fp@(-8)

volatile unsigned d = _rotl(x, 48);
  18:   202e fffc       movel %fp@(-4),%d0
  1c:   4840            swap %d0
  1e:   2d40 fff4       movel %d0,%fp@(-12)

volatile unsigned e = _rotl(x, 41);
  22:   202e fffc       movel %fp@(-4),%d0
  26:   7209            moveq #9,%d1
  28:   e3b8            roll %d1,%d0
  2a:   2d40 fff0       movel %d0,%fp@(-16)
  
volatile unsigned f = _rotl(x, 36);
  2e:   202e fffc       movel %fp@(-4),%d0
  32:   e998            roll #4,%d0
  34:   2d40 ffec       movel %d0,%fp@(-20)

volatile unsigned g = _rotl(x, 32);
  38:   202e fffc       movel %fp@(-4),%d0
  3c:   2d40 ffe8       movel %d0,%fp@(-24)
  
volatile unsigned h = _rotl(x, 24);  
  40:   202e fffc       movel %fp@(-4),%d0
  44:   e098            rorl #8,%d0
  46:   2d40 ffe4       movel %d0,%fp@(-28)

volatile unsigned i = _rotl(x, 16);
  4a:   202e fffc       movel %fp@(-4),%d0
  4e:   4840            swap %d0
  50:   2d40 ffe0       movel %d0,%fp@(-32)

volatile unsigned j = _rotl(x, 9);
  54:   202e fffc       movel %fp@(-4),%d0
  58:   e3b8            roll %d1,%d0
  5a:   2d40 ffdc       movel %d0,%fp@(-36)

volatile unsigned k = _rotl(x, 4);
  5e:   202e fffc       movel %fp@(-4),%d0
  62:   e998            roll #4,%d0
  64:   2d40 ffd8       movel %d0,%fp@(-40)
  
volatile unsigned l = _rotl(x, 0);
  68:   202e fffc       movel %fp@(-4),%d0
  6c:   2d40 ffd4       movel %d0,%fp@(-44)

volatile unsigned m = _rotr(x, 0);
  70:   202e fffc       movel %fp@(-4),%d0
  74:   2d40 ffd0       movel %d0,%fp@(-48)

volatile unsigned n = _rotr(x, 4);
  78:   202e fffc       movel %fp@(-4),%d0
  7c:   e898            rorl #4,%d0
  7e:   2d40 ffcc       movel %d0,%fp@(-52)

volatile unsigned o = _rotr(x, 9);
  82:   202e fffc       movel %fp@(-4),%d0
  86:   e2b8            rorl %d1,%d0
  88:   2d40 ffc8       movel %d0,%fp@(-56)

volatile unsigned p = _rotr(x, 16);
  8c:   202e fffc       movel %fp@(-4),%d0
  90:   7210            moveq #16,%d1
  92:   e2b8            rorl %d1,%d0
  94:   2d40 ffc4       movel %d0,%fp@(-60)

volatile unsigned q = _rotr(x, 24);
  98:   202e fffc       movel %fp@(-4),%d0
  9c:   7218            moveq #24,%d1
  9e:   e2b8            rorl %d1,%d0
  a0:   2d40 ffc0       movel %d0,%fp@(-64)

volatile unsigned r = _rotr(x, 32);
  a4:   202e fffc       movel %fp@(-4),%d0
  a8:   2d40 ffbc       movel %d0,%fp@(-68)

volatile unsigned s = _rotr(x, 36);
  ac:   202e fffc       movel %fp@(-4),%d0
  b0:   e898            rorl #4,%d0
  b2:   2d40 ffb8       movel %d0,%fp@(-72)

volatile unsigned t = _rotr(x, 41);
  b6:   202e fffc       movel %fp@(-4),%d0
  ba:   7209            moveq #9,%d1
  bc:   e2b8            rorl %d1,%d0
  be:   2d40 ffb4       movel %d0,%fp@(-76)

volatile unsigned u = _rotr(x, 48);
  c2:   202e fffc       movel %fp@(-4),%d0
  c6:   7210            moveq #16,%d1
  c8:   e2b8            rorl %d1,%d0
  ca:   2d40 ffb0       movel %d0,%fp@(-80)

volatile unsigned v = _rotr(x, 64);
  ce:   202e fffc       movel %fp@(-4),%d0
  d2:   2d40 ffac       movel %d0,%fp@(-84)

return 0;
  d6:   4280            clrl %d0

  d8:   4e5e            unlk %fp
  da:   4e75            rts

If I had to choose a compiler based on this alone, I'd go with gcc 2.95.3. Notice, though, how it does a swap on _rotl(x, ) and not _rorl(x, ). The same goes for direction changes for large shifts.

Your template is better in that regard, but as you noted, you might exclude the templated asm from further optimization. I think, though, that the code should be optimized (or at least scheduled) properly as long as you don't use asm volatile (...).

Trev · « **Reply #28 on:** July 01, 2009, 10:08:01 PM »

Quote from: Karlos;514157

How is it with 8/16-bit rotate?

I'll take a look.

Quote

2.95.3's behaviour is slightly moot at this point as I'm hoping to use a later version anyway. Still a bit confused by your findings above though. Perhaps this could be down to stormgcc's backend? I was under the impression they hadn't messed about with the m68k compiler part at all.

I don't know. If the source on Alinea's web site is current, we can take a look.

I'm thinking I'll have a go at amigaos targets. I'm building win32 native, non-Cygwin tools, which I'm sure would be useful to others, particularly people that don't want their Cygwin environment hijacked by a single target a la the current solutions out there.

EDIT: The StormC gcc (m68k-storm) is a bit of a mess. They built a modified m68k-amigaos binutils, added an m68k-storm target to gcc (modified from the Geek Gadgets m68k-amigaos), configured for the target, and then created a bunch of StormC projects to bootstrap the compiler, probably from a vanilla Geek Gadgets install. Funky. Anyhow, I don't have it built yet, but I'm not I'm seeing a benefit to completing it. StormC 4 is based on gcc 2.95.2. It's not difficult to get a new native m68k compiler.

And what I was really interested in is why gcc 4.4.0 doesn't optimize correctly--in fact, worse than gcc 2.95.3 (which still isn't optimal). A shiney new gcc 4.4.0 m68k-*-amigaos* with fixed optimization (for this parituclar issue, anyway) and a native newlib implementation would be, well, shiney.

Trev · « **Reply #29 from previous page:** July 03, 2009, 08:23:39 PM »

I've started adding m68k*-*-amigaos* target support to gcc 4.4.0, and I have a freestanding compiler built. There's a bug in the adtools gas parser (or in my build of it), however, that causes assembly like 'jsr a6@(-0x228:W)' to be assembled as 'js a6@(-0x228:W)', resulting in an assembler error. 'jsrr a6@(-0x228:W)' assembles as 'jsr a6@(-0x228:W)', so that's a bit funny. Anyway, I think it has something to do with the way the offsets are parsed. If the bit after -0x is longer than two characters, the parser eats the r in jrs.

So, I need to fix that before I can move forward.

Author Topic: GCC asm() warning suppression options? (Read 19973 times)

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?

Trev

Re: GCC asm() warning suppression options?