Welcome, Guest. Please login or register.

Author Topic: OT - assembler versus C for Amiga development  (Read 5647 times)

Description:

0 Members and 1 Guest are viewing this topic.

Offline FrankoTopic starter

  • Hero Member
  • *****
  • Join Date: Jun 2010
  • Posts: 5707
    • Show only replies by Franko
OT - assembler versus C for Amiga development
« on: August 29, 2011, 05:16:58 PM »
OT discussion continued from here: http://www.amiga.org/forums/showthread.php?t=58923

@ Jose @ Golem

Afraid C is not the programming language for me for the simple reason that Golem said the generated code is waaay bigger (normally increasing the size by a third at least) and it's a lot slower and highly inefficient... :(

I know a lot of folk back in the day who tried C and it's variants but they could never produce the results they wanted with it, especially when working with fast moving gfx and most of them ended up coming back to assembler... :)

Assembler to me personally is very easy and simple to understand and use and much easier to follow when writing large pieces of code but best of all you get the absolute best speed and efficiency from you code at the end of the day on the Amiga that's what's important... :)
« Last Edit: August 29, 2011, 07:19:44 PM by Karlos »
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: Intuition questions...
« Reply #1 on: August 29, 2011, 05:23:25 PM »
Quote from: Franko;656712
@ Jose @ Golem

Afraid C is not the programming language for me for the simple reason that Golem said the generated code is waaay bigger (normally increasing the size by a third at least) and it's a lot slower and highly inefficient... :(


That's not a problem with the language, that is a problem with the compiler.  If Sidewinder and I can get Clang to work on AROS 68k, we'll have a bit newer compiler technology and maybe we'll get closer to parity with Assembly developers.

One example is that sometimes assembly programmers could pack 2 words in a long variable to get better register loading.  Clang does this if you use the PBQP register allocator.  I don't know of any existing Amiga compilers that do this.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #2 on: August 29, 2011, 05:36:34 PM »
GCC has options for packing suitably small structures into single register fields. Certainly you can do it for return values using -freg-struct-return.

Packing several values, eg two shorts, into a register doesn't always make sense depending on the CPU you are coding for. For example, on 68040, almost any memory access that results in a direct cache hit is probably going to be faster than having to write several instructions to transform a register to get hold of the element you want, do the operation and then put it all back together again. Yet on the 020, the reverse is almost always true.

Of course, a good compiler should be able to apply the appropriate analysis to the code generation and see which makes more sense here.
int p; // A
 

Offline FrankoTopic starter

  • Hero Member
  • *****
  • Join Date: Jun 2010
  • Posts: 5707
    • Show only replies by Franko
Re: Intuition questions...
« Reply #3 on: August 29, 2011, 05:37:13 PM »
Quote from: SamuraiCrow;656715
That's not a problem with the language, that is a problem with the compiler.  If Sidewinder and I can get Clang to work on AROS 68k, we'll have a bit newer compiler technology and maybe we'll get closer to parity with Assembly developers.

One example is that sometimes assembly programmers could pack 2 words in a long variable to get better register loading.  Clang does this if you use the PBQP register allocator.  I don't know of any existing Amiga compilers that do this.


Whether it's the language itself or the compiler, it make no difference... :)

C and it's derivatives will always produce larger code and will never be as efficient in terms of speed as something written in pure assember... :)

Way, way back we used to have this argument all the time especially with mates who went to Uni and for whatever reason they had to learn C. They would produce a bit of code in C on the Amiga claiming it was every bit as efficient as Assembler but ALWAYS those of us who coded in Assembler would prove this wrong by producing the same routine in Assembler smaller & faster... :)

The thing was back then most folk were using an unexpanded A500 (as it cost about 250 quid for a 512Mb ram board) so with just the chipmem available coding had to be as efficient and fast as possible for such machines... :)

It taught us to be very proficient and not to be lazy when coding something so that we could get the very best out of such small resources and to this day that has always been to me the way to do things on the Amiga, small, efficient and speedy... :)
 

Offline Thorham

  • Hero Member
  • *****
  • Join Date: Oct 2009
  • Posts: 1149
    • Show only replies by Thorham
Re: Intuition questions...
« Reply #4 on: August 29, 2011, 05:45:07 PM »
Quote from: Franko;656712
Assembler to me personally is very easy and simple to understand and use and much easier to follow when writing large pieces of code but best of all you get the absolute best speed and efficiency from you code at the end of the day on the Amiga that's what's important... :)
Indeed. C is nice on the peecee, but on 680x0 Amigas, assembly language rules.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #5 on: August 29, 2011, 06:06:40 PM »
Personally, I go for C(++) first, then anything that still isn't fast enough after reviewing algorithm changes and profiling I will look at writing assembler replacements for.

Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.

The problem with assembly coding is that unless you keep absolutely up-to-date with each revision of your target architecture, you will always fall foul of bad assumptions in the end.

There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

Then there are general changes in system architecture. On the cacheless 68000, precomputed lookup tables were king to speed up various complex operations. As processors have gotten faster in relation to memory, it often ends up quicker to evaluate the expression than it does to precompute it and perform memory lookups, unless you can arrange your precomputed data in a very cache friendly way.

Anyhow, this is somewhat off-topic.
int p; // A
 

Offline FrankoTopic starter

  • Hero Member
  • *****
  • Join Date: Jun 2010
  • Posts: 5707
    • Show only replies by Franko
Re: Intuition questions...
« Reply #6 on: August 29, 2011, 06:33:42 PM »
Quote from: Karlos;656729
Personally, I go for C(++) first, then anything that still isn't fast enough after reviewing algorithm changes and profiling I will look at writing assembler replacements for.

Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.

The problem with assembly coding is that unless you keep absolutely up-to-date with each revision of your target architecture, you will always fall foul of bad assumptions in the end.

There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

Then there are general changes in system architecture. On the cacheless 68000, precomputed lookup tables were king to speed up various complex operations. As processors have gotten faster in relation to memory, it often ends up quicker to evaluate the expression than it does to precompute it and perform memory lookups, unless you can arrange your precomputed data in a very cache friendly way.

Anyhow, this is somewhat off-topic.


While I see the points your making they are not exactly correct however... :)

Take for example something written for doing IO like an HD DOS driver or device. If you wrote that entirely in C it would be highly inefficient in comparison to coding it in assembler... ;)

Sure not matter which method you choose to write your code in they both are restricted by hardware & the hardware bus and physical speeds of IO lines... :)

BUT that's not where it ends, as the actual code for shifting all this data back and forth is in the driver you write and if you write it in C and not in highly optimised assembler you lose speed overall as your routine performs it's code x amount of time per second... :)

No matter what you write and which version of the OS or processor you write it for at the end of day C will never outperform Assembler, proven fact and easy to prove... :)
 

Offline Tension

Re: Intuition questions...
« Reply #7 on: August 29, 2011, 06:57:59 PM »
Quote from: Karlos;656729
Personally, I go for C(++) first, then anything that still isn't fast enough after reviewing algorithm changes and profiling I will look at writing assembler replacements for.

Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.

The problem with assembly coding is that unless you keep absolutely up-to-date with each revision of your target architecture, you will always fall foul of bad assumptions in the end.

There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

Then there are general changes in system architecture. On the cacheless 68000, precomputed lookup tables were king to speed up various complex operations. As processors have gotten faster in relation to memory, it often ends up quicker to evaluate the expression than it does to precompute it and perform memory lookups, unless you can arrange your precomputed data in a very cache friendly way.

Anyhow, this is somewhat off-topic.


but very interesting!

Offline golem

  • Sr. Member
  • ****
  • Join Date: May 2002
  • Posts: 430
    • Show only replies by golem
Re: Intuition questions...
« Reply #8 on: August 29, 2011, 07:07:02 PM »
Quote from: Tension;656743
but very interesting!


+1.
                                                             
A1200 desktop, Blizzard 1260, OS3.9BB2, Indivision Mk II, SCSI Jaz, Ethernet
A1200 desktop, Blizzard 1230, OS3.1, Ethernet
A500, OS1.3
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: Intuition questions...
« Reply #9 on: August 29, 2011, 07:13:34 PM »
@Karlos

Could you split from post 11 onward into a separate thread?  I'd like to continue this discussion without going off-topic.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #10 on: August 29, 2011, 07:18:43 PM »
Quote from: SamuraiCrow;656750
@Karlos

Could you split from post 11 onward into a separate thread?  I'd like to continue this discussion without going off-topic.


Done.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #11 on: August 29, 2011, 07:29:41 PM »
Quote from: franko
While I see the points your making they are not exactly correct however...

Take for example something written for doing IO like an HD DOS driver or device. If you wrote that entirely in C it would be highly inefficient in comparison to coding it in assembler...

Sure not matter which method you choose to write your code in they both are restricted by hardware & the hardware bus and physical speeds of IO lines...

BUT that's not where it ends, as the actual code for shifting all this data back and forth is in the driver you write and if you write it in C and not in highly optimised assembler you lose speed overall as your routine performs it's code x amount of time per second...

You are making some poor assumptions there. If you are dealing with a slow bus, then "inefficient" generated code can often be as fast as hand-written assembler simply because the latency of the IO hides the cost of the operation being performed.

If you want demonstrable proof of this, look no further than C2P to chip RAM on any decent 040 or higher. The most highly tuned implementations tend to run at copy speed, that is to say, as fast as a vanilla unrolled move.l (a0)+, (a1) style loop. And yet they have many more instructions per longword transferred than the latter. The point being, that the cost of many instructions (compared to a basic copy loop) is entirely masked by the slow bus.

Likewise, a naive C longword copy loop such as the following

Code: [Select]
while (count--) {
   *dest++ = *src++;
}

will perform almost as well as a hand written move.l based loop when it comes to slow buses like the Chip RAM or Zorro-II interface. However, the compiler will almost certainly unroll the above at any modest level of optimization, resulting in more efficient code than the above loop implies.

Sure there are other tricks you can try, like playing around with MMU settings and imprecise cache modes on 060 that can get you a boost, so you can definitely improve upon what vanilla C can do in some cases, but not all.

I've tested various techniques to try and burst data faster to my graphics card, using hand generated move16 and other such contrivances and in the end, they simply weren't significantly faster than well tuned C (used a 16x unrolled Duff's device loop) code.

Anyway, the two aren't at opposition. One of the best features about C is that it's usually fairly easy to add assembler into places where you know it isn't going to be able to compete with your own ingenuity or domain knowledge. However, it takes the drudgery out of almost everything else.
« Last Edit: August 29, 2011, 07:34:41 PM by Karlos »
int p; // A
 

Offline golem

  • Sr. Member
  • ****
  • Join Date: May 2002
  • Posts: 430
    • Show only replies by golem
Re: Intuition questions...
« Reply #12 on: August 29, 2011, 07:37:43 PM »
Quote from: Karlos;656729
Personally, I go for C(++) first, then anything that still isn't fast enough after reviewing algorithm changes and profiling I will look at writing assembler replacements for.

Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.

The problem with assembly coding is that unless you keep absolutely up-to-date with each revision of your target architecture, you will always fall foul of bad assumptions in the end.

There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

Then there are general changes in system architecture. On the cacheless 68000, precomputed lookup tables were king to speed up various complex operations. As processors have gotten faster in relation to memory, it often ends up quicker to evaluate the expression than it does to precompute it and perform memory lookups, unless you can arrange your precomputed data in a very cache friendly way.

Anyhow, this is somewhat off-topic.


I am not a programmer, only hobbyist and professionally IT support but I get your point that whether you go to machine code is dependant upon what you are trying to do. I had a  big project that was very CPU intensive (mainly subset generating algorithms) and I coded this in 68k machine code which was fast but stupid. It taught me about the 68000 but when I converted it to C 12 years later I could then even recompile it on Linux with very few changes and it worked. If I was banging the Amiga hardware then obviously this wouldn't have been possible and I suppose that is one of the cases where assembler rules.
                                                             
A1200 desktop, Blizzard 1260, OS3.9BB2, Indivision Mk II, SCSI Jaz, Ethernet
A1200 desktop, Blizzard 1230, OS3.1, Ethernet
A500, OS1.3
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: Intuition questions...
« Reply #13 on: August 29, 2011, 07:43:47 PM »
Most of the reason you write C code isn't for performance but maintainability.

I've rewritten the hash function of my AmigaE hash table class in Assembly.  It cut out a lot of cruft but mostly the cruft was the result of E not supporting bit rotations.

In order to run it on a non-Classic Amiga, such as an AROS system, I had to also write the code in PortablE and it generated some hacky-looking C++ code but the GCC compiler knows how to convert a couple of shifts, and an OR to a rotate internally.  Now suddenly I don't have to worry about writing in a new Assembly code for my x86 AROS hosted environment for the Mac, nor for PPC AROS, nor anything else.

It's a tradeoff that's becoming increasingly biased against hand-optimized code beyond what C can offer.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #14 on: August 29, 2011, 07:53:54 PM »
@SamuraiCrow

LOL, regarding bitwise rotate, I totally agree:
Code: [Select]

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
//  File:         platforms/amigaos3_68k/systemlib/machine_bitops_native.hpp
//  Tab Size:     2
//  Max Line:     120
//  Description:  AmigaOS Specific implementation of systemlib internals
//  Comment(s):
//  Library:      System
//  Created:      2006-10-08
//  Updated:      2006-10-08
//  Author(s):    Karl Churchill
//  Note(s):
//  Copyright:    (C)2006+, eXtropia Studios
//                Karl Churchill
//                All Rights Reserved.
//
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#ifndef _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP
# define _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP

////////////////////////////////////////////////////////////////////////////////
//
//  Native bit operations
//
////////////////////////////////////////////////////////////////////////////////

namespace Machine {

  template<typename T>
  inline T _rotLeft8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        asm(
          &quot;rol.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    } else {
      asm(
        &quot;rol.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotLeft16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T  _rotRight8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        // only rotate when modulus 8 > 0
        asm(
          &quot;ror.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    }
    else {
      asm(
        &quot;ror.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotRight16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }


  inline uint16 swap16(uint16 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<8|val>>8;
    } else {
      asm(
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP16

  inline uint32 swap32(uint32 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<16 | val>>16;
      val = ((val&0x00FF00FF)<<8) | ((val&0xFF00FF00)>>8);
    } else {
      asm(
        &quot;rol.w #8, %0\n\t&quot;
        &quot;swap %0\n\t&quot;
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP32

  inline uint64 swap64(uint64 val)
  {
    if (__builtin_constant_p(val)) {
      return  (((val & 0xff00000000000000ull) >> 56)
            | ((val & 0x00ff000000000000ull) >> 40)
            | ((val & 0x0000ff0000000000ull) >> 24)
            | ((val & 0x000000ff00000000ull) >> 8)
            | ((val & 0x00000000ff000000ull) << 8)
            | ((val & 0x0000000000ff0000ull) << 24)
            | ((val & 0x000000000000ff00ull) << 40)
            | ((val & 0x00000000000000ffull) << 56));
    }
    else {
      union { uint64 u64; uint32 u32[2]; };
      u64 = val;
      uint32 msw  = swap32(u32[0]);
      u32[0]      = swap32(u32[1]);
      u32[1]      = msw;
      return u64;
    }
  }
  #define _EXNG2_MACHINE_HAS_SWAP64

  // runtime known rotate
  inline uint32 rotLeft8_32(uint32 bits, uint32 val)  { return _rotLeft8<uint32>(bits, val); }
  inline uint16 rotLeft8_16(uint32 bits, uint16 val)  { return _rotLeft8<uint16>(bits, val); }
  inline uint8  rotLeft8(uint32 bits, uint8 val)      { return _rotLeft8<uint8>(bits, val); }



  #define _EXNG2_MACHINE_HAS_ROL8

  inline uint32 rotLeft16_32(uint16 bits, uint32 val) { return _rotLeft16<uint32>(bits, val); }
  inline uint16 rotLeft16(uint32 bits, uint16 val)    { return _rotLeft16<uint16>(bits, val); }


  #define _EXNG2_MACHINE_HAS_ROL16

  inline uint32 rotLeft32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROL32

  inline uint32 rotRight8_32(uint32 bits, uint32 val) { return _rotRight8<uint32>(bits, val); }
  inline uint16 rotRight8_16(uint32 bits, uint16 val) { return _rotRight8<uint16>(bits, val); }
  inline uint8  rotRight8(uint32 bits, uint8 val)     { return _rotRight8<uint8>(bits, val);  }

  #define _EXNG2_MACHINE_HAS_ROR8

  inline uint32 rotRight16_32(uint32 bits, uint32 val)  { return _rotRight16<uint32>(bits, val); }
  inline uint16 rotRight16(uint32 bits, uint16 val)     { return _rotRight16<uint32>(bits, val); }

  #define _EXNG2_MACHINE_HAS_ROR16

  inline uint32 rotRight32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROR32

  inline sint32 mostSigBit32(uint32 val)
  {
    asm(
      &quot;bfffo %0 {#0:#32}, %0&quot; &quot;\n\t&quot;
      &quot;eor.w #31,%0\n&quot;
      : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
    );
    return val;
  }
  #define _EXNG2_MACHINE_HAS_BFFFO

};

#endif

;-)

If you know your gnu C, you'll recognise that almost all of that reduces down to inserting just the right bitwise rotate operation and despite the apparent awesome size of the C++ code, usually boils down to 1-3 instructions that are identical to what you'd write for an assembler version. The __builtin_constant_p() test is a compile time operation that, when the operand is determined to be a constant value, ends up emitting a constant value for the output. After all, there's no sense in rotating a constant, when you can just use the constant it would evaluate to.
« Last Edit: August 29, 2011, 08:00:19 PM by Karlos »
int p; // A