Author Topic: OT - assembler versus C for Amiga development (Read 13487 times)

Karlos · « **on:** August 29, 2011, 05:36:34 PM »

GCC has options for packing suitably small structures into single register fields. Certainly you can do it for return values using -freg-struct-return.

Packing several values, eg two shorts, into a register doesn't always make sense depending on the CPU you are coding for. For example, on 68040, almost any memory access that results in a direct cache hit is probably going to be faster than having to write several instructions to transform a register to get hold of the element you want, do the operation and then put it all back together again. Yet on the 020, the reverse is almost always true.

Of course, a good compiler should be able to apply the appropriate analysis to the code generation and see which makes more sense here.

Karlos · « **Reply #1 on:** August 29, 2011, 06:06:40 PM »

Personally, I go for C(++) first, then anything that still isn't fast enough after reviewing algorithm changes and profiling I will look at writing assembler replacements for.

Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.

The problem with assembly coding is that unless you keep absolutely up-to-date with each revision of your target architecture, you will always fall foul of bad assumptions in the end.

There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

Then there are general changes in system architecture. On the cacheless 68000, precomputed lookup tables were king to speed up various complex operations. As processors have gotten faster in relation to memory, it often ends up quicker to evaluate the expression than it does to precompute it and perform memory lookups, unless you can arrange your precomputed data in a very cache friendly way.

Anyhow, this is somewhat off-topic.

Karlos · « **Reply #2 on:** August 29, 2011, 07:18:43 PM »

Quote from: SamuraiCrow;656750

@Karlos

Could you split from post 11 onward into a separate thread? I'd like to continue this discussion without going off-topic.

Done.

Karlos · « **Reply #3 on:** August 29, 2011, 07:29:41 PM »

Quote from: franko

While I see the points your making they are not exactly correct however...

Take for example something written for doing IO like an HD DOS driver or device. If you wrote that entirely in C it would be highly inefficient in comparison to coding it in assembler...

Sure not matter which method you choose to write your code in they both are restricted by hardware & the hardware bus and physical speeds of IO lines...

BUT that's not where it ends, as the actual code for shifting all this data back and forth is in the driver you write and if you write it in C and not in highly optimised assembler you lose speed overall as your routine performs it's code x amount of time per second...

You are making some poor assumptions there. If you are dealing with a slow bus, then "inefficient" generated code can often be as fast as hand-written assembler simply because the latency of the IO hides the cost of the operation being performed.

If you want demonstrable proof of this, look no further than C2P to chip RAM on any decent 040 or higher. The most highly tuned implementations tend to run at copy speed, that is to say, as fast as a vanilla unrolled move.l (a0)+, (a1) style loop. And yet they have many more instructions per longword transferred than the latter. The point being, that the cost of many instructions (compared to a basic copy loop) is entirely masked by the slow bus.

Likewise, a naive C longword copy loop such as the following

Code: [Select]

while (count--) {
   *dest++ = *src++;
}

will perform almost as well as a hand written move.l based loop when it comes to slow buses like the Chip RAM or Zorro-II interface. However, the compiler will almost certainly unroll the above at any modest level of optimization, resulting in more efficient code than the above loop implies.

Sure there are other tricks you can try, like playing around with MMU settings and imprecise cache modes on 060 that can get you a boost, so you can definitely improve upon what vanilla C can do in some cases, but not all.

I've tested various techniques to try and burst data faster to my graphics card, using hand generated move16 and other such contrivances and in the end, they simply weren't significantly faster than well tuned C (used a 16x unrolled Duff's device loop) code.

Anyway, the two aren't at opposition. One of the best features about C is that it's usually fairly easy to add assembler into places where you know it isn't going to be able to compete with your own ingenuity or domain knowledge. However, it takes the drudgery out of almost everything else.

Karlos · « **Reply #4 on:** August 29, 2011, 07:53:54 PM »

@SamuraiCrow

LOL, regarding bitwise rotate, I totally agree:

Code: [Select]


///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
//  File:         platforms/amigaos3_68k/systemlib/machine_bitops_native.hpp
//  Tab Size:     2
//  Max Line:     120
//  Description:  AmigaOS Specific implementation of systemlib internals
//  Comment(s):
//  Library:      System
//  Created:      2006-10-08
//  Updated:      2006-10-08
//  Author(s):    Karl Churchill
//  Note(s):
//  Copyright:    (C)2006+, eXtropia Studios
//                Karl Churchill
//                All Rights Reserved.
//
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#ifndef _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP
# define _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP

////////////////////////////////////////////////////////////////////////////////
//
//  Native bit operations
//
////////////////////////////////////////////////////////////////////////////////

namespace Machine {

  template<typename T>
  inline T _rotLeft8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        asm(
          &quot;rol.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    } else {
      asm(
        &quot;rol.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotLeft16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T  _rotRight8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        // only rotate when modulus 8 > 0
        asm(
          &quot;ror.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    }
    else {
      asm(
        &quot;ror.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotRight16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }


  inline uint16 swap16(uint16 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<8|val>>8;
    } else {
      asm(
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP16

  inline uint32 swap32(uint32 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<16 | val>>16;
      val = ((val&0x00FF00FF)<<8) | ((val&0xFF00FF00)>>8);
    } else {
      asm(
        &quot;rol.w #8, %0\n\t&quot;
        &quot;swap %0\n\t&quot;
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP32

  inline uint64 swap64(uint64 val)
  {
    if (__builtin_constant_p(val)) {
      return  (((val & 0xff00000000000000ull) >> 56)
            | ((val & 0x00ff000000000000ull) >> 40)
            | ((val & 0x0000ff0000000000ull) >> 24)
            | ((val & 0x000000ff00000000ull) >> 8)
            | ((val & 0x00000000ff000000ull) << 8)
            | ((val & 0x0000000000ff0000ull) << 24)
            | ((val & 0x000000000000ff00ull) << 40)
            | ((val & 0x00000000000000ffull) << 56));
    }
    else {
      union { uint64 u64; uint32 u32[2]; };
      u64 = val;
      uint32 msw  = swap32(u32[0]);
      u32[0]      = swap32(u32[1]);
      u32[1]      = msw;
      return u64;
    }
  }
  #define _EXNG2_MACHINE_HAS_SWAP64

  // runtime known rotate
  inline uint32 rotLeft8_32(uint32 bits, uint32 val)  { return _rotLeft8<uint32>(bits, val); }
  inline uint16 rotLeft8_16(uint32 bits, uint16 val)  { return _rotLeft8<uint16>(bits, val); }
  inline uint8  rotLeft8(uint32 bits, uint8 val)      { return _rotLeft8<uint8>(bits, val); }



  #define _EXNG2_MACHINE_HAS_ROL8

  inline uint32 rotLeft16_32(uint16 bits, uint32 val) { return _rotLeft16<uint32>(bits, val); }
  inline uint16 rotLeft16(uint32 bits, uint16 val)    { return _rotLeft16<uint16>(bits, val); }


  #define _EXNG2_MACHINE_HAS_ROL16

  inline uint32 rotLeft32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROL32

  inline uint32 rotRight8_32(uint32 bits, uint32 val) { return _rotRight8<uint32>(bits, val); }
  inline uint16 rotRight8_16(uint32 bits, uint16 val) { return _rotRight8<uint16>(bits, val); }
  inline uint8  rotRight8(uint32 bits, uint8 val)     { return _rotRight8<uint8>(bits, val);  }

  #define _EXNG2_MACHINE_HAS_ROR8

  inline uint32 rotRight16_32(uint32 bits, uint32 val)  { return _rotRight16<uint32>(bits, val); }
  inline uint16 rotRight16(uint32 bits, uint16 val)     { return _rotRight16<uint32>(bits, val); }

  #define _EXNG2_MACHINE_HAS_ROR16

  inline uint32 rotRight32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROR32

  inline sint32 mostSigBit32(uint32 val)
  {
    asm(
      &quot;bfffo %0 {#0:#32}, %0&quot; &quot;\n\t&quot;
      &quot;eor.w #31,%0\n&quot;
      : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
    );
    return val;
  }
  #define _EXNG2_MACHINE_HAS_BFFFO

};

#endif

;-)

If you know your gnu C, you'll recognise that almost all of that reduces down to inserting just the right bitwise rotate operation and despite the apparent awesome size of the C++ code, usually boils down to 1-3 instructions that are identical to what you'd write for an assembler version. The __builtin_constant_p() test is a compile time operation that, when the operand is determined to be a constant value, ends up emitting a constant value for the output. After all, there's no sense in rotating a constant, when you can just use the constant it would evaluate to.

Karlos · « **Reply #5 on:** August 29, 2011, 08:51:55 PM »

Quote from: Sidewinder;656765

This is a good argument when thinking about a single process or thread, but on multi-tasking systems like the Amiga a different process may be able to use the open processor time for further computation. Thus an optimized I/O routine would be preferable to an unoptimized one in terms of overall system performance.

Not necessarily. While the processor is waiting for the bus, as it would be with slow IO, you can't just assume you can go away and run another thread. The OS divides processor time into quanta that are much larger than the granularity we are talking about here.

What you are saying is true when you are literally Wait()ing for IO, that is, having put the thread to sleep while waiting for an interrupt or IPC event of some kind.

Karlos · « **Reply #6 on:** August 29, 2011, 08:59:50 PM »

Quote from: billt;656773

What is a good 68k assembler to use today?

Quite a few. If you can find it, DevPac was great. As I'm only writing subcomponents of code in assembler these days, I tend to use PhxAss. Failing that, I just inject assembler directly into C code when using gcc.

Karlos · « **Reply #7 on:** August 29, 2011, 09:02:30 PM »

One final comment on the overall subject, as far as I'm concerned, you don't need to know anything about assembler to be a C programmer, but writing assembly language gives you a much better insight into how to write C optimized for a given platform.

Everybody that wants to write fast code in any compiled language should be a bit familiar with assembler at least, just to understand the inner workings of how their kit works.

Karlos · « **Reply #8 on:** August 29, 2011, 09:07:36 PM »

Quote from: SamuraiCrow;656776

The author of PhxAss has written a newer Assembler that is more flexible. See VAsm for details.

It was probably vasm that I meant :lol: Old naming traditions die hard.

Karlos · « **Reply #9 on:** August 29, 2011, 10:11:22 PM »

Quote from: itix;656787

I sort of agree but I wouldnt recommend it. It can lead to bad habits. I know coder who used to write lenghty C# methods because he knew there is always small overhead when calling subroutines in assembler code. Obviously that is not relevant to C# anymore and often not relevant to low level languages like C/C++ even.

Agreed. You should understand how your high-level language works first and foremost, particularly how it is optimized by your compiler. C# adds an extra layer of indirection through the CLR that means a lot of assumptions you might make about low level performance of language constructs may be invalid. However, I stand by the assertion that being able to look at code performance from both ends is better than understanding it from one end only.

Author Topic: OT - assembler versus C for Amiga development (Read 13487 times)

Karlos

Re: Intuition questions...

Karlos

Re: Intuition questions...

Karlos

Re: Intuition questions...

Karlos

Re: Intuition questions...

Karlos

Re: Intuition questions...

Karlos

Re: Intuition questions...

Karlos

Re: OT - assembler versus C for Amiga development

Karlos

Re: OT - assembler versus C for Amiga development

Karlos

Re: OT - assembler versus C for Amiga development

Karlos

Re: OT - assembler versus C for Amiga development