Welcome, Guest. Please login or register.

Author Topic: OT - assembler versus C for Amiga development  (Read 5635 times)

Description:

0 Members and 1 Guest are viewing this topic.

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #14 on: August 29, 2011, 07:53:54 PM »
@SamuraiCrow

LOL, regarding bitwise rotate, I totally agree:
Code: [Select]

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
//  File:         platforms/amigaos3_68k/systemlib/machine_bitops_native.hpp
//  Tab Size:     2
//  Max Line:     120
//  Description:  AmigaOS Specific implementation of systemlib internals
//  Comment(s):
//  Library:      System
//  Created:      2006-10-08
//  Updated:      2006-10-08
//  Author(s):    Karl Churchill
//  Note(s):
//  Copyright:    (C)2006+, eXtropia Studios
//                Karl Churchill
//                All Rights Reserved.
//
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#ifndef _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP
# define _EXNG2_SYSTEMLIB_BITOPS_NATIVE_HPP

////////////////////////////////////////////////////////////////////////////////
//
//  Native bit operations
//
////////////////////////////////////////////////////////////////////////////////

namespace Machine {

  template<typename T>
  inline T _rotLeft8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        asm(
          &quot;rol.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    } else {
      asm(
        &quot;rol.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotLeft16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T  _rotRight8(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&7) {
        // only rotate when modulus 8 > 0
        asm(
          &quot;ror.b %1, %0\n&quot;
          : &quot;=d&quot;(val) : &quot;I&quot;(bits&7), &quot;0&quot;(val) : &quot;cc&quot;
        );
      }
    }
    else {
      asm(
        &quot;ror.b %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }

  template<typename T>
  inline T _rotRight16(uint32 bits, T val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&15) {
        // only rotate when modulus 16 > 0
        if ((bits&15) < 9) {
          asm(
            &quot;ror.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&15), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use opposite rotate for N > 8
          asm(
            &quot;rol.w %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(16-(bits&15)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.w %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }


  inline uint16 swap16(uint16 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<8|val>>8;
    } else {
      asm(
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP16

  inline uint32 swap32(uint32 val)
  {
    if (__builtin_constant_p(val)) {
      val = val<<16 | val>>16;
      val = ((val&0x00FF00FF)<<8) | ((val&0xFF00FF00)>>8);
    } else {
      asm(
        &quot;rol.w #8, %0\n\t&quot;
        &quot;swap %0\n\t&quot;
        &quot;rol.w #8, %0\n&quot;
        : &quot;=d&quot;(val)
        : &quot;0&quot;(val)
        : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_SWAP32

  inline uint64 swap64(uint64 val)
  {
    if (__builtin_constant_p(val)) {
      return  (((val & 0xff00000000000000ull) >> 56)
            | ((val & 0x00ff000000000000ull) >> 40)
            | ((val & 0x0000ff0000000000ull) >> 24)
            | ((val & 0x000000ff00000000ull) >> 8)
            | ((val & 0x00000000ff000000ull) << 8)
            | ((val & 0x0000000000ff0000ull) << 24)
            | ((val & 0x000000000000ff00ull) << 40)
            | ((val & 0x00000000000000ffull) << 56));
    }
    else {
      union { uint64 u64; uint32 u32[2]; };
      u64 = val;
      uint32 msw  = swap32(u32[0]);
      u32[0]      = swap32(u32[1]);
      u32[1]      = msw;
      return u64;
    }
  }
  #define _EXNG2_MACHINE_HAS_SWAP64

  // runtime known rotate
  inline uint32 rotLeft8_32(uint32 bits, uint32 val)  { return _rotLeft8<uint32>(bits, val); }
  inline uint16 rotLeft8_16(uint32 bits, uint16 val)  { return _rotLeft8<uint16>(bits, val); }
  inline uint8  rotLeft8(uint32 bits, uint8 val)      { return _rotLeft8<uint8>(bits, val); }



  #define _EXNG2_MACHINE_HAS_ROL8

  inline uint32 rotLeft16_32(uint16 bits, uint32 val) { return _rotLeft16<uint32>(bits, val); }
  inline uint16 rotLeft16(uint32 bits, uint16 val)    { return _rotLeft16<uint16>(bits, val); }


  #define _EXNG2_MACHINE_HAS_ROL16

  inline uint32 rotLeft32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;rol.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROL32

  inline uint32 rotRight8_32(uint32 bits, uint32 val) { return _rotRight8<uint32>(bits, val); }
  inline uint16 rotRight8_16(uint32 bits, uint16 val) { return _rotRight8<uint16>(bits, val); }
  inline uint8  rotRight8(uint32 bits, uint8 val)     { return _rotRight8<uint8>(bits, val);  }

  #define _EXNG2_MACHINE_HAS_ROR8

  inline uint32 rotRight16_32(uint32 bits, uint32 val)  { return _rotRight16<uint32>(bits, val); }
  inline uint16 rotRight16(uint32 bits, uint16 val)     { return _rotRight16<uint32>(bits, val); }

  #define _EXNG2_MACHINE_HAS_ROR16

  inline uint32 rotRight32(uint32 bits, uint32 val)
  {
    if (__builtin_constant_p(bits)) {
      if (bits&31) {
        // only rotate when modulus 32 > 0
        if ((bits&31) < 9) {
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)==16) {
          asm(
            &quot;swap %0\n&quot;
            : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else if ((bits&31)>23) {
          // use opposite rotate for N > 23
          asm(
            &quot;rol.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;I&quot;(32-(bits&31)), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
        else {
          // use register rotate for all intermediate sizes
          asm(
            &quot;ror.l %1, %0\n&quot;
            : &quot;=d&quot;(val) : &quot;d&quot;(bits&31), &quot;0&quot;(val) : &quot;cc&quot;
          );
        }
      }
    }
    else {
      asm(
        &quot;ror.l %1, %0\n&quot;
        : &quot;=d&quot;(val) : &quot;d&quot;(bits), &quot;0&quot;(val) : &quot;cc&quot;
      );
    }
    return val;
  }
  #define _EXNG2_MACHINE_HAS_ROR32

  inline sint32 mostSigBit32(uint32 val)
  {
    asm(
      &quot;bfffo %0 {#0:#32}, %0&quot; &quot;\n\t&quot;
      &quot;eor.w #31,%0\n&quot;
      : &quot;=d&quot;(val) : &quot;0&quot;(val) : &quot;cc&quot;
    );
    return val;
  }
  #define _EXNG2_MACHINE_HAS_BFFFO

};

#endif

;-)

If you know your gnu C, you'll recognise that almost all of that reduces down to inserting just the right bitwise rotate operation and despite the apparent awesome size of the C++ code, usually boils down to 1-3 instructions that are identical to what you'd write for an assembler version. The __builtin_constant_p() test is a compile time operation that, when the operand is determined to be a constant value, ends up emitting a constant value for the output. After all, there's no sense in rotating a constant, when you can just use the constant it would evaluate to.
« Last Edit: August 29, 2011, 08:00:19 PM by Karlos »
int p; // A
 

Offline Sidewinder

  • Full Member
  • ***
  • Join Date: Mar 2002
  • Posts: 241
    • Show only replies by Sidewinder
    • http://www.liquido2.com
Re: Intuition questions...
« Reply #15 on: August 29, 2011, 08:38:19 PM »
Quote
Some stuff just doesn't need optimizing. Basic IO, for example, is almost always going to be limited by the speed of the device being communicated with. Event handling is another example. No amount of assembler will basically speed up having to wait for an asynchronous event to happen.


This is a good argument when thinking about a single process or thread, but on multi-tasking systems like the Amiga a different process may be able to use the open processor time for further computation.  Thus an optimized I/O routine would be preferable to an unoptimized one in terms of overall system performance.

If I had unlimited time I'd probably want to write everything in assembly to be the most efficient possible.  But, since I do not, the question boils down to priorities.  If it will take another 100 hours of coding to rewrite the I/O system to save only a few cycles, it's probably not worth it unless overall system speed and efficiency are the priority.  In most cases the 100 hours would be better spent speeding up the most used code sections.

In addition, there is the case of specialization.  Very few people can say they are experts in all areas of computer architecture.  There are many cases where I'm fairly certain I would not be able to improve upon the work of others.  For me, I/O is one such case.  I'm content to trust that the authors of the I/O library have done their homework and made their code as efficient as possible.


And the original poster made the claim that with assembler...

Quote
you get the absolute best speed and efficiency from your code...


Clearly this statement cannot always be true.  In the right hands an assembly program can be a masterpiece, but in naive hands it can be a disaster.  Writing efficient assembly code takes skill and extensive knowledge of system architecture.  If one doesn't have this knowledge, C would probably be a better choice--unless, of course, the goal is to gain this knowledge.
Sidewinder
 

Offline commodorejohn

  • Hero Member
  • *****
  • Join Date: Mar 2010
  • Posts: 3165
    • Show only replies by commodorejohn
    • http://www.commodorejohn.com
Re: Intuition questions...
« Reply #16 on: August 29, 2011, 08:47:31 PM »
The problem with C on small systems is that everybody uses GCC, which just plain isn't designed for efficiency so much as for massively multi-platform reliability. I don't know if there's a decent C99 compiler for 68k out there, but it would sure help things if there were.

That said, I kind of agree with Franko - 68k assembler is the nicest I've ever used in terms of programmer-friendliness, and it's pretty much guaranteed to be better than GCC at least. If you're doing an Amiga-specific project that's fairly simple in organization, I don't see a reason not to use it - it'll help majorly on low-end Amiga systems (and yes, people still use them,) and for those of us with a slightly beefier setup, the extra efficiency is just gravy :)
Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
Synthesizers: Roland JX-10/MT-32/D-10, Oberheim Matrix-6, Yamaha DX7/FB-01, Korg MS-20 Mini, Ensoniq Mirage/SQ-80, Sequential Circuits Prophet-600, Hohner String Performer

"\'Legacy code\' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: Intuition questions...
« Reply #17 on: August 29, 2011, 08:50:56 PM »
@CommodoreJohn

VBCC is largely C99 compliant.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: Intuition questions...
« Reply #18 on: August 29, 2011, 08:51:55 PM »
Quote from: Sidewinder;656765
This is a good argument when thinking about a single process or thread, but on multi-tasking systems like the Amiga a different process may be able to use the open processor time for further computation.  Thus an optimized I/O routine would be preferable to an unoptimized one in terms of overall system performance.

Not necessarily. While the processor is waiting for the bus, as it would be with slow IO, you can't just assume you can go away and run another thread. The OS divides processor time into quanta that are much larger than the granularity we are talking about here.

What you are saying is true when you are literally Wait()ing for IO, that is, having put the thread to sleep while waiting for an interrupt or IPC event of some kind.
« Last Edit: August 29, 2011, 09:08:18 PM by Karlos »
int p; // A
 

Offline billt

  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 910
    • Show only replies by billt
    • http://www.billtoner.net
Re: OT - assembler versus C for Amiga development
« Reply #19 on: August 29, 2011, 08:55:33 PM »
What is a good 68k assembler to use today? I may have use for it more for 68000 system-agnostic than for Amiga-specific. Soething I could use with easy68K or ide68k simulator and other things that don't have much of a system attached to the CPU. For all 68K flavors 68000 to 68060, but at least 68000.

For Amiga programming I'll go C and maybe learn a little C++.

Can one use WinUAE to test generic 68k assembler binaries without an OS in the way?

I'd like to tinker with the free/open verilog/vhdl 68k CPUs and compare particular things with simulator to a known working chip or software sim, and it seems UAE or easy68k is probably easier to get/use than some 68k experimenter board today.
Bill T
All Glory to the Hypnotoad!
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: OT - assembler versus C for Amiga development
« Reply #20 on: August 29, 2011, 08:59:50 PM »
Quote from: billt;656773
What is a good 68k assembler to use today?


Quite a few. If you can find it, DevPac was great. As I'm only writing subcomponents of code in assembler these days, I tend to use PhxAss. Failing that, I just inject assembler directly into C code when using gcc.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: OT - assembler versus C for Amiga development
« Reply #21 on: August 29, 2011, 09:02:30 PM »
One final comment on the overall subject, as far as I'm concerned, you don't need to know anything about assembler to be a C programmer, but writing assembly language gives you a much better insight into how to write C optimized for a given platform.

Everybody that wants to write fast code in any compiled language should be a bit familiar with assembler at least, just to understand the inner workings of how their kit works.
int p; // A
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: OT - assembler versus C for Amiga development
« Reply #22 on: August 29, 2011, 09:04:21 PM »
Quote from: Karlos;656774
Quite a few. If you can find it, DevPac was great. As I'm only writing subcomponents of code in assembler these days, I tend to use PhxAss. Failing that, I just inject assembler directly into C code when using gcc.


The author of PhxAss has written a newer Assembler that is more flexible.  See VAsm for details.
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: OT - assembler versus C for Amiga development
« Reply #23 on: August 29, 2011, 09:07:36 PM »
Quote from: SamuraiCrow;656776
The author of PhxAss has written a newer Assembler that is more flexible.  See VAsm for details.


It was probably vasm that I meant :lol: Old naming traditions die hard.
int p; // A
 

Offline commodorejohn

  • Hero Member
  • *****
  • Join Date: Mar 2010
  • Posts: 3165
    • Show only replies by commodorejohn
    • http://www.commodorejohn.com
Re: Intuition questions...
« Reply #24 on: August 29, 2011, 09:34:04 PM »
Quote from: SamuraiCrow;656769
VBCC is largely C99 compliant.
Hmm. How's the code quality?
Quote from: billt;656773
What is a good 68k assembler to use today? I may  have use for it more for 68000 system-agnostic than for Amiga-specific.  Soething I could use with easy68K or ide68k simulator and other things  that don't have much of a system attached to the CPU. For all 68K  flavors 68000 to 68060, but at least 68000.
I agree with Karlos that Devpac is quite nice; if you want to code on a  non-Amiga platform, vasm (which SamuraiCrow already linked, but it bears repeating)  apparently supports Devpac's directives on top of the standard Motorola  syntax.
Quote from: Karlos;656775
One final comment on the overall subject, as far as  I'm concerned, you don't need to know anything about assembler to be a C  programmer, but writing assembly language gives you a much better  insight into how to write C optimized for a given platform.

Everybody that wants to write fast code in any compiled language should  be a bit familiar with assembler at least, just to understand the inner  workings of how their kit works.
Amen. Amen. Even if you never write a single project in assembler, understanding the nuances of your architecture(s) is crucial to being able to write good code for them.
Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
Synthesizers: Roland JX-10/MT-32/D-10, Oberheim Matrix-6, Yamaha DX7/FB-01, Korg MS-20 Mini, Ensoniq Mirage/SQ-80, Sequential Circuits Prophet-600, Hohner String Performer

"\'Legacy code\' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup
 

Offline itix

  • Hero Member
  • *****
  • Join Date: Oct 2002
  • Posts: 2380
    • Show only replies by itix
Re: OT - assembler versus C for Amiga development
« Reply #25 on: August 29, 2011, 10:04:13 PM »
Quote from: Karlos;656775
Everybody that wants to write fast code in any compiled language should be a bit familiar with assembler at least, just to understand the inner workings of how their kit works.

I sort of agree but I wouldnt recommend it. It can lead to bad habits. I know coder who used to write lenghty C# methods because he knew there is always small overhead when calling subroutines in assembler code. Obviously that is not relevant to C# anymore and often not relevant to low level languages like C/C++ even.

I sometimes see this similar behaviour in my code when I am monitoring generated assembly to optimize software pipelined loops...

But of course assembly coding has its place in time critical routines and sometimes you just cant get compilers to produce efficient code for your task (i.e. AltiVec optimizations or using special PPC instructions, move16 on 68k and so on).

Quote
There are many clock cycle optimisations for the basic 68000 that are slower on the 68020. Likewise, once you master the 020's behaviour, a lot of it ends up being counter-productive on the 68040.

That it so true. It is also possible beat machine language if you select better algorithm. Bubble sort is always slow no matter how many hours is spent to squeeze last clock cycles away. Refactoring is so much easier in higher level languages and you are also more productive.
« Last Edit: August 29, 2011, 10:07:06 PM by itix »
My Amigas: A500, Mac Mini and PowerBook
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16867
  • Country: gb
  • Thanked: 4 times
    • Show only replies by Karlos
Re: OT - assembler versus C for Amiga development
« Reply #26 on: August 29, 2011, 10:11:22 PM »
Quote from: itix;656787
I sort of agree but I wouldnt recommend it. It can lead to bad habits. I know coder who used to write lenghty C# methods because he knew there is always small overhead when calling subroutines in assembler code. Obviously that is not relevant to C# anymore and often not relevant to low level languages like C/C++ even.

Agreed. You should understand how your high-level language works first and foremost, particularly how it is optimized by your compiler. C# adds an extra layer of indirection through the CLR that means a lot of assumptions you might make about low level performance of language constructs may be invalid. However, I stand by the assertion that being able to look at code performance from both ends is better than understanding it from one end only.
int p; // A
 

Offline SamuraiCrow

  • Hero Member
  • *****
  • Join Date: Feb 2002
  • Posts: 2280
  • Country: us
  • Gender: Male
    • Show only replies by SamuraiCrow
Re: Intuition questions...
« Reply #27 on: August 29, 2011, 10:12:43 PM »
Quote from: commodorejohn;656782
Hmm. How's the code quality?


Better than GCC.
 

Offline commodorejohn

  • Hero Member
  • *****
  • Join Date: Mar 2010
  • Posts: 3165
    • Show only replies by commodorejohn
    • http://www.commodorejohn.com
Re: OT - assembler versus C for Amiga development
« Reply #28 from previous page: August 29, 2011, 10:17:36 PM »
Quote from: itix;656787
I sort of agree but I wouldnt recommend it. It can lead to bad habits. I know coder who used to write lenghty C# methods because he knew there is always small overhead when calling subroutines in assembler code. Obviously that is not relevant to C# anymore and often not relevant to low level languages like C/C++ even.

I sometimes see this similar behaviour in my code when I am monitoring generated assembly to optimize software pipelined loops...

That it so true. It is also possible beat machine language if you select better algorithm. Bubble sort is always slow no matter how many hours is spent to squeeze last clock cycles away. Refactoring is so much easier in higher level languages and you are also more productive.
Yes and no. Unthinking application of optimization techniques learned by rote is going to lead to convoluted code that probably isn't even that optimal, whether you're writing in C, assembler, Forth, or what-the-hell-have-you. And while you're absolutely right that intelligent refactoring can make much more difference than a couple assembler tweaks ever will, I wouldn't say that's "beating machine language" - refactoring is refactoring no matter what it's written in.

This is why I love Michael Abrash. His Black Book is specifically geared towards 386/486 optimization, but there's so much information in there just as suitable to any architecture...one key point of which is "the best optimizer is between your ears." Learning how to identify problem areas and optimize them intelligently will serve you well in any language.
Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
Synthesizers: Roland JX-10/MT-32/D-10, Oberheim Matrix-6, Yamaha DX7/FB-01, Korg MS-20 Mini, Ensoniq Mirage/SQ-80, Sequential Circuits Prophet-600, Hohner String Performer

"\'Legacy code\' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup