Back in the day when instructions had a specific ammount of ticks and when registers refrenced in opcodes actually represented the reality of what was in the processor core and when you knew precisely how many ticks it took to fetch a byte from memory and the abscense of on die instruction caching you could make a serious argument about designing your algorithm around a processors instruction set.
Times have changed. That function that you just called might live in the cache at that moment of time, in case great, or it may have to be fetched from ram. No way to predict what the case will be. Instructions are broken down into micro ops now, and executed out of order and in parallel based on best case prediction hardware.
In many cases a nice hand rolled assembler routine will look really efficient but in reality stalls out the pipeline on every iteration, then you've got a shiney assembler chunk of garbage.
A lot of the optimisations they make in the processor cores are based on the output of the major compilers. In many cases not coding your asm like the compiler would results in wasted cycles.