Author Topic: Disassembled ASM problem. (Read 2852 times)

Doobrey · « **on:** November 11, 2004, 12:40:33 AM »

Just wondered if any of you ASM gurus could tell me wtf is going on with the branch to +2 ?


; a1=dos
; a0=exec
; on stack =proc struct,temp,dunno
LAB_0206:
	MOVEM.L	D2-D3/A3-A4/A6,-(A7)	;2D8A: 48E7301A
	MOVEA.L	36(A7),A3		;2D8E: 266F0024	Addr of buffer to a3
	MOVEA.L	24(A7),A4		;2D92: 286F0018 

;## Clear whitespace from start.
LAB_0207:
	CMPI	#$528B,D0		;2D96: 0C40528B
	MOVE.B	(A3),D0		;2D9A: 1013
	MOVEQ	#32,D1		;2D9C: 7220
	CMP.B	D1,D0		;2D9E: B001		;## Check for space.
	BEQ.S	LAB_0207+2		;2DA0: 67F6
	MOVEQ	#9,D1		;2DA2: 7209
	CMP.B	D1,D0		;2DA4: B001		## Check for tab
	BEQ.S	LAB_0207+2		;2DA6: 67F0

Seeing as the code at 0207+2 is actually 'addq.l #1,a3', why the compare?
Is it simply some compiler trick to shave a couple of bytes off the exe, and speed it up by not having another branch?
(Both IRA and VDA68k produced the same code)

BTW, I`ll donate £10 to amiga.org if the first person to correctly answer this *isnt* Piru :-)

Karlos · « **Reply #1 on:** November 11, 2004, 12:50:27 AM »

Thats quite strange.

Assuming the code entry is at LAB_0206, the compare operation at LAB_0207 would seem to be entirely wasted, since the cc fields will be set my the move.b that follows it.

The compare also seems to take no part in the loop because the branch offsets skip past it. Actually, I am not sure they do. They jump to the address + 2, which assuming a byte based offset actually jumps into the immediate word data of the instruction (the #$528B) - which is your addq #1, a3.

So the effects are that it harmlessly eliminates the effect of the addition operation on initial loop entry, since it is interpreted as a compare (which simply does nothing useful of itself) and each successive iteration afterwards it is interpreted as the add quick.

Hence, I guess its simply a method of suppressing the add operation the first time through the loop as a form of optimisation in the flow in a do/while style construct.

Doobrey · « **Reply #2 on:** November 11, 2004, 01:00:10 AM »

Quote

Karlos wrote:
They jump to the address + 2, which assuming a byte based offset actually jumps into the immediate word data of the instruction (the #$528B).

Yup.. and $528B is the 'addq.l #1,a3' I mentioned, and makes sense to what the code does (it`s part of a routine that reads a config file and strips all whitespace and comments from each line before parsing with ParseArgs() )

I figured the compare is just a dummy op to save on branching past the addq when it`s first called, otherwise it`d scan from the 2nd char in the line.

It`s something I`ve seen quite a bit lately..poking about inside the OS. Some parts of the OS are riddled with it, others don`t have it at all.

Anyway, ta for confirming my suspicions.

Karlos · « **Reply #3 on:** November 11, 2004, 01:01:38 AM »

I think you replied whilst I was editing my post :-)

As optimisations go, its a bit pointless as at best it saves one unconditional branch on loop entry. It's most likely a space saving optimisation rather than a performance one.

Doobrey · « **Reply #4 on:** November 11, 2004, 01:06:59 AM »

Ditto :lol:

Karlos · « **Reply #5 on:** November 11, 2004, 01:10:26 AM »

The places that have this type of optimisation were possibly written in asm in the first instance. Those that don't are perhaps the result of C compilers?

Or even vice versa, if a C compiler is optimising for size ;-)

-edit-

Would this count as a form of self modifying code? It's certianly a "self reinterpreting" code ;-)

Piru · « **Reply #6 on:** November 11, 2004, 01:36:56 AM »

Quote

The places that have this type of optimisation were possibly written in asm in the first instance. Those that don't are perhaps the result of C compilers?

Such code is commonly generated by C compilers because it's typically faster than branching, and often reduces code size too.

Quote

Would this count as a form of self modifying code? It's certianly a "self reinterpreting" code

This indeed can cause some problems with 68060's Branch Prediction cache. Specifically it can result in Access Error exception with Branch Prediction Exception bit set (BPE bit in FSLW). In such cases the exception handler needs to flush the branch cache before continuing.

NOTE: This isn't something coders need to worry about, unless if you're implementing your own OS or replacing the access error exception vector. :-)

Karlos · « **Reply #7 on:** November 11, 2004, 01:41:31 AM »

@Piru

As I said, in this instance its wasted as a speed optimisation, it affects the first iteration of the loop only. Surely a size optimisation then - it would save 2 bytes overall compared to a bra.b into the loop body after the add instruction, right?

Given that on the 060 it may invoke an exception, it's hardly a speed optimisation anymore :-)

Piru · « **Reply #8 on:** November 11, 2004, 01:48:47 AM »

Quote

As I said, in this instance its wasted as a speed optimisation, it affects the first interation only. Surely a size optimisation then - it would save 2 bytes overall compared to a bra.b into the loop body after the add instruction, right?

Yeah.. But in cases where two set of instructions are equally fast, if the other reduces to fewer memory/cache inst fetches, the smaller code is generally faster. This is highly academic though, as it much depends on the state of the cache, and the total size of the code being executed, aswell as the target CPU. Anyway, in essense size optimization that doesn't slow down execution can be considered speed optimization, too.

Quote

Given that on the 060 it may invoke an exception, it's hardly a speed optimisation anymore

Well, it was ok for 68020/68030 at least, possibly with 68040. But if the code is to be executed on 68060, it's obviously not recommended.

Karlos · « **Reply #9 on:** November 11, 2004, 01:59:02 AM »

All true, of course. I guess it's because I always see the loop unrolled in my mind that optimisations such as this appear virtually futile to me, other than for space saving ;-)

-edit-

Speaking of all things low down and dirty, did you solve my float to int problem yet? I need a non conditional branching way of handling the x=1.0 case and if anybody can solve that one it's you. Don't make me come and extinguish the sauna now.... :-D

-edit2-

Dang, I forgot, I posted that problem prior to the great thread massacre of 2004...

Doobrey · « **Reply #10 on:** November 11, 2004, 02:14:18 AM »

Quote

Piru wrote:
But if the code is to be executed on 68060, it's obviously not recommended.

So would you be horrified to learn that snippet of code came from Setpatch (v44.38)..specifically the NSDPatch part. ?

Doobrey · « **Reply #11 on:** November 11, 2004, 02:49:49 AM »

Quote

Karlos wrote:
at best it saves one unconditional branch on loop entry.

I dunno why they didn`t just do..

subq.l #1,a3
Loop:
addq.l #1,a3

It doesn`t make the code any bigger or use any extra branches.
As you`ve both said, it looks like it was the result of a C compiler being clever(maybe too clever!)

Gofromiel · « **Reply #12 on:** November 11, 2004, 06:20:48 AM »

Have you tried another disassembler like d68k ? Because I use it since birth and I've never seen a "addr+2" thing !

PiR · « **Reply #13 on:** November 18, 2004, 06:29:33 PM »

@Gofromiel

Sorry, but I think you've lost the main problem in this listing... It is not possible to disassemble such code in a good way.

@Doobrey

I guess this came from times of optimalisation for plain 68000. Notice that single cmpi.w seems to be easier to execute than two artmetical operations (where both also set CC afterwards, so it is calculation + cmp for both).

But for newer processors your solution is obviously better (and much better than bra).

I think I've read also somewhere that for 68040 it is faster to have fewer, even longer commands, than more simpler.

No such problems for PowerPC code. :-)

And in the last word - to extend what Piru said about 68060 Branch Prediciton Cache - for WinUAE users:
Imagine what all this can do to JIT emutation.
Let's asume that we have even optimasing JIT, that is able even to analyse and store, if it is needed to prepare CC in the instruction (or even eliminates dead-code instructions).

Author Topic: Disassembled ASM problem. (Read 2852 times)

Doobrey

Disassembled ASM problem.

Karlos

Re: Disassembled ASM problem.

Doobrey

Re: Disassembled ASM problem.

Karlos

Re: Disassembled ASM problem.

Doobrey

Re: Disassembled ASM problem.

Karlos

Re: Disassembled ASM problem.

Piru

Re: Disassembled ASM problem.

Karlos

Re: Disassembled ASM problem.

Piru

Re: Disassembled ASM problem.

Karlos

Re: Disassembled ASM problem.

Doobrey

Re: Disassembled ASM problem.

Doobrey

Re: Disassembled ASM problem.

Gofromiel

Re: Disassembled ASM problem.

PiR

Re: Disassembled ASM problem.