All my intuition.library source is full of examples I gave... This library is the slowest of the Kickstart 3.1 for me...
I'll show you another example for the beta 7 release if you want...
Note : passing args with registers is more 4 or 5 times faster than by the stack, specially on 000/010/020 who don't have datacache... And we don't need the addq/lea after the subroutine...
There are, give or take, some 800 functions which make up intuition.library V40, each of which passes parameters on the stack, and that's not counting the function calls made through amiga.lib stubs. This may be a ripe field for peephole optimizations, but my best guess is that the impact any such optimizations could have on overall system performance will be rather low.
Intuition is largely event-driven: its main purpose is capturing user input and routing these events to the clients which consume and react to them. This type of operation is very slow to begin with and typically does not happen more than 60 times per second (e.g. moving the mouse pointer), and more likely happens less than 10 times per second (hitting a key, clicking a mouse button). These operations happen as fast as the user can produce these events.
Aside from the event processing and routing, Intuition also contains API wrapper code which makes interacting with screens and windows, and rendering into them, possible (and nicer, too, from a programmer's point of view). These wrappers connect directly to graphics.library and layers.library, respectively.
Then there's the rest of the code, which consists of utility functions. For example, if you click on a gadget in a window Intuition will need to figure out if the click hit the gadget or the window. Utility functions such as the one which figures out geometric relationships are written in pure 68k assembly in intuition.library V40.
The parts of Intuition which interface to graphics.library and layers.library are more likely to produce improvements if optimized than those parts which merely react to the user's input in his own time.
If the work you are doing in order to optimize code is fun for you, then that's OK, no harm done.
For the record, I would like to point out which scope your optimizations fit into, and where you might want to make specific choices on what to look into.
You could chip away at the code which reacts to user input in user time and you would see no benefit whatsoever (if a mouse click is processed a microsecond faster than without optimization, the user will most definitely not notice the difference), but changes in the interaction between intuition.library and graphics.library/layers.library might have an impact. Assuming that you can measure it, and not just imply that the impact will be there because the number of cycles spent in the modified code is smaller than they used to be.
And making thousand supertinysubroutines (3 or 4 mnemonics) called by bsr/jsr is not a sign of a good compilator... Inlining is MUCH faster, really...

The Green Hills 'C' compiler had, for its time, really great data flow analysis capabilities, which allowed it to optimize the operations carried out by a single function. The compiler knew well how the function-local operations were carried out but it did not have an idea of the bigger picture of which function called which.
As far as I can tell Intuition was not written to benefit from function inlining, which would be constrained by the size of the respective function (back then they used preprocessor macros instead). You would have had to mark local functions as being 'static' and let the compiler decide whether inlining them would make sense.
Anyway, as great as the compiler was, it needed help from the programmer to tell it what to do, and this being absent, no function inlining happened. You are asking too much of an optimizing compiler which was a product of its time.