My point is that it was far from smooth, and I am sure the Apollo Core will now be trimmed and the routines hand optimised to fit exactly this use case so that it will be smooth.
Not quite clear to me what your point is. Of course, if you optimize to a particular target, you get better results. The instruction set has - possibly - been optimized for this particular use case, and the assembly has been particular been optimized to take maximum advantage of the extended instruction set. That seems to be completely legit to me, no problem.
The question would be how generic the instructions are so they can be carried over to another use case. I cannot answer that, but as far as I know, the instruction set is documented, so you can check.
Or, to put it in a different way, all the extended instruction sets of the x86 where obtained in exactly this way, namely by looking at particular use cases (such as 3D rendering) and providing short cuts for exactly such cases. Looks all fine to me.