That's rather longer than I expected.
page 3-1
http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf The first 4 stages are for fetching and assigning the instruction to an integer unit. The next 4 stages are the dual integer unit, then the last two stages are completing the instructions.
It's quite a simple design.
It doesn't evenly distribute instructions between integer pipelines, it only uses the second integer pipeline when the first is running an instruction that can be run at the same time. Whether it can will depend on the instruction as not all can even be run on the second pipeline and the registers involved. If the instruction in the primary pipeline changes a register used in the next instruction then the next instruction also has to be put on the primary pipeline.
I don't know if the pipelines will get starved if you're continuously using both integer pipelines for instructions that only take 1 clock cycle to execute. It's not something that you can achieve in real world examples, however as a 32bit value can contain two instructions then it might be possible. There isn't much explained as to how this works though. They do say it's "capable of sustained execution rates of < 1 machine cycle per instruction of the M68000 instruction set". But if it could sustain 2 instructions per machine cycle then I would have thought they would have claimed that.
There is a document (
http://cdn.preterhuman.net/texts/underground/phreak/68060Info.txt) that explains the pipelines in more detail, I don't where the pdf is as the pictures are missing in the ascii version. It might be this:
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=289639 but I'm not paying for it :-)
The branch executing in zero cycles doesn't seem to be very well documented. I can't tell whether they are over-exaggerating what it does or not. My original thought was that the branch is in the primary pipeline and the secondary pipeline has the target or next instruction (depending on what is predicted). This doesn't actually cause it to execute in 0 cycles when looking at the pipeline as a whole, but when looking at the branch on it's own it does have a 0 cycle overhead.
What is odd is that they claim different for predicted correctly taken and predicted correctly not taken
"If the BC indicates that the instruction is a branch and that this branch should be predicted as taken,
the IAG pipeline stage is updated with the target address of the branch
instead of the next sequential address. This approach, along with the
instruction folding techniques that the BC uses, allow the 68060 to achieve a
zero-clock latency penalty for correctly predicted taken branches.
If the BC predicts a branch as not-taken, there is no discontinuity
in the instruction prefetch stream. The IFP continues to fetch instructions
sequentially. Eventually, the not-taken branch instruction executes as a
single-clock instruction in the OEP, so correctly predicted not-taken
branches require a single clock to execute. These predicted as not-taken
branches allow a superscalar instruction dispatch, so in many cases, the next
instruction executes simultaneously in the sOEP."
So it would imply that the branch doesn't hit the execute stage of the pipeline, but then the document goes on to say it does.
"The 68060 performs the actual condition code checking to evaluate the
branch conditions in the EX stage of the OEP. If a branch has been
mispredicted, the 68060 discards the contents of the IFP and the OEPs, and
the 68060 resumes fetching of the instruction stream at the correct location.
To refill the pipeline in this manner, there is a seven-clock penalty for a
mispredicted branch."
I guess it comes down to how you interpret this from the first quote:
"allow the 68060 to achieve a zero-clock latency penalty for correctly predicted taken branches"