And I think *I* have something wrong. Michal Shulz did some rough performance calculations, and even though my method (wait for quantum to expire) is semantically correct, the performance penalty is terrifying.
 Yeah that was my point. Making forbid wait for the other cpus will mean that these four lines of code will take over 1 task quantum.  
forbid()
permit()
forbid()
permit() 
The first forbid() will take anywhere from nothing to 1 task quantum depending on how it aligns with the other cpu's tasks. 
Stopping the other cpu's immediately will have some performance penalty, which even though it's much higher than the overhead in AOS 3.1, it should be nowhere near a quantum. 
You also don't want the other cpu's tasks to lose their quantum when another cpu does a forbid(), the other cpu's tasks should have the quantum extended by the time they are suspended. 
Rather than signalling the other cpu, it might be enough to actually stop them. The performance might depend on architecture, plus I don't know how you're abstracting all this stuff, so either way might make more sense.