So it'll be a near rewrite for anything AOS based API?
It would be a rewrite for any OS not already designed for many core, I'd say. CUDA (and by extension OpenCL) require that you take a completely different look at how to write code. Essentially you are writing code that launches many (read thousands) of concurrent threads at once over different sections of a dataset that you've divided into a grid. It's like SIMD but a bit more flexible in that it is possible for each thread to take a different path of execution at a conditional branch (you get a penalty when that happens though).
Unfortunately, not all code can be reworked for the many core approach. Only problems that contain inherent parallelism are suitable.