Calling XMOS parallelism massive is quite a stretch.
GPUs have that, however. My ATI Radeon HD 6970 has 1536 cores and total 2.7 TFLOPS (sp), and it's rather easy to utilize it with OpenCL.
Boooo! nVidia+CUDA 4.x FTW :lol:
I'd always be careful quoting teraflop values for graphics cards. They're almost always unattainable in real code, even code that is explicitly parallel by nature. Unless your code is a perfectly structured sequence of fused multiply-add without any memory accesses, branches or scheduling overhead, at any rate.