I think you have a good point about these chips being useful for special purposes.
Yes, it's a bit unfortunate that it's been easy to get the impression that these are competing with large desktop CPUs - maybe a descendant of them can at one point far down the line, but to get there would take a *huge* change in how we write software, as this design is *never* going to compete with large complicated cores for per-core performance (given that they lack caches, don't have the advantage of being massively superscalar etc., and never will as that would defeat the entire design goal of having something that can scale up to ridiculous number of cores).
The embedded market would probably be OK with the software pain if it could give a good multiplier in performance for $ or power consumption vs. other architectures. That is the key though, it has to offer that multiplier and stay well ahead of the curve. Cell was a pretty amazing chip when it came out for raw horsepower and touted the same "new design doesn't have to carry baggage thus can perform better for the money", but it wasn't too long at all before Intel chips were surpassing it again.
I think the difference here is that the Cell did not keep bumping the core count. Trying to keep up with Intel on per-core performance is a fools game unless one happens to have a bunch of super-geniuses and a few hundred billion dollars to blow through.
Epiphany has the potential to be successful *if* they are successful in finding suitable niches that works well to do massively parallel that isn't well suitable for GPUs (as the GPU will trounce it on performance for things like vector math). Or where "somewhat massively parallel" at very low power works fine. And they do *need* to keep bumping the core count rapidly to make up for their almost guaranteed inability to keep increasing the per-core count fast enough to compete that way.
E.g. I mentioned networking hardware - I was thinking more about that yesterday, as there's a *huge* jump from 1Gbps ethernet, where we have stacks of old switches sitting in cupboards - they're cheap enough to use as door-stops - while a decent size 10Gbps ethernet switch starts in the $1k-$2k range.
If someone were to put an Epiphany (or more..) on a PCIe card, and hook one (or two...) of the 10Gbps links up to a suitable set of PCIe lanes, and make one or two of the other off-chip links available externally, and a little "Epiphany switch" consisting of a small mesh, you'd potentially have an amazing high performance local, programmable interconnect, *much cheaper* than most current 10GE switches and cards. Cable length would likely be a limiting factor, but for very fast local interconnects within a rack, I'd pay good money for that even with cables down to <50cm...
E.g. the Epiphany cores could do a significant amount of TCP/IP offload and routing before stuff would even hit the CPU, and that would be ideal as it is trivial to parallelise (have one core handle the low level ethernet stuff, one handle low level packet filtering, split responsibility for connections across the rest of the cores and distribute the packet stream between them) yet definitively not suitable for GPU type acceleration (separate enough to need many separate instruction streams)
Also sounds interesting about how quick it is to read memory from other CPU's memory. However, I think reading memory from other CPUs is a pretty specialized thing that requires even more complex software to take advantage of. This means you are probably implementing an assembly line multi-core software model where each core takes the work from the previous core and does another set of operations on it. This was tossed around a lot with the Cell early on as it can essentially do the same thing via DMA between SPUs, but the drawbacks of trying to manage this efficiently are ridiculous, as you have to ensure each task on each CPU takes about the same amount of cycles in order to keep the CPUs optimally busy with work. I don't think that model got much use at all.
Sort of. They have a number of examples that does depend on timing, I believe. The write-order to the same core is deterministic, so you can do this "transactionally" by e.g. have one core process a "job" and write the result to the memory of the next core, finishing with writing to a location used as a semaphore (there's a DMA engine that can move data at 8GB/sec between cores while the CPU does other work, or you can simply write the result onto the mesh in anything from 1 to 8 byte chunks per store instruction) that the receiving core can use to verify that the job is ready for processing if you don't want to time everything to the cycle. That'd be roughly the same.
However, since each core can read from other cores memory directly too, if you are unsure about the workload split between cores, you can easily pull data instead (also using the per-core DMA engines if you prefer), which lets you have a scheduler on the main CPU or even on one of the Epiphany cores periodically "rebalancing" tasks if any of the cores get stuck waiting, as you don't need to move any of the application data around - if a core ends up idle, just write a few KB of program code to them to have them switch to one of the tasks that's bottlenecked (the Epiphany cores can request DMA between two external locations, so the job manager could live on one or more of the cores and just fire off DMA calls to repurpose underutilised cores for different jobs).
They actually have a simple job manager written that'll monitor the utilisation and throughput of each core and schedule jobs on them.
It's obviously going to be less efficient to do that than if you can keep each core constantly fully busy through careful timing, but people don't manage to do that optimally even with regular CPUs nor with GPUs, so even if you lose some percentage shuffling jobs around, and in the process gains the flexibility of dynamic scheduling across a potentially huge number of cores, it's likely well worth it for a lot of tasks.
Both the 16-core and 64-core version measures 15mm x 15mm with a 2 watt maximum - you could fit dozens of them on a PCIe card, for several thousand cores on a single PCIe card without even giving a typical PC PSU or fans a slight workout, that can be interconnected in a mesh (each chip has four external "network" interfaces that can be directly connected to other Epiphany chips or to other IO; they're "only" 64Gbps I believe, so slowing down memory read/writes/DMA between cores on different chips, but that's still vastly better than having to go over most "normal" network interconnects within the budget of ordinary mortals).