Only watched the video but it sounds something like that, and I think I've seen this many months before. The core concept seems to be keeping all the CPUs to their local memory instead of dealing with shared caches/memory (e.g. Cell).
Note quite. Each CPU can access the full memory of the system directly. No setting up DMA channels or other nonsense. There's a predictable, fixed per hop penalty for accessing the local memory of cores elsewhere in the mesh.
(And each chip has a number of 10Gbps external links that you can use to hook multiple chips together, or to interface to external devices)
I think they are doomed to failure.
The hardware might work great, but I'd make a bet it wouldn't perform much better than a typical X64 + GPU.
It's not meant to. This is a dev platform to let people experiment with the chips and programming model.
The interesting part is their roadmap, which points towards thousands of cores with up to 1MB per core on a single chip through process shrinks if/when they can afford it.
They may be advertising it as cheap but I'm not sure it would be too cheap by the time it made it into a consumer device.
The size of each core means that it is at least in theory viable to get per-chip manufacturing costs for the 16 and 64 core versions down into a couple of dollars per chip. They're *tiny* compared to current generation "normal" CPUs, and currently manufactured with relatively cheap/old processes.
The real problem is software. Who is going to run out and port their software to this?
That's the entire reason why they did the Kickstarter, and got 6000 people to order boards to experiment with them, and why they've released extensive manuals and design files, ranging from detailed layout of the board, and chip design details well beyond what most manufacturers will give you.
Realistically they don't *need* lots of people to port. They need a few people here and there to port applications that said people would order crates worth of chips to run, such as simulations that are not vectorisable, or really low power sensor systems, automotive systems etc.
Selling a few systems to hobbyists is great for buzz and PR, but where they want to be is volume shipments to OEMs.
Not only that, but this kind of architecture imposes a constant limitation on you where not having quite enough local memory causes constant design juggling of fitting your work in it vs. keeping work units large enough to pipeline things.
Yes, that is a challenge. But they do have demos where they pipeline streams of individual words between cores and still achieve performance well in excess of what the much more expensive ARM Cortex CPU can do on its own, thanks to the low latency inter-core links, so while you won't get maximum throughput that way, they have already showed they can get *good* throughput without agonising over every single byte placement.
You also get all sorts of code/feature design problems due to the fact that you need to stay in local memory to be efficient, try to go read some global variable and you are toast.
You are not toast, though of course it helps to stay in local memory. Each core has a possible throughput of up to 64GB/sec. Latency for a memory operation is 1.5 clock cycles per node-to-node hop.
In a 16 node system, the worst case latency is thus 9 cycles (from opposing corners - 3 hops to reach the furthermost row, 3 to reach the colum, for 6 hops at 1.5 cycles) or 21 cycles worst case for a 64 core version.
You can keep pipelining memory operations, so "off core" writes and reads are fine as long as you can set things up to execute a number of them in sequence, and ideally allocate memory at "nearby" nodes.
A single 16 core Epiphany chip has a higher peak bandwidth than a Cell, and the 64 core chips are in production at sample quantities.
(In fact, given the cost of current high speed network interconnects (per-port switch costs for 10Gbit ethernet is still in the $100-$200 range + cards), there are obvious opportunities for networking hardware with intelligent offloading capabilities here; if you took even the current card design and hooked some of the IO lines of the Epiphanies to PCIe lanes, you'd already have a product I'd buy)
All this to make it work on this chip that is probably going to have less market share than the Amiga? Sounds like a winner.
There's likely to be more Parallellas in active use than Amiga's within a month or two given the currently manufactured batch. But that is besides the point.
The point is not really to make them key features of end-user systems, but to get them in the hand of developers to get them into embedded systems etc. Routers. Sensor systems. Software defined networking. Controllers for various hardware (our RAID cards in my servers at work already have PPC's with much lower memory throughput on them; most of our hard drives have ARM SOCs controlling them - yet a big problem for SSDs today is that existing interfaces limits throughput, to the extent that some of our servers have PCIe cards with multiple RAID'ed SSDs straight on the board...)
If they prove to do well for more general computing tasks too (and Adapteva do keep churning out interesting demos, with source), then that's an awesome bonus, but they don't need that for success.
If this architecture ever does catch on, and some variant of it probably will decades from now, it will probably come from Intel or some other giant. This chip is going to just be played with by some nerds and researchers.
Maybe. But it will be fun.
And consider that Intel are not the guys to beat if your goals is volume, rather than high margins per unit. In units, Intel is far down the list of CPU manufacturers. They're the Apple of the CPU world - making tons of money on premium products, but don't have much presence in the budget space. And all the innovation in the many tiny cores space is coming from startups (see e.g. GreenArrays, whose 144 core chip is even smaller than the Epiphany 16 core chip...) because they are free from "having" to try to scale up massively complex existing CPU architectures.