Author Topic: Adapteva Parallella-16 (Read 7303 times)

vidarh · « **on:** November 05, 2013, 01:33:36 PM »

Quote from: Iggy;751792

The fact that they haven't produced many of these yet does not bode well for them.
And using a parallel computing device is a complex task.

That's the entire point of the Parallella board: To get the chips into the hands of developers who want to experiment with them. They've manufactured about 6000 boards to meet Kickstarter bounties and pre-orders + a few thousand more chips AFAIK.

Of course it has a high chance of failure, but it's a very fascinating design because these are all separate CPU cores rather than vector units, which has the potential of opening up different use cases.

Quote

The idea that really caught my attention was the Parallella 64.

Note that their roadmap is actually targeting *thousands* of cores. These are just the first baby steps...

Quote

And, of course, the Parallella board's use of a Cortex A9 Arm processor to coordinate the use of all these cores is pretty neat.

It's neat, but again this is for a dev board - you can buy just the chips now too in packs of 8 (for $595 - it's obviously still intended for sampling / dev usage, while they ramp up).

vidarh · « **Reply #1 on:** November 05, 2013, 03:00:21 PM »

Quote from: Bif;751763

Only watched the video but it sounds something like that, and I think I've seen this many months before. The core concept seems to be keeping all the CPUs to their local memory instead of dealing with shared caches/memory (e.g. Cell).

Note quite. Each CPU can access the full memory of the system directly. No setting up DMA channels or other nonsense. There's a predictable, fixed per hop penalty for accessing the local memory of cores elsewhere in the mesh.
(And each chip has a number of 10Gbps external links that you can use to hook multiple chips together, or to interface to external devices)

Quote

I think they are doomed to failure.

The hardware might work great, but I'd make a bet it wouldn't perform much better than a typical X64 + GPU.

It's not meant to. This is a dev platform to let people experiment with the chips and programming model.

The interesting part is their roadmap, which points towards thousands of cores with up to 1MB per core on a single chip through process shrinks if/when they can afford it.

Quote

They may be advertising it as cheap but I'm not sure it would be too cheap by the time it made it into a consumer device.

The size of each core means that it is at least in theory viable to get per-chip manufacturing costs for the 16 and 64 core versions down into a couple of dollars per chip. They're *tiny* compared to current generation "normal" CPUs, and currently manufactured with relatively cheap/old processes.

Quote

The real problem is software. Who is going to run out and port their software to this?

That's the entire reason why they did the Kickstarter, and got 6000 people to order boards to experiment with them, and why they've released extensive manuals and design files, ranging from detailed layout of the board, and chip design details well beyond what most manufacturers will give you.

Realistically they don't *need* lots of people to port. They need a few people here and there to port applications that said people would order crates worth of chips to run, such as simulations that are not vectorisable, or really low power sensor systems, automotive systems etc.

Selling a few systems to hobbyists is great for buzz and PR, but where they want to be is volume shipments to OEMs.

Quote

Not only that, but this kind of architecture imposes a constant limitation on you where not having quite enough local memory causes constant design juggling of fitting your work in it vs. keeping work units large enough to pipeline things.

Yes, that is a challenge. But they do have demos where they pipeline streams of individual words between cores and still achieve performance well in excess of what the much more expensive ARM Cortex CPU can do on its own, thanks to the low latency inter-core links, so while you won't get maximum throughput that way, they have already showed they can get *good* throughput without agonising over every single byte placement.

Quote

You also get all sorts of code/feature design problems due to the fact that you need to stay in local memory to be efficient, try to go read some global variable and you are toast.

You are not toast, though of course it helps to stay in local memory. Each core has a possible throughput of up to 64GB/sec. Latency for a memory operation is 1.5 clock cycles per node-to-node hop.

In a 16 node system, the worst case latency is thus 9 cycles (from opposing corners - 3 hops to reach the furthermost row, 3 to reach the colum, for 6 hops at 1.5 cycles) or 21 cycles worst case for a 64 core version.

You can keep pipelining memory operations, so "off core" writes and reads are fine as long as you can set things up to execute a number of them in sequence, and ideally allocate memory at "nearby" nodes.

A single 16 core Epiphany chip has a higher peak bandwidth than a Cell, and the 64 core chips are in production at sample quantities.

(In fact, given the cost of current high speed network interconnects (per-port switch costs for 10Gbit ethernet is still in the $100-$200 range + cards), there are obvious opportunities for networking hardware with intelligent offloading capabilities here; if you took even the current card design and hooked some of the IO lines of the Epiphanies to PCIe lanes, you'd already have a product I'd buy)

Quote

All this to make it work on this chip that is probably going to have less market share than the Amiga? Sounds like a winner.

There's likely to be more Parallellas in active use than Amiga's within a month or two given the currently manufactured batch. But that is besides the point.

The point is not really to make them key features of end-user systems, but to get them in the hand of developers to get them into embedded systems etc. Routers. Sensor systems. Software defined networking. Controllers for various hardware (our RAID cards in my servers at work already have PPC's with much lower memory throughput on them; most of our hard drives have ARM SOCs controlling them - yet a big problem for SSDs today is that existing interfaces limits throughput, to the extent that some of our servers have PCIe cards with multiple RAID'ed SSDs straight on the board...)

If they prove to do well for more general computing tasks too (and Adapteva do keep churning out interesting demos, with source), then that's an awesome bonus, but they don't need that for success.

Quote

If this architecture ever does catch on, and some variant of it probably will decades from now, it will probably come from Intel or some other giant. This chip is going to just be played with by some nerds and researchers.

Maybe. But it will be fun.

And consider that Intel are not the guys to beat if your goals is volume, rather than high margins per unit. In units, Intel is far down the list of CPU manufacturers. They're the Apple of the CPU world - making tons of money on premium products, but don't have much presence in the budget space. And all the innovation in the many tiny cores space is coming from startups (see e.g. GreenArrays, whose 144 core chip is even smaller than the Epiphany 16 core chip...) because they are free from "having" to try to scale up massively complex existing CPU architectures.

vidarh · « **Reply #2 on:** November 05, 2013, 04:46:43 PM »

Quote from: Iggy;751881

@ vidarh

I remember when this was first announced and they quoting all kinds of silly frequency figures, supposedly based on with the "equivalent" of all these cores running at once would be.

They messed up in their marketing copy when trying to come up with some way of communicating the capacity to non-technical users. All the raw values, down to clock cycle timings for the instruction set and communications latencies was available, and they acknowledged in retrospect that it perhaps wasn't a smart thing to do.

Quote

I think some of the past comparisons to the Cell B.E. are apt.
Just like the Cell's SPE units, coordinating all these seperate processing units is going to be quite a task.

The Epiphany cores are are more equivalent to a general purpose CPU - e.g. they had some interns put together an (optional) task scheduler to allocate tasks to cores just as with a normal OS, for example. Coordinating them is no different from coordinating threads on any other general purpose CPU. And the cores can access each others memory directly. Or you can "just" use OpenCL (but won't get full benefit of the chip).

But yes, it is going to be quite a task to get the most out of them, and that's the reason for the Parallella board, so they can start getting feedback on what works and what doesn't and get the boards in the hands of people who want to learn how to work with them. I'm eagerly awaiting my two boards...

vidarh · « **Reply #3 on:** November 05, 2013, 05:01:30 PM »

Quote from: nicholas;751885

Oh that's quite different than the Cell and more akin to the traditional multicore SMP model if each of these cores is just a standard ARM CPU.

They are *not* ARM CPU's.

It's their own RISC architecture. Very simple design. Slightly super-scalar (it can execute one full integer instruction and one full floating point instruction per clock cycle, and looks like some degree of pipelining. It has 64 general purpose 32-bit registers. The instruction set is from what I can see quite a bit smaller than M68k but powerful enough.

They are definitively fully general purpose CPUs. You could run a "real" (non-MMU) OS on one with enough porting effort (though the low amount of per-core memory would make that very wasteful)

Quote

I'd like to see how a heavily multithreaded API/OS design like the BeOS would scale on one of these things.

I think trying to port a general purpose OS directly to the Epiphany architecture won't make much sense at this stage. It might be fun to do if/when they meet their goals of versions with substantially more per-core memory.

Something more esoteric, like a FORTH based micro-kernel might be fun and very much doable, though

vidarh · « **Reply #4 on:** November 06, 2013, 10:43:50 AM »

Quote from: Bif;751911

I think you have a good point about these chips being useful for special purposes.

Yes, it's a bit unfortunate that it's been easy to get the impression that these are competing with large desktop CPUs - maybe a descendant of them can at one point far down the line, but to get there would take a *huge* change in how we write software, as this design is *never* going to compete with large complicated cores for per-core performance (given that they lack caches, don't have the advantage of being massively superscalar etc., and never will as that would defeat the entire design goal of having something that can scale up to ridiculous number of cores).

Quote

The embedded market would probably be OK with the software pain if it could give a good multiplier in performance for $ or power consumption vs. other architectures. That is the key though, it has to offer that multiplier and stay well ahead of the curve. Cell was a pretty amazing chip when it came out for raw horsepower and touted the same "new design doesn't have to carry baggage thus can perform better for the money", but it wasn't too long at all before Intel chips were surpassing it again.

I think the difference here is that the Cell did not keep bumping the core count. Trying to keep up with Intel on per-core performance is a fools game unless one happens to have a bunch of super-geniuses and a few hundred billion dollars to blow through.

Epiphany has the potential to be successful *if* they are successful in finding suitable niches that works well to do massively parallel that isn't well suitable for GPUs (as the GPU will trounce it on performance for things like vector math). Or where "somewhat massively parallel" at very low power works fine. And they do *need* to keep bumping the core count rapidly to make up for their almost guaranteed inability to keep increasing the per-core count fast enough to compete that way.

E.g. I mentioned networking hardware - I was thinking more about that yesterday, as there's a *huge* jump from 1Gbps ethernet, where we have stacks of old switches sitting in cupboards - they're cheap enough to use as door-stops - while a decent size 10Gbps ethernet switch starts in the $1k-$2k range.

If someone were to put an Epiphany (or more..) on a PCIe card, and hook one (or two...) of the 10Gbps links up to a suitable set of PCIe lanes, and make one or two of the other off-chip links available externally, and a little "Epiphany switch" consisting of a small mesh, you'd potentially have an amazing high performance local, programmable interconnect, *much cheaper* than most current 10GE switches and cards. Cable length would likely be a limiting factor, but for very fast local interconnects within a rack, I'd pay good money for that even with cables down to <50cm...

E.g. the Epiphany cores could do a significant amount of TCP/IP offload and routing before stuff would even hit the CPU, and that would be ideal as it is trivial to parallelise (have one core handle the low level ethernet stuff, one handle low level packet filtering, split responsibility for connections across the rest of the cores and distribute the packet stream between them) yet definitively not suitable for GPU type acceleration (separate enough to need many separate instruction streams)

Quote

Also sounds interesting about how quick it is to read memory from other CPU's memory. However, I think reading memory from other CPUs is a pretty specialized thing that requires even more complex software to take advantage of. This means you are probably implementing an assembly line multi-core software model where each core takes the work from the previous core and does another set of operations on it. This was tossed around a lot with the Cell early on as it can essentially do the same thing via DMA between SPUs, but the drawbacks of trying to manage this efficiently are ridiculous, as you have to ensure each task on each CPU takes about the same amount of cycles in order to keep the CPUs optimally busy with work. I don't think that model got much use at all.

Sort of. They have a number of examples that does depend on timing, I believe. The write-order to the same core is deterministic, so you can do this "transactionally" by e.g. have one core process a "job" and write the result to the memory of the next core, finishing with writing to a location used as a semaphore (there's a DMA engine that can move data at 8GB/sec between cores while the CPU does other work, or you can simply write the result onto the mesh in anything from 1 to 8 byte chunks per store instruction) that the receiving core can use to verify that the job is ready for processing if you don't want to time everything to the cycle. That'd be roughly the same.

However, since each core can read from other cores memory directly too, if you are unsure about the workload split between cores, you can easily pull data instead (also using the per-core DMA engines if you prefer), which lets you have a scheduler on the main CPU or even on one of the Epiphany cores periodically "rebalancing" tasks if any of the cores get stuck waiting, as you don't need to move any of the application data around - if a core ends up idle, just write a few KB of program code to them to have them switch to one of the tasks that's bottlenecked (the Epiphany cores can request DMA between two external locations, so the job manager could live on one or more of the cores and just fire off DMA calls to repurpose underutilised cores for different jobs).

They actually have a simple job manager written that'll monitor the utilisation and throughput of each core and schedule jobs on them.

It's obviously going to be less efficient to do that than if you can keep each core constantly fully busy through careful timing, but people don't manage to do that optimally even with regular CPUs nor with GPUs, so even if you lose some percentage shuffling jobs around, and in the process gains the flexibility of dynamic scheduling across a potentially huge number of cores, it's likely well worth it for a lot of tasks.

Both the 16-core and 64-core version measures 15mm x 15mm with a 2 watt maximum - you could fit dozens of them on a PCIe card, for several thousand cores on a single PCIe card without even giving a typical PC PSU or fans a slight workout, that can be interconnected in a mesh (each chip has four external "network" interfaces that can be directly connected to other Epiphany chips or to other IO; they're "only" 64Gbps I believe, so slowing down memory read/writes/DMA between cores on different chips, but that's still vastly better than having to go over most "normal" network interconnects within the budget of ordinary mortals).

vidarh · « **Reply #5 on:** November 06, 2013, 10:57:14 AM »

Quote from: nicholas;751962

Speaking of esoteric Forth implementations, this one is pretty impressive.

http://jupiterace.proboards.com/index.cgi?board=otherforth&action=display&thread=204

Then you should also take a look at http://www.colorforth.com. It's the site of Chuck Moore (inventor of Forth) and his ColorForth. You can boot straight into ColorForth on a PC - take a look at the supplied low level IDE driver: http://www.colorforth.com/ide.html (the small text; the big text is documentation...)

Particularly relevant since the company he's involved with now - Green Arrays - is producing another high-core-count CPU.

Theirs have *144* cores per (1cm^2) chip. Their eval board comes with 2 of them. They're far more minimalistic than the Epiphany - 144 bytes of RAM and 144 bytes of ROM per core. But ColorForth linked to above is their native instruction set, and they're incredibly low power.

Author Topic: Adapteva Parallella-16 (Read 7303 times)

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16