Author Topic: Adapteva Parallella-16 (Read 3521 times)

Bif · « **Reply #14 from previous page:** November 05, 2013, 07:56:56 PM »

Quote from: vidarh;751898

I think trying to port a general purpose OS directly to the Epiphany architecture won't make much sense at this stage. It might be fun to do if/when they meet their goals of versions with substantially more per-core memory.

Looks like you've looked at this more deeply, thanks for all the details.

I think you have a good point about these chips being useful for special purposes. The embedded market would probably be OK with the software pain if it could give a good multiplier in performance for $ or power consumption vs. other architectures. That is the key though, it has to offer that multiplier and stay well ahead of the curve. Cell was a pretty amazing chip when it came out for raw horsepower and touted the same "new design doesn't have to carry baggage thus can perform better for the money", but it wasn't too long at all before Intel chips were surpassing it again. Cell was also touted as being great for embedded stuff but I don't think it saw much use beyond a few TVs and such. I will be curious to see how well these chips perform against others over time. I think you also have to throw in GPU type chips when looking at cost/performance for embedded devices that require a lot of horsepower (E.g. TVs).

Also sounds interesting about how quick it is to read memory from other CPU's memory. However, I think reading memory from other CPUs is a pretty specialized thing that requires even more complex software to take advantage of. This means you are probably implementing an assembly line multi-core software model where each core takes the work from the previous core and does another set of operations on it. This was tossed around a lot with the Cell early on as it can essentially do the same thing via DMA between SPUs, but the drawbacks of trying to manage this efficiently are ridiculous, as you have to ensure each task on each CPU takes about the same amount of cycles in order to keep the CPUs optimally busy with work. I don't think that model got much use at all.

Anyway, I do find this all interesting as a nerd type myself, just trying to relate how I think it might shake down based on past experience with stuff like this.

Iggy · « **Reply #15 on:** November 05, 2013, 10:17:54 PM »

@ Bif

In regard to the Cell, I still don't know what they were thinking about that one.
The design works pretty smoothly if you don't have many unexpected branches, and the floating point performance is good (and was even greatly improved in the now discontinued PowerXCell 8i).
But its not an ideal design for general purpose computing. And in its original form it required the use of XDR memory (and anytime you see a Rambus designed idea backed by a limited number of vendors you should run away at high speed).

Also, while IBM did a pretty good job of documenting the chip, their marketing left something to be desired.
You could contact IBM about it, but they didn't want to sell any without "qualifying" the users design and intended use.
In other words they expected companies to partner with them.

That might work when you are building millions of game consoles of a relatively static design, but it doesn't work so well in other more rapidly evolving consumer products.

Anyway, enough talk about dead architectures.
The real competition for ideas like the Parallella is likely to come from gpu computing (where parallelism has already been taken to the extreme).

nicholas · « **Reply #16 on:** November 06, 2013, 10:18:30 AM »

Quote from: vidarh;751898

They are *not* ARM CPU's.

It's their own RISC architecture. Very simple design. Slightly super-scalar (it can execute one full integer instruction and one full floating point instruction per clock cycle, and looks like some degree of pipelining. It has 64 general purpose 32-bit registers. The instruction set is from what I can see quite a bit smaller than M68k but powerful enough.

They are definitively fully general purpose CPUs. You could run a "real" (non-MMU) OS on one with enough porting effort (though the low amount of per-core memory would make that very wasteful)

I think trying to port a general purpose OS directly to the Epiphany architecture won't make much sense at this stage. It might be fun to do if/when they meet their goals of versions with substantially more per-core memory.

Something more esoteric, like a FORTH based micro-kernel might be fun and very much doable, though

Speaking of esoteric Forth implementations, this one is pretty impressive.

http://jupiterace.proboards.com/index.cgi?board=otherforth&action=display&thread=204

vidarh · « **Reply #17 on:** November 06, 2013, 10:43:50 AM »

Quote from: Bif;751911

I think you have a good point about these chips being useful for special purposes.

Yes, it's a bit unfortunate that it's been easy to get the impression that these are competing with large desktop CPUs - maybe a descendant of them can at one point far down the line, but to get there would take a *huge* change in how we write software, as this design is *never* going to compete with large complicated cores for per-core performance (given that they lack caches, don't have the advantage of being massively superscalar etc., and never will as that would defeat the entire design goal of having something that can scale up to ridiculous number of cores).

Quote

The embedded market would probably be OK with the software pain if it could give a good multiplier in performance for $ or power consumption vs. other architectures. That is the key though, it has to offer that multiplier and stay well ahead of the curve. Cell was a pretty amazing chip when it came out for raw horsepower and touted the same "new design doesn't have to carry baggage thus can perform better for the money", but it wasn't too long at all before Intel chips were surpassing it again.

I think the difference here is that the Cell did not keep bumping the core count. Trying to keep up with Intel on per-core performance is a fools game unless one happens to have a bunch of super-geniuses and a few hundred billion dollars to blow through.

Epiphany has the potential to be successful *if* they are successful in finding suitable niches that works well to do massively parallel that isn't well suitable for GPUs (as the GPU will trounce it on performance for things like vector math). Or where "somewhat massively parallel" at very low power works fine. And they do *need* to keep bumping the core count rapidly to make up for their almost guaranteed inability to keep increasing the per-core count fast enough to compete that way.

E.g. I mentioned networking hardware - I was thinking more about that yesterday, as there's a *huge* jump from 1Gbps ethernet, where we have stacks of old switches sitting in cupboards - they're cheap enough to use as door-stops - while a decent size 10Gbps ethernet switch starts in the $1k-$2k range.

If someone were to put an Epiphany (or more..) on a PCIe card, and hook one (or two...) of the 10Gbps links up to a suitable set of PCIe lanes, and make one or two of the other off-chip links available externally, and a little "Epiphany switch" consisting of a small mesh, you'd potentially have an amazing high performance local, programmable interconnect, *much cheaper* than most current 10GE switches and cards. Cable length would likely be a limiting factor, but for very fast local interconnects within a rack, I'd pay good money for that even with cables down to <50cm...

E.g. the Epiphany cores could do a significant amount of TCP/IP offload and routing before stuff would even hit the CPU, and that would be ideal as it is trivial to parallelise (have one core handle the low level ethernet stuff, one handle low level packet filtering, split responsibility for connections across the rest of the cores and distribute the packet stream between them) yet definitively not suitable for GPU type acceleration (separate enough to need many separate instruction streams)

Quote

Also sounds interesting about how quick it is to read memory from other CPU's memory. However, I think reading memory from other CPUs is a pretty specialized thing that requires even more complex software to take advantage of. This means you are probably implementing an assembly line multi-core software model where each core takes the work from the previous core and does another set of operations on it. This was tossed around a lot with the Cell early on as it can essentially do the same thing via DMA between SPUs, but the drawbacks of trying to manage this efficiently are ridiculous, as you have to ensure each task on each CPU takes about the same amount of cycles in order to keep the CPUs optimally busy with work. I don't think that model got much use at all.

Sort of. They have a number of examples that does depend on timing, I believe. The write-order to the same core is deterministic, so you can do this "transactionally" by e.g. have one core process a "job" and write the result to the memory of the next core, finishing with writing to a location used as a semaphore (there's a DMA engine that can move data at 8GB/sec between cores while the CPU does other work, or you can simply write the result onto the mesh in anything from 1 to 8 byte chunks per store instruction) that the receiving core can use to verify that the job is ready for processing if you don't want to time everything to the cycle. That'd be roughly the same.

However, since each core can read from other cores memory directly too, if you are unsure about the workload split between cores, you can easily pull data instead (also using the per-core DMA engines if you prefer), which lets you have a scheduler on the main CPU or even on one of the Epiphany cores periodically "rebalancing" tasks if any of the cores get stuck waiting, as you don't need to move any of the application data around - if a core ends up idle, just write a few KB of program code to them to have them switch to one of the tasks that's bottlenecked (the Epiphany cores can request DMA between two external locations, so the job manager could live on one or more of the cores and just fire off DMA calls to repurpose underutilised cores for different jobs).

They actually have a simple job manager written that'll monitor the utilisation and throughput of each core and schedule jobs on them.

It's obviously going to be less efficient to do that than if you can keep each core constantly fully busy through careful timing, but people don't manage to do that optimally even with regular CPUs nor with GPUs, so even if you lose some percentage shuffling jobs around, and in the process gains the flexibility of dynamic scheduling across a potentially huge number of cores, it's likely well worth it for a lot of tasks.

Both the 16-core and 64-core version measures 15mm x 15mm with a 2 watt maximum - you could fit dozens of them on a PCIe card, for several thousand cores on a single PCIe card without even giving a typical PC PSU or fans a slight workout, that can be interconnected in a mesh (each chip has four external "network" interfaces that can be directly connected to other Epiphany chips or to other IO; they're "only" 64Gbps I believe, so slowing down memory read/writes/DMA between cores on different chips, but that's still vastly better than having to go over most "normal" network interconnects within the budget of ordinary mortals).

vidarh · « **Reply #18 on:** November 06, 2013, 10:57:14 AM »

Quote from: nicholas;751962

Speaking of esoteric Forth implementations, this one is pretty impressive.

http://jupiterace.proboards.com/index.cgi?board=otherforth&action=display&thread=204

Then you should also take a look at http://www.colorforth.com. It's the site of Chuck Moore (inventor of Forth) and his ColorForth. You can boot straight into ColorForth on a PC - take a look at the supplied low level IDE driver: http://www.colorforth.com/ide.html (the small text; the big text is documentation...)

Particularly relevant since the company he's involved with now - Green Arrays - is producing another high-core-count CPU.

Theirs have *144* cores per (1cm^2) chip. Their eval board comes with 2 of them. They're far more minimalistic than the Epiphany - 144 bytes of RAM and 144 bytes of ROM per core. But ColorForth linked to above is their native instruction set, and they're incredibly low power.

Author Topic: Adapteva Parallella-16 (Read 3521 times)

Bif

Re: Adapteva Parallella-16

Iggy

Re: Adapteva Parallella-16

nicholas

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16