Author Topic: Adapteva Parallella-16 (Read 3533 times)

persia · « **on:** November 03, 2013, 05:06:32 PM »

[youtube]L3M80fHk1zU[/youtube]

This really looks interesting, anyone pre-order it?

Adapteva Parallella-16

Fats · « **Reply #1 on:** November 03, 2013, 05:11:54 PM »

Quote from: persia;751724

This really looks interesting, anyone pre-order it?

I did kickstart this but still waiting for delivery.
Staf.

nicholas · « **Reply #2 on:** November 03, 2013, 05:14:01 PM »

If I understand this correctly it is to ARM what the Cell Broadband Engine is to PPC?

Bif · « **Reply #3 on:** November 04, 2013, 08:34:00 AM »

Quote from: nicholas;751726

If I understand this correctly it is to ARM what the Cell Broadband Engine is to PPC?

Only watched the video but it sounds something like that, and I think I've seen this many months before. The core concept seems to be keeping all the CPUs to their local memory instead of dealing with shared caches/memory (e.g. Cell).

I think they are doomed to failure.

The hardware might work great, but I'd make a bet it wouldn't perform much better than a typical X64 + GPU. They may be advertising it as cheap but I'm not sure it would be too cheap by the time it made it into a consumer device.

The real problem is software. Who is going to run out and port their software to this? You are looking at a massive redesign/rewrite to work Cell style. Not only that, but this kind of architecture imposes a constant limitation on you where not having quite enough local memory causes constant design juggling of fitting your work in it vs. keeping work units large enough to pipeline things. You also get all sorts of code/feature design problems due to the fact that you need to stay in local memory to be efficient, try to go read some global variable and you are toast. It causes a constant feature vs. design vs. efficiency tension (hey I just need to read that byte over there for this new feature, easy, right? no, I need to redesign how a whole pile of modules interact with each other to keep it efficient, fun fun fun!).

All this to make it work on this chip that is probably going to have less market share than the Amiga? Sounds like a winner.

I designed and implemented multiple large software components to run on Cell, and know gobs of people who also did this. I still admire some things about Cell from a technical point of view - I think the main thing is it forces you to do is recognize your memory access patterns and cause you to intelligently let your design do the work of the cache vs. the CPU guessing it. Code designed for Cell architectures will probably work faster on unified memory architectures than haphazard code. But I sure as hell don't want to go back to Cell, and I don't think you will find many people who will. There is giant relief in the game industry that everything is going back to X64/whatever normal CPU. Crap is just so easy to implement now it's ridiculous in comparison.

If this architecture ever does catch on, and some variant of it probably will decades from now, it will probably come from Intel or some other giant. This chip is going to just be played with by some nerds and researchers.

persia · « **Reply #4 on:** November 04, 2013, 01:48:35 PM »

Parallel processing seems to be more aimed at real time processing. I'm not sure how much word processing would benefit from it, true. Part of the problem is you really have to think in four dimensions to see what is going on. GPUs and Multicore processors are just the start. Memory is a problem, sharing of processing is a problem, but it can be done, look at the human brain.

Hattig · « **Reply #5 on:** November 04, 2013, 01:55:20 PM »

I expect that they are looking to license their designs to SoC manufacturers who need such functionality within their SoC designs as they don't have an in-house design they could use, nor does the SoC require a fully featured OpenCL GPU component.

Iggy · « **Reply #6 on:** November 04, 2013, 05:02:22 PM »

The fact that they haven't produced many of these yet does not bode well for them.
And using a parallel computing device is a complex task.

The idea that really caught my attention was the Parallella 64.

And, of course, the Parallella board's use of a Cortex A9 Arm processor to coordinate the use of all these cores is pretty neat.

vidarh · « **Reply #7 on:** November 05, 2013, 01:33:36 PM »

Quote from: Iggy;751792

The fact that they haven't produced many of these yet does not bode well for them.
And using a parallel computing device is a complex task.

That's the entire point of the Parallella board: To get the chips into the hands of developers who want to experiment with them. They've manufactured about 6000 boards to meet Kickstarter bounties and pre-orders + a few thousand more chips AFAIK.

Of course it has a high chance of failure, but it's a very fascinating design because these are all separate CPU cores rather than vector units, which has the potential of opening up different use cases.

Quote

The idea that really caught my attention was the Parallella 64.

Note that their roadmap is actually targeting *thousands* of cores. These are just the first baby steps...

Quote

And, of course, the Parallella board's use of a Cortex A9 Arm processor to coordinate the use of all these cores is pretty neat.

It's neat, but again this is for a dev board - you can buy just the chips now too in packs of 8 (for $595 - it's obviously still intended for sampling / dev usage, while they ramp up).

Iggy · « **Reply #8 on:** November 05, 2013, 02:53:58 PM »

@ vidarh

I remember when this was first announced and they quoting all kinds of silly frequency figures, supposedly based on with the "equivalent" of all these cores running at once would be.

I think some of the past comparisons to the Cell B.E. are apt.
Just like the Cell's SPE units, coordinating all these seperate processing units is going to be quite a task.

vidarh · « **Reply #9 on:** November 05, 2013, 03:00:21 PM »

Quote from: Bif;751763

Only watched the video but it sounds something like that, and I think I've seen this many months before. The core concept seems to be keeping all the CPUs to their local memory instead of dealing with shared caches/memory (e.g. Cell).

Note quite. Each CPU can access the full memory of the system directly. No setting up DMA channels or other nonsense. There's a predictable, fixed per hop penalty for accessing the local memory of cores elsewhere in the mesh.
(And each chip has a number of 10Gbps external links that you can use to hook multiple chips together, or to interface to external devices)

Quote

I think they are doomed to failure.

The hardware might work great, but I'd make a bet it wouldn't perform much better than a typical X64 + GPU.

It's not meant to. This is a dev platform to let people experiment with the chips and programming model.

The interesting part is their roadmap, which points towards thousands of cores with up to 1MB per core on a single chip through process shrinks if/when they can afford it.

Quote

They may be advertising it as cheap but I'm not sure it would be too cheap by the time it made it into a consumer device.

The size of each core means that it is at least in theory viable to get per-chip manufacturing costs for the 16 and 64 core versions down into a couple of dollars per chip. They're *tiny* compared to current generation "normal" CPUs, and currently manufactured with relatively cheap/old processes.

Quote

The real problem is software. Who is going to run out and port their software to this?

That's the entire reason why they did the Kickstarter, and got 6000 people to order boards to experiment with them, and why they've released extensive manuals and design files, ranging from detailed layout of the board, and chip design details well beyond what most manufacturers will give you.

Realistically they don't *need* lots of people to port. They need a few people here and there to port applications that said people would order crates worth of chips to run, such as simulations that are not vectorisable, or really low power sensor systems, automotive systems etc.

Selling a few systems to hobbyists is great for buzz and PR, but where they want to be is volume shipments to OEMs.

Quote

Not only that, but this kind of architecture imposes a constant limitation on you where not having quite enough local memory causes constant design juggling of fitting your work in it vs. keeping work units large enough to pipeline things.

Yes, that is a challenge. But they do have demos where they pipeline streams of individual words between cores and still achieve performance well in excess of what the much more expensive ARM Cortex CPU can do on its own, thanks to the low latency inter-core links, so while you won't get maximum throughput that way, they have already showed they can get *good* throughput without agonising over every single byte placement.

Quote

You also get all sorts of code/feature design problems due to the fact that you need to stay in local memory to be efficient, try to go read some global variable and you are toast.

You are not toast, though of course it helps to stay in local memory. Each core has a possible throughput of up to 64GB/sec. Latency for a memory operation is 1.5 clock cycles per node-to-node hop.

In a 16 node system, the worst case latency is thus 9 cycles (from opposing corners - 3 hops to reach the furthermost row, 3 to reach the colum, for 6 hops at 1.5 cycles) or 21 cycles worst case for a 64 core version.

You can keep pipelining memory operations, so "off core" writes and reads are fine as long as you can set things up to execute a number of them in sequence, and ideally allocate memory at "nearby" nodes.

A single 16 core Epiphany chip has a higher peak bandwidth than a Cell, and the 64 core chips are in production at sample quantities.

(In fact, given the cost of current high speed network interconnects (per-port switch costs for 10Gbit ethernet is still in the $100-$200 range + cards), there are obvious opportunities for networking hardware with intelligent offloading capabilities here; if you took even the current card design and hooked some of the IO lines of the Epiphanies to PCIe lanes, you'd already have a product I'd buy)

Quote

All this to make it work on this chip that is probably going to have less market share than the Amiga? Sounds like a winner.

There's likely to be more Parallellas in active use than Amiga's within a month or two given the currently manufactured batch. But that is besides the point.

The point is not really to make them key features of end-user systems, but to get them in the hand of developers to get them into embedded systems etc. Routers. Sensor systems. Software defined networking. Controllers for various hardware (our RAID cards in my servers at work already have PPC's with much lower memory throughput on them; most of our hard drives have ARM SOCs controlling them - yet a big problem for SSDs today is that existing interfaces limits throughput, to the extent that some of our servers have PCIe cards with multiple RAID'ed SSDs straight on the board...)

If they prove to do well for more general computing tasks too (and Adapteva do keep churning out interesting demos, with source), then that's an awesome bonus, but they don't need that for success.

Quote

If this architecture ever does catch on, and some variant of it probably will decades from now, it will probably come from Intel or some other giant. This chip is going to just be played with by some nerds and researchers.

Maybe. But it will be fun.

And consider that Intel are not the guys to beat if your goals is volume, rather than high margins per unit. In units, Intel is far down the list of CPU manufacturers. They're the Apple of the CPU world - making tons of money on premium products, but don't have much presence in the budget space. And all the innovation in the many tiny cores space is coming from startups (see e.g. GreenArrays, whose 144 core chip is even smaller than the Epiphany 16 core chip...) because they are free from "having" to try to scale up massively complex existing CPU architectures.

nicholas · « **Reply #10 on:** November 05, 2013, 03:24:41 PM »

Quote from: vidarh;751875

it's a very fascinating design because these are all separate CPU cores rather than vector units, which has the potential of opening up different use cases.

Oh that's quite different than the Cell and more akin to the traditional multicore SMP model if each of these cores is just a standard ARM CPU.

I'd like to see how a heavily multithreaded API/OS design like the BeOS would scale on one of these things.

If only I had the talent.

Iggy · « **Reply #11 on:** November 05, 2013, 04:31:36 PM »

Quote from: nicholas;751885

Oh that's quite different than the Cell and more akin to the traditional multicore SMP model if each of these cores is just a standard ARM CPU.

I'd like to see how a heavily multithreaded API/OS design like the BeOS would scale on one of these things.

If only I had the talent.

"akin to the traditional multicore SMP"

Not at all invalid, but then most manufacturers don try to jam 16 or 64 cpu cores into a product.
The closest I can think of is Freescale's e6500 core and I think that maxes out at about 24 virtual core (basically 12 dual threaded processors per chip).

What will make or break this device is how well the glue logic is designed to keep all these cpus fed without the cpus slowing down slowing down each others processes.

Unlike others here, I don't think this project is doomed to fail, but they've set themselves up for a pretty daunting task.

vidarh · « **Reply #12 on:** November 05, 2013, 04:46:43 PM »

Quote from: Iggy;751881

@ vidarh

I remember when this was first announced and they quoting all kinds of silly frequency figures, supposedly based on with the "equivalent" of all these cores running at once would be.

They messed up in their marketing copy when trying to come up with some way of communicating the capacity to non-technical users. All the raw values, down to clock cycle timings for the instruction set and communications latencies was available, and they acknowledged in retrospect that it perhaps wasn't a smart thing to do.

Quote

I think some of the past comparisons to the Cell B.E. are apt.
Just like the Cell's SPE units, coordinating all these seperate processing units is going to be quite a task.

The Epiphany cores are are more equivalent to a general purpose CPU - e.g. they had some interns put together an (optional) task scheduler to allocate tasks to cores just as with a normal OS, for example. Coordinating them is no different from coordinating threads on any other general purpose CPU. And the cores can access each others memory directly. Or you can "just" use OpenCL (but won't get full benefit of the chip).

But yes, it is going to be quite a task to get the most out of them, and that's the reason for the Parallella board, so they can start getting feedback on what works and what doesn't and get the boards in the hands of people who want to learn how to work with them. I'm eagerly awaiting my two boards...

vidarh · « **Reply #13 on:** November 05, 2013, 05:01:30 PM »

Quote from: nicholas;751885

Oh that's quite different than the Cell and more akin to the traditional multicore SMP model if each of these cores is just a standard ARM CPU.

They are *not* ARM CPU's.

It's their own RISC architecture. Very simple design. Slightly super-scalar (it can execute one full integer instruction and one full floating point instruction per clock cycle, and looks like some degree of pipelining. It has 64 general purpose 32-bit registers. The instruction set is from what I can see quite a bit smaller than M68k but powerful enough.

They are definitively fully general purpose CPUs. You could run a "real" (non-MMU) OS on one with enough porting effort (though the low amount of per-core memory would make that very wasteful)

Quote

I'd like to see how a heavily multithreaded API/OS design like the BeOS would scale on one of these things.

I think trying to port a general purpose OS directly to the Epiphany architecture won't make much sense at this stage. It might be fun to do if/when they meet their goals of versions with substantially more per-core memory.

Something more esoteric, like a FORTH based micro-kernel might be fun and very much doable, though

Bif · « **Reply #14 on:** November 05, 2013, 07:56:56 PM »

Quote from: vidarh;751898

I think trying to port a general purpose OS directly to the Epiphany architecture won't make much sense at this stage. It might be fun to do if/when they meet their goals of versions with substantially more per-core memory.

Looks like you've looked at this more deeply, thanks for all the details.

I think you have a good point about these chips being useful for special purposes. The embedded market would probably be OK with the software pain if it could give a good multiplier in performance for $ or power consumption vs. other architectures. That is the key though, it has to offer that multiplier and stay well ahead of the curve. Cell was a pretty amazing chip when it came out for raw horsepower and touted the same "new design doesn't have to carry baggage thus can perform better for the money", but it wasn't too long at all before Intel chips were surpassing it again. Cell was also touted as being great for embedded stuff but I don't think it saw much use beyond a few TVs and such. I will be curious to see how well these chips perform against others over time. I think you also have to throw in GPU type chips when looking at cost/performance for embedded devices that require a lot of horsepower (E.g. TVs).

Also sounds interesting about how quick it is to read memory from other CPU's memory. However, I think reading memory from other CPUs is a pretty specialized thing that requires even more complex software to take advantage of. This means you are probably implementing an assembly line multi-core software model where each core takes the work from the previous core and does another set of operations on it. This was tossed around a lot with the Cell early on as it can essentially do the same thing via DMA between SPUs, but the drawbacks of trying to manage this efficiently are ridiculous, as you have to ensure each task on each CPU takes about the same amount of cycles in order to keep the CPUs optimally busy with work. I don't think that model got much use at all.

Anyway, I do find this all interesting as a nerd type myself, just trying to relate how I think it might shake down based on past experience with stuff like this.

Author Topic: Adapteva Parallella-16 (Read 3533 times)

persia

Adapteva Parallella-16

Fats

Re: Adapteva Parallella-16

nicholas

Re: Adapteva Parallella-16

Bif

Re: Adapteva Parallella-16

persia

Re: Adapteva Parallella-16

Hattig

Re: Adapteva Parallella-16

Iggy

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

Iggy

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

nicholas

Re: Adapteva Parallella-16

Iggy

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

vidarh

Re: Adapteva Parallella-16

Bif

Re: Adapteva Parallella-16