Author Topic: a golden age of Amiga (Read 14466 times)

Mrs Beanbag · « **Reply #29 on:** January 31, 2012, 08:44:14 PM »

Yes, a computer with some FPGAs on board that could be reconfigured on the fly through the OS could lead to some very interesting projects...

Tonight I've been looking into hardware/realtime raytracing. Of course we all remember The Juggler. Raytracing was what made Amiga's name!

So I'm now wondering just how many ARM coprocessors we could pack on one board... forget GPUs, maybe a stack of CPU/FPGA pairs could do all sorts of crazy things.

Karlos · « **Reply #30 on:** January 31, 2012, 09:06:24 PM »

Quote from: Mrs Beanbag;678166

ARM is the future of computing though, I'm sure of that. x86 has to end sooner or later, it's too stupid to continue indefinitely. Surprised it's lasted this long to be honest.

ARM is very much the present of computing, let alone the future. I wouldn't write off the x86 though. The current generation of these processors is a far cry from the clunky old components. The modern 64-bit implementations are actually quite nice and extremely high performance.

Mrs Beanbag · « **Reply #31 on:** January 31, 2012, 10:23:56 PM »

Quote from: Karlos;678482

ARM is very much the present of computing, let alone the future. I wouldn't write off the x86 though. The current generation of these processors is a far cry from the clunky old components. The modern 64-bit implementations are actually quite nice and extremely high performance.

Oh yeah there's a really neat RISC core hiding behind all that microcode gubbins, shame we can't get at it directly and turn off all those redundant transistors...

Still. Check this out: http://www.youtube.com/watch?v=oLte5f34ya8

this is the sort of trick a new Amiga ought to aim for. Forget GPUs, massive parallelism is the way to go. Maybe a single standard supervisor CPU with a whole load of barrel co-processors similar to the UltraSparc T1. The throughput of those things is incredible, given the right workloads. They threw away such complexities as out-of-order execution in exchange for simultaneous multithreading, thus all but eliminating cache latencies. This strategy would be perfect for highly paralellisable workloads such as ray-tracing.

We could call them Juggler Chips!

orange · « **Reply #32 on:** January 31, 2012, 10:57:38 PM »

Quote from: Mrs Beanbag;678487

Oh yeah there's a really neat RISC core hiding behind all that microcode gubbins, shame we can't get at it directly and turn off all those redundant transistors...

Still. Check this out: http://www.youtube.com/watch?v=oLte5f34ya8

this is the sort of trick a new Amiga ought to aim for. Forget GPUs, massive parallelism is the way to go. Maybe a single standard supervisor CPU with a whole load of barrel co-processors similar to the UltraSparc T1. The throughput of those things is incredible, given the right workloads. They threw away such complexities as out-of-order execution in exchange for simultaneous multithreading, thus all but eliminating cache latencies. This strategy would be perfect for highly paralellisable workloads such as ray-tracing.

We could call them Juggler Chips!

um, are you sure 'Mr. Beanbag' needs ray-tracing ?

Thorham · « **Reply #33 on:** January 31, 2012, 11:15:55 PM »

While everyone is talking about 'new' hardware, I have to ask: What about the software?

Karlos · « **Reply #34 on:** January 31, 2012, 11:16:30 PM »

Quote from: Mrs Beanbag;678487

Oh yeah there's a really neat RISC core hiding behind all that microcode gubbins, shame we can't get at it directly and turn off all those redundant transistors...

Still. Check this out: http://www.youtube.com/watch?v=oLte5f34ya8

this is the sort of trick a new Amiga ought to aim for. Forget GPUs, massive parallelism is the way to go. Maybe a single standard supervisor CPU with a whole load of barrel co-processors similar to the UltraSparc T1. The throughput of those things is incredible, given the right workloads. They threw away such complexities as out-of-order execution in exchange for simultaneous multithreading, thus all but eliminating cache latencies. This strategy would be perfect for highly paralellisable workloads such as ray-tracing.

We could call them Juggler Chips!

On the contrary, I'd say forget CPUs and focus on GPU if you like massive parallelism. My (now old news) quad core can run four threads concurrently. My (equally old) GTX275 can run 30720 of them at full pelt. Thread switching to hide latencies caused by memory access and the like is completely built into the hardware.

Full ray tracing is a tough one due to the tendency of threads to become divergent in their flow of execution but far from impossible with modest GPUs today. Then there is ray marching, which is the poor man's next best thing. And they can do that entirely realtime. In your browser, even, if you happen to have a WebGL capable one and supported hardware.

Mrs Beanbag · « **Reply #35 on:** February 01, 2012, 11:56:31 AM »

Quote from: Karlos;678500

Full ray tracing is a tough one due to the tendency of threads to become divergent in their flow of execution but far from impossible with modest GPUs today. Then there is ray marching, which is the poor man's next best thing. And they can do that entirely realtime. In your browser, even, if you happen to have a WebGL capable one and supported hardware.

On ray marching, or "volume ray casting" as they call it, Wikipedia states

"However, adaptive ray-casting upon the projection plane and adaptive sampling along each individual ray do not map well to the SIMD architecture of modern GPU; therefore, it is a common perception that this technique is very slow and not suitable for interactive rendering. Multi-core CPUs, however, are a perfect fit for this technique and may benefit marvelously from an adaptive ray-casting strategy, making it suitable for interactive ultra-high quality volumetric rendering."

http://en.wikipedia.org/wiki/Volume_ray_casting

Here Intel are doing real time ray tracing show off their Nehalem core:
http://www.youtube.com/watch?v=ianMNs12ITc

Obviously that is an expensive top-of-the-range CPU there (or rather, four of them). It makes me wonder what could be done with a big bunch of ARM chips. GPUs can be made to do this but you'd not be using them optimally. Likewise even a general purpose chip like the Nehalem is a lot more complex than necessary.

I think to sum it up, GPUs are designed for a task too specific, while mainstream CPUs are designed for tasks too general. I wonder if this goes some way to explain AMD's strategy with their Bulldozer chips, which seems to have confused a lot of people.

HenryCase · « **Reply #36 on:** February 01, 2012, 01:34:33 PM »

@Mrs Beanbag

Quote from: Mrs Beanbag;678487

Forget GPUs, massive parallelism is the way to go.

I do agree that FPGAs represent a big opportunity to change how flexible computing architecture can be, but the line I quoted above doesn't make sense. The very reason GPGPU is a growing field is due to the massively parallel nature of modern GPUs. GPU computing and FPGA computing are not identical, but they are clearly related.

Quote from: Mrs Beanbag;678487

We could call them Juggler Chips!

I've got good news for you, your Juggler chips already exist:
http://www.eetimes.com/electronics-products/processors/4115523/Xilinx-puts-ARM-core-into-its-FPGAs

Mrs Beanbag · « **Reply #37 on:** February 01, 2012, 01:48:31 PM »

Quote from: HenryCase;678577

@Mrs Beanbag

I do agree that FPGAs represent a big opportunity to change how flexible computing architecture can be, but the line I quoted above doesn't make sense. The very reason GPGPU is a growing field is due to the massively parallel nature of modern GPUs. GPU computing and FPGA computing are not identical, but they are clearly related.

Why are you talking about FPGAs? I never mentioned FPGAs in that post. I'm talking about generic CPUs in massive parallelism. GPUs are SIMD. Well some degree of SIMD is still useful for raytracing, because they do a lot of basic vector arithmetic, but not with the same amount of repetition as a GPU is designed for.

GPUs are of course massively parallel, but they are optimised for a specific sort of workload, although they are becoming more general purpose lately.

Quote

I've got good news for you, your Juggler chips already exist:
http://www.eetimes.com/electronics-products/processors/4115523/Xilinx-puts-ARM-core-into-its-FPGAs

These are not barrel processors. They are FPGAs with an ARM core attached. Which is also cool and useful, but not "Juggler chip" as described above. "Juggler chip" is similar design strategy to UltraSPARC T1 but with ARM instruction set.

Karlos · « **Reply #38 on:** February 01, 2012, 03:18:19 PM »

Quote from: Mrs Beanbag;678562

On ray marching, or "volume ray casting" as they call it, Wikipedia states

"However, adaptive ray-casting upon the projection plane and adaptive sampling along each individual ray do not map well to the SIMD architecture of modern GPU; therefore, it is a common perception that this technique is very slow and not suitable for interactive rendering. Multi-core CPUs, however, are a perfect fit for this technique and may benefit marvelously from an adaptive ray-casting strategy, making it suitable for interactive ultra-high quality volumetric rendering."

http://en.wikipedia.org/wiki/Volume_ray_casting

Here Intel are doing real time ray tracing show off their Nehalem core:
http://www.youtube.com/watch?v=ianMNs12ITc

Obviously that is an expensive top-of-the-range CPU there (or rather, four of them). It makes me wonder what could be done with a big bunch of ARM chips. GPUs can be made to do this but you'd not be using them optimally. Likewise even a general purpose chip like the Nehalem is a lot more complex than necessary.

I think to sum it up, GPUs are designed for a task too specific, while mainstream CPUs are designed for tasks too general. I wonder if this goes some way to explain AMD's strategy with their Bulldozer chips, which seems to have confused a lot of people.

Wikipedia must be out of date there. I can assure you raymarching works fine on my gtx 275 and better still on fermi based GPUs which have superior divergent conditional branch handling and cache. There are several realtime examples written entirely in glsl for Mr doob's web glsl playground which run at full speed on my kit. I've tested even better CUDA specific examples. Lastly, even SIMD does not accurately describe the operation of these GPUs. SIMD better describes SSE or altivec. Its a poor description for modern stream processors.

Mrs Beanbag · « **Reply #39 on:** February 01, 2012, 04:15:30 PM »

Quote from: Karlos;678592

Wikipedia must be out of date there. I can assure you raymarching works fine on my gtx 275 and better still on fermi based GPUs which have superior divergent conditional branch handling and cache. There are several realtime examples written entirely in glsl for Mr doob's web glsl playground which run at full speed on my kit. I've tested even better CUDA specific examples. Lastly, even SIMD does not accurately describe the operation of these GPUs. SIMD better describes SSE or altivec. Its a poor description for modern stream processors.

Shader engines aren't really ray tracing, impressive though they may be. CUDA can get closer to what I'm talking about, but "General Purpose GPU" is a self-contradictory phrase! Either these chips are general purpose or they are special purpose. Maybe we only call them GPU because they happen to be used for graphics.

I'm not saying it can't be done, or even done well, I'm only saying it's not optimal, because the chips are designed for something else and to make them do it you have to work around their limitations. In other words, if they are so good at doing ray tracing already, imagine if they were actually designed for ray tracing instead of rasterisation... it seems to me that the complexity of graphics is these days getting to the point where ray tracing could actually be faster! But a mainstream CPU is also far more complex than it needs to be, having been optimised for single-threaded performance, which is the opposite of what we want.

I mean look at this:
http://www.youtube.com/watch?v=x5aXxJGefxU

100% CPU work, and "Running in an E2140 1.6GHZ", that's not a lot of CPU, doesn't even have hyperthreading. Now if you had 16 of such cores instead of only two, each with 8-way hyperthreading instead of superscalar... this is where CPU and GPU would meet in the middle. The compromises made for streaming processors no longer seem appropriate.

Karlos · « **Reply #40 on:** February 01, 2012, 05:17:40 PM »

Shader engine is an obsolete term. Modern GPUs are massively parallel stream processors that are Turing complete. You can use them to perform any inherently parallel task you like, provided you know how to code it. If you program them to ray trace, that is exactly what they do. Or you could program them to perform all-pairs n-body particle interaction, or brute force md5 sums. They are nothing whatsoever like fixed function, discrete shade unit graphics chips of a few years ago any more than a modern multicore x64 is like a 286. Their main application is graphics processing because that is the sort of inherently parallel task they excel at, whether it is simple rasterization or complex per pixel shading. However, you need to look at this in the abstract. It can be any algorithm operating on a set of data using thread per unit data parallelism. There is no shader, the shader is merely a software construct running on a truly general purpose (algorithmically speaking- stream processor. And it crushes CPUs for this

HenryCase · « **Reply #41 on:** February 01, 2012, 07:00:42 PM »

Quote from: Mrs Beanbag;678578

These are not barrel processors. They are FPGAs with an ARM core attached. Which is also cool and useful, but not "Juggler chip" as described above. "Juggler chip" is similar design strategy to UltraSPARC T1 but with ARM instruction set.

I guess I misread what you meant. Perhaps it would be best to outline in more detail what design you had in mind for the 'juggler chip', I'm interested to hear your thoughts.

In the meantime, here's another couple of links about massively parallel chips that you may be interested in following up on:
http://www.greenarraychips.com/
http://www.tilera.com/

Mrs Beanbag · « **Reply #42 on:** February 01, 2012, 07:02:58 PM »

Well if that is the case then a modern GPU *is* a CPU, the only difference being the way it is connected to the memory. But I still don't think that is quite the case. How I understand it, a GPU is given a "kernel" which is a small program that is run for every piece of data that comes in on the stream. They don't run a "full program" like a CPU does, but continually apply the same function over and over on the incoming data. Which is very useful. But its "Turing completeness" is limited to the bounds of the kernel, that is you can branch and loop as much as you like within a kernel, but you can't arbitrarily call one kernel from another. Also the data goes in one end and out the other, very useful if you can split your dataset up into loads of small independent chunks. If you're doing rasterisation this is very easy because every triangle can be done independently. Maybe there's some cunning trick to it but I don't know how ray tracing would work in that scheme, because you want to do blocks of pixels in parallel rather than triangles or objects so every pipeline needs access to the complete scene structure.

But theory aside, I've been putting "real time ray tracing" into Youtube and I get a lot of stuff on CPUs and GPUs, and a lot of it is very impressive, but I don't see that GPUs actually have any obvious advantage over CPUs so far.

Karlos · « **Reply #43 on:** February 01, 2012, 07:29:33 PM »

The modern GPU is basically a very large collection of arithmetic/logic units. Think of these as very simple CPU cores where stuff like conditional branching is expensive but data processing is not. Then imagine them in clusters, each cluster running the same code but on different data. Not like a SIMD unit, but as an array of cores, able to branch independently but optimal when in step. Now imagine a set of work supervisors that oversee them, detecting when clusters are waiting for IO and able to switch the thread group they are executing for one that is ready to go. Finally, imagine these being served by multiple memory controllers on demand. That's your basic GPU today. Current GPUs can even execute multiple kernels concurrently, so if one cannot occupy all the stream units, you can run more.

Mrs Beanbag · « **Reply #44 from previous page:** February 01, 2012, 07:36:33 PM »

Quote from: HenryCase;678626

I guess I misread what you meant. Perhaps it would be best to outline in more detail what design you had in mind for the 'juggler chip', I'm interested to hear your thoughts.

Ok just look up UltraSPARC T1 to get what I mean. I'll summarise. Traditionally CPUs have been designed for single-threaded performace, by inventing such things as instruction-level parallelism (you can do several consecutive instructions at once if they don't clash), branch predication, speculative execution, out-of-order execution etc.. All of these things require extra circuitry of course but it's worth it for the performance boost. Problem is you don't get 2x performance for 2x transistors so then multiple cores came into play. But lots of programs don't run in multiple threads so they still try to maximise the performance of single cores. And they are still held back when one instruction has a dependency on a previous one that hasn't finished yet, or it's waiting for memory reads etc..

UltraSPARC T1 took a more holistic approach. Knowing servers always run umpteen threads at once, there's really no point in all that extra complexity to get the most single threaded performance. So they ditched it all and instead made a CPU core that could switch threads on every cycle. They only have to have a register file for each thread and rotate them round (hence the term "barrel processor"), and you can get rid of a whole load of complexity and go back to a very simple core that only does one instruction at once, which gives you room for loads more cores on a die, and cache misses can be made to vanish into the background. Single-thread performance is terrible, but if you can throw enough threads at it it can keep up with CPUs that run at far faster clock speeds. The T1 typically ran at 1.2GHz and, given the right sort of workloads, could keep pace with 3GHz Xeons.

Author Topic: a golden age of Amiga (Read 14466 times)

Mrs Beanbag

Re: a golden age of Amiga

Karlos

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga

orange

Re: a golden age of Amiga

Thorham

Re: a golden age of Amiga

Karlos

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga

HenryCase

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga

Karlos

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga

Karlos

Re: a golden age of Amiga

HenryCase

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga

Karlos

Re: a golden age of Amiga

Mrs Beanbag

Re: a golden age of Amiga