Author Topic: What is the real power of Akiko chip in cd32 ? (Read 35230 times)

Karlos · « **on:** February 21, 2010, 01:54:25 PM »

The figures I remember being quoted suggested that the akiko could do c2p at "copy speed". In other words, it could perform the c2p operation in the same amount of time that it would take to just copy the data to chip ram without any operations being performed on it. Once you hit that limit, no amount of "doing it faster" on the CPU is going to help, it'll just end up waiting to write the transformed data.

Karlos · « **Reply #1 on:** February 22, 2010, 12:09:53 AM »

Quote from: Crumb;544289

@Piru

I lack knowledge about how Akiko works but wouldn't it allow the coder to write directly to chipmem groups of 8-32 pixels without relying in a chunkybuffer? wouldn't akiko work as a small buffer that converts chunkypixels to bitplanes? something like that perhaps would allow optimizing chipmem access more. I guess that writting to a chunkybuffer and later copying to chipmem may not be as fast as writting directly to chipmem.

One problem with the akiko c2p is that it forced you to convert 32 consecutive 8-bit pixels (8 32-bit words) into 8 32-bit planar words. That's actually fine if your application renders spans of pixels sequentially. However, a lot of the 3D engines of the day used column rendering to texture vertical surfaces. If your application needs this sort of thing, in the end, you end up having to render everything into a fast ram buffer and then doing a c2p pass over the entire buffer. Since the CPU then has to read the data back from the fast ram buffer and feed it into the akiko,, read it back and then shovel it into the bitplanes. All that shuffling can end up using almost as much CPU as a good software C2P routine. On faster processors, pure software c2p ends up being the most effective method.

Karlos · « **Reply #2 on:** February 22, 2010, 09:29:56 AM »

Quote from: stefcep2;544329

Amiga CD32 68020/14Mhz 8Mb (Optimised 020 C2P) - 18971 realtics (3.9 fps)
Amiga CD32 68020/14Mhz 8Mb (Optimised Akiko C2P) - 12872 realtics (5.8 fps)

5.8/3.9=149% sped boost in frame rate. People spend big dollars on a 25% frame rate improvement on modern graphics cards. You get 50% increase in frame rate for free with Akiko. whereas a 50 mhz '030 with 32 meg ram at the time could have cost about $1800, with the ram costing about half of that. Sure prices came down, but that was years after the CD32 was released.

Most people that have graphics cards aren't using them on a 68020 so your comparison isn't valid. The speed up for my 040 on an RTG card over AGA was immediately obvious.

-edit-

Oh wait, you are referring to modern cards on other systems, not RTG versus AGA. Yeah, you have a point. I currently have a GTX260, hopefully I won't get any newer card that isn't at least 50% faster.

Karlos · « **Reply #3 on:** October 30, 2011, 11:23:39 AM »

Quote from: Digiman;665755

You miss the point, big deal a 486 33mhz requiring game requires an 040 25mhz Amiga shocker.

The benchmarks show a 50% speed improvement. So if you wrote any game engine and it managed 13-15fps on A1200 sans Akiko you can make it run 22fps for free on Akiko. The issue is 95% of Amiga games had sloppy n00b coding.

I think it's you that has rather missed the point. The 50% it gave for an 020 won't scale upwards linearly with faster systems. Akiko gave a 50% speed increase on slow systems where the time it takes to do C2P per frame is comparable to the time it takes to do everything else, that is to say, systems which are too slow to run the game at all. Going from 3FPS to 5FPS doesn't at all imply a 15FPS game will go to 22FPS.

For an engine of a given complexity, as the CPU gets faster, the portion time it takes to do C2P on the CPU becomes less and less of the overall time of execution time until you hit the magical copyspeed bottleneck. At that point, performing C2P takes no more processor time than simply copying data from Fast RAM to Chip RAM.

Some fast 030 C2P routines are already approaching copyspeed, once you start using 040, they get there. On these systems, using an Akiko would likely result in an even slower FPS count than a copyspeed software C2P routine. The reason for that is the way Akiko works. You have to write your chunky data to it, then read back the planar result then write that to Chip RAM all using the CPU. All that copying back and forth takes longer than simply reading the source data from Fast RAM, transforming it then writing it to Chip RAM.

At copyspeed for AGA, even an infinitely fast CPU/Fast RAM combination (where the Chip RAM write speed of ~7MB/s becomes the only rate limiting factor) would be limited to around 90fps at 320x256 for 256 colours.

Doom uses a fairly simple (and efficient) 2D BSP and optimised column and span texturing routines for walls and floors and a sprite routine for objects, each of which has been pretty well-tuned. For more complex game engines, the time taken to render the frame becomes dominant and so C2P time becomes even less important. Take Quake for example. Here you have a full 3D BSP and a much more general-case textured/shaded/depth-buffered polygon rendering system. All much more CPU work than Doom. On 68K, eliminating the C2P all together (i.e. using an RTG card) only gives a comparatively small increase in fps over AGA. For 603/604 PPC systems, you see a bigger improvement because they are able to render the frame significantly faster than the 040/060 can and the whole argument about time spent rendering versus time spent C2P/copying becomes relevant again.

Karlos · « **Reply #4 on:** October 30, 2011, 04:48:49 PM »

Quote from: commodorejohn;665814

Just curious, you make it sound like Doom has separate rendering code for sprites? I guess I'd kinda figured it'd just reuse the column routine, since that's capable of transparency anyway.

It might do, but sprites are a special case as they are always "face on" and thus don't require anything beyond scaling - certainly not the recalculation of scale per column that walls do. It would seem an obvious case where an optimisation could be made.

Quote from: matthey

I'm not so sure that the 603/604 PPC on the Phase5 boards were significantly faster than the 68060

I only mentioned PPC from the perspective of Quake, not Doom, as Quake is an example of an engine where even the fastest 68K spends most of it's time doing something other than C2P or copying data to the video device.

On 68060, using C2P+AGA versus an RTG board shows only a modest performance increase, since doing C2P takes only a small fraction of the overall time per frame. When you move up to PPC, going from C2P+AGA to RTG (at least when the bus is suitably faster than AGA, such as BVisionPPC/CVision/GRexx), you observe a more noticeable increase.

Still, now that you mention it, the 040 version of DoomAttack runs significantly faster on my 603e under emulation. Fast enough for 640x400 (RTG) to be playable (only the transition effects are slow):
[youtube]AQ1t5q3xmYk[/youtube]

Out of curiosity, how is it at 640x400 RTG with 68060 (I can't test as I only have 040, but it's cripplingly slow for me at that resolution)?

Karlos · « **Reply #5 on:** October 30, 2011, 06:48:16 PM »

Quote from: matthey;665832

ADoom at 640x400 RTG (Mediator with Voodoo 4 and no gfx bus overclocking) is ~11 fps using the timedemo test on my 68060@75MHz.

Mediator1200 or Mediator4000?

Quote

The actual game play feels a little faster than Quake at 20 fps (HW 3D) on the same setup. It takes VERY heavy combat situations to perceive any slow down at all. There is no slow down just walking around, turning and shooting a barrel like in your video, or was that my slow 1.86 GHz Windows machine pausing just playing the low resolution video :/. That would be a good show for a 68040 though.

It wasn't an 040, it was the 040 executable running under OS4's emulation on a 603e. The device recording the footage was a bit slow too. My old mouse was acting up, has a tendency to move in jumps (you can see that when opening the preferences application and it overshoots). The performance of the game itself was pretty consistent.

Quote

I'm surprised Doom Attack worked at 640x400 as the docs say it doesn't support increasing the horizontal resolution.

Give it a bash, seems to work regardless

Karlos · « **Reply #6 on:** October 30, 2011, 08:18:29 PM »

Quote from: matthey;665837

Mediator 3000T/4000T (works through Zorro II/III slots but only 6-8 MB/s as I recall)

Hmm, I thought it was at least faster than the 1200 version.

Quote

The PPC emulation in AmigaOS 4.x is fast at least. Maybe AmigaOS 4.x should just use emulated 68k code as applets like Android OS uses java . The faster 604 CyberStorm PPCs should be close to emulating the 68k at 68060 speed with 3x the clock speed, 4x the cache, over 2x the memory bandwidth and much more memory needed. That's progress though :/.

Emulation is a funny thing. Sometimes it's faster, sometimes it isn't, it all depends.

Karlos · « **Reply #7 on:** October 30, 2011, 08:51:39 PM »

Quote from: Piru;665842

Data cache size? Or is it rendering to framebuffer directly (in which case the writes would be going to cache inhibited memory anyway...)?

Not sure with DoomAttack, to be honest. I doubt it's rendering to the framebuffer directly, given the way Doom renders the scene columns at at a time. You'd at least want to use a buffer that could hold say four 8-bit pixel columns so that you could then transfer them as 32-bit words over the bus. That would be 800 bytes for a 200 pixel display. Or 1600, if you wanted to have a nice 16 column, cache-aligned wide buffer

Karlos · « **Reply #8 on:** October 30, 2011, 09:11:42 PM »

Quote from: Piru;665842

Speaking of emulation: MorphOS JIT was significantly faster than the parent 68k from very early on. Even the slowest 603 can easily beat 060@50 in pretty much everything

I don't doubt it, at least for anything processor intensive. There's little that can be done for the rest of the system though. Slow buses, in particular.

My remark about emulation performance was intended to be a general observation, not specific to emulation of 68K under OS4. There is the expectation that whenever you emulate something - particularly via any sort of JIT mechanism - that the result is going to be faster as long as the host processor is faster. It isn't always the case, however. How is 6888x extended precision floating point performance, for example? IIRC, the PPC only supports 64-bit IEEE 754, so any accurate emulation is probably going to end up hampered trying to find interesting ways of handling the extra precision. Or, is it (as I suspect) the case that only extended precision read/write are really emulated and any extended precision operations are all carried out at 64-bit instead?

Karlos · « **Reply #9 on:** October 30, 2011, 10:17:23 PM »

Quote from: Piru;665851

You're right, for performance reasons all FPU calculations are performed by using the native FPU datatypes. This means that the FPU does give slightly different results under emulation. It's fast though.

Which is fair enough. Other than a few esoteric fractal generators, did anything use extended precision?

I know there are methods of doubling up (if you'll pardon the pun) and using two floats to represent a number with greater precision, though I can't recall the specifics at the moment.

-edit-

Yep, I remember now, it was for GPGPU stuff where real double precision may not be supported (or, it might be too slow):

http://forums.nvidia.com/index.php?showtopic=73067

Karlos · « **Reply #10 on:** October 30, 2011, 10:38:47 PM »

So, getting back on topic: Akiko is great if you need to do C2P on a CPU as weak as the one the CD32 ships with. Unfortunately, that also tends to imply that you are restricted to very simple engines since the moment you've got a 50MHz 030 expansion in there, it might be faster just to use CPU C2P...

Karlos · « **Reply #11 on:** October 31, 2011, 08:18:55 AM »

Quote from: matthey;665859

Yes. All fp values are extended to extended precision in registers and all calculations use extended precision on the 68k FPU unless specified to use less precision.

I realise that, what I meant was, are there many applications which depend on being able to use the additional precision? I know some fractal generators used "long double" types for increased precision at high magnification, but were there any others?

Karlos · « **Reply #12 on:** November 04, 2011, 09:56:10 AM »

Quote from: Digiman;666422

Simple point is nobody bought 040s to play games 68030 was a minimal improvement mhz per mhz over 020 ergo A4000/030 25mhz was slow and dead end for games.

Speak for yourself. I got my first 040 not long after the AB3D2 demo first appeared

Quote

My point is spot on thanks. So getting back on topic in relevance to A1200 and CD32 (ie 1992 and 1993) chunky screen gaming, and accepting the known fact that A1200 had to compete out of the box for less than half the price of a 386DX for £899 with monitor blah blah......

Sorry, but no. Perhaps if the Akiko were implemented in a somewhat better fashion it would be useful, but it wasn't.

Let's imagine we have a 020+FastRAM+Akiko.

Having to first calculate 32 pixels of worth of data in fast ram (no data cache on 020), then copy it as 8 longwords to the Akiko. Then read it back again as 8 long words, then write each of those 8 longword to different addresses (offset into each bitplane) in your chip RAM. Rinse and repeat until the entire buffer has been converted.

Had the Akiko been implemented in such a fashion that you fed it your 8 bitplane addresses and then simply wrote 8 longwords of chunky pixels to it at a time and it in turn wrote them to ChipRAM incrementing the address pointers as it went, then it wouldn't have sucked.

Author Topic: What is the real power of Akiko chip in cd32 ? (Read 35230 times)

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?

Karlos

Re: What is the real power of Akiko chip in cd32 ?