Author Topic: Idea... (Read 2755 times)

Karlos · « **on:** November 28, 2003, 07:57:00 PM »

In a post AOS4 / MOS world, would a grex style busboard for BlizPPC / CSPPC with a genuine AGP controller be feasable?

Speed issues aside, the biggest arguments I heard against genuine AGP capable bus boards was the lack of a suitable 680x0 friendly northbridge. Moving to PPC presumably opens up the possibility?

I realise that such a product is unlikely to be ever manufactured, but is it feasable?

Effy · « **Reply #1 on:** November 28, 2003, 08:02:19 PM »

Wasn´t the key answer to using a fast gfx card on the Amiga (agp or a fast pci card) that the busboard just can´t supply the gfx card with the data ?? Wasn´t this the story of putting a V8 in a Fiat Uno

That V8 would then be the brandnew fast gfx card ... :-D

Karlos · « **Reply #2 on:** November 28, 2003, 08:10:10 PM »

Yes, you are correct. Remember however, the g-rex is the fastest PCI expansion for classic amigas, provided you have the h/w to use it.

I was actually wondering how much of the existing bus speed limitation is down to 680x0 support etc.

For instance, some systems with 68060/grex can hit 40Mb/s for CPU->VRAM. Before anybody says thats a myth, I have the test results to prove it. But it was a minority of cases - not sure what the exact configs of those machines are.

That suggests to me that it is possible to get more out of the BlizzPPC / CSPPC expansion connector than most people do. If we had only logic for the PPC side and totally ignored the 680x0, perhaps the old arguments dont apply.

lempkee · « **Reply #3 on:** November 28, 2003, 08:48:33 PM »

grexx is dead(discountinued 12+ months ago), and yes its faster than a mediator but not as much as 4 times the speed...

13.9(grexx) vs 7.8 (mediator) is the numbers i just got when i compared with a friend now, how his setup is i dunno...but it sounds just about right..

for blizzppc...

an agp system and simmilar things like an sharkppc+ model , well thats when things start to make sense...., ifear it would be just another waste of money if not...

patrik · « **Reply #4 on:** November 28, 2003, 10:24:28 PM »

@Karlos:

I got two questions:
1. Does your test program measure the speed of writing or reading or both?

2. Does the PPC have the possibility to access all addresses in the system?

(edit) The results of the busspeed with the Grex should be very sensitive to wether or not the cpu does its linetransfers to/from PCI cards on the Grex in burstmode or not.

/Patrik

Karlos · « **Reply #5 on:** November 29, 2003, 12:28:20 AM »

@patrick

Well, its a bit involved. It was designed to measure the performace of my direct-to-video ram pixel translations (a backend part of a system I have been designing).

However, if you run it without specifying a source pixel format, it just performs what is basically a copy operation. The code for this is loop unrolled asm, moving 16x32-bits in the main loop. It so happens that this gives a pretty good estimate of the bus speed which has proven to be the limiting factor on all systems so far, even directly connected cards like the BVision.

Later, when looking at 3D issues, I found I needed a quick test for VRAM access badwidth (limited by the speed of the bus, basically) and it was handy to use my existing program.

As for line transfers and the like, I am not sure how likely that is. VRAM especially is usually mapped as non cacheable due to its inherently volatile nature.

The PPC can indeed address anywhere the 680x0 can. That was part of the design aim of the first PPC cards.

If, however, the 680x0 can't write at full speed because of the bus logic limitations, its fair to assume the PPC wont be much faster. Hence my thoughs about better bus logic(northbridge or whatever) that might be available for PPC only systems.

patrik · « **Reply #6 on:** November 29, 2003, 02:10:34 AM »

@Karlos:

Yes, you are right about the caching, my bad. Accesses to the video memory of a graphics card should indeed be marked as non cachable.

This makes me curious about one thing then. The PCI-buscontroller of the PPC-card/Grex should implement translation from 040/060 burst transfers to PCI burst transfers and the same the other way around if the constructors were aiming at any performance. If this is the case, shouldnt the use of the MOVE16 instruction (which transfers memory blocks of 16Bytes using the same burst as line transfers) to move data to/from the video ram result in higher performance than using regular MOVE:s instructions?

/Patrik

Karlos · « **Reply #7 on:** November 29, 2003, 02:23:41 AM »

@patrik

Well, you could only really use move16 for the cache aligned parts of large writes.

On cards like the blizzard Im not sure that the move16 is really much faster than 4 moves...

patrik · « **Reply #8 on:** November 29, 2003, 03:13:49 AM »

@Karlos:

Non-burst transfers to/from the PCI-bus will inevitable render non-burst transfers on the PCI-bus. The PCI-bus multiplexes the address and data lines so it will take several clockcycles on the PCI-bus to do a basic read or write. After doing a quick check in the PCI 2.1 specification datasheet (so I reserve myself for errors

) I found that a basic read takes 4 clock cycles minimum and a basic write cycle takes 3 clock cycles minimum. With a little calculation this gives:

Basic read maximum bandwidth:
(33*10^6 * (32 /

) / 4 = 33000000B/Sec which is 31.47MB/Sec

Basic write maximum bandwidth:
(33*10^6 * (32 /

) / 3 = 44000000B/Sec which is 41.96MB/Sec

Taking these figures and your report about 40MB/Sec with the Grex into account (assuming that the largest portion of datashuffling your test-program uses are writes, correct me if I am wrong), it seems like the Grex is able to deliver about maximum theoretical bandwidth using non-bursts transfers on the PCI bus.

Btw, the theoretical figures should be a little bit higher as the bus is clocked to 33 and a third, not 33 as I used in my calculations... lets count that into the +-0.5 fault margin of 33

.

/Patrik

patrik · « **Reply #9 on:** November 29, 2003, 03:55:33 AM »

@Karlos:

It would indeed be interesting to know if the Grex supports bursts transfers. The 68040 and 68060 bursts are 4 longwords long, so it would translate to a 4 longword burst on the PCI-bus. A read burst of four longwords would take 7 clock cycles minumum on the PCI-bus and a write burst of four longwords on the PCI-bus would take 6 clock cycles minumum. That would give these maximum theoretical bandwidth figures:

Read-burst (4 longwords):
(33*10^6 * (32 /

* 4) / 7 = 75428571.43B/Sec which is 71.93MB/Sec

Write-burst (4 longwords):
(33*10^6 * (32 /

* 4) / 7 = 88000000B/Sec which is 83.92MB/Sec

These figures are a little more than double the transferspeed using non-burst transfers. Maybe it is of not to great practical use for graphic cards, but it would be very interesting to see if figures like this could be achieved using bursts transfers on the Grex.

Ofcourse the graphics-card has to be able to handle this too, soo when testing this a low resolution, low bitdepth and low refreshrate would be very good - ideally the card shouldnt be refreshing the screen at all during the test.

/Patrik

Karlos · « **Reply #10 on:** November 29, 2003, 04:00:32 AM »

Yeah it would.

I cant see it for VRAM mind, but other areas might be cacheable.

The 40Mb/s transfer was for CPU regsiter -> VRAM (so no system memory data read). That was on a CSPPC 060 @ 66Mhz.

As for the refrresh thing, I render to an offscreen buffer during a locked call, then update the display afterards.

You can never eliminate the actual display refresh (that draws the screen) but in SGRAM rated at 10ns or better, the MB/s you are getting are always bus limited, not VRAM limited.

-edit-

Anyhoo, the copy/clear/set routines are part of my Mem class and implemented in 680x0 assembler. So the 40Mb/s test was for basically a 16x block of "move d0, (a0)+" type stuff.

I might have to see about testing/comparing a cache aligned block of VRAM using a move16 loop. You got me curious now...

Better still, write a dedicated test for VRAM access times only (the pixeltest has a bunch of conversion options that are irrelavent to the argument and for timing my conversion routines only).

patrik · « **Reply #11 on:** November 29, 2003, 05:09:05 AM »

Just realised one important thing about using MOVE16. MOVE16 loads 16Bytes (a line) from the source-address into a temporary register and then saves it at the destination-address, right?

I continue on babling assuming that

.

If so, testing with MOVE16 would probably result in worse result than the results you achieved with your test program as the fastmemory-bandwidth of the accelerators to which a Grex can be connected will be from about 30MB/Sec with 040@25MHz to maybe 70MB/Sec with 060@50MHz (just rough estimations though) and at that fastmem-bandwidth-utilization the local 68040/68060-bus bandwidth is fully utilized (the time not spent transfering data to/from fastmem is spent with waitstates) with no extra bandwidth room for transfering data to/from the PCI-buscontroller without sacrifising fastmem-bandwidth. Transferring data to/from the PCI-buscontroller at the same time as transferring data to/from fastmemory (as the case would be when testing with MOVE16) should effectively and approximately

cut the fastmemory-bandwidth in half. This would make the accelerator-cards with the fastest fastmem-interfaces able to transfer about 35MB/Sek to/from the Grex using MOVE16.

To try to max out the Grex, the 68040/68060 local bus has to be free of bandwith-hogging memory-transfers so the register to VRAM method you are using in your test program must be used! The only problem will be to use that method with bursts....what do you think about marking a part of the VRAM writethrough-cacheable using the MMU and then doing a write-test ala your recipe?

(edit):

Quote

"Transferring data to/from the PCI-buscontroller at the same time as transferring data to/from fastmemory (as the case would be when testing with MOVE16) should effectively and approximately cut the fastmemory-bandwidth in half. "

This is half-true

.. the transfers are ofcourse not happening at the same time, first you read from fastmemory, then you write to the PCI-buscontroller for example. But as you cant read from fastmemory when writing to the PCI-buscontroller the fastmemory bandwidth is approximately and effectively cut in half (the same ofcourse applies transferring data the other way) - so the important part about the bandwidth was atleast true

.

/Patrik

Karlos · « **Reply #12 on:** November 29, 2003, 05:29:17 AM »

I think I should point out that I don't personally have a G-Rex dude :-) These were results returned from people kind enough to run them here..

Im sure the MMU trickery can be can be done from within the ASM, but I'd definately need to get the manual out since supervisor model chip dependent coding isn't something I do much of.

As it goes, I was measuring the speed of direct writes (rather than bursts) originally for a good reason. I needed to estimate how quickly I can write data to a GPU's rasterizer engine (often a FIFO buffer). Some of these are command registers and burstmode / out of order writes would totally screw it up so were never considered in the testing strategy.
As with VRAM, except for perhaps the slowest 3D GPU used (the ViRGE), such transfers are always limited by the bus performance.

Karlos · « **Reply #13 on:** November 29, 2003, 05:31:59 AM »

@patrik

You sound far more than capable enough to write a test application yourself - you don't need me :lol:

Incidentally, have you ever tried BusSpeed?

-edit-

Incidentally its 5am here, so if I don't make any sense, you'll know why :-)

patrik · « **Reply #14 on:** November 29, 2003, 03:58:51 PM »

As you dont have a Grex yet

I think you should check out this opportunity

.

If you want to get a quick overview of how the caching areas and modes are controlled by the MMU you should look at these sections in the User Manual for respective processor:

For the 68040 (2.5 Pages of reading):
4.3, 4.3.1, 4.3.1.1, 4.3.1.2 and 3.1.3

For the 68060 (3.5 Pages of reading):
5.4, 5.4.1, 5.4.1.1, 5.4.1.2, and 4.1.3

I do recommend reading the 060 User Manual sections first as it is written in a better way... atleast I think so

.

I understand your reasons, I just got kind of goo-gaa interested in finding out what levels of performance that can be crammed out of the Grex

. As the Grex seems to be able to fully utilize the PCI-bus using non-burst transfers in some configurations I am incredibly curious of how it can perform using burst transfers. There should be some use of burst-transfers btw... when transferring big chunks of data from fastmemory to VRAM (textures for example maybe), burst transfers with MOVE16 should be very sensible.

Though as I said in my earlier post MOVE16 would be crap to measure the raw busspeed performance. Some MMU-tricking, setting some pages of the VRAM as writethrough cachable and using your method of reading/writing from/to these pages of VRAM should theoretically make it possible to test the raw busspeed performance of the Grex.

If you mean bustest I have used it. Uhm.. was there any point you were coming to regarding this?

The bottom line anyhow is this: I need your expertise for this man!

Have a great day!

/Patrik

Author Topic: Idea... (Read 2755 times)

Karlos

Idea...

Effy

Re: Idea...

Karlos

Re: Idea...

lempkee

Re: Idea...

patrik

Re: Idea...

Karlos

Re: Idea...

patrik

Re: Idea...

Karlos

Re: Idea...

patrik

Re: Idea...

patrik

Re: Idea...

Karlos

Re: Idea...

patrik

Re: Idea...

Karlos

Re: Idea...

Karlos

Re: Idea...

patrik

Re: Idea...