Author Topic: Video overlay - essential for fast video playback (Read 11364 times)

Fab · « **Reply #29 on:** June 30, 2012, 10:39:53 AM »

@karlos

Fine, and what about the practical write speed on SAM 460 PCIe bus?

Being able to write yuv 420 data instead of plain rgb24 data is twice as fast... On machines like Macs, Pegasos or AmigaOnes it's a huge speedup.

Karlos · « **Reply #30 on:** June 30, 2012, 11:18:01 AM »

Quote from: Fab;698438

@karlos

Fine, and what about the practical write speed on SAM 460 PCIe bus?

I don't own a SAM so I can't say - there are various benchmarks out there.

Quote

Being able to write yuv 420 data instead of plain rgb24 data is twice as fast... On machines like Macs, Pegasos or AmigaOnes it's a huge speedup.

Not disagreeing at all. Don't forget, I still use an BPPC/BVision combo and the VRAM write speed there is ~15MB/s or so, thanks to the bus.

Perhaps I didn't make it clear (though I believe I did) when I first mentioned video texturing but just in case, video texturing does not imply the transfer of already-decoded RGB data from RAM -> VRAM. The transformation from YCbCr -> RGB, along with scaling, is handled by the texture mapping hardware. Without this hardware colourspace conversion, a chip can't really be said to properly support video texturing.

Whatever bus transfer arguments you can make for traditional overlay, you can equally make for video texturing.

slayer · « **Reply #31 on:** June 30, 2012, 11:20:51 AM »

Quote from: Karlos;698435

GPU <-> memory bandwidth is literally in the hundreds of GB/s these days. Even my now rather old hat G200 based card can manage ~ 105GB/s for VRAM to VRAM (read-process-write) operations and that's while it's still displaying a fully composited desktop.

So what does that mean for a HD7970 in the X1000 is capable of, once it's suitably tapped?

I've gone from having no recent graphics cards to 4 x 4890s of various configurations and 1 x 4870X2 and of course the 7970!

takemehomegrandma · « **Reply #32 on:** June 30, 2012, 11:25:54 AM »

I think Sam 460 owners would be much better off with having their SATA controller card (which is obviously needed since there is only 1 SATA port on the Sam MB) in the shape of PCI-e instead, and some Radeon R200 card in the normal PCI slot, since having the graphics on PCI-e seems pretty worthless at the moment (no overlay/video, no 3D), and it will remain so for a considerable and unknown amount of time into the future as well (years even?)

Karlos · « **Reply #33 on:** June 30, 2012, 11:34:27 AM »

Quote from: slayer;698441

So what does that mean for a HD7970 in the X1000 is capable of, once it's suitably tapped?

Hans is better placed to tell you than I am. The most advanced Radeon card I presently own is an R200 :lol:

But I expect it'll make a significant performance difference. What would be even better would be some support for programming the shader units directly. You could probably offload the entire decode video process then.

Of course, as a long time nVidia fanboi, I'm obliged to point out that all things Radeon suck.

However, that's not actually true at all (only their linux drivers suck in my experience) and I have to say, the 79xx series has certainly caught my attention. It's kicking some serious green corner ass at OpenCL centric tasks right now. It's almost an exact transposition of the G200 v HD47xx days when it was ATI being defensive over compute performance and pointing at gaming as the primary issue.

Quote

I've gone from having no recent graphics cards to 4 x 4890s of various configurations and 1 x 4870X2 and of course the 7970!

That's quite a step up.

slayer · « **Reply #34 on:** June 30, 2012, 12:00:19 PM »

Quote from: Karlos;698445

Of course, as a long time nVidia fanboi, I'm obliged to point out that all things Radeon suck.

You sound like my son :hammer:

Quote

The 79xx series has certainly caught my attention. It's kicking some serious green corner ass at OpenCL centric tasks right now.

I did do a little research and it seems to have a few positive things going for it yes :pint:

Quote

That's quite a step up.

Yes, trouble with beta testing, you tend to end up with all these left overs and the cards weren't that cheap to amass. A little over the top perhaps but that's how I generally roll, it's just good to get your hands dirty and get the things you need in to help solve issues.

klx300r · « **Reply #35 on:** June 30, 2012, 05:22:32 PM »

Quote from: spirantho;698427

Overlay is obsolete.
Textured video is a much better way of doing the same thing at the same speed.
RadeonHD cards had moved away from supporting true overlay, using textured video instead.

And just in case I also get accused of fanboyism, here's a forum entry by AMD saying exactly that:
devgurus.amd.com/thread/154009

don't bother posting real facts here Ian as Piru & friends only want data they can manipulate to serve their agenda. Sad part is he has the audacity to think people are daft enough not to realize:whack:

Karlos · « **Reply #36 on:** June 30, 2012, 05:38:57 PM »

Quote from: klx300r;698472

don't bother posting real facts here Ian as Piru & friends only want data they can manipulate to serve their agenda. Sad part is he has the audacity to think people are daft enough not to realize:whack:

This discussion is getting tedious quickly and it's not helped by remarks like this.

Piru has stated in his opening post:

Quote

Overlay is absolutely critical for low-end systems displaying video

He clarified his use of the word "overlay" to mean any hardware mechanism that offloads colourspace conversion and scaling, or what I would call a "hardware video surface".

His point is thus entirely valid. If you have a low end CPU, you need all the hardware acceleration you can get because colourspace conversion from YUV to RGB is more expensive than you might think (can be done at copyspeed on many faster processors though) and scaling, particularly enlarging with filtering, is not cheap.

And even if you do have a faster CPU that is capable of both without dropping frames, hardware acceleration still gives you better performance per watt, less heat, fan noise etc.

Piru · « **Reply #37 on:** June 30, 2012, 06:06:13 PM »

Quote from: spirantho;698427

Overlay is obsolete.

The "classic" method of implementing video overlay might be obsolete with certain hardware. This does not however make the concept of overlay obsolete. As I point out in my original post:

Quote

PS. It does not matter how the overlay is implemented. It could be dedicated overlay mode supported by the graphics hardware, or it could be implemented via the 3D hardware. The important point is that both OS and the device driver implement it and the APIs providing the overlay to the applications.

Hans has been saying that it is unlikely that he will support the old Picasso96 PIP APIs. This is unfortunate as it will lock out textured video from all old applications using the API. In that sense overlay indeed has been obsoleted for OS4.

This also means that in the future if some application wishes to support fast video display on OS4 the application has to have two code paths, one for classic Picasso96 PIP API (for older graphics cards) and second for some yet to be determined new API (for newer graphics cards). Hardly an ideal solution.

Of course, it also means that anyone who is using newer RadeonHD cards in low-end systems will need to wait undetermined time to have proper video playback, or upgrade their system to Radeon 9200 or some other card that is properly supported.

Quote

Textured video is a much better way of doing the same thing at the same speed.

It actually is considerably slower to use textured video. There is some setup involved, and you need to wait for the operation to finish (actual performance of course depends on the implementation details). Classic overlay gives the application a frame buffer it can write to, and it will be displayed automagically without any extra calls or waiting needed. It is even possible to completely disable the OS and bang the framebuffer and it will update on screen just fine (this is btw why with Mediator setup you could route amiga display to a video grabbing card, and then display the graphics in an overlay window, and run HW banging games and demos).

If there's a possibility to have both classic overlay and textured video, overlay would be the better choice for the low-end systems. This also means that it likely is a better idea to use a graphics card with true overlay in such low-end systems instead of a card that doesn't have the classic overlay.

Textured video has the benefit of being part of the actual display frame buffer though, and then you can perform other effects on it (such as transparency), and for example take screenshot of the video.

Terminills · « **Reply #38 on:** June 30, 2012, 06:16:46 PM »

Quote from: Karlos;698474

This discussion is getting tedious quickly and it's not helped by remarks like this.

Piru has stated in his opening post:

He clarified his use of the word "overlay" to mean any hardware mechanism that offloads colourspace conversion and scaling, or what I would call a "hardware video surface".

klx300r chooses to ignore what piru says and goes straight for implying Piru is trolling against his OS of choice. As for me I agree with the concept of hardware surface and overlay being interchangeable in this context.

Karlos · « **Reply #39 on:** June 30, 2012, 07:31:15 PM »

Quote from: Piru;698480

Hans has been saying that it is unlikely that he will support the old Picasso96 PIP APIs. This is unfortunate as it will lock out textured video from all old applications using the API. In that sense overlay indeed has been obsoleted for OS4.

I think that's extrapolating what he says to the worst case outcome. I offer the following interpretation. He's not going to implement video texturing within the constraints of the PiP interface. That doesn't preclude any wrapper / glue logic to expose whatever new method/API he implements to the PiP interface.

It wouldn't even require him to do it or for him to release his driver source. Picasso96 uses a bunch of structures filled with function pointers for driver implementations to override. Anybody familiar enough with P96 driver development could probably have a go once there is some method in existence to create the glue for. (All this talk of overlay has piqued my interest in implementing it for the Permedia2 Picasso96 driver which presently has no PiP either. It too will require a video texture implementation.)

Quote

This also means that in the future if some application wishes to support fast video display on OS4 the application has to have two code paths, one for classic Picasso96 PIP API (for older graphics cards) and second for some yet to be determined new API (for newer graphics cards). Hardly an ideal solution.

Only in the worst case scenario.

Quote

It actually is considerably slower to use textured video. There is some setup involved, and you need to wait for the operation to finish (actual performance of course depends on the implementation details).

I don't expect speed to be a major issue. My G200 uses video texturing and handles 1080p just fine and some of these later HD5xxx cards are considerably more powerful. The bottleneck is going to end up being the CPU decode, especially if DMA retrieval of texture data is thrown into the mix.

Quote

Classic overlay gives the application a frame buffer it can write to, and it will be displayed automagically without any extra calls or waiting needed. It is even possible to completely disable the OS and bang the framebuffer and it will update on screen just fine (this is btw why with Mediator setup you could route amiga display to a video grabbing card, and then display the graphics in an overlay window, and run HW banging games and demos).

This is a distinct advantage for such machines, but I doubt that, other than on actual classic Amiga machines, there'll be much hardware banging going on.

Quote

If there's a possibility to have both classic overlay and textured video, overlay would be the better choice for the low-end systems.

That rather depends. Your assertion assumes that video texturing will add significant latency to the operation due to having to wait for texture mapping and so on, but texture fill rate is already measured in gigatexels/second.

If my experience at messing around with decode on my linux system is anything to go by, it will probably take longer to transfer a single 1080p frame of YUV texels to the card from system memory (even with DMA) than it will for the GPU to fill the framebuffer with the RGB output. The reason is obvious here:

Code: [Select]

karlos@Megaburken-II:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest 
[bandwidthTest] starting...
./bandwidthTest Starting...

Running on...

 Device 0: GeForce GTX 275
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2438.4

 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1961.3

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     105118.2

[bandwidthTest] test results...
PASSED

The fastest host->device transfer, using DMA, achieves a modest 2438.4 MB/s over a PCIe 16x connection. A 16-bit YUV 1080p frame occupies ~3.955MB, so you'd expect to be able to transfer it in around 1.62 ms. (none of these figures get particularly lower if I use a slower CPU simply because everything in this test is directed by the GPU).

Once on the card, even high quality floating point based conversion to YUV using, for example, CUDA is entirely memory IO limited, let alone doing so using direct hardware support for YUV texture formats and simply blasting out a textured quad to the framebuffer. You need to read 3.955MB and write 7.91MB entirely on the device now (also note, writing VRAM is generally a bit faster than reading, even for the GPU) and you have 105GB/s copy bandwidth to play with. All things being equal, the bottleneck is not on the card here, it's down to how quickly the rest of the system can get the next YUV frame ready. And this is on a card several years old now.

Moving away from this machine, even in the lousiest scenario, where the PPC has to transfer the frame data to VRAM itself because for some reason DMA transfer over the PCIe bus wasn't possible, the bottleneck is still not going to be on the graphics card side, if it is anywhere it will be in the PIO transfer of data to the VRAM for a faster CPU or still in the decode phase for a slower one.

Quote

This also means that it likely is a better idea to use a graphics card with true overlay in such low-end systems instead of a card that doesn't have the classic overlay.

That's certainly true at the moment.

Quote

Textured video has the benefit of being part of the actual display frame buffer though, and then you can perform other effects on it (such as transparency), and for example take screenshot of the video.

Not to mention postprocessing (deinterlace, denoising, deblocking etc), if you have the necessary shader infrastructure for it.

Piru · « **Reply #40 on:** June 30, 2012, 08:01:12 PM »

Quote from: Karlos;698492

The bottleneck is going to end up being the CPU decode, especially if DMA retrieval of texture data is thrown into the mix.

True. However, the resulting bitmap needs to be blitted to the screen, too, unless if it somehow bypasses the compositing stage (of course possible as well). While this is not going to take much time, it still is yet another step that needs to be performed compared to classic overlay.

If implemented properly there will be little difference in performance. It will be slower than the old style overlay. How much, depends on the implementation (if everything is done correctly there's only minimal performance hit, if it's not; say no PIO texture fetch, extra blit required, then it could make a difference between 720P being usable or not).

Quote

This is a distinct advantage for such machines, but I doubt that, other than on actual classic Amiga machines, there'll be much hardware banging going on.

Of course. The example was merely to illustrate how the classic overlay is completely automatic and doesn't require any extra blitting, copying or such (that is it comes as "free" resource-consumption-wise).

Anyhow, it will be some time before we will see how it performs. Meanwhile anyone wanting to play videos on low-end systems such as SAM460 with Radeon HD are out of luck.

magnetic · « **Reply #41 on:** June 30, 2012, 09:01:41 PM »

Quote

Anyhow, it will be some time before we will see how it performs. Meanwhile anyone wanting to play videos on low-end systems such as SAM460 with Radeon HD are out of luck.

Quite true and that is teh sukcage.

"upgrade to Radeon 9200" lulz

Dont get me wrong I dont mean to make fun I feel bad for Sam users. Whats weird is that the Sam platform is THE mainstream platform for os4. These systems need drivers how long has it been now? I mean to pay $1500 for a box and have stuttering audio and bad video playback is not cool. Not to mention that whole silly 1 Sata port business.

Well at least X1000 is a nice platform for OS4, too bad it cost more than some people's cars!

I'm lucky because my Peg2 box with Radeon 9000 Pro II board has overlay support in OS4 and Morphos.

Karlos · « **Reply #42 on:** June 30, 2012, 09:11:49 PM »

Quote from: Piru;698495

True. However, the resulting bitmap needs to be blitted to the screen, too, unless if it somehow bypasses the compositing stage (of course possible as well). While this is not going to take much time, it still is yet another step that needs to be performed compared to classic overlay.

Again, that is possible. However, a lot of graphics hardware allows you pass a rectangle clip list for primitive rendering, so in theory, could can skip a blit too and instead, you render your textured quad directly to the framebuffer, passing it a clip list derived from the actual layers on the screen. Essentially, the hardware renders the the same primitive multiple times for each clip region in the list. Even the Permedia2 could do that but in practise the driver doesn't do it.

Quote

If implemented properly there will be little difference in performance. It will be slower than the old style overlay.

Only on cards that have it. Comparing "Old style" overlay on, for instance, an old Cirrus logic versus video texturing on a HD7970 seems a bit chalk and cheese to me

It is only a valid comparison on hardware that supports both methods.

Quote

How much, depends on the implementation (if everything is done correctly there's only minimal performance hit, if it's not; say no PIO texture fetch, extra blit required, then it could make a difference between 720P being usable or not).

I really do believe the video stream decode phase is what will make that determination on hardware like the Sam even if the video texture implementation ends up being somewhat sub optimal. Even if the 3D hardware has to render to an offscreen bitmap and then ClipBlit it into a window on the desktop, those operations on modern hardware are so much faster than on, say R200 era hardware that it would make your hair stand on end.

In an optimal case, you'd be using a fullscreen display anyway and a lot of pipelining/parallelism becomes possible. In my imaginary, ideal player:

RAMDAC
Displaying Nth RGB frame from the active Screen's BitMap

GPU 3D Core:
Converting / scaling the (N+1)th YUV frame from primary texture buffer in VRAM, rendering unclipped quad directly to secondary (or tertiary) ScreenBufer BitMap

GPU Memory controller / Busmaster:
Transferring the (N+2)th YUV frame from system RAM to secondary texture buffer in VRAM

CPU
Decoding the (N+3)th frame in the stream to a secondary system RAM buffer, mixing audio etc.

Karlos
Eating popcorn

Quote

Of course. The example was merely to illustrate how the classic overlay is completely automatic and doesn't require any extra blitting, copying or such (that is it comes as "free" resource-consumption-wise).

Don't get me wrong, we all like free stuff

Quote

Anyhow, it will be some time before we will see how it performs. Meanwhile anyone wanting to play videos on low-end systems such as SAM460 with Radeon HD are out of luck.

Tell it to my BVision :razz:

Piru · « **Reply #43 on:** June 30, 2012, 11:27:49 PM »

Quote from: Karlos;698500

Tell it to my BVision :razz:

"Tseh Tseh, BVision, jump into that PCI/PCIe slot!"

Anyway, I heard you've been working on Radeon R100/R2x0 as well. I almost feel sorry for you, such a pain in the behind...

Karlos · « **Reply #44 from previous page:** June 30, 2012, 11:36:38 PM »

Quote from: Piru;698515

"Tseh Tseh, BVision, jump into that PCI/PCIe slot!"

These RadeonHD owners don't know they're born. In my day we had blinkenlights and we thought we were lucky...

Quote

Anyway, I heard you've been working on Radeon R100/R2x0 as well.

Yeah, but mostly 3D related for now. I narrowly snatched defeat from the jaws of victory in a recent refactor of the R100 code, so right now I'm working through the changesets in my local repository to see what heinous crime I just committed.

Quote

I almost feel sorry for you, such a pain in the behind...

I've been called worse :lol:

Author Topic: Video overlay - essential for fast video playback (Read 11364 times)

Fab

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback

slayer

Re: Video overlay - essential for fast video playback

takemehomegrandma

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback

slayer

Re: Video overlay - essential for fast video playback

klx300r

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback

Piru

Re: Video overlay - essential for fast video playback

Terminills

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback

Piru

Re: Video overlay - essential for fast video playback

magnetic

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback

Piru

Re: Video overlay - essential for fast video playback

Karlos

Re: Video overlay - essential for fast video playback