Hans has been saying that it is unlikely that he will support the old Picasso96 PIP APIs. This is unfortunate as it will lock out textured video from all old applications using the API. In that sense overlay indeed has been obsoleted for OS4.
I think that's extrapolating what he says to the worst case outcome. I offer the following interpretation. He's not going to implement video texturing within the constraints of the PiP interface. That doesn't preclude any wrapper / glue logic to expose whatever new method/API he implements to the PiP interface.
It wouldn't even require him to do it or for him to release his driver source. Picasso96 uses a bunch of structures filled with function pointers for driver implementations to override. Anybody familiar enough with P96 driver development could probably have a go once there is some method in existence to create the glue for. (All this talk of overlay has piqued my interest in implementing it for the Permedia2 Picasso96 driver which presently has no PiP either. It too will require a video texture implementation.)
This also means that in the future if some application wishes to support fast video display on OS4 the application has to have two code paths, one for classic Picasso96 PIP API (for older graphics cards) and second for some yet to be determined new API (for newer graphics cards). Hardly an ideal solution.
Only in the worst case scenario.
It actually is considerably slower to use textured video. There is some setup involved, and you need to wait for the operation to finish (actual performance of course depends on the implementation details).
I don't expect speed to be a major issue. My G200 uses video texturing and handles 1080p just fine and some of these later HD5xxx cards are considerably more powerful. The bottleneck is going to end up being the CPU decode, especially if DMA retrieval of texture data is thrown into the mix.
Classic overlay gives the application a frame buffer it can write to, and it will be displayed automagically without any extra calls or waiting needed. It is even possible to completely disable the OS and bang the framebuffer and it will update on screen just fine (this is btw why with Mediator setup you could route amiga display to a video grabbing card, and then display the graphics in an overlay window, and run HW banging games and demos).
This is a distinct advantage for such machines, but I doubt that, other than on actual classic Amiga machines, there'll be much hardware banging going on.
If there's a possibility to have both classic overlay and textured video, overlay would be the better choice for the low-end systems.
That rather depends. Your assertion assumes that video texturing will add significant latency to the operation due to having to wait for texture mapping and so on, but texture fill rate is already measured in gigatexels/second.
If my experience at messing around with decode on my linux system is anything to go by, it will probably take longer to transfer a single 1080p frame of YUV texels to the card from system memory (even with DMA) than it will for the GPU to fill the framebuffer with the RGB output. The reason is obvious here:
karlos@Megaburken-II:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest
[bandwidthTest] starting...
./bandwidthTest Starting...
Running on...
Device 0: GeForce GTX 275
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2438.4
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1961.3
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 105118.2
[bandwidthTest] test results...
PASSED
The fastest host->device transfer, using DMA, achieves a modest 2438.4 MB/s over a PCIe 16x connection. A 16-bit YUV 1080p frame occupies ~3.955MB, so you'd expect to be able to transfer it in around 1.62 ms. (none of these figures get particularly lower if I use a slower CPU simply because everything in this test is directed by the GPU).
Once on the card, even high quality floating point based conversion to YUV using, for example, CUDA is entirely memory IO limited, let alone doing so using direct hardware support for YUV texture formats and simply blasting out a textured quad to the framebuffer. You need to read 3.955MB and write 7.91MB entirely on the device now (also note, writing VRAM is generally a bit faster than reading, even for the GPU) and you have 105GB/s copy bandwidth to play with. All things being equal, the bottleneck is not on the card here, it's down to how quickly the rest of the system can get the next YUV frame ready. And this is on a card several years old now.
Moving away from this machine, even in the lousiest scenario, where the PPC has to transfer the frame data to VRAM itself because for some reason DMA transfer over the PCIe bus wasn't possible, the bottleneck is still not going to be on the graphics card side, if it is anywhere it will be in the PIO transfer of data to the VRAM for a faster CPU or still in the decode phase for a slower one.
This also means that it likely is a better idea to use a graphics card with true overlay in such low-end systems instead of a card that doesn't have the classic overlay.
That's certainly true at the moment.
Textured video has the benefit of being part of the actual display frame buffer though, and then you can perform other effects on it (such as transparency), and for example take screenshot of the video.
Not to mention postprocessing (deinterlace, denoising, deblocking etc), if you have the necessary shader infrastructure for it.