All I am stating regarding hardware standards is that they base it on I/O ports and memory maps like VGA/EGA/CGA was rather than API calls. That way, you don't have to rely on any drivers and API calls although those can also be present in a system.
>From the above statement, clearly not. I could program a "pallete change" for my GPU that simultaneously sets every colour register in parallel in a couple of shader clock cycles.
Now you call an API to do that which works on majority of PCs and see how well it performs as compared to Amiga swapping two palette registers or to a standard VGA swapping color registers. You can't just target your machine since we are talking about making it work in general; that's why I am talking hardware compatibility to begin with.
1) Not just my machine, any CUDA 1.0+ capable machine. That's every G80 upwards. There are more of those installed in machines today than there are amigas. Hence my code will run on more machines than yours.
2) It won't run on the rest. Well, there's a pity. Perhaps this is why API calls exist in the first place, eh?
>However, palette changes are a thing of the past for modern hardware. I haven't used a indexed colour mode for more than a few hours (usually when retgrogaming) in almost 10 years, even on the Amiga.
That's subjective.
No it isn't its FACT. The only time I have used indexed colour modes are on a physical AGA machine when playing old games. The rest of the time I run 32-bit truecolour displays, or at worst, high colour 16-bit ones.
But regardless, I was giving example where I/O accesses are better than API calls and Amiga I/O accesses aren't that slow.
Compared to speed of modern graphics memory it is ACHINGLY SLOW. Why don't you time how many palette registers you can update in 1 second? Doing it from either the copper or the CPU, you'll hit the limit of the bus speed. However you decide to divide the workload between the copper and CPU, you are going to be restricted by the write bandwidth, which is at best 7-8MB/s for AGA class hardware and even less for OCS/ECS.
That particular limitation is mitigated in current hardware, where memory bandwidth is in the GiB/s range. So, in the time it takes you to set just one palette register with an IO instruction, a modern G200 class GPU could have done all of them dozens of times over, that's even before you parallelise the code.
Assuming you want to set 256 32-bit registers, you can do so by moving a vector of 4 32-bit integers to the target location. CUDA executes parallel threads in "warps" of 32 threads. In the absence of any conditional branching, every thread in a warp runs the same instruction concurrently as it's neigbours.
So, you'd need a GPU kernel that you invoke as a pair of concurrent warps (giving 64 threads in total), where each thread sets a block of 4 registers using a vector of 4 ints, ensuring coalesced memory access.
64 threads each setting 4 registers in parallel, each thread running in parallel. On my GPU, which will happily run 24576 parallel threads at any one instant, that's only 64/24576 = 0.26% utilisation.
In theory, that's pretty damn fast, especially at 1.3GHz (shader clock of said PU)
In practise, it would take much longer to set up than it does to execute. That dominates the timing, so much so that I'd almost guarantee that a basic API call on the CPU that just does 256 successive 32-bit writes to the same address space would be faster in "real" time and would work on any graphics card.
So, for all I can actually set the registers faster by "hitting the metal", the end result is less portable and most likely slower than using the API since the job at hand is so easy to do anyway that it ceases to benefit from HW acceleration.