Author Topic: question about DMA (Read 4841 times)

billt · « **on:** September 09, 2011, 04:17:56 PM »

Hi all, I'm taking a graduate class in microprocessor systems, and last night's discussion included DMA. And in particular, something about DMA that seems different than what I've always assumed DMA to be or do. We all know that DMA has the peripheral, a serial port, a SCSI drive, whatever, talk direct to memory, so the processor s not involved with that data transfer.

Now, I've always assumed that this means the processor can go off and do something else productive.

But last nights discussion defined DMA as an operation that does a data transfer faster than the processor can, because the processor needs to read from source into a register and then write it back to destination, increment a transfer size counter, do a compare to know if it's finished before moving on to the next unit. The DMA can do the counter and compare in hardware pretty much instantaneously on each transfer unit, compared to the processor needing to use instructions and thus instruction time to do the same thing. And this discussion said that while the DMA controller is doing this, the CPU is idle, waiting for DMA to complete. The processor is not off doing something productive at all. This is partly because that it would at least be colliding with each other on bus usage, as apparently things are expected to be on the same bus.

OK, thinking in terms of Amiga Classics, we have the split between chip and fast memory. Can the chipset do DMA in chip mem whlie CPU does stuff in fast mem, all at the same time? Or do they need to interleave and share the bus, so no one is really going at full speed on any bus?

Also thinking about the ARM chips I'm involved with at work, I'm not sure you would consider that to be a single bus. The AHB goes through a "matrix", and things go through huge multiplexors leaving me to believe that any AHB master can get to a different slave. So one master takes a different route through the matrix to it's slave than another master does, and thus they are not competing for the same bus, and thus each master can go full speed.

It also seems like this processor waiting for the CPU could be a waste in that I know our ARM has DMA capable serial ports, which seem rather slow to me. It seems that even if all things were on the same bus, the CPU could be doing useful stuff on the bus in between serial port DMA cycles on the bus.

I just was surprised about some of these details, and wanted to see some discussion from Amiga gurus that know how this stuff works elsewhere.

SpeedGeek · « **Reply #1 on:** September 09, 2011, 05:02:01 PM »

Well you pretty much have it explained. Yes, the Amiga was designed to allow Chip bus sharing via interleaved memory cycles with the primary DMA controller (Agnus or Alice). However, the CPU has low priority on the Chip bus. It's all depends upon the Screen mode and overscan usage which require more DMA bandwith.

This is completely different from SCSI DMA controllers which master the bus and totally prevent CPU access until their transfer is complete. 68040 and 68060 were designed to function as low priority bus masters and can still operate from their internal caches durring DMA. However, 68030 and earlier do nothing accept wait for the DMA controller to release the bus.

The only issue you have not addressed is bus arbitration. It's the basic protocol used by the CPU and alternate bus masters.

itix · « **Reply #2 on:** September 09, 2011, 05:38:11 PM »

Quote from: billt;658651

And this discussion said that while the DMA controller is doing this, the CPU is idle, waiting for DMA to complete. The processor is not off doing something productive at all. This is partly because that it would at least be colliding with each other on bus usage, as apparently things are expected to be on the same bus.

They will be competing for the bandwidth but due to multiasking you can not prevent CPU doing other work at the same time.

Zac67 · « **Reply #3 on:** September 09, 2011, 06:02:14 PM »

Quote from: billt;658651

Now, I've always assumed that this means the processor can go off and do something else productive.

In general - yes, it can. It depends on the load the DMA operation puts onto the bus(ses).

Quote

But last nights discussion defined DMA as an operation that does a data transfer faster than the processor can, because the processor needs to read from source into a register and then write it back to destination, increment a transfer size counter, do a compare to know if it's finished before moving on to the next unit. The DMA can do the counter and compare in hardware pretty much instantaneously on each transfer unit, compared to the processor needing to use instructions and thus instruction time to do the same thing. And this discussion said that while the DMA controller is doing this, the CPU is idle, waiting for DMA to complete.

Not necessarily. If there's a bootleneck that gets saturated by DMA, other stuff gets slowed to a halt. A standard PCI bus saturated by gigabit ethernet can do nothing else, e.g. serve I/O requests for HDD. Modern systems increasingly avoid bootlenecks, PCI Express is no physical bus any more for instance. On older Pentium III age systems you could easily saturate the memory interface with fast I/O for SCSI or Ethernet, but a modern system can have dual channel 1333 MHz RAM (or faster), a theoretical throughput of 16 bytes x 1333 MHz ~= 20 GB/s - extremely heavy duty hardware aside, there's nothing on the planet able to saturate that interface. If the PCIe subsystem is properly crossbar-switched you can't saturate that either.
So, any DMA operation just competes for memory time slots. If there's plenty, the CPU will have to yield the bus only for extremely short times - depending on what's running that may not even be noticeable. OTOH, today's CPUs are so fast that they're practically waiting for the slow RAM all the time...

Anyway, DMA is much more efficient since the data has to go through the bus only once. For PIO, the CPU needs to read the data from the device and write it to memory, the memory load is doubled. Additionally, status registers need to be polled, increasing the load even more. Furthermore, the CPU (or one core/thread) is busy handling the I/O.

Quote

This is partly because that it would at least be colliding with each other on bus usage, as apparently things are expected to be on the same bus.

See the cache as a separate bus that's running independently.

Quote

OK, thinking in terms of Amiga Classics, we have the split between chip and fast memory. Can the chipset do DMA in chip mem whlie CPU does stuff in fast mem, all at the same time?

Yes! That's the plan. Chip DMA competes for chip bus time alone. The CPU competes for fast bus time with the (mostly Zorro) DMA devices present. For chip RAM access it needs to compete for both(!) simultaneously.

Quote

Or do they need to interleave and share the bus, so no one is really going at full speed on any bus?

No.

Quote

Also thinking about the ARM chips I'm involved with at work, I'm not sure you would consider that to be a single bus. The AHB goes through a "matrix", and things go through huge multiplexors leaving me to believe that any AHB master can get to a different slave. So one master takes a different route through the matrix to it's slave than another master does, and thus they are not competing for the same bus, and thus each master can go full speed.

Sounds like a crossbar switch.

nicholas · « **Reply #4 on:** September 09, 2011, 06:10:32 PM »

I can't contribute anything of use to this thread but I would like to say how much I've enjoyed reading it already.

Can't wait for the rest!

Zac67 · « **Reply #5 on:** September 09, 2011, 06:20:10 PM »

Coming to think of it...
Actually, the Amiga bears some similarity to modern dual/triple/... memory channel systems. Essentially the latter have two separate memory subsystems, offering the possibility to compete for more than a single chunk of RAM. These channels are symmetrical (in how they're implemented, not necessarily how they're loaded) and all-purpose. The Amiga's "channels" are very asymmetrical and specialized, but the idea was the same!

billt · « **Reply #6 on:** September 12, 2011, 09:17:21 PM »

Well, prof says that it's a general definition that the CPU goes idle until DMA completes. He did say that additional circuitry could be added to be smart about sharing bus cycles, but that such added complexity may not be worth while, and is often not done. But, a couple of references:

Section 4.1.2 of Wolf's book Computers as Components (not required for the class but is suggested reading) suggests that the CPU could be doing things independently until it needs the bus again

And section 4.6 of Microprocessor Theory and Applications with 68000/68020 and Pentium which I found while searching, suggests a few methods to divvy up bus cycles with the CPU rather than completely take it over. (cycle stealing and interleaved)

Tension · « **Reply #7 on:** September 12, 2011, 11:27:29 PM »

Quote from: Zac67;658664

In general - yes, it can. It depends on the load the DMA operation puts onto the bus(ses).

Not necessarily. If there's a bootleneck that gets saturated by DMA, other stuff gets slowed to a halt. A standard PCI bus saturated by gigabit ethernet can do nothing else, e.g. serve I/O requests for HDD. Modern systems increasingly avoid bootlenecks, PCI Express is no physical bus any more for instance. On older Pentium III age systems you could easily saturate the memory interface with fast I/O for SCSI or Ethernet, but a modern system can have dual channel 1333 MHz RAM (or faster), a theoretical throughput of 16 bytes x 1333 MHz ~= 20 GB/s - extremely heavy duty hardware aside, there's nothing on the planet able to saturate that interface. If the PCIe subsystem is properly crossbar-switched you can't saturate that either.
So, any DMA operation just competes for memory time slots. If there's plenty, the CPU will have to yield the bus only for extremely short times - depending on what's running that may not even be noticeable. OTOH, today's CPUs are so fast that they're practically waiting for the slow RAM all the time...

Anyway, DMA is much more efficient since the data has to go through the bus only once. For PIO, the CPU needs to read the data from the device and write it to memory, the memory load is doubled. Additionally, status registers need to be polled, increasing the load even more. Furthermore, the CPU (or one core/thread) is busy handling the I/O.

See the cache as a separate bus that's running independently.

Yes! That's the plan. Chip DMA competes for chip bus time alone. The CPU competes for fast bus time with the (mostly Zorro) DMA devices present. For chip RAM access it needs to compete for both(!) simultaneously.

No.

Sounds like a crossbar switch.

you sir, are completely correct.

psxphill · « **Reply #8 on:** September 13, 2011, 12:09:20 AM »

Quote from: billt;659088

Well, prof says that it's a general definition that the CPU goes idle until DMA completes. He did say that additional circuitry could be added to be smart about sharing bus cycles, but that such added complexity may not be worth while, and is often not done.

He is wrong, while it's possible to do what he says, there are no commercially available computers or consoles that always stop the cpu when any dma activity is occurring. On a unified memory architecture where the cpu and gpu access the same ram, the cpu would never get a chance to run unless you turned the screen off.

It's possible to completely use up all the chip ram bandwidth on the amiga & lock out the cpu, however using less bitplanes & changing the blitter priority to lower than the cpu you can run code while the dma is occurring.

Tension · « **Reply #9 on:** September 13, 2011, 12:40:40 AM »

Quote from: psxphill;659104

He is wrong, while it's possible to do what he says, there are no commercially available computers or consoles that always stop the cpu when any dma activity is occurring. On a unified memory architecture where the cpu and gpu access the same ram, the cpu would never get a chance to run unless you turned the screen off.

It's possible to completely use up all the chip ram bandwidth on the amiga & lock out the cpu, however using less bitplanes & changing the blitter priority to lower than the cpu you can run code while the dma is occurring.

true

freqmax · « **Reply #10 on:** September 13, 2011, 06:08:18 AM »

Could the CPU work on Fast memory while the Custom chips did DMA with the Chip memory simultainously ?
(how does "slow memory" fit in?)

Btw, Even the crappy x86 architecture Intel 8237 could do "single mode" where CPU and DMA cycles are interleaved. And "demand mode" where transfers continue until TC or EOP goes active or DRQ goes inactive which allow the CPU to use the bus when no transfer is requested. So that professor should reconsider "not often" as x86 is "quite common"

commodorejohn · « **Reply #11 on:** September 13, 2011, 06:15:26 AM »

Quote from: freqmax;659144

(how does "slow memory" fit in?)

Slow memory is just chip RAM that the chips aren't capable of addressing, so all the chip RAM constraints apply to it as well.

delshay · « **Reply #12 on:** September 13, 2011, 06:45:03 AM »

perhaps this is the right thread to ask this question saving me time at looking at A1200 docs.

why are some A1200 & CD32 have 70ns chip ram to the normal 80ns chip ram?

what is the waitstate of chip ram?

Zac67 · « **Reply #13 on:** September 13, 2011, 07:02:32 PM »

It highly depends on what you mean by "until DMA completes". If bus arbitration is for a single cycle then it is that (of course there can't be multiple users on the bus simultaneously). Usually there's a "burst", so the bus get allocated for a maximum on n cycles (many Pentium era PCI boards allowed you to set a 'PCI latency' - that's the length of that burst cycle) and bus mastership doesn't change within that cycle.

However, these are pratical limitations. In theory each bus cycle could be arbitrated independently, so a longer DMA operation (without buffering and bursts) not saturating the bus could get interleaved with CPU cycles. So, in general, that prof is wrong.

Additionally, the CPU could easily run on cache alone as long as no memory cycle is required.

Furthermore, a dual (triple, ...) RAM channel design (unganged) could very well run both DMA and CPU cycles simultaneously, or even several DMAs (Xeon EXs have up to four memory channels!).

Even more complicated, integrating the memory controller into the CPU and using a peripheral connect for I/O (like Hypertransport, QPI, ...) could even have your I/O connect saturated with the memory subsystem idling for a few cycles which could be scooped up by the CPU.

So, all in all, he's talking crap. Sorry.

Quote

Could the CPU work on Fast memory while the Custom chips did DMA with the Chip memory simultainously ?

YES! YES! YES!

Quote

why are some A1200 & CD32 have 70ns chip ram to the normal 80ns chip ram?

That doesn't matter. 80 ns is fast enough, there isn't any way to go faster unless you're overclocking the chipset (yes, I've tried that once

).

billt · « **Reply #14 on:** September 13, 2011, 07:29:08 PM »

This course is more about embedded computers than desktops, and perhaps there's a bit of context influencing things in that distinction, I'm not sure. For the class lab assignments we're programming a Rabbit 3000 board, 8-bit stuff. But still, I just had to try and find some information about these things.

One of the examples I think of is the DMA on two serial ports in our ARM chip at work, which we're designing into an SOC. It seems the ARM goes a lot faster than serial ports, at least the serial ports when I had a modem, and I do realize that speeds have increased there far beyond what I ever used.

But, with a serial port doing DMA, it seems the CPU could have a few bus cycles between each serial port DMA bus cycle to do something else. Even with two serial ports each doing its own DMA, it would seem the CPU could have some free bus cycles to make use of. And that's ignoring internal caches.

Perhaps there's things that do saturate the bus for DMA, I suppose a big Gb ethernet or SATA transfer might be capable of that, or USB3. And that I can accept, that some things really do not leave any free cycles for the CPU until DMA is finished. But it seems like some things can be slow enough that sharing the bus (each for its own cycles of course) is possible, and in my/our minds, worth doing.

And I don't think that it makes sense to DMA no matter what. There is some overhead involved in configuring the DMA controller, telling it where to read from, where to write to, and how many times, to increment or not (If you're reading from a serial port for example, you're probably only reading from a single register every time, memory of course would increment to next address for every cycle). A single cycle DMA probably won't make any sense to do. It would need to be a transfer larger than your overhead cycles to begin being worth it I think.

For the very little DMA software I've written or looked at recently, the CPU does wait for DMA to complete. But this is a case of simulating the chip RTL before silicon, to make sure we have the DMA controller hooked up correctly inside the chip. There's nothing else to do but wait for the test results. There's no applications, no GUI to update, no OS, nothing else at all. I don't consider that a normal situation in writing software around a DMA controller.

I was just really surprised that I was the only one surprised by this detail in class that day. Thank you for making me feel like I'm on the sane side of this debate.

Author Topic: question about DMA (Read 4841 times)

billt

question about DMA

SpeedGeek

Re: question about DMA

itix

Re: question about DMA

Zac67

Re: question about DMA

nicholas

Re: question about DMA

Zac67

Re: question about DMA

billt

Re: question about DMA

Tension

Re: question about DMA

psxphill

Re: question about DMA

Tension

Re: question about DMA

freqmax

Re: question about DMA

commodorejohn

Re: question about DMA

delshay

Re: question about DMA

Zac67

Re: question about DMA

billt

Re: question about DMA