Author Topic: question about DMA (Read 4849 times)

Zac67 · « **on:** September 09, 2011, 06:02:14 PM »

Quote from: billt;658651

Now, I've always assumed that this means the processor can go off and do something else productive.

In general - yes, it can. It depends on the load the DMA operation puts onto the bus(ses).

Quote

But last nights discussion defined DMA as an operation that does a data transfer faster than the processor can, because the processor needs to read from source into a register and then write it back to destination, increment a transfer size counter, do a compare to know if it's finished before moving on to the next unit. The DMA can do the counter and compare in hardware pretty much instantaneously on each transfer unit, compared to the processor needing to use instructions and thus instruction time to do the same thing. And this discussion said that while the DMA controller is doing this, the CPU is idle, waiting for DMA to complete.

Not necessarily. If there's a bootleneck that gets saturated by DMA, other stuff gets slowed to a halt. A standard PCI bus saturated by gigabit ethernet can do nothing else, e.g. serve I/O requests for HDD. Modern systems increasingly avoid bootlenecks, PCI Express is no physical bus any more for instance. On older Pentium III age systems you could easily saturate the memory interface with fast I/O for SCSI or Ethernet, but a modern system can have dual channel 1333 MHz RAM (or faster), a theoretical throughput of 16 bytes x 1333 MHz ~= 20 GB/s - extremely heavy duty hardware aside, there's nothing on the planet able to saturate that interface. If the PCIe subsystem is properly crossbar-switched you can't saturate that either.
So, any DMA operation just competes for memory time slots. If there's plenty, the CPU will have to yield the bus only for extremely short times - depending on what's running that may not even be noticeable. OTOH, today's CPUs are so fast that they're practically waiting for the slow RAM all the time...

Anyway, DMA is much more efficient since the data has to go through the bus only once. For PIO, the CPU needs to read the data from the device and write it to memory, the memory load is doubled. Additionally, status registers need to be polled, increasing the load even more. Furthermore, the CPU (or one core/thread) is busy handling the I/O.

Quote

This is partly because that it would at least be colliding with each other on bus usage, as apparently things are expected to be on the same bus.

See the cache as a separate bus that's running independently.

Quote

OK, thinking in terms of Amiga Classics, we have the split between chip and fast memory. Can the chipset do DMA in chip mem whlie CPU does stuff in fast mem, all at the same time?

Yes! That's the plan. Chip DMA competes for chip bus time alone. The CPU competes for fast bus time with the (mostly Zorro) DMA devices present. For chip RAM access it needs to compete for both(!) simultaneously.

Quote

Or do they need to interleave and share the bus, so no one is really going at full speed on any bus?

No.

Quote

Also thinking about the ARM chips I'm involved with at work, I'm not sure you would consider that to be a single bus. The AHB goes through a "matrix", and things go through huge multiplexors leaving me to believe that any AHB master can get to a different slave. So one master takes a different route through the matrix to it's slave than another master does, and thus they are not competing for the same bus, and thus each master can go full speed.

Sounds like a crossbar switch.

Zac67 · « **Reply #1 on:** September 09, 2011, 06:20:10 PM »

Coming to think of it...
Actually, the Amiga bears some similarity to modern dual/triple/... memory channel systems. Essentially the latter have two separate memory subsystems, offering the possibility to compete for more than a single chunk of RAM. These channels are symmetrical (in how they're implemented, not necessarily how they're loaded) and all-purpose. The Amiga's "channels" are very asymmetrical and specialized, but the idea was the same!

Zac67 · « **Reply #2 on:** September 13, 2011, 07:02:32 PM »

It highly depends on what you mean by "until DMA completes". If bus arbitration is for a single cycle then it is that (of course there can't be multiple users on the bus simultaneously). Usually there's a "burst", so the bus get allocated for a maximum on n cycles (many Pentium era PCI boards allowed you to set a 'PCI latency' - that's the length of that burst cycle) and bus mastership doesn't change within that cycle.

However, these are pratical limitations. In theory each bus cycle could be arbitrated independently, so a longer DMA operation (without buffering and bursts) not saturating the bus could get interleaved with CPU cycles. So, in general, that prof is wrong.

Additionally, the CPU could easily run on cache alone as long as no memory cycle is required.

Furthermore, a dual (triple, ...) RAM channel design (unganged) could very well run both DMA and CPU cycles simultaneously, or even several DMAs (Xeon EXs have up to four memory channels!).

Even more complicated, integrating the memory controller into the CPU and using a peripheral connect for I/O (like Hypertransport, QPI, ...) could even have your I/O connect saturated with the memory subsystem idling for a few cycles which could be scooped up by the CPU.

So, all in all, he's talking crap. Sorry.

Quote

Could the CPU work on Fast memory while the Custom chips did DMA with the Chip memory simultainously ?

YES! YES! YES!

Quote

why are some A1200 & CD32 have 70ns chip ram to the normal 80ns chip ram?

That doesn't matter. 80 ns is fast enough, there isn't any way to go faster unless you're overclocking the chipset (yes, I've tried that once

).

Zac67 · « **Reply #3 on:** September 13, 2011, 08:27:16 PM »

I guess he was talking of that system which may have a rather restrictive way to DMA - nothing bad in embedded.

For serial, it does make sense to do even single cycle, repetitive DMA for longer I/Os (depending on serial and memory speed) - you can have a higher throughput with much less overhead in with PIO:
- set up buffer start address
- set up buffer size / max I/O size (for input)
- set up timeout
- go play somewhere else
- until an interrupt tells you it's done

For PIO:
- install IRQ handler
- set up buffer start address
- set up buffer length
- set up timeout timer IRQ
- do something else
- on each interrupt in the IRQ handler:
- restore current buffer address
- read serial port buffer
- copy to buffer
- increase buffer pointer
- compare buffer length
- on timer IRQ:
- cancel the pending I/O

You see, DMA is much less hassle. IRQ handling can be much shorter which is a very good thing in embedded where you usually try to be very deterministic.

Starting with PCI, NICs started using the main memory as buffers with DMA. The achievable speed is high enough and once you've built the (simple) DMA engine you get away with extremely small hardware buffers, saving costs and reducing latency(!): when the NIC signals the frame done the data's already present in main memory and the software doesn't need to wait for the driver to copy the data out of the NIC.

Zac67 · « **Reply #4 on:** September 13, 2011, 08:37:44 PM »

Quote from: billt;659245

For the very little DMA software I've written or looked at recently, the CPU does wait for DMA to complete.

That highly depends on the system it's all running on. A single tasking OS has no way to spend the time elsewhere so you have to wait. (Unless you've implemented asynchronous I/O which usually isn't the case.) In a multitasking OS, the driver simply lets go of the CPU after DMA setup and is revived once the 'finished' IRQ drops in. All the other threads can do what they want during that time.

Author Topic: question about DMA (Read 4849 times)

Zac67

Re: question about DMA

Zac67

Re: question about DMA

Zac67

Re: question about DMA

Zac67

Re: question about DMA

Zac67

Re: question about DMA