Some more thoughts and findings...
If I limit DiagROM's memory test range to $0-$E283, I see no errors even after dozens of passes. If I extend the range to $0-$E28F, there are address errors within a few passes. The range $0-$E28B seems marginal, sometimes passing quite a few times before showing errors. If I go much beyond that (say $0-E3FF), DiagROM fairly promptly crashes (outputting unexpected text or binary on the serial port).
When address errors are detected, address bits 10-12 are always indicated (though DiagROM acknowledges that these are only estimates). I also often see bits 5, 7 and 8, and occasionally 2, 3, 4 and 6. I don't believe I've seen any other bits involved.
Once errors happen, things seem to stay bad: if I go back and test a previously "good" memory range, it will now show errors. Something seems to get stuck somehow, and a cold boot restores things.
I wondered about the mapping between the CPU address lines and the DRAM addressing controlled by Agnus, thinking that that might help identify underlying hardware problems. The Agnus specification document is quite helpful here:
The device generates RAM address from two sources, the processor or from the device performing DMA cycles, selected by a multiplexer. This multiplexer allows the processor to access RAM when AS* and RAMEN* are both low. At this time, the device also multiplexes the processor address (A1-A18) onto the MA bus. The device places A1 to A8 & A17 on the MA0 to MA9 outputs, respectively, during the row address time and places A9 to A16 & A18 on the MA0 to MA9, respectively, during the column address time. The A19 line is used by the IC to determine which RAS line is to be asserted. If A19 is low, RAS0* is enabled, and if high, RAS1* is enabled. The device also senses the LDS* and UDS* inputs to determine which CAS to drop. If LDS* is low, the IC will drop CALS* and if UDS* is low, CASU* is dropped.
https://retro-commodore.eu/files/downloads/amigamanuals-xiik.net/Hardware/Specifications%20Agnus%20-%20Manual-ENG%20.pdfNote that MA is named DRA on other schematics, and presumably the references to MA9 in that excerpt are mistakes (there are 9 lines in total, but numbered MA0..M8). The version of Agnus described uses DRA0..DRA8 for 9 bits of row/column addressing, for (2^9)^2 = 262,144 total words (512 kiB).
So, during row address times, the mapping is:
A1 DRA0
A2 DRA1
A3 DRA2
A4 DRA3
A5 DRA4
A6 DRA5
A7 DRA6
A8 DRA7
A17 DRA8
...and during column address times:
A9 DRA0
A10 DRA1
A11 DRA2
A12 DRA3
A13 DRA4
A14 DRA5
A15 DRA6
A16 DRA7
A18 DRA8
Combining this with the pattern of bad address bits doesn't reveal any magical pattern to me, but it does at least show that the most commonly (apparently) flaky address bits are associated with DRA1-3 during column accesses.
I also found it interesting that A8-A13 are associated with the CIA chips, and one of these tested bad. A8-11 connect directly to both CIAs, and A12 and A13 are involved in the chip select logic.
Helpful? I really don't know.
