Author Topic: Zorro III memory card... now with Ethernet (Read 13326 times)

olsen · « **Reply #29 from previous page:** November 30, 2013, 09:15:00 AM »

Quote

...

This would probably make sense if the AllocVec () call used MEMF_CLEAR in addition to MEMF_PUBLIC, but this is not the case. Even if the allocated memory has been zeroed previously why not provide NULL as the default value? The same very source uses NULL just a couple lines of code further:

Code: [Select]
opener->filter_hook = (APTR)GetTagData (S2_PacketFilter, NULL, tag_list); opener->dma_tx_function = (APTR)GetTagData (S2_DMACopyFromBuff32, NULL, tag_list);

Or perhaps there is some reason of (not) doing so?

As far as I can tell (the callbacks are initialized exactly once), this is risky, and there is no benefit in initializing the callbacks in this manner.

I would not call it a bug, since every client of the SANA-II driver is likely to provide the proper callbacks. But if it does not, for some reason, then the device will crash.

The 3c589.device should be more paranoid, and verify that each parameter provided by the client is sound.

I already put a snapshot of the whole 3c589.device/pccard.library source code into my SVN repository, for rework, but there's been too little time to rework it so far :-(

Bobo68 · « **Reply #30 on:** November 30, 2013, 09:31:19 AM »

Quote from: nyteschayde;753590

This is awesome and please don't take this question as a complaint, but with RAM being so cheap, why only 64MB? Is there some limitation that I'm unaware of? Is this so it can be used in Zorro II in addition to Zorro III? I actually am unaware of the limitation per card for these devices in regards to addressable memory.

there is a limit of chip size

tnt23 · « **Reply #31 on:** December 01, 2013, 08:32:47 AM »

Quote from: nyteschayde;753590

This is awesome and please don't take this question as a complaint, but with RAM being so cheap, why only 64MB? Is there some limitation that I'm unaware of? Is this so it can be used in Zorro II in addition to Zorro III? I actually am unaware of the limitation per card for these devices in regards to addressable memory.

64M was the biggest SDRAM chip I was able to find in TSSOP package. The common approach is to have two or more chips on board, but I wasn't brave enough to route another one. Zorro III itself is able to address more than 1G per card.

Speaking of Zorro II, the limit is 8M there, unless someone comes with some sort of banking driver or something.

tnt23 · « **Reply #32 on:** December 01, 2013, 10:44:20 AM »

Quote from: olsen;753602

As far as I can tell (the callbacks are initialized exactly once), this is risky, and there is no benefit in initializing the callbacks in this manner.

I would not call it a bug, since every client of the SANA-II driver is likely to provide the proper callbacks. But if it does not, for some reason, then the device will crash.

The 3c589.device should be more paranoid, and verify that each parameter provided by the client is sound.

I already put a snapshot of the whole 3c589.device/pccard.library source code into my SVN repository, for rework, but there's been too little time to rework it so far :-(

Well, this indeed does not seem like a bug, at least no one complained so far. I think I understand what this code should do: since the S2_CopyToBuff is obligatory, the RX hook will be set to S2_CopyToBuff first. If the caller provides S2_CopyToBuff16 then the hook will be assigned that new tag value; otherwise, it will stick to S2_CopyToBuff, and so on. That way, as it seems to me, the request will be serviced using the fastest hook caller provides.

Anyway, the bug on my side was so silly it even isn't worth mentioning. Time to dig DHCP (and try Sgrab):

olsen · « **Reply #33 on:** December 01, 2013, 07:20:47 PM »

Quote from: tnt23;753638

Well, this indeed does not seem like a bug, at least no one complained so far. I think I understand what this code should do: since the S2_CopyToBuff is obligatory, the RX hook will be set to S2_CopyToBuff first. If the caller provides S2_CopyToBuff16 then the hook will be assigned that new tag value; otherwise, it will stick to S2_CopyToBuff, and so on.

Yes, that seems to be the intention. However, if S2_CopyToBuff were missing, and S2_CopyToBuff16 were missing, too, then the code will end up using an unitialized pointer, which should be caught before it happens. Same goes for the S2_CopyFromBuff tags.

Quote

That way, as it seems to me, the request will be serviced using the fastest hook caller provides.

The purpose of S2_CopyToBuff16 is not to speed up copying. It is the counterpart to the S2_CopyFromBuff16 tag, which is a workaround for a hardware bug. As far as I know this bug only exists in one type of Amiga Ethernet card, which is the original "Ariadne".

There is a bug in how byte-sized Zorro II accesses to the card are handled. These are treated like word-sized accesses, which means that garbage data will go out or come in the high order byte. This isn't much of a problem for reading (if you read a byte from the receive buffer, you'll probably write it back as a byte, too), but if you write bytes to the Ariadne buffer, this will trash half the buffer contents.

The solution is to copy only in word-sized portions to the buffer, or in long-sized portions if possible. For this purpose the ariadne.device allocates a side-buffer, which all writes will go through. First the data will be copied into the side-buffer, then the side-buffer will be copied quickly to the transmit buffer on the card (in long-sized portions). Problem solved, but at the expense of speed.

The S2_CopyFromBuff16 method solves the problem by requiring that the client copies only in word-sized portions (or long-sized portions). As far as I know, no ariadne.device with the S2_CopyToBuff16 method enabled was ever shipped. The ariadne.device supports a different method, which is functionally identical to S2_CopyToBuff16. The tag ID for this method is (S2_Dummy + 1968). I suppose the ariadne.device author (Stefan Sticht, if I remember correctly) may have been born in 1968

Put another way, no driver is really required to support the S2_CopyFromBuff16 method unless the driver really, really needs it.

Quote

Anyway, the bug on my side was so silly it even isn't worth mentioning. Time to dig DHCP (and try Sgrab):

Hm... does the DHCP negotiation succeed, eventually? If not, have you tried tcpdump yet?

tnt23 · « **Reply #34 on:** December 02, 2013, 05:45:22 AM »

Quote from: olsen;753651

Yes, that seems to be the intention. However, if S2_CopyToBuff were missing, and S2_CopyToBuff16 were missing, too, then the code will end up using an unitialized pointer, which should be caught before it happens. Same goes for the S2_CopyFromBuff tags.

The purpose of S2_CopyToBuff16 is not to speed up copying. It is the counterpart to the S2_CopyFromBuff16 tag, which is a workaround for a hardware bug. As far as I know this bug only exists in one type of Amiga Ethernet card, which is the original "Ariadne".

That's fascinating

One would think the 16/32 buffer management routines have been proposed into SANA with performance in mind, not as some certain bug workarounds.

Quote

Put another way, no driver is really required to support the S2_CopyFromBuff16 method unless the driver really, really needs it.

Since the DM9000 in my design is wired in 16 bits, and I tend to use word accesses wherever possible, using x16 routines would be preferrable in my case.

Quote

Hm... does the DHCP negotiation succeed, eventually? If not, have you tried tcpdump yet?

No, the DHCP gives up after a minute timeout. I suspect there are at least two reasons for that, first that the queueing TX is not done properly, and then there is good load of KPrintF () calls all over the code - running at 9600 by default. If the serial debug routines are blocking then this would also impact timings. I will change the speed to 115200 and also will fix the TX queueing.

Haven't tried tcpdump yet, but definitely will

olsen · « **Reply #35 on:** December 02, 2013, 07:52:30 AM »

Quote from: tnt23;753674

That's fascinating One would think the 16/32 buffer management routines have been proposed into SANA with performance in mind, not as some certain bug workarounds.

Since the DM9000 in my design is wired in 16 bits, and I tend to use word accesses wherever possible, using x16 routines would be preferrable in my case.

I would not recommend it. The 16/32 bit copy functions require that the data being copied is aligned to a particular address boundary, and that in itself is a restriction. That restriction may be necessary (if your hardware chokes on unaligned accesses, which would be rather unfortunate), but it does not produce speed gains. On the contrary: the 68030 would benefit from word-sized access restrictions, but since the Zorro II space is marked as non-cacheable there would be no advantage after all. And on a Zorro III board that question wouldn't even come up.

Sticking with S2_CopyFromBuff/S2_CopyToBuff has no downsides. Any client (e.g. TCP/IP stack) should use optimized copying code which would automatically use long-sized accesses.

So, in a nutshell: your driver should use S2_CopyFromBuff/S2_CopyToBuff and ignore everything else, unless your hardware has very specific requirements for which the 16/32 bit aligned copying functions would solve a really big problem.

The same goes for the S2_DMACopyFromBuff32/S2_DMACopyToBuff32 functions: unless your hardware supports this functionality perfectly (that is, it actually supports DMA to/from arbitrary 32 bit aligned addresses) don't bother implementing it. The benefits of these functions are very, very small if you don't support DMA. You might be able to skip one copying step inside the TCP/IP stack, but the gains are small. One case (perhaps the only case) in which the gains are not so small is the PPPoE driver which I cooked up, and which is practically useless today

Quote

No, the DHCP gives up after a minute timeout. I suspect there are at least two reasons for that, first that the queueing TX is not done properly, and then there is good load of KPrintF () calls all over the code - running at 9600 by default. If the serial debug routines are blocking then this would also impact timings. I will change the speed to 115200 and also will fix the TX queueing.

Haven't tried tcpdump yet, but definitely will

tcpdump is worth a shot if you suspect that traffic has gone missing which should have been processed by the TCP/IP stack. Readability of the output tends to be rather mixed bag, though, so this might be a good idea only if all other options have been exhausted (or if you create binary capture files and view them in "Wireshark").

tnt23 · « **Reply #36 on:** December 03, 2013, 05:41:40 AM »

Thank you Olsen, after spending some time studying tcpdump and sashimi logs I came to a conclusion that the card simply wasn't picking the DHCP ACK from the server. No wonder since the code responsible for multicast/broadcast stuff was, ahem, mostly commented out.

So I went and cowardly let the card accept all and every frame to see if this was an issue. Bingo!

Ping is reporting duplicates, and FTP won't work even in passive mode, but being connected makes me feel better.

tnt23 · « **Reply #37 on:** December 04, 2013, 01:16:50 PM »

Quote from: olsen;753681

Sticking with S2_CopyFromBuff/S2_CopyToBuff has no downsides. Any client (e.g. TCP/IP stack) should use optimized copying code which would automatically use long-sized accesses.

In Roadshow, if the COPYMODE=FAST option is set, buffer management will offer S2_CopyFromBuff16. Is there a way to have it also provide S2_CopyToBuff16? I can imagine the environment where using word-sized and word-aligned access would indeed speed things on the device driver's side compared with S2_CopyFromBuff/S2_CopyToBuff.

tnt23 · « **Reply #38 on:** December 04, 2013, 07:37:20 PM »

Quote from: tnt23;753795

In Roadshow, if the COPYMODE=FAST option is set, buffer management will offer S2_CopyFromBuff16. Is there a way to have it also provide S2_CopyToBuff16? I can imagine the environment where using word-sized and word-aligned access would indeed speed things on the device driver's side compared with S2_CopyFromBuff/S2_CopyToBuff.

Here's what I've been looking into: (http://wiki.amigaos.net/index.php/Revision_3)

Code: [Select]


   These are optional callbacks presented to the device with the
   same calling interface as for S2_CopyToBuff or S2_CopyFromBuff,
   respectively. The difference to the original callbacks is the
   required and guaranteed transfer size and alignment for
   accessing the device's buffer for a single piece of a data of
   either 16 or 32 bits, a data word. The copy function called may
   only use 16/32 bit aligned read/write commands of 16/32 bits at
   once to transfer the data words, respectively. If the buffer
   data length is not a multiple of the required data word
   transfer size, the last data word transfer may contain garbage
   padding in either transfer direction.

tnt23 · « **Reply #39 on:** December 09, 2013, 04:32:42 PM »

That's what I get with non-debug version of dm9000.device. A4000 with 68030/25MHz and 2MB of Chip RAM, 0MB of Fast RAM, 64MB of Zorro III RAM clocked at 100MHz.

Code: [Select]

NETIO - Network Throughput Benchmark, Version 1.32
(C) 1997-2012 Kai Uwe Rommel

UDP server listening.
TCP server listening.
TCP connection established ...
Receiving from client, packet size  1k ...  135.32 KByte/s
Sending to client, packet size  1k ...  7.59 KByte/s
Receiving from client, packet size  2k ...  143.53 KByte/s
Sending to client, packet size  2k ...  149.24 KByte/s
Receiving from client, packet size  4k ...  146.89 KByte/s
Sending to client, packet size  4k ...  151.80 KByte/s
Receiving from client, packet size  8k ...  142.36 KByte/s
Sending to client, packet size  8k ...  156.03 KByte/s
Receiving from client, packet size 16k ...  144.04 KByte/s
Sending to client, packet size 16k ...  155.74 KByte/s
Receiving from client, packet size 32k ...  134.22 KByte/s
Sending to client, packet size 32k ...  157.05 KByte/s
Done.

I wonder if tweaking the priorities of RX/TX routines would give any boost. Also will try moving to INT6 chain, although I don't think this will improve things dramatically. The CNet driver is able to squeeze ~500KBytes through pccard interface, which is also sharing the INT2 interrupt.

I can use WGET to upgrade the device driver by simply pulling the new version from my PC over HTTP. So I'm judging the single TCP connection kinda works more or less stable. (Obviously even less). A mix of WGETs and pings also run in parallel quite all right, with sanautil on top of that. However, when the FTP opens second socket in passive mode it never gets the remote directory listing. I can see the listing in tcpdumped packets, probably the device driver does something odd to them upon reception.

Oh, and MiamiDX cannot complete DHCP configuration for some reason, as opposed to Roadshow. Perhaps I will need more packet dumping inside the device driver.

tnt23 · « **Reply #40 on:** December 10, 2013, 11:14:32 AM »

Have just resolved the FTP issue.

This also seems to fix the small packet transfer speed. According to NetIO test, Tx/Rx is around 130K in both directions.

olsen · « **Reply #41 on:** December 10, 2013, 12:22:42 PM »

Quote from: tnt23;753805

Here's what I've been looking into: (http://wiki.amigaos.net/index.php/Revision_3)

Code: [Select]
These are optional callbacks presented to the device with the same calling interface as for S2_CopyToBuff or S2_CopyFromBuff, respectively. The difference to the original callbacks is the required and guaranteed transfer size and alignment for accessing the device's buffer for a single piece of a data of either 16 or 32 bits, a data word. The copy function called may only use 16/32 bit aligned read/write commands of 16/32 bits at once to transfer the data words, respectively. If the buffer data length is not a multiple of the required data word transfer size, the last data word transfer may contain garbage padding in either transfer direction.

I don't know if this has been clarified yet.

The purpose of 16 or 32 bit variants of the S2_CopyToBuff and S2_CopyFromBuff callbacks is to restrict all copying to operations which transfer data in amounts of a specific granularity. In the 16 bit variant, only 16 or 32 bit transfer operations will be used. In the 32 bit variant, only 32 bit transfer operations will be used. By contrast, the S2_CopyToBuff and S2_CopyFromBuff methods will use 8, 16 or 32 bit transfer operations, as necessary.

The S2_CopyFromBuff/S2_CopyFromBuff16/S2_CopyFromBuff32 callbacks transfer data to a contiguous buffer. If your hardware has no such contiguous buffer to transfer data to, you will have to copy the data to a contiguous side-buffer, which is then given to S2_CopyFromBuff/S2_CopyFromBuff16/S2_CopyFromBuff32 to process.

It works exactly the same with the S2_CopyToBuff/S2_CopyToBuff16/S2_CopyToBuff32 callbacks, except that the data is transferred into the opposite direction.

You may be able to avoid using a contiguous side-buffer if the TCP/IP stack supports the S2_DMACopyToBuff32 and S2_DMACopyFromBuff32 callbacks. With these callback functions, you may receive a pointer to a contiguous buffer which is at least as large as you requested. You may then access this buffer and directly copy to/from it. Note that you may get a NULL pointer if no such buffer is available, which which case you would need to fall back to calling S2_CopyToBuff or S2_CopyFromBuff instead, respectively.

tnt23 · « **Reply #42 on:** December 12, 2013, 07:23:49 AM »

Quote from: olsen;754124

I don't know if this has been clarified yet.

The purpose of 16 or 32 bit variants of the S2_CopyToBuff and S2_CopyFromBuff callbacks is to restrict all copying to operations which transfer data in amounts of a specific granularity. In the 16 bit variant, only 16 or 32 bit transfer operations will be used. In the 32 bit variant, only 32 bit transfer operations will be used. By contrast, the S2_CopyToBuff and S2_CopyFromBuff methods will use 8, 16 or 32 bit transfer operations, as necessary.

Frankly speaking, I don't understand why, for the 16-bit case, there would be any 32-bit transfer at all. Say, if we need to transfer two 16-bit words with respect to both size AND alignment, then it should look like two "move.w (src)+, (dst)+" instructions should it not? The addressing will be done in words, and that's nice. In my perception this is not equal to one "move.l (src)+, (dst)+" instruction as the latter breaks both the size (transferring 32 bits at once) and alignment constraints (crossing the 16 bit boundary).

Quote

The S2_CopyFromBuff/S2_CopyFromBuff16/S2_CopyFromBuff32 callbacks transfer data to a contiguous buffer. If your hardware has no such contiguous buffer to transfer data to, you will have to copy the data to a contiguous side-buffer, which is then given to S2_CopyFromBuff/S2_CopyFromBuff16/S2_CopyFromBuff32 to process.

That's exactly what I am trying to figure out. It is possible to implement the said contiguous buffer on my card, with the restriction that it should only be accessed in 16-bits using even addresses only. If the S2_CopyFromBuff16/S2_CopyToBuff16 hooks would follow that "move.w (src)+, (dst)+" restriction, everything should work smoothly - and that would eliminate the need in any side buffering, saving in memory and performance.

Now, if the S2_CopyFromBuff16/S2_CopyToBuff16 hooks at some point won't follow the granularity convention and decide to switch to transferring 32 bits at once, that would break the whole idea, I think.

Quote

You may be able to avoid using a contiguous side-buffer if the TCP/IP stack supports the S2_DMACopyToBuff32 and S2_DMACopyFromBuff32 callbacks. With these callback functions, you may receive a pointer to a contiguous buffer which is at least as large as you requested. You may then access this buffer and directly copy to/from it. Note that you may get a NULL pointer if no such buffer is available, which which case you would need to fall back to calling S2_CopyToBuff or S2_CopyFromBuff instead, respectively.

I understand the DMA callbacks idea better now

In fact, I am trying to perform exactly like that, checking if the DMA hook is available, then asking for the pointer etc. It even seems to work, although is slow as hell. Lot to check on my side.

So, back to our 16-bit stuff. Do you think it would be feasible to implement that 'strict' behaviour S2_CopyFromBuff16/S2_CopyToBuff16 in Roadshow?

UPDATE. I'm afraid I have been terribly wrong: the hardware buffer on my side could only be arranged for long-aligned 16-bit access

tnt23 · « **Reply #43 on:** December 19, 2013, 07:28:12 AM »

Quick update regarding performance. That's the best of the driver (stock A4000 with 68030@25MHz I guess? no Fast RAM, Zorro memory running at 120MHz), compiled for 030 with -O3.

On the PC side, netio reports RX faster by ~30K.

With Fast RAM, rx/tx speeds increase slightly by ~50K in both directions. I guess I'll leave it as it is for now, will try various optimizations later.

olsen · « **Reply #44 on:** December 19, 2013, 08:52:57 AM »

Quote from: tnt23;754231

Frankly speaking, I don't understand why, for the 16-bit case, there would be any 32-bit transfer at all. Say, if we need to transfer two 16-bit words with respect to both size AND alignment, then it should look like two "move.w (src)+, (dst)+" instructions should it not? The addressing will be done in words, and that's nice. In my perception this is not equal to one "move.l (src)+, (dst)+" instruction as the latter breaks both the size (transferring 32 bits at once) and alignment constraints (crossing the 16 bit boundary).

There are two reasons.

The first is historic: up until very recently (and with the exception of the DKB WildFire, which I believe was capable of 32 bit wide memory access) all Amiga Ethernet hardware was either accessible only through the Zorro II bus, or did not permit 32 bit wide memory access. On the Zorro II bus, a 32 bit wide access will be broken up into two consecutive 16 bit accesses. How this worked out with hardware which could not support 32 bit wide accesses was up to the glue logic on the board.

The second is performance: the ratio of instructions executed vs. the amount of data copied is terrible for "move.w (a0)+,(a1)+", less terrible for "move.l (a0)+,(a0)+" and becomes better if you can leverage "movem.l (a0)+,d1-d7/a2-a6 ; movem.l d1-d7/a2-a6,(a1)+" style copying (better still if you can unroll the copying loop in which movem.l is used).

I stopped counting execution cycles more than 15 years ago, but I believe that performance of even an unrolled "move.w (a0)+,(a1)+" loop will be quite poor.

Roadshow contains a restricted version of the original, optimized copying function, with the restriction being that only 16 and 32 bit copying operations are used. The goal was to provide for better performance than the S2_CopyFromBuff/S2_CopyToBuff callbacks could. Which was done specifically for the "Ariadne".

There is a slow "move.w (a0)+,(a1)+" variant available in Roadshow already. It is enabled by default, but all the example interface configuration files disable it. To switch back to the slow variant, either remove the "copymode=fast" parameter from the respective interface file, or replace it with "copymode=slow".

Quote

That's exactly what I am trying to figure out. It is possible to implement the said contiguous buffer on my card, with the restriction that it should only be accessed in 16-bits using even addresses only. If the S2_CopyFromBuff16/S2_CopyToBuff16 hooks would follow that "move.w (src)+, (dst)+" restriction, everything should work smoothly - and that would eliminate the need in any side buffering, saving in memory and performance.

Now, if the S2_CopyFromBuff16/S2_CopyToBuff16 hooks at some point won't follow the granularity convention and decide to switch to transferring 32 bits at once, that would break the whole idea, I think.

Could be, but then your code needs to be able to handle the regular S2_CopyFromBuff/S2_CopyToBuff callbacks, which are likely going to be much worse in terms of performance. You will always have to be able to provide for a side-buffer, in case S2_CopyFromBuff/S2_CopyToBuff callbacks are invoked and the client offers no alternative callbacks.

Quote

I understand the DMA callbacks idea better now In fact, I am trying to perform exactly like that, checking if the DMA hook is available, then asking for the pointer etc. It even seems to work, although is slow as hell. Lot to check on my side.

So, back to our 16-bit stuff. Do you think it would be feasible to implement that 'strict' behaviour S2_CopyFromBuff16/S2_CopyToBuff16 in Roadshow?

See above: it's already supported

Quote

UPDATE. I'm afraid I have been terribly wrong: the hardware buffer on my side could only be arranged for long-aligned 16-bit access

If you can make it appear on a 32 bit aligned start address, then testing it with Roadshow's built-in slow 16 bit copy callback might just work out.

Author Topic: Zorro III memory card... now with Ethernet (Read 13326 times)

olsen

Re: Zorro III memory card... now with Ethernet

Bobo68

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

olsen

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

olsen

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

olsen

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

tnt23

Re: Zorro III memory card... now with Ethernet

olsen

Re: Zorro III memory card... now with Ethernet