LwIP RAW + Zynq - Unresponsive Tx path when Rx is active

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

LwIP RAW + Zynq - Unresponsive Tx path when Rx is active

Nenad Pekez

Hello,

I hope to get some advice on how to debug a problem. I am still investigating it and my knowledge on Ethernet drivers and lwIP is still limited.

We have 2 lwIP (2.0.2) RAW servers on a Zynq based board. Each server can serve only 1 client at a time. Over port 4040 server only receives and over 4041 it only sends data. Everything works fine if we don't receive and send simultaneously. Communication over 4041 (Tx port) stalls at some point. We send and receive at data rates of around 1MB/s, but over 1Gb network.

Looking at Wireshark (121 is PC, 124 is Zynq) we see a lot of Dup ACKs before communication stalls. But once it stalls, no traffic is captured at all. For whole minute nothing is captured, until we close our PC app.

Looking at Lwip memory stats on Zynq nothing unusual is observed. What is strange though is that when communication stalls, LINK STAT xmit value is being updated (increased), however TCP STAT xmit is not being updated at all. Therefore, TCP send buffers are not being emptied and we are not able to write anything at all. We call tcp_output when tcp_sndbuf(pcb) < TCP_MSS/2, where TCP_MSS is 1460. We don't use tcp sent callback at all (is that a problem?), just periodically write to tcp sndbuf. Other side of communication is alive all the time without problems.

Any ideas on what this situation with LINK and TCP xmit stats mean? Any ideas on how to debug this problem further?

Thank you very much,
Nenad

  


_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: LwIP RAW + Zynq - Unresponsive Tx path when Rx is active

Sergio R. Caprile
Well, the RAW API is tricky and there is no such thing as half-duplex or
unidirectional communications in TCP, so I'd need to see your code.
You should (read: must) be calling sys_check_timeouts() frequently
enough for TCP to handle its internal timers; and since your current
problem is on the send side, I guess you can do with the built-in
handler, that is, no tcp_sent() callback. Do you have one ?
BUT, as Simon said (and more than once), out-of-order packets are mostly
due to driver bugs, particularly DMA drivers.

Please don't paste screen captures, some of us like to scroll the frames
to see what is going on.

Your device is sending something on the 9.82M and suddenly a 9.79M frame
(that should have already been sent but I can't see it because you just
pasted the black and red parts) pops up.
Your client then starts shouting about a previous frame (that should
have already been sent as frame #165377 but I can't see it for the same
reason), I don't have more (read: previous) information, and then
restarts the connection.

To check if a driver is properly working, as in most engineering
endeavours, you have to reduce the number of random uncontrolled
variables to a quantity you can manage. So, instead of your application,
use a known to work good application (read: written by someone with some
expertise on the subject, that is: the example apps in the tree or in
the contrib tree). Since you are debugging a heavy sender, I don't think
the echo is a good candidate, but the web server might be a better
choice. You can also use a bare minimum app that "just sends", but in
that case please send the code so we can check and give you the thumbs
up, otherwise you are debugging your driver and your port and your app
at the same time.
Ruling out the app, then there is your port and your driver.

Another source of errors is people (and vendors) not reading the docs
and calling lwIP low-level functions from different contexts (like main
loop and interrupts on a bare metal system). You said you use RAW API,
and I remember answering some question of yours, but I don't remember if
you are running bare metal or an OS (neither should I).

And sometimes caches... what is it that you have there ?

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: LwIP RAW + Zynq - Unresponsive Tx path when Rx is active

Nenad Pekez
You said you use RAW API,
and I remember answering some question of yours, but I don't remember if
you are running bare metal or an OS (neither should I).

Yes, I am running bare metal.

You should (read: must) be calling sys_check_timeouts() frequently
enough for TCP to handle its internal timers;
Another source of errors is people (and vendors) not reading the docs
and calling lwIP low-level functions from different contexts (like main
loop and interrupts on a bare metal system).

I have discussed with Simon this stuff in details in previous thread: http://lists.nongnu.org/archive/html/lwip-users/2018-07/msg00005.html

I do check TCP timeouts all the time and I am sure nothing is called from interrupt context.

I guess you can do with the built-in
handler, that is, no tcp_sent() callback. Do you have one ?

I have added tcp_sent callback. Behavior looking from Wireshark is somewhat different, but Tx path still hangs. I have attached some Wireshark captures at this link: https://files.fm/u/2mhggcav.

So, instead of your application,
use a known to work good application (read: written by someone with some
expertise on the subject, that is: the example apps in the tree or in
the contrib tree).

Well, the problem is that I cannot find appropriate third party application which is doing sending and receiving at the same time. I have checked iperf application provided in lwip src, but there the sending is done after the receiving is finished. Whatsoever, I did use this application as reference for implementing my application. I was also checking some Xilinx iperf examples.

Since you are debugging a heavy sender

Is this really a heavy sender? It's just 1.5MB per second on a 1Gb Ethernet.

the web server might be a better
choice.

I cannot find this application in lwip src. Maybe you can give me some references?

You can also use a bare minimum app that "just sends"

When I do "just sending" or "just receiving" the problem does not exist. I would send you my code on how the sending is done. But basically I just write to TCP buffer from time to time in the main loop and have a counter in tcp_sent callback counting how much data has been acknowledged. Nothing more than that. Maybe I should continue sending from tcp_sent callback?

And sometimes caches... what is it that you have there ?

Wow, caches are story for itself. I still need to check some stuff with caches. Will report on this one.

Sergio, thank you very much for your answers and ideas.

Best regards,
Nenad
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: LwIP RAW + Zynq - Unresponsive Tx path when Rx is active

goldsimon@gmx.de
On 14.08.2018 15:23, Nenad Pekez wrote:
[..]
Well, the problem is that I cannot find appropriate third party application which is doing sending and receiving at the same time. I have checked iperf application provided in lwip src, but there the sending is done after the receiving is finished.

Have a 2nd look. When I last checked, this worked :-)
It depends on the iperf client's arguments though. And it might depend on the version of lwIP...

Whatsoever, I did use this application as reference for implementing my application. I was also checking some Xilinx iperf examples.

Since you are debugging a heavy sender

Is this really a heavy sender? It's just 1.5MB per second on a 1Gb Ethernet.

Depending on processor speed, yes, this is a heavy sender.



the web server might be a better
choice.

I cannot find this application in lwip src. Maybe you can give me some references?

src/apps/httpd?


You can also use a bare minimum app that "just sends"

When I do "just sending" or "just receiving" the problem does not exist. I would send you my code on how the sending is done. But basically I just write to TCP buffer from time to time in the main loop and have a counter in tcp_sent callback counting how much data has been acknowledged. Nothing more than that. Maybe I should continue sending from tcp_sent callback?

And sometimes caches... what is it that you have there ?

Wow, caches are story for itself. I still need to check some stuff with caches. Will report on this one.

I still suspect the netif driver...

Cheers,
Simon
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users