we had some weird behavior with a TCP connection on LWIP 1.4.1 when the peer (non-LWIP) has a cable disconnect:
LWIP has an established TCP connection #1 running fine
Peer has a cable disconnect
Our application on top of LWIP runs into a receive timeout and closes the socket (500ms)
Peer reconnects cable
Our application opens a new connection #2 which again is established and running fine
The FINACK+PSHACK re-sends of connection #1 also reaches the peer which answers RSTACK
This keeps on looping until we restart the whole machine with LWIP
Also, I have a sort of “netstat” implemented on top of the LWIP socket API which runs over all possible sockets we have and if it finds a valid conn pointer there, prints infos (local addr, remot addr, port, TCP state
and such). And connection #1 does not show up anymore in this view!
In my mind, the TCP state machine should be in FIN_WAIT_1 while the peer cable is disconnected?
And it should just jump to either CLOSED or TIME_WAIT when receiving the RSTs upon cable reconnect?
I attached a clipped pcap with only connection #1 shown and the problem starting at packet #19. Image the final exchange going on forever to understand the problem ;o)
mmm... the ACK number..., I think I've seen this one or two years ago,
search the list and or the patches for "one less" or something like that.
I'm not fresh on this, but I think that is the problem, the ACK to the
RST has the wrong number and causes a retransmission. I can't remember
if this is also related to the half-closed connection; you might check
on that too.
On 08.03.2019 15:47, Sergio R. Caprile wrote:
> mmm... the ACK number..., I think I've seen this one or two years ago,
> search the list and or the patches for "one less" or something like that.
> I'm not fresh on this, but I think that is the problem, the ACK to the
> RST has the wrong number and causes a retransmission. I can't remember
> if this is also related to the half-closed connection; you might check
> on that too.
I know this might not be an option, but 1.4.1 is *really* old and this
one as well as numerous other things might already be fixed in one of
the newer versions.
> I know this might not be an option, but 1.4.1 is *really* old and
> this one as well as numerous other things might already be fixed
> in one of the newer versions.
Yes. This is not an option unfortunately. We considered updating to 2.0.3 a while ago and I tried integrating it but it requires quite some changes which we can't justify for the old product.
Is there a chance to find a specific fix/patch for this?
I tried searching the bug reports and patches on savannah and the mailing list but did not find something that really matches.
Even when copying the current LWIP master state of the top of tcp_process() to my 1.4.1 working copy, the behavior still results in this RST/ACK pingpong.
I fear that something in tcp_receive() has changed as well or even deeper?
At least in the referenced bug report, the current state of master seems to have helped. My only idea right now is to blame it on the IP stack of the peer but that is a bit counterproductive right now.
Any further ideas or help in this?
From: lwip-users <lwip-users-bounces+fabian.koch=[hidden email]> On Behalf Of Sergio R. Caprile
Sent: Montag, 11. März 2019 13:49
To: [hidden email] Subject: Re: [lwip-users] TCP state machine problem? LWIP 1.4.1
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Nope, I'm sorry, no further ideas on my side.
Since you can't upgrade, perhaps diffing against git commits around that
time or against git head would provide a clue on what to change.
There could be another bug report more related to your problem, I just
didn't find it but I haven't done an exhaustive search.