Re: endless loop in netconn_write

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

M. Gotink
I'm working with Edwin on the project and i have been debugging some more on the problem.
 
First of all i enabled the debug ability of the stack and am now throwing all debug info over the UART.
I also wrote a little tool which creates a number of threads which each do one request on the webserver. This to be able to always do the same requests with the same interval.
Running the tool with 3 threads crashes the webserver after a random number of requests, there is not noticable pattern. Running the tool with 1 thread doesnt seem to crash the webserver, or it has to take a long time.
 
Checking the log the only strange thing i noticed is the 'mem_malloc: could not allocate 1528 bytes' errors, in the beginning they only occur once in a while, but the longer the webserver runs and requests are made, the more often this error occurs. My first opinion would be some kind of memory leak, but the lwip_stat->mem->used variable is telling me its being malloced and freed without any probs. Its 0x3C on boot, and 0x3C after a request has been handled.
 
The crash still starts in the netconn_write loop, where on one point its not able to allocate memory for the send buffer. But after about 1-2 minutes it breaks out of the loop because of a connection error, ECONNABORTED.
That memory should be freed by an ACK packet acking data which is already sent, which doesnt seem to happen. This sounds like the problem, but how do you explain the growing number of malloc errors with that? I didnt see any retransmits from the client in the ethereal logs (except on the point where the server stops sending data), and the memory is still being freed... The problem seems to be in some other file than the api_lib.c since this file doesnt directly free/fill the send buffer.
 
After it breaks out of that loop it doesnt want to send data anymore, if i start that tool i made with 3 threads, i only see one thread coming in through the debug log. But i don't see the stack doing anything with it. The same goes for ICMP ping packets. ARP packets are still handled as it seems in the debug log.
 
I attached the whole debug log of 1 session, if you search for mem_malloc you will notice the growing number of malloc errors. After the last malloc error i did wait some time till doing new requests. There are 4 new requests in the log ('TCP connection request'), i started the tool with 3 threads, 4 times, it seems each time only 1 connection gets through.
The whole time i was pinging the webserver with default packet size and 10 seconds timeout, you will notice them through the whole log, in and after the netconn_write loop it didnt respond to ping messages anymore either. The ping packets didnt seem to have any effect on the stability of the webserver.
 
I hope someone can shed some more light on this problem, after all the debugging we;ve done, we dont really know where to look next...
 
(just for your information: the lwip stack is v1.1.1, running freertos 4.0.4 on the AT91SAM7X256 controller from atmel on a AT91SAM7X-EK evaluation kit)

Martin
_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users

3_threads_debug.zip (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

E.F.Spijksma
We've been studying the debug info the lwipstack provides and noticed that this
happens with multiple threads to the stack.

Sometimes a "tcp_receive: received FIN" isn't followed by
TCP connection closed 3613 -> 80.

tcp_pcb_purge

This would mean the buffers aren't purged and wouldn't this end up in
a loop in netconn_write() because len = 0 due to unpurged data in the send
buffer?

Grtz
Edwin


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

Kieran Mansley
On Wed, 2006-07-26 at 12:10 +0200, E. Spijksma wrote:

> We've been studying the debug info the lwipstack provides and noticed that this
> happens with multiple threads to the stack.
>
> Sometimes a "tcp_receive: received FIN" isn't followed by
> TCP connection closed 3613 -> 80.
>
> tcp_pcb_purge
>
> This would mean the buffers aren't purged and wouldn't this end up in
> a loop in netconn_write() because len = 0 due to unpurged data in the send
> buffer?

I'm not sure about the details of your problem, but a received FIN will
not result in a closed connection until the process at the local end
also closes it.  The received FIN just means that the other end will not
send any more data, but the connection is still usable for the local end
to send.

Kieran



_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

M. Gotink
> On Wed, 2006-07-26 at 12:10 +0200, E. Spijksma wrote:
> > We've been studying the debug info the lwipstack provides and noticed that
> this
> > happens with multiple threads to the stack.
> >
> > Sometimes a "tcp_receive: received FIN" isn't followed by
> > TCP connection closed 3613 -> 80.
> >
> > tcp_pcb_purge
> >
> > This would mean the buffers aren't purged and wouldn't this end up in
> > a loop in netconn_write() because len = 0 due to unpurged data in the send
> > buffer?
>
> I'm not sure about the details of your problem, but a received FIN will
> not result in a closed connection until the process at the local end
> also closes it.  The received FIN just means that the other end will not
> send any more data, but the connection is still usable for the local end
> to send.
>
> Kieran


Our process is a webserver so the server (lwIP) sends a FIN packet when its
done sending the data (we call netconn_close at the end of the process), the
client will ack this FIN with a FIN packet from his side. When that packet is
received the connection should be closed and the pcb should be purged... But
thats not happening.


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

Kieran Mansley
On Wed, 2006-07-26 at 13:37 +0200, M. Gotink wrote:

>
> Our process is a webserver so the server (lwIP) sends a FIN packet when its
> done sending the data (we call netconn_close at the end of the process), the
> client will ack this FIN with a FIN packet from his side. When that packet is
> received the connection should be closed and the pcb should be purged... But
> thats not happening.

Ahh, OK.  Could it be that the purge doesn't happen until the connection
has exited the TCP TIME_WAIT state?  I'm not sure how lwIP deals with
this and don't have time to check at the moment, so it's just an idea
rather than anything concrete.

Kieran



_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

M. Gotink
>
> Ahh, OK.  Could it be that the purge doesn't happen until the connection
> has exited the TCP TIME_WAIT state?  I'm not sure how lwIP deals with
> this and don't have time to check at the moment, so it's just an idea
> rather than anything concrete.
>

We've updated lwIP to the latest version on CVS, with no noticable changes in
the stability of our webserver.
We seem to be missing a bit of the debug information since it writes a lot of
data over the UART, so thats not totally reliable. We also use the lwip_stats
to locate errors, which is reliable.
Today we've noticed that the MEMP_TCP_PCB (which is set to 32 right now) buffer
shows some errors. As far as i know this buffer is just the number of active
connections. So when we are doing 3 requests at once and wait until all
connections are closed before doing a new request, this buffer should be 3
max...
So it looks like not all connections are closed, which can cause problems with
some memory which won't be freed.

Martin


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

Kieran Mansley
On Thu, 2006-07-27 at 10:06 +0200, M. Gotink wrote:

> Today we've noticed that the MEMP_TCP_PCB (which is set to 32 right now) buffer
> shows some errors. As far as i know this buffer is just the number of active
> connections. So when we are doing 3 requests at once and wait until all
> connections are closed before doing a new request, this buffer should be 3
> max...
> So it looks like not all connections are closed, which can cause problems with
> some memory which won't be freed.

Again, connections in the TCP TIME_WAIT state could explain that: the
PCBs will continue to exist after the connection has closed until they
time out.  

Kieran



_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: endless loop in netconn_write

M. Gotink
>
> Again, connections in the TCP TIME_WAIT state could explain that: the
> PCBs will continue to exist after the connection has closed until they
> time out.  
>

According to the source, before a connection enters the TIME_WAIT state it is
purged, removed from the active state list and then added to the TIME_WAIT
list. If a new connection arrives and no PCB's are left, the oldest PCB from
the TIME_WAIT list will be removed.
Since there are rather much active connections open (10 at the latest attempt)
it could be the send buffer is full because this memory isn't freed, which
doesn't happen because not all connections enter the TIME_WAIT state some way.

Martin


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users