netconn_write blocking

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

netconn_write blocking

Lukefahr, Andrew Robert (UMC-Student)
Hi,

I have an application that is sending out TCP data to several client using the sequential API.  When a client disconnects gracefully, netconn_write returns a negative value, and I can close the connection. However, if any of the clients locks up (i'm using embedded clients), or a cable gets unplugged, etc.  netconn_write keeps queue packets until it fills up the buffer, and then blocks.  I've been playing around with debugging, and so far all I get is lots of messages about the queue being full.  

My question is what is the proper way to deal with ungraceful disconnections using the sequential API? Am I doing something wrong, should netconn_write return an error for ungraceful disconnections, or is there any other way to check if for connection timeouts?  

I'm using FreeRTOS and a cvs version of lwip from about 2 weeks ago.  

Andrew Lukefahr
[hidden email]

Open Source, Open Minds




_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: netconn_write blocking

goldsimon@gmx.de

> Hi,
>  
Hi!
> I have an application that is sending out TCP data to several client using the sequential API.  When a client disconnects gracefully, netconn_write returns a negative value, and I can close the connection. However, if any of the clients locks up (i'm using embedded clients), or a cable gets unplugged, etc.  netconn_write keeps queue packets until it fills up the buffer, and then blocks.  I've been playing around with debugging, and so far all I get is lots of messages about the queue being full.  
>  
It complains about the queue being full?? That would be a
misconfiguration and maybe an error in your port! The queues should
never be full! That's why sys_arch_mbox_post has no return value, and
the port should assert to check that a queue is never full.
Misconfiguration could lead to this: too big TCP windows vs. too small
queues...

But to be sure about this, could you post an excerpt of your debug
output so that I know which function / file complains?
> My question is what is the proper way to deal with ungraceful disconnections using the sequential API? Am I doing something wrong, should netconn_write return an error for ungraceful disconnections, or is there any other way to check if for connection timeouts?  
>  
Unfortunately, there is only RX timeout currently. TX timeout is
planned, I think...


Simon


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: netconn_write blocking

Frédéric BERNON-2
Hi,

When you said "lots of messages about the queue being full", can you tell me
what exact messages you got?

I think if you wait a long long time (several minutes), you should got a
error. In fact, there is a kind of "timeout" in lwIP for tcp "write", but
default values in opt.h and tcp.h are too big (to my point of view). I think
mainly to TCP_SYNMAXRTX that you could reduce. This is the number of
retransmissions you have to wait before lwIP abort a TCP connection when
it's segments are not acknowledged: when you unplug your "peer", your server
continue to send packets until it fill the "tcp send buffer". Even in this
case, your "write" doesn't return if the segment is not "enqueued" (it retry
each time tcp_sent callback is invoked, with do_writemore). Since the cable
is unplugged, the tcp segments you send are never acknowledged. So, the
"slow" tcp timer try to resend them (tcp considers that these segments can
be lost in the network, so, this is a normal tcp retransmission). It try to
resend them TCP_SYNMAXRTX times (but not in a "linear" way, but in a
"exponential" way). After that, it abort the connection. If you can do a
wireshark capture, I suppose you can see these retransmissions (that what I
did, see below). So, the "solution" is to reach TCP_SYNMAXRTX faster. To do
that, you can:

- Reduce TCP_SYNMAXRTX in your lwipopts.h (you can try 4)
- Reduce TCP_TMR_INTERVAL in your lwipopts.h (you can try 100)

We have talk with Kieran about lwIP retransmission implementation in this
emails (this is not exactly the same case, but the cause is, but, be
carefull, I talk about a dirty hack, don't use it, it was just for
experience):

http://lists.nongnu.org/archive/html/lwip-devel/2007-09/msg00061.html
http://lists.nongnu.org/archive/html/lwip-devel/2007-09/msg00062.html
http://lists.nongnu.org/archive/html/lwip-devel/2007-09/msg00063.html

I attach some captures I did during these tests, but I can remember the
TCP_TMR_INTERVAL value I used. What you can see in "TCP_MAXRTX=12.cap", is
there is until 412 seconds until the connection is abort (we can see
anything in the capture, lwIP dosen't send any RST packet when it abort the
connection). You can also see the delay between each retransmission is
increased (doubled in a first time, until it reach a max value). It use the
tcp_backoff table in can found in tcp.c:

const u8_t tcp_backoff[13] ={ 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7};

In "TCP_MAXRTX=6.cap", you can see the abort is reach faster.

I hope it can help you...


----- Original Message -----
From: <[hidden email]>
To: "Mailing list for lwIP users" <[hidden email]>
Sent: Tuesday, October 09, 2007 9:28 PM
Subject: Re: [lwip-users] netconn_write blocking


>
>> Hi,
>>
> Hi!
>> I have an application that is sending out TCP data to several client
>> using the sequential API.  When a client disconnects gracefully,
>> netconn_write returns a negative value, and I can close the connection.
>> However, if any of the clients locks up (i'm using embedded clients), or
>> a cable gets unplugged, etc.  netconn_write keeps queue packets until it
>> fills up the buffer, and then blocks.  I've been playing around with
>> debugging, and so far all I get is lots of messages about the queue being
>> full.
> It complains about the queue being full?? That would be a misconfiguration
> and maybe an error in your port! The queues should never be full! That's
> why sys_arch_mbox_post has no return value, and the port should assert to
> check that a queue is never full. Misconfiguration could lead to this: too
> big TCP windows vs. too small queues...
>
> But to be sure about this, could you post an excerpt of your debug output
> so that I know which function / file complains?
>> My question is what is the proper way to deal with ungraceful
>> disconnections using the sequential API? Am I doing something wrong,
>> should netconn_write return an error for ungraceful disconnections, or is
>> there any other way to check if for connection timeouts?
> Unfortunately, there is only RX timeout currently. TX timeout is planned,
> I think...
>
>
> Simon
>
>
> _______________________________________________
> lwip-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/lwip-users
>

_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users

TCP_MAXRTX=12.cap (26K) Download Attachment
TCP_MAXRTX=6.cap (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: netconn_write blocking

Lukefahr, Andrew Robert (UMC-Student)
In reply to this post by Lukefahr, Andrew Robert (UMC-Student)
Hi,
Ok, first let me describe in a little more detail how my system works...
There are two threads, the first blocks listening for new connections,
and when it gets one, it adds the *netconn to a linked list.  Then a
second thread generates some data, and iterates through the linked list,
sending the same packet to every conn on the list. When any of the
netconn_send API calls return <0, the connection is closed, and deleted
from the linked list.

          { //send
              int ret = 0;
              struct node * i = sockets_list->head;
              while (i != NULL){
                  if ((ret = netconn_write( ( struct netconn *) i->data
, outputBuffer,
                                    (u16_t) strlen(outputBuffer),
NETCONN_COPY)) < 0 ){
                     
                      debug_printk("Closing connection\r\n");
                     
                      netconn_close( (struct netconn*) i->data);
                     
                      netconn_delete( (struct netconn * ) i->data);
                     
                      delete_node( sockets_list, i);
                     
                  } //if
                  i = i->next;
              } //while
          } //send

I realize that this probably isn't the best way to write to sockets, as
if anything blocks, it stops sending data to all of the sockets.  This
is what I'm pretty sure is currently happening with netconn_write

Ok, next, this is what I get for my debug output when i have the
following in my lwipopts.h
...
#define DBG_MIN_LEVEL                     DBG_LEVEL_SERIOUS
#define DBG_TYPES_ON    ( API_LIB_DEBUG |\
                                            API_MSG_DEBUG | \
                                            )
#define API_LIB_DEBUG                   LWIP_DBG_ON
#define API_MSG_DEBUG                   LWIP_DBG_ON

Debug Output:
( I added the LWIP:\t to each line)
...
LWIP:     tcp_write(pcb=0x207e78, data=0x20574c, len=18, copy=1)
LWIP:     tcp_enqueue(pcb=0x207e78, arg=0x20574c, len=18, flags=0, co
LWIP:     tcp_enqueue: queuelen: 32
LWIP:     tcp_enqueue: too long queue 32 (max 32)
LWIP:     tcp_output: snd_wnd 1052, cwnd 128, wnd 128, effwnd 135, se
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     tcpip_thread: PACKET 0x2085d0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x1146bd(0x0)
LWIP:     tcpip: ip_reass_tmr()
LWIP:     sys_timeout: 0x2087ac msecs=1000 h=0x1146bd arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     tcpip_thread: PACKET 0x2085d0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     tcp_slowtmr: polling application
LWIP:     tcp_write(pcb=0x207e78, data=0x20574c, len=18, copy=1)
LWIP:     tcp_enqueue(pcb=0x207e78, arg=0x20574c, len=18, flags=0, co
LWIP:     tcp_enqueue: queuelen: 32
LWIP:     tcp_enqueue: too long queue 32 (max 32)
LWIP:     tcp_output: snd_wnd 1052, cwnd 128, wnd 128, effwnd 135, se
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x1146bd(0x0)
LWIP:     tcpip: ip_reass_tmr()
LWIP:     sys_timeout: 0x2087ac msecs=1000 h=0x1146bd arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     sys_timeout: 0x20877c msecs=100 h=0x114629 arg=0x0
LWIP:     smf calling h=0x114629(0x0)
LWIP:     tcp_slowtmr: processing active pcb
LWIP:     tcp_slowtmr: polling application
LWIP:     tcp_write(pcb=0x207e78, data=0x20574c, len=18, copy=1)
LWIP:     tcp_enqueue(pcb=0x207e78, arg=0x20574c, len=18, flags=0, co
LWIP:     tcp_enqueue: queuelen: 32
LWIP:     tcp_enqueue: too long queue 32 (max 32)
... etc...

So, tcp_enqueue is complaining because the queue is too long.  Will
netconn_write block until it can write the next segment in the queue?

Ok, now the good news....
By setting the following:

#define TCP_MAXRTX              4
#define TCP_SYNMAXRTX           4
#define TCP_TMR_INTERVAL        100

I was able to cut the amount of time for a timeout down to like ~20
seconds.  This isn't great, as all of the other connections will be
without data for that ~20 seconds, and also, it seems like I'm setting
the number of times the packet can be retransmitted pretty low.  But
once the connection timed out, netconn_write returns, and the netconn is
deleted, and everything continues to operate normally.

This is at best a flimsy fix, so if anyone has better ideas, please feel
free to send them my way.

Thanks

-
Andrew Lukefahr
[hidden email]

Open Source, Open Minds



_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users

signature.asc (260 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: netconn_write blocking

Goldschmidt Simon


> [..]
> I realize that this probably isn't the best way to write to
> sockets, as if anything blocks, it stops sending data to all
> of the sockets.

You can get problems this way, of course, but nevertheless, it should
work... Until a client stops responding.

> This is what I'm pretty sure is currently
> happening with netconn_write
>
> Ok, next, this is what I get for my debug output when i have
> the following in my lwipopts.h ...
> #define DBG_MIN_LEVEL                     DBG_LEVEL_SERIOUS
> #define DBG_TYPES_ON    ( API_LIB_DEBUG |\
>                                             API_MSG_DEBUG | \
>                                             )
> #define API_LIB_DEBUG                   LWIP_DBG_ON
> #define API_MSG_DEBUG                   LWIP_DBG_ON
>
> Debug Output:
> ( I added the LWIP:\t to each line)
> ...
> LWIP:     tcp_write(pcb=0x207e78, data=0x20574c, len=18, copy=1)
> LWIP:     tcp_enqueue(pcb=0x207e78, arg=0x20574c, len=18, flags=0, co
> LWIP:     tcp_enqueue: queuelen: 32
> LWIP:     tcp_enqueue: too long queue 32 (max 32)

Oh, THAT queue! That's something different, of course! :-) There are two
defines in opt.h to limit the amount of data being enqueued in one
tcp_pcb:

- TCP_SND_BUF: sender buffer space in bytes
- TCP_SND_QUEUELEN: number of pbufs allowed in the sender buffer

In your case, TCP_SND_QUEUELEN seems to be defined to 32, and 32 pbufs
have
already been enqueued. If you do all your writes using 18 bytes, you
will
have (18 bytes * 32 pbufs) 576 bytes in the queue, maybe that's a
problem
for the nagle algorithm or something else?

In any case, you should try to increase TCP_SND_QUEUELEN and see what
happens!


>  Will netconn_write block until it can write the next segment
> in the queue?

Yes.

>
> Ok, now the good news....
> By setting the following:
>
> #define TCP_MAXRTX              4
> #define TCP_SYNMAXRTX           4
> #define TCP_TMR_INTERVAL        100
>
> I was able to cut the amount of time for a timeout down to
> like ~20 seconds.  This isn't great, as all of the other
> connections will be without data for that ~20 seconds, and

And that's the bad news! A better solution (which is not yet
implemented)
is to use send timeouts (like we have receive timeouts already).
But as I said, it remains a todo for lwIP...

> also, it seems like I'm setting the number of times the
> packet can be retransmitted pretty low.  But once the
> connection timed out, netconn_write returns, and the netconn
> is deleted, and everything continues to operate normally.
>
> This is at best a flimsy fix, so if anyone has better ideas,
> please feel free to send them my way.

Yes: increase TCP_SND_QUEUELEN and if it still locks up, use wireshark
to see what's
happening and post the packet trace here if you have questions about it!

Simon


_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

RE : netconn_write blocking

Frédéric BERNON
In reply to this post by Lukefahr, Andrew Robert (UMC-Student)
I think that increase TCP_SND_QUEUELEN is not the solution in this case, since the problem is that application have to run when one of the peer is unplugged. So, a bigger TCP_SND_QUEUELEN should just increase the time between the blocking (if I don't do any mistake).
 

>> #define TCP_MAXRTX              4
>> #define TCP_SYNMAXRTX           4
>> #define TCP_TMR_INTERVAL        100
>>
>> I was able to cut the amount of time for a timeout down to
>> like ~20 seconds.  This isn't great, as all of the other
>> connections will be without data for that ~20 seconds, and  
>
>And that's the bad news! A better solution (which is not yet
>implemented)
>is to use send timeouts (like we have receive timeouts already). But as I said, it remains a todo for lwIP...
You could continue to reduce TCP_MAXRTX and TCP_TMR_INTERVAL if you want to reduce the timeout (but, be careful, reduce TCP_TMR_INTERVAL will increase your cpu time used by the stack, and a too low TCP_MAXRTX could "drop" sometimes your tcp connections - 2 should be the lower value I think). This is already a "send timeout" (for packets already enqueued). Another one is the keepalive packets, but there are not more efficient in this case (default values drop a "dead" connection after several hours if I remember, but, it's mainly to drop a tcp connections when no data are exchanged all the times, a telnet server by example).

In all cases, a send timeout can't be the solution if you don't want to block (even some seconds). I think the solution is more in the design of your application:

- you to use the raw api where there is no blocking functions (if the send buffer is full, you will got a error, so, you can handle the case). It's more difficult to use than the sequential api, but, you could handle all cases.

Or

- you can use one thread per socket and a "unblocking" queue between the "main sender" and "socket senders": like this, you use the standard lwIP code, and if a "peer" is unplugged, only the matching "socket sender" will be blocked.

Or

- you can check the send_queue available in a connection before calling netconn_write : "available = tcp_sndbuf((( struct netconn *) i->data)->pcb.tcp);" it's not very thread-safe (since you access to the pcb.tcp, which can be delete when err_tcp is called) and you have to implement your own "retry". If "available" is too long time too low, you can close the connection.

Or

- you can patch api_lib.c and api_msg.c to avoid a blocking when tcp_write failed to enqueue datas, and "return" only how many datas is really sent: it's more difficult, but safe-thread, and you still have to implement your own "retry" at application level.

I hope it give you some ideas...


====================================
Frédéric BERNON
HYMATOM SA
Chef de projet informatique
Microsoft Certified Professional
Tél. : +33 (0)4-67-87-61-10
Fax. : +33 (0)4-67-70-85-44
Email : [hidden email]
Web Site : http://www.hymatom.fr 
====================================
P Avant d'imprimer, penser à l'environnement
 


-----Message d'origine-----
De : lwip-users-bounces+frederic.bernon=[hidden email] [mailto:lwip-users-bounces+frederic.bernon=[hidden email]] De la part de Goldschmidt Simon
Envoyé : mercredi 10 octobre 2007 08:21
À : Mailing list for lwIP users
Objet : RE: [lwip-users] netconn_write blocking




> [..]
> I realize that this probably isn't the best way to write to
> sockets, as if anything blocks, it stops sending data to all
> of the sockets.

You can get problems this way, of course, but nevertheless, it should work... Until a client stops responding.

> This is what I'm pretty sure is currently
> happening with netconn_write
>
> Ok, next, this is what I get for my debug output when i have
> the following in my lwipopts.h ...
> #define DBG_MIN_LEVEL                     DBG_LEVEL_SERIOUS
> #define DBG_TYPES_ON    ( API_LIB_DEBUG |\
>                                             API_MSG_DEBUG | \
>                                             )
> #define API_LIB_DEBUG                   LWIP_DBG_ON
> #define API_MSG_DEBUG                   LWIP_DBG_ON
>
> Debug Output:
> ( I added the LWIP:\t to each line)
> ...
> LWIP:     tcp_write(pcb=0x207e78, data=0x20574c, len=18, copy=1)
> LWIP:     tcp_enqueue(pcb=0x207e78, arg=0x20574c, len=18, flags=0, co
> LWIP:     tcp_enqueue: queuelen: 32
> LWIP:     tcp_enqueue: too long queue 32 (max 32)
Oh, THAT queue! That's something different, of course! :-) There are two defines in opt.h to limit the amount of data being enqueued in one
tcp_pcb:

- TCP_SND_BUF: sender buffer space in bytes
- TCP_SND_QUEUELEN: number of pbufs allowed in the sender buffer

In your case, TCP_SND_QUEUELEN seems to be defined to 32, and 32 pbufs have already been enqueued. If you do all your writes using 18 bytes, you will have (18 bytes * 32 pbufs) 576 bytes in the queue, maybe that's a problem for the nagle algorithm or something else?

In any case, you should try to increase TCP_SND_QUEUELEN and see what happens!


>  Will netconn_write block until it can write the next segment
> in the queue?

Yes.

>
> Ok, now the good news....
> By setting the following:
>
> #define TCP_MAXRTX              4
> #define TCP_SYNMAXRTX           4
> #define TCP_TMR_INTERVAL        100
>
> I was able to cut the amount of time for a timeout down to
> like ~20 seconds.  This isn't great, as all of the other
> connections will be without data for that ~20 seconds, and
And that's the bad news! A better solution (which is not yet
implemented)
is to use send timeouts (like we have receive timeouts already). But as I said, it remains a todo for lwIP...

> also, it seems like I'm setting the number of times the
> packet can be retransmitted pretty low.  But once the
> connection timed out, netconn_write returns, and the netconn
> is deleted, and everything continues to operate normally.
>
> This is at best a flimsy fix, so if anyone has better ideas,
> please feel free to send them my way.

Yes: increase TCP_SND_QUEUELEN and if it still locks up, use wireshark to see what's happening and post the packet trace here if you have questions about it!

Simon


_______________________________________________
lwip-users mailing list
[hidden email] http://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users

=?iso-8859-1?Q?Fr=E9d=E9ric_BERNON=2Evcf?= (810 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: netconn_write blocking

Kieran Mansley
In reply to this post by Goldschmidt Simon
On Wed, 2007-10-10 at 08:21 +0200, Goldschmidt Simon wrote:

> > I was able to cut the amount of time for a timeout down to
> > like ~20 seconds.  This isn't great, as all of the other
> > connections will be without data for that ~20 seconds, and
>
> And that's the bad news! A better solution (which is not yet
> implemented)
> is to use send timeouts (like we have receive timeouts already).
> But as I said, it remains a todo for lwIP...

In this case with the current code it will time out eventually.  The
"problem" is that TCP is by default incredibly tolerant of a network
outage and keeps trying for a very long time before giving up.  We're
exposing this very long time up to the user, which is probably the right
thing to do, but I suppose it might be desirable to give the user the
option of having a shorter API-level timeout as well.

Kieran



_______________________________________________
lwip-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-users