lwip_close() doesn't work when lwip_write() hangs

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

lwip_close() doesn't work when lwip_write() hangs

alhadpalkar
I am using branch 1.4.1 of lwip. I have a thread that connects to a remote server and writes data to it using lwip_write(). Sometimes this hangs indefinitely. Looks like its waiting on the op->completed semaphore which never gets signaled.

I tried using the SO_SNDTIMEO socket option, but that just causes panics in my system. So I tried another approach where I use a watchdog that detects this hang and calls lwip_close(). But it looks like LWIP doesn't like it. I hit this assert

netconn_free(struct netconn *conn)
{
  LWIP_ASSERT("PCB must be deallocated outside this function", conn->pcb.tcp == NULL);
...
}

On debugging further it looks like we end up getting the ERR_INPROGRESS in do_delconn().

do_delconn(struct api_msg_msg *msg)
{
  /* @todo TCP: abort running write/connect? */
 if ((msg->conn->state != NETCONN_NONE) &&
     (msg->conn->state != NETCONN_LISTEN) &&
     (msg->conn->state != NETCONN_CONNECT)) {
    /* this only happens for TCP netconns */
    LWIP_ASSERT("msg->conn->type == NETCONN_TCP", msg->conn->type == NETCONN_TCP);
    km_printf("err in progress\n");
    msg->err = ERR_INPROGRESS;
...
}

so we never end up cleanup up the pcbs which leads to this assert.

Is there a way around this?


Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

Joel Cunningham-2
LwIP doesn't support this kind of threading model.  Multiple threads can not perform simultaneous operations (read+write, write+close, etc.) on the same socket.  The main limitation is that the netconn only has a single semaphore for blocking the calling thread when entering the core context.

On the master branch there is support for this model but the feature is in alpha state (see LWIP_NETCONN_FULLDUPLEX). In LwIP 1.4.1, this is not supported.

Joel

On Oct 10, 2015, at 12:29 AM, alhadpalkar <[hidden email]> wrote:

I am using branch 1.4.1 of lwip. I have a thread that connects to a remote
server and writes data to it using lwip_write(). Sometimes this hangs
indefinitely. Looks like its waiting on the op->completed semaphore which
never gets signaled.

I tried using the SO_SNDTIMEO socket option, but that just causes panics in
my system. So I tried another approach where I use a watchdog that detects
this hang and calls lwip_close(). But it looks like LWIP doesn't like it. I
hit this assert

netconn_free(struct netconn *conn)
{
LWIP_ASSERT("PCB must be deallocated outside this function", conn->pcb.tcp
== NULL);
...
}

On debugging further it looks like we end up getting the ERR_INPROGRESS in
do_delconn().

do_delconn(struct api_msg_msg *msg)
{
/* @todo TCP: abort running write/connect? */
if ((msg->conn->state != NETCONN_NONE) &&
(msg->conn->state != NETCONN_LISTEN) &&
(msg->conn->state != NETCONN_CONNECT)) {
/* this only happens for TCP netconns */
LWIP_ASSERT("msg->conn->type == NETCONN_TCP", msg->conn->type ==
NETCONN_TCP);
km_printf("err in progress\n");
msg->err = ERR_INPROGRESS;
...
}

so we never end up cleanup up the pcbs which leads to this assert.

Is there a way around this?






--
View this message in context: http://lwip.100.n7.nabble.com/lwip-close-doesn-t-work-when-lwip-write-hangs-tp25191.html
Sent from the lwip-users mailing list archive at Nabble.com.

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

alhadpalkar
Thanks. Is there anyway around lwip_write blocking forever?

On Mon, Oct 12, 2015 at 7:44 AM, Joel Cunningham <[hidden email]> wrote:
LwIP doesn't support this kind of threading model.  Multiple threads can not perform simultaneous operations (read+write, write+close, etc.) on the same socket.  The main limitation is that the netconn only has a single semaphore for blocking the calling thread when entering the core context.

On the master branch there is support for this model but the feature is in alpha state (see LWIP_NETCONN_FULLDUPLEX). In LwIP 1.4.1, this is not supported.

Joel

On Oct 10, 2015, at 12:29 AM, alhadpalkar <[hidden email]> wrote:

I am using branch 1.4.1 of lwip. I have a thread that connects to a remote
server and writes data to it using lwip_write(). Sometimes this hangs
indefinitely. Looks like its waiting on the op->completed semaphore which
never gets signaled.

I tried using the SO_SNDTIMEO socket option, but that just causes panics in
my system. So I tried another approach where I use a watchdog that detects
this hang and calls lwip_close(). But it looks like LWIP doesn't like it. I
hit this assert

netconn_free(struct netconn *conn)
{
LWIP_ASSERT("PCB must be deallocated outside this function", conn->pcb.tcp
== NULL);
...
}

On debugging further it looks like we end up getting the ERR_INPROGRESS in
do_delconn().

do_delconn(struct api_msg_msg *msg)
{
/* @todo TCP: abort running write/connect? */
if ((msg->conn->state != NETCONN_NONE) &&
(msg->conn->state != NETCONN_LISTEN) &&
(msg->conn->state != NETCONN_CONNECT)) {
/* this only happens for TCP netconns */
LWIP_ASSERT("msg->conn->type == NETCONN_TCP", msg->conn->type ==
NETCONN_TCP);
km_printf("err in progress\n");
msg->err = ERR_INPROGRESS;
...
}

so we never end up cleanup up the pcbs which leads to this assert.

Is there a way around this?






--
View this message in context: http://lwip.100.n7.nabble.com/lwip-close-doesn-t-work-when-lwip-write-hangs-tp25191.html
Sent from the lwip-users mailing list archive at Nabble.com.

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users


_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

Joel Cunningham-2
You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1.  I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?

I've also written applications that used non-blocking sockets and select to achieve a similar behavior of having blocking I/O that can be canceled.  The trick here is adding a second socket to the read FD set in select and then set select to block until your write or read is ready.  This second socket is bound to the loopback address.  When you want to cancel the blocking select from another thread, simply send a datagram to the additional socket, which will return the select call.  Then you can detect that a cancel/wakeup happened because the second socket is marked as ready.

Joel

On Oct 12, 2015, at 12:45 PM, Alhad Palkar <[hidden email]> wrote:

Thanks. Is there anyway around lwip_write blocking forever?

On Mon, Oct 12, 2015 at 7:44 AM, Joel Cunningham <[hidden email]> wrote:
LwIP doesn't support this kind of threading model.  Multiple threads can not perform simultaneous operations (read+write, write+close, etc.) on the same socket.  The main limitation is that the netconn only has a single semaphore for blocking the calling thread when entering the core context.

On the master branch there is support for this model but the feature is in alpha state (see LWIP_NETCONN_FULLDUPLEX). In LwIP 1.4.1, this is not supported.

Joel

On Oct 10, 2015, at 12:29 AM, alhadpalkar <[hidden email]> wrote:

I am using branch 1.4.1 of lwip. I have a thread that connects to a remote
server and writes data to it using lwip_write(). Sometimes this hangs
indefinitely. Looks like its waiting on the op->completed semaphore which
never gets signaled.

I tried using the SO_SNDTIMEO socket option, but that just causes panics in
my system. So I tried another approach where I use a watchdog that detects
this hang and calls lwip_close(). But it looks like LWIP doesn't like it. I
hit this assert

netconn_free(struct netconn *conn)
{
LWIP_ASSERT("PCB must be deallocated outside this function", conn->pcb.tcp
== NULL);
...
}

On debugging further it looks like we end up getting the ERR_INPROGRESS in
do_delconn().

do_delconn(struct api_msg_msg *msg)
{
/* @todo TCP: abort running write/connect? */
if ((msg->conn->state != NETCONN_NONE) &&
(msg->conn->state != NETCONN_LISTEN) &&
(msg->conn->state != NETCONN_CONNECT)) {
/* this only happens for TCP netconns */
LWIP_ASSERT("msg->conn->type == NETCONN_TCP", msg->conn->type ==
NETCONN_TCP);
km_printf("err in progress\n");
msg->err = ERR_INPROGRESS;
...
}

so we never end up cleanup up the pcbs which leads to this assert.

Is there a way around this?






--
View this message in context: http://lwip.100.n7.nabble.com/lwip-close-doesn-t-work-when-lwip-write-hangs-tp25191.html
Sent from the lwip-users mailing list archive at Nabble.com.

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

Sylvain Rochet
Hi Joel,

On Mon, Oct 12, 2015 at 07:10:39PM +0000, Joel Cunningham wrote:

> You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1.  I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?
>
> I've also written applications that used non-blocking sockets and
> select to achieve a similar behavior of having blocking I/O that can
> be canceled.  The trick here is adding a second socket to the read FD
> set in select and then set select to block until your write or read is
> ready.  This second socket is bound to the loopback address.  When you
> want to cancel the blocking select from another thread, simply send a
> datagram to the additional socket, which will return the select call.  
> Then you can detect that a cancel/wakeup happened because the second
> socket is marked as ready.
I really like this trick. It remembers myself of the well known wake up
pipe I explained here[1], but using the loopback to do so in lwIP is
very very clever :-)

Sylvain

[1] http://lists.gnu.org/archive/html/lwip-devel/2015-09/msg00028.html

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

Joel Cunningham-2
It's a great trick, hopefully others can leverage it as well :)

I'm not sure what I'd do without it.  Having select() and non-blocking sockets operate as the blocking construct of a server's event loop is essential for managing multiple connections in a high performance manner.

Joel 

On Oct 12, 2015, at 02:19 PM, Sylvain Rochet <[hidden email]> wrote:

Hi Joel,

On Mon, Oct 12, 2015 at 07:10:39PM +0000, Joel Cunningham wrote:
You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1. I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?

I've also written applications that used non-blocking sockets and
select to achieve a similar behavior of having blocking I/O that can
be canceled. The trick here is adding a second socket to the read FD
set in select and then set select to block until your write or read is
ready. This second socket is bound to the loopback address. When you
want to cancel the blocking select from another thread, simply send a
datagram to the additional socket, which will return the select call.
Then you can detect that a cancel/wakeup happened because the second
socket is marked as ready.

I really like this trick. It remembers myself of the well known wake up
pipe I explained here[1], but using the loopback to do so in lwIP is
very very clever :-)

Sylvain

[1] http://lists.gnu.org/archive/html/lwip-devel/2015-09/msg00028.html
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

signature.asc (284 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

alhadpalkar

I tried using the SO_SNDTIMEO. Here is what I am seeing. 

1. i create a socket using lwip_socket(AF_INET, SOCK_STREAM, 0) and then connect using lwip_connect()
2. set the SO_SNDTIME0 to 1s using lwip_setsockopt(socket, SOL_SOCKET, SO_SNDTIMEO, &timeout_int, sizeof(timeout_int))
3. then start sending data as follows

 while(err > 0) {
     err = lwip_write(socket, buffer, size);
 }

I see that I hit the timeout condition in the 2nd to last transfer 

lwip_send(0, data=0x20A72D78, size=86264, flags=0x0)  <--- this is the packet that times out
...
so_sndtimeo <-- added these printouts to check the condition I am hitting
partial write
lwip_send(0) err=0 written=67348

lwip_send(0, data=0x20A8348C, size=86264, flags=0x0) <--- the last packet that causes the assert
panic: already writing or closing <--- we hit this in do_write() ( api_msg.c:1360)


Some questions;
1. Does that mean I cannot call lwip_write() after timing out on a previous call to lwip_write()? 
2. Can I use the fact that the size returned by lwip_write() < total buffer size as an indication that our transfer timed out?
 
Thanks,
Alhad














On Mon, Oct 12, 2015 at 1:16 PM, Joel Cunningham <[hidden email]> wrote:
It's a great trick, hopefully others can leverage it as well :)

I'm not sure what I'd do without it.  Having select() and non-blocking sockets operate as the blocking construct of a server's event loop is essential for managing multiple connections in a high performance manner.

Joel 

On Oct 12, 2015, at 02:19 PM, Sylvain Rochet <[hidden email]> wrote:

Hi Joel,

On Mon, Oct 12, 2015 at 07:10:39PM +0000, Joel Cunningham wrote:
You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1. I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?

I've also written applications that used non-blocking sockets and
select to achieve a similar behavior of having blocking I/O that can
be canceled. The trick here is adding a second socket to the read FD
set in select and then set select to block until your write or read is
ready. This second socket is bound to the loopback address. When you
want to cancel the blocking select from another thread, simply send a
datagram to the additional socket, which will return the select call.
Then you can detect that a cancel/wakeup happened because the second
socket is marked as ready.

I really like this trick. It remembers myself of the well known wake up
pipe I explained here[1], but using the loopback to do so in lwIP is
very very clever :-)

Sylvain

[1] http://lists.gnu.org/archive/html/lwip-devel/2015-09/msg00028.html
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users


_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

alhadpalkar
Figured out why I am hitting this issue. In the partial write case, the write_offset is not reset back to zero. This causes the "already writing or closing" assert in the next lwip_write() or lwip_close(). Once I set this back to zero like this

diff --git a/lib/lwip/src/api/api_msg.c b/lib/lwip/src/api/api_msg.c
index 0a7edfe..2160766 100644
--- a/lib/lwip/src/api/api_msg.c
+++ b/lib/lwip/src/api/api_msg.c
@@ -1239,6 +1239,7 @@ do_writemore(struct netconn *conn)
       /* partial write */
       err = ERR_OK;
       conn->current_msg->msg.w.len = conn->write_offset;
+      conn->write_offset = 0; /* reset this back to zero */
       km_printf("partial write\n");
     }
   } else

things look to be working fine. Is this a known bug?

Thanks,
Alhad


On Mon, Oct 12, 2015 at 2:02 PM, Alhad Palkar <[hidden email]> wrote:

I tried using the SO_SNDTIMEO. Here is what I am seeing. 

1. i create a socket using lwip_socket(AF_INET, SOCK_STREAM, 0) and then connect using lwip_connect()
2. set the SO_SNDTIME0 to 1s using lwip_setsockopt(socket, SOL_SOCKET, SO_SNDTIMEO, &timeout_int, sizeof(timeout_int))
3. then start sending data as follows

 while(err > 0) {
     err = lwip_write(socket, buffer, size);
 }

I see that I hit the timeout condition in the 2nd to last transfer 

lwip_send(0, data=0x20A72D78, size=86264, flags=0x0)  <--- this is the packet that times out
...
so_sndtimeo <-- added these printouts to check the condition I am hitting
partial write
lwip_send(0) err=0 written=67348

lwip_send(0, data=0x20A8348C, size=86264, flags=0x0) <--- the last packet that causes the assert
panic: already writing or closing <--- we hit this in do_write() ( api_msg.c:1360)


Some questions;
1. Does that mean I cannot call lwip_write() after timing out on a previous call to lwip_write()? 
2. Can I use the fact that the size returned by lwip_write() < total buffer size as an indication that our transfer timed out?
 
Thanks,
Alhad














On Mon, Oct 12, 2015 at 1:16 PM, Joel Cunningham <[hidden email]> wrote:
It's a great trick, hopefully others can leverage it as well :)

I'm not sure what I'd do without it.  Having select() and non-blocking sockets operate as the blocking construct of a server's event loop is essential for managing multiple connections in a high performance manner.

Joel 

On Oct 12, 2015, at 02:19 PM, Sylvain Rochet <[hidden email]> wrote:

Hi Joel,

On Mon, Oct 12, 2015 at 07:10:39PM +0000, Joel Cunningham wrote:
You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1. I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?

I've also written applications that used non-blocking sockets and
select to achieve a similar behavior of having blocking I/O that can
be canceled. The trick here is adding a second socket to the read FD
set in select and then set select to block until your write or read is
ready. This second socket is bound to the loopback address. When you
want to cancel the blocking select from another thread, simply send a
datagram to the additional socket, which will return the select call.
Then you can detect that a cancel/wakeup happened because the second
socket is marked as ready.

I really like this trick. It remembers myself of the well known wake up
pipe I explained here[1], but using the loopback to do so in lwIP is
very very clever :-)

Sylvain

[1] http://lists.gnu.org/archive/html/lwip-devel/2015-09/msg00028.html
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users



_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: lwip_close() doesn't work when lwip_write() hangs

alhadpalkar
ah. I think I found the fix 

Commit: aecbce283db243ebfc786879e50b0da8d7006ed8 [aecbce2]
Parents: <a href="rev://b8d798158bce0068260302371afb2b4ab4d3678a" style="font-family:Helvetica;font-size:13px">b8d798158b
Author: goldsimon <[hidden email]>
Date: October 21, 2014 at 2:09:07 AM PDT

fixed bug #38219 Assert on TCP netconn_write with sndtimeout set

Alhad.

On Mon, Oct 12, 2015 at 2:50 PM, Alhad Palkar <[hidden email]> wrote:
Figured out why I am hitting this issue. In the partial write case, the write_offset is not reset back to zero. This causes the "already writing or closing" assert in the next lwip_write() or lwip_close(). Once I set this back to zero like this

diff --git a/lib/lwip/src/api/api_msg.c b/lib/lwip/src/api/api_msg.c
index 0a7edfe..2160766 100644
--- a/lib/lwip/src/api/api_msg.c
+++ b/lib/lwip/src/api/api_msg.c
@@ -1239,6 +1239,7 @@ do_writemore(struct netconn *conn)
       /* partial write */
       err = ERR_OK;
       conn->current_msg->msg.w.len = conn->write_offset;
+      conn->write_offset = 0; /* reset this back to zero */
       km_printf("partial write\n");
     }
   } else

things look to be working fine. Is this a known bug?

Thanks,
Alhad


On Mon, Oct 12, 2015 at 2:02 PM, Alhad Palkar <[hidden email]> wrote:

I tried using the SO_SNDTIMEO. Here is what I am seeing. 

1. i create a socket using lwip_socket(AF_INET, SOCK_STREAM, 0) and then connect using lwip_connect()
2. set the SO_SNDTIME0 to 1s using lwip_setsockopt(socket, SOL_SOCKET, SO_SNDTIMEO, &timeout_int, sizeof(timeout_int))
3. then start sending data as follows

 while(err > 0) {
     err = lwip_write(socket, buffer, size);
 }

I see that I hit the timeout condition in the 2nd to last transfer 

lwip_send(0, data=0x20A72D78, size=86264, flags=0x0)  <--- this is the packet that times out
...
so_sndtimeo <-- added these printouts to check the condition I am hitting
partial write
lwip_send(0) err=0 written=67348

lwip_send(0, data=0x20A8348C, size=86264, flags=0x0) <--- the last packet that causes the assert
panic: already writing or closing <--- we hit this in do_write() ( api_msg.c:1360)


Some questions;
1. Does that mean I cannot call lwip_write() after timing out on a previous call to lwip_write()? 
2. Can I use the fact that the size returned by lwip_write() < total buffer size as an indication that our transfer timed out?
 
Thanks,
Alhad














On Mon, Oct 12, 2015 at 1:16 PM, Joel Cunningham <[hidden email]> wrote:
It's a great trick, hopefully others can leverage it as well :)

I'm not sure what I'd do without it.  Having select() and non-blocking sockets operate as the blocking construct of a server's event loop is essential for managing multiple connections in a high performance manner.

Joel 

On Oct 12, 2015, at 02:19 PM, Sylvain Rochet <[hidden email]> wrote:

Hi Joel,

On Mon, Oct 12, 2015 at 07:10:39PM +0000, Joel Cunningham wrote:
You can use SO_SNDTIMEOUT, which should work on LwIP 1.4.1. I have used it in my port with LwIP 1.4.1, so possibly there's a problem with your port?

I've also written applications that used non-blocking sockets and
select to achieve a similar behavior of having blocking I/O that can
be canceled. The trick here is adding a second socket to the read FD
set in select and then set select to block until your write or read is
ready. This second socket is bound to the loopback address. When you
want to cancel the blocking select from another thread, simply send a
datagram to the additional socket, which will return the select call.
Then you can detect that a cancel/wakeup happened because the second
socket is marked as ready.

I really like this trick. It remembers myself of the well known wake up
pipe I explained here[1], but using the loopback to do so in lwIP is
very very clever :-)

Sylvain

[1] http://lists.gnu.org/archive/html/lwip-devel/2015-09/msg00028.html
_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users




_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users