TCP break down with concurrent client accept

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

TCP break down with concurrent client accept

Oldrich Kepka
Hi,

we run lwip-1.4.0 on PPC440 and experience rare random hanging of TCP. I was able to create a minimal working example to reproduce the hang: Setup a tcp server on the PPC:

    int socketId =  socket(AF_INET, SOCK_STREAM, 0);
    if(socketId == -1){...return;}
  

    struct sockaddr_in server;
    server.sin_family = AF_INET;
    server.sin_port = htons(12121);
    server.sin_addr.s_addr = INADDR_ANY;

    int err = bind(socketId, (struct sockaddr *) &server, sizeof (server));
    if (err < 0) {... return;}
  
    err = listen(socketId, 1);
    if (err < 0) { .... return;}

   while(1) {

        int socketConn = accept(socketId, NULL, NULL);
        sys_thread_t thread = sys_thread_new("tcip_server", processConnection,
                                             (void*)socketConn,
                                             2*THREAD_STACKSIZE,
                                             DEFAULT_THREAD_PRIO);
    }

sponing a thread on accepted connection:

void processConnection(void *p) {

    int sd = (int)p;
    uint8_t *buffer = new uint8_t[CMD_MAX_SIZE];
    uint32_t n = 0, bufOffset = 0;

    while((n  = read(sd, buffer+bufOffset, CMD_MAX_SIZE-bufOffset)) > 0 ) {
        bufOffset += n;
    }
    ......
    if(buffer) delete buffer;

    close(sd);
}

Then keep dumping the content of a file (~30 characters)
for i in {1..n}; do cat some_file > /dev/tcp/DEVICE_IP/12121; done
from 2 shells at the same time. For some time I see random

cat: write error: Connection reset by peer

However after some time, this message is printed after every command of any of the two threads. At this point the tcp breaks. I admit that this example is rather agressive, but allows me to get the system to a similar problematic state that we experience in production.


I found that after the TCP breaks down, UDP communication still works and I am able to check the state of the system. For example, lwip_stats.tcp counts properly incoming TCP packets. One cannot however create new tcp socket anymore. I don't see TCPIP_MSG_API messages in tcpip_thread anymore, etc. Placing printouts/usleep(1000) inside in some places removes the (race condition) problem, but also slows the system.

Any advice on how to move forward in debugging this would be very much appreciated. opt.h and tcp_impl.h attached. I tried to play blindly with a few paramters (TCP_TMR_INTERVAL, TCP_SLOW_INTERVAL, MEM_ALIGNMENT, MEMP_OVERFLOW_CHECK, MEMP_NUM_TCP_PCB, MEMP_NUM_TCP_PCB_LISTEN) with no success.

Best,
Oldrich


















_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

tcp_impl.h (26K) Download Attachment
opt.h (74K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TCP break down with concurrent client accept

goldsimon@gmx.de
On 01.04.2018 01:26, Oldrich Kepka wrote:
> we run lwip-1.4.0 on PPC440 and experience rare random hanging of TCP

Two things that I think are worth noting:
a) 1.4.0 is really old. There have been numerous fixes since that. Can
you reproduce the issue with current git master?
b) 1.4.0 does not contain a file called tcp_impl.h. Although, both
tcp_impl.h and opt.h are stack-internal files that should not be
changed. Why did you attach them to your mail?

Simon

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: TCP break down with concurrent client accept

Oldrich Kepka
In reply to this post by Oldrich Kepka
Hi Simon,

I return to this old post. As you requested, I migrated our system to 2.0.2 using Xilinx port of LWIP.

and I observe the same behaviour that I reported above.

I am able to reproduce the break with SOCKETS_DEBUG=1. This shows:
17:24:27.846: lwip_accept(2)...                                                                                    
17:24:27.848: lwip_accept(2): netconn_acept failed, err=-13

I have not yet been able to break it with also TCPIP_DEBUG=1. There is a sort of race condition which is not met when printing out more info.

Attached find lwipopts.h and sys_arch.c which might be useful.

Thanks for any hint where to look or how to get more debug information.

Best,
Oldrich








On Sat, 31 Mar 2018 at 23:26, Oldrich Kepka <[hidden email]> wrote:
Hi,

we run lwip-1.4.0 on PPC440 and experience rare random hanging of TCP. I was able to create a minimal working example to reproduce the hang: Setup a tcp server on the PPC:

    int socketId =  socket(AF_INET, SOCK_STREAM, 0);
    if(socketId == -1){...return;}
  

    struct sockaddr_in server;
    server.sin_family = AF_INET;
    server.sin_port = htons(12121);
    server.sin_addr.s_addr = INADDR_ANY;

    int err = bind(socketId, (struct sockaddr *) &server, sizeof (server));
    if (err < 0) {... return;}
  
    err = listen(socketId, 1);
    if (err < 0) { .... return;}

   while(1) {

        int socketConn = accept(socketId, NULL, NULL);
        sys_thread_t thread = sys_thread_new("tcip_server", processConnection,
                                             (void*)socketConn,
                                             2*THREAD_STACKSIZE,
                                             DEFAULT_THREAD_PRIO);
    }

sponing a thread on accepted connection:

void processConnection(void *p) {

    int sd = (int)p;
    uint8_t *buffer = new uint8_t[CMD_MAX_SIZE];
    uint32_t n = 0, bufOffset = 0;

    while((n  = read(sd, buffer+bufOffset, CMD_MAX_SIZE-bufOffset)) > 0 ) {
        bufOffset += n;
    }
    ......
    if(buffer) delete buffer;

    close(sd);
}

Then keep dumping the content of a file (~30 characters)
for i in {1..n}; do cat some_file > /dev/tcp/DEVICE_IP/12121; done
from 2 shells at the same time. For some time I see random

cat: write error: Connection reset by peer

However after some time, this message is printed after every command of any of the two threads. At this point the tcp breaks. I admit that this example is rather agressive, but allows me to get the system to a similar problematic state that we experience in production.


I found that after the TCP breaks down, UDP communication still works and I am able to check the state of the system. For example, lwip_stats.tcp counts properly incoming TCP packets. One cannot however create new tcp socket anymore. I don't see TCPIP_MSG_API messages in tcpip_thread anymore, etc. Placing printouts/usleep(1000) inside in some places removes the (race condition) problem, but also slows the system.

Any advice on how to move forward in debugging this would be very much appreciated. opt.h and tcp_impl.h attached. I tried to play blindly with a few paramters (TCP_TMR_INTERVAL, TCP_SLOW_INTERVAL, MEM_ALIGNMENT, MEMP_OVERFLOW_CHECK, MEMP_NUM_TCP_PCB, MEMP_NUM_TCP_PCB_LISTEN) with no success.

Best,
Oldrich


















_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users

lwipopts.h (2K) Download Attachment
sys_arch.c (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TCP break down with concurrent client accept

Oldrich Kepka
Hi again,

I found that the reason for tcp networking to break is that there are no available semaphores to initialize. We run out of semaphores, because some connection don't close. They don't close because they are stuck in read function.
while(1) {
   n = read(s, ...)
   if(n<=0) break;
}
close(s)

More concretely in sys_arch_mbox_fetch in netconn_recv_data. The loop above goes through once and data ara read. However, the thread gets stuck during the second pass through in read.

Checking the pcbs looped in tcp_slowtmr, I found out that the stuck connections have a state CLOSE_WAIT. So the application should normally call close(s), but do to some race condition, we are stuck in read, and cannot do that.

Thanks for any help,
Cheers,
Oldrich




On Tue, 4 Sep 2018 at 08:33, Oldrich Kepka <[hidden email]> wrote:
Hi Simon,

I return to this old post. As you requested, I migrated our system to 2.0.2 using Xilinx port of LWIP.

and I observe the same behaviour that I reported above.

I am able to reproduce the break with SOCKETS_DEBUG=1. This shows:
17:24:27.846: lwip_accept(2)...                                                                                    
17:24:27.848: lwip_accept(2): netconn_acept failed, err=-13

I have not yet been able to break it with also TCPIP_DEBUG=1. There is a sort of race condition which is not met when printing out more info.

Attached find lwipopts.h and sys_arch.c which might be useful.

Thanks for any hint where to look or how to get more debug information.

Best,
Oldrich








On Sat, 31 Mar 2018 at 23:26, Oldrich Kepka <[hidden email]> wrote:
Hi,

we run lwip-1.4.0 on PPC440 and experience rare random hanging of TCP. I was able to create a minimal working example to reproduce the hang: Setup a tcp server on the PPC:

    int socketId =  socket(AF_INET, SOCK_STREAM, 0);
    if(socketId == -1){...return;}
  

    struct sockaddr_in server;
    server.sin_family = AF_INET;
    server.sin_port = htons(12121);
    server.sin_addr.s_addr = INADDR_ANY;

    int err = bind(socketId, (struct sockaddr *) &server, sizeof (server));
    if (err < 0) {... return;}
  
    err = listen(socketId, 1);
    if (err < 0) { .... return;}

   while(1) {

        int socketConn = accept(socketId, NULL, NULL);
        sys_thread_t thread = sys_thread_new("tcip_server", processConnection,
                                             (void*)socketConn,
                                             2*THREAD_STACKSIZE,
                                             DEFAULT_THREAD_PRIO);
    }

sponing a thread on accepted connection:

void processConnection(void *p) {

    int sd = (int)p;
    uint8_t *buffer = new uint8_t[CMD_MAX_SIZE];
    uint32_t n = 0, bufOffset = 0;

    while((n  = read(sd, buffer+bufOffset, CMD_MAX_SIZE-bufOffset)) > 0 ) {
        bufOffset += n;
    }
    ......
    if(buffer) delete buffer;

    close(sd);
}

Then keep dumping the content of a file (~30 characters)
for i in {1..n}; do cat some_file > /dev/tcp/DEVICE_IP/12121; done
from 2 shells at the same time. For some time I see random

cat: write error: Connection reset by peer

However after some time, this message is printed after every command of any of the two threads. At this point the tcp breaks. I admit that this example is rather agressive, but allows me to get the system to a similar problematic state that we experience in production.


I found that after the TCP breaks down, UDP communication still works and I am able to check the state of the system. For example, lwip_stats.tcp counts properly incoming TCP packets. One cannot however create new tcp socket anymore. I don't see TCPIP_MSG_API messages in tcpip_thread anymore, etc. Placing printouts/usleep(1000) inside in some places removes the (race condition) problem, but also slows the system.

Any advice on how to move forward in debugging this would be very much appreciated. opt.h and tcp_impl.h attached. I tried to play blindly with a few paramters (TCP_TMR_INTERVAL, TCP_SLOW_INTERVAL, MEM_ALIGNMENT, MEMP_OVERFLOW_CHECK, MEMP_NUM_TCP_PCB, MEMP_NUM_TCP_PCB_LISTEN) with no success.

Best,
Oldrich


















_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users