Summary: LWIP_NETCONN_FULLDUPLEX: Assertion failed:
sockets[i].select_waiting == 0
Project: lwIP - A Lightweight TCP/IP stack
Submitted by: pschlang
Submitted on: Thu 19 Dec 2019 09:42:42 AM UTC
Severity: 3 - Normal
Item Group: None
Assigned to: None
Discussion Lock: Any
Planned Release: None
lwIP version: git head
I've discovered a possible issue with lwip_select() in LWIP_NETCONN_FULLDUPLEX
When closing a socket which is being lwip_select()ed on from another
task/thread, the socket might end up in a state where it is closed but
select_waiting is not decremented properly. This will trigger the assertion
"sockets[i].select_waiting == 0" when re-allocating that socket.
I'm trying to explain the failure mechanism:
1. A thread calls lwip_select() to wait for events on a specific socket
2. lwip_select() will increment the used count for each socket via
lwip_select_inc_sockets_used() to ensure it's not freed during the select
3. After increasing the select_waiting for the socket but before decrementing
it again (i.e. while waiting for events), the socket is closed from another
thread/task. Since the socket is still in use by lwip_select(),
fd_free_pending will be set.
4. In lwip_select(), the loop to decrease select_waiting is entered. In the
loop, tryget_socket_unconn_locked is used to retrieve the socket structure.
For the socket closed in (3), tryget_socket_unconn_locked will return NULL
because fd_free_pending is set (checked in sock_inc_used_locked). Since
tryget_socket_unconn_locked returned NULL, lwip_select() will correctly set
nready to -1 an errno to EBADF, but it never decrements select_waiting for
5. lwip_select_dec_sockets_used() is used to decrement the used count before
returning -1 from lwip_select(). The used count of the closed socket will
become 0 and the socket is actually freed, but select_waiting is still 1.
6. Later, when re-using the socket structure in alloc_socket(),
"sockets[i].select_waiting == 0" assertion fails.
Is this an issue in lwIP or am I just using it in a non-supported way?
I've prepared a small patch which fixes the problem in my tests. Since I'm not
an expert on lwIP internals, I'd appreciate if somebody could double-check if
the fix is valid.