[TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

vr roriz
Dear colleagues,

I'am writting my master thesis in a project using the raw API of lwip-2.0.3. Although my implementation works, I want to understand a certain behavior between the Nagle algorithm and the way I call (or not) tcp_output, but I am not quite sure what is happening.

In the case of a TCP write request, the sender function is invoked: sender(data[ ], size, send_now). It's pseudocode is:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
sender(*data, size, send_now) {

left = size //how much data is left to be put in the queue
available = 0 // how much data I can  put in the queue
pos = 0 // current position in the data vector to be put in the queue
err = ERR_OK // holds the current err of tcp_write attempts

while( left > 0 ) {
do {
  available = tcp_sndbuf(pcb)
  
  if((left <= available) && (available > 0)) {
err = tcp_write(pcb, data[pos], left, TCP_WRITE_FLAG_COPY)
if(err == ERR_OK) {
left = 0
if(send_now) {
tcp_output(pcb)
} else { // err == ERR_MEM
blocks and waits for trigger from tcp_sent callback, indicating that sent data was acked
}
  
  } else if((left > available) && (available > 0)){ //left > available
err = tcp_write(pcb, data[pos], available, TCP_WRITE_FLAG_MORE | TCP_WRITE_FLAG_COPY)
if(err == ERR_OK) {
left = left - available
pos = pos + available
} else { // err == ERR_MEM
blocks and waits for trigger from tcp_sent callback, indicating that sent data was acked
}
  } else {//available == 0
blocks and waits for trigger from tcp_sent callback, indicating that sent data was acked
  }
  
} while (err == ERR_MEM)

}
}
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Initially I didn't have a send_now option, so I was always calling tcp_output when there was available space to send all the rest of the data I wanted. This option was requested because the team writting application needs that two specific TCP segments are sent together, even when they call sender two consecutive times. Therefore, I have to give the application the possibility to control when tcp_output is called, because  "if the window size >= MSS and available data is >= MSS" [1], or if there is NOT unconfirmed data still in the pipe, the tcp_segment will be send now even with Nagle algo enabled. The application case is the second, they are trying to send data that is smaller than MSS but there is not unconfirmed data in the pipe. 

Then, I added the send_now control option, letting tcp_output (with send_now = 0) to be called by lwip itself. I've searched all the references for tcp_output in the lwip code. For what I understood, not considering retransmissions connect/close and etc, tcp_output will be called from the tcp slow timer and in the end of tcp_input (with the comment /* Try to send something out. */). It makes sense, because when we received an acked lwip seems to try to flush the TCP Tx queue.

Ok, this strategy seems to be working fine for our purposes. But I would like to understand the behavior during throughput tests. In the throughput test, the Application is a client. It connects to a TCP server and sends a defined amount of data at each period of 1ms. This amount of data is defined according to the throughput setpoint I set in the test. I've run the test for the 4 scenarios s = (Nagle, send_now), with 6 different throughput setpoints. Thus, 24 tests. The server is implemented in a PC, by using python sockets. The nodes are directly connected. The network layer is IPv6, then I am configuring MSS to be 1500 (ETH MTU) - 40 (IPv6 header) - 20 (TCP header) = 1440 bytes.


------------------------------------------ Test summary  - Task period = 1 ms - MSS = 1440 (ipv6) --------------------------------------------------------------------------------------
throughput_setpoint = 1 Mbps (125 bytes / period)
test_id = 1 : s = (0,0) -- throughput below throughput_setpoint  and floating, RTT between 170 ms to 200 ms
test_id = 2 : s = (0,1) --  throughput_measured =  throughput_setpoint, stable, RTT close to 1 ms.
test_id = 3 : s = (1,0) -- throughput below  throughput_setpoint  and floating, RTT between 170 ms to 200 ms
test_id = 4 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable

throughput_setpoint = 10 Mbps (1250 bytes / period) 
test_id = 5 : s = (0,0) -- throughput below throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id = 6 : s = (0,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.
test_id = 7 : s = (1,0) -- throughput below  throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id = 8 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.

throughput_setpoint = 25 Mbps (3125 bytes / period)   
test_id = 9  : s = (0,0) -- throughput below throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id =10 : s = (0,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1. 
test_id =11 : s = (1,0) -- throughput below  throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id =12 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.

throughput_setpoint = 35 Mbps (4375 bytes / period)   
test_id =13 : s = (0,0) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1 (so, for this amount of data the RTT decreases and, thus, throughput is achieved)
test_id =14 : s = (0,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1. 
test_id =15 : s = (1,0) -- throughput below  throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id =16 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.

throughput_setpoint = 45 Mbps (5625 bytes / period)   
test_id =17 : s = (0,0) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1 (so, for this amount of data the RTT decreases and, thus, throughput is achieved)
test_id =18 : s = (0,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1. 
test_id =19 : s = (1,0) -- throughput below  throughput_setpoint  and floating, RTT between 1 ms to 200 ms (a bit better)
test_id =20 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.

throughput_setpoint = 49 Mbps (6125 bytes / period)   
test_id =21 : s = (0,0) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1 (so, for this amount of data the RTT decreases and, thus, throughput is achieved)
test_id =22 : s = (0,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1. 
test_id =23 : s = (1,0) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1 (so, for this amount of data the RTT decreases and, thus, throughput is achieved)
test_id =24 : s = (1,1) -- throughput_measured =  throughput_setpoint, stable, RTT close to 1.
 
-------------------------------------------------------------------------------------------------------------------------------------------------------  

For (0, 1) and (1, 1) cases (send_now always 1 ):
It doesn't matter if Nagle is ON or OFF, I can always achieve the throughput setpoint, by wireshark measurements, until 49 Mbps. This is the maximum value we can achieve, because the OS is message-passing based and we've limited the maximum length of a message is the OS, therefore limitting the capability of the application to enqueue data.

For (0, 0) cases (When Nagle = 0 and send_now = 0):
if I write 3125 or more bytes (tests 13, 17 and 21), each period, than the RTT decreases and the throughput is achieved. I don't understand why at this point things change. I thought it could be related to delayed acks from the server. I've changed the advertised windows size (on the server side) for smaller values and also set the TCP_NODELAY param of the socket to 1 but the overall behavior is the same. 

For (1, 0) cases (When Nagle = 1 and send_now = 0):
The behavior is similar to (0, 0), the RTT started very high and gets better with more data being sent, but the point where it finally gets back to around 1 ms is just in the test_id = 23, when we are sending 6125 bytes each period. Obviously, the throughput is affected by the RTT.

Therefore, I would like to understand what is causing the RTT to reach high values when send_now is 0 and to dramatically drop for certain amounts of data being sent. What else can be if not the delayed acks and why we observe different behaviors for test_id = 13 onwards (for Nagle = 0) and for test_id = 23 (for Nagle = 1).

--------
Atachments:
* lwipopts.h
* test_description_table summarizes the test configs.
* test_wireshark_tracefiles:
Wireshark trace files with names testX_Na_SNb_Ty. With X being the test ID, "Na" being N0 (Nagle off) or N1 (Naggle on), "SNb" being SN0 (send_now = 0) or SN1 (send_now = 1) and "Ty" being the Throughput setpoint = y. The Toyota device is the client in the trace files.
--------


Sorry for the long email but I think this bunch of tests with behavior analysis can be quite useful for future developers since it is not so easy to find complete throughput tests available in the forums.
Thank you very much!

Kind regards,
Vitor


_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

goldsimon@gmx.de


vr roriz wrote:
>[..]
>Then, I added the send_now control option, letting tcp_output (with
>send_now = 0) to be called by lwip itself.

Ok, so the application *never* calls tcp_output() but you leave this completely to the stack? That might work somehow, but will lead to totally unpredictive performance, as you have measured.

Also, even if you don't call tcp_output(), that doesn't always ensure two segments get sent together. Trying to achieve something like that is just not what TCP is like. TCP is a streaming protocol, which is what people tend to not understand. You can try to work around its streaming nature and make it datagram like, but once you think you got it working like you want to, don't be surprised if it doesn't in the next version...

Simon

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

vr roriz
>Ok, so the application *never* calls tcp_output() but you leave this completely to the stack? That might work somehow, but will lead to totally unpredictive performance, as you have measured.

That's my point, I thought it would be totally unpredective. But after
some certain amount of data is periodically queued, the RTT starts to
go down again and the throughput is achieved. That is what I would
like to understand. Do you think it would be a good idea to call
tcp_output from some custom timer? (Since the poll callback has a very
high resolution).

>Also, even if you don't call tcp_output(), that doesn't always ensure two segments get sent together. Trying to achieve something like that is just not what TCP is like. TCP is a streaming protocol, which is what people tend to not understand. You can try to work around its streaming nature and make it datagram like, but once you think you got it working like you want to, don't be surprised if it doesn't in the next version.

I agree, it does not ensure, I used it as an argument and they are
aware of it. However, according to them, they need this due to a bug
with TLS in a analysis tool they are using. It is only for debugging
and when the software is deployed I doubt they will set send_now = 0.

MfG,
Vitor
Em qui, 11 de out de 2018 às 21:11, goldsimon <[hidden email]> escreveu:

>
>
>
> vr roriz wrote:
> >[..]
> >Then, I added the send_now control option, letting tcp_output (with
> >send_now = 0) to be called by lwip itself.
>
> Ok, so the application *never* calls tcp_output() but you leave this completely to the stack? That might work somehow, but will lead to totally unpredictive performance, as you have measured.
>
> Also, even if you don't call tcp_output(), that doesn't always ensure two segments get sent together. Trying to achieve something like that is just not what TCP is like. TCP is a streaming protocol, which is what people tend to not understand. You can try to work around its streaming nature and make it datagram like, but once you think you got it working like you want to, don't be surprised if it doesn't in the next version...
>
> Simon
>
> _______________________________________________
> lwip-users mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

goldsimon@gmx.de


vr roriz wrote:
>That's my point, I thought it would be totally unpredective. But after
>some certain amount of data is periodically queued, the RTT starts to
>go down again and the throughput is achieved. That is what I would
>like to understand.

I think tcp_output() is called every time an rx segment is processed for a pcb, even if it doesn't contain data but only an ack. This is to,achieve throughput like you want. You just cannot rely on it regarding throughput. But this could well explain your measurements...

>Do you think it would be a good idea to call
>tcp_output from some custom timer? (Since the poll callback has a very
>high resolution).

No. You'll just get threading problems when starting something like that...

Call tcp_output() when you want to send data. In other words, if the application calling your code controls send_now, make them set it to 1 for the last call.

Simon

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users
Reply | Threaded
Open this post in threaded view
|

Re: [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

vr roriz
>No. You'll just get threading problems when starting something like that...
In my architecture, when I need an interrupt process (like for
handling Rx) the Interrupt process is just triggering a handler
process. The handler process and any other process from the driver
have the same priority and will not intercalate each other. By doing
this I believe I am in compliance with the "multi-thread issue" of the
raw API, because any invoked LWIP function will return before any
other LWIP function starts being executed.

Em qui, 11 de out de 2018 às 21:47, goldsimon <[hidden email]> escreveu:

>
>
>
> vr roriz wrote:
> >That's my point, I thought it would be totally unpredective. But after
> >some certain amount of data is periodically queued, the RTT starts to
> >go down again and the throughput is achieved. That is what I would
> >like to understand.
>
> I think tcp_output() is called every time an rx segment is processed for a pcb, even if it doesn't contain data but only an ack. This is to,achieve throughput like you want. You just cannot rely on it regarding throughput. But this could well explain your measurements...
>
> >Do you think it would be a good idea to call
> >tcp_output from some custom timer? (Since the poll callback has a very
> >high resolution).
>
> No. You'll just get threading problems when starting something like that...
>
> Call tcp_output() when you want to send data. In other words, if the application calling your code controls send_now, make them set it to 1 for the last call.
>
> Simon
>
> _______________________________________________
> lwip-users mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/lwip-users