[task #7040] Work on tcp_enqueue

classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

URL:
  <http://savannah.nongnu.org/task/?7040>

                 Summary: Work on tcp_enqueue
                 Project: lwIP - A Lightweight TCP/IP stack
            Submitted by: goldsimon
            Submitted on: Dienstag 26.06.2007 um 19:56
                Category: None
         Should Start On: Dienstag 26.06.2007 um 00:00
   Should be Finished on: Dienstag 26.06.2007 um 00:00
                Priority: 1 - Later
                  Status: None
                 Privacy: Public
        Percent Complete: 0%
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________

Details:

If the last segment on pcb->unsent is < pcb->mss, it could hold additional
data. Instead, tcp_enqueue only fills it with additional data if all the data
passed to it fit in that segment.

ToDo: first check space in last unsent segment, then create pbufs.




    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #1, task #7040 (project lwip):

OK, some more details: When sending much data very fast (so fast that
tcp_pcb->snd_buf gets empty), an incomplete segment (smaller than pcb->mss)
can be queued (and will, in most cases).

When there is send buffer available again, the caller will call tcp_enqueue
again, which splits the data into pcb->mss size chunks and tries to enqueue
them. At this point, the last segment (which is smaller than mss) will not be
filled, which results in sending segments smaller than mss, although there is
enough data to be sent/enqueued.

To solve this, we can either
- in tcp_enqueue, check if the last segment on pcb->uncack is < mss or
- at application level, check the last segment and call tcp_write with a
smaller length so that the last segment will be filled.

I'd prefer the first solution because it solves the problem for the raw api
also and is faster.

Any comments? Maybe this is intended in order to be small?

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #2, task #7040 (project lwip):

I'd suggest this is not worth worrying about.  The consequences are very
minor.



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #3, task #7040 (project lwip):

Not? If you call tcp_write many times for small chunks instead of one time
for a big chunk, you will get every chunk in a single segment. Now that's
what I call inefficient!

I'm not saying I'm using it like this, but that's not what I thought tcp
should be doing...



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #4, task #7040 (project lwip):

re comment #3:
What I've seen there might also come from an incorrect implementation of
tcp_output_nagle(): tcp_output should only be called if there is a
full-sized-segment to be sent (or unacked == NULL). pcb->snd_queuelen is not
sufficient for this when calling tcp_write with many small chunks! I'll work
on that.

Anyway, the current code already tries to combine an existing segment with
the newly created segments, but only succeeds if the first segment on 'queue'
fits in. I would see this as a bug (it both wastes space and is inefficient).

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #5, task #7040 (project lwip):

Nagle should indeed be solving that problem, and I suggest we fix that rather
than the segmentation in the queue.

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #6, task #7040 (project lwip):

OK, I wait until nagle is fixed and take a look at it again. I still would
love to fix this, but only if the fix isn't much bigger or slower than the
existing solution!

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #7, task #7040 (project lwip):

Can we revisit this?  I don't think this has to do with Nagle. Or it also
occurs without Nagle.  On large transfers (RAW_API) I see it not send a full
payload about 1 in 40 packets.  True, this isn't a huge problem.  I would
expect if sending e.g. 300kB that all but the last packet would have a full
payload.

Do we know why it sends a partial packet?  In my tcp_sent, if I have more
than MTU bytes to send, should I wait for tcp_sndbuf to have MTU free?  Is
this what was meant by solving it at the application level?

Thanks,
Bill

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: [task #7040] Work on tcp_enqueue

Jakob Stoklund Olesen-2
Bill Auerbach <[hidden email]> writes:

> Can we revisit this?  I don't think this has to do with Nagle. Or it also
> occurs without Nagle.  On large transfers (RAW_API) I see it not send a full
> payload about 1 in 40 packets.  True, this isn't a huge problem.  I would
> expect if sending e.g. 300kB that all but the last packet would have a full
> payload.

True, it is not Nagle doing this. Of course, setting SO_NDELAY causes
many small segments to be sent by design.

> Do we know why it sends a partial packet?  In my tcp_sent, if I have more
> than MTU bytes to send, should I wait for tcp_sndbuf to have MTU free?  Is
> this what was meant by solving it at the application level?

To solve it at the application level, you must call tcp_write() with
data lengths that are a multiple of the mss.

Here is what happens in tcp_enqueue():

if write-length > mss then
  break into mss-sized segments
end

if last-unsent-segment + first-write-segment <= mss
  concat onto last unsent segment
else
  start new segment
end

This means that every tcp_write() with more than mss bytes will always
start a new segment.

For many small writes, this is OK: Small writes are concatenated as long
as they fit in the last unsent segment.

For large writes this is also OK: Each write creates a number of full
segments + one partial. This leads to a small overhead, nothing to worry
about.

For just-over-mss sized writes this is really bad: Repeated writes of
1461 bytes leads to alternating segments 1460 , 1, 1460, 1, ...

A quick fix to tcp_enqueue() is relatively easy: When breaking the write
data into segments, the first segment should be made small enough to fit
at the end of the last unsent segment. The remaining segments should
still be mss-sized.

/stoklund



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

RE: Re: [task #7040] Work on tcp_enqueue

Bill Auerbach
>To solve it at the application level, you must call tcp_write() with
>data lengths that are a multiple of the mss.

Thank you!  Exactly right.  If I add the second if to my tcp_sent callback
it works perfectly.  All packets are 1460 bytes with one trailing packet
less than 1460.

  if( tcp_sndbuf(pcb) >= dataLen )
    len = dataLen;
  else
    len = tcp_sndbuf(pcb);
  if( len > pcb->mss )
    len -= len % pcb->mss;

>
>Here is what happens in tcp_enqueue():
>
>if write-length >mss then
>  break into mss-sized segments
>end
>
>if last-unsent-segment + first-write-segment <= mss
>  concat onto last unsent segment
>else
>  start new segment
>end
>
>This means that every tcp_write() with more than mss bytes will always
>start a new segment.

Ok.  This makes sense.  Tcp_enqueue couldn't hang on to the last partial
piece unless tcp_write had a flag to signal a flush or "end of contiguous
data".  Someone like me using the example code above to call tcp_write in
the tcp_sent callback knows when the end of the data has been reached
because the sent callback has to determine that so implementing this flag in
the application is trivial.

Maybe the second if above should be included in the tcp_sent examples with a
comment // this test ensures full payloads will be transmitted.  Wouldn't
the netconn API benefit from this addition?

>For many small writes, this is OK: Small writes are concatenated as
>long as they fit in the last unsent segment.
>
>For large writes this is also OK: Each write creates a number of full
>segments + one partial. This leads to a small overhead, nothing to worry
>about.
>
>For just-over-mss sized writes this is really bad: Repeated writes of
>1461 bytes leads to alternating segments 1460 , 1, 1460, 1, ...
>
>A quick fix to tcp_enqueue() is relatively easy: When breaking the write
>data into segments, the first segment should be made small enough to fit
>at the end of the last unsent segment. The remaining segments should
>still be mss-sized.

I don't know the code that well to know where you mean - I'll look at it
since I'm on the trail and it's trivial to test the change.  If we have a
partial mss-worth of data at the end, if we leave it queued, will it get
sent if there is no following tcp_write?

Thank you!
Bill



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: [task #7040] Work on tcp_enqueue

Jakob Stoklund Olesen-2
"bill" <[hidden email]> writes:

> Thank you!  Exactly right.  If I add the second if to my tcp_sent callback
> it works perfectly.  All packets are 1460 bytes with one trailing packet
> less than 1460.
>
>   if( tcp_sndbuf(pcb) >= dataLen )
>     len = dataLen;
>   else
>     len = tcp_sndbuf(pcb);
>   if( len > pcb->mss )
>     len -= len % pcb->mss;

If this works for you, it is an easy solution.

[...]

>>This means that every tcp_write() with more than mss bytes will always
>>start a new segment.
>
> Ok.  This makes sense.  Tcp_enqueue couldn't hang on to the last partial
> piece unless tcp_write had a flag to signal a flush or "end of contiguous
> data".  Someone like me using the example code above to call tcp_write in
> the tcp_sent callback knows when the end of the data has been reached
> because the sent callback has to determine that so implementing this flag in
> the application is trivial.

That is not quite how it works. Actually tcp_enqueue() doesn't transmit
any data right away. You should call tcp_output_nagle() after writing
continuous data. If you don't, nothing is transmitted until an ACK is
received or some timer goes off. This can make a big difference in
throughput.

I am not sure if it is legal to call tcp_output() from inside the
tcp_sent callback. It it is probably harmless, but will be called anyway
when your callback function returns.

> Maybe the second if above should be included in the tcp_sent examples with a
> comment // this test ensures full payloads will be transmitted.  Wouldn't
> the netconn API benefit from this addition?

I would prefer to fix tcp_enqueue() instead. The raw API is low-level,
but it doesn't have to be that low-level.

[...fix tcp_enqueue()...]

> I don't know the code that well to know where you mean - I'll look at it
> since I'm on the trail and it's trivial to test the change.  If we have a
> partial mss-worth of data at the end, if we leave it queued, will it get
> sent if there is no following tcp_write?

Yes, it will be sent the next time tcp_output_nagle() is called (or when
an ACK is received).

/stoklund



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: Re: [task #7040] Work on tcp_enqueue

goldsimon@gmx.de

> I am not sure if it is legal to call tcp_output() from inside the
> tcp_sent callback. It it is probably harmless, but will be called anyway
> when your callback function returns.

As far as I remember, tcp_output just returns if it is called while input is processed for the given pcb, so yes, it's harmless. ;-)


> I would prefer to fix tcp_enqueue() instead. The raw API is low-level,
> but it doesn't have to be that low-level.

That's exactly the point of task #7040 (which, by the way, would be a better place for this discussion as lwip-devel).

I don't know when I will find the time to do that, but a ready patch is of course always welcome.

Simon
--
Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01


_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: Re: [task #7040] Work on tcp_enqueue

Kieran Mansley
In reply to this post by Jakob Stoklund Olesen-2
On Thu, 2009-01-29 at 06:15 +0100, Jakob Stoklund Olesen wrote:

> A quick fix to tcp_enqueue() is relatively easy: When breaking the write
> data into segments, the first segment should be made small enough to fit
> at the end of the last unsent segment. The remaining segments should
> still be mss-sized.

A patch for that would be very happily accepted.

Kieran



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt
In reply to this post by Simon Goldschmidt

Follow-up Comment #8, task #7040 (project lwip):

First of all: I am sorry for posting comments directly to the lwip-devel
list. For any new listeners out there, there is some discussion about this
task in the January 2009 archives of lwip-devel.

I have attached a patch for tcp_enqueue that fixes the issue discussed here.
It was not as simple as I had hoped. The tcp_enqueue function is quite scary.
I have tested this patch, but please try to test it on different systems.
There are many things that can go wrong.

The code to extend the last unsent segment is separate from the code that
creates new segments. This has a couple of advantages: 1. pbufs are allocated
with PBUF_RAW, so no memory is wasted on a header that is never prepended. 2.
The old code would allocate a segment only to free it again immediately. In
the no-copy case, we can also avoid allocating an extra pbuf for the header.

The downside is that the pbuf allocation code is duplicated. Twice as many
bugs!

I think it would be a good idea to split tcp_enqueue into two separate
functions:

tcp_enqueue_data to be called from tcp_write and
tcp_enqueue_options for everybody else.

The logic trying to handle data or options has become rather convoluted.

I think the user should be able to call tcp_write with small data chunks
without worrying too much about performance. My own user code does that.
Currently small writes cause very long pbuf chains because each write creates
at least one pbuf. With no-copy tcp_write this is inevitable. With copying
tcp_write it is really not necessary.

tcp_enqueue_data could allocate a pbuf with room for a full segment (or some
smaller chunk size). Following tcp_writes could append data to the same pbuf.
This would use more memory, so it must be considered carefully.

This is perhaps related to the zero-copy driver discussion on lwip-devel?
Maybe the driver could have a say in how outgoing pbufs are allocated?

My own Ethernet driver cannot directly DMA the long pbuf chains created by
small writes. I have to copy them into continuous memory first. This means
that data is copied both by tcp_enqueue AND the driver.


(file #17364)
    _______________________________________________________

Additional Item Attachment:

File name: tcp-enqueue-concat             Size:7 KB


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #9, task #7040 (project lwip):

I can confirm that this works with the same test I've been using when I first
noticed the problem.

I don't know the impact to the code size, but the impact on performance
(throughput) for sending large amounts of data is about 1%.  It might help if
the last packet in the queue were saved in the pcb to eliminate the search to
the end that is required. Maybe for long chains this is where the time goes?
Or have tcp_write return the number of bytes queued and let it simply queue
'len - len % mss' bytes and return this amount.  The problem is that this
breaks code that calls tcp_write.

I know there was a lot of effort put into this and I appreciate that.  From
my position, I would like to use it, but I need a performance increase, not
decrease.

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

On PBUF was: Work on tcp_enqueue

Alain Mouette
In reply to this post by Simon Goldschmidt
I am jijacking this thread for a subject that has come along in it:

PBUF

I would like to suggest a place to list all different necessities
regarding pbufs, maybe a special page in the wiki (if many devellopers
can have write access)

There are lots of different aspects, just to name a few:
- zero copy
- copy optimyzed: on 32 bit archs, an specially ARM and CortexM3 there a
lot of tricks to speed up copy about 5 times!! things like word aligment
but also multiples or 4/8 words for fast copy, etc..
- DMA: this is very usefull and varies wildly in various archs
- internal copy and pbuf list, as stated in this mesage about tcp_enqueue
- etc..

Maybe that if we manage to put all this toghether, a future version or
patch could be more flexible!!

first list what is needed
second plan
third to it, if someone volunteers, of course :)

Alain


Jakob Stoklund Olesen escreveu:

> Follow-up Comment #8, task #7040 (project lwip):
>
> First of all: I am sorry for posting comments directly to the lwip-devel
> list. For any new listeners out there, there is some discussion about this
> task in the January 2009 archives of lwip-devel.
>
> I have attached a patch for tcp_enqueue that fixes the issue discussed here.
> It was not as simple as I had hoped. The tcp_enqueue function is quite scary.
> I have tested this patch, but please try to test it on different systems.
> There are many things that can go wrong.



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt
In reply to this post by Simon Goldschmidt

Follow-up Comment #10, task #7040 (project lwip):

Bill, thanks for testing the patch!

What is the limiting factor for your performance?
CPU? Network bandwidth? Bandwidth-delay product? DMA
bandwidth/fragmentation?
Do you have retransmissions?

Sending full-sized TCP segments only helps you if you are saturating the
network link. It minimizes the protocol header overhead. If you are
CPU-limited, it could well be more work. The pbuf chains will be slightly
longer on average.

Your 1% number is rather small. Did you measure a variance as well?



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

[task #7040] Work on tcp_enqueue

Simon Goldschmidt

Follow-up Comment #11, task #7040 (project lwip):

Re comment #8: I think that this is leaning even further towards the argument
of shaking up the raw TCP API by allowing data to be written as pbufs rather
than char * + lengths.

Because of the need to allocate pbufs in chunks of the segment size, there
could perhaps be a separate function to allocate a pbuf chain of the
appropriate dimensions. Thinking out loud, that function could also maybe then
be used in future to allocate pbufs complying with zero-copy requirements of
alignment etc.? Or at least possibly call such a function. So for example:

tcp_pbuf_alloc(tcp_pcb, size, pbuf_type)
(where pbuf_type is PBUF_RAM, PBUF_ROM etc.)

and in turn that calls pbuf_alloc_zerocopy(pbuflayer, len, pbuftype, netif)

The netif would be required in order to identify driver/hardware requirements
for alignment, address, etc.

Should something like this even be a prerequisite before doing a full
zero-copy implementation, essentially by being its first phase? I think it
might be.

There would be API breakage doing this of course. Maybe we could minimise
this by having a wrapper used for backwards compatibility.

Further discussion about this should perhaps move back to lwip-devel though,
as it's out of place in this bug.

Jifl


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?7040>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: On PBUF was: Work on tcp_enqueue

Piero 74
In reply to this post by Alain Mouette
hi

i agree with you.
we have to find a place and a way to work toghether, producing code and docs.

there are a lot of work to do...
- define a separate layer for pbuf managment, which can cover, using
customization, different dma architectures.
- break tcp write code in lwip, to allow this pbufs layer to have
zero-copy for sending
- guide line for write ethernetif for rx to have zero copy
- we talked here about socket2, build without netcomm. bsd socket
works using copy, but we can offer a way to break this rule, offering
zero-copy option, and user-level api in new pbuf layer which works
toghther with socket2, which hide pbufs implementation and dma
managment behind, to the user. So, a well done lwip customization, can
offer over an hw architecture, high performance, low ram usage, even
the power of socket api!

in my opinion, socket2 should have 3 layers:
- internal structure (flags, counters,...)
- code for socket features managment in user level api (extended
version of bsd socket)
- low-level functions, which works above internal structure, called
with function pointer

in this way a developer could have a ready-to-use socket2 above
network driver, but can also add other socket2 family, using different
low-level funcs (for example over serial streaming, over OS data
streaming...) which sharebthe same code for socket managment and user
api.... thinks "select" function over different sockets... thins that
in embedded device, cooperation between os, networking, other hw
streaming could became necessary, and have only one interface above -
socket2 - could be a killer idea!

i know... i'm talking about lwip 2.0.0, right :o) ? someone has the
same dream or i'm the only crazy developer?


2009/1/30, Alain M. <[hidden email]>:

> I am jijacking this thread for a subject that has come along in it:
>
> PBUF
>
> I would like to suggest a place to list all different necessities
> regarding pbufs, maybe a special page in the wiki (if many devellopers
> can have write access)
>
> There are lots of different aspects, just to name a few:
> - zero copy
> - copy optimyzed: on 32 bit archs, an specially ARM and CortexM3 there a
> lot of tricks to speed up copy about 5 times!! things like word aligment
> but also multiples or 4/8 words for fast copy, etc..
> - DMA: this is very usefull and varies wildly in various archs
> - internal copy and pbuf list, as stated in this mesage about tcp_enqueue
> - etc..
>
> Maybe that if we manage to put all this toghether, a future version or
> patch could be more flexible!!
>
> first list what is needed
> second plan
> third to it, if someone volunteers, of course :)
>
> Alain
>
>
> Jakob Stoklund Olesen escreveu:
>> Follow-up Comment #8, task #7040 (project lwip):
>>
>> First of all: I am sorry for posting comments directly to the lwip-devel
>> list. For any new listeners out there, there is some discussion about this
>> task in the January 2009 archives of lwip-devel.
>>
>> I have attached a patch for tcp_enqueue that fixes the issue discussed
>> here.
>> It was not as simple as I had hoped. The tcp_enqueue function is quite
>> scary.
>> I have tested this patch, but please try to test it on different systems.
>> There are many things that can go wrong.
>
>
>
> _______________________________________________
> lwip-devel mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/lwip-devel
>


_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
Reply | Threaded
Open this post in threaded view
|

Re: [task #7040] Work on tcp_enqueue

Jakob Stoklund Olesen-2
In reply to this post by Simon Goldschmidt
OK, back on lwip-devel...

Jonathan Larmour <[hidden email]> writes:
> Follow-up Comment #11, task #7040 (project lwip):
>
> Re comment #8: I think that this is leaning even further towards the argument
> of shaking up the raw TCP API by allowing data to be written as pbufs rather
> than char * + lengths.

I don't think I understand why this would be a good idea. Could you
elaborate, please?

For UDP I get it: UDP protocols are often very simple. You allocate a
pbuf, cast p->payload to a struct pointer, fill out the fields and send
it. Very easy.

TCP is more difficult: You want to send a 1500-byte message. You call
tcp_pbuf_alloc and get back a pbuf chain with 17+1460+23 bytes because
that layout gives the best TCP segmentation. Now you have to deal with
next pointers, len/tot_len, and alignment issues. You give up, malloc
1500 bytes, format your message, and copy it into the pbuf chain.

Annoying example, I know, but not unthinkable.

I went back and read some of the old discussions about this issue. In
particular task #6735. I think the issue of TCP segmentation and
application code behaviour was not discussed.

I think we need to consider how lwIP is used, looking at the full
software stack.

In a small system with very limited resources, you would write
specialized application code directly on top of the raw API:

  App code -> Raw API

A larger system with more complex software would provide some form of
abstraction:

  App code -> Adaptation layer -> Raw API

The adaptation layer could be netconn for threads, sockets for legacy
code, or something different. (I use C++ based asynchronous message
passing).

In the first scenario you want the raw API to be relatively easy to
use. There should not be too many special cases, or you will get obscure
bugs. In the second scenario, the app code should be oblivious to issues
like segmentation and scatter-gather DMA. The adaption layer should be
able to do a decent job with different traffic patterns (within reason).

Typical traffic patterns include:

Small writes: Syslog over TCP. Each write is 20-120 bytes, no alignment,
multiple writes must be combined into one segment for proper throughput.

Medium-sized writes: CORBA IIOP, iSCSI. It would be reasonable to send
each write in its own segment most of the time. Application might even
set SO_NDELAY to encourage that.

Large writes: FTP, HTTP. Throughput is important. Large chunks of
continuous data is available, so full-sized zero-copy segments should be
possible.

In the case of small writes we probably cannot expect zero-copy
transmission, but single-copy would not be unreasonable. In my system,
small writes are copied twice: Once in tcp_enqueue, and once in the
driver because it cannot handle the long irregular pbuf chains.

Medium-sized writes should be zero-copy if the app code delivers aligned
data. Of course this depends on the particulars of the driver.

Large writes should always be zero-copy if at all possible.

When the driver requires data in a special region of memory, we should
aim for single-copy transmission. Zero-copy would require bad layering
violations.

It would be possible to make a 100% zero-copy API, but in reality you
would just be moving the copying into the app code.

My point is this: I would prefer a clean API that allows single-copy to
a complicated one that supports zero-copy in every case. When we copy
data, we should make the most of it: Make sure that it only happens
once, and calculate a checksum while we are at it.

/stoklund




_______________________________________________
lwip-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/lwip-devel
1234