From owner-tcp-impl@lerc.nasa.gov  Mon Feb  1 22:42:49 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA08847
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Feb 1999 22:42:48 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA16154; Mon, 1 Feb 1999 21:10:45 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA14916; Mon, 1 Feb 1999 21:06:51 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 2 Feb 1999 02:30:01 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m107VET-001xh6C; Mon, 1 Feb 1999 18:06:29 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id SAA01109; Mon, 1 Feb 1999 18:07:12 -0800 (PST)
Message-Id: <199902020207.SAA01109@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@lerc.nasa.gov
Subject: internet draft on suggested mod to the Nagle algorithm
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Mon, 01 Feb 1999 18:07:11 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Tcp-impl people,

Just in case you didn't see the following, i'm forwarding the announcement 
along.  I'd love to hear any thoughts people might have, both on the concept 
and on how to implement it.  (I need to do at least one modification to add in 
how Sam Manthorpe/SGI has addressed this issue.)

Thanks,  Greg Minshall
------- Forwarded Message

Message-Id: <199902011653.LAA23867@ietf.org>
Mime-Version: 1.0
Content-Type: Multipart/Mixed; Boundary="NextPart"
To: IETF-Announce: ;
From: Internet-Drafts@ietf.org
Reply-to: Internet-Drafts@ietf.org
Subject: I-D ACTION:draft-minshall-nagle-00.txt
Date: Mon, 01 Feb 1999 11:53:39 -0500
Sender: cclark@ns.cnri.reston.va.us
X-UIDL: 2d7829f9365bbfc5454f7c0d6a61f975

- --NextPart

A New Internet-Draft is available from the on-line Internet-Drafts directories.

	Title		: A Suggested Modification to Nagle's Algorithm
	Author(s)	: G. Minshall
	Filename	: draft-minshall-nagle-00.txt
	Pages		: 4
	Date		: 29-Jan-99
	
   The Nagle algorithm is one of the primary mechanisms which protects
   the internet from poorly designed and/or implemented applications.
   However, for a certain class of applications (notably,
   request-response protocols) the Nagle algorithm interacts poorly
   with delayed acknowledgements to give these applications poorer
   performance.

   This draft is NOT suggesting that these applications should disable
   the Nagle algorithm.

   This draft suggests a fairly small and simple modification to the
   Nagle algorithm to preserve Nagle as a means of protecting the
   internet while at the same time giving better performance to a
   wider class of applications.

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-minshall-nagle-00.txt

Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
	"get draft-minshall-nagle-00.txt".

A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html 
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt


Internet-Drafts can also be obtained by e-mail.

Send a message to:
	mailserv@ietf.org.
In the body type:
	"FILE /internet-drafts/draft-minshall-nagle-00.txt".
	
NOTE:	The mail server at ietf.org can return the document in
	MIME-encoded form by using the "mpack" utility.  To use this
	feature, insert the command "ENCODING mime" before the "FILE"
	command.  To decode the response(s), you will need "munpack" or
	a MIME-compliant mail reader.  Different MIME-compliant mail readers
	exhibit different behavior, especially when dealing with
	"multipart" MIME messages (i.e. documents which have been split
	up into multiple messages), so check your local documentation on
	how to manipulate these messages.
		
		
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.

- --NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

- --OtherAccess
Content-Type: Message/External-body;
	access-type="mail-server";
	server="mailserv@ietf.org"

Content-Type: text/plain
Content-ID:	<19990129145320.I-D@ietf.org>

ENCODING mime
FILE /internet-drafts/draft-minshall-nagle-00.txt

- --OtherAccess
Content-Type: Message/External-body;
	name="draft-minshall-nagle-00.txt";
	site="ftp.ietf.org";
	access-type="anon-ftp";
	directory="internet-drafts"

Content-Type: text/plain
Content-ID:	<19990129145320.I-D@ietf.org>

- --OtherAccess--

- --NextPart--


------- End of Forwarded Message


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 17:49:04 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA10415
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 17:49:03 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA09503; Wed, 3 Feb 1999 16:07:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA07259; Wed, 3 Feb 1999 16:03:57 -0500 (EST)
Received: from ehsco.com ([209.31.7.45]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 415;
          Wed, 3 Feb 1999 13:03:43 -0800
Message-ID: <36B8B9A5.4BBBEA81@ehsco.com>
Date: Wed, 03 Feb 1999 13:03:33 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Greg Minshall <minshall@siara.com>
CC: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902020207.SAA01109@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> Just in case you didn't see the following, i'm forwarding the
> announcement along. I'd love to hear any thoughts people might have,
> both on the concept and on how to implement it.

I'll bite.

Q: If Nagle is a problem for my application, I can just disable it
altogether. What benefit is there in changing Nagle so that it affects
ALL transactions instead of just the ones I'm troubled by?

IE, by changing Nagle, you're going to be changing everything, and not
just those applications that are request/response-intensive. Can you
prove that your theory is always better with every application?
Otherwise I'd say just fix the apps that are causing you problems.

I'm interested in hearing your answer.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 17:49:28 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA10437
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 17:49:27 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA15771; Wed, 3 Feb 1999 16:17:31 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA13776; Wed, 3 Feb 1999 16:14:10 -0500 (EST)
Received: from ehsco.com ([209.31.7.45]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 456;
          Wed, 3 Feb 1999 13:13:33 -0800
Message-ID: <36B8BBF3.EAFAD2C@ehsco.com>
Date: Wed, 03 Feb 1999 13:13:24 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Greg Minshall <minshall@siara.com>, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902020207.SAA01109@red.mtv.siara.com> <36B8B9A5.4BBBEA81@ehsco.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> Q: If Nagle is a problem for my application, I can just disable it
> altogether.

One related question: Does everybody support applications disabling
Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 18:50:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA11777
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 18:50:29 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA01789; Wed, 3 Feb 1999 17:37:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sgi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA29099; Wed, 3 Feb 1999 17:32:48 -0500 (EST)
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id OAA04269; Wed, 3 Feb 1999 14:32:45 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: from bossette.engr.sgi.com (bossette.engr.sgi.com [150.166.61.12])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id OAA39128;
	Wed, 3 Feb 1999 14:32:43 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: (from sm@localhost) by bossette.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) id OAA33575; Wed, 3 Feb 1999 14:32:43 -0800 (PST)
From: sm@bossette.engr.sgi.com (Sam Manthorpe)
Message-Id: <199902032232.OAA33575@bossette.engr.sgi.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: ehall@ehsco.com (Eric A. Hall)
Date: Wed, 3 Feb 1999 14:32:43 -0800 (PST)
Cc: minshall@siara.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <36B8BBF3.EAFAD2C@ehsco.com> from "Eric A. Hall" at Feb 3, 99 01:13:24 pm
X-Mailer: ELM [version 2.4 PL25 PGP3 *ALPHA*]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Eric A. Hall wrote:
> 
> 
> > Q: If Nagle is a problem for my application, I can just disable it
> > altogether.
> 
> One related question: Does everybody support applications disabling
> Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?

IRIX does.

-- Sam

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sam Manthorpe, SGI.  tel: (650)933-2856 fax: (650)932-2856  sm@engr.sgi.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 18:51:48 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA11816
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 18:51:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA29078; Wed, 3 Feb 1999 17:32:31 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sgi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA28449; Wed, 3 Feb 1999 17:31:12 -0500 (EST)
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id OAA05597; Wed, 3 Feb 1999 14:31:09 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: from bossette.engr.sgi.com (bossette.engr.sgi.com [150.166.61.12])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id OAA62548;
	Wed, 3 Feb 1999 14:31:08 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: (from sm@localhost) by bossette.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) id OAA33831; Wed, 3 Feb 1999 14:30:58 -0800 (PST)
From: sm@bossette.engr.sgi.com (Sam Manthorpe)
Message-Id: <199902032230.OAA33831@bossette.engr.sgi.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: ehall@ehsco.com (Eric A. Hall)
Date: Wed, 3 Feb 1999 14:30:57 -0800 (PST)
Cc: minshall@siara.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <36B8B9A5.4BBBEA81@ehsco.com> from "Eric A. Hall" at Feb 3, 99 01:03:33 pm
X-Mailer: ELM [version 2.4 PL25 PGP3 *ALPHA*]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Hi Eric,

Eric A. Hall wrote:
> I'll bite.
> 
> Q: If Nagle is a problem for my application, I can just disable it
> altogether. What benefit is there in changing Nagle so that it affects
> ALL transactions instead of just the ones I'm troubled by?

Because you just have to make one change to the kernel code instead
of many changes to all applications which do bulk data transfer.  Moreover, 
in some cases, you will have to introduce some special intelligence
in the application so that it can decide _whether_ it is doing bulk
transfer or not.  Why not just put this extra intelligence in the
kernel.

> IE, by changing Nagle, you're going to be changing everything, and not
> just those applications that are request/response-intensive. 

The request/response-intensive applications should notice no difference,
except where the responses consist of significant amounts of data.  I
can't think of any cases where fixing this in the kernel would cause
problems with other applications or for network utilisation; can anyone
else?

-- Sam

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sam Manthorpe, SGI.  tel: (650)933-2856 fax: (650)932-2856  sm@engr.sgi.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 18:56:00 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA11927
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 18:55:59 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA10113; Wed, 3 Feb 1999 17:52:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA07937; Wed, 3 Feb 1999 17:48:37 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id OAA05193;
	Wed, 3 Feb 1999 14:48:35 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id OAA02339;
	Wed, 3 Feb 1999 14:48:35 -0800 (PST)
Date: Wed, 3 Feb 1999 14:48:35 -0800 (PST)
Message-Id: <199902032248.OAA02339@rum.isi.edu>
To: tcp-impl@lerc.nasa.gov, minshall@siara.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> To: tcp-impl@lerc.nasa.gov
> Subject: internet draft on suggested mod to the Nagle algorithm
> Date: Mon, 01 Feb 1999 18:07:11 -0800
> From: Greg Minshall <minshall@siara.com>
> 
> 	Title		: A Suggested Modification to Nagle's Algorithm
> 	Author(s)	: G. Minshall
> 	Filename	: draft-minshall-nagle-00.txt
> 	Pages		: 4
> 	Date		: 29-Jan-99
> 	
>    The Nagle algorithm is one of the primary mechanisms which protects
>    the internet from poorly designed and/or implemented applications.
>    However, for a certain class of applications (notably,
>    request-response protocols) the Nagle algorithm interacts poorly
>    with delayed acknowledgements to give these applications poorer
>    performance.
> 
>    This draft is NOT suggesting that these applications should disable
>    the Nagle algorithm.

Why not?

Nagle was a solution to char-at-a-time remote logins, and
is discouraged for transactional systems, even ones with
bursts as small as a few characters, e.g., X11.

See:

	John Heidemann. Performance Interactions Between P-HTTP 
	and TCP Implementations. ACM Computer Communication Review, 
	27 2, 65-73, April, 1997.

	http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html


Joe


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 19:03:19 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA12083
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 19:03:19 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA15675; Wed, 3 Feb 1999 18:02:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA15182; Wed, 3 Feb 1999 18:02:00 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id XAA21343; Wed, 3 Feb 1999 23:01:52 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m108CAn-0007U1C; Wed, 3 Feb 99 23:57 GMT
Message-Id: <m108CAn-0007U1C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: sm@bossette.engr.sgi.com (Sam Manthorpe)
Date: Wed, 3 Feb 1999 23:57:32 +0000 (GMT)
Cc: ehall@ehsco.com, minshall@siara.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <199902032232.OAA33575@bossette.engr.sgi.com> from "Sam Manthorpe" at Feb 3, 99 02:32:43 pm
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> > One related question: Does everybody support applications disabling
> > Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?
> 
> IRIX does.

Linux does


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 19:32:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA12465
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 19:32:46 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA28708; Wed, 3 Feb 1999 18:27:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA26249; Wed, 3 Feb 1999 18:22:45 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id RAA01161;
	Wed, 3 Feb 1999 17:22:41 -0600 (CST)
Date: Wed, 3 Feb 1999 17:22:41 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902032322.RAA01161@frantic.bsdi.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Cc: ehall@ehsco.com
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Eric,

> One related question: Does everybody support applications disabling
> Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?

Most BSD systems support this via:

	int on = 1;
	setsockopt(s, IPPROTO_TCP, TCP_NODELAY, &on, sizeof(on));

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 19:43:11 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA12574
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 19:43:10 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA01526; Wed, 3 Feb 1999 18:32:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA29740; Wed, 3 Feb 1999 18:30:02 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id PAA17971
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Feb 1999 15:30:03 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id PAA14194 for <tcp-impl@lerc.nasa.gov>; Wed, 3 Feb 1999 15:29:57 -0800 (PST)
Message-ID: <36B8DBF4.FBCB23FB@cup.hp.com>
Date: Wed, 03 Feb 1999 15:29:56 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902032230.OAA33831@bossette.engr.sgi.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Because you just have to make one change to the kernel code instead
> of many changes to all applications which do bulk data transfer.  Moreover,

Back-up a step here - I thought that the problem being fixed here was
for request/response applications. From the original rfc txt:

begin quote:

The interaction of delayed ACKs and Nagle

   If a TCP has more application data to transmit than will fit in one
   packet, but less than two full-sized packets' worth of data, it
   will transmit the first packet.  As a result of Nagle, it will not
   transmit the second packet until the first packet has been
   acknowledged.  On the other hand, the receiving TCP will delay
   acknowledging the first packet until either i) a second packet
   arrives (which, in this case, won't arrive), or ii) approximately
   100ms (and a maximum of 200ms) has elapsed.

   When the sending TCP receives the delayed ACK, it can then transmit
   its second packet.

   In a request-response protocol, this second packet will complete
   either a request or a response, which then enables a succeeding
   response or request.

   Note two (related) bad results of the interaction of delayed ACKs
   and the Nagle algorithm in this case: the request-response time may
   be increased by up to 400ms (if both the request and the response
   are delayed); and, the number of transactions per second is
   substantially reduced.

end quote.

Perhaps my definition of bulk data transfer application is differnet,
but I take that to mean a very large (>>> MSS) quantity of data
transfered unidirectinally (eg FTP). Given that, I'm not sure that there
is any problem with such an application and the current Nagle algorithm
is there? Enough data accumulates from the small sends to make full-MSS
segments and away they go.

Now, in separate email I was discussing that first quoted paragraph with
Greg. I'll try to paraphrase it here :) First, the paragraph again:


   If a TCP has more application data to transmit than will fit in one
   packet, but less than two full-sized packets' worth of data, it
   will transmit the first packet.  As a result of Nagle, it will not
   transmit the second packet until the first packet has been
   acknowledged.  On the other hand, the receiving TCP will delay
   acknowledging the first packet until either i) a second packet
   arrives (which, in this case, won't arrive), or ii) approximately
   100ms (and a maximum of 200ms) has elapsed.

My understanding of "the way things are supposed to work" :) is that if
that one-and-a-fraction MSS-worth of data was presented to the TCP in a
single "send" that all of it should go out without delay. And yes, that
could mean a series of 1460,1,1460,1 segments traversing the network. I
demonstrated that being the case with a quick netperf TCP_RR test with
"-r 1461,1" - if the paragraphs description was correct, I should have
only gotten ~5 transactions per second (200 ms delayed ACK), but I got
several hundred.

My further understanding (going further in the quoted text above) was
that if the application presented the data to the TCP in two (or more)
"sends" and that the remote could only reply when all the data arrived,
that the application was essentially broken - it was presenting
logically associated data to the TCP in separate "send" calls when it
should have been presented in a single send (perhaps a gathering send).

Depending on how that application structured its sends. I'm not sure the
proposed change: 

   The proposed Nagle algorithm modifies this as follows:

        If a TCP has less than a full-sized packet to transmit,
        and if any previous less than full-sized packet has not
        yet been acknowledged, do not transmit a packet.

would actually "fix" things. For instance, a (broken IMO) application
that sent a 2048 byte "request" as a pair of 1024 byte "sends" would
still encounter the delayed ACK. As I am guessing would an application
dribbling data out through stdio (say a web plug-in or something)

Now, there may still be sufficient cases of reasonable applications that
are not broken, but still "suffer" from Nagle, but I'm not sure (yet -
I'm sure someone else will point some out :) what they are. _Maybe_
something to do with pipelined http requests generating sub MSS
responses?

The idea presented in the draft about an explicit flush mechanism was
interesting, though it would be a coupled change in both the
application(s) and the kernel.

rick jones
-- 
today, an ACK is just as expensive to the server as a data segment...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 19:47:27 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA12604
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 19:47:27 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA06639; Wed, 3 Feb 1999 18:42:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA06164; Wed, 3 Feb 1999 18:41:41 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA06301
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 3 Feb 1999 16:41:39 -0700 (MST)
Date: Wed, 3 Feb 1999 16:41:39 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902032341.QAA06301@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> > One related question: Does everybody support applications disabling
> > Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?
>
> IRIX does.

] Linux does.


Let's not have a brust of 10,000 messages listing implementations
that do support TCP_NODELAY.  Instead, let's wait for the few, if
any, announcements of those that don't.

You would expect that the TCP_NODELAY setsockopt() would work on most BSD
derived TCP stacks, including IRIX, BSD/OS, HP-UX, Ultrix and the other
(OSF?) DEC system, System V Release 4 and beyond, AIX, and the zillions
of others I can't think of in 3 seconds.  TCP_NODELAY also seems to work
in some other popular, not exactly BSD-derived systems, notably at least
some "winsock" libraries for WIN32 platforms, including WIN95, WIN98, and
NT 4.0.  (I've rather limited confidence that they do everything right,
but at least those Microsoft system do not complain when my code makes
the TCP_NODELAY request.)

I bet there are more systems that do not support the Nagle algorithm at
all than those that support the Nagle algorithm but cannot turn it off
(except, of course, systems that cannot be changed to want to turn it off,
such as embedded code burned into ROM).

Is TCP_NODELAY somewhere in POSIX?  I bet so.  
And also DLPI?  Ditto.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 19:55:31 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA12647
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 19:55:31 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA11758; Wed, 3 Feb 1999 18:52:32 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from alcor.process.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA09498; Wed, 3 Feb 1999 18:48:12 -0500 (EST)
Received: by process.com (MX V5.1-X A2w8g) id 103;
          Wed, 3 Feb 1999 18:48:10 -0400
Date: Wed, 3 Feb 1999 18:48:10 -0400
From: Bernie Volz <volz@process.com>
To: tcp-impl@lerc.nasa.gov, minshall@siara.com
Message-ID: <009D332E.35B42D12.103@process.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Folks:

Perhaps this is a dumb question and has been discussed before, but is
there any value in having TCP know a bit about what the application is
doing? This might be considered a protocol layering violation. Has
anyone implemented this kind of scheme?

I'm thinking one could have logic that says "delay ACK unless all data
delivered to application". I don't mean moved into the queue, but a read
was actually waiting and all data delivered or on its way to the
application. Of course, you would ONLY not delay if you received a small
segment (in the case where you believe the sender may be doing the Nagle
algorithm). One could also consider including an explicit "full" window
update with this immediate ACK.

A small change to this logic might be to only avoid delaying the ACK
if the application can accept more data (ie, all has been delivered and
there is still a read outstanding) - you know the application wants more
data!

One could also implement this that when a read is queued by an
application, if a delayed ACK is pending on the connection (and the last
segment was small), send the ACK now.

This of course is wasteful if your round trip time is long (since the
delay ACK time is insignificant in that case). So, perhaps this would
only be turned on if the round-trip estimate is "short".

However, the cost is also small as if all data was read, the application
must be ready for the data and why not expidite delivery of more. Also,
for a fast network, it is unlikely that an application could keep up
with network speeds for long and hence this wouldn't trigger that often.

- Bernie Volz
  Process Software


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 20:02:32 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA12737
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 20:02:31 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA11764; Wed, 3 Feb 1999 18:52:51 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA09027; Wed, 3 Feb 1999 18:48:03 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA06505
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 3 Feb 1999 16:48:01 -0700 (MST)
Date: Wed, 3 Feb 1999 16:48:01 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902032348.QAA06505@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: sm@bossette.engr.sgi.com (Sam Manthorpe)

> ...
> > Q: If Nagle is a problem for my application, I can just disable it
> > altogether. What benefit is there in changing Nagle so that it affects
> > ALL transactions instead of just the ones I'm troubled by?
>
> Because you just have to make one change to the kernel code instead
> of many changes to all applications which do bulk data transfer.  Moreover, 
> in some cases, you will have to introduce some special intelligence
> in the application so that it can decide _whether_ it is doing bulk
> transfer or not.  Why not just put this extra intelligence in the
> kernel.

While I think the proposed change is a good thing, that argument
in favor of it is a pretty compelling argument against it.

Outside of some outfits that believe their customers are best served with
amazingly buggy bloatware and have no idea of the meaning of "freaping
creaturism" (or have business reasons for throwing absolutely everything
into their so called operating systems), a "kernel" is universally expected
to be minimal.  You put only the minimal features into a kernel.  "Neat"
is a very pejorative word when it comes to adding things to kernels.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 20:11:28 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA12802
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 20:11:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA17056; Wed, 3 Feb 1999 19:02:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from alcor.process.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA16002; Wed, 3 Feb 1999 19:00:36 -0500 (EST)
Received: by process.com (MX V5.1-X A2w8g) id 21; Wed, 3 Feb 1999 18:56:13 -0400
Date: Wed, 3 Feb 1999 18:56:13 -0400
From: Bernie Volz <volz@process.com>
To: raj@cup.hp.com
CC: tcp-impl@lerc.nasa.gov
Message-ID: <009D332F.558E2768.21@process.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>Back-up a step here - I thought that the problem being fixed here was
>for request/response applications. From the original rfc txt:
>
>begin quote:
>
>The interaction of delayed ACKs and Nagle
>
>   If a TCP has more application data to transmit than will fit in one
>   packet, but less than two full-sized packets' worth of data, it
>   will transmit the first packet.  As a result of Nagle, it will not
>   transmit the second packet until the first packet has been
>   acknowledged.  On the other hand, the receiving TCP will delay
>   acknowledging the first packet until either i) a second packet
>   arrives (which, in this case, won't arrive), or ii) approximately
>   100ms (and a maximum of 200ms) has elapsed.
>
>   When the sending TCP receives the delayed ACK, it can then transmit
>   its second packet.
>
>   In a request-response protocol, this second packet will complete
>   either a request or a response, which then enables a succeeding
>   response or request.
>
>   Note two (related) bad results of the interaction of delayed ACKs
>   and the Nagle algorithm in this case: the request-response time may
>   be increased by up to 400ms (if both the request and the response
>   are delayed); and, the number of transactions per second is
>   substantially reduced.
>
>end quote.

This is missing one point ... bad applications. In the early days (and I
bet it still happens more often than we would like to think), a lot of
applications sent the request or reply (without the terminator) and then
terminator as two separate writes. Put another way, the message
delimiter and data are not a single write. Even worse, the data may be
"sent" as many small writes. If the amount of data is small, the Nagle
algorithm places lots of delays.

If you look at a trace, you'd see:
				<command> --->
after DELAY_ACK time			  <--- ACK
				<terminator>
after DELAY_ACK time			  <--- ACK or REPLY

Anyway, this is solvable if all applications do "large" writes (write a
single request as one write).

- Bernie Volz
  Process Software


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 20:17:13 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA12828
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 20:17:13 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA22205; Wed, 3 Feb 1999 19:12:31 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mail3.microsoft.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA19570; Wed, 3 Feb 1999 19:07:36 -0500 (EST)
Received: by mail3.microsoft.com with Internet Mail Service (5.5.2524.0)
	id <D8V4RJSD>; Wed, 3 Feb 1999 16:07:35 -0800
Message-ID: <3FF8121C9B6DD111812100805F31FC0D0CAE85F0@RED-MSG-59>
From: Art Shelest <artshel@microsoft.com>
To: "'Bernie Volz'" <volz@process.com>, tcp-impl@lerc.nasa.gov,
        minshall@siara.com
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Date: Wed, 3 Feb 1999 16:07:28 -0800 
X-Mailer: Internet Mail Service (5.5.2524.0)
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

The distinction between "full" and "small" segments is not certain
whenever PMTU discovery is used. There may be other ways to detect
that specific segment is at the end of user's buffer and accelerate 
an ACK.


-----Original Message-----
<snip>

I'm thinking one could have logic that says "delay ACK unless all data
delivered to application". I don't mean moved into the queue, but a read
was actually waiting and all data delivered or on its way to the
application. Of course, you would ONLY not delay if you received a small
segment (in the case where you believe the sender may be doing the Nagle
algorithm). One could also consider including an explicit "full" window
update with this immediate ACK.

<snip>

- Bernie Volz
  Process Software


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 20:48:26 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA13128
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 20:48:25 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA07441; Wed, 3 Feb 1999 19:42:32 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA04834; Wed, 3 Feb 1999 19:37:57 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id AAA23199; Thu, 4 Feb 1999 00:37:49 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m108Dfg-0007U1C; Thu, 4 Feb 99 01:33 GMT
Message-Id: <m108Dfg-0007U1C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: volz@process.com (Bernie Volz)
Date: Thu, 4 Feb 1999 01:33:31 +0000 (GMT)
Cc: raj@cup.hp.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <009D332F.558E2768.21@process.com> from "Bernie Volz" at Feb 3, 99 06:56:13 pm
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> 				<terminator>
> after DELAY_ACK time			  <--- ACK or REPLY
> 
> Anyway, this is solvable if all applications do "large" writes (write a
> single request as one write).

Its actually non trivial because no write is always big enough. You also get
interactions with stuff like sendfile. HP seem to have done a send this
data then the file type thing. Its not clear what the nice answer is. Linux
2.2 has a TCP_CORK option - which is kind of like supernaglei[1]. And also of
course very easy to keep clean in the performance path.

The right answer ? I don't know. I do know that almost every attempt we
made to do self adjusting ack delays didnt work.

[1] Its 'supernagle' as in never send a frame when its not MTU sized


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 21:06:18 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA13271
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 21:06:18 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA14889; Wed, 3 Feb 1999 19:57:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA13449; Wed, 3 Feb 1999 19:54:58 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id TAA02789
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Feb 1999 19:54:54 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id QAA14328 for <tcp-impl@lerc.nasa.gov>; Wed, 3 Feb 1999 16:54:55 -0800 (PST)
Message-ID: <36B8EFDF.75390A6C@cup.hp.com>
Date: Wed, 03 Feb 1999 16:54:55 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <m108Dfg-0007U1C@the-village.bc.nu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Its actually non trivial because no write is always big enough. You also get
> interactions with stuff like sendfile. HP seem to have done a send this
> data then the file type thing.

When we implemented sendfile(), we made sure that the amalgam (?) of the
header data specified by the application and the file data arrived at
TCP in a way that made it behaves as if it was one "send" call so there
were no Nagle/delayed ACK interactions. 

(Not to say that we didn't experience that in some early protos of
course :)

rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  3 23:09:18 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA15959
	for <tcpimpl-archive@lists.ietf.org>; Wed, 3 Feb 1999 23:09:17 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA05117; Wed, 3 Feb 1999 21:37:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA02812; Wed, 3 Feb 1999 21:32:52 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA10857; Wed, 3 Feb 1999 18:13:03 -0800
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id SAA21015; Wed, 3 Feb 1999 18:13:00 -0800
Received: from dors (awe185-153.AWE.Sun.COM [192.29.185.153])
	by jurassic.eng.sun.com (8.9.3.Beta0+Sun/8.9.3.Beta0) with SMTP id SAA12247;
	Wed, 3 Feb 1999 18:12:58 -0800 (PST)
Date: Wed, 3 Feb 1999 18:14:37 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: Rick Jones <raj@cup.hp.com>
Cc: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36B8DBF4.FBCB23FB@cup.hp.com>
Message-ID: <Roam.SIMC.2.0.6.918094477.28747.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> My understanding of "the way things are supposed to work" :) is that if
> that one-and-a-fraction MSS-worth of data was presented to the TCP in a
> single "send" that all of it should go out without delay. And yes, that
> could mean a series of 1460,1,1460,1 segments traversing the network. I
> demonstrated that being the case with a quick netperf TCP_RR test with
> "-r 1461,1" - if the paragraphs description was correct, I should have
> only gotten ~5 transactions per second (200 ms delayed ACK), but I got
> several hundred.

This is interesting.  So are you saying that the receiver did not delay acks
in your test case?  Assuming MSS is 1460 bytes.  After the sender sends the
first 1460 bytes of the 1461 bytes request (in a single send call), it cannot
send any more because of Nagle.  And if the receiver implements delay ack and
ack every other segment, it will not send back an ack immediately if the 1460
bytes segment happen to be the first of the 2 segments.  Have you looked at
the transfer trace to see what is going on?

> Now, there may still be sufficient cases of reasonable applications that
> are not broken, but still "suffer" from Nagle, but I'm not sure (yet -
> I'm sure someone else will point some out :) what they are. _Maybe_
> something to do with pipelined http requests generating sub MSS
> responses?

Well, in your test case, Nagle is supposed to kick in and reduce the
RR rate.  Are you sure the stack you tested with does not have some form
of optimization for this?

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 00:02:59 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id AAA16640
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 00:02:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA12004; Wed, 3 Feb 1999 22:52:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from daffy.ee.lbl.gov (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA11538; Wed, 3 Feb 1999 22:51:31 -0500 (EST)
Received: (from vern@localhost)
	by daffy.ee.lbl.gov (8.9.2/8.9.2) id TAA10469;
	Wed, 3 Feb 1999 19:51:31 -0800 (PST)
Message-Id: <199902040351.TAA10469@daffy.ee.lbl.gov>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Date: Wed, 03 Feb 1999 19:51:30 PST
From: Vern Paxson <vern@ee.lbl.gov>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

An observation: some of the comments on this thread make me wonder if some
contributors are speculating about the I-D before having read it.  If you
haven't read it, please do, it's short and a very simple change.

		Vern


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 00:34:44 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id AAA17031
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 00:34:44 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id XAA24388; Wed, 3 Feb 1999 23:17:31 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Twig.Rodents.Montreal.QC.CA (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id XAA22118; Wed, 3 Feb 1999 23:12:39 -0500 (EST)
Received: (from mouse@localhost)
	by Twig.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id XAA09174;
	Wed, 3 Feb 1999 23:12:34 -0500 (EST)
Date: Wed, 3 Feb 1999 23:12:34 -0500 (EST)
From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
Message-Id: <199902040412.XAA09174@Twig.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>>> One related question: Does everybody support applications disabling
>>> Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?
>> [foo] does.
>> [bar] does.
> Let's not have a brust of 10,000 messages listing implementations
> that do support TCP_NODELAY.  Instead, let's wait for the few, if
> any, announcements of those that don't.

I suspect this won't work very well.  I feel quite sure that having a
representative on tcp-impl correlates positively, and probably fairly
strongly, with having a relatively decent implementation.  Thus, the
people who could truthfully chime in as you request - even assuming
they'd be willing to publicly admit their failing - are the people who
probably won't see the question.

The only way I expect to see "<foo> doesn't do this right" here is from
someone who wrote code that tried to disable Nagle (eg, TCP_NODELAY)
and then discovered in a packet trace that it was happening anyway.

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 00:47:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id AAA17095
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 00:47:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id XAA03843; Wed, 3 Feb 1999 23:37:30 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sgi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id XAA02013; Wed, 3 Feb 1999 23:33:36 -0500 (EST)
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id UAA06499; Wed, 3 Feb 1999 20:33:28 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: from bossette.engr.sgi.com (bossette.engr.sgi.com [150.166.61.12])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id UAA07748;
	Wed, 3 Feb 1999 20:33:28 -0800 (PST)
	mail_from (sm@bossette.engr.sgi.com)
Received: (from sm@localhost) by bossette.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) id UAA35010; Wed, 3 Feb 1999 20:33:27 -0800 (PST)
From: sm@bossette.engr.sgi.com (Sam Manthorpe)
Message-Id: <199902040433.UAA35010@bossette.engr.sgi.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: vjs@calcite.rhyolite.com (Vernon Schryver)
Date: Wed, 3 Feb 1999 20:33:27 -0800 (PST)
Cc: tcp-impl@lerc.nasa.gov
In-Reply-To: <199902032348.QAA06505@calcite.rhyolite.com> from "Vernon Schryver" at Feb 3, 99 04:48:01 pm
X-Mailer: ELM [version 2.4 PL25 PGP3 *ALPHA*]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon Schryver wrote:
> 
> > From: sm@bossette.engr.sgi.com (Sam Manthorpe)
> 
> > ...
> > > Q: If Nagle is a problem for my application, I can just disable it
> > > altogether. What benefit is there in changing Nagle so that it affects
> > > ALL transactions instead of just the ones I'm troubled by?
> >
> > Because you just have to make one change to the kernel code instead
> > of many changes to all applications which do bulk data transfer.  Moreover, 
> > in some cases, you will have to introduce some special intelligence
> > in the application so that it can decide _whether_ it is doing bulk
> > transfer or not.  Why not just put this extra intelligence in the
> > kernel.
> 
> While I think the proposed change is a good thing, that argument
> in favor of it is a pretty compelling argument against it.
> 
> Outside of some outfits that believe their customers are best served with
> amazingly buggy bloatware and have no idea of the meaning of "freaping
> creaturism" (or have business reasons for throwing absolutely everything
> into their so called operating systems), a "kernel" is universally expected
> to be minimal.  You put only the minimal features into a kernel.  "Neat"
> is a very pejorative word when it comes to adding things to kernels.

Point taken.  However, from the point of view of kernel implementation 
this trivial mod can result in a significant performance improvement for 
lots of applications.    And it isn't a `neat feature' it's a harmless 
bending of the rules that IMO falls well into the grey area of RFC deviance 
justified by the `be conservative in what you specify but liberal in what 
you accept' philosophy.  I like Greg's ID and proposed ligitimisation of
this change.

-- Sam

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sam Manthorpe, SGI.  tel: (650)933-2856 fax: (650)932-2856  sm@engr.sgi.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 11:43:46 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id LAA00111
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 11:43:45 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id JAA12273; Thu, 4 Feb 1999 09:57:36 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ausmail1.austin.ibm.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id JAA11727; Thu, 4 Feb 1999 09:56:37 -0500 (EST)
Received: from netmail2.austin.ibm.com (netmail2.austin.ibm.com [9.53.250.97])
	by ausmail1.austin.ibm.com (8.9.1/8.8.5) with ESMTP id IAA17806;
	Thu, 4 Feb 1999 08:50:37 -0600
Received: from mojave.austin.ibm.com (mojave.austin.ibm.com [9.53.150.76])
        by netmail2.austin.ibm.com (8.8.5/8.8.5) with ESMTP id IAA03758;
        Thu, 4 Feb 1999 08:56:30 -0600
Received: (from marquard@localhost) by mojave.austin.ibm.com (AIX4.3/UCB 8.8.8/8.7-client1.01) id IAA31352; Thu, 4 Feb 1999 08:56:29 -0600
To: "Eric A. Hall" <ehall@ehsco.com>
Cc: Greg Minshall <minshall@siara.com>, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902020207.SAA01109@red.mtv.siara.com> <36B8B9A5.4BBBEA81@ehsco.com> <36B8BBF3.EAFAD2C@ehsco.com>
From: Dave Marquardt <marquard@austin.ibm.com>
Date: 04 Feb 1999 08:56:29 -0600
In-Reply-To: "Eric A. Hall"'s message of "Wed, 03 Feb 1999 13:13:24 -0800"
Message-ID: <v5tr9s65iaq.fsf@mojave.austin.ibm.com>
Lines: 13
X-Mailer: Gnus v5.6.2/Emacs 19.34
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

"Eric A. Hall" <ehall@ehsco.com> writes:
> > Q: If Nagle is a problem for my application, I can just disable it
> > altogether.
> 
> One related question: Does everybody support applications disabling
> Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?

AIX does.

I suspect most implementations based on BSD do.  It's the TCP_NODELAY
socket option.

-Dave


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 11:53:21 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id LAA00112
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 11:43:46 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id JAA03014; Thu, 4 Feb 1999 09:42:38 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id JAA01745; Thu, 4 Feb 1999 09:40:18 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id HAA21740
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 4 Feb 1999 07:40:12 -0700 (MST)
Date: Thu, 4 Feb 1999 07:40:12 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902041440.HAA21740@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: der Mouse  <mouse@Rodents.Montreal.QC.CA>

> ...
> > that do support TCP_NODELAY.  Instead, let's wait for the few, if
> > any, announcements of those that don't.
>
> I suspect this won't work very well.  I feel quite sure that having a
> representative on tcp-impl correlates positively, and probably fairly
> strongly, with having a relatively decent implementation.  Thus, the
> people who could truthfully chime in as you request - even assuming
> they'd be willing to publicly admit their failing - are the people who
> probably won't see the question.

That's all quite right, but the problem applies even more to the other
tactic of collecting notes about systems whose advocates claim work
correctly.

There's also the issue of why the IETF cares whether any particular
implemetation does or does not implement TCP_NODELAY.  It's one thing for
to define a standard, and quite another thing to monitor, test, or
otherwise worry about the compliance of every implementation deployed or
announced.


> The only way I expect to see "<foo> doesn't do this right" here is from
> someone who wrote code that tried to disable Nagle (eg, TCP_NODELAY)
> and then discovered in a packet trace that it was happening anyway.

In general, that's also probably true.  However, if you have an application
which uses TCP_NODELAY because it is undesirable for it to follow the
rules (e.g. does things like write-write-read-...) and so gets hit by the
Nagle+delayed_Ack interaction, you probably won't need to packet trace to
have the problem brought to your attention.  The performance effects are
hard to miss.

I have such an application, but it wouldn't be helped by the proposal
because its naughty writes are too small.  It still needs TCP_NODELAY.
Such is life.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 13:30:08 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id NAA06577
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 13:30:07 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA06217; Thu, 4 Feb 1999 12:12:38 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA03147; Thu, 4 Feb 1999 12:08:07 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id LAA02391;
	Thu, 4 Feb 1999 11:07:58 -0600 (CST)
Date: Thu, 4 Feb 1999 11:07:58 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902041707.LAA02391@frantic.bsdi.com>
To: minshall@siara.com, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Greg,

(This message got a bit longer than I anticipated... Oh well.)

I like the proposed modification to the Nagle algorithm.  But I do
have to say that I'm biased, because I implemented this in BSD/OS
several years ago, as part of an overall effort to improve BSD/OS's
performance when doing web traffic (BSD/OS 2.1, patch K210-019, August
1996).  My goal was to eliminate any artificial delays in the TCP
traffic, and to make the TCP packets as full as possible.

Basicly, I changed our TCP so that when an application does a single
write, if there isn't any outstanding un-acked data, all the new data
will go out (subject to constraints of the congestion window), including
the trailing partial packet.  It has been working very well, and is
basicly similar to the proposed text of:

        If a TCP has less than a full-sized packet to transmit,
        and if any previous less than full-sized packet has not
        yet been acknowledged, do not transmit a packet.

My changes were well worth it.  Your suggestion of adding a SND_SML
state might be a simpler implementation than what I did, though we
were dealing with more issues than just the TCP Nagle issue.  In BSD,
sosend() only copies down at most MCLBYTES (2K in BSD/OS) from the user
to the kernel, and then hands that single piece down to tcp_output().

We were seeing problems on FDDI where TCP would get the initial 2K
of the data, and it would send it out right away, even when there was
more data that could have filled a 4K packet, it just hadn't been
copied down from the user yet.  The next 2K would stall due to Nagle,
and the successive 2K would then allow a 4K packet to be sent.

We were also seeing the problem you addressed, where a 2000 byte write
would send the initial 1460 bytes, followed by a delay before sending
the other 540 bytes, due to Nagle.

In a nutshell, our code does the Nagle decision once on the whole
chunk of data written in a single write by the application.  So,
the trailing partial packet goes out immediatly, because there was
no outstanding data when the write call was issued.  We actually
remember this state across successive calls to tcp_output(), since
sosend() hands down the large write from the user in multiple pieces.
The old code would generate:

	User write 4096 -> kernel
		sosend 2048 -> tcp_output
			send 1460 (defer 588 bytes)
		sosend 2048 -> tcp_output
			send 1460 (defer 1176 bytes)
	(wait for ACK or timeout)
			send 1176
			
What we didn't want, and what we initially got, was:

	User write 4096 -> kernel
		sosend 2048 -> tcp_output()
			send 1460
			send 588
		sosend 2048 -> tcp_output()
			send 1460 (defer 588 bytes)
	(wait for ACK or timeout)
			send 588 bytes

After adding state across successive calls to tcp_output(), our
code does:
	User write 4096 -> kernel
		sosend 2048 -> tcp_output()
			send 1460 (defer 588 bytes)
		sosend 2048 -> tcp_output()
			send 1460
			send 1080

(All the above examples assume the congestion window is large
enough to allow all the data to go out...)

So, it is important to point out to implementors that with this
change to Nagle, you need to keep track of whether or not there is
more data waiting to be copied down from the user.

Another thought is that perhaps Nagle should only apply when you have
successive small packets.  Then the text becomes something like:

	If a TCP has less than a full-sized packet to transmit,
        and if the last insequence packet sent was less than a
	full-sized packet and has not yet been acknowledged, do
	not transmit a packet.

But then if the application did:
	write(s, buf1, 1500);
	write(s, buf2, 1500);
	write(s, buf3, 1500);

You'd get:
	1st write:	send 1460 bytes
			send 40 bytes
	2nd write:	send 1460 bytes
			send 40 bytes
	3rd write:	send 1460 bytes
			send 40 bytes

whereas the current Nagle code would produce:

	1st write:	send 1460 bytes  (defer 40 bytes)
	2nd write:	send 40 + 1420 = 1460 bytes (defer 80 bytes)
	3rd write:	send 80 + 1380 = 1460 bytes  (defer 120 bytes)
			(wait for ACK or timeout)
			send 120 bytes

and the proposed Nagle (and the current BSD/OS) would produce:

	1st write:	send 1460 bytes
			send 40 bytes
	2nd write:	send 1460 bytes (defer 40 bytes)
	3rd write	send 40 + 1420 = 1460 bytes (defer 80 bytes)
			(wait for ACK or timeout)
			send 80 bytes

As long as the application is not doing multiple writes to send
a single transaction request, the proposed code would work just
fine.  In fact, my implementation was geared to optimize with the
assumption that the application was smart enough to send the entire
request in a single write system call.

As for sending multiple requests in a row, I think the additional
idea about SO_EXPLICITPUSH could be used to fix the delay of the
final packet.  I've bounced this idea around before, but never
followed through on it.  My suggestion is to just add a MSG_PUSH
flag that would be used on the last send to indicate that the
application doesn't intend to send anymore data at this time.

I don't like SO_EXPLICITPUSH, because that means that the kernel
is trusting the application to remember to tell it when it is
done sending the data.  If the application forgets, then it could
adversly affect the performance below what would have happened
had SO_EXPLICITPUSH not been set.  I'd rather have the kernel
continue to be as aggressive as normal in sending out data, but
take the MSG_PUSH bit as a suggestion that it should be more
aggessive and transmit any delayed data, since no more data will
be written at this time.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 13:36:17 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id NAA06887
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 13:36:16 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA24470; Thu, 4 Feb 1999 11:52:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sco.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA21257; Thu, 4 Feb 1999 11:47:56 -0500 (EST)
Received: from tyne.london.sco.COM(150.126.1.103), claiming to be "tyne.sco.com"
 via SMTP by scol.london.sco.COM, id smtpdAAAa003gB; Thu Feb  4 11:33:07 1999
Received: from dhcp3-23.pd.london.sco.com by tyne.sco.com id aa29913;
          4 Feb 99 11:32 GMT
Message-ID: <024201be5032$2eeb1de0$17037e96@barbarella.pd.london.sco.com>
From: Jonathan Webb <jonwe@sco.COM>
To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
MMDF-Warning:  Parse error in original version of preceding line at scol.sco.COM
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Date: Thu, 4 Feb 1999 11:27:01 -0000
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 4.72.2106.4
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

SCO UnixWare7 and OSR5 do.

- Jonathan

-----Original Message-----
From: Vernon Schryver <vjs@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov <tcp-impl@lerc.nasa.gov>
Date: 03 February 1999 23:52
Subject: Re: internet draft on suggested mod to the Nagle algorithm


>> > One related question: Does everybody support applications disabling
>> > Nagle on a per-circuit basis, as mandated by 4.2.3.4 in RFC 1122?
>>
>> IRIX does.
>
>] Linux does.
>
>
>Let's not have a brust of 10,000 messages listing implementations
>that do support TCP_NODELAY.  Instead, let's wait for the few, if
>any, announcements of those that don't.
>
>You would expect that the TCP_NODELAY setsockopt() would work on most BSD
>derived TCP stacks, including IRIX, BSD/OS, HP-UX, Ultrix and the other
>(OSF?) DEC system, System V Release 4 and beyond, AIX, and the zillions
>of others I can't think of in 3 seconds.  TCP_NODELAY also seems to work
>in some other popular, not exactly BSD-derived systems, notably at least
>some "winsock" libraries for WIN32 platforms, including WIN95, WIN98, and
>NT 4.0.  (I've rather limited confidence that they do everything right,
>but at least those Microsoft system do not complain when my code makes
>the TCP_NODELAY request.)
>
>I bet there are more systems that do not support the Nagle algorithm at
>all than those that support the Nagle algorithm but cannot turn it off
>(except, of course, systems that cannot be changed to want to turn it off,
>such as embedded code burned into ROM).
>
>Is TCP_NODELAY somewhere in POSIX?  I bet so.  
>And also DLPI?  Ditto.
>
>
>Vernon Schryver    vjs@rhyolite.com
>


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 13:56:54 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id NAA08043
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 13:56:54 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA27720; Thu, 4 Feb 1999 12:47:35 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA24710; Thu, 4 Feb 1999 12:42:19 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id JAA11102
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 09:42:22 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id JAA15448 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 09:42:14 -0800 (PST)
Message-ID: <36B9DBF6.499265E4@cup.hp.com>
Date: Thu, 04 Feb 1999 09:42:14 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <Roam.SIMC.2.0.6.918094477.28747.kcpoon@jurassic>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Kacheong Poon wrote:
> 
> > My understanding of "the way things are supposed to work" :) is that if
> > that one-and-a-fraction MSS-worth of data was presented to the TCP in a
> > single "send" that all of it should go out without delay. And yes, that
> > could mean a series of 1460,1,1460,1 segments traversing the network. I
> > demonstrated that being the case with a quick netperf TCP_RR test with
> > "-r 1461,1" - if the paragraphs description was correct, I should have
> > only gotten ~5 transactions per second (200 ms delayed ACK), but I got
> > several hundred.
> 
> This is interesting.  So are you saying that the receiver did not delay acks
> in your test case?  Assuming MSS is 1460 bytes.  After the sender sends the
> first 1460 bytes of the 1461 bytes request (in a single send call), it cannot
> send any more because of Nagle.  And if the receiver implements delay ack and
> ack every other segment, it will not send back an ack immediately if the 1460
> bytes segment happen to be the first of the 2 segments.  Have you looked at
> the transfer trace to see what is going on?

No, I'm saying that the last byte did not wait for the ack to go out -
the Nagle check was applied to the "send" as a whole, not the individual
segments that made up the "send."

rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 14:03:59 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA08423
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 14:03:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA01121; Thu, 4 Feb 1999 12:52:35 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA00500; Thu, 4 Feb 1999 12:51:47 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id RAA12172; Thu, 4 Feb 1999 17:51:35 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m108To4-0007U2C; Thu, 4 Feb 99 18:47 GMT
Message-Id: <m108To4-0007U2C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: dab@BSDI.COM (David Borman)
Date: Thu, 4 Feb 1999 18:47:15 +0000 (GMT)
Cc: minshall@siara.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <199902041707.LAA02391@frantic.bsdi.com> from "David Borman" at Feb 4, 99 11:07:58 am
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> continue to be as aggressive as normal in sending out data, but
> take the MSG_PUSH bit as a suggestion that it should be more
> aggessive and transmit any delayed data, since no more data will
> be written at this time.

The Linux TCP_CORK option is basically doing the same thing but the other
way around. An application has to actively turn this behaviour on and request
its use.  It also solves the sendfile() interaction problems more cleanly than
a send this data then the file api which is really oriented at web traffic.

That the unix socket API needs a "just buffer this for a minute" option is
fairly indisputable. Since most applications won't support sensible use of
such API's a better heuristic is good news too.

Alan


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 14:19:51 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA09133
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 14:19:51 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA10781; Thu, 4 Feb 1999 13:07:35 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA10188; Thu, 4 Feb 1999 13:06:20 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id LAA25474
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 4 Feb 1999 11:06:19 -0700 (MST)
Date: Thu, 4 Feb 1999 11:06:19 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902041806.LAA25474@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: David Borman <dab@BSDI.COM>

> ...
> I don't like SO_EXPLICITPUSH, because that means that the kernel
> is trusting the application to remember to tell it when it is
> done sending the data.  If the application forgets, then it could
> adversly affect the performance below what would have happened
> had SO_EXPLICITPUSH not been set.

Good point.

>                                    I'd rather have the kernel
> continue to be as aggressive as normal in sending out data, but
> take the MSG_PUSH bit as a suggestion that it should be more
> aggessive and transmit any delayed data, since no more data will
> be written at this time.

MSG_PUSH sounds interesting, but

  - you would definitely need to pick some other name, since otherwise
     you would have hordes of experts up in arms when they saw the PUSH
     bit in the TCP header set or not set in concert with the MSG_PUSH
     flag in the system call.  Recall the continuing popularity of their
     complaints and questions about the PUSH bit as sent by the BSD code.
     They would think that something in the API named *PUSH would be a
     direct control on the TCP header bit, and never mind how much sense
     that makes.

  - I bet that most applications that now need to turn set TCP_NODELAY
     would still need to use TCP_NODELAY.  In other words, what could you
     do with MSG_PUSH that you could not do more efficiently (i.e. with
     fewer CPU cycles) with writev()?


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 15:00:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id PAA10873
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 15:00:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA00192; Thu, 4 Feb 1999 13:37:47 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from zephyr.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA28332; Thu, 4 Feb 1999 13:34:34 -0500 (EST)
From: braden@ISI.EDU
Received: from gra.isi.edu (gra.isi.edu [128.9.160.133])
	by zephyr.isi.edu (8.8.7/8.8.6) with SMTP id KAA14733;
	Thu, 4 Feb 1999 10:34:18 -0800 (PST)
Date: Thu, 4 Feb 1999 10:29:59 -0800
Posted-Date: Thu, 4 Feb 1999 10:29:59 -0800
Message-Id: <199902041829.AA24314@gra.isi.edu>
Received: by gra.isi.edu (5.65c/4.0.3-6)
	id <AA24314>; Thu, 4 Feb 1999 10:29:59 -0800
To: minshall@siara.com, tcp-impl@lerc.nasa.gov, dab@BSDI.COM
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

  *> 
  *> As for sending multiple requests in a row, I think the additional
  *> idea about SO_EXPLICITPUSH could be used to fix the delay of the
  *> final packet.  I've bounced this idea around before, but never
  *> followed through on it.  My suggestion is to just add a MSG_PUSH
  *> flag that would be used on the last send to indicate that the
  *> application doesn't intend to send anymore data at this time.
  *> 
  *> I don't like SO_EXPLICITPUSH, because that means that the kernel
  *> is trusting the application to remember to tell it when it is
  *> done sending the data.  If the application forgets, then it could
  *> adversly affect the performance below what would have happened
  *> had SO_EXPLICITPUSH not been set.  I'd rather have the kernel
  *> continue to be as aggressive as normal in sending out data, but
  *> take the MSG_PUSH bit as a suggestion that it should be more
  *> aggessive and transmit any delayed data, since no more data will
  *> be written at this time.
  *> 
  *> 			-David Borman, dab@bsdi.com
  *> 

Dave,

Yes. Your definition of MSG_PUSH seems to exactly match the PUSH bit in
the TCP spec (FINALLY!!!!!!!).

Bob


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 15:36:08 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id PAA12359
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 15:36:07 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA10317; Thu, 4 Feb 1999 13:52:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA09688; Thu, 4 Feb 1999 13:51:28 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id KAA13461 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 10:51:25 -0800
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id KAA01501 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 10:51:20 -0800
Received: from dors (awe185-162.AWE.Sun.COM [192.29.185.162])
	by jurassic.eng.sun.com (8.9.3.Beta0+Sun/8.9.3.Beta0) with SMTP id KAA20673
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 10:51:21 -0800 (PST)
Date: Thu, 4 Feb 1999 10:53:02 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36B9DBF6.499265E4@cup.hp.com>
Message-ID: <Roam.SIMC.2.0.6.918154382.12573.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> No, I'm saying that the last byte did not wait for the ack to go out -
> the Nagle check was applied to the "send" as a whole, not the individual
> segments that made up the "send."

So this is an optimization your stack does.  For TCP, it just looks at
how much data there is to send.  It does not care if it is a single send
or not.  I infer from what you said that two segments, one 1460 bytes
and one 1 byte, would go out immediately regardless of Nagle because it
was a single send.

I think this is another variant of the proposed change.  It seems that
people are not against the change.  Maybe the author can think of another
way to phrase it.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 16:25:13 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA14696
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 16:25:12 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA25873; Thu, 4 Feb 1999 15:02:36 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA23596; Thu, 4 Feb 1999 14:59:23 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id NAA02712;
	Thu, 4 Feb 1999 13:59:21 -0600 (CST)
Date: Thu, 4 Feb 1999 13:59:21 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902041959.NAA02712@frantic.bsdi.com>
To: tcp-impl@lerc.nasa.gov, vjs@calcite.rhyolite.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Thu, 4 Feb 1999 11:06:19 -0700 (MST)
> From: Vernon Schryver <vjs@calcite.rhyolite.com>
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
> ...
> MSG_PUSH sounds interesting, but
>
>   - you would definitely need to pick some other name, since otherwise
>      you would have hordes of experts up in arms when they saw the PUSH
>      bit in the TCP header set or not set in concert with the MSG_PUSH
> ...

The name doesn't matter, e.g. MSG_NDELAY would work as well.

>   - I bet that most applications that now need to turn set TCP_NODELAY
>      would still need to use TCP_NODELAY.  In other words, what could you
>      do with MSG_PUSH that you could not do more efficiently (i.e. with
>      fewer CPU cycles) with writev()?

First, the combination of modifying Nagle and adding MSG_PUSH
is what I would recommend, not just MSG_PUSH.  The combination
of the two would eliminate most of the need for TCP_NODELAY.
Secondly, yes, sending the transaction in a single writev() is
by far the prefered method.

The usual need for using TCP_NODELAY is to get around Nagle, due
to it interacting poorly with delayed acks.

1) With the modification to Nagle, then for a single request sent
   in a single writev(), no matter how large the request, there would
   be no need for TCP_NODELAY.

2) If a request is sent via multiple writes, then using MSG_PUSH
   on the last write will eliminate the need for TCP_NODELAY.

3) If an application is sending multiple requests with a separate
   write for each request, with or without the Nagle change, the
   trailing data will get delayed.  Using MSG_PUSH on the last write
   will eliminate that.

The point is that rather than bypassing Nagle in all instances,
using MSG_PUSH you can bypass Nagle at the specific points where
you need to.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 16:29:51 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA15006
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 16:29:51 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA22436; Thu, 4 Feb 1999 14:57:36 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA21274; Thu, 4 Feb 1999 14:55:45 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id OAA21776
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 14:55:42 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id LAA15622 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 11:55:43 -0800 (PST)
Message-ID: <36B9FB3E.EA081676@cup.hp.com>
Date: Thu, 04 Feb 1999 11:55:43 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <Roam.SIMC.2.0.6.918154382.12573.kcpoon@jurassic>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Kacheong Poon wrote:
> 
> > No, I'm saying that the last byte did not wait for the ack to go out -
> > the Nagle check was applied to the "send" as a whole, not the individual
> > segments that made up the "send."
> 
> So this is an optimization your stack does.  

I guess I've been so used to this behaviour over the last N years that I
thought it was present in all stacks :) Silly me :)

rick
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 16:45:14 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA15928
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 16:45:13 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA08633; Thu, 4 Feb 1999 15:22:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA08086; Thu, 4 Feb 1999 15:21:34 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id NAA27954
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 4 Feb 1999 13:21:33 -0700 (MST)
Date: Thu, 4 Feb 1999 13:21:33 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902042021.NAA27954@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

] From: braden@ISI.EDU

] Yes. Your definition of MSG_PUSH seems to exactly match the PUSH bit in
] the TCP spec (FINALLY!!!!!!!).

As I tried to say earlier, it's almost 20 years too late for such
considerations.  The PUSH bit irrevocably means what it now means,
as useless as that might be.  Or in other words, you can't starting
tommorrow saying that a write without the MSG_PUSH option will be
arbitrarily delayed.  It might have been really swell to have the
MSG_PUSH bit in ~1981 when sockets were defined, but since none of
us are of the Wishful Thinking School of Engineering, ...

And one writev() is vastly better than two writes, no matter how many
bits you add to the API.   You could consider the BSD definition of
the PUSH bit as a stupid tax.  People who use two write()'s when one
is possible are likely to to have designs and code that won't notice
the additional slow down from Nagle+delayed Ack.  (From my experience,
that's emperically true.)

 ....................


> From: David Borman <dab@BSDI.COM>

> ...
> The name doesn't matter, e.g. MSG_NDELAY would work as well.
>
> >   - I bet that most applications that now need to turn set TCP_NODELAY
> >      would still need to use TCP_NODELAY.  In other words, what could you
> >      do with MSG_PUSH that you could not do more efficiently (i.e. with
> >      fewer CPU cycles) with writev()?
>
> First, the combination of modifying Nagle and adding MSG_PUSH
> is what I would recommend, not just MSG_PUSH.  The combination
> of the two would eliminate most of the need for TCP_NODELAY.

Maybe so, but I've doubts.   My current application would still need
TCP_NODELAY or would set MSG_NDELAY on every write, which I think amount
the same thing on the wire.

> Secondly, yes, sending the transaction in a single writev() is
> by far the prefered method.

I agree that the MSG_NDELAY bit would be a nice supplement for writev(),
when where it is hard to get all of the buffers together for a writev().
It would also be a better way than TCP_NODELAY to patch code from the many
people that don't realize why write-write is a bad idea.   TCP_NODELAY is
a big hammer unless you really know there are no write-write sequences in
which the first write should be delayed.


> The usual need for using TCP_NODELAY is to get around Nagle, due
> to it interacting poorly with delayed acks.
>
> 1) With the modification to Nagle, then for a single request sent
>    in a single writev(), no matter how large the request, there would
>    be no need for TCP_NODELAY.
>
> 2) If a request is sent via multiple writes, then using MSG_PUSH
>    on the last write will eliminate the need for TCP_NODELAY.
>
> 3) If an application is sending multiple requests with a separate
>    write for each request, with or without the Nagle change, the
>    trailing data will get delayed.  Using MSG_PUSH on the last write
>    will eliminate that.
>
> The point is that rather than bypassing Nagle in all instances,
> using MSG_PUSH you can bypass Nagle at the specific points where
> you need to.

I really like the Nagle algorithm, but how many applications really need
it today?  Among carefully written applications (i.e. use writev or at
least never more than 1 write/request) that might use MSG_NDELAY, how
many would not get exactly the same results by TCP_NODELAY?

The biggest trouble with MSG_NDELAY is the same as with writev().
You'd have it where you don't really needed it (e.g. BSD systems)
but not where you do need it (e.g. winsock).


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 18:15:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA21668
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 18:15:23 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA06132; Thu, 4 Feb 1999 16:52:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA04973; Thu, 4 Feb 1999 16:50:45 -0500 (EST)
Received: from Eng.Sun.COM (engmail1 [129.146.1.13]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id NAA08991 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 13:50:44 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id NAA09308
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 13:50:42 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id NAA14789
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 13:50:39 -0800 (PST)
Date: Thu, 4 Feb 1999 13:50:39 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902041707.LAA02391@frantic.bsdi.com>
Message-ID: <Roam.SIMCSD.2.0.4.918165039.2813.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> 	If a TCP has less than a full-sized packet to transmit,
>       and if the last insequence packet sent was less than a
> 	full-sized packet and has not yet been acknowledged, do
> 	not transmit a packet.

I second this change.  This implies that in the TCP context, if an
application does a single send, TCP will send out all available data.
I think Rick Jones will also agree with this as this is what his stack
does.  In case people are interested, Solaris's TCP stack also does
this.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 19:17:00 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA24137
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 19:16:59 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA17838; Thu, 4 Feb 1999 18:02:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA15973; Thu, 4 Feb 1999 18:00:02 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id RAA00547;
	Thu, 4 Feb 1999 17:00:01 -0600 (CST)
Date: Thu, 4 Feb 1999 17:00:01 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902042300.RAA00547@frantic.bsdi.com>
To: tcp-impl@lerc.nasa.gov, vjs@calcite.rhyolite.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Thu, 4 Feb 1999 13:21:33 -0700 (MST)
> From: Vernon Schryver <vjs@calcite.rhyolite.com>
>
> ] From: braden@ISI.EDU
>
> ] Yes. Your definition of MSG_PUSH seems to exactly match the PUSH bit in
> ] the TCP spec (FINALLY!!!!!!!).
>
> As I tried to say earlier, it's almost 20 years too late for such
> considerations.  The PUSH bit irrevocably means what it now means,
> as useless as that might be.  Or in other words, you can't starting
> tommorrow saying that a write without the MSG_PUSH option will be
> arbitrarily delayed.

I think there is a mis-understanding here.  Setting MSG_PUSH means
push this data through (and set the PUSH flag when it gets to TCP).
Lack of the MSG_PUSH does not mean that the data is not sent, nor
that the PUSH flag will not be set.  At least that's my thinking.
But this may just be support for your agument to call it something
else than MSG_PUSH.

> ...
> > First, the combination of modifying Nagle and adding MSG_PUSH
> > is what I would recommend, not just MSG_PUSH.  The combination
> > of the two would eliminate most of the need for TCP_NODELAY.
>
> Maybe so, but I've doubts.   My current application would still need
> TCP_NODELAY or would set MSG_NDELAY on every write, which I think amount
> the same thing on the wire.

Well, I've tossed around the idea of MSG_PUSH off and on for many years,
and the fact that I haven't done anything about it sort of says that I
haven't even convinced myself that it is really needed.

I do think the change to Nagle is good (supported by the fact that
I implemented it in BSD/OS over 2 years ago).

>...
> I agree that the MSG_NDELAY bit would be a nice supplement for writev(),
> when where it is hard to get all of the buffers together for a writev().
> It would also be a better way than TCP_NODELAY to patch code from the many
> people that don't realize why write-write is a bad idea.   TCP_NODELAY is
> a big hammer unless you really know there are no write-write sequences in
> which the first write should be delayed.

The question is are there enough real-world situations where this
would be worth-while.

> > The point is that rather than bypassing Nagle in all instances,
> > using MSG_PUSH you can bypass Nagle at the specific points where
> > you need to.
>
> I really like the Nagle algorithm, but how many applications really need
> it today?  Among carefully written applications (i.e. use writev or at
> least never more than 1 write/request) that might use MSG_NDELAY, how
> many would not get exactly the same results by TCP_NODELAY?

A other question is, if you have the modification to Nagle, how
many of those carefully written applications that use TCP_NODELAY
could drop use of it?  Or would they still need TCP_NODELAY/MSG_NDELAY
because they send multiple requests without waiting for a response
to the first request?

> The biggest trouble with MSG_NDELAY is the same as with writev().
> You'd have it where you don't really needed it (e.g. BSD systems)
> but not where you do need it (e.g. winsock).

Patient: Doctor, it hurts when I do this.
Doctor: Then don't do that.
:-)

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 19:59:48 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA25537
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 19:59:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA04302; Thu, 4 Feb 1999 18:32:37 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA03279; Thu, 4 Feb 1999 18:30:55 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA01201
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 4 Feb 1999 16:30:54 -0700 (MST)
Date: Thu, 4 Feb 1999 16:30:54 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902042330.QAA01201@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: David Borman <dab@BSDI.COM>

> > As I tried to say earlier, it's almost 20 years too late for such
> > considerations.  The PUSH bit irrevocably means what it now means,

> I think there is a mis-understanding here.  Setting MSG_PUSH means
> push this data through (and set the PUSH flag when it gets to TCP).
> Lack of the MSG_PUSH does not mean that the data is not sent, nor
> that the PUSH flag will not be set.  At least that's my thinking.
> But this may just be support for your agument to call it something
> else than MSG_PUSH.

I think we agree.


> ...
> > Maybe so, but I've doubts. ...

> Well, I've tossed around the idea of MSG_PUSH off and on for many years,
> and the fact that I haven't done anything about it sort of says that I
> haven't even convinced myself that it is really needed.

I had thoughts along those lines, but forgot to be offensive by mentioning
them.


> I do think the change to Nagle is good (supported by the fact that
> I implemented it in BSD/OS over 2 years ago).

Yes.
Although I'm not convinced that BSD/OS sees really high traffic or as
many installations as other systems.  Yes, BSD/OS is used on lots of
web servers (and others such as this box), but I think the really big
servers use boxes bigger than you can get with 80*86 CPU's.  The HP
report is the overwhelming endorsement.


> ...
> The question is are there enough real-world situations where this
> would be worth-while.

Attempts to change to ancient, extremely popular API's are awfully painful
for eveyrone, and often unsuccessful in various ways.  Winsock 2.0 is an
example of how more or less necessary changes can get out of hand.  (I've
found the complications around select, non-blocking I/O, and WIN32 messages
in winsock 2.0 ... engrossing.)


> A other question is, if you have the modification to Nagle, how
> many of those carefully written applications that use TCP_NODELAY
> could drop use of it?  Or would they still need TCP_NODELAY/MSG_NDELAY
> because they send multiple requests without waiting for a response
> to the first request?

Maybe I don't understand the proposed modification, but I have the
impression that the answer to that question is "most of them."  It
seems the modification helps only two sets of cases:

    1. bad code that does write-write-read-write-write-read-...
    2. odd cases such as write(>1 segment but not a MBbyte)

If you have an application that involves lots of small random writes mixed
with random reads, then TCP_NODELAY/MSG_NDELAY is unavoidable.  Perhaps
X clients, as someone said.  Certain the code I'm currently bashing.


> > The biggest trouble with MSG_NDELAY is the same as with writev().
> > You'd have it where you don't really needed it (e.g. BSD systems)
> > but not where you do need it (e.g. winsock).
>
> Patient: Doctor, it hurts when I do this.
> Doctor: Then don't do that.
> :-)

"If wishes were horses, then beggars would ride"


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 20:49:03 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA27104
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 20:49:03 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA08548; Thu, 4 Feb 1999 19:35:45 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA05209; Thu, 4 Feb 1999 19:29:16 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 426;
          Thu, 4 Feb 1999 11:07:08 -0800
Message-ID: <36B9EFD5.D90FABF1@ehsco.com>
Date: Thu, 04 Feb 1999 11:07:01 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: David Borman <dab@BSDI.COM>
CC: minshall@siara.com, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902041707.LAA02391@frantic.bsdi.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> whereas the current Nagle code would produce:
> 
>         1st write:      send 1460 bytes  (defer 40 bytes)
>         2nd write:      send 40 + 1420 = 1460 bytes (defer 80 bytes)
>         3rd write:      send 80 + 1380 = 1460 bytes  (defer 120 bytes)
>                         (wait for ACK or timeout)
>                         send 120 bytes
> 
> and the proposed Nagle (and the current BSD/OS) would produce:
> 
>         1st write:      send 1460 bytes
>                         send 40 bytes
>         2nd write:      send 1460 bytes (defer 40 bytes)
>         3rd write       send 40 + 1420 = 1460 bytes (defer 80 bytes)
>                         (wait for ACK or timeout)
>                         send 80 bytes

This is at least one extra send operation, which can't be a good thing,
particularly when the network is over-utilized (or near to it) already.
While this isn't as bad as Rick's example (1461-byte writes), it is
still bad utilization. (I'd hate to own the network that did both.)

I'd say that based on this info, it is better to just disable Nagle if
your application is going to be writing small (1.5 segments) chunks, and
leaving it enabled for applications that do large writes, rather than
introducing lots of new frames to the network.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 20:49:24 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA27125
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 20:49:24 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA04206; Thu, 4 Feb 1999 19:27:35 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA02143; Thu, 4 Feb 1999 19:23:48 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id QAA22524 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 16:23:47 -0800
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id QAA24417 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 16:23:37 -0800
Received: from locked.eng.sun.com (locked [129.146.85.189])
	by jurassic.eng.sun.com (8.9.3.Beta0+Sun/8.9.3.Beta0) with ESMTP id QAA10761
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 16:23:39 -0800 (PST)
Received: (from mohanp@localhost)
	by locked.eng.sun.com (8.9.1b+Sun/8.9.1) id QAA00546
	for tcp-impl@lerc.nasa.gov; Thu, 4 Feb 1999 16:22:56 -0800 (PST)
Date: Thu, 4 Feb 1999 16:22:56 -0800 (PST)
From: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM>
Message-Id: <199902050022.QAA00546@locked.eng.sun.com>
To: tcp-impl@lerc.nasa.gov
Subject: iss calculation during TIME_WAIT ressurection ?
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

RFC1122 states that

	    When a connection is closed actively, it MUST linger in
            TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime).
            However, it MAY accept a new SYN from the remote TCP to
            reopen the connection directly from TIME-WAIT state, if it:

            (1)  assigns its initial sequence number for the new
                 connection to be larger than the largest sequence
                 number it used on the previous connection incarnation,

Looking at BSD code, it used the "rcv_nxt" field to calculate the new
iss :

			if (tiflags & TH_SYN &&
                            tp->t_state == TCPS_TIME_WAIT &&
                            SEQ_GT(ti->ti_seq, tp->rcv_nxt)) {
                                iss = tp->rcv_nxt + TCP_ISSINCR;
                                tp = tcp_close(tp);
                                goto findpcb;
                        }

Is this a bug in the TCP code ? Should it not use snd_nxt ?

thanks
-mohan


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 21:04:20 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA27661
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 21:04:20 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA07052; Thu, 4 Feb 1999 19:32:36 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA05175; Thu, 4 Feb 1999 19:29:11 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 381;
          Thu, 4 Feb 1999 09:14:48 -0800
Message-ID: <36B9D582.1BDF7B49@ehsco.com>
Date: Thu, 04 Feb 1999 09:14:43 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Vernon Schryver <vjs@calcite.rhyolite.com>
CC: tcp-impl@lerc.nasa.gov, Greg Minshall <minshall@siara.com>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902041440.HAA21740@calcite.rhyolite.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> I have such an application, but it wouldn't be helped by the proposal
> because its naughty writes are too small.  It still needs TCP_NODELAY.

Maybe this is an irrelevant thread, then. Obviously there will be some
apps that would still need to disable the modified Nagle. That does not
mean the proposal is invalid, but probably means that the question I
asked is irrelevant.

I suppose the question ought to be: how would the modified Nagle affect
everyday applications? I can't really think of any situations in which
the proposal would hurt throughput. I'd like to see some testing of
typical apps to verify it though. Greg, do you have data showing
different kinds of apps over extended periods that are benefitted by
this, or that (more important) are hurt by it?

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  4 23:00:30 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA03004
	for <tcpimpl-archive@lists.ietf.org>; Thu, 4 Feb 1999 23:00:29 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA14071; Thu, 4 Feb 1999 21:37:37 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA12197; Thu, 4 Feb 1999 21:33:45 -0500 (EST)
Received: from Eng.Sun.COM (engmail4 [129.144.134.6]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA18891 for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 18:33:43 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id SAA04250
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 18:33:42 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id SAA14901
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Feb 1999 18:33:42 -0800 (PST)
Date: Thu, 4 Feb 1999 18:33:42 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36B9EFD5.D90FABF1@ehsco.com>
Message-ID: <Roam.SIMCSD.2.0.4.918182022.18810.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> This is at least one extra send operation, which can't be a good thing,
> particularly when the network is over-utilized (or near to it) already.
> While this isn't as bad as Rick's example (1461-byte writes), it is
> still bad utilization. (I'd hate to own the network that did both.)

Let's look at the problem from an application programmer's point of view. You
are writing a network application, the transfer pattern can be anything. You
notice a performance problem and find that by disabling Nagle algorithm, the
problem goes away.  What will you do in the production code?

Now with the changed Nagle algorithm, for certain kind of transfer pattern,
you do not notice any performance problem.  So you will not turn off Nagle
algorithm.  But in fact, the Nagle algorithm may still kick in sometimes and
help the network.  Just that you only see the little impact from time to time.

Nagle algorithm helps a lot in the case of telnet like traffic.  This we
should not change.  But for some kind of transfer pattern, it helps a lot if
we relax the algorithm.  By doing this, we encourage  application programmer
not to disable the algorithm.  IMHO, this is what the  proposed draft is
about.  We do not want the algorithm to be disabled in general.  We are making
an incentive for programmers not to do that.  At the same time, we do not
adversely affect the network.  Imagine what will happen if every application
programmers disable the algorithm "by default."

> I'd say that based on this info, it is better to just disable Nagle if
> your application is going to be writing small (1.5 segments) chunks, and
> leaving it enabled for applications that do large writes, rather than
> introducing lots of new frames to the network.

The problem is an application can turn the algorithm off any time, regardless
of the transfer pattern.  And the proposed change is not to disable the
algorithm, but to relax it in some cases.  I'd rather see a slight increase
of network traffic in some cases than have most applications turn off the
algorithm by default.

Solaris's TCP stack has some form of the proposed change for many years.  I
have not heard of customer complaints because of it.  At the very least,
there was no network meltdown...

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 09:56:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id JAA29493
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 09:56:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id HAA02605; Fri, 5 Feb 1999 07:57:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from pc-jcs.coded.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id HAA01559; Fri, 5 Feb 1999 07:55:57 -0500 (EST)
From: jsnader@ix.netcom.com
Received: (from jcs@localhost)
	by pc-jcs.coded.com (8.8.5/8.8.5) id HAA23050;
	Fri, 5 Feb 1999 07:55:06 -0500 (EST)
Message-ID: <19990205075505.11619@ix.netcom.com>
Date: Fri, 5 Feb 1999 07:55:05 -0500
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <Roam.SIMC.2.0.6.918154382.12573.kcpoon@jurassic> <36B9FB3E.EA081676@cup.hp.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.74e
In-Reply-To: <36B9FB3E.EA081676@cup.hp.com>; from Rick Jones on Thu, Feb 04, 1999 at 11:55:43AM -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Thu, Feb 04, 1999 at 11:55:43AM -0800, Rick Jones wrote:
> Kacheong Poon wrote:
> > 
> > So this is an optimization your stack does.  
> 
> I guess I've been so used to this behaviour over the last N years that I
> thought it was present in all stacks :) Silly me :)
> 

BSD (at least since 4.4BSD--I haven't checked the earlier code)
does the same thing.  Since Kacheong reports that Solaris also does
this, it looks as if this is common behavior.

Jon Snader


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 12:34:21 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA05590
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 12:34:20 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA03820; Fri, 5 Feb 1999 11:07:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA01359; Fri, 5 Feb 1999 11:04:06 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id KAA01524;
	Fri, 5 Feb 1999 10:04:00 -0600 (CST)
Date: Fri, 5 Feb 1999 10:04:00 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902051604.KAA01524@frantic.bsdi.com>
To: Mohan.Parthasarathy@eng.Sun.COM, tcp-impl@lerc.nasa.gov
Subject: Re: iss calculation during TIME_WAIT ressurection ?
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Thu, 4 Feb 1999 16:22:56 -0800 (PST)
> From: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM>
> Subject: iss calculation during TIME_WAIT ressurection ?
> ...
> Looking at BSD code, it used the "rcv_nxt" field to calculate the new
> iss :
>
> 			if (tiflags & TH_SYN &&
>                             tp->t_state == TCPS_TIME_WAIT &&
>                             SEQ_GT(ti->ti_seq, tp->rcv_nxt)) {
>                                 iss = tp->rcv_nxt + TCP_ISSINCR;
>                                 tp = tcp_close(tp);
>                                 goto findpcb;
>                         }
>
> Is this a bug in the TCP code ? Should it not use snd_nxt ?

Yes.  This was fixed in the 4.4BSD code.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 12:45:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA05752
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 12:45:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA10341; Fri, 5 Feb 1999 11:17:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mail4.microsoft.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA09391; Fri, 5 Feb 1999 11:16:10 -0500 (EST)
Received: by mail4.microsoft.com with Internet Mail Service (5.5.2524.0)
	id <D8W1QS2Q>; Fri, 5 Feb 1999 08:16:10 -0800
Message-ID: <3FF8121C9B6DD111812100805F31FC0D0CAE860F@RED-MSG-59>
From: Art Shelest <artshel@microsoft.com>
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Date: Fri, 5 Feb 1999 08:16:04 -0800 
X-Mailer: Internet Mail Service (5.5.2524.0)
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Why not parametrize the Nagle?

Instead of "allow only one unacknowledged segment"
allow "allow N unacknowledged segments/bytes" and 
specify recommended range of values.

The N=2 would be similar to current proposal and also
fix the case where application follows the 
"small write, small write, read" operation, 
which is not addressed by the proposed draft.

The N=2 seems like a good choice, because receivers
typically ACK every other packet.


-----Original Message-----
From: Kacheong Poon [mailto:Kacheong.Poon@Eng.Sun.COM]
Sent: Thursday, February 04, 1999 6:34 PM
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm


> This is at least one extra send operation, which can't be a good thing,
> particularly when the network is over-utilized (or near to it) already.
> While this isn't as bad as Rick's example (1461-byte writes), it is
> still bad utilization. (I'd hate to own the network that did both.)

Let's look at the problem from an application programmer's point of view.
You
are writing a network application, the transfer pattern can be anything. You
notice a performance problem and find that by disabling Nagle algorithm, the
problem goes away.  What will you do in the production code?

Now with the changed Nagle algorithm, for certain kind of transfer pattern,
you do not notice any performance problem.  So you will not turn off Nagle
algorithm.  But in fact, the Nagle algorithm may still kick in sometimes and
help the network.  Just that you only see the little impact from time to
time.

Nagle algorithm helps a lot in the case of telnet like traffic.  This we
should not change.  But for some kind of transfer pattern, it helps a lot if
we relax the algorithm.  By doing this, we encourage  application programmer
not to disable the algorithm.  IMHO, this is what the  proposed draft is
about.  We do not want the algorithm to be disabled in general.  We are
making
an incentive for programmers not to do that.  At the same time, we do not
adversely affect the network.  Imagine what will happen if every application
programmers disable the algorithm "by default."

> I'd say that based on this info, it is better to just disable Nagle if
> your application is going to be writing small (1.5 segments) chunks, and
> leaving it enabled for applications that do large writes, rather than
> introducing lots of new frames to the network.

The problem is an application can turn the algorithm off any time,
regardless
of the transfer pattern.  And the proposed change is not to disable the
algorithm, but to relax it in some cases.  I'd rather see a slight increase
of network traffic in some cases than have most applications turn off the
algorithm by default.

Solaris's TCP stack has some form of the proposed change for many years.  I
have not heard of customer complaints because of it.  At the very least,
there was no network meltdown...

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 13:18:35 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id NAA06516
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 13:18:34 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA14282; Fri, 5 Feb 1999 12:07:47 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA13565; Fri, 5 Feb 1999 12:06:41 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id LAA01625;
	Fri, 5 Feb 1999 11:06:35 -0600 (CST)
Date: Fri, 5 Feb 1999 11:06:35 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902051706.LAA01625@frantic.bsdi.com>
To: ehall@ehsco.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Cc: tcp-impl@lerc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Thu, 04 Feb 1999 11:07:01 -0800
> From: "Eric A. Hall" <ehall@ehsco.com>
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
>
> > whereas the current Nagle code would produce:
> > 
> >         1st write:      send 1460 bytes  (defer 40 bytes)
> >         2nd write:      send 40 + 1420 = 1460 bytes (defer 80 bytes)
> >         3rd write:      send 80 + 1380 = 1460 bytes  (defer 120 bytes)
> >                         (wait for ACK or timeout)
> >                         send 120 bytes
> > 
> > and the proposed Nagle (and the current BSD/OS) would produce:
> > 
> >         1st write:      send 1460 bytes
> >                         send 40 bytes
> >         2nd write:      send 1460 bytes (defer 40 bytes)
> >         3rd write       send 40 + 1420 = 1460 bytes (defer 80 bytes)
> >                         (wait for ACK or timeout)
> >                         send 80 bytes
>
> This is at least one extra send operation, which can't be a good thing,
> particularly when the network is over-utilized (or near to it) already.
> While this isn't as bad as Rick's example (1461-byte writes), it is
> still bad utilization. (I'd hate to own the network that did both.)
>
> I'd say that based on this info, it is better to just disable Nagle if
> your application is going to be writing small (1.5 segments) chunks, and
> leaving it enabled for applications that do large writes, rather than
> introducing lots of new frames to the network.

But turning off Nagle is even worse.  You then get:
	1st write:	send 1460 bytes
			send 40 bytes
	2nd write:	send 1460 bytes
			send 40 bytes
	3rd write:	send 1460 bytes
			send 40 bytes

Also, I might not have said it, but this example was to show
3 separate transaction requests, not 1 transaction done with
3 writes.  If it was one request, it should really be done with
a single 4500 byte write, yeilding:

	4500 write:	send 1460 bytes
			send 1460 bytes
			send 1460 bytes
			(wait for ACK or timeout)
			send 120 bytes
the difference being that with the modified Nagle algorithm, the
wait for ACK or timeout would be skipped.

		-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 14:31:09 1999
Received: from assateague.lerc.nasa.gov ([139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA07842
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 14:31:08 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA25712; Fri, 5 Feb 1999 13:07:44 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA25021; Fri, 5 Feb 1999 13:06:50 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id LAA20288
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 5 Feb 1999 11:06:43 -0700 (MST)
Date: Fri, 5 Feb 1999 11:06:43 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902051806.LAA20288@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Art Shelest <artshel@microsoft.com>

> Why not parametrize the Nagle?
>
> Instead of "allow only one unacknowledged segment"
> allow "allow N unacknowledged segments/bytes" and 
> specify recommended range of values.
>
> The N=2 would be similar to current proposal and also
> fix the case where application follows the 
> "small write, small write, read" operation, 
> which is not addressed by the proposed draft.
>
> The N=2 seems like a good choice, because receivers
> typically ACK every other packet.

How would you implement that?  You could record not merely the sequence
numbers of RFC 793 and the one additional sequence number of the draft,
but the starting sequence number of each of the last M segments, where M
is the maximum value of your N.  You'd probably want to record the sequence
numbers corresponding to the starts of application write requests instead
of segments on the wire.  On each write, you'd just cycle the stack of
previous sequence numbers.  For likely values of M, you wouldn't need more
than a couple 100 bytes for every TCB, and they could be maintained and
used with only a few 100 instructions at start, another few 100
instructions on every send.  However, we're not talking about typical
GUI bloatware, where adding cycles to every keystroke does not matter.
TCP/IP can now be done with a few dozen cycles total, so adding such
overhead would not be good.  I don't see how to implement a parameterize
Nagle algorithm in real life.

I am unhappy about the English descriptions of the current and modified
Nagle algorithms in the draft.  As I read it, the English description of
the modified Nagle algorithm is only slightly related to the later concrete
description.  Like talk of a parameterized Nagle scheme, the English in
the draft sounds nice until you try to figure out what it might really
mean, and then you realize it is at best fuzzy.  That is a Bad Thing(tm).
To fix this problem, I would delete the English description of the modified
algorithm and stick to the concrete talk of sequence numbers.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 15:37:33 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id PAA08952
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 15:37:33 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA16322; Fri, 5 Feb 1999 14:22:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel2.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA13926; Fri, 5 Feb 1999 14:19:14 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel2.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id OAA08742
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 14:19:07 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id LAA17607 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 11:19:10 -0800 (PST)
Message-ID: <36BB442E.FD5715EE@cup.hp.com>
Date: Fri, 05 Feb 1999 11:19:10 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902051806.LAA20288@calcite.rhyolite.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Are we arriving at the conclusion/concensus that the problem being
solved here (that a single send of say 1.5MSS bytes encounters a delay
in the second segment) has already been dealt with in a number of
production TCP stacks by interpreting Nagle on a "send" rather than a
"segment" basis? And that a pair of 0.75 MSS sends would still encounter
the "son of Nagle" interaction?

rick jones

PS where can I find the original "chapter and verse" on Nagle?

PPS - As this thread continues, I've been in contact with one person who
inherited an application that disable Nagle, but did not know why, nor
what the Nagle algorithm was. So, as part of "outreach" :) I wnt through
the 30 second descrition of what it did and how buggy applications might
interact with it and the ACK policies - he seems very pleased to learn
what was going on and thought he could remove the call. Is that perhaps
the root problem here - that people are writing apps without
understanding? 

I'm guessing that with the possible exception of a very small number of
applications (small, streaming transactions?), the disabling of Nagle is
really just a kludge, and that the networks as a whole would be best-off
if we got the applications fixed? If they were going to adopt the
secondary flush proposal in the RFC there would need to be app changes
anyway... so we are talking about doing outreach with the apps types
either way.

-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 15:54:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id PAA09248
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 15:54:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA03827; Fri, 5 Feb 1999 14:48:01 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA03156; Fri, 5 Feb 1999 14:46:59 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 5 Feb 1999 20:10:21 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108rD3-001xhHC; Fri, 5 Feb 1999 11:46:37 -0800 (PST)
Received: from ip220.san-francisco41.ca.pub-ip.psi.net ([38.28.91.220]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 5 Feb 1999 20:10:20 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA01318; Fri, 5 Feb 1999 11:47:18 -0800 (PST)
Message-Id: <199902051947.LAA01318@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Art Shelest <artshel@microsoft.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 05 Feb 1999 08:16:04 PST."
             <3FF8121C9B6DD111812100805F31FC0D0CAE860F@RED-MSG-59> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 11:47:17 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Art,

> Instead of "allow only one unacknowledged segment"
> allow "allow N unacknowledged segments/bytes" and 
> specify recommended range of values.

Mainly, i think, because there doesn't seem to be any need to do this.  Using 
the current or modified Nagle works for almost all applications.  Disabling 
Nagle may be required for a very small set of applications.

Greg Minshall


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 16:01:27 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09376
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 16:01:27 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA07592; Fri, 5 Feb 1999 14:52:44 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA03219; Fri, 5 Feb 1999 14:47:05 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 5 Feb 1999 20:10:27 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108rD0-001xhGC; Fri, 5 Feb 1999 11:46:34 -0800 (PST)
Received: from ip220.san-francisco41.ca.pub-ip.psi.net ([38.28.91.220]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 5 Feb 1999 20:10:17 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA01310; Fri, 5 Feb 1999 11:47:15 -0800 (PST)
Message-Id: <199902051947.LAA01310@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 04 Feb 1999 13:50:39 PST."
             <Roam.SIMCSD.2.0.4.918165039.2813.kcpoon@jurassic> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 11:47:15 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Dave Borman wrote a long message which i am still digesting.  However, 
Kacheong Poon quoted one part and let me answer just that:

> 	If a TCP has less than a full-sized packet to transmit,
>       and if the last insequence packet sent was less than a
> 	full-sized packet and has not yet been acknowledged, do
> 	not transmit a packet.

The problem is that if an application does a sequence of small and large 
writes, you send more than one small packet per RTT, which i'd rather avoid.

(All of this is complicated by the fact that most stacks, including the BSD 
stack, actually look at the size of the send() from the application rather 
than the size of packets transmitted over the wire.  I think this is 
reasonable in the sense that, really, applications don't have much hope of 
knowing what MSS might be at any particular point in time.  Others earlier in 
this thread have mentioned this, especially w.r.t. path MTU discovery).

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 16:02:33 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09400
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 16:02:32 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA03798; Fri, 5 Feb 1999 14:47:43 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA03109; Fri, 5 Feb 1999 14:46:34 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 5 Feb 1999 20:09:56 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108rCd-001xhEC; Fri, 5 Feb 1999 11:46:11 -0800 (PST)
Received: from ip220.san-francisco41.ca.pub-ip.psi.net ([38.28.91.220]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 5 Feb 1999 20:09:54 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA01285; Fri, 5 Feb 1999 11:46:47 -0800 (PST)
Message-Id: <199902051946.LAA01285@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Wed, 03 Feb 1999 13:03:33 PST."
             <36B8B9A5.4BBBEA81@ehsco.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 11:46:45 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

All and sundry,

I'm going to try to respond to several people in one (or more?) message(s).

Eric Hall and Joe Touch both ask, quite reasonably, why not just go ahead and 
disable Nagle and be done with it.

The reason is that Nagle is very effective at protecting the network from 
hordes of small packets.  One of the main reasons i am proposing this 
modification is to keep people from being tempted to disable Nagle.

I've seen a lot of cases where people wanted to turn off Nagle, but most of 
them have been because of not really understanding what is going on, 
especially with respect to buffering.

As an example, some colleagues of mine recently turned off Nagle on a Usenet 
server.  They saw article retrieval latency drop substantially, which was a 
good thing.  However, at first they did not notice that the number of packets 
being transmitted (for the same request stream) increased an order of 
magnitude (with the average packet size dropping from, basically, MSS to 140 
bytes!).  (It turned out that when they went back in, turned Nagle *back on*, 
and improved the buffering of the Usenet server, they got the reduced 
latencies *without* introducing any more load into the network.)

Thus, turning off Nagle is a very dangerous thing to do.  I wouldn't want lots 
of applications running with Nagle turned off; in fact, i wouldn't want any 
more applications turning off Nagle than already do.

Again, my hope is that by introducing this change, there will be less of a 
tendency for application writers to think "bad performance?  oh, i know, i'll 
disable Nagle!".


Eric Hall also asks (basically) could changing the Nagle algorithm in this way 
either make things worse for existing applications, or increase the load on 
the network for existing applications.  Here is about all i know how to say 
about that:  if an application is dribbling out data to TCP at a little over 1 
MSS worth of data per RTT, then with the original Nagle (and assuming the 
timing is exactly right) you might see one ``small'' packet, then a stream of 
big packets (about one per RTT), whereas with the modified Nagle (again, 
assuming the timing is exactly right) you might see one ``small'' packet and 
one big packet each RTT.


In a later mail, you say:

> the idea here is to avoid waiting for a timer right? specifically, the
> standalone ACK timer.

Actually, in some cases the RTT that Nagle imposes may be as significant as 
the delayed ACK timer.

> that would seem to indicate that if one allowed up to 2MSS of unacked,
> "small" sends and _then_ stopped one could be reasonably sure that an
> ACK would be arriving within one RTT of the 2MSSth byte being sent.
> 
> it would allow a "pathological" application to put up to 2MSS packets
> out there, but I'm guessing it would be cleaner than simply disabling
> nagle entirely?

This would probably be cleaner than disabling Nagle entirely, but i don't 
think this is a good idea.  Imagine 2*MSS of 1 byte packets!

The point is to get the entire request or response to the other side with as 
few packets as possible (each as large as possible).  Once it is there, then 
the other side can respond or send another request.

Greg Minshall


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 16:16:32 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09598
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 16:16:32 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA17694; Fri, 5 Feb 1999 15:07:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA17047; Fri, 5 Feb 1999 15:07:04 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 5 Feb 1999 20:30:26 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108rCw-001xhEC; Fri, 5 Feb 1999 11:46:30 -0800 (PST)
Received: from ip220.san-francisco41.ca.pub-ip.psi.net ([38.28.91.220]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 5 Feb 1999 20:10:12 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA01294; Fri, 5 Feb 1999 11:47:10 -0800 (PST)
Message-Id: <199902051947.LAA01294@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Wed, 03 Feb 1999 15:29:56 PST."
             <36B8DBF4.FBCB23FB@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 11:47:10 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

> Now, there may still be sufficient cases of reasonable applications
> that are not broken, but still "suffer" from Nagle, but I'm not sure
> (yet - I'm sure someone else will point some out :) what they are.
> _Maybe_ something to do with pipelined http requests generating sub
> MSS responses? 

If a response (could equally well be a request) sends an odd number of MSS 
packets, and then wants to send one small packet, Nagle will block until all 
the MSS packets have been acknowledged.  Because the response stream was an 
odd number of packets, the remote TCP will probably decide not to ACK for 
100ms or so, thus preventing the response from making it through.  So, the 
response will take RTT+100ms (approximately) longer than it could have taken.

So, this is the application where there is an argument to be made that Nagle 
(especially when coupled with delayed ACKs) is slowing things down somewhat.

With the proposed modification, the small packet will be allowed to follow the 
MSS packets (assuming the window, congestion stuff, etc.).

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 16:26:54 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09726
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 16:26:53 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA24764; Fri, 5 Feb 1999 15:17:43 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Tandem.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA22821; Fri, 5 Feb 1999 15:14:21 -0500 (EST)
Received: from adm.loc201.tandem.com (adm.loc201.tandem.com [155.186.68.56])
	by Tandem.com (8.8.8/2.0.1) with SMTP id MAA20899;
	Fri, 5 Feb 1999 12:14:16 -0800 (PST)
Received: from tahoe.loc201.tandem.com by adm.loc201.tandem.com (4.1/6main.940209)
	id AA05723; Fri, 5 Feb 99 12:14:13 PST
Received: by tahoe.loc201.tandem.com (5.x/6leaf.940209)
	id AA21342; Fri, 5 Feb 1999 12:18:13 -0800
From: jayanth@loc201.tandem.com (vijayaraghavan_jayanth)
Message-Id: <9902052018.AA21342@tahoe.loc201.tandem.com>
Subject: Re: iss calculation during TIME_WAIT ressurection ?
To: dab@bsdi.com (David Borman)
Date: Fri, 5 Feb 1999 12:18:12 -0800 (PST)
Cc: tcp-impl@lerc.nasa.gov
In-Reply-To: <199902051604.KAA01524@frantic.bsdi.com> from "David Borman" at Feb 5, 99 10:04:00 am
X-Mailer: ELM [version 2.4 PL24]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


The FreeBSD code and the 4.4BSD-Lite do not seem to have this fix.

jayanth


> > Looking at BSD code, it used the "rcv_nxt" field to calculate the new
> > iss :
> >
> > 			if (tiflags & TH_SYN &&
> >                             tp->t_state == TCPS_TIME_WAIT &&
> >                             SEQ_GT(ti->ti_seq, tp->rcv_nxt)) {
> >                                 iss = tp->rcv_nxt + TCP_ISSINCR;
> >                                 tp = tcp_close(tp);
> >                                 goto findpcb;
> >                         }
> >
> > Is this a bug in the TCP code ? Should it not use snd_nxt ?
> 
> Yes.  This was fixed in the 4.4BSD code.
> 
> 			-David Borman, dab@bsdi.com
> 


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 16:31:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09824
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 16:31:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA28391; Fri, 5 Feb 1999 15:22:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA25384; Fri, 5 Feb 1999 15:18:15 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id OAA01909;
	Fri, 5 Feb 1999 14:09:25 -0600 (CST)
Date: Fri, 5 Feb 1999 14:09:25 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902052009.OAA01909@frantic.bsdi.com>
To: raj@cup.hp.com, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

> PS where can I find the original "chapter and verse" on Nagle?

You can find it in:
	"Congestion Control in IP/TCP," J. Nagle, RFC-896, January 1984.
and section 4.2.3.4 of RFC-1122.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 17:05:11 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA10428
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 17:05:11 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA16573; Fri, 5 Feb 1999 15:48:59 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel2.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA14514; Fri, 5 Feb 1999 15:45:12 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel2.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id PAA08588
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 15:45:06 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id MAA17764 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 12:45:08 -0800 (PST)
Message-ID: <36BB5854.DACB5A24@cup.hp.com>
Date: Fri, 05 Feb 1999 12:45:08 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902051947.LAA01294@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Greg Minshall wrote:
> 
> Rick,
> 
> > Now, there may still be sufficient cases of reasonable applications
> > that are not broken, but still "suffer" from Nagle, but I'm not sure
> > (yet - I'm sure someone else will point some out :) what they are.
> > _Maybe_ something to do with pipelined http requests generating sub
> > MSS responses?
> 
> If a response (could equally well be a request) sends an odd number of MSS
> packets, and then wants to send one small packet, Nagle will block until all
> the MSS packets have been acknowledged.  Because the response stream was an
> odd number of packets, the remote TCP will probably decide not to ACK for
> 100ms or so, thus preventing the response from making it through.  So, the
> response will take RTT+100ms (approximately) longer than it could have taken.
> 
> So, this is the application where there is an argument to be made that Nagle
> (especially when coupled with delayed ACKs) is slowing things down somewhat.

But if that application is presenting that data in separate send calls,
and it was indeed data that was really one unit, is that application
broken or not? The whole issue here is for request/response applications
right? Which means that there is all the data of the "request" and then
all the data of the "response."

> With the proposed modification, the small packet will be allowed to follow the
> MSS packets (assuming the window, congestion stuff, etc.).

I guess part of my hangup is considering any transport that did not
interpret Nagle on a per"send" basis rather than a per segment basis as
broken to begin with - perhaps then I'm in violent agreement with the
spirit of the proposed change :) And with the sonofNagle, apps which are
broken and sending logically associated data as two sub-MSS sends are
still broken and still going to either get fixed by the app developer
(unlikely but good) or disable Nagle (likely and bad) anyway. 

I'm guessing (common and bad :) that most apps that disable nagle today
are like that - successive sub-mss sends of logically-associated data
synchronously awaiting a response/request. Those apps are only going to
be fixed with apps mods - either by getting all (the major) TCP's to
include the explicit flush mechanism, or by getting the app to do a
gthering write. If we are going to have to outreach to get apps modified
anyway, might as well get all the "rightness" into them we can and get
them onto gathering writes right?

rick jones

-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 17:25:50 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA10875
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 17:25:49 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA22446; Fri, 5 Feb 1999 15:57:42 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA21194; Fri, 5 Feb 1999 15:55:34 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id OAA01979;
	Fri, 5 Feb 1999 14:55:15 -0600 (CST)
Date: Fri, 5 Feb 1999 14:55:15 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902052055.OAA01979@frantic.bsdi.com>
To: jayanth@loc201.tandem.com
Subject: Re: iss calculation during TIME_WAIT ressurection ?
Cc: tcp-impl@lerc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Jayanth,

> The FreeBSD code and the 4.4BSD-Lite do not seem to have this fix.
>
> > > Looking at BSD code, it used the "rcv_nxt" field to calculate the new
> > > iss :
> > >
> > > 			if (tiflags & TH_SYN &&
> > >                             tp->t_state == TCPS_TIME_WAIT &&
> > >                             SEQ_GT(ti->ti_seq, tp->rcv_nxt)) {
> > >                                 iss = tp->rcv_nxt + TCP_ISSINCR;
> > >                                 tp = tcp_close(tp);
> > >                                 goto findpcb;
> > >                         }
> > >
> > > Is this a bug in the TCP code ? Should it not use snd_nxt ?
> > 
> > Yes.  This was fixed in the 4.4BSD code.

I don't know about FreeBSD, but the fix is there in the 4.4BSD-Lite
code:
                        if (tiflags & TH_SYN &&
                            tp->t_state == TCPS_TIME_WAIT &&
                            SEQ_GT(ti->ti_seq, tp->rcv_nxt)) {
                                iss = tp->snd_nxt + TCP_ISSINCR;
                                tp = tcp_close(tp);
                                goto findpcb;
                        }

Note that setting iss now uses tp->snd_next instead of tp->rcv_nxt.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 17:31:45 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA11105
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 17:31:45 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA12292; Fri, 5 Feb 1999 16:27:45 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from pc-jcs.coded.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA11161; Fri, 5 Feb 1999 16:25:40 -0500 (EST)
From: jsnader@ix.netcom.com
Received: (from jcs@localhost)
	by pc-jcs.coded.com (8.8.5/8.8.5) id QAA23709;
	Fri, 5 Feb 1999 16:26:49 -0500 (EST)
Message-ID: <19990205162649.62256@ix.netcom.com>
Date: Fri, 5 Feb 1999 16:26:49 -0500
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902051806.LAA20288@calcite.rhyolite.com> <36BB442E.FD5715EE@cup.hp.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.74e
In-Reply-To: <36BB442E.FD5715EE@cup.hp.com>; from Rick Jones on Fri, Feb 05, 1999 at 11:19:10AM -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Fri, Feb 05, 1999 at 11:19:10AM -0800, Rick Jones wrote:
> 
> PS where can I find the original "chapter and verse" on Nagle?
> 
RFC 896 (the original RFC on the Nagle algorithm) says:

 The solution is to inhibit the sending of new TCP  segments  when
 new  outgoing  data  arrives  from  the  user  if  any previously
 transmitted data on the connection remains unacknowledged.   This
 inhibition  is  to be unconditional; no timers, tests for size of
 data received, or other conditions are required.   Implementation
 typically requires one or two lines inside a TCP program.

RFC 1122 (Host Requirements) says:

 If there is unacknowledged data (i.e., SND.NXT >
 SND.UNA), then the sending TCP buffers all user
 data (regardless of the PSH bit), until the
 outstanding data has been acknowledged or until
 the TCP can send a full-sized segment (Eff.snd.MSS
 bytes; see Section 4.2.2.6).

As we have seen, many implementations honor neither of these.
I don't know of any other RFC citations. Anyone?

Jon Snader


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 18:21:01 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA12638
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 18:21:01 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA08175; Fri, 5 Feb 1999 17:07:43 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mail5.microsoft.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA07632; Fri, 5 Feb 1999 17:06:49 -0500 (EST)
Received: by INET-IMC-05 with Internet Mail Service (5.5.2524.0)
	id <D8W7J162>; Fri, 5 Feb 1999 14:06:48 -0800
Message-ID: <3FF8121C9B6DD111812100805F31FC0D0CAE8620@RED-MSG-59>
From: Art Shelest <artshel@microsoft.com>
To: "'Vernon Schryver'" <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Date: Fri, 5 Feb 1999 14:06:41 -0800 
X-Mailer: Internet Mail Service (5.5.2524.0)
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon Schryver wrote:

> How would you implement that?  

It's a valid question, however I'd suggest that benefit of the
algorithm is discussed before perceived implementation complexity.

If the recommended value for N is 2, the implementation is trivial,
consumes extra 2-4 bytes* per TCB and the overhead is 1-2 extra
instructions per send.

*One can store beginning sequence number of the last packet (4 bytes),
or the size of the last segment (2 bytes). If storing last segment
size, then Nagle check would change:
N=1 (today):	SND.NXT > SND.UNA
N=2:			SND.NXT -SND.SIZ > SND.UNA 
where SND.SIZ is last segment size.

Back to the real issue: would allowing N=2 unacknowledged segments 
resolve effects of Nagle/Delayed ACK interaction?

Cheers,

    _Art Shelest.

-----Original Message-----
From: Vernon Schryver [mailto:vjs@calcite.rhyolite.com]
Sent: Friday, February 05, 1999 10:07 AM
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm


> From: Art Shelest <artshel@microsoft.com>

> Why not parametrize the Nagle?
>
> Instead of "allow only one unacknowledged segment"
> allow "allow N unacknowledged segments/bytes" and 
> specify recommended range of values.
>
> The N=2 would be similar to current proposal and also
> fix the case where application follows the 
> "small write, small write, read" operation, 
> which is not addressed by the proposed draft.
>
> The N=2 seems like a good choice, because receivers
> typically ACK every other packet.

How would you implement that?  You could record not merely the sequence
numbers of RFC 793 and the one additional sequence number of the draft,
but the starting sequence number of each of the last M segments, where M
is the maximum value of your N.  You'd probably want to record the sequence
numbers corresponding to the starts of application write requests instead
of segments on the wire.  On each write, you'd just cycle the stack of
previous sequence numbers.  For likely values of M, you wouldn't need more
than a couple 100 bytes for every TCB, and they could be maintained and
used with only a few 100 instructions at start, another few 100
instructions on every send.  However, we're not talking about typical
GUI bloatware, where adding cycles to every keystroke does not matter.
TCP/IP can now be done with a few dozen cycles total, so adding such
overhead would not be good.  I don't see how to implement a parameterize
Nagle algorithm in real life.

I am unhappy about the English descriptions of the current and modified
Nagle algorithms in the draft.  As I read it, the English description of
the modified Nagle algorithm is only slightly related to the later concrete
description.  Like talk of a parameterized Nagle scheme, the English in
the draft sounds nice until you try to figure out what it might really
mean, and then you realize it is at best fuzzy.  That is a Bad Thing(tm).
To fix this problem, I would delete the English description of the modified
algorithm and stick to the concrete talk of sequence numbers.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 18:47:19 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13432
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 18:47:19 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA26905; Fri, 5 Feb 1999 17:37:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA23984; Fri, 5 Feb 1999 17:32:22 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id PAA24992
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 5 Feb 1999 15:32:20 -0700 (MST)
Date: Fri, 5 Feb 1999 15:32:20 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902052232.PAA24992@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Art Shelest <artshel@microsoft.com>

> > How would you implement that?  
>
> It's a valid question, however I'd suggest that benefit of the
> algorithm is discussed before perceived implementation complexity.

In general, I strongly disagree.  Implementation costs are never merely
perceived.  Good ideas are a subset of those that have reasonable
implementations.  Without a good implementation, talk aobut the
wonderfulness of any idea is no better than useless usenet jabber, of
which there is no shortage.  Besides, the name of this mailing list
is relevant.  That said, all that's necessary to make it reasonable to
discuss an idea is a plausible hint that it can be reasonably implemented.


> If the recommended value for N is 2, the implementation is trivial,
> consumes extra 2-4 bytes* per TCB and the overhead is 1-2 extra
> instructions per send.
>
> *One can store beginning sequence number of the last packet (4 bytes),
> or the size of the last segment (2 bytes). If storing last segment
> size, then Nagle check would change:
> N=1 (today):	SND.NXT > SND.UNA
> N=2:			SND.NXT -SND.SIZ > SND.UNA 
> where SND.SIZ is last segment size.

"Configurable" implies that sometimes N=1 and other times N=2.  That means
that the time costs are at least the following additional instructions,
  1   fetch of the switch
  2   test of the switch
  3   branch on the switch's value
  4   fetch SND.SIZ
  5   subtract 

Depending on the CPU, you might combine #1, #2, and #3 into 2 instructions,
such as a test and a branch, or a fetch and a branch-zero.  The largest
costs are probably not the instructions but the primary cache misses and
dirtying.  The secondary tertiary cache misses probably don't matter since
the entire TCB is likely to be faulted in and dirtied.  Overall, that
sounds fairly cheap to me.


> Back to the real issue: would allowing N=2 unacknowledged segments 
> resolve effects of Nagle/Delayed ACK interaction?

How is this idea better than the idea of the draft?  

This idea would effectively turn off Nagle for the most common (in my
experience) bad code, which loves write-write-read-write-write-read....
It would double the number of packets on the wire for such bad
applications.  I think that would be Not Good.  Would relaxing Nagle for
2 send requests be much different from turning it off entirely?


Do we agree that I misread the following or it was not tightly phrased?
I read it as a proposal for letting the application vary the value of N
from 1 to some large number.

> > Why not parametrize the Nagle?
> >
> > Instead of "allow only one unacknowledged segment"
> > allow "allow N unacknowledged segments/bytes" and 
> > specify recommended range of values.
> >
> > The N=2 would be similar to current proposal and also
> > fix the case where application follows the 
> > "small write, small write, read" operation, 
> > which is not addressed by the proposed draft.
> >
> > The N=2 seems like a good choice, because receivers
> > typically ACK every other packet.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 19:27:50 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA14373
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 19:27:49 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA20790; Fri, 5 Feb 1999 18:17:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mail5.microsoft.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA20196; Fri, 5 Feb 1999 18:16:20 -0500 (EST)
Received: by INET-IMC-05 with Internet Mail Service (5.5.2524.0)
	id <D8W7JMR9>; Fri, 5 Feb 1999 15:16:19 -0800
Message-ID: <3FF8121C9B6DD111812100805F31FC0D0CAE8627@RED-MSG-59>
From: Art Shelest <artshel@microsoft.com>
To: "'Vernon Schryver'" <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Date: Fri, 5 Feb 1999 15:16:17 -0800 
X-Mailer: Internet Mail Service (5.5.2524.0)
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon Schryver wrote:

> Do we agree that I misread the following or it was not tightly phrased?
> I read it as a proposal for letting the application vary the value of N
> from 1 to some large number.

Requiring implementation to be configurable is stronger requirement
than what I had in mind. The wording should have been: "Hosts MUST implement
Nagle(N) where N may be one of the following values...". The configurability
is optional, of course.

> This idea would effectively turn off Nagle for the most common (in my
> experience) bad code, which loves write-write-read-write-write-read....
> It would double the number of packets on the wire for such bad
> applications.  I think that would be Not Good.  Would relaxing Nagle for
> 2 send requests be much different from turning it off entirely?

In the case above, present or draft-proposed Nagle will both generate 2
packets because first will be sent immediately. The performance, however,
will improve from 2 packets per 200ms to 2 packets per RTT (10+ times on
many LANs)

I don't agree with notion that W-W-R applications come from bad programmers,
application programmer should not be required a TCP expert when they simply
treat
socket as a file handle. For all they know, it may or may not be a network
device.
If W-W-R works well with fwrite, why shouldn't it work with send?

Large sends is a more complex case. Both Nagle(2) and modified Nagle will
create 
extra packets with sends slightly bigger than MSS, which is bad for bulk
transfers, 
but good for interactive operations.

It almost sounds like Nagle needs it's own timer rather than being driven by
receiver's delayed ACK timer or RTT. This way, Nagle(T) could read:
"Delay sending data until there's full MSS, but no longer than period T".
This will guarantee predictable Nagle delay, will not inhibit bulk
transfers,
improve interactive performance and not depend on the receiver's delayed ACK

policy.

Now we have 4 possible solutions: Nagle, modified Nagle, Nagle(N) and
Nagle(T)
it would be interesting to compare their behavior for the cases previously
described, from telnet to bulk transfer to MSS+1 sends.

-----Original Message-----
From: Vernon Schryver [mailto:vjs@calcite.rhyolite.com]
Sent: Friday, February 05, 1999 2:32 PM
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm


> From: Art Shelest <artshel@microsoft.com>

> > How would you implement that?  
>
> It's a valid question, however I'd suggest that benefit of the
> algorithm is discussed before perceived implementation complexity.

In general, I strongly disagree.  Implementation costs are never merely
perceived.  Good ideas are a subset of those that have reasonable
implementations.  Without a good implementation, talk aobut the
wonderfulness of any idea is no better than useless usenet jabber, of
which there is no shortage.  Besides, the name of this mailing list
is relevant.  That said, all that's necessary to make it reasonable to
discuss an idea is a plausible hint that it can be reasonably implemented.


> If the recommended value for N is 2, the implementation is trivial,
> consumes extra 2-4 bytes* per TCB and the overhead is 1-2 extra
> instructions per send.
>
> *One can store beginning sequence number of the last packet (4 bytes),
> or the size of the last segment (2 bytes). If storing last segment
> size, then Nagle check would change:
> N=1 (today):	SND.NXT > SND.UNA
> N=2:			SND.NXT -SND.SIZ > SND.UNA 
> where SND.SIZ is last segment size.

"Configurable" implies that sometimes N=1 and other times N=2.  That means
that the time costs are at least the following additional instructions,
  1   fetch of the switch
  2   test of the switch
  3   branch on the switch's value
  4   fetch SND.SIZ
  5   subtract 

Depending on the CPU, you might combine #1, #2, and #3 into 2 instructions,
such as a test and a branch, or a fetch and a branch-zero.  The largest
costs are probably not the instructions but the primary cache misses and
dirtying.  The secondary tertiary cache misses probably don't matter since
the entire TCB is likely to be faulted in and dirtied.  Overall, that
sounds fairly cheap to me.


> Back to the real issue: would allowing N=2 unacknowledged segments 
> resolve effects of Nagle/Delayed ACK interaction?

How is this idea better than the idea of the draft?  

This idea would effectively turn off Nagle for the most common (in my
experience) bad code, which loves write-write-read-write-write-read....
It would double the number of packets on the wire for such bad
applications.  I think that would be Not Good.  Would relaxing Nagle for
2 send requests be much different from turning it off entirely?


Do we agree that I misread the following or it was not tightly phrased?
I read it as a proposal for letting the application vary the value of N
from 1 to some large number.

> > Why not parametrize the Nagle?
> >
> > Instead of "allow only one unacknowledged segment"
> > allow "allow N unacknowledged segments/bytes" and 
> > specify recommended range of values.
> >
> > The N=2 would be similar to current proposal and also
> > fix the case where application follows the 
> > "small write, small write, read" operation, 
> > which is not addressed by the proposed draft.
> >
> > The N=2 seems like a good choice, because receivers
> > typically ACK every other packet.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 19:42:44 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA14555
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 19:42:44 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA02920; Fri, 5 Feb 1999 18:37:42 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA00873; Fri, 5 Feb 1999 18:34:07 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA26026
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 5 Feb 1999 16:34:06 -0700 (MST)
Date: Fri, 5 Feb 1999 16:34:06 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902052334.QAA26026@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Art Shelest <artshel@microsoft.com>

> ...
> > This idea would effectively turn off Nagle for the most common (in my
> > experience) bad code, which loves write-write-read-write-write-read....
> > It would double the number of packets on the wire for such bad
> > applications.  I think that would be Not Good.  Would relaxing Nagle for
> > 2 send requests be much different from turning it off entirely?
>
> In the case above, present or draft-proposed Nagle will both generate 2
> packets because first will be sent immediately. The performance, however,
> will improve from 2 packets per 200ms to 2 packets per RTT (10+ times on
> many LANs)

True, Nagle does not reduce the number of packets per transaction from
bad applications, but it reduces or at least strongly limits the number
of packets per second from them.  As far as the rest of the network
is concerned, the effect of Nagle is to reduce the number of packets.


> I don't agree with notion that W-W-R applications come from bad programmers,
> application programmer should not be required a TCP expert when they simply
> treat
> socket as a file handle. For all they know, it may or may not be a network
> device.
> If W-W-R works well with fwrite, why shouldn't it work with send?

NO!

  - W-W-R does NOT work well with fwrite(), compared to a single write().

  - when confronted with W-W-R, fwrite() will make as single write() 
      system call, in the relevant, tiny-write cases.  That's why
      fwrite() and the rest of stdio were invented.

  - mixing stdio writes and reads without any intervening fseek()s is not
     for the faint of heart.  Are such combinations now legal, eg. in POSIX?

  - any programmer who doesn't design for the target media is incompetant.
     Competant programmers do what is necessary to make their code work
     well on all of their design-targeted media.  What works well on
     network streams often does not work on other media, and vice versa.
     No amount of CASE/OOP snake oil will make rewind work on a TCP stream,
     nor does select() generally make much sense on disk files or tapes.


> Large sends is a more complex case. Both Nagle(2) and modified Nagle will
> create 
> extra packets with sends slightly bigger than MSS, which is bad for bulk
> transfers, 
> but good for interactive operations.

Not so.  Extra packets are bad for interactive operations.  Extra packets
cost both time and CPU cycles.  It is also true that longer delays while
the wire is idle are worse for interactive applications.  More over, large
transmissions are generally not involved in interactive applications, for
obvious reasons.


> It almost sounds like Nagle needs it's own timer rather than being driven by
> receiver's delayed ACK timer or RTT. This way, Nagle(T) could read:
> "Delay sending data until there's full MSS, but no longer than period T".
> This will guarantee predictable Nagle delay, will not inhibit bulk
> transfers,
> improve interactive performance and not depend on the receiver's delayed ACK
> policy.
>
> Now we have 4 possible solutions: Nagle, modified Nagle, Nagle(N) and
> Nagle(T)
> it would be interesting to compare their behavior for the cases previously
> described, from telnet to bulk transfer to MSS+1 sends.

Adding yet another timer to TCP is an non-starter, since it fails the
reasonable-implementation criterion.  The several timers that are now
needed for every TCP stream are already a major burden in TCP
implementations that care about speed, especially speed when dealing with
lots of simultaneous streams, as on a large web server.  The gyrations
required to make a large, fast multi-processor deal with the current
zillions of TCP timers are painful to contemplate.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 19:48:38 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA14586
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 19:48:38 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA02926; Fri, 5 Feb 1999 18:38:02 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA01815; Fri, 5 Feb 1999 18:36:05 -0500 (EST)
Received: from Eng.Sun.COM (engmail3 [129.144.170.5]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id PAA05873 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 15:35:59 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id PAA10548
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 15:35:58 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id PAA15126
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 15:35:58 -0800 (PST)
Date: Fri, 5 Feb 1999 15:35:58 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36BB442E.FD5715EE@cup.hp.com>
Message-ID: <Roam.SIMCSD.2.0.4.918257758.5558.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Are we arriving at the conclusion/concensus that the problem being
> solved here (that a single send of say 1.5MSS bytes encounters a delay
> in the second segment) has already been dealt with in a number of
> production TCP stacks by interpreting Nagle on a "send" rather than a
> "segment" basis? And that a pair of 0.75 MSS sends would still encounter
> the "son of Nagle" interaction?

I don't think there is a concensus.  First, the proposed change is not on a
per-send basis.  Please refer to Greg's latest mail why he did not like
this idea.  It seems to me, correct me if I am wrong, Greg is trying to
tackle it in the context of TCP, which looks at segment size and data
available to be sent only.  And for BSD derived implementations, Dave pointed
out the sosend() problem to be fixed first.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 19:50:07 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA14612
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 19:50:07 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA05926; Fri, 5 Feb 1999 18:42:42 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA03432; Fri, 5 Feb 1999 18:38:37 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 6 Feb 1999 00:02:00 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108upA-001xhEC; Fri, 5 Feb 1999 15:38:12 -0800 (PST)
Received: from ip19.san-francisco41.ca.pub-ip.psi.net ([38.28.91.19]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 6 Feb 1999 00:01:55 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id PAA02547; Fri, 5 Feb 1999 15:38:35 -0800 (PST)
Message-Id: <199902052338.PAA02547@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Art Shelest <artshel@microsoft.com>
cc: "'Vernon Schryver'" <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 05 Feb 1999 14:06:41 PST."
             <3FF8121C9B6DD111812100805F31FC0D0CAE8620@RED-MSG-59> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 15:38:35 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Art,

> Back to the real issue: would allowing N=2 unacknowledged segments 
> resolve effects of Nagle/Delayed ACK interaction?

No, it would not.  On at least the BSD stack, the returning ACK won't be sent 
until there are 2*MSS of window update to be sent (or until the delayed ACK 
timer goes off).

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 20:12:01 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA14773
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 20:12:01 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA20840; Fri, 5 Feb 1999 19:07:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA19702; Fri, 5 Feb 1999 19:05:29 -0500 (EST)
Received: from Eng.Sun.COM (engmail1 [129.146.1.13]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id QAA12945 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 16:05:25 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id QAA20570
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 16:05:20 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id QAA15132
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 16:05:16 -0800 (PST)
Date: Fri, 5 Feb 1999 16:05:16 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: RE: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <3FF8121C9B6DD111812100805F31FC0D0CAE8627@RED-MSG-59>
Message-ID: <Roam.SIMCSD.2.0.4.918259516.16235.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I don't agree with notion that W-W-R applications come from bad programmers,
> application programmer should not be required a TCP expert when they simply
> treat
> socket as a file handle. For all they know, it may or may not be a network
> device.
> If W-W-R works well with fwrite, why shouldn't it work with send?

I don't think that if application programmers know what value to set for
your Nagle(N) proposal, they do not know that they are working with a
network device.  If they do not know they are working with a network device,
they'd better not use socket() directly and use some higher level libraries. 
Let those libraries handle the network part.

> It almost sounds like Nagle needs it's own timer rather than being driven by
> receiver's delayed ACK timer or RTT. This way, Nagle(T) could read:
> "Delay sending data until there's full MSS, but no longer than period T".
> This will guarantee predictable Nagle delay, will not inhibit bulk
> transfers,
> improve interactive performance and not depend on the receiver's delayed ACK
> policy.

What value do you set for T?  First it cannot be larger than 200ms, the
commonly used delay ack timer.  And T has to be dependent on RTT and the
application.  TCP can measure RTT.  But how can TCP know when application
will send?  I'd say TCP will guess wrongly most of the time.  And the most
important question is what we gain for all these complicated work, as
compared to the simple proposed change of Nagle algorithm.

> Now we have 4 possible solutions: Nagle, modified Nagle, Nagle(N) and
> Nagle(T)
> it would be interesting to compare their behavior for the cases previously
> described, from telnet to bulk transfer to MSS+1 sends.

IMHO, we should not even consider Nagle(T).  And for Nagle(N), I'd say the
little, if any, gain when compared to the proposed change does not justify
the extra code (although smaller than Nagle(T)) implementing it.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 20:36:26 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA14982
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 20:36:25 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA02350; Fri, 5 Feb 1999 19:27:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from alpha.xerox.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA01814; Fri, 5 Feb 1999 19:26:59 -0500 (EST)
Received: from crevenia.parc.xerox.com ([13.2.116.11]) by alpha.xerox.com with SMTP id <61355(1)>; Fri, 5 Feb 1999 16:26:57 PST
Received: from localhost by crevenia.parc.xerox.com with SMTP id <177534>; Fri, 5 Feb 1999 16:26:54 -0800
To: David Borman <dab@bsdi.com>
cc: jayanth@loc201.tandem.com, tcp-impl@lerc.nasa.gov
Subject: Re: iss calculation during TIME_WAIT ressurection ? 
In-reply-to: Your message of "Fri, 05 Feb 99 12:55:15 PST."
             <199902052055.OAA01979@frantic.bsdi.com> 
Date: Fri, 5 Feb 1999 16:26:40 PST
From: Bill Fenner <fenner@parc.xerox.com>
Message-Id: <99Feb5.162654pst.177534@crevenia.parc.xerox.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

It looks like the bug was still present in 4.4BSD-Lite, but fixed in
4.4BSD-Lite2.  FreeBSD started with -Lite and did a manual merge to
-Lite2 because the T/TCP modifications made simply merging diffs
problematic; it appears as though this fix was omitted.

  Bill

(The diff from -Lite to -Lite2 can be seen at
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/tcp_input.c.diff?r1=1.1.1.1&r2=1.1.1.2
).


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 21:46:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA15430
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 21:46:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA06874; Fri, 5 Feb 1999 20:27:45 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA04608; Fri, 5 Feb 1999 20:23:30 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 6 Feb 1999 01:46:40 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m108w9M-001xhFC; Fri, 5 Feb 1999 17:03:08 -0800 (PST)
Received: from ip51.san-francisco41.ca.pub-ip.psi.net ([38.28.91.51]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 6 Feb 1999 01:26:51 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id PAA02571; Fri, 5 Feb 1999 15:52:04 -0800 (PST)
Message-Id: <199902052352.PAA02571@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 05 Feb 1999 12:45:08 PST."
             <36BB5854.DACB5A24@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 05 Feb 1999 15:52:04 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

I see what you are saying:

> But if that application is presenting that data in separate send
> calls, and it was indeed data that was really one unit, is that
> application broken or not? The whole issue here is for request/
> response applications right? Which means that there is all the data of
> the "request" and then all the data of the "response." 

I think it is reasonable to have the application buffer reasonable amounts of 
data, but i'm not sure it is reasonable to say it must buffer 50Kbytes or 
100Kbytes.  It seems to me (but, i agree, there is room here to differ) that 
if the application is passing down its response (or request) in 4096 byte 
chunks (to pick a popular size), TCP should "work well".

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 22:37:00 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA17467
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 22:37:00 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA08725; Fri, 5 Feb 1999 21:22:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA07680; Fri, 5 Feb 1999 21:20:41 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id SAA01840
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 18:20:45 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id SAA18287 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 18:20:38 -0800 (PST)
Message-ID: <36BBA6F6.7297D2CC@cup.hp.com>
Date: Fri, 05 Feb 1999 18:20:38 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902052352.PAA02571@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I think it is reasonable to have the application buffer reasonable amounts of
> data, but i'm not sure it is reasonable to say it must buffer 50Kbytes or

I will go along with the statement that (while desirable :) it is not
always reasonable to expect an application with a 50K response/request
to provide that to TCP in one call. 

I will point-out thought that I regularly see (ok, ok, in benchmarking
:) web servers handing the transport as much data as it will take in a
"send" call and that being more than 50K often enough depending on the
"SO_SNDBUF" setting. 

By the time we are talking about 50K request/responses are we really
talking request/response any more, or are we back to (short)
unidrectional streams? Another way to ask that is is there really a
problem with applications sending across the net 50KB of data and then a
single byte?

> 100Kbytes.  It seems to me (but, i agree, there is room here to differ) that
> if the application is passing down its response (or request) in 4096 byte
> chunks (to pick a popular size), TCP should "work well".

I would probably say "work well enough" :) I think that if data is given
in 4096 byte chunks that well enough is (assuming a 1460 MSS) 1460,
1460, 1176. I see that today in netperf TCP_RR tests over some
transports, with no nagle/delayedack induced delays:

$ uname -a
HP-UX loiter B.10.20 A 9000/735 2004221734 two-user license
$ remsh hpisrdq "uname -a"
HP-UX hpisrdq A.09.05 E 9000/725 2007119367 8-user license
$ ./netperf -t TCP_RR -H hpisrdq -- -r 1,4096 -S 32K -s 32K
TCP REQUEST/RESPONSE TEST to hpisrdq
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

32768  32768  1        4096    10.00     151.95   
32768  32768 
$ ./netperf -t TCP_RR -H hpisrdq -- -r 4096,1 -S 32K -s 32K
TCP REQUEST/RESPONSE TEST to hpisrdq
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

32768  32768  4096     1       10.00     150.17   
32768  32768 

here's some to include HP-UX 11 as a sender and to show it at something
that might not be MCLBYTES in size :)

# ./netperf -H hpisrdq -t TCP_RR -- -r 4097,1 -S 32K -s 32K
TCP REQUEST/RESPONSE TEST to hpisrdq
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

32768  32768  4097     1       10.01     107.13   
32768  32768 
# uname -a
HP-UX hpntc11a B.11.00 A 9000/778 2011481613 two-user license

back to the 10.20 box...
$ ./netperf -t TCP_RR -H hpisrdq -- -r 4097,1 -S 32K -s 32K
TCP REQUEST/RESPONSE TEST to hpisrdq
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

32768  32768  4097     1       10.00     146.46   
32768  32768 

I'm even happy if that was 8192 bytes in two 4096 byte sends - unless of
course I'm talking over a link with a 9000 byte MTU...which takes us
into other territory :) 

I guess I would be in the camp that feels that interpreting nagle send
by send is sufficient and the way to go and that we need to make sure
app writers understand just what they are doing when they set
TCP_NODELAY. A "mantra" along the lines of "provide as much of the
logically associated data as you can to the transport at one time" or
something...

rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  5 23:31:28 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA18111
	for <tcpimpl-archive@lists.ietf.org>; Fri, 5 Feb 1999 23:31:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA06252; Fri, 5 Feb 1999 22:12:47 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA03331; Fri, 5 Feb 1999 22:08:04 -0500 (EST)
Received: from Eng.Sun.COM (engmail4 [129.144.134.6]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id TAA22212 for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 19:08:03 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id TAA12843
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 19:08:03 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id TAA15205
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Feb 1999 19:08:02 -0800 (PST)
Date: Fri, 5 Feb 1999 19:08:02 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36BBA6F6.7297D2CC@cup.hp.com>
Message-ID: <Roam.SIMCSD.2.0.4.918270482.31348.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I guess I would be in the camp that feels that interpreting nagle send
> by send is sufficient and the way to go and that we need to make sure
> app writers understand just what they are doing when they set
> TCP_NODELAY. A "mantra" along the lines of "provide as much of the
> logically associated data as you can to the transport at one time" or
> something...

Is there such a document available somewhere?  If not, maybe we should write
up an informational draft to describe this simple but not widely understood
guidelines in writing network applications using TCP.

I just checked Steven's "UNIX Network Programming" book.  It does suggest
people to use writev().  But I don't think it puts enough emphasis on the
adverse effect of disabling Nagle algorithm by default.  It is this point
that we need to bring to the application programmers.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 05:01:02 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id FAA15907
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 05:01:01 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id CAA04767; Mon, 8 Feb 1999 02:52:52 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id CAA03196; Mon, 8 Feb 1999 02:48:53 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 439;
          Fri, 5 Feb 1999 10:08:21 -0800
Message-ID: <36BB3383.C6258375@ehsco.com>
Date: Fri, 05 Feb 1999 10:08:03 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: David Borman <dab@BSDI.COM>
CC: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902051706.LAA01625@frantic.bsdi.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> Also, I might not have said it, but this example was to show
> 3 separate transaction requests, not 1 transaction done with
> 3 writes.  If it was one request, it should really be done with
> a single 4500 byte write, yeilding:

You're not showing any response activity so it's hard to guess what's
really going on.

If the sender is actually waiting for a response from the far-end after
writing each block of 1500 bytes, then disabling Nagle on the sender
would be the most efficient thing:

          1st write:     send 1460 bytes
                         send 40 bytes
          Response

          2nd write:     send 1460 bytes
                         send 40 bytes
          Response
etc.

If the client is writing 3 blocks of 1500 bytes because that's how it
was written, then it should be deleted. :)   Unfortunately, since you're
not showing any responses, we have to assume that's what is happening.
In this case, the app should at least know that it is doing multiple
writes for a single operation, since it's not waiting for a response.
And in  that case, using Nagle is the most efficient:

> But turning off Nagle is even worse.  You then get:
>         1st write:      send 1460 bytes
>                         send 40 bytes
>         2nd write:      send 1460 bytes
>                         send 40 bytes
>         3rd write:      send 1460 bytes
>                         send 40 bytes

Six sends when writing 3 blocks >1 segment.

> > > and the proposed Nagle (and the current BSD/OS) would produce:
> > >
> > >     1st write:      send 1460 bytes
> > >                     send 40 bytes
> > >     2nd write:      send 1460 bytes (defer 40 bytes)
> > >     3rd write       send 40 + 1420 = 1460 bytes (defer 80  bytes)
> > >                     (wait for ACK or timeout)
> > >                     send 80 bytes

Five sends. Better.

> > > whereas the current Nagle code would produce:
> > >
> > >     1st write:      send 1460 bytes  (defer 40 bytes)
> > >     2nd write:      send 40 + 1420 = 1460 bytes (defer 80 bytes)
> > >     3rd write:      send 80 + 1380 = 1460 bytes  (defer 120 bytes)
> > >                     (wait for ACK or timeout)
> > >                     send 120 bytes

Four sends. Best.

If the app knows that it's going to send >2 segments in an operation, it
should use Nagle. If the app knows that it's going to send <2 segments
in a single operation, it should disable Nagle.

Modifying Nagle doesn't benefit either of those scenarios. It might
benefit a scenario where the client doesn't know what its doing.


ps--I'm at the end of what appears to be a very long list. I don't see
my posts until several hours after sending them. This makes reading and
participating somewhat difficult. Apologies if out-of-order.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 12:54:18 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA24911
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 12:54:17 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA14641; Mon, 8 Feb 1999 11:09:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from algw1.lucent.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA09399; Mon, 8 Feb 1999 11:00:10 -0500 (EST)
Received: from hoserve.ho.lucent.com by alig1.firewall.lucent.com (SMI-8.6/EMS-L sol2)
	id LAA15017; Mon, 8 Feb 1999 11:28:22 -0500
Received: from qc74.qcs by hoserve.ho.lucent.com (SMI-8.6/EMS-1.4.1 sol2)
	id LAA16173; Mon, 8 Feb 1999 11:04:01 -0500
Received: by qc74.qcs (SMI-8.6/SMI-SVR4)
	id KAA20943; Mon, 8 Feb 1999 10:52:52 -0500
Date: Mon, 8 Feb 1999 10:52:52 -0500
From: ramesh@hoserve.ho.lucent.com (Ramesh Nagarajan)
Message-Id: <199902081552.KAA20943@qc74.qcs>
To: tcp-impl@lerc.nasa.gov
Subject: INFOCOM'2000: Advance Call for Papers
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

			 INFOCOM 2000 CALL FOR PAPERS

			      IEEE Infocom 2000

	    (Israel)  http://www.comnet.technion.ac.il/infocom2000
	      (U.S.A.)  http://www.cse.ucsc.edu/~rom/infocom2000
	       (Japan) http://halo.kuamp.kyoto-u.ac.jp/~infocom
		(general)  http://www.comsoc.org/confs/infocom

                    Dan Panorama Hotel, Tel Aviv, Israel
			      March 26-30, 2000

	 Sponsored by the IEEE Communications and Computer Societies


SCOPE
=====

For the last 18 years, Infocom has been the major conference on computer
communications and networking, bringing together researchers and
implementors of every aspect of data communications and networks
presenting the most up-to-date results and achievements in the field.

The 19-th annual conference on Computer Communications, Infocom 2000,
will be held at the Dan Panorama Hotel in Tel-Aviv, Israel, during the
week of March 26-30, 2000. The conference is sponsored by the technical
committees on computer communications of the IEEE Communications and
Computers Societies.

The Infocom 2000 organizing committee is soliciting original papers
describing state-of-the-art research and development in all areas of
computer networking and data communications.  Topics of interest include,
but are not limited to, the following:

Active Networks                  	Network Management and Control
BISDN and ATM                    	Network Reliability
Billing and Pricing              	Network Restoration
Congestion and Admission Control 	Network Signaling
Distributed Network Algorithms   	Network Standards
High-Speed Network Protocols     	Network and Protocol Performance
Integrated Control of Networks   	Optical Networks
Intelligent Networks             	Personal Communications Systems
Internet                         	Photonic Switching
Internetworking                  	Protocol Design and Analysis
Lightwave Networks               	Quality of Service
Mobility                         	Routing and Routing Protocols
Multicast/Broadcast Algorithms   	Security and Privacy
Multimedia Protocols             	Switch Architectures
Multimedia Terminals and Systems 	Testbeds and Measurements
Multiple Access                         Traffic Management and Control
Network Architectures            	Video Networking
Network Design and Planning      	Wireless Networks and Protocols

PAPER SUBMISSION
================

Papers must be submitted electronically in the manner and format
detailed below.  Authors for whom this presents a severe problem should
contact one of the technical program committee co-chairs to discuss
alternatives.

Papers must be formatted according to the IEEE standard format except
for the font size, which must be 11pt.  To make it easy to adhere to the
formatting standard we offer templates and samples for LaTex, MSWord,
and FrameMaker (consult at the web pages referenced on top of this
message).

Submission must be in PDF.  However, the committee will also accept
Postscript from Latex, FrameMaker, or MSWord source file.  Postscript
papers must use only standard PostScript fonts: Times Roman, Courier,
Symbol, and Helvetica.  (Postscript output from MSWord typically does
not work on non-Microsoft platforms.  The use of the Apple LaserWriter
II printer driver is strongly recommended).  The above formatted papers
can be submitted in a compressed form (gzip, zip, compress).

Because of the size limitation on the final manuscript, and to ensure
that the reviewed paper and the final version have a similar size,
-----------------------------------------------------
|papers with more than 11 pages will not be reviewed| (this is roughly
-----------------------------------------------------
equivalent to 20 double-spaced pages).

Papers must be submitted electronically using the Web site at
<http://www.cs.columbia.edu/~hgs/edas/infocom2000>.  This web page
contains exact and detailed instructions.  The submission process
includes providing detailed contact information.  To save space authors
can omit this information from the paper itself.  Authors will receive
an immediate notification of the successful receipt of the file
containing their paper.  Subsequently, a formal notification will be
sent after verifying that the paper can be printed successfully.

-------------------------------------------------------------------------
|Submissions will only be accepted between April 1st and July 1st, 1999.| 
-------------------------------------------------------------------------

Submission deadlines are strict!  Papers that have been improperly
submitted or improperly formatted by the submission date will not be
considered.  To avoid last minute problems, authors are encouraged to
submit their papers well in advance of the deadline.


THE REVIEW PROCESS
==================

Each paper will typically be reviewed by three independent reviewers,
whose reviews will be relayed to the corresponding author.  To
facilitate the review process authors will be asked to classify the
paper according to a list of categories so that the most appropriate
reviewers handle the paper.  This year a new step will be introduced
into the process whereby authors will have a chance to provide a
limited rebuttal on the reviews before the program committee makes its
final decision.


TRAVEL GRANTS
=============

Limited travel assistance to students, post-docs and junior faculty for
defraying some of the costs of presenting a paper in the conference
will be available. Please refer to the conference web sites for
further details later this year.


IMPORTANT DATES
===============

    Complete paper due		April 1 - July 1, 1999
    Notification of acceptance	October 31, 1999
    Final version due		December 17, 1999


PROGRAM COMMITTEE CO-CHAIRS [infocom@comnet.technion.ac.il]
===========================

    Raphael Rom, Technion, Israel
    Henning Schulzrinne, Columbia University, USA


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 13:07:12 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id NAA25414
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 13:07:09 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA13602; Mon, 8 Feb 1999 11:08:00 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA11052; Mon, 8 Feb 1999 11:02:59 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id KAA05330;
	Mon, 8 Feb 1999 10:02:54 -0600 (CST)
Date: Mon, 8 Feb 1999 10:02:54 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902081602.KAA05330@frantic.bsdi.com>
To: ehall@ehsco.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Cc: tcp-impl@lerc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From ehall@ehsco.com Mon Feb  8 01:48:27 1999
> Date: Fri, 05 Feb 1999 10:08:03 -0800
> From: "Eric A. Hall" <ehall@ehsco.com>
> Organization: EHS Company
> X-Accept-Language: en
> To: David Borman <dab@BSDI.COM>
> CC: tcp-impl@lerc.nasa.gov
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
>
>
> > Also, I might not have said it, but this example was to show
> > 3 separate transaction requests, not 1 transaction done with
> > 3 writes.  If it was one request, it should really be done with
> > a single 4500 byte write, yeilding:
>
> You're not showing any response activity so it's hard to guess what's
> really going on.

Ok.  My example was for an application that is request/response
oriented, but it does not necessarily wait for the preceeding
response before sending the next request.  The point being that
the first two requestes will get sent out, but there will be a
delay in the 3rd request.

With the current Nagle code:
	1st request:	send 1460 bytes  (defer 40 bytes)
			(pause for some short period of time)
	2nd request:	send 40 + 1420 = 1460 bytes (defer 80 bytes)
			(pause for some arbitrary period of time)
	3rd request:	send 80 + 1380 = 1460 bytes  (defer 120 bytes)
			(wait for ACK of 4380 bytes or timeout)
			send 120 bytes

With the new Nagle code:

	1st write:      send 1460 bytes
			send 40 bytes
			(pause for some short period of time)
	2nd write:      send 1460 bytes (defer 40 bytes)
			(pause for some short period of time)
	3rd write       send 40 + 1420 = 1460 bytes (defer 80  bytes)
			(wait for ACK of first 1500 bytes or timeout)
			send 80 bytes

The difference is that with the new Nalge code, the whole initial
request gets sent immediatly, whereas with the old code it is only
the addition of the 2nd request that actually pushes out the rest
of the 1st request.

Additionally, note that I've clarified that there is a difference in
when the final piece gets sent, relative to when the ACKs arrive.
With the old Nagle code, it will be waiting for an ACK of all the
outstanding data before sending the final 120 bytes.  With the new
Nagle code, once the first 40 byte packet has been acked, the trailing
80 byte packet can go out because there is no longer any previous
unacked small packets.

Also, you said, wrt to this example and the New Nagle code:
> Five sends. Better.
And the old Nagle:
> Four sends. Best.

Not necessarily.  Though it is usually best to send as few packets
as possible, strict packet count is not the only criteria.  In this
situation we are also looking at how quickly the whole requests get
sent out, and what artificial delays are introduced as an artifact
of interaction between the Nagle algorithm and delayed acks.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 21:56:32 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA07017
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 21:56:32 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA25053; Mon, 8 Feb 1999 20:17:57 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA23020; Mon, 8 Feb 1999 20:13:06 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 9 Feb 1999 01:36:38 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10A1Q4-001xhGC; Mon, 8 Feb 1999 16:52:52 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id QAA01403; Mon, 8 Feb 1999 16:53:38 -0800 (PST)
Message-Id: <199902090053.QAA01403@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 05 Feb 1999 18:20:38 PST."
             <36BBA6F6.7297D2CC@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Mon, 08 Feb 1999 16:53:38 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

> I guess I would be in the camp that feels that interpreting nagle send
> by send is sufficient and the way to go and that we need to make sure
> app writers understand just what they are doing when they set
> TCP_NODELAY. A "mantra" along the lines of "provide as much of the
> logically associated data as you can to the transport at one time" or
> something... 

I think two things.

One, i don't think application writers can always guarantee that even if they 
get the write buffer size, all the data will be exempt from Nagle.  For 
example, in the BSD stack at least, if you pass too much data for the current 
(congestion or whatever) window in your send() call, subsequent packets may be 
subject to Nagle (thus susceptible to bad interactions with delayed ACKs at 
the remote side).  Check out, for example, John Heideman's article in the 
April, 1997, ACM SIGCOMM Computer Communications Review, or Henrik Nielsen, et 
al, from SIGCOMM 1997, for example of fairly well-crafted applications that 
still run afoul of Nagle/delayed ACK interactions.

Two, i think that making this change may have the effect of preventing people 
from wholesale disabling of Nagle.

(I also think i'm happy to have the change apply to "send()s", as is quite 
often current practice, rather than to the output of TCP.  I think, though, 
that it should be *written* as if it applies to the output of TCP, just as 
RFC896.)

Does that make sense?

Cheers,  Greg


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 21:56:42 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA07028
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 21:56:41 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA02199; Mon, 8 Feb 1999 20:32:57 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA29457; Mon, 8 Feb 1999 20:27:14 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 9 Feb 1999 01:50:46 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10A1ww-001xhGC; Mon, 8 Feb 1999 17:26:50 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id RAA01497; Mon, 8 Feb 1999 17:27:35 -0800 (PST)
Message-Id: <199902090127.RAA01497@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: David Borman <dab@BSDI.COM>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 04 Feb 1999 11:07:58 CST."
             <199902041707.LAA02391@frantic.bsdi.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Mon, 08 Feb 1999 17:27:34 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Dave,

If i understand what you did several years ago:

> I changed our TCP so that when an application does a single write, if
> there isn't any outstanding un-acked data, all the new data will go
> out (subject to constraints of the congestion window), including the
> trailing partial packet.

you are doing what various other people (Rick Jones, most notably) have said 
they are doing.  I.e., Nagle sort of at the output of send() rather than in 
tcp_output().  (The BSD code i have, from FreeBSD 2.2.x, does Nagle on *input* 
to tcp_output(), rather than as tcp_output() emits packets to ip_output(); 
this seems sort of an approximation [possibly less than ideal] to what you are 
doing.)

I think this is a fine thing to do.  However, i don't think (as you seem to) 
that this is the same as what the I-D is proposing.

According to the I=D (or, at least the *intent* of the I-D, as Vernon points 
out that the english description may not be ideal!; see below for some 
pseudo-code), if an application calls send() with 20 bytes, and some MSS-sized 
packets are not acknowledged, but no packets < MSS are unacknowledged, then 
(consistent with windows) a packet with a TCP payload of 20 bytes will be 
transmitted.

I think that is *not* what your modification would do.  (Correct me if i am 
wrong!)

So, the $64,000 question is: now that you (hopefully, if i'm getting better at 
explaining it!) know what the modification is, what are your thoughts about it?

Thanks,  Greg
----
(eff.snd.mss is RFC1122 terminology.)

Current Nagle:

	if ((available_data < eff.snd.mss) && (snd.una != snd.nxt))
		don't send;

New Nagle:

	if ((available_data < eff.snd.mss) && (snd.una < snd.sml))
		don't send;
	else
		snd.sml = snd.nxt+available_data;

(and, code to "pull" snd.sml along, so that it doesn't get left behind if the 
sequence space wraps.)


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 22:24:20 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA07297
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 22:24:20 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA17181; Mon, 8 Feb 1999 21:07:58 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA16465; Mon, 8 Feb 1999 21:05:26 -0500 (EST)
Received: from Eng.Sun.COM (engmail2 [129.146.1.25]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA03513 for <tcp-impl@lerc.nasa.gov>; Mon, 8 Feb 1999 18:05:21 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id SAA16632
	for <tcp-impl@lerc.nasa.gov>; Mon, 8 Feb 1999 18:05:19 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id SAA16333
	for <tcp-impl@lerc.nasa.gov>; Mon, 8 Feb 1999 18:05:16 -0800 (PST)
Date: Mon, 8 Feb 1999 18:05:16 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902090053.QAA01403@red.mtv.siara.com>
Message-ID: <Roam.SIMCSD.2.0.4.918525916.11885.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> One, i don't think application writers can always guarantee that even if
> they  get the write buffer size, all the data will be exempt from Nagle. 
> For  example, in the BSD stack at least, if you pass too much data for the
> current  (congestion or whatever) window in your send() call, subsequent
> packets may be  subject to Nagle (thus susceptible to bad interactions with
> delayed ACKs at  the remote side).

If we agree that Nagle algorithm should be applied on a per send basis, not on
a per segment basis, then change the BSD code so that it will send out the
remaining data.  And I think Dave has pointed out that his BSD code does that,
more or less.  Please correct me if I misunderstood that.

> Two, i think that making this change may have the effect of preventing
> people  from wholesale disabling of Nagle.

I don't understand.  We are trying to discourage people from disabling Nagle
by default, right?  Do I misunderstand your sentence above?

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 22:58:16 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA09066
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 22:58:16 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA29887; Mon, 8 Feb 1999 21:37:59 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA28625; Mon, 8 Feb 1999 21:34:42 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id TAA10622
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Mon, 8 Feb 1999 19:34:40 -0700 (MST)
Date: Mon, 8 Feb 1999 19:34:40 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902090234.TAA10622@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>                                              ... if i'm getting better at 
> explaining it! ...

The best possible explanation should be put into anything that is to
eventually become an RFC.

In other words, should the proposal be advanced, the text needs to be
sharpened.   For example, it would be quite wrong to leave issues such as
whether the algorithm is to be applied to application requests or the
output of the segmenter to the archives of this mailing list, since it
seems the results are on the wire differ.  A standard must support a


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  8 23:14:19 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA09157
	for <tcpimpl-archive@lists.ietf.org>; Mon, 8 Feb 1999 23:14:18 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA08501; Mon, 8 Feb 1999 21:58:02 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA06343; Mon, 8 Feb 1999 21:52:07 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id VAA16222
	for <tcp-impl@lerc.nasa.gov>; Mon, 8 Feb 1999 21:52:03 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id SAA23114 for <tcp-impl@lerc.nasa.gov>; Mon, 8 Feb 1999 18:52:05 -0800 (PST)
Message-ID: <36BFA2D5.83017FB5@cup.hp.com>
Date: Mon, 08 Feb 1999 18:52:05 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902090053.QAA01403@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> One, i don't think application writers can always guarantee that even if they
> get the write buffer size, all the data will be exempt from Nagle.  For
> example, in the BSD stack at least, if you pass too much data for the current
> (congestion or whatever) window in your send() call, subsequent packets may be
> subject to Nagle (thus susceptible to bad interactions with delayed ACKs at
> the remote side).  Check out, for example, John Heideman's article in the
> April, 1997, ACM SIGCOMM Computer Communications Review, or Henrik Nielsen, et
> al, from SIGCOMM 1997, for example of fairly well-crafted applications that
> still run afoul of Nagle/delayed ACK interactions.

OK, so one of the problems reported was (IMO) a bug in Apache (sending
the headers separate from the data and yes, prompted by a browser bug
:), so I'm not sure that is not a priori cause to modify TCP.

The second biggie there - the small last segment is at least in part
solved by applying nagle on a per-send basis. That does indeed leave-out
the case where an application with 4097 bytes of data sends it as 4096
+1, and that last "1" gets delayed because it was a separate send. I'm
still not sure that is not an application bug for not presenting the
4097 bytes of data in a writev() or the like.

Where that magic constant changes I do not know, and I can see that as
an argument for some explicit flush mechanism between the app and TCP -
ioctl or send() flag I suppose. I think though that an application aught
to be able to present up to SO_SNDBUF bytes of a > SO_SNDBUF bytes
request or response. 

Regardless, it is hard for me to not believe that the experiements in
the paper were not simply conducted with buggy implementations.

> Two, i think that making this change may have the effect of preventing people
> from wholesale disabling of Nagle.

People are already wholesale disabling Nagle, the only question is
whether or not we can reverse that trend.

> (I also think i'm happy to have the change apply to "send()s", as is quite
> often current practice, rather than to the output of TCP.  I think, though,
> that it should be *written* as if it applies to the output of TCP, just as
> RFC896.)

If the updated RFC is written as if it applies to the the output of TCP,
there will be someone who really wants to pick nits with implementations
that are still doing the right/sufficient thing, just not implementing
it precisely the same way. That of concerns me.

Also, I'm not sure what part of RFC896 is stating that the algorithm
should be impemented on a per-segment basis rather than per-send. For
example, the second paragraph of the second "The solution to the small
packet problem" reads:

#begin quote
The solution is to inhibit the sending of new TCP  segments  when
new  outgoing  data  arrives  from  the  user  if  any previously
transmitted data on the connection remains unacknowledged.   This
inhibition  is  to be unconditional; no timers, tests for size of
data received, or other conditions are required.   Implementation
typically requires one or two lines inside a TCP program.
#end quote

Notice how it says "new data arrives from the user" - that sounds much
more like someone calling "send()" than TCP sending a segment. Two
paragraphs later it reads:

#begin quote
When a user process writes to a TCP connection, TCP receives some
data.   It  may  hold  that data for future sending or may send a
packet immediately.  If it refrains from  sending  now,  it  will
typically send the data later when an incoming packet arrives and
changes the state of the system.  The state changes in one of two
...
Thus, when we omit sending data on arrival from the
user,  we  are  simply  deferring its transmission until the next
message arrives from the distant host. 
#end quote

again "a user process writes to a TCP connection" and "when we omit
sending data on arrival from the user" sounds like a call to "send()" 

if I then switch to RFC1122, I find this in 4.2.2.14 I find:

#begin quote
                 Another important TCP performance issue is that some
                 applications, especially remote login to character-at-
                 a-time hosts, tend to send streams of one-octet data
                 segments.  To avoid deadlocks, every TCP SEND call from
                 such applications must be "pushed", either explicitly
                 by the application or else implicitly by TCP.  The
                 result may be a stream of TCP segments that contain one
                 data octet each, which makes very inefficient use of
                 the Internet and contributes to Internet congestion.
                 The Nagle Algorithm described in Section 4.2.3.4
                 provides a simple and effective solution to this
                 problem.  It does have the effect of clumping
#end quote

with the bit about "every TCP SEND call from such applications..."
wouldn't that be the send() socket call? That is the call the
application makes. And it says that each send should be "pushed" modulo
4.2.3.4 so:

#begin quote
         4.2.3.4  When to Send Data
 
            A TCP MUST include a SWS avoidance algorithm in the sender.
 
            A TCP SHOULD implement the Nagle Algorithm [TCP:9] to
            coalesce short segments.  However, there MUST be a way for
            an application to disable the Nagle algorithm on an
            individual connection.  In all cases, sending data is also
            subject to the limitation imposed by the Slow Start
            algorithm (Section 4.2.2.15).
 
            DISCUSSION:
                 The Nagle algorithm is generally as follows:
 
                      If there is unacknowledged data (i.e., SND.NXT >
                      SND.UNA), then the sending TCP buffers all user
 
 
Internet Engineering Task Force                                [Page 98]

 
RFC1122                  TRANSPORT LAYER -- TCP             October 1989
 
 
                      data (regardless of the PSH bit), until the
                      outstanding data has been acknowledged or until
                      the TCP can send a full-sized segment (Eff.snd.MSS
                      bytes; see Section 4.2.2.6).
#end quote

So, one might interpret the "regardless of the PSH" bit as being counter
to the "interpret on each send()" that I am espousing, but 1122 was
expanding upon 896 anyway (since 896 was an unconditional wait, not tied
to MSS) so I would argue that interpreting per-send is either a return
to (the possible) intent of 896 (it talked of user sends, not tcp_output
calls) or simply a further refinement of 1122 (I'm sure there is more
chapter and verse out there...). 

Short of an explicit end of message indicator, or an explicit flush from
the user, the send() boundary is the best bet TCP has for guessing if
there is going to be any more data to even be able to make-up an
MSS-sized segment in a request/response situation. 

We are already (or aught to be) telling apps writers that bigger sends
are better. The bigger the sends become, the more Nagle+delACK becomes a
race condition or less of a concern yes? 

Similarly for a per-send interpretation of Nagle on a bulk-data
application - if the application gets ahead of the cwnd (how often does
a bulk-data app remain behind cwnd anyway?) cwnd is going to do the
aggregating for us anyway. 

Make SO_SNDBUF just a skosh larger (perhaps simply as a matter of when
one decides it is full...) than the TCP window and you get the same
thing when you do stay ahead of cwnd - for instance, HP-UX TCP's from
HP-UX 9 up through 10.20 allow a full MCLBYTES to be queued to the
socket even if there is only 1 byte left. Fewer icky, small mbuf issues
then, and it gives you a better shot of buffering up data in bulk
transfers and not having small segments.

I'm not sure which is easier to implement or the most logical (I have an
opinion on the latter :) - per-send or tracking that sndnxt-snduna is
not an MSS multiple. It sounds like many stacks have already gone
per-send. That (and the proposed change) still leaves all those "series
of sub-mss send" apps wanting to disable Nagle, and I'm betting they are
the ones we really need to get "fixed" - which is only going to be fixed
with the explicit flush. (I do believe that explicit flush appears
easier to implement than CORK...)

It is interesting (at least to me) to consider what the packets would
look like had HTTP been implemented on top of UDP - packet generation
would have been on a per-send basis as well...and we would probably be
encouraging the http apps types to be presenting as much data to the
transport at one time as possible and using writev() and such to avoid
long series of sub-MTU-sized datagrams...

rick
-- 
today, an ACK is just as many CPU cycles as a data segment...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Tue Feb  9 18:39:53 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA01623
	for <tcpimpl-archive@lists.ietf.org>; Tue, 9 Feb 1999 18:39:53 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA03832; Tue, 9 Feb 1999 16:58:04 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from saba.cs.washington.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA01527; Tue, 9 Feb 1999 16:52:49 -0500 (EST)
Received: from localhost (cardwell@localhost) by saba.cs.washington.edu (8.8.8+CS/7.2ws+) with SMTP id NAA08272; Tue, 9 Feb 1999 13:52:44 -0800
Date: Tue, 9 Feb 1999 13:52:44 -0800 (PST)
From: Neal Cardwell <cardwell@cs.washington.edu>
To: Greg Minshall <minshall@siara.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-Reply-To: <199902090127.RAA01497@red.mtv.siara.com>
Message-ID: <Pine.LNX.4.02A.9902091344370.5238-100000@saba.cs.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> ----
> (eff.snd.mss is RFC1122 terminology.)
> 
> Current Nagle:
> 
> 	if ((available_data < eff.snd.mss) && (snd.una != snd.nxt))
> 		don't send;
> 
> New Nagle:
> 
> 	if ((available_data < eff.snd.mss) && (snd.una < snd.sml))
> 		don't send;
> 	else
> 		snd.sml = snd.nxt+available_data;
> 
> (and, code to "pull" snd.sml along, so that it doesn't get left behind if the 
> sequence space wraps.)

Just to clarify, i believe the New Nagle pseudocode is a little closer to:

	if (available_data < eff.snd.mss) {
        	if  (snd.una < snd.sml)
                	don't send;
		else
			snd.sml = snd.nxt+available_data;  
	}

neal


From owner-tcp-impl@lerc.nasa.gov  Tue Feb  9 22:28:18 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA04042
	for <tcpimpl-archive@lists.ietf.org>; Tue, 9 Feb 1999 22:28:18 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA18306; Tue, 9 Feb 1999 20:53:08 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA15917; Tue, 9 Feb 1999 20:48:01 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 10 Feb 1999 02:11:36 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10AOkZ-001xhQC; Tue, 9 Feb 1999 17:47:35 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id RAA03874; Tue, 9 Feb 1999 17:24:56 -0800 (PST)
Message-Id: <199902100124.RAA03874@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Neal Cardwell <cardwell@cs.washington.edu>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Tue, 09 Feb 1999 13:52:44 PST."
             <Pine.LNX.4.02A.9902091344370.5238-100000@saba.cs.washington.edu> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 09 Feb 1999 17:24:55 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Just to clarify, i believe the New Nagle pseudocode is a little closer to:

yes, right.  thanks!


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 15:59:22 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id PAA21666
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 15:59:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA15092; Thu, 11 Feb 1999 14:18:33 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA12588; Thu, 11 Feb 1999 14:14:39 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 11 Feb 1999 19:38:20 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10B1Yg-001xhGC; Thu, 11 Feb 1999 11:13:54 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA05328; Thu, 11 Feb 1999 11:14:23 -0800 (PST)
Message-Id: <199902111914.LAA05328@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Mon, 08 Feb 1999 18:05:16 PST."
             <Roam.SIMCSD.2.0.4.918525916.11885.kcpoon@jurassic> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 11 Feb 1999 11:14:22 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

K. Poon,

> If we agree that Nagle algorithm should be applied on a per send
> basis, not on a per segment basis, then change the BSD code so that it
> will send out the remaining data.  And I think Dave has pointed out
> that his BSD code does that, more or less.  Please correct me if I
> misunderstood that. 

1.  I may well have misunderstood what Dave said.  However, if the final 
send() of a response is for 33 bytes, say, and there is unacknowledged data 
(from "full sized" send() calls), then the current Nagle would block the 33 
bytes until all previous data has been acknowledged.  The modified Nagle would 
send the 33 bytes immediately (modulo windows, etc.).

> I don't understand.  We are trying to discourage people from disabling
> Nagle by default, right?  Do I misunderstand your sentence above?

Yes, that is what we are trying to do.  I'm sorry that my sentence was unclear.


By the way, there are a couple of ways that you could apply "Nagle on a per 
send basis".

The way the TCP stack on my machine works (FreeBSD 2.2.x) is that when 
tcp_output is called, if there is no unacknowledged data, it will send *all* 
the available data (modulo windows), even if the last segment (possibly only 
segment) it transmits is less than a full MSS.  In *this* case, data that is 
*not* transmitted in this call to tcp_output because of window restrictions 
may be subject to being delayed by Nagle the next time tcp_output is called 
(because of a returning ACK, say).

It sounds like Dave has made sure that the entire send()'s amount of data gets 
at least this treatment (rather than on a per-call-to-tcp_output() basis as is 
true in my stack).  But, again, it may be the case that in his, or other's, 
stacks, the "residual" data not transmitted this time around may be subject to 
being delayed by Nagle later in the connection.

Greg


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 16:26:38 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA21945
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 16:26:38 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA16954; Thu, 11 Feb 1999 15:09:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA10023; Thu, 11 Feb 1999 14:58:57 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 11 Feb 1999 20:22:38 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10B1tC-001xhOC; Thu, 11 Feb 1999 11:35:06 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA05403; Thu, 11 Feb 1999 11:35:31 -0800 (PST)
Message-Id: <199902111935.LAA05403@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Mon, 08 Feb 1999 18:52:05 PST."
             <36BFA2D5.83017FB5@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 11 Feb 1999 11:35:31 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

> The second biggie there - the small last segment is at least in part
> solved by applying nagle on a per-send basis. That does indeed
> leave-out the case where an application with 4097 bytes of data sends
> it as 4096 +1, and that last "1" gets delayed because it was a
> separate send. I'm still not sure that is not an application bug for
> not presenting the 4097 bytes of data in a writev() or the like. 

Let's say a web server (httpd) is reading data from a CGI script, buffering 
it, and passing it down to TCP to send out over an ethernet.  Buffering it in 
4096 byte chunks seems reasonable.  Let's say the CGI script writes 4096 bytes 
(or, 5000 bytes, or ...).  Then the question is "should the trailing bytes be 
delayed by Nagle?".

I think the answer is "no".  The application, in this case, is doing an 
adequate job of buffering (larger than the MSS), and the trailing data should 
be sent "immediately" (modulo windows stuff).


> People are already wholesale disabling Nagle, the only question is
> whether or not we can reverse that trend.

I think making this change is a necessary, though perhaps not sufficient, step 
in reversing that trend.


> If the updated RFC is written as if it applies to the the output of
> TCP, there will be someone who really wants to pick nits with
> implementations that are still doing the right/sufficient thing, just
> not implementing it precisely the same way. That of concerns me. 

I think this is a valid point (as you have pointed out multiple times, and as 
many other people have pointed out again and again).  I will try to re-write 
the draft in a way that allows either behavior.  (I think i will say that some 
implementations apply Nagle on application send(), and that that is perfectly 
reasonable; i don't think i will say "that is preferred", since that seems to 
open a can of worms.  Thoughts from the assembled multitude?)


> The bigger the sends become, the more Nagle+delACK becomes a
> race condition or less of a concern yes?

My experience is that the more you close the window for a race condition, the 
more it seems to occur!


I think that by modifying the Nagle algorithm in the manner presented in the 
draft, we can improve the performance for a percentage of the request/response 
interactions for many internet applications (including the biggie, http) in a 
way that is transparent to application developers.  I think this is a win.


Greg (who'd like to move a revised draft into the RFC process, possibly as 
experimental)


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 17:07:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA22591
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 17:07:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA02921; Thu, 11 Feb 1999 15:33:21 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA02203; Thu, 11 Feb 1999 15:32:13 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id PAA26395
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 15:32:08 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id MAA28185 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 12:32:09 -0800 (PST)
Message-ID: <36C33E49.4F80B1AD@cup.hp.com>
Date: Thu, 11 Feb 1999 12:32:09 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902111935.LAA05403@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> > separate send. I'm still not sure that is not an application bug for
> > not presenting the 4097 bytes of data in a writev() or the like.
> 
> Let's say a web server (httpd) is reading data from a CGI script, buffering
> it, and passing it down to TCP to send out over an ethernet.  Buffering it in
> 4096 byte chunks seems reasonable.  Let's say the CGI script writes 4096 bytes
> (or, 5000 bytes, or ...).  Then the question is "should the trailing bytes be
> delayed by Nagle?".
> 
> I think the answer is "no".  The application, in this case, is doing an
> adequate job of buffering (larger than the MSS), and the trailing data should
> be sent "immediately" (modulo windows stuff).

Well, unless 2*MSS is an integral divisor (?) of the blessed application
buffering size, the second, small send is going to be delayed anyway, in
either nagle case - original or updated right? (and I suspect we do not
want to go down the rathole of having apps getsockopt(TCP_MAXSEG)?-)

send(4096)
  transmit 1460
  transmit 1460
  transmit 1176  
send(904)
  wait for ack

The only thing that will force that second, small send out would be an
explicit flush, and if there is an explicit flush, is there any real
need to change the interpretation of nagle?

Perhaps the thing to focus on and codify then is the explicit flush? I
suspect that the explicit flush would be sufficient to reverse the trend
of TCP_NODELAY usage. The app will always know better than TCP just what
is and is not the end of a message no matter what naglesque heuristic we
use, and even if apps started flushing on every send the behaviour on
the net would be no worse than it is today with TCP_NODELAY.

> I think that by modifying the Nagle algorithm in the manner presented in the
> draft, we can improve the performance for a percentage of the request/response
> interactions for many internet applications (including the biggie, http) in a
> way that is transparent to application developers.  I think this is a win.

I do not think that just the interpretation change alone will make much
of a difference because the internal application buffering such as you
discuss will not align nicely with 2*MSS, or the applications will make
a pair of sub-mss sends.

Also, it sounds like we have a number of stacks out there already which
are implementing something very close to the proposed change to Nagle,
and we still have rampant use of TCP_NODELAY.

So, we could go ahead with the codification of the change to nagle as
being per-send/the draft, but unless explicit flush becomes reality I
don't think there will be a significant impact on the use of
TCP_NODELAY.

rick 
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 17:17:38 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA22986
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 17:17:37 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA18071; Thu, 11 Feb 1999 15:58:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA16544; Thu, 11 Feb 1999 15:55:58 -0500 (EST)
Received: from Eng.Sun.COM (engmail2 [129.146.1.25]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id MAA14327 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 12:55:57 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id MAA25458
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 12:55:54 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id MAA17877
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 12:55:54 -0800 (PST)
Date: Thu, 11 Feb 1999 12:55:54 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902111935.LAA05403@red.mtv.siara.com>
Message-ID: <Roam.SIMCSD.2.0.4.918766554.29208.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Let's say a web server (httpd) is reading data from a CGI script, buffering 
> it, and passing it down to TCP to send out over an ethernet.  Buffering it
> in  4096 byte chunks seems reasonable.  Let's say the CGI script writes 4096
> bytes  (or, 5000 bytes, or ...).  Then the question is "should the trailing
> bytes be  delayed by Nagle?".

I think most implementations support the socket option TCP_MAXSEG.  If a
well written application is going to buffer data, I think it is better
to buffer data in multiples of SMSS (sending MSS) bytes size.  Then the
problem mentioned above should not be there, assuming PMTU does not change...
Should this, application buffering in chunks of SMSS bytes buffers, be
mentioned in the draft as a best practise?

> I think that by modifying the Nagle algorithm in the manner presented in the 
> draft, we can improve the performance for a percentage of the
> request/response  interactions for many internet applications (including the
> biggie, http) in a  way that is transparent to application developers.  I
> think this is a win.

Yes, I think we all agree with the draft's good intention.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 17:54:43 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA23750
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 17:54:42 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA15964; Thu, 11 Feb 1999 16:44:58 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA14780; Thu, 11 Feb 1999 16:42:44 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 214;
          Thu, 11 Feb 1999 13:42:41 -0800
Message-ID: <36C34EC7.DD2DBE3A@ehsco.com>
Date: Thu, 11 Feb 1999 13:42:31 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Greg Minshall <minshall@siara.com>
CC: Rick Jones <raj@cup.hp.com>, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902111935.LAA05403@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> Greg (who'd like to move a revised draft into the RFC process,
> possibly as experimental)

I'm all for expiremental.

I'd also like to see some sort of wording that says "developers should
only disable Nagle if they know for certain that they will only be
generating data that is cumulatively smaller than two full-sized
segments, as many writes of small blocks of data will negatively impact
the network, possibly to the point that performance degradation occurs
for all applications (including the disabler)" or something similar.
This helps to get the message out.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 17:55:14 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA23767
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 17:55:14 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA09325; Thu, 11 Feb 1999 16:33:16 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA07022; Thu, 11 Feb 1999 16:29:22 -0500 (EST)
Received: from big (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id QAA32144;
	Thu, 11 Feb 1999 16:29:19 -0500
Message-Id: <3.0.5.32.19990211162919.03034330@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Thu, 11 Feb 1999 16:29:19 -0500
To: Joe Touch <touch@ISI.EDU>, tcp-impl@lerc.nasa.gov, minshall@siara.com
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 14:48 2/3/99 -0800, Joe Touch wrote:

>>    This draft is NOT suggesting that these applications should disable
>>    the Nagle algorithm.
>
>Why not?

This is actually a hard question to answer.

>Nagle was a solution to char-at-a-time remote logins, and
>is discouraged for transactional systems, even ones with
>bursts as small as a few characters, e.g., X11.

In HTTP the size of requests and responses vary arbitrarily from message to
message so even though the proposed change fixes the particular problem of
the odd, small final segment in Apache, it is always possible to find
scenarios where Nagle is triggered to delay packets even though it is not
desired.

If I understand correctly then the following HTTP pipeline example will
trigger Nagle into delaying the 2nd response:

	client			server
	HTTP REQ 1	-->	
			<--	1st segment of RES 1
	HTTP REQ 2	-->
			<--    last 1.25 segment of RES 1
			<--	0.25 segment RES 2 delayed

From (at least a Web) application point of view, the main problem is
actually not Nagle itself but the fixed delay (at least in the sense that
it isn't adjusted to fit the link). Risking this delay is likely to have
app writes disable Nagle regardless of the proposed change or not. One
solution would be to adjust the delay but knowing timers, this is normally
not an easy task.

Instead, added control at the application layer would be of tremendous help
as the app is likely to know what it going on. In other words, I like the
SO_EXPLICITPUSH flag idea a lot and think that it would solve at least our
needs.

Henrik

--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 18:06:12 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA23964
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 18:06:11 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA15355; Thu, 11 Feb 1999 16:43:17 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA12204; Thu, 11 Feb 1999 16:38:18 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id NAA21886;
	Thu, 11 Feb 1999 13:38:10 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id NAA16638;
	Thu, 11 Feb 1999 13:38:10 -0800 (PST)
Date: Thu, 11 Feb 1999 13:38:10 -0800 (PST)
Message-Id: <199902112138.NAA16638@rum.isi.edu>
To: touch@ISI.EDU, tcp-impl@lerc.nasa.gov, minshall@siara.com, frystyk@w3.org
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Thu, 11 Feb 1999 16:29:19 -0500
> To: Joe Touch <touch@ISI.EDU>, tcp-impl@lerc.nasa.gov, minshall@siara.com
> From: Henrik Frystyk Nielsen <frystyk@w3.org>
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
> 
...
> Instead, added control at the application layer would be of tremendous help
> as the app is likely to know what it going on. In other words, I like the
> SO_EXPLICITPUSH flag idea a lot and think that it would solve at least our
> needs.
> 
> Henrik

I agree with this. It isn't clear how the current efforts
to generalize Nagle will help this; if anything, they 
potentially lull the application programmer into assuming
TCP will "do the right thing".

The right thing, in this case, is complicated semantically,
and hard to infer simply from send calls or message sizes.
The explicit signalling seems like a much more useful solution.

Joe


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 18:12:42 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA24029
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 18:12:42 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA27379; Thu, 11 Feb 1999 17:03:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA25016; Thu, 11 Feb 1999 16:59:33 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id OAA19953
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 11 Feb 1999 14:59:32 -0700 (MST)
Date: Thu, 11 Feb 1999 14:59:32 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902112159.OAA19953@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>>I think that by modifying the Nagle algorithm in the manner presented in the
>>draft, we can improve the performance for a percentage of the request/response
>>interactions for many internet applications (including the biggie, http) in a
> > way that is transparent to application developers.  I think this is a win.

As the discussion has gone on, there seems to me less justification for
changing the Nagle algorithm.  Could someone review the plausible scenarios
where the change helps?  (I.e. not just switching the interpretation of
"send" from what it seems no one used to something closer to what everyone
is doing.)


> I do not think that just the interpretation change alone will make much
> of a difference because the internal application buffering such as you
> discuss will not align nicely with 2*MSS, or the applications will make
> a pair of sub-mss sends.
>
> Also, it sounds like we have a number of stacks out there already which
> are implementing something very close to the proposed change to Nagle,
> and we still have rampant use of TCP_NODELAY.
>
> So, we could go ahead with the codification of the change to nagle as
> being per-send/the draft, but unless explicit flush becomes reality I
> don't think there will be a significant impact on the use of
> TCP_NODELAY.

That's an excellent point.  It's weakness is in the implicit assumption
that there exist many request-response applications that use transactions
larger than the MSS and smaller than the windows, where 100 ms one way or
another matters, which are not turning off Nagle for other reasons, and
also where turning off Nagle would be less desirable.  Concrete examples
of such applications are required for any change to TCP.

That bad programmers might imitate people who turn off Nagle for good
reasons is not justification for adding a TCP Flush knob.  The incompetant
will do the wrong thing regardless.  The best you might hope is that
they'd use the flush knob as well as turning off Nagle.


One must never add code to an operating system by reasoning "why not,"
since down that path lie many infamous piles of bloat.  It is even more
important to never write standards based on vague hopes or "it can't
hurt" reasoning.  Down that road lie the dead protocols such as the ISO
OSI suite, 100VG-AnyLAN, and FDDI-II.  That way also is populated by
Winsock 2.0 (read the spec to see the zillions of geegaws, geewhizzes,
and "wouldn't-it-be-nice"s) and WIN32 (which must be used to be believed,
with its 37 ways to do anything, all of which require at least 300 lines
more than you'd expect).


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 18:23:40 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA24095
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 18:23:39 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA05940; Thu, 11 Feb 1999 17:18:20 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA04380; Thu, 11 Feb 1999 17:15:53 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 11 Feb 1999 22:39:22 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10B4OG-001xhGC; Thu, 11 Feb 1999 14:15:20 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id OAA05567; Thu, 11 Feb 1999 14:15:48 -0800 (PST)
Message-Id: <199902112215.OAA05567@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Henrik Frystyk Nielsen <frystyk@w3.org>
cc: Joe Touch <touch@ISI.EDU>, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 11 Feb 1999 16:29:19 EST."
             <3.0.5.32.19990211162919.03034330@localhost> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 11 Feb 1999 14:15:48 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Henrik,

The modification to Nagle is designed to help lock-step request/response 
protocols (such as HTTP used in a non-pipelined fashion(**)).

Your example is:
----
	client			server
	HTTP REQ 1	-->	
			<--	1st segment of RES 1
	HTTP REQ 2	-->
			<--    last 1.25 segment of RES 1
			<--	0.25 segment RES 2 delayed
----

If httpd had seen REQ 2, then i think it should have done one send which had 
the 1.25 RES 1 and the 0.25 RES 2.

If httpd had not seen REQ 2, then you are right, the 0.25 RES 2 will be 
delayed.

Notice, however, that if you were using something like SO_EXPLICITPUSH, and 
you set SO_EXPLICITPUSH for the 1.25 RES 1, and then turned around and set 
SO_EXPLICITPUSH for the 0.25 RES 2, *my* [fictional] implementation would 
*also* delay the 0.25 RES 2.  (This is because i would be using 
SO_EXPLICITPUSH as a way to *avoid* sending < MSS packets, but would *still* 
be protecting the network from excess traffic from small packets.)

Having an adaptive delayed ack scheme is an interesting idea (that others, 
including John Nagle recently, have thought about).  I don't know if it would 
be based on link speed, "one-way-ness" of application traffic, or what.  It is 
independent of the Nagle modification.

Greg

(**) Any idea how much HTTP is pipelined these days, or what the trend is?


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 18:53:10 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA24613
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 18:53:09 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA22757; Thu, 11 Feb 1999 17:48:19 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA21606; Thu, 11 Feb 1999 17:46:24 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 11 Feb 1999 23:10:05 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10B4VU-001xhGC; Thu, 11 Feb 1999 14:22:48 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id OAA05610; Thu, 11 Feb 1999 14:23:16 -0800 (PST)
Message-Id: <199902112223.OAA05610@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 11 Feb 1999 12:32:09 PST."
             <36C33E49.4F80B1AD@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 11 Feb 1999 14:23:16 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

In the case you are doing Nagle on send() (rather than on segments leaving 
tcp_output()), then the send(904) would *not* be delayed by the modified Nagle 
algorithm.

Greg


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 19:00:39 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA24713
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 19:00:38 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA25518; Thu, 11 Feb 1999 17:53:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA23252; Thu, 11 Feb 1999 17:49:56 -0500 (EST)
Received: from big (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id RAA05377;
	Thu, 11 Feb 1999 17:49:53 -0500
Message-Id: <3.0.5.32.19990211174953.053cf1e0@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Thu, 11 Feb 1999 17:49:53 -0500
To: Greg Minshall <minshall@siara.com>
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
Cc: Joe Touch <touch@ISI.EDU>, tcp-impl@lerc.nasa.gov
In-Reply-To: <199902112215.OAA05567@red.mtv.siara.com>
References: <Your message of "Thu, 11 Feb 1999 16:29:19 EST."             <3.0.5.32.19990211162919.03034330@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 14:15 2/11/99 -0800, Greg Minshall wrote:

>If httpd had seen REQ 2, then i think it should have done one send which had 
>the 1.25 RES 1 and the 0.25 RES 2.
>
>If httpd had not seen REQ 2, then you are right, the 0.25 RES 2 will be 
>delayed.

Yes, it is a timing issue, you are right.

>Notice, however, that if you were using something like SO_EXPLICITPUSH, and 
>you set SO_EXPLICITPUSH for the 1.25 RES 1, and then turned around and set 
>SO_EXPLICITPUSH for the 0.25 RES 2, *my* [fictional] implementation would 
>*also* delay the 0.25 RES 2.  (This is because i would be using 
>SO_EXPLICITPUSH as a way to *avoid* sending < MSS packets, but would *still* 
>be protecting the network from excess traffic from small packets.)

Hmm, I think a simple write-through mechanism would be easier to deal with.
Isn't the point exactly that the app knows it is a small packet but it
wants it to go out anyway because it is "the Right Thing" in that
particular situation?

>Having an adaptive delayed ack scheme is an interesting idea (that others, 
>including John Nagle recently, have thought about).  I don't know if it
would 
>be based on link speed, "one-way-ness" of application traffic, or what.
It is 
>independent of the Nagle modification.

...

>(**) Any idea how much HTTP is pipelined these days, or what the trend is?

In HTTP itself, there are only a few clients that do pipelining - roughly
speaking applications based on my libwww code

	http://www.w3.org/Library/

The major argument against it on client side is that it is (too) hard to
implement (one reason being that HTTP/1.1 has 5 different ways of
delimiting messages). As it is significantly simpler on server side, this
is much more commonly supported.

However, if work on WebMUX is truly starting in the IETF, a lot of the
complexity will go away as WebMux has a much better handle on the message
length. In that case, pipelining is likely to become much more common than
it is now.

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 19:12:44 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA24859
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 19:12:43 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA03437; Thu, 11 Feb 1999 18:08:21 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA02382; Thu, 11 Feb 1999 18:06:50 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id SAA08770
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:06:45 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id PAA28439 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 15:06:48 -0800 (PST)
Message-ID: <36C36287.4CE7ECDC@cup.hp.com>
Date: Thu, 11 Feb 1999 15:06:47 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902112223.OAA05610@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> In the case you are doing Nagle on send() (rather than on segments leaving
> tcp_output()), then the send(904) would *not* be delayed by the modified Nagle
> algorithm.

OK, I'm losted then :) Or perhaps I missed a clarification:

   The proposed Nagle algorithm modifies this as follows:

        If a TCP has less than a full-sized packet to transmit,
        and if any previous less than full-sized packet has not
        yet been acknowledged, do not transmit a packet.

So, the application doing internal buffering in 4096 byte chunks trying
to send 5000 bytes goes

send(4096) - this is > MSS so it all goes out right away
 xmit 1460
 xmit 1460
 xmit 1176
send(904)
 
the second send call is < MSS, and there is an unacked, less than
full-sized segment on the network, so the send(904) waits.

Is the clarification I'm missing that in essence _all_ instances of
"packet" were replaced by "send?" If a TCP is presented with less than
an MSS-sized send to transmit, and if any previous less than MSS-sized
send has not been acknowledged, do not transmit the send. Is that it?

I've been assuming something a triffle simpler (presumeably) to
implement - if the send is larger than MSS send all of it now (windows
willing). If the send is less than MSS, only send it if it and unsent
data combines to form a >=MSS-sized segment.

Which have you been assuming?

rick

PS - since everyone in the discussion is on the mailing list, might as
well trim the reply to just tcp-impl - that will serialize the
discussion through the list server, and cut-down on the number of
messages some of us receive :)

-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 19:34:53 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA25120
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 19:34:52 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA11722; Thu, 11 Feb 1999 18:23:16 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA10731; Thu, 11 Feb 1999 18:21:50 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA21349
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 11 Feb 1999 16:21:48 -0700 (MST)
Date: Thu, 11 Feb 1999 16:21:48 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902112321.QAA21349@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Greg Minshall <minshall@siara.com>

> 	client			server
> 	HTTP REQ 1	-->	
> 			<--	1st segment of RES 1
> 	HTTP REQ 2	-->
> 			<--    last 1.25 segment of RES 1
> 			<--	0.25 segment RES 2 delayed
> ----
> If httpd had seen REQ 2, then i think it should have done one send which had 
> the 1.25 RES 1 and the 0.25 RES 2.

Who is urging that applications be smart enough to see REQ2 before
finishing sending RES1?  It's nice to gobble all currently waiting input
before sending your output, but I think it's above and beyond the call of
duty, and edges into geewhiz freaping creaturism territory.


> If httpd had not seen REQ 2, then you are right, the 0.25 RES 2 will be 
> delayed.

With all current and proposed forms of Nagle, right?


> Notice, however, that if you were using something like SO_EXPLICITPUSH, and 
> you set SO_EXPLICITPUSH for the 1.25 RES 1, and then turned around and set 
> SO_EXPLICITPUSH for the 0.25 RES 2, *my* [fictional] implementation would 
> *also* delay the 0.25 RES 2.  (This is because i would be using 
> SO_EXPLICITPUSH as a way to *avoid* sending < MSS packets, but would *still* 
> be protecting the network from excess traffic from small packets.)

That is inconsistent with my understanding of Dave Borman's SO_EXPFLUSH
bit.  (PLEASE call it something other than SO_EXPLICITPUSH, because it
has nothing to do with the other PSH thing!)   As I understood it, his
bit would flush data no matter how small.  

> Having an adaptive delayed ack scheme is an interesting idea (that others, 
> including John Nagle recently, have thought about).  I don't know if it would 
>be based on link speed, "one-way-ness" of application traffic, or what.  It is 
> independent of the Nagle modification.

Yes, and everyone who thinks about a timer immediately looks for something
else, something practical.  It's not as if we don't already know how hard
adaptive timers are, from experience with congestion stuff (even if you're
not impressed by the high costs that such timers impose on large systems).
There are unrelated, very good reasons for an adapative delayed Ack timer
on any shared medium LAN.  When you're pushing lots of data, the reverse
stream of TCP ACK's does terrible things to performance.  For example, on
BLEB 802.3, reducing the Ack rate can be good for a 30% throughput boost.
But ...

  ___________________


] From: "Eric A. Hall" <ehall@ehsco.com>

] ...
] I'd also like to see some sort of wording that says "developers should
] only disable Nagle if they know for certain that they will only be
] generating data that is cumulatively smaller than two full-sized  
] segments, as many writes of small blocks of data will negatively impact
] the network, possibly to the point that performance degradation occurs 
] for all applications (including the disabler)" or something similar.   
] This helps to get the message out.
  
Such words migth be ok in an RFC that is justified on other grounds,
although they seem to me to duplicate RFC 896.  However, I strongly oppose
a purely conscousness raising RFC.  You don't need to get the word out to
people who woud read a new RFC, since they've already RFC 896 and RFC
1122.  Publishing broadsides is statisfying to the already enlightened,
but as we know from watching real life, doesn't do much to raise the
conscousness of those who actually need it.

Has everyone who favors a conscousness raising RFC read RFC 896 and the
relevant parts of RFC 1122?  More words can be less when it comes to
raising consciousness.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 20:16:07 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA25415
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 20:16:07 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA29856; Thu, 11 Feb 1999 18:58:20 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from atlrel2.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA28887; Thu, 11 Feb 1999 18:56:14 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel2.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id SAA25445
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:56:05 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id PAA28539 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 15:56:06 -0800 (PST)
Message-ID: <36C36E16.6005C63D@cup.hp.com>
Date: Thu, 11 Feb 1999 15:56:06 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902112223.OAA05610@red.mtv.siara.com> <36C36287.4CE7ECDC@cup.hp.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

One thought that just struck me is that we've all been bantering about
with a 1460 MSS thinking primarily about 1500 byte MTU networks and how
4096 is big enough for a 1500 byte MTU network and all that.

However, once someone takes that nice Apache (etc) application up to
some (yes, possibly rare, but not unknown) larger MTU intranet (ATM,
JumboFrame Ethernet, 802.5, etc) that "4096 bytes of application
buffering is fine" thinking falls apart and we again have even sends
that are sub-MSS.

Which, I think is yet another reason why if we codify anything it aught
to be the explicit flush mechanism. That is the only thing that seems to
have any chance of putting a serious dent in the use of TCP_NODELAY.

rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 20:40:38 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA25643
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 20:40:37 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA18074; Thu, 11 Feb 1999 19:33:17 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA15918; Thu, 11 Feb 1999 19:29:47 -0500 (EST)
Received: from danmark (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id TAA12212;
	Thu, 11 Feb 1999 19:29:38 -0500
Message-Id: <3.0.5.32.19990211192939.00b6f1d0@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Thu, 11 Feb 1999 19:29:39 -0500
To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
In-Reply-To: <199902112321.QAA21349@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 16:21 2/11/99 -0700, Vernon Schryver wrote:

>Who is urging that applications be smart enough to see REQ2 before
>finishing sending RES1?  It's nice to gobble all currently waiting input
>before sending your output, but I think it's above and beyond the call of
>duty, and edges into geewhiz freaping creaturism territory.

Welcome to HTTP/1.1 pipelining [1] :)

It's in Apache, Jigsaw, and possibly many other HTTP/1.1 servers.

Henrik

[1] http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 21:22:00 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA25898
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 21:21:59 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA08555; Thu, 11 Feb 1999 20:13:20 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA05913; Thu, 11 Feb 1999 20:08:33 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id SAA23695
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 11 Feb 1999 18:08:32 -0700 (MST)
Date: Thu, 11 Feb 1999 18:08:32 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902120108.SAA23695@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Henrik Frystyk Nielsen <frystyk@w3.org>

> >Who is urging that applications be smart enough to see REQ2 before
> >finishing sending RES1?  It's nice to gobble all currently waiting input
> >before sending your output, but I think it's above and beyond the call of
> >duty, and edges into geewhiz freaping creaturism territory.
>
> Welcome to HTTP/1.1 pipelining [1] :)
>
> It's in Apache, Jigsaw, and possibly many other HTTP/1.1 servers.

Didn't I also see a note from you saying the only existing HTTP/1.1 code
that does it is yours?   If so, that sounds like nicer way to say about
the same thing as I tried to.


I'm not enthused by the creaping featurism aspects of the explicit flush
bit.  It all seems too much like the Winsock 2.0 mess.  (Don't take my
word; go get the spec at http://www.stardust.com/wsresource/wsresrce.html)
But an explict flush bit certainly sounds more useful, more flexible,
easier to understand, possible to implement, possible to test standards
compliance, and remotely possible to explain to the many people who code
send-send-read applications.

If I had a nickle for every time I've tried to explain why it's not just
obvious common sense to reduce the number sends per transaction ...
"Set the SO_SEGFLUSH bit" sounds more likely to be understood than
"code it correctly."  It does a less harm to the net than "turn off Nagle."


Again, what are the concrete example applications on real life systems
where the proposed modification to Nagle are better?  I've just re-read
the draft, and don't the advantages of the proposal, given that most
systems (that care about performance) are doing as RFC 896 and 1122 say
and operating on application send requests instead of on packets, and
since most transaction applications either send a lot more than 2920 bytes
or less than 1460.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 22:50:25 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA28287
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 22:50:24 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA22379; Thu, 11 Feb 1999 21:38:19 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA20975; Thu, 11 Feb 1999 21:35:38 -0500 (EST)
Received: from Eng.Sun.COM (engmail4 [129.144.134.6]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA14129 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:35:37 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id SAA06102
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:35:37 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id SAA18176
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:35:33 -0800 (PST)
Date: Thu, 11 Feb 1999 18:35:33 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902112321.QAA21349@calcite.rhyolite.com>
Message-ID: <Roam.SIMCSD.2.0.4.918786933.17714.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Has everyone who favors a conscousness raising RFC read RFC 896 and the
> relevant parts of RFC 1122?  More words can be less when it comes to
> raising consciousness.

Yes, I have read the 2 RFCs.  The reason I suggested a draft on describing the
preferred way to write network application is that many application
programmers do not really understand how TCP works.  Reading those RFCs may 
just confuse them more, as those RFC assumes an understanding of how TCP
works.  And those RFCs do not provide a programming guide line.  By guide line
I mean actual example code.  To me, a document with example code and an
intuitive explanation on why the code is better in simple networking terms
is more suitable for most programmers.  I don't know if there is such a
document freely available on the Web.  If you know, please post a reference.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 22:55:20 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA28303
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 22:55:19 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA15000; Thu, 11 Feb 1999 21:23:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA12835; Thu, 11 Feb 1999 21:19:33 -0500 (EST)
Received: from Eng.Sun.COM (engmail3 [129.144.170.5]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA11430 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:19:26 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id SAA04254
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:19:22 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id SAA18158
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 18:19:20 -0800 (PST)
Date: Thu, 11 Feb 1999 18:19:20 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
Message-ID: <Roam.SIMCSD.2.0.4.918785960.15719.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

While reading all those examples of how the Nagle modification will fail and
how an EPS (explicit push send (-:) will help, I'm wondering if we should also
look at how current delay ack is implemented.  Most stacks delay ack for every
other full MSS segments.  Some delay ack for every other segments regardless
of the size.  And we know how it helps, as explained in section 4.2.3.2
of RFC 1122.

Just for an exercise, suppose TCP does not delay an ack for a segment smaller
than MSS size if the segment received prior to that is a full MSS size
segment.  It still uses the current delay ack strategy for full MSS size
segments.  Will this solve the problem we are discussing when combined with
the Nagle modification on a per-send basis?

For telnet like traffic, TCP delays ack the same way as before.  For
request/response traffic, it will not delay ack for the last small segment. 
This does create one more ack as the server should respond immediately after
getting the complete request.  For the HTTP example Henrik sent out, it will
not delay ack for the .25 segment of RES 1.  So the other .25 segment of RES 2
can be sent out after getting the ack.  In this case, it does not create an
extra ack.  Can anyone think of an example when it will still delay ack at the
wrong time?  Or an example when it does not delay ack when it should?

This should be easy to implement, much easier than an adaptive delay ack
algorithm.  Will Vernon consider this as a bloat to OS?  Are those extra
acks in some situations acceptable to people on this list?  Note that this
is just an exercise on thinking what delay ack strategy should be.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 11 23:43:51 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA28865
	for <tcpimpl-archive@lists.ietf.org>; Thu, 11 Feb 1999 23:43:51 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA19397; Thu, 11 Feb 1999 22:33:17 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA16904; Thu, 11 Feb 1999 22:28:31 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id TAA10838
	for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 19:28:37 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id TAA28808 for <tcp-impl@lerc.nasa.gov>; Thu, 11 Feb 1999 19:28:28 -0800 (PST)
Message-ID: <36C39FDB.9EECFAA3@cup.hp.com>
Date: Thu, 11 Feb 1999 19:28:27 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <Roam.SIMCSD.2.0.4.918785960.15719.kcpoon@jurassic>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Just for an exercise, suppose TCP does not delay an ack for a segment smaller
> than MSS size if the segment received prior to that is a full MSS size
> segment.  It still uses the current delay ack strategy for full MSS size
> segments.  Will this solve the problem we are discussing when combined with
> the Nagle modification on a per-send basis?

Would a PTMU smaller than the MSS interfere with the receiver's ability
to tell a full-sized segment from a small one? Especially in a
unidirectional case, or an assymetric route case.

> This should be easy to implement, much easier than an adaptive delay ack
> algorithm.  Will Vernon consider this as a bloat to OS?  Are those extra
> acks in some situations acceptable to people on this list?  Note that this
> is just an exercise on thinking what delay ack strategy should be.

Well, it is my belief that "today" an ACK costs just as many CPU cycles
as a data segment.

rick 
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 02:12:32 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id CAA06262
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 02:12:32 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id AAA21409; Fri, 12 Feb 1999 00:38:17 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id AAA20961; Fri, 12 Feb 1999 00:37:13 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id WAA28399
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 11 Feb 1999 22:37:12 -0700 (MST)
Date: Thu, 11 Feb 1999 22:37:12 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902120537.WAA28399@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>

> Just for an exercise, suppose TCP does not delay an ack for a segment smaller
> than MSS size if the segment received prior to that is a full MSS size
> segment.  It still uses the current delay ack strategy for full MSS size
> segments.  Will this solve the problem we are discussing when combined with
> the Nagle modification on a per-send basis?

Which Nagle modification is that?  The algorithm as classically described
and widely implemented that mostly worries about application send requests
instead of packets?  Or the new proposal with its intentional (and I think
undesirable) ambiguity between application requests and packets?


> For telnet like traffic, TCP delays ack the same way as before.  For
> request/response traffic, it will not delay ack for the last small segment. 
> ...

Assuming there is a problem here that needs to be solved, and that the
problem cannot be easily and cleanly solved with an explicit flush bit,
then I think that's a good area to investigate.  The old delayed Ack is
a badly aimed hammer that has mashed a lot of thumbs over the years.
(E.g. the 10X performance hit taken by some commercial UNIX systems
talking to some version of Sun's system.)


> This should be easy to implement, much easier than an adaptive delay ack
> algorithm.  Will Vernon consider this as a bloat to OS?  Are those extra
> acks in some situations acceptable to people on this list?  Note that this
> is just an exercise on thinking what delay ack strategy should be.

Maybe.

First we need some evidence of a problem that needs to be and can be
solved, other than the familiar committee need to Publish Something. 
I suspect there is real problem, but so far, there is no clear evidence.
As far as I can recall, all of the proposed examples of problems that
could only be solved with the various proposals have failed.
It's hard to invent a solution to an unknown problem.

The HTTP/1.1 problem is interesting, but I suspect it has already been
solved with NODELAY, smart use of select(), and so forth.  A handicap for
any new idea is that it will not be widely implemented for a long time.
Good programmers must and will continue to use the old solutions.  If
you're writing portable code, if on some platforms you must use NODELAY,
and if NODELAY always works as well as the new scheme, do you bloat your
code with #ifdef's or runtime bytes to use the new scheme?  Or do you just
use NODELAY and forget the new mechanism?


} From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>

} ...
} Yes, I have read the 2 RFCs.  The reason I suggested a draft on describing the
} preferred way to write network application is that many application
} programmers do not really understand how TCP works.  Reading those RFCs may 
} just confuse them more, as those RFC assumes an understanding of how TCP
} works.  And those RFCs do not provide a programming guide line.  By guide line
} I mean actual example code.  To me, a document with example code and an
} intuitive explanation on why the code is better in simple networking terms
} is more suitable for most programmers.  I don't know if there is such a
} document freely available on the Web.  If you know, please post a reference.

A document that would teach good programming sounds like a good thing.
Such documents are often called textbooks.  Some good textbooks contain
lots of example code.  However, as you basically say about the old RFCs,
the IETF is not in the introductory** textbook business.  The IETF is a
standards committee, not Addison-Wesley, John Wiley, or even The Microsoft
Press.  Writing, printing, and distributing good textbooks is hard work.
Textbook authors and publishers should be properly paid.  Regardless of
anyone's intentions, I doubt that the profits for an RFC can buy more than
the usual trade rag consutant blabbler and inter-ad filler, which is
usually wrong, and always much less reliable than the ads.

**The technical issue is basic, no matter how many lazy people get it
wrong.  As I said earlier about why fwrite()/stdio was invented--everyone
competant knows that extra I/O requests are bad, whether on wires or to
disks.  Would you feel good about hiring a programmer whose code does
unnecessary multiple database access?--of course not, if you've ever heard
about locking, commits, concurrency, disk latency, and so forth.
The small TCP write problem is very much like writing single disk sectors
instead of whole tracks or cylinders


] From: Rick Jones <raj@cup.hp.com>

] ...
] Would a PTMU smaller than the MSS interfere with the receiver's ability
] to tell a full-sized segment from a small one? Especially in a
] unidirectional case, or an assymetric route case.

It might be good enough to consider anything bigger than 512 bytes full
size, and any segment smaller than its immediate predecessor.  Not that
I'm advocating anything--just speculating.

] ...
] Well, it is my belief that "today" an ACK costs just as many CPU cycles
] as a data segment.

As Rick knows, I mostly agree with that.
But the current issue is not CPU cycles, but 100 ms request-response latency.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 09:04:28 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id JAA09157
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 09:04:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id GAA25012; Fri, 12 Feb 1999 06:48:21 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id GAA24487; Fri, 12 Feb 1999 06:47:54 -0500 (EST)
Received: from danmark (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id GAA14210;
	Fri, 12 Feb 1999 06:47:51 -0500
Message-Id: <3.0.5.32.19990212064751.03d63100@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Fri, 12 Feb 1999 06:47:51 -0500
To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
In-Reply-To: <199902120108.SAA23695@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 18:08 2/11/99 -0700, Vernon Schryver wrote:
>> From: Henrik Frystyk Nielsen <frystyk@w3.org>
>> It's in Apache, Jigsaw, and possibly many other HTTP/1.1 servers.
>
>Didn't I also see a note from you saying the only existing HTTP/1.1 code
>that does it is yours?   If so, that sounds like nicer way to say about
>the same thing as I tried to.

Nope, I said there are two cases: client side and server side. What you
explicitly referred to is server side pipelining which is trivially
implemented.

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 21:44:42 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA19397
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 21:44:41 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA19715; Fri, 12 Feb 1999 20:38:23 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA18195; Fri, 12 Feb 1999 20:35:20 -0500 (EST)
Received: from Eng.Sun.COM (engmail3 [129.144.170.5]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id RAA29511 for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:35:19 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id RAA08863
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:35:16 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id RAA18628
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:35:18 -0800 (PST)
Date: Fri, 12 Feb 1999 17:35:17 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902120537.WAA28399@calcite.rhyolite.com>
Message-ID: <Roam.SIMCSD.2.0.4.918869717.10149.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>> segments.  Will this solve the problem we are discussing when combined with
>> the Nagle modification on a per-send basis?
>
> Which Nagle modification is that?  The algorithm as classically described

I guess the phrase "on a per-send basis" is not clear enough...

> Assuming there is a problem here that needs to be solved, and that the
> problem cannot be easily and cleanly solved with an explicit flush bit,
...

Let's see if we can agree on what the problem the draft tries to solve.
Greg and others can comment on this if I mistake.  I observe the following
points so far.

1. There is a bad interaction between some applications, e.g. transaction
type applications, and Nagle algorithm.

2. We have observed that many network applications set TCP_NODELAY flag by
default, thus disabling Nagle algorithm, even though it may not be necessary.
Many application programmers are not aware of the implications of setting
TCP_NODELAY flag and do not understand Nagle algorithm.  But by setting
TCP_NODELAY flag, they see their applications run better.

3. In many circumstances, Nagle algorithm helps reduce unnecessary network
traffic.

4. Many TCP stacks already have some form of modified Nagle algorithm.  The
interesting thing is that most people do not know about that.  And since
there is no "standard" modified Nagle, application programmers cannot rely
on this fact if they want their applications to run well on multiple
platforms.  Thus having a modified Nagle algorithm does not help reduce
the number of applications setting TCP_NODELAY.

Because of 3, Greg and many others thought that we should find a way to 
remedy 2.  And many such applications are like 1.  So Greg wrote the draft
to solve 1 hoping 2 will go away.  Note that the draft does not intend to
solve all bad interactions in 1.  But by getting rid of the most common set
of bad interactions, 2 may go away.

Do we agree on the above points?  If we do, the question comes to if modifying
Nagle algorithm helps.  And if by setting a "standard" modified Nagle, can
application programmers rely on this fact and do not need to set TCP_NODELAY
flag even if their applications run on multiple platforms, thus eliminating 4. 
Does the proposed algorithm really get rid of the most common set of bad
interactions?  Or do we only need to have an EPS mechanism, as described in
the draft?

My observation is that Vernon, and maybe others, think that there is no
problem at all.  It is just a problem of incompetent programmers and IETF
should not do anything about it.  Rick and others think that we should
document the modified Nagle many stacks also have, namely doing the check
on a per-send basis.  This helps 4.  And to further reduce the problem set
of bad interactions, we should investigate EPS.  Have I misunderstood the
flow of discussions so far?

> A document that would teach good programming sounds like a good thing.
> Such documents are often called textbooks.  Some good textbooks contain
> lots of example code.  However, as you basically say about the old RFCs,
> the IETF is not in the introductory** textbook business.  The IETF is a
> standards committee, not Addison-Wesley, John Wiley, or even The Microsoft
  ^^^^^^^^^^^^^^^^^^^

You have been involved with IETF much longer than I do.  I guess I have
misunderstood the intention of FYI notes and various other infomational
documents IETF has published...  Is IETF really only a "standards committee?"
Maybe you can tell me what those documents are for.  In your opinion, do
many of them also fall into the category of textbooks and IETF should not
have published them in the first place?  I guess probably.

My point is that it seems that so far no textbook has been clear on what Nagle
algorithm is and its implication.  Maybe textbook authors do not care, or
people designing protocol do not care to tell them about it, or ...  IMHO, a
brief document from TCP implementors urging people not to disable the
algorithm by default  and describing the right way to write network programs
using TCP is another way to tackle the problem the draft tries to solve, but
without modifying Nagle algorithm.  And it will carry more credits than from
any textbook.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 21:44:56 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA19408
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 21:44:55 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA14447; Fri, 12 Feb 1999 20:28:24 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA12373; Fri, 12 Feb 1999 20:24:29 -0500 (EST)
Received: from Eng.Sun.COM (engmail2 [129.146.1.25]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id RAA28174 for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:24:28 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id RAA29425
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:24:27 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id RAA18598
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:24:22 -0800 (PST)
Date: Fri, 12 Feb 1999 17:24:22 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36C36E16.6005C63D@cup.hp.com>
Message-ID: <Roam.SIMCSD.2.0.4.918869062.17619.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> However, once someone takes that nice Apache (etc) application up to
> some (yes, possibly rare, but not unknown) larger MTU intranet (ATM,
> JumboFrame Ethernet, 802.5, etc) that "4096 bytes of application
> buffering is fine" thinking falls apart and we again have even sends
> that are sub-MSS.

What should be the size of small segment in Nagle algorithm?  If MSS can be
as large as, say 4056 bytes, then should a 1500 bytes segment be considered as
small?  I will say no.  While we are talking about modification to Nagle
algorithm, maybe we should also think about what a small segment really 
means.  To define small as less than 1 SMSS bytes may not be good any more
with today's technology.  BTW, although you think it is a rathole, I guess
doing a getsockopt(TCP_MAXSEG) is useful, especially in this case.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 21:47:50 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA19434
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 21:47:49 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA15764; Fri, 12 Feb 1999 20:30:38 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA12910; Fri, 12 Feb 1999 20:25:48 -0500 (EST)
Received: from Eng.Sun.COM (engmail2 [129.146.1.25]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id RAA28296 for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:25:47 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (SMI-8.6/SMI-5.3) with ESMTP id RAA29604
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:25:45 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id RAA18615
	for <tcp-impl@lerc.nasa.gov>; Fri, 12 Feb 1999 17:25:46 -0800 (PST)
Date: Fri, 12 Feb 1999 17:25:46 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <36C39FDB.9EECFAA3@cup.hp.com>
Message-ID: <Roam.SIMCSD.2.0.4.918869146.24550.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Would a PTMU smaller than the MSS interfere with the receiver's ability
> to tell a full-sized segment from a small one? Especially in a
> unidirectional case, or an assymetric route case.

Yup, this can be a problem.  This is the delayed ack problem described in 
draft-ietf-tcpimpl-cong-control-03.txt, SMSS <> RMSS.  There are some
solutions proposed, though they are not "standardised."  But I guess this
can be worked around.  And there are stacks which already implement solutions
to handle the delayed ack problem.

> Well, it is my belief that "today" an ACK costs just as many CPU cycles
> as a data segment.

Sometimes the cost can be justified...  I think it is OK in this case to
avoid the extra delay.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 22:03:48 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA19536
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 22:03:48 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA03085; Fri, 12 Feb 1999 21:03:23 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA02049; Fri, 12 Feb 1999 21:01:31 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 13 Feb 1999 02:25:16 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10BUOE-001xhTC; Fri, 12 Feb 1999 18:01:02 -0800 (PST)
Received: from ip201.san-francisco41.ca.pub-ip.psi.net ([38.28.91.201]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 13 Feb 1999 02:25:09 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id SAA03427; Fri, 12 Feb 1999 18:01:42 -0800 (PST)
Message-Id: <199902130201.SAA03427@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Vernon Schryver <vjs@calcite.rhyolite.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 11 Feb 1999 14:59:32 MST."
             <199902112159.OAA19953@calcite.rhyolite.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 12 Feb 1999 18:01:42 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon (and others, i am sure),

> As the discussion has gone on, there seems to me less justification
> for changing the Nagle algorithm.  Could someone review the plausible
> scenarios where the change helps?  (I.e. not just switching the
> interpretation of "send" from what it seems no one used to something
> closer to what everyone is doing.) 

The scenario is that TCP has been willing to send lots of data because (either 
at tcp_output() or at send()) it was presented with "lots" of data to send (>= 
MSS), and so has sent lots of data.  Now, it is presented (either at 
tcp_output() or at send()) with a "little bit" of data.  In this case, TCP 
will wait until all the previous data has been acknowledged.

(The change would allow TCP to transmit this "little bit" of data (but not 
transmit any subsequent "little bit" of data until *this* "little bit" of data 
has been acknowledged.)

Greg (who believes many things, such as that a mode of program-controlled 
flushing would probably be a good API addition, and that some heuristics to 
delayed ACKs might be a good thing [but make sure you understand about Silly 
Window Syndrome avoidance before messing too much with that! -- see RFC813, 
one of the Dave Clark 5], etc., but *still* feels this change will help things)


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 12 22:09:18 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA19552
	for <tcpimpl-archive@lists.ietf.org>; Fri, 12 Feb 1999 22:09:18 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA05592; Fri, 12 Feb 1999 21:08:24 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA03628; Fri, 12 Feb 1999 21:05:01 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 13 Feb 1999 02:28:46 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10BU8U-001xhUC; Fri, 12 Feb 1999 17:44:46 -0800 (PST)
Received: from ip196.san-francisco41.ca.pub-ip.psi.net ([38.28.91.196]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 13 Feb 1999 02:08:52 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id RAA03387; Fri, 12 Feb 1999 17:45:17 -0800 (PST)
Message-Id: <199902130145.RAA03387@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Rick Jones <raj@cup.hp.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 11 Feb 1999 15:06:47 PST."
             <36C36287.4CE7ECDC@cup.hp.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 12 Feb 1999 17:45:17 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Rick,

Sorry.  You are right.

> Is the clarification I'm missing that in essence _all_ instances of
> "packet" were replaced by "send?" If a TCP is presented with less than
> an MSS-sized send to transmit, and if any previous less than MSS-sized
> send has not been acknowledged, do not transmit the send. Is that it?

This *is* what i am assuming.

Also, to deal with something else you said in another e-mail:

> One thought that just struck me is that we've all been bantering about
> with a 1460 MSS thinking primarily about 1500 byte MTU networks and how
> 4096 is big enough for a 1500 byte MTU network and all that.

While i don't think it is reasonable to assume that an application tracks the 
actual MSS (given path MTU, etc.), i think it *is* reasonable that an 
application know *something* (such as the largest MTU on a connected, 
non-loopback interface).

Greg


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 13 00:20:09 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id AAA22357
	for <tcpimpl-archive@lists.ietf.org>; Sat, 13 Feb 1999 00:20:08 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id XAA11219; Fri, 12 Feb 1999 23:13:22 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id XAA10297; Fri, 12 Feb 1999 23:12:05 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id VAA22104
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 12 Feb 1999 21:12:03 -0700 (MST)
Date: Fri, 12 Feb 1999 21:12:03 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902130412.VAA22104@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Greg Minshall <minshall@siara.com>
 
> > for changing the Nagle algorithm.  Could someone review the plausible
> > scenarios where the change helps?  (I.e. not just switching the
> > interpretation of "send" from what it seems no one used to something
> > closer to what everyone is doing.)
>
>The scenario is that TCP has been willing to send lots of data because (either
>at tcp_output() or at send()) it was presented with "lots" of data to send (>=
> MSS), and so has sent lots of data.  Now, it is presented (either at
> tcp_output() or at send()) with a "little bit" of data.  In this case, TCP
> will wait until all the previous data has been acknowledged.
>
> (The change would allow TCP to transmit this "little bit" of data (but not
>transmit any subsequent "little bit" of data until *this* "little bit" of data 
> has been acknowledged.)

That's toward answering my question, but doesn't quite hit it.
First, could we forget the "at tcp_output()" clause since it seems everyone
agrees that the original RFC 896 text, the RFC 1122 text, and common
implementations want the delaying to happen only at send() and many
implementaitons come pretty close to achieving that?  To the extent that
some do not achieve that, it might be worthwhile to write something,
probably in the 'common bugs' RFC.
  
Second, could we please have something to make the bad scenario compelling?
For me, that means either a convincing argument that significantly many
well written applications (no write-write junk) suffer the problem, or a
pointer to a real application that exemplifies the class of suffering
applications.  These must be applications that now suffer unnecessary
latency on real world TCP stacks on private nets or the Internet.
(Statements like "this code would have problems if it did something other
than what it does" are inadmissable.)

Statements like "my intuition says there must be lots of code that writes
1700 bytes with MSS=1460 on a system that worries about tcp_output() but
not send()" are not convincing.  My intuition says that most applications
uses stacks that watch send() (or won't be fixed so this is all moot), or
either write less than an MSS or so much more than an MSS that they hit
windows, or would have to use TCP_NODELAY regardless of Nagle
modifications.

Good design changes depend on a kind of Occam's Razor says that in the
absense of objective evidence, the 'gut feeling' that supports the
status quo is always right.
 ________________________________
 
] From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>

] ...
] 1. There is a bad interaction between some applications, e.g. transaction
] type applications, and Nagle algorithm.

We have only hypothetical applications that suffer bad interactions.

] ...
] 4. Many TCP stacks already have some form of modified Nagle algorithm.  The
] interesting thing is that most people do not know about that.  And since
] there is no "standard" modified Nagle, application programmers cannot rely
] on this fact if they want their applications to run well on multiple
] platforms.  Thus having a modified Nagle algorithm does not help reduce
] the number of applications setting TCP_NODELAY.

As far as I can tell, much of the so called "modification" consists of
doing what RFC 896 actually says instead of what some people thought it
said, namely delaying by send() instead of tcp_ouptut().  It's not clear
"do what RFC 896 says instead of what we thought it says" needs to be
published, except perhaps in the 'common bugs' RFC.

] Because of 3, Greg and many others thought that we should find a way to 
] remedy 2.  And many such applications are like 1. ...

The assertion that there are many applications that would benefit has been
repeated many times.  Maybe I've been made cynical by politicians, each
repetition of an unsupported claim convinces me more of the opposite.
Again, please name or compellingly describe an application that is well
written (e.g. uses writev() or WSASend()), suffers undesirable latencies,
and would be helped by a change to the Nagle algorithm, as the Nagle
algorithm is defined in RFC 896 and widely implemented.


] ...
] My observation is that Vernon, and maybe others, think that there is no
] problem at all.  It is just a problem of incompetent programmers and IETF
] should not do anything about it.  Rick and others think that we should
] document the modified Nagle many stacks also have, namely doing the check
] on a per-send basis.  This helps 4.  And to further reduce the problem set
] of bad interactions, we should investigate EPS.  Have I misunderstood the
] flow of discussions so far?

That overstates my view--I could be easily convinced the problem exists
by other than repetitions of unsupported claims.  The availability of
NODELAY may make finding an exemplar difficult.  However, I'd be glad to
hear of a real application that could stop using NODELAY thanks to a
modified Nagle.


] ...
] You have been involved with IETF much longer than I do.  I guess I have
] misunderstood the intention of FYI notes and various other infomational
] documents IETF has published...  Is IETF really only a "standards committee?"
] Maybe you can tell me what those documents are for.  In your opinion, do
] many of them also fall into the category of textbooks and IETF should not
] have published them in the first place?  I guess probably.

Yes, the IETF is merely a standards committee.  Standards committees have
many people who feel more publish-or-perish pressure than the canonical
academic.  The FYI's and "various other infomational documents" (presumably
the Informational RFC's, since the IETF only publishes RFCs, some of which
also get the FYI label) are generally not the most admired, admirable,
and/or influential documents from the IETF.  Some are of only trade rag
quality, between just plain wrong and painfully naive.

] My point is that it seems that so far no textbook has been clear on what Nagle
] algorithm is and its implication.  Maybe textbook authors do not care, or
] people designing protocol do not care to tell them about it, or ...  IMHO, a
] brief document from TCP implementors urging people not to disable the
] algorithm by default  and describing the right way to write network programs
] using TCP is another way to tackle the problem the draft tries to solve, but
] without modifying Nagle algorithm.  And it will carry more credits than from
] any textbook.

To paraphrase, you want to use the bully pulpit of the IETF to preach
about something we agree is bad.  You figure that an RFC would be influence
more people, be more authoritative, and more effectively fix the problem
than a textbook.

I've had this argument many times over the years with people who want to
write mini-textbook RFC's.  Since I first took my position, I've seen too
many poor informational RFC's published.   I've also seen (and written)
countless words about subjects like the Nagle problem (including the Nagle
problem) in forums more widely read than the RFC archives (e.g.
mailinglists and netnews).  The problems go one.  If RFC 896 didn't fix
it, then your RFC 3999 won't either.  The best and most effective thing
you might do to fix the Nagle problem is to help/convince the big TCP
textbook authors to produce revised editions.  (By "Nagle problem", I mean
junk code that does unncessary tiny_write-tiny_write or necessary
write-write but does not know about Nagle.)  I offer RFC 896 and those
#$%*@! textbooks and trade rag idiots that say that Ethernet is limited
to 37% as proofs.

I've seen a very little, merely second hand, of what it costs and profits
an author to write a real textbook.  That has convinced me the real thing
too much like hard labor for my tastes, and made me a little cynical about
the common ambition to spend a few hours and whip out a mini-textbook RFC.
   ________________


] From: Greg Minshall <minshall@siara.com>

] ...
] > One thought that just struck me is that we've all been bantering about
] > with a 1460 MSS thinking primarily about 1500 byte MTU networks and how
] > 4096 is big enough for a 1500 byte MTU network and all that.
]
] While i don't think it is reasonable to assume that an application tracks the 
] actual MSS (given path MTU, etc.), i think it *is* reasonable that an 
] application know *something* (such as the largest MTU on a connected, 
] non-loopback interface).

I think that's asking too much of applications, what with mulitihoming,
dynamic routing, the difficulties in finding the MTU's of all currently
active interfaces on some platforms, and interfaces that come and go
(e.g. SLIP and PPP).


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 13 14:31:32 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA02822
	for <tcpimpl-archive@lists.ietf.org>; Sat, 13 Feb 1999 14:31:31 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA06841; Sat, 13 Feb 1999 13:03:32 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from jupiter.nal.utoronto.ca (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA02072; Sat, 13 Feb 1999 12:53:45 -0500 (EST)
Received: from nal.utoronto.ca by jupiter.nal.utoronto.ca (SMI-8.6/SMI-SVR4)
	id MAA06291; Sat, 13 Feb 1999 12:53:17 -0500
Message-ID: <36C5BD85.B63933B3@nal.utoronto.ca>
Date: Sat, 13 Feb 1999 12:59:33 -0500
From: Raouf Boutaba <rboutaba@jupiter.nal.utoronto.ca>
Organization: University of Toronto
X-Mailer: Mozilla 4.02 [en] (Win95; I)
MIME-Version: 1.0
To: webrepl@cs.utk.edu
Subject: MMM'99 Call for Paper
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

***[My apologies if you receive this more than once]***
=======================================================
               MMM'99
The International Conference on
        Multimedia Modeling
http://ocri.genie.uottawa.ca/mmm99

==========================================
        4-6 October 1999, Ottawa, Canada
===========================================
        Call for Papers and Participation
============================================
Scope
======
The Multimedia & Agents Research Laboratory at
the School of Information Technology & Engineering
University of Ottawa, Canada,the National Research
Council of Canada, and Graphics Society are proud to announce
MMM'99 Conference. This conference will provide a unique
opportunity for researchers, software and application developers,
and computer graphics technologists to discuss new developments
on multimedia modeling technologies and applications.
As with the last year's conference in Lausanne, Switzerland,
MMM'99 will focus on modeling issues across the areas of
interactive and intelligent presentation, media production,
processing and visualization, techniques for media abstractions
and intelligent multimedia information retrieval.

Call for Papers, Tutorials, Panels and Exhibits
===============================================

Contributions describing original research, surveys and
applications in the following areas are solicited:

-Formal Support for MM Modeling
- Topological and Geometric Modeling
- Integration of MM Information
- Interactive MM
- MM Operating system
- MM Database Modeling
- Hypermedia
- Model-based Video/Vision/Graphics
- Media Abstractions
- Web Mining Agents and Agent Modeling
- Broadband Home Services
- Representation of MM Information
- Indexing & Retrieval of MM Information
- Integration of Graphics & Vision
- Synchronization of MM Information
- MM and Virtual Reality
- Networked MM
- Speech and Music Modeling
- Education and Applications of MM
Authors are invited to submit papers (6,000 words)
on completed research or work in progress to the MMM'99
conference Chair (Send 5 copies of the full manuscript):


Prof. Ahmed Karmouch
MMM'99 Conference Chair
School of Information Technology & Engineering
University of Ottawa
161, Louis Pasteur, K1N 6N5
Ottawa, Ontario, Canada
Tel. (613) 562-5800 x6203
Fax. (613) 562-5175 email: Karmouch @site.uottawa.ca

Important Dates
===================================================

Full Paper due                  15 April 1999
Proposals for Tutorials         15 April 1999
Notice of Acceptance            1 July 1999
Camera-ready paper due          1 August 1999

=====================================================
Electronic submission by Email <mmm99@ocri.genie.uottawa.ca> is
preferred.
Alternatively, you may use our ftp server <ocri.genie.uottawa.ca> (user:

ftp, psw: email), under </pub/mmm99/incoming>.
(Please send a note by email specifying name, type, and contents of the
file).
All paper submissions should have a cover page containing the title,
names,
email address and complete postal addresses (including telephone and fax

numbers)
 for all authors. Please indicate the main author for the purpose of
correspondence.
The cover page should also provide an abstract (150 words maximum),
and a list of keywords. Please include a statement stating that "when
accepted,
one of the authors will attend the conference to present the paper".

Tutorials, panels and prototype demonstrations are also invited.
 More information on MMM'99 is available at
http://ocri.genie.uottawa.ca/mmm99

Conference Committee Members

Conference Chair:               Karmouch, Ahmed (University of Ottawa)
Technical Program Chair:        Yeap, Tet  (University of Ottawa)
Tutorial and Exhibit Chair:     Impey, Roger  (National Research
Council)
Conference Management:          Mahoney, Kahty (Ottawa Centre for
Research and
Innovation)
                                        Dinsdale, Dianne (Communications
& Information Technology of
                                        Ontario)

---------------------------------------------------------------------------


From owner-tcp-impl@lerc.nasa.gov  Sun Feb 14 00:26:46 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id AAA07423
	for <tcpimpl-archive@lists.ietf.org>; Sun, 14 Feb 1999 00:26:45 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA24210; Sat, 13 Feb 1999 22:58:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA22056; Sat, 13 Feb 1999 22:53:31 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 187;
          Sat, 13 Feb 1999 19:53:27 -0800
Message-ID: <36C648B7.36598766@ehsco.com>
Date: Sat, 13 Feb 1999 19:53:27 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
CC: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <Roam.SIMCSD.2.0.4.918869062.17619.kcpoon@jurassic>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> What should be the size of small segment in Nagle algorithm?  If MSS
> can be as large as, say 4056 bytes, then should a 1500 bytes segment
> be considered as small?  I will say no.

Nagle should stay as it is, at least wrt the defintion to "small
segments." Think of an org using Token Ring everywhere. They benefit
from having large frames everywhere mostly by being able to fill them,
and here you are wanting to take that away from them. Tsk tsk.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Mon Feb 15 12:24:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA25553
	for <tcpimpl-archive@lists.ietf.org>; Mon, 15 Feb 1999 12:24:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id KAA21831; Mon, 15 Feb 1999 10:33:31 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ietf.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id KAA18892; Mon, 15 Feb 1999 10:28:23 -0500 (EST)
Received: from CNRI.Reston.VA.US (localhost [127.0.0.1])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id KAA19239;
	Mon, 15 Feb 1999 10:28:19 -0500 (EST)
Message-Id: <199902151528.KAA19239@ietf.org>
Mime-Version: 1.0
Content-Type: Multipart/Mixed; Boundary="NextPart"
To: IETF-Announce:;;@ns.cnri.reston.va.us
Cc: tcp-impl@lerc.nasa.gov
From: Internet-Drafts@ietf.org
Reply-to: Internet-Drafts@ietf.org
Subject: I-D ACTION:draft-ietf-tcpimpl-cong-control-04.txt
Date: Mon, 15 Feb 1999 10:28:19 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

--NextPart

Note: This revision reflects comments received during the last call period.

A New Internet-Draft is available from the on-line Internet-Drafts directories.
This draft is a work item of the TCP Implementation Working Group of the IETF.

	Title		: TCP Congestion Control
	Author(s)	: M. Allman, V. Paxson, W. Stevens
	Filename	: draft-ietf-tcpimpl-cong-control-04.txt
	Pages		: 11
	Date		: 12-Feb-99
	
    This document defines TCP's four intertwined congestion control
    algorithms: slow start, congestion avoidance, fast retransmit, and
    fast recovery.  In addition, the document specifies how TCP should
    begin transmission after a relatively long idle period, as well as
    discussing various acknowledgment generation methods.

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-ietf-tcpimpl-cong-control-04.txt

Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
	"get draft-ietf-tcpimpl-cong-control-04.txt".

A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html 
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt


Internet-Drafts can also be obtained by e-mail.

Send a message to:
	mailserv@ietf.org.
In the body type:
	"FILE /internet-drafts/draft-ietf-tcpimpl-cong-control-04.txt".
	
NOTE:	The mail server at ietf.org can return the document in
	MIME-encoded form by using the "mpack" utility.  To use this
	feature, insert the command "ENCODING mime" before the "FILE"
	command.  To decode the response(s), you will need "munpack" or
	a MIME-compliant mail reader.  Different MIME-compliant mail readers
	exhibit different behavior, especially when dealing with
	"multipart" MIME messages (i.e. documents which have been split
	up into multiple messages), so check your local documentation on
	how to manipulate these messages.
		
		
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.

--NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

--OtherAccess
Content-Type: Message/External-body;
	access-type="mail-server";
	server="mailserv@ietf.org"

Content-Type: text/plain
Content-ID:	<19990212084058.I-D@ietf.org>

ENCODING mime
FILE /internet-drafts/draft-ietf-tcpimpl-cong-control-04.txt

--OtherAccess
Content-Type: Message/External-body;
	name="draft-ietf-tcpimpl-cong-control-04.txt";
	site="ftp.ietf.org";
	access-type="anon-ftp";
	directory="internet-drafts"

Content-Type: text/plain
Content-ID:	<19990212084058.I-D@ietf.org>

--OtherAccess--

--NextPart--


From owner-tcp-impl@lerc.nasa.gov  Mon Feb 15 16:44:45 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id QAA09657
	for <tcpimpl-archive@lists.ietf.org>; Mon, 15 Feb 1999 16:44:45 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA27158; Mon, 15 Feb 1999 15:23:33 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA26088; Mon, 15 Feb 1999 15:21:49 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id OAA16630
	for tcp-impl@lerc.nasa.gov; Mon, 15 Feb 1999 14:21:43 -0600 (CST)
Date: Mon, 15 Feb 1999 14:21:43 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902152021.OAA16630@frantic.bsdi.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

I just remembered one very important item.  The whole issue of
delayed acks vs Nagle, and applying Nagle to the whole send()
doesn't help at all if you have an intial congestion window of
only 1*MSS, and the request is >1*MSS.  Case in point, an initial
2K request sends 1440 bytes and defers the rest not due to Nagle,
but due to the congestion window.  The other side only gets 1 packet,
so it delays the ack.  And you have the same problem.

The fix is to have at least a 2*MSS initial window, so you need
to implement RFC 2414 "Increasing TCP's Initial Window".

		-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 16 06:26:03 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id GAA25059
	for <tcpimpl-archive@lists.ietf.org>; Tue, 16 Feb 1999 06:26:02 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id EAA26474; Tue, 16 Feb 1999 04:38:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from daffy.ee.lbl.gov (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id EAA24565; Tue, 16 Feb 1999 04:34:57 -0500 (EST)
Received: (from vern@localhost)
	by daffy.ee.lbl.gov (8.9.2/8.9.2) id BAA02406;
	Tue, 16 Feb 1999 01:34:52 -0800 (PST)
Message-Id: <199902160934.BAA02406@daffy.ee.lbl.gov>
To: David Borman <dab@BSDI.COM>
Cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
In-reply-to: Your message of Mon, 15 Feb 1999 14:21:43 CST.
Date: Tue, 16 Feb 1999 01:34:52 PST
From: Vern Paxson <vern@ee.lbl.gov>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> The fix is to have at least a 2*MSS initial window, so you need
> to implement RFC 2414 "Increasing TCP's Initial Window".

Or 2001.bis, which should pop out as an RFC quite soon - it also allows an
initial window of two segments.

		Vern


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 16 12:55:10 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA05291
	for <tcpimpl-archive@lists.ietf.org>; Tue, 16 Feb 1999 12:55:10 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id KAA10259; Tue, 16 Feb 1999 10:49:16 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ietf.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id KAA07273; Tue, 16 Feb 1999 10:45:42 -0500 (EST)
Received: from CNRI.Reston.VA.US (localhost [127.0.0.1])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id KAA28279;
	Tue, 16 Feb 1999 10:06:41 -0500 (EST)
Message-Id: <199902161506.KAA28279@ietf.org>
Mime-Version: 1.0
Content-Type: Multipart/Mixed; Boundary="NextPart"
To: IETF-Announce:;;@ns.cnri.reston.va.us
Cc: tcp-impl@lerc.nasa.gov
From: Internet-Drafts@ietf.org
Reply-to: Internet-Drafts@ietf.org
Subject: I-D ACTION:draft-ietf-tcpimpl-newreno-02.txt
Date: Tue, 16 Feb 1999 10:06:40 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

--NextPart

A New Internet-Draft is available from the on-line Internet-Drafts
directories.  This draft is a work item of the TCP Implementation
Working Group of the IETF.

	Title           : The NewReno Modification to TCP's Fast
			  Recovery Algorithm
	Author(s)	: S. Floyd, T. Henderson
	Filename	: draft-ietf-tcpimpl-newreno-02.txt
	Pages		: 11
	Date		: 15-Feb-99
	
RFC 2001 [RFC2001] documents the following four intertwined TCP
congestion control algorithms: Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery.  RFC 2001-bis [RFC2001-bis] explicitly
allows certain modifications of these algorithms, including
modifications that use the TCP Selective Acknowledgement (SACK) option
[MMFR96], and modifications that respond to "partial acknowledgments"
(ACKs which cover new data, but not all the data outstanding when loss
was detected) in the absence of SACK.  This document describes a
specific algorithm for responding to partial acknowledgments, referred
to as NewReno.  This response to partial acknowledgments was first
proposed by Janey Hoe in [Hoe95].

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-ietf-tcpimpl-newreno-02.txt

Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
	"get draft-ietf-tcpimpl-newreno-02.txt".

A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html 
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt


Internet-Drafts can also be obtained by e-mail.

Send a message to:
	mailserv@ietf.org.
In the body type:
	"FILE /internet-drafts/draft-ietf-tcpimpl-newreno-02.txt".
	
NOTE:	The mail server at ietf.org can return the document in
	MIME-encoded form by using the "mpack" utility.  To use this
	feature, insert the command "ENCODING mime" before the "FILE"
	command.  To decode the response(s), you will need "munpack" or
	a MIME-compliant mail reader.  Different MIME-compliant mail readers
	exhibit different behavior, especially when dealing with
	"multipart" MIME messages (i.e. documents which have been split
	up into multiple messages), so check your local documentation on
	how to manipulate these messages.
		
		
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.

--NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

--OtherAccess
Content-Type: Message/External-body;
	access-type="mail-server";
	server="mailserv@ietf.org"

Content-Type: text/plain
Content-ID:	<19990216090351.I-D@ietf.org>

ENCODING mime
FILE /internet-drafts/draft-ietf-tcpimpl-newreno-02.txt

--OtherAccess
Content-Type: Message/External-body;
	name="draft-ietf-tcpimpl-newreno-02.txt";
	site="ftp.ietf.org";
	access-type="anon-ftp";
	directory="internet-drafts"

Content-Type: text/plain
Content-ID:	<19990216090351.I-D@ietf.org>

--OtherAccess--

--NextPart--


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 16 14:20:33 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA06913
	for <tcpimpl-archive@lists.ietf.org>; Tue, 16 Feb 1999 14:20:32 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id KAA10259; Tue, 16 Feb 1999 10:49:16 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ietf.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id KAA07273; Tue, 16 Feb 1999 10:45:42 -0500 (EST)
Received: from CNRI.Reston.VA.US (localhost [127.0.0.1])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id KAA28279;
	Tue, 16 Feb 1999 10:06:41 -0500 (EST)
Message-Id: <199902161506.KAA28279@ietf.org>
Mime-Version: 1.0
Content-Type: Multipart/Mixed; Boundary="NextPart"
To: IETF-Announce:;;@ns.cnri.reston.va.us
Cc: tcp-impl@lerc.nasa.gov
From: Internet-Drafts@ietf.org
Reply-to: Internet-Drafts@ietf.org
Subject: I-D ACTION:draft-ietf-tcpimpl-newreno-02.txt
Date: Tue, 16 Feb 1999 10:06:40 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

--NextPart

A New Internet-Draft is available from the on-line Internet-Drafts
directories.  This draft is a work item of the TCP Implementation
Working Group of the IETF.

	Title           : The NewReno Modification to TCP's Fast
			  Recovery Algorithm
	Author(s)	: S. Floyd, T. Henderson
	Filename	: draft-ietf-tcpimpl-newreno-02.txt
	Pages		: 11
	Date		: 15-Feb-99
	
RFC 2001 [RFC2001] documents the following four intertwined TCP
congestion control algorithms: Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery.  RFC 2001-bis [RFC2001-bis] explicitly
allows certain modifications of these algorithms, including
modifications that use the TCP Selective Acknowledgement (SACK) option
[MMFR96], and modifications that respond to "partial acknowledgments"
(ACKs which cover new data, but not all the data outstanding when loss
was detected) in the absence of SACK.  This document describes a
specific algorithm for responding to partial acknowledgments, referred
to as NewReno.  This response to partial acknowledgments was first
proposed by Janey Hoe in [Hoe95].

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-ietf-tcpimpl-newreno-02.txt

Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
	"get draft-ietf-tcpimpl-newreno-02.txt".

A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html 
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt


Internet-Drafts can also be obtained by e-mail.

Send a message to:
	mailserv@ietf.org.
In the body type:
	"FILE /internet-drafts/draft-ietf-tcpimpl-newreno-02.txt".
	
NOTE:	The mail server at ietf.org can return the document in
	MIME-encoded form by using the "mpack" utility.  To use this
	feature, insert the command "ENCODING mime" before the "FILE"
	command.  To decode the response(s), you will need "munpack" or
	a MIME-compliant mail reader.  Different MIME-compliant mail readers
	exhibit different behavior, especially when dealing with
	"multipart" MIME messages (i.e. documents which have been split
	up into multiple messages), so check your local documentation on
	how to manipulate these messages.
		
		
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.

--NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

--OtherAccess
Content-Type: Message/External-body;
	access-type="mail-server";
	server="mailserv@ietf.org"

Content-Type: text/plain
Content-ID:	<19990216090351.I-D@ietf.org>

ENCODING mime
FILE /internet-drafts/draft-ietf-tcpimpl-newreno-02.txt

--OtherAccess
Content-Type: Message/External-body;
	name="draft-ietf-tcpimpl-newreno-02.txt";
	site="ftp.ietf.org";
	access-type="anon-ftp";
	directory="internet-drafts"

Content-Type: text/plain
Content-ID:	<19990216090351.I-D@ietf.org>

--OtherAccess--

--NextPart--


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 16 23:27:53 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA20654
	for <tcpimpl-archive@lists.ietf.org>; Tue, 16 Feb 1999 23:27:53 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA06815; Tue, 16 Feb 1999 21:36:14 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA05035; Tue, 16 Feb 1999 21:33:44 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 17 Feb 1999 02:57:40 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10Cwnf-001xhWC; Tue, 16 Feb 1999 18:33:19 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id SAA05086; Tue, 16 Feb 1999 18:00:03 -0800 (PST)
Message-Id: <199902170200.SAA05086@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Vernon Schryver <vjs@calcite.rhyolite.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 12 Feb 1999 21:12:03 MST."
             <199902130412.VAA22104@calcite.rhyolite.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 16 Feb 1999 18:00:02 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon,

> Second, could we please have something to make the bad scenario
> compelling? For me, that means either a convincing argument that
> significantly many well written applications (no write-write junk)
> suffer the problem, or a pointer to a real application that
> exemplifies the class of suffering applications.  These must be
> applications that now suffer unnecessary latency on real world TCP
> stacks on private nets or the Internet. 

I'm not sure we're converging, but let's try this:  for request-reponse 
protocols with variable-sized responses, with the current Nagle and delayed 
ACKs, there is no way to write the server application such that some 
percentage of the responses don't get delayed.  The modified Nagle allows an 
application writer to write their application to avoid that particular delay.

Here's something i mentioned to the Rick Jones (and the list); maybe this 
helps:

> Check out, for example, John Heideman's article in the  April, 1997,
> ACM SIGCOMM Computer Communications Review, or Henrik Nielsen, et  al,
> from SIGCOMM 1997, for example of fairly well-crafted applications
> that  still run afoul of Nagle/delayed ACK interactions.

Greg


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 16 23:42:58 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA21152
	for <tcpimpl-archive@lists.ietf.org>; Tue, 16 Feb 1999 23:42:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA07200; Tue, 16 Feb 1999 22:21:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA03701; Tue, 16 Feb 1999 22:16:18 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id UAA16861
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Tue, 16 Feb 1999 20:16:17 -0700 (MST)
Date: Tue, 16 Feb 1999 20:16:17 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902170316.UAA16861@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Greg Minshall <minshall@siara.com>

> ...
> > suffer the problem, or a pointer to a real application that
> > exemplifies the class of suffering applications.  These must be
> > applications that now suffer unnecessary latency on real world TCP
> > stacks on private nets or the Internet. 
>
> I'm not sure we're converging, but let's try this:  for request-reponse 
> protocols with variable-sized responses, with the current Nagle and delayed 
> ACKs, there is no way to write the server application such that some 
> percentage of the responses don't get delayed.  The modified Nagle allows an 
> application writer to write their application to avoid that particular delay.

That's still only a sketch of an hypothetical application instead of a
concrete pointer.  Do you have a name of a product? 

My intuition claims there are essentially no real applications that fit
that sketch, and do not need to turn off the Nagle algorithm for other
reasons.  The most common reason is the incompetant use of write-write
instead of writev().  Another common but respectable reason is occassional,
unidirectional traffic asynchronous with the main requests and responses.
If your application has extra, one sided operations, such as updating
unimportant status, then you must turn off Nagle, or the real requests will
stall when they come.

It has been said that the proposal would wean users of write-write instead
of writev to turning off Nagle.  I disagree.  They'll still reflexively
turn off Nagle instead of using writev(), and won't notice the modified
Nagle algorithm.

The modification is not free.  It's modest cost <<MUST>> have concrete
beneficiaries!

I think you've convinced me the current Nagle algorithm is good enough
for fixing telnet, that it does not matter to FTP, NFS, and other bulk
applications, and it is and always will be turned off for HTTP and
everything else for various good and bad reasons.


> Here's something i mentioned to the Rick Jones (and the list); maybe this 
> helps:
>
> > Check out, for example, John Heideman's article in the  April, 1997,
> > ACM SIGCOMM Computer Communications Review, or Henrik Nielsen, et  al,
> > from SIGCOMM 1997, for example of fairly well-crafted applications
> > that  still run afoul of Nagle/delayed ACK interactions.

Do you have an on-line reference?   A year or two ago, after 15 going on
30 years of increasing irritation, the ACM's insistance on charging me
for the nearly content-free, trade rag that CACM had become exhausted my
patience and extravagance.  You can get ads separated by bogus statistics
proving the wonders of the latest panacea for the Information Technology
Crisis du jure for free, and with a lot less stilted pedantry.  And you
don't get invoices with hefty "contributions" conveniently included.

I do recall seeing something somewhere about HTTP 1.0 and maybe 1.1 versus
the Nagle algorithm.  As I vaguely recall it, even the proposed modified
Nagle algorithm would still have to be turned off.  That's not the subject
of those articles, is it?


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 02:19:59 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id CAA28505
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 02:19:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id AAA20302; Wed, 17 Feb 1999 00:56:16 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id AAA19705; Wed, 17 Feb 1999 00:55:36 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 188;
          Tue, 16 Feb 1999 21:55:33 -0800
Message-ID: <36CA59D1.A56D07A3@ehsco.com>
Date: Tue, 16 Feb 1999 21:55:29 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Vernon Schryver <vjs@calcite.rhyolite.com>
CC: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902170316.UAA16861@calcite.rhyolite.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> My intuition claims there are essentially no real applications that
> fit that sketch

On a lark I did a search for +"nagle algorithm" on AltaVista, and almost
all references to it are things that explain what it is and then
admonish the programmer from using it. I'm starting to wonder too if
anybody actually disables it for non-interactive, non-realtime stuff.

Apple's OT devsite: http://developer.apple.com/qa/nw/nw26.html
Winsock FAQ:
http://telin.rug.ac.be/~sid/Netwerken/Winsock/advanced.html#q15
WinSock API:
http://www.medusa.uni-bremen.de/intern/knowhow/winsock/winsock4.htm

> I do recall seeing something somewhere about HTTP 1.0 and maybe 1.1
> versus the Nagle algorithm.  As I vaguely recall it, even the
> proposed modified Nagle algorithm would still have to be turned off.

http://www.isi.edu/lsam/publications/phttp_tcp_interactions/

In this particular test, disabling Nagle helped a lot since the data
being returned from the HTTP server was sent as two writes (one for the
MIME header and another for the data).

http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html also
discusses the issue some, suggesting that Nagle made no effect on some
pipelined transactions, while it did have an effect in others.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 09:41:13 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id JAA03299
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 09:41:13 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id HAA13063; Wed, 17 Feb 1999 07:41:15 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from pc-jcs.coded.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id HAA09159; Wed, 17 Feb 1999 07:36:27 -0500 (EST)
From: jsnader@ix.netcom.com
Received: (from jcs@localhost)
	by pc-jcs.coded.com (8.8.5/8.8.5) id HAA13153;
	Wed, 17 Feb 1999 07:39:01 -0500 (EST)
Message-ID: <19990217073901.24228@ix.netcom.com>
Date: Wed, 17 Feb 1999 07:39:01 -0500
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199902170316.UAA16861@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.74e
In-Reply-To: <199902170316.UAA16861@calcite.rhyolite.com>; from Vernon Schryver on Tue, Feb 16, 1999 at 08:16:17PM -0700
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Tue, Feb 16, 1999 at 08:16:17PM -0700, Vernon Schryver wrote:
> >
> > > Check out, for example, John Heideman's article in the  April, 1997,
> > > ACM SIGCOMM Computer Communications Review, or Henrik Nielsen, et  al,
> > > from SIGCOMM 1997, for example of fairly well-crafted applications
> > > that  still run afoul of Nagle/delayed ACK interactions.
> 
> Do you have an on-line reference?

Both papers are available at:

http://www.acm.org/sigcomm/ccr/archive/ccr-toc/ccr-toc-97.html

Jon Snader


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 11:38:43 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id LAA06673
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 11:38:41 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id JAA26415; Wed, 17 Feb 1999 09:51:18 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id JAA23760; Wed, 17 Feb 1999 09:48:30 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id HAA03396
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 17 Feb 1999 07:48:28 -0700 (MST)
Date: Wed, 17 Feb 1999 07:48:28 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902171448.HAA03396@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: "Eric A. Hall" <ehall@ehsco.com>

> ...
> On a lark I did a search for +"nagle algorithm" on AltaVista, and almost
> all references to it are things that explain what it is and then
> admonish the programmer from using it. I'm starting to wonder too if
> anybody actually disables it for non-interactive, non-realtime stuff.

Why would anyone care, since you're talking about "non-interactive,
non-realtime stuff", and the worst effect of Nagle on bulk transfers is
an extra 0.1 second delay in sending the FIN?


> Apple's OT devsite: http://developer.apple.com/qa/nw/nw26.html
> Winsock FAQ:
> http://telin.rug.ac.be/~sid/Netwerken/Winsock/advanced.html#q15

That talks about an especially weak reason to turn off the Nagle
algorithm, to try to retain packet boundaries on the wire, presumably
so that application receive requests get the same block boundaries
as sent by the sender.


> WinSock API:
> http://www.medusa.uni-bremen.de/intern/knowhow/winsock/winsock4.htm

That's a nice ASCII version of the Winsock 1 spec.  Thanks.
I see it mentions RCVBUF and SNDBUF


> ...
> http://www.isi.edu/lsam/publications/phttp_tcp_interactions/

> In this particular test, disabling Nagle helped a lot since the data
> being returned from the HTTP server was sent as two writes (one for the
> MIME header and another for the data).

The following sentence in that paper is interesting:

|   Apache supports keep-alive connections, an early implementation
|   of P-HTTP.  When handling a keep-alive connection, Apache sends
|   its headers as a separate segment. (It does so to work around
|   a bug in a popular browser.)

That seems to me to imply that of the two reasons to turn off the Nagle
algorithm in Apache, broken writes like those and what they call the
"Odd/Short-Final-Segment Problem", the first forces them to turn off Nagle
regardless.  That means that while the second would be helped by the
proposed modification to Nagle, it is moot.  In other words, it's
consistent with my claim that the proposed modification to Nagle won't
gain anything in the real world.

(I am puzzled by that work-around for the browser bug.  Apache cannot
guarantee that its segments won't be combined by the sending kernel.
Oh well, maybe it works often enough to be worthwhile.)


> http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html also
> discusses the issue some, suggesting that Nagle made no effect on some
> pipelined transactions, while it did have an effect in others.

They're talking about turning off Nagle regardless because sometimes doing
so helps a lot.  That again makes the proposed modification moot.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 18:18:57 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13103
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 18:18:57 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA13965; Wed, 17 Feb 1999 15:36:25 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA10629; Wed, 17 Feb 1999 15:32:19 -0500 (EST)
Received: from big (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id PAA17432;
	Wed, 17 Feb 1999 15:31:56 -0500
Message-Id: <3.0.5.32.19990217153156.02eb5c30@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Wed, 17 Feb 1999 15:31:56 -0500
To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
In-Reply-To: <199902171448.HAA03396@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 07:48 2/17/99 -0700, Vernon Schryver wrote:

>> http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html also
>> discusses the issue some, suggesting that Nagle made no effect on some
>> pipelined transactions, while it did have an effect in others.
>
>They're talking about turning off Nagle regardless because sometimes doing
>so helps a lot.  That again makes the proposed modification moot.

That's because we can't control it the way we would like and don't want to
suffer from the potential penalty. Give me the on/off flag and I am happy
to not say that Nagle should be turned off by default.

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 18:19:03 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13114
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 18:19:03 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA04515; Wed, 17 Feb 1999 17:12:48 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from databus.databus.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA01101; Wed, 17 Feb 1999 17:08:27 -0500 (EST)
From: Barney Wolff <barney@databus.com>
To: tcp-impl@lerc.nasa.gov
Date: Wed, 17 Feb 1999 17:03 EST
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Content-Type: text/plain
Message-ID: <36cb3dd80.1db0@databus.databus.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At the risk of looking silly, don't I remember that setting NODELAY
has to be done before the socket is connected?  That means that a
server has to make its choice before it knows who's asking and what
the request is, unless it can listen on different ports.  To me,
that's the advantage of the flush bit.  Same problem with setting
SND/RCVBUF.

Barney Wolff  <barney@databus.com>

> Date: Wed, 17 Feb 1999 14:03:41 -0700 (MST)
> From: Vernon Schryver <vjs@calcite.rhyolite.com>
> 
> Second, for an application I'm hacking, I've been trying to see how I
> would use a flag or the proposed change or something else.  As far as I
> can tell, nothing does better than simply turning off Nagle, and I bet
> the same applies to HTTP.  It's not that you could not use a flush bit,
> but that the packet traces and everything else would be essentially the
> same as with Nagle off.  So why bother writing code that only works on
> those platforms with the flush bit implemented?  Why not forget the
> flush bit even when available and just turn off Nagle?


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 18:26:58 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13186
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 18:26:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id PAA16605; Wed, 17 Feb 1999 15:40:08 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from jupiter.nal.utoronto.ca (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA07786; Wed, 17 Feb 1999 15:29:33 -0500 (EST)
Received: from nal.utoronto.ca by jupiter.nal.utoronto.ca (SMI-8.6/SMI-SVR4)
	id PAA05793; Wed, 17 Feb 1999 15:29:09 -0500
Message-ID: <36CB282D.6BAD9528@nal.utoronto.ca>
Date: Wed, 17 Feb 1999 15:35:58 -0500
From: Raouf Boutaba <rboutaba@jupiter.nal.utoronto.ca>
Organization: University of Toronto
X-Mailer: Mozilla 4.02 [en] (Win95; I)
MIME-Version: 1.0
To: theme.src@lip6.fr
Subject: MATA'99 1st Int Workshop on Mobile Agents for Telecom...
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

[My apologies if you receive this more than once]
======================================================
                         MATA'99

First International Workshop on Mobile Agents for
       Telecommunication Applications
        http://ocri.genie.uottawa.ca/mata99

======================================================
                6-8 October 1999, Ottawa, Canada
======================================================
                Call for Papers and Participation
======================================================
 Scope
=========

Mobile agents refer to self-contained and identifiable
computer programs that can move within the network
and can act on behalf of the user or another entity.
Most of the current research work on the mobile agent
paradigm has two general goals: reduction of network
traffic and asynchronous interaction. These two goals stem
directly from the desire to reduce information overload
and to efficiently use network resources. There are
certainly many motivations for the use of mobile agent
paradigm; however, intelligent information retrieval,
network and mobility management, and network services
are currently the three most cited application targets
 for a mobile agent system. The aim of the workshop
is to provide a unique opportunity for researchers,
software and application developers, and computer network
technologists to discuss new developments on the mobile
agent technology and applications.   The workshop will
focus on mobile agent issues across the areas of network
management, mobile applications, Nomadic computing,
feature interactions, Internet applications, QoS management,
policy-based management, interactive multimedia,
Tele-learning applications, and Computer Telephony Integration.

Call for Papers, Tutorials, Panels and Exhibits
====================================================
Contributions describing original research, surveys
and applications in the following areas are solicited:

 � Mobile Agent Architecture and Models
� Agent Identification, Tracking and  Persistence
� Agent-based Mobility Management in Mobile Networks
� Web Agent Systems
� Agent Integration with CORBA and TINA
� Active Networks and Mobile Agents
� Feature Interaction and Agents
� Mobile Agents Communication Language
� Security in Mobile Agent Systems
� Interactive Multimedia Presentation Agents
� Agent -based Electronic Commerce
� Agent-based Access to Legacy Services
� Managing QoS with agents
� Information Discovery and Gathering using agents
� Data Mining  Agents
� Network Management Agents
� Policy-based Management using Mobile Agents
� Education and Applications of Mobile Agents
� Prototypes and Experience with Mobile Agents
� Seamless Messaging and Mobile Agents

Authors are invited to submit papers (6,000 words)
on completed research or work in progress to
the MATA'99 Workshop Chair
(Send 5 copies of the full manuscript):

Prof. Ahmed Karmouch
MATA'99 Workshop Chair
School of Information Technology & Engineering
University of Ottawa
161, Louis Pasteur
Ottawa, Ontario, Canada, K1N 6N5
Tel. (613) 562-5800 x6203
Fax. (613) 562-5175 email: Karmouch @site.uottawa.ca

Important Dates

Full Paper due                   30 April 1999
Proposals for Tutorials          30 May 1999
Notice of Acceptance              1 July 1999
Camera-ready paper due            1 August 1999


Electronic submission by Email <mata99@ocri.genie.uottawa.ca> is
preferred.
Alternatively, you may use our ftp server <ocri.genie.uottawa.ca> (user:

ftp, psw: email), under </pub/mata99/ incoming>. (Please send a note by
email specifying name, type, and contents of the file).

All paper submissions should have a cover page containing the title,
names,
email address and complete postal addresses (including telephone and fax

numbers) for all authors.  Please indicate the main author for the
purpose
of correspondence. The cover page should also provide an abstract (150
words
maximum), and a list of keywords.  Please include a statement stating
that
"when accepted, one of the authors will attend the Workshop to present
the
paper".

Tutorials, panels and prototype demonstrations are also invited.  More
information on MATA'99 is available at
http://ocri.genie.uottawa.ca/mata99
Workshop Committee Members

Workshop Chair:         Karmouch, Ahmed (University of Ottawa, Canada)
Technical Program
Co-Chairs:              -Impey, Roger  (National Research Council,
Canada)         - Horlait, Eric  (LIP6,CNRS, France)
Tutorial &
Exhibit Chair:  Liscano, Ramiro (National Research Council,
Canada)
Workshop Manager: Mahoney, Kathey (Ottawa Center for Research &
Innovation)
================================================================


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 18:42:09 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13450
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 18:42:08 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA09848; Wed, 17 Feb 1999 16:06:20 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA07542; Wed, 17 Feb 1999 16:03:43 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id OAA10699
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 17 Feb 1999 14:03:41 -0700 (MST)
Date: Wed, 17 Feb 1999 14:03:41 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902172103.OAA10699@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Henrik Frystyk Nielsen <frystyk@w3.org>

> >> http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html also
> >> discusses the issue some, suggesting that Nagle made no effect on some
> >> pipelined transactions, while it did have an effect in others.
> >
> >They're talking about turning off Nagle regardless because sometimes doing
> >so helps a lot.  That again makes the proposed modification moot.
>
> That's because we can't control it the way we would like and don't want to
> suffer from the potential penalty. Give me the on/off flag and I am happy
> to not say that Nagle should be turned off by default.

(I assume you mean an explicit flush bit, and not the proposed modification
to the Nagle algorithm itself.)

Are you sure about that?  That you would use it?

First is the problem that the flag would not be available on most
platforms for at least a few years.

Second, for an application I'm hacking, I've been trying to see how I
would use a flag or the proposed change or something else.  As far as I
can tell, nothing does better than simply turning off Nagle, and I bet
the same applies to HTTP.  It's not that you could not use a flush bit,
but that the packet traces and everything else would be essentially the
same as with Nagle off.  So why bother writing code that only works on
those platforms with the flush bit implemented?  Why not forget the
flush bit even when available and just turn off Nagle?

My application involves mostly request-response transactions resulting
from explicit user actions.  Looking at packet traces, I see mostly of

      data -->
	<-- data+Ack
	(delay)
       Ack --> 

with occassional

      extra data --> or <--
      (delay, and possibly Ack packet)
      data -->
	<-- data
       Ack -->

The extra data is necessary, is triggered by various async. events, must
be handled in sequence with the main stuff, and cannot be delayed more
than ~0.3 sec without noticable (objectionable) user interface effects.
I think agressive caching or anticipatory cache filling in something like
HTTP would be similar.

I've mechanims to buffer the extra data in the application, and then write
it with the next request or response in one write().  There are timers to
flush the buffer when no request or response comes along in time.  Still,
I've been unable to avoid turning off Nagle, because if I do emit an extra
data write and then discover I need to send a request, I don't want the
request to be delayed.  If I had a flush flag I could use it on main data,
but that would result in the same packet traces as with Nagle off.
So why bother?

For a day or two, I thought an inverse flush bit, something that says
"delay this segment" would be cool.  But that's no good either.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 18:44:15 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA13463
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 18:44:15 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA02245; Wed, 17 Feb 1999 17:47:48 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA28287; Wed, 17 Feb 1999 17:42:36 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id QAA22004;
	Wed, 17 Feb 1999 16:42:22 -0600 (CST)
Date: Wed, 17 Feb 1999 16:42:22 -0600 (CST)
From: David Borman <dab@BSDI.COM>
Message-Id: <199902172242.QAA22004@frantic.bsdi.com>
To: barney@databus.com, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Barney Wolff <barney@databus.com>
> Date: Wed, 17 Feb 1999 17:03 EST
>
> At the risk of looking silly, don't I remember that setting NODELAY
> has to be done before the socket is connected?  That means that a
> server has to make its choice before it knows who's asking and what
> the request is, unless it can listen on different ports.  To me,
> that's the advantage of the flush bit.  Same problem with setting
> SND/RCVBUF.

No.  TCP_NODELAY can be set/unset at any time.  What needs to be
done before the connect is the setting of RCVBUF, so that the
correct window scale option will be negotiated at connect time. 

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 19:58:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA14004
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 19:58:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA28672; Wed, 17 Feb 1999 18:57:49 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from shasta-pc.shastanets.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA27072; Wed, 17 Feb 1999 18:55:33 -0500 (EST)
Received: from STEVEA-PC ([209.31.25.164]) by shasta-pc.shastanets.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2232.9)
	id 1M2Q9MY9; Wed, 17 Feb 1999 15:53:12 -0800
Reply-To: <stevea@shastanets.com>
From: "Steve Alexander" <stevea@shastanets.com>
To: "'David Borman'" <dab@BSDI.COM>, <barney@databus.com>,
        <tcp-impl@lerc.nasa.gov>
Subject: RE: internet draft on suggested mod to the Nagle algorithm
Date: Wed, 17 Feb 1999 15:55:56 -0800
Message-ID: <000301be5ad1$101e52c0$a4191fd1@stevea-pc.ShastaNets.COM>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2232.26
Importance: Normal
In-Reply-To: <199902172242.QAA22004@frantic.bsdi.com>
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

I'm pretty sure that in a lot of BSD-based implementations,
TCP_NODELAY is not inherited across accept() because so_options
is copied, but t_flags is not.  Maybe this is where the
confusion originates.

-- Steve

> -----Original Message-----
> From: David Borman [mailto:dab@BSDI.COM]
> Sent: Wednesday, February 17, 1999 2:42 PM
> To: barney@databus.com; tcp-impl@lerc.nasa.gov
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
> 
> 
> > From: Barney Wolff <barney@databus.com>
> > Date: Wed, 17 Feb 1999 17:03 EST
> >
> > At the risk of looking silly, don't I remember that setting NODELAY
> > has to be done before the socket is connected?  That means that a
> > server has to make its choice before it knows who's asking and what
> > the request is, unless it can listen on different ports.  To me,
> > that's the advantage of the flush bit.  Same problem with setting
> > SND/RCVBUF.
> 
> No.  TCP_NODELAY can be set/unset at any time.  What needs to be
> done before the connect is the setting of RCVBUF, so that the
> correct window scale option will be negotiated at connect time. 
> 
> 			-David Borman, dab@bsdi.com
> 


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 20:05:20 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA14125
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 20:05:19 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA02796; Wed, 17 Feb 1999 19:02:50 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA02042; Wed, 17 Feb 1999 19:01:45 -0500 (EST)
Received: from big (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id TAA30314;
	Wed, 17 Feb 1999 19:01:41 -0500
Message-Id: <3.0.5.32.19990217190140.03116100@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Wed, 17 Feb 1999 19:01:40 -0500
To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: internet draft on suggested mod to the Nagle algorithm
In-Reply-To: <199902172103.OAA10699@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 14:03 2/17/99 -0700, Vernon Schryver wrote:

>First is the problem that the flag would not be available on most
>platforms for at least a few years.

That's simple - check for the flag - if it exists then use it, otherwise
drop Nagle.

>Second, for an application I'm hacking, I've been trying to see how I
>would use a flag or the proposed change or something else.  As far as I
>can tell, nothing does better than simply turning off Nagle, and I bet
>the same applies to HTTP.  It's not that you could not use a flush bit,
>but that the packet traces and everything else would be essentially the
>same as with Nagle off.  So why bother writing code that only works on
>those platforms with the flush bit implemented?  Why not forget the
>flush bit even when available and just turn off Nagle?

I feel we are circling in this thread :) I started out answering Joe Touch
that this is in fact a hard question (the archives don't allow me to refer
to individually archived messages which IMHO is really bad - HTML'ized
archivers come cheap these days and would avoid a lot of circular and
repeated arguments).

Yes, I too do application level buffering with all the problems of figuring
out what a good buffer size is (I ended up using 1K as a compromise). In
most cases this mean I don't need Nagle at all.

But, from an architectural viewpoint, I think it makes a lot more sense if
I didn't have to do that at the application layer and that I instead could
keep the services separated. This means that I need enough control to be
able to indicate what I want - an explicit flush is one of those hooks.
Today, I don't have that so I have to repeat a lot of the functionality in
a layer above and short-circuit things like Nagle in lower layers.

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 20:31:53 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA14430
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 20:31:53 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA25758; Wed, 17 Feb 1999 19:32:48 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA21825; Wed, 17 Feb 1999 19:27:43 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id RAA15423
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 17 Feb 1999 17:27:42 -0700 (MST)
Date: Wed, 17 Feb 1999 17:27:42 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902180027.RAA15423@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Henrik Frystyk Nielsen <frystyk@w3.org>

> ...
> Yes, I too do application level buffering with all the problems of figuring
> out what a good buffer size is (I ended up using 1K as a compromise). In
> most cases this mean I don't need Nagle at all.
>
> But, from an architectural viewpoint, I think it makes a lot more sense if
> I didn't have to do that at the application layer and that I instead could
> keep the services separated. This means that I need enough control to be
> able to indicate what I want - an explicit flush is one of those hooks.
> Today, I don't have that so I have to repeat a lot of the functionality in
> a layer above and short-circuit things like Nagle in lower layers.

That's a fair argument, but I don't see how it actually applies here.


Case 0: no buffering, Nagle off, use flush bit
Case 1: no buffering, Nagle off, don't use flush bit
   same packets on the wire in either case...far too many!

Case 2: you don't buffer, leave Nagle on, and use the flush bit.  
    Your writes of slow data generally go out immediately, because you
    send them when things are idle.  When you have fast data, you use the
    flush bit, so they also go out immediately instead of possibly being
    delayed for an Ack of slow data.  The result on the wire is too many
    packets, almost as bad as #0 and #1.  The only improvement in packet
    counts over #0 and #1 is when you happen to send slow data within
    100 ms after receiving fast data or send 2 slow bursts within 100 ms.

Case 3:  you buffer, leave Nagle on, and use the flush bit
Case 4:  you buffer, turn Nagle off, and do not use the flush bit.
    Both cases have the similar application code complexity and the
    same, minimal count of packets on the wire.

Case 5-7:  boring.

Do you agree that #0 #1 are bad, and so you pick either #3 or #4?

Given that #4 works everywhere today and forever, why would you ever bother
to code #3?  For the architectual purity seal of approval?


What's the difference between a flush bit in the send() system call, and
explicit setsockopt()'s turning off TCP_NODELAY off before a send() and
on afterwards?  No, I don't like system call overhead--I'm asking what
happens on the wire.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 21:28:35 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA15019
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 21:28:34 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA00755; Wed, 17 Feb 1999 20:17:49 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA29946; Wed, 17 Feb 1999 20:16:34 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id RAA23902;
	Wed, 17 Feb 1999 17:16:30 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id RAA25177;
	Wed, 17 Feb 1999 17:16:30 -0800 (PST)
Date: Wed, 17 Feb 1999 17:16:30 -0800 (PST)
Message-Id: <199902180116.RAA25177@rum.isi.edu>
To: tcp-impl@lerc.nasa.gov, vjs@calcite.rhyolite.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Wed, 17 Feb 1999 17:27:42 -0700 (MST)
> From: Vernon Schryver <vjs@calcite.rhyolite.com>
...
> What's the difference between a flush bit in the send() system call, and
> explicit setsockopt()'s turning off TCP_NODELAY off before a send() and
> on afterwards?  No, I don't like system call overhead--I'm asking what
> happens on the wire.

I don't think there is an answer to this. The problem is that
there isn't really a required correlation between send() calls
and segments, regardless of NODELAY (or anything else I can find,
excepting _only_ the proposed flush bit, which might be defined as
guaranteeing that the last byte of the write is the last byte
of a segment).

e.g, Nagle waits for a timeout or enough data for a full-sized
segment. However, there is no requirement that this data is sent 
in a single segment, per se.

Granted, this may seem broken, but given it holds, it may be a bigger
hole in the spec than can be fixed by tweaking Nagle... 

Joe


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 21:53:49 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id VAA15261
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 21:53:49 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA17205; Wed, 17 Feb 1999 20:37:51 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA15516; Wed, 17 Feb 1999 20:35:13 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id RAA25191;
	Wed, 17 Feb 1999 17:35:10 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id RAA25638;
	Wed, 17 Feb 1999 17:35:10 -0800 (PST)
Date: Wed, 17 Feb 1999 17:35:10 -0800 (PST)
Message-Id: <199902180135.RAA25638@rum.isi.edu>
To: vjs@calcite.rhyolite.com, tcp-impl@lerc.nasa.gov, frystyk@w3.org
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Wed, 17 Feb 1999 19:01:40 -0500
> To: Vernon Schryver <vjs@calcite.rhyolite.com>, tcp-impl@lerc.nasa.gov
> From: Henrik Frystyk Nielsen <frystyk@w3.org>
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
...
> But, from an architectural viewpoint, I think it makes a lot more sense if
> I didn't have to do that at the application layer and that I instead could
> keep the services separated. This means that I need enough control to be
> able to indicate what I want - an explicit flush is one of those hooks.
> Today, I don't have that so I have to repeat a lot of the functionality in
> a layer above and short-circuit things like Nagle in lower layers.

TCP currently lacks a few things that would certainly be useful
to applications that want something other than a byte-stream, 
including (but not limited to):

	in-band EOF
		separate "close" and "disconnect" system calls, e.g.


	flush
		implementing the intended semantics of the PUSH bit,
		i.e., an application signal to avoid intermediate
		buffering and emit whatever data the lower layers have
		ASAP

	record mark
		a way to indicate the last byte of a segment, i.e.,
		to force a (minor) correlation between segment and app 
		message boundaries

		note, this is different from FLUSH, since the system
		could be emitting segments as fast as possible while
		a subsequent send() could add more data

	aggregate-for-me
		a way to ask the system to aggregate packet emissions
		if they're not "MSS", by stalling writes when there
		is less than 1 MSS to send.

		this isn't Nagle, which stalls sending if there
		is any outstanding unack'd segment

		this isn't Minshall's recent proposal, which stalls
		sending if there is any outstanding unack'd PARTIAL
		segment (even if there is currently a full segment to 
		be sent).

		i.e., use a flag to indicate "SENT_TINY";
			on emit,
				SENT_TINY = "segment < MSS"
			on attempted non-timer emit:
				if "current segment < MSS" &&
				"SENT_TINY == 1",
					wait
			on timer-based emit:
				ignore SENT_TINY


The most useful of these, for message-based systems, require
changes to the semantics of the send() API, however...

Joe
			

From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 22:40:37 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA17315
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 22:40:37 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA26603; Wed, 17 Feb 1999 21:27:51 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Twig.Rodents.Montreal.QC.CA (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA23541; Wed, 17 Feb 1999 21:23:12 -0500 (EST)
Received: (from mouse@localhost)
	by Twig.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id VAA05141;
	Wed, 17 Feb 1999 21:23:07 -0500 (EST)
Date: Wed, 17 Feb 1999 21:23:07 -0500 (EST)
From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
Message-Id: <199902180223.VAA05141@Twig.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> TCP currently lacks a few things that would certainly be useful to
> applications that want something other than a byte-stream, [...]

Um, yeah.  I'm not quite sure what your point is.  This sounds like
"TCP isn't all things to all applications", which doesn't really sound
to me like something that is possible to fix, nor do I think it would
be desirable to try.

> including (but not limited to):

> 	in-band EOF
> 		separate "close" and "disconnect" system calls, e.g.

Um, isn't that what the FIN bit is all about?  shutdown(2) in the
socket API, with second argument 1?  Or have I misunderstood?  (In this
vein, I really think TCP needs a way to implement shutdown(...,0), that
is, to push "no more receives" back to the sender in a way that doesn't
break data flow in the other direction.)

> 	flush
> 		implementing the intended semantics of the PUSH bit,

This sounds a lot like one of the things proposed in this Nagle thread.
I think it would be a good thing to have, but it's an API issue, not
really a TCP issue, as far as I can see.

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 22:40:43 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id WAA17339
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 22:40:43 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA04163; Wed, 17 Feb 1999 21:37:52 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Twig.Rodents.Montreal.QC.CA (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA01298; Wed, 17 Feb 1999 21:33:32 -0500 (EST)
Received: (from mouse@localhost)
	by Twig.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id VAA05196;
	Wed, 17 Feb 1999 21:33:27 -0500 (EST)
Date: Wed, 17 Feb 1999 21:33:27 -0500 (EST)
From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
Message-Id: <199902180233.VAA05196@Twig.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>> What's the difference [on the wire] between a flush bit in the
>> send() system call, and explicit setsockopt()'s turning off
>> TCP_NODELAY off before a send() and on afterwards?

None that I can see.

> The problem is that there isn't really a required correlation between
> send() calls and segments, regardless of NODELAY (or anything else I
> can find, excepting _only_ the proposed flush bit, which might be
> defined as guaranteeing that the last byte of the write is the last
> byte of a segment).

That's not how I read it.  Suppose there is no send window available,
MSS 1500, 500 bytes buffered waiting for window, and then, without any
network traffic occurring during it, the application does
write-with-flush 10 bytes then ordinary write 10 bytes more.  I don't
think any of the flush-bit proposals would require the next segment to
hold 510 bytes rather than 520 (assuming that when the window opens, it
opens at least 520 bytes' worth).

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 17 23:04:54 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id XAA17683
	for <tcpimpl-archive@lists.ietf.org>; Wed, 17 Feb 1999 23:04:54 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id WAA26143; Wed, 17 Feb 1999 22:07:13 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id WAA21646; Wed, 17 Feb 1999 22:01:07 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id SAA00496;
	Wed, 17 Feb 1999 18:57:48 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id SAA28151;
	Wed, 17 Feb 1999 18:57:48 -0800 (PST)
Date: Wed, 17 Feb 1999 18:57:48 -0800 (PST)
Message-Id: <199902180257.SAA28151@rum.isi.edu>
To: tcp-impl@lerc.nasa.gov, mouse@Rodents.Montreal.QC.CA
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From owner-tcp-impl@lerc.nasa.gov Wed Feb 17 18:43:40 1999
> Date: Wed, 17 Feb 1999 21:23:07 -0500 (EST)
> From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
> To: tcp-impl@lerc.nasa.gov
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
> 
> > TCP currently lacks a few things that would certainly be useful to
> > applications that want something other than a byte-stream, [...]
> 
> Um, yeah.  I'm not quite sure what your point is.  This sounds like
> "TCP isn't all things to all applications", which doesn't really sound
> to me like something that is possible to fix, nor do I think it would
> be desirable to try.

Restated, TCP lacks some features that appear to be consistent
with a general reliable byte-stream model, but have not yet been
designed in.

> > including (but not limited to):
> 
> > 	in-band EOF
> > 		separate "close" and "disconnect" system calls, e.g.
> 
> Um, isn't that what the FIN bit is all about? 

What I mean is to be able to send "EOF" more than once in a stream.
You can for a file - i.e., you can read the EOF, then rewind, then
get an EOF again. But you can't for a stream. Once you send EOF, you
can't send any more data.

Joe


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 01:01:22 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id BAA19407
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 01:01:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id XAA13266; Wed, 17 Feb 1999 23:47:51 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Twig.Rodents.Montreal.QC.CA (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id XAA11094; Wed, 17 Feb 1999 23:44:38 -0500 (EST)
Received: (from mouse@localhost)
	by Twig.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id XAA05853;
	Wed, 17 Feb 1999 23:44:32 -0500 (EST)
Date: Wed, 17 Feb 1999 23:44:32 -0500 (EST)
From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
Message-Id: <199902180444.XAA05853@Twig.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>>> 	in-band EOF
>> Um, isn't that what the FIN bit is all about?
> What I mean is to be able to send "EOF" more than once in a stream.

I'm not sure this really makes sense.  Once you do this, it isn't a
byte-stream any longer; it's a (byte-or-EOF)-stream, and will have a
*real* EOF when there are no more (byte-or-EOF)s possible.  You can
then repeat the construction as many times as you like...but what's the
point of doing it at all?

> You can for a file - i.e., you can read the EOF, then rewind, then
> get an EOF again.  But you can't for a stream.

You can't rewind a byte-stream.  They also don't support a lot of other
things files do, like random access (of which rewinding is really just
a special case).  I don't see this as a problem; it is the nature of
byte-streams to be byte-streams.

If an application wants to send more than 256 different things over a
stream, I am inclined to let the application deal with encoding them.
Different applications will have different tradeoffs (such as how
frequent different values are and thus how to represent them)....

"If you want pseudo-terminals, you know where to find them."

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 02:03:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id CAA24614
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 02:03:29 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id AAA23897; Thu, 18 Feb 1999 00:42:50 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id AAA20754; Thu, 18 Feb 1999 00:38:11 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id WAA20674
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 17 Feb 1999 22:38:10 -0700 (MST)
Date: Wed, 17 Feb 1999 22:38:10 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902180538.WAA20674@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: der Mouse  <mouse@Rodents.Montreal.QC.CA>

> >> What's the difference [on the wire] between a flush bit in the
> >> send() system call, and explicit setsockopt()'s turning off
> >> TCP_NODELAY off before a send() and on afterwards?
>
> None that I can see.

I couldn't think of one either, but the issue is slippery.

> > The problem is that there isn't really a required correlation between
> > send() calls and segments, regardless of NODELAY (or anything else I
> > can find, excepting _only_ the proposed flush bit, which might be
> > defined as guaranteeing that the last byte of the write is the last
> > byte of a segment).
> ...

That sides steps my intended question.  As with previous rounds in this
discussion of the Nagle algorithm, what we think is a correct exegesis of
the (or a) standard is basically irrelevant.  Does it matter whether RFC
896 talked about send() system calls, passages through tcp_output(), or
cycles of the RFC 793 state machine?  NO!--What really matters is what
happens in representative, currently deployed implementations.  First
figure out the problem, and after that choose an architecture, a layering,
and a religion.  That order is what made TCP, and the opposite is what
made TP0-TP4/CONS-CLNS.


] From: Joe Touch <touch@ISI.EDU>
]
] > > TCP currently lacks a few things that would certainly be useful to
] > > applications that want something other than a byte-stream, [...]
] > 
] > Um, yeah.  I'm not quite sure what your point is.  This sounds like
] > "TCP isn't all things to all applications", which doesn't really sound
] > to me like something that is possible to fix, nor do I think it would
] > be desirable to try.

I was horrified by some of the ideas.  EOF and record markers particularly
bother me because they are so trivially implemented by the application
and there are so many more kinds of markers than any of us would dream
of putting into a kernel (or so I would like to think).  You would need
at the very least all of ANSI magnetic tape markers, including "my summer
vacation" naratives optionally associated with every mark.

On the other hand, the notion of "delay sending these low priority bits
until I find some more or you really have nothing better to do" and
"send this glob" immediately sound very enticing.  I think they might
finally be an improvement on Nagle.

On the third hand, was in it this mailing list where Joe Touch and others
tonight mentioned Gbit/sec TCP for clusters?  Every minor optional feature
that you add to a protocol hurts enormously at the boundaries, on very
large and fast and on very small systems.  It wouldn't take more than a
smidgeon more improvements to transform the TCP/IP of 1985 that buried
TP0-TP4/CONS-CLNS into a monster that makes those poor guys look agile
and svelte.  The necessary but so far small and subtle changes including
header prediction, fast retransmission, fast recovery, slow start, double
initial MSS, window scaling and SAK, etc. are a long way down that road
paved with good intentions.  Start adding neat knobs to the API, and TCP
will instantly flash under the lintel with the damning motto.


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 14:25:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA08827
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 14:25:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA01479; Thu, 18 Feb 1999 12:36:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ietf.org (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA29518; Thu, 18 Feb 1999 12:34:56 -0500 (EST)
Received: from CNRI.Reston.VA.US (localhost [127.0.0.1])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id MAA02681;
	Thu, 18 Feb 1999 12:34:30 -0500 (EST)
Message-Id: <199902181734.MAA02681@ietf.org>
Mime-Version: 1.0
Content-Type: Multipart/Mixed; Boundary="NextPart"
To: IETF-Announce:;;@ns.cnri.reston.va.us
Cc: tcp-impl@lerc.nasa.gov
From: Internet-Drafts@ietf.org
Reply-to: Internet-Drafts@ietf.org
Subject: I-D ACTION:draft-ietf-tcpimpl-cong-control-05.txt
Date: Thu, 18 Feb 1999 12:34:30 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

--NextPart

A New Internet-Draft is available from the on-line Internet-Drafts directories.
This draft is a work item of the TCP Implementation Working Group of the IETF.

Note: This revision reflects comments received during the last call period.

	Title		: TCP Congestion Control
	Author(s)	: M. Allman, V. Paxson, W. Stevens
	Filename	: draft-ietf-tcpimpl-cong-control-05.txt
	Pages		: 11
	Date		: 17-Feb-99
	
    This document defines TCP's four intertwined congestion control
    algorithms: slow start, congestion avoidance, fast retransmit, and
    fast recovery.  In addition, the document specifies how TCP should
    begin transmission after a relatively long idle period, as well as
    discussing various acknowledgment generation methods.

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-ietf-tcpimpl-cong-control-05.txt

Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
	"get draft-ietf-tcpimpl-cong-control-05.txt".

A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html 
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt


Internet-Drafts can also be obtained by e-mail.

Send a message to:
	mailserv@ietf.org.
In the body type:
	"FILE /internet-drafts/draft-ietf-tcpimpl-cong-control-05.txt".
	
NOTE:	The mail server at ietf.org can return the document in
	MIME-encoded form by using the "mpack" utility.  To use this
	feature, insert the command "ENCODING mime" before the "FILE"
	command.  To decode the response(s), you will need "munpack" or
	a MIME-compliant mail reader.  Different MIME-compliant mail readers
	exhibit different behavior, especially when dealing with
	"multipart" MIME messages (i.e. documents which have been split
	up into multiple messages), so check your local documentation on
	how to manipulate these messages.
		
		
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.

--NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

--OtherAccess
Content-Type: Message/External-body;
	access-type="mail-server";
	server="mailserv@ietf.org"

Content-Type: text/plain
Content-ID:	<19990217165702.I-D@ietf.org>

ENCODING mime
FILE /internet-drafts/draft-ietf-tcpimpl-cong-control-05.txt

--OtherAccess
Content-Type: Message/External-body;
	name="draft-ietf-tcpimpl-cong-control-05.txt";
	site="ftp.ietf.org";
	access-type="anon-ftp";
	directory="internet-drafts"

Content-Type: text/plain
Content-ID:	<19990217165702.I-D@ietf.org>

--OtherAccess--

--NextPart--


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 14:39:31 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id OAA09976
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 14:39:30 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA21188; Thu, 18 Feb 1999 13:21:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA15532; Thu, 18 Feb 1999 13:17:01 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id KAA21225;
	Thu, 18 Feb 1999 10:16:43 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id KAA20959;
	Thu, 18 Feb 1999 10:16:42 -0800 (PST)
Date: Thu, 18 Feb 1999 10:16:42 -0800 (PST)
Message-Id: <199902181816.KAA20959@rum.isi.edu>
To: tcp-impl@lerc.nasa.gov, vjs@calcite.rhyolite.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Vernon Schryver <vjs@calcite.rhyolite.com>
> To: tcp-impl@lerc.nasa.gov
> Subject: Re: internet draft on suggested mod to the Nagle algorithm
...
> > > The problem is that there isn't really a required correlation between
> > > send() calls and segments, regardless of NODELAY (or anything else I
> > > can find, excepting _only_ the proposed flush bit, which might be
> > > defined as guaranteeing that the last byte of the write is the last
> > > byte of a segment).
> > ...
> 
> That sides steps my intended question.  As with previous rounds in this
> discussion of the Nagle algorithm, what we think is a correct exegesis of
> the (or a) standard is basically irrelevant.  Does it matter whether RFC
> 896 talked about send() system calls, passages through tcp_output(), or
> cycles of the RFC 793 state machine?  NO!--What really matters is what
> happens in representative, currently deployed implementations.  First
> figure out the problem, and after that choose an architecture, a layering,
> and a religion.  That order is what made TCP, and the opposite is what
> made TP0-TP4/CONS-CLNS.

AOK - stated differently, current implementations do not guarantee
any correlation between send() calls and segments.  There are more
deployed implementations than just BSD. It makes more sense, as Sally Floyd
has said in other threads, to keep the requirements as open and flexible
as possible. We might be better less concerned with "what would 
OS X.Y.Z do with this mod" than "what's the best and worst a 'compliant' 
system could do, given these requiremetns".

> ] From: Joe Touch <touch@ISI.EDU>
> ]
> ] > > TCP currently lacks a few things that would certainly be useful to
> ] > > applications that want something other than a byte-stream, [...]
> ] > 
> ] > Um, yeah.  I'm not quite sure what your point is.  This sounds like
> ] > "TCP isn't all things to all applications", which doesn't really sound
> ] > to me like something that is possible to fix, nor do I think it would
> ] > be desirable to try.
> 
> I was horrified by some of the ideas.  EOF and record markers particularly
> bother me because they are so trivially implemented by the application

Reliability is also trivially implemented at the app layer, but TCP/IP
is preferred to TCP/UDP/IP. Some things are more effectively, efficiently,
or consistently implemented in the transport layer. Granted, it's open
to debate what that list is; I was giving examples, not requirements.

> On the third hand, was in it this mailinglist where Joe Touch and others
> tonight mentioned Gbit/sec TCP for clusters?  Every minor optional feature
> that you add to a protocol hurts enormously at the boundaries, on very
> large and fast and on very small systems. 

Not always - some can be disabled (such as disabling windowing for
connections on the same subnet), others do not come into play unless
triggered (e.g., some of the markings proposed could be implemented
in options; there's only one check to see if there are "no options", and
it has to be performed anyway).

Joe


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 17:27:14 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id RAA22215
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 17:27:14 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA27956; Thu, 18 Feb 1999 14:46:48 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA22399; Thu, 18 Feb 1999 14:41:59 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id MAA06670
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 18 Feb 1999 12:41:58 -0700 (MST)
Date: Thu, 18 Feb 1999 12:41:58 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902181941.MAA06670@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Joe Touch <touch@ISI.EDU>

> AOK - stated differently, current implementations do not guarantee
> any correlation between send() calls and segments.  There are more
> deployed implementations than just BSD. It makes more sense, as Sally Floyd
> has said in other threads, to keep the requirements as open and flexible
> as possible. We might be better less concerned with "what would 
> OS X.Y.Z do with this mod" than "what's the best and worst a 'compliant' 
> system could do, given these requiremetns".

So what are we supposed to do with the standards based non-answer?
Of course there is no guaranteed correlation between send() and segments,
and there could not be even with an explicit flush bit (eg. window space
& MSS).  Of course there are more than BSD implementations, and even among
BSD implementations there are differences.  We can nevertheless talk about
typical, representative, useful, and/or counter-productive behaviors.

Let me try to state a pedantic proposition that does not admit standards
exegesis equivocation:
  - in RFC 793, 896, & 1122 TCP conformant implementations which allow
     dynamic switching of the Nagle algorithm, turning off Nagle before
     a send() request and on afterwards MAY have identical results in
     segments on the wire as the proposed explicit flush bit.  Moreover,
     all existing implementations that allow dynamic control of the Nagle
     algorithm would have such identical results if the flush bit were
     added to their send() API's.

Now, is that proposition true or false?
I do not know the answer; if I did, I'd say so.  I do suspect it's true.

sheesh--will I have to write standards language to specify Dave Borman's
flush bit?  Or will we now have an argument about the definitions of truth,
models, satisfiability, and the rest of basic mathematical logic?
Rejecting a statement about the integers on the grounds that nonstandard
models of arithmetic also satisfy the Piano Axioms but falsify the
statement is fine, but this mailing list is named "tcp-impl".


> . ..
> > I was horrified by some of the ideas.  EOF and record markers particularly
> > bother me because they are so trivially implemented by the application
>
> Reliability is also trivially implemented at the app layer, but TCP/IP
> is preferred to TCP/UDP/IP.

If you'd step aside from the heat of the rhetorical battle, I hope you'd
not like that statement.  Reliability requires timers and buffering for
retransmissions, but markers need require no more than a byte count at
the start of the file, record, or block, and never more than simple pattern
recognition (e.g. escaping) if you don't have the file, record or block
size a priori.  You must know that even ANSI tape labels are not in the
same league as any kind of reliability.  I know you've heard of slow start,
fast retransmission, and fast recovery.  The reasons those necessary
complications of reliability exist have nothing to do with convenient
abstractions like "layers."  You must recall and surely agree with the
contempuous remarks in this mailing list about the persistent proposals
to replace TCP/IP with simplistic retransmitting of UDP/IP.


>                             Some things are more effectively, efficiently,
> or consistently implemented in the transport layer. Granted, it's open
> to debate what that list is; I was giving examples, not requirements.

hmmm...I wrote "in the application" while you wrote "at the app layer."
I've noticed people with .edu addresses on the Southern Calif. coast tend
to have strong views on the sanctity of Architecture and Network Layers.
I view both as mere tools, as handy ways to talk, or more commonly not
talk, about things.  Suppressing momentarily irrelevant details is necessary
for effective thinking, but those curtains to hide details have no virtues
of their own.  The statement that something is more effective in a layer
makes little sense to me.  Network layers and architecture do not exist
independent of real, concrete systems.  There is no such thing as the
transport layer except when you are talking about implementations or
writing an academic (pejorative sense) paper.  Computer architecture
independent of implementations makes as little sense as theoretical civil
architecture.


> > On the third hand, was in it this mailinglist where Joe Touch and others
> > tonight mentioned Gbit/sec TCP for clusters?  Every minor optional feature
> > that you add to a protocol hurts enormously at the boundaries, on very
> > large and fast and on very small systems. 
>
> Not always - some can be disabled (such as disabling windowing for
> connections on the same subnet), others do not come into play unless
> triggered (e.g., some of the markings proposed could be implemented
> in options; there's only one check to see if there are "no options", and
> it has to be performed anyway).

NO!  ALWAYS!  In the common case, "disabling" involves code in the critical
exection path.  That code is minimally an executed fetch, test and a whole
gob of unexecuted bloating code that spreads out the executed code and so
by I-cache bloat slows the code that is executed.  Even without cache
effects, that fetch and test have costs.  As you know, TCP can be done
with a few dozen instructions (not counting byte copies or checksum, which
can be done with hardware).  Each optional feature would noticably increase
those <<few dozen>> instructions even if not used.

In some cases, "disabling" involves conditional compilation, and so no
run-time cost, but the maintanance costs of that solution are usually even
higher than the run time costs of the other tactic.  There is a kernel of
truth to the claims that TCP is too heavyweight for this or that (currently
"clusters").  That kernel is the intellectual baggage like slow-start that
comes with TCP.  You couldn''t make TCP go fast or fit tiny processors if
all of your programmers run screaming when you mention it...proof: the
ISO OSI suite could be made to go almost as fast as TCP or be as small as
TCP, but no one really tried.  Most competant people ran screaming from
the specs, forseeing a hell of maintenance programming for the rest of
their lives because of all of that OSI flexibility.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 18:32:53 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id SAA27300
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 18:32:52 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA29287; Thu, 18 Feb 1999 17:15:05 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA22053; Thu, 18 Feb 1999 17:09:08 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id OAA25332;
	Thu, 18 Feb 1999 14:09:07 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id OAA26759;
	Thu, 18 Feb 1999 14:09:06 -0800 (PST)
Date: Thu, 18 Feb 1999 14:09:06 -0800 (PST)
Message-Id: <199902182209.OAA26759@rum.isi.edu>
To: vjs@calcite.rhyolite.com, tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Cc: touch@ISI.EDU
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vern et al,

Here are some clarifications; perhaps those who would like to continue
the discussion can circulate remaining comments to Vernon and myself
"off-line"... 

> So what are we supposed to do with the standards based non-answer?
> Of course there is no guaranteed correlation between send() and segments,

We appear to be tuning Nagle based on some very specific
behaviors which are particular to an implementation.

Granted, this is (or was) tcp-impl, but that refers to "places where
implementations have bugs or stray from the desired behavior". It does
not necessarily mean "constrained to behave like a particular implementation".

> Let me try to state a pedantic proposition that does not admit standards
> exegesis equivocation:
>   - in RFC 793, 896, & 1122 TCP conformant implementations which allow
>      dynamic switching of the Nagle algorithm, turning off Nagle before
>      a send() request and on afterwards MAY have identical results in
>      segments on the wire as the proposed explicit flush bit.  Moreover,
>      all existing implementations that allow dynamic control of the Nagle
>      algorithm would have such identical results if the flush bit were
>      added to their send() API's.
> 
> Now, is that proposition true or false?
> I do not know the answer; if I did, I'd say so.  I do suspect it's true.

This is precisely where we differ. While I suspect it may be true
for MOST implementations, I firmly claim that an entirely conformant
TCP implementation could have different results, and that it would be
inappropriate to claim either result as uniquely correct.
Adapting Postel's motto and Floyd's position, I am advocating that
(excepting bugs, which was where this WG started), we do not 
overspecify the behavior of a TCP implementation.

> > > I was horrified by some of the ideas.  EOF and record markers particularly
> > > bother me because they are so trivially implemented by the application
> >
> > Reliability is also trivially implemented at the app layer, but TCP/IP
> > is preferred to TCP/UDP/IP.
> 
> If you'd step aside from the heat of the rhetorical battle, I hope you'd
> not like that statement.  Reliability requires timers and buffering for
> retransmissions, but markers need require no more than a byte count at
> the start of the file, record, or block, and never more than simple pattern
> recognition (e.g. escaping) if you don't have the file, record or block
> size a priori.

If I don't know the sizes ahead of time, I have to do pattern recognition,
which is computationally intensive (must examine each byte), and can
cause the packet size to increase by up to 50% in the worse cases.
Out-of-band indicators, i.e., in the transport protocol, avoid that.

The differences between out of band and in-band, application vs. kernel,
are all just optimizations. Few, if any of the ones we've described
are necessarily of one flavor or another.

> >                             Some things are more effectively, efficiently,
> > or consistently implemented in the transport layer. 
> 
> hmmm...I wrote "in the application" while you wrote "at the app layer."

OK. I stepped on a religion button somewhere. We're talking about
the same thing - in the application. 

> I've noticed people with .edu addresses on the Southern Calif. coast tend
> to have strong views on the sanctity of Architecture and Network Layers.

Yes, we are minimalists about architecture, and some would claim
that has helped, rather than hurt. 

> of their own.  The statement that something is more effective in a layer
> makes little sense to me. 

Would it be better to refer to it as in-band vs. out-of-band? Or "restricted
to a common, central implemenation" vs "per application program"? These are
the relevant differences to which I, and others using the term 'layer', usually
refer.

> > > On the third hand, was in it this mailinglist where Joe Touch and others
> > > tonight mentioned Gbit/sec TCP for clusters?  Every minor optional feature
> > > that you add to a protocol hurts enormously at the boundaries, on very
> > > large and fast and on very small systems. 
> >
> > Not always - some can be disabled (such as disabling windowing for
> 
> NO!  ALWAYS!  In the common case, "disabling" involves code in the critical
> exection path. 

See the Fast Sockets work at Berkeley. You can make almost all your decisions 
at connection setup time, and run completely different implementations
if you prefer. Even those decisions that are made while an implementation is
running need not be made on a per-packet basis, necessarily.

Joe


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 18 19:08:49 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.8.5/8.8.7a) with ESMTP id TAA29438
	for <tcpimpl-archive@lists.ietf.org>; Thu, 18 Feb 1999 19:08:49 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA17486; Thu, 18 Feb 1999 17:55:11 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA10929; Thu, 18 Feb 1999 17:50:05 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id PAA11070
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 18 Feb 1999 15:50:03 -0700 (MST)
Date: Thu, 18 Feb 1999 15:50:03 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902182250.PAA11070@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Joe Touch <touch@ISI.EDU>

> ...
> > So what are we supposed to do with the standards based non-answer?
> > Of course there is no guaranteed correlation between send() and segments,
>
> We appear to be tuning Nagle based on some very specific
> behaviors which are particular to an implementation.

We are talking about proposed changes to TCP, specifically Minshall's
proposed change to the Nagle algorithm and alternatives including explicit
flush and go-slow bits in the application send-request API.  Which is the
"very specific behavior particular to an implementation," that which all
except hypothetical implementations already do, or that which none now do
and which would require a new state variable that would have to checked
and updated on most send()'s?


> ...
> >   - in RFC 793, 896, & 1122 TCP conformant implementations which allow
> >      dynamic switching of the Nagle algorithm, turning off Nagle before
> >      a send() request and on afterwards MAY have identical results in
> >      segments on the wire as the proposed explicit flush bit.  Moreover,
> >      all existing implementations that allow dynamic control of the Nagle
> >      algorithm would have such identical results if the flush bit were
> >      added to their send() API's.
> > 
> > Now, is that proposition true or false?
> > I do not know the answer; if I did, I'd say so.  I do suspect it's true.
>
> This is precisely where we differ. While I suspect it may be true
> for MOST implementations, I firmly claim that an entirely conformant
> TCP implementation could have different results, and that it would be
> inappropriate to claim either result as uniquely correct.
> Adapting Postel's motto and Floyd's position, I am advocating that
> (excepting bugs, which was where this WG started), we do not 
> overspecify the behavior of a TCP implementation.

SHEESH!  Why do you think I care about the answer?  Give me more credit
than care about getting your hypotetical, probably non-existent
implementations declared non-conformant.  If, as you say, "MOST
implementations" statisfy my pedantic proposition, then instead of
defining yet another API bit, we can talk about the heretofore ill
defined results of turning Nagle off and on.  Most of us would agree
that adding yet another flag to the standard send() API is more of a
constraint on or over-specification of implementations than specifying
what happens when you toggle Nagle, particularly when the latter would
require no changes to any existing implementation.


> ...
> > I've noticed people with .edu addresses on the Southern Calif. coast tend
> > to have strong views on the sanctity of Architecture and Network Layers.
>
> Yes, we are minimalists about architecture, and some would claim
> that has helped, rather than hurt. 

No, just the opposite of "minimalist architecture".  (Never mind how
rewindable streams might be called minimalist anything.)  The tendency
I've noticed over there is to support surprising positions by appeals
to the Oracles of Network Architecture and the Sacred Network Layering.


> ...
> > NO!  ALWAYS!  In the common case, "disabling" involves code in the critical
> > exection path. 
>
> See the Fast Sockets work at Berkeley. You can make almost all your decisions 
> at connection setup time, and run completely different implementations
> if you prefer. Even those decisions that are made while an implementation is
> running need not be made on a per-packet basis, necessarily.

DOUBLE SHEESH!...I heard about the Berkeley (or Livermore) squashed stacks
at least 6 and I think 8 years ago.  Didn't you notice my reference to a
"few dozen instructions"?  For that matter, what do you think the far
older header-prediction amounts to, except moving optional stuff out of
the fast path?

The problem with "making almost all of your decisions at connection setup"
or otherwise out of the performance path is that it puts you in a worse
mess then compile-time option selection.  You still have zillions of lines
of bug prone code that does not get well tested, and that sends competant
programmers screaming and running away.  It's a worse mess because you
don't only have the optional code in the source tree, but also sitting in
your kernel, ready for execution when you least expect it, and bloating
your systems minimum memory footprint.   Do you think that kernel memory
is free?  Yes, you might make your kernel TCP code pagable, but that's
not without large costs.

THERE IS NO SUCH THING AS A FREE OPTION!  Not in any real implementations,
that is.  Implementations that exist only on standards committee overhead
projectors and management whiteboards are not so limited.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 23 19:50:42 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA22886
	for <tcpimpl-archive@lists.ietf.org>; Tue, 23 Feb 1999 19:50:41 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA09869; Tue, 23 Feb 1999 16:00:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id PAA07054; Tue, 23 Feb 1999 15:55:12 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 512
          for <tcp-impl@lerc.nasa.gov>; Tue, 23 Feb 1999 12:55:08 -0800
Message-ID: <36D315AC.B9FCFCE0@ehsco.com>
Date: Tue, 23 Feb 1999 12:55:08 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: TCP Implementations <tcp-impl@lerc.nasa.gov>
Subject: dynamic rwin adjustments
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


Anybody doing anything with dynamic default rwin sizes? Like, if the
incoming connection request is from a local node whose RTT is already
known, set window for X? or, probably better, if connection request is
from a remote system with no defined route, assume high-latency RTT
equal to previous remote connections and set rwin to X?

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Tue Feb 23 21:00:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA26523
	for <tcpimpl-archive@lists.ietf.org>; Tue, 23 Feb 1999 21:00:46 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA06702; Tue, 23 Feb 1999 19:46:25 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA05649; Tue, 23 Feb 1999 19:43:19 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id RAA22527
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Tue, 23 Feb 1999 17:43:17 -0700 (MST)
Date: Tue, 23 Feb 1999 17:43:17 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902240043.RAA22527@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: dynamic rwin adjustments
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: "Eric A. Hall" <ehall@ehsco.com>

> Anybody doing anything with dynamic default rwin sizes? Like, if the
> incoming connection request is from a local node whose RTT is already
> known, set window for X? or, probably better, if connection request is
> from a remote system with no defined route, assume high-latency RTT
> equal to previous remote connections and set rwin to X?

Why bother, at least using tactics like those?  The RTT is only one of
the parameters for computing the window size that gives you maximum
throughput.  Given an RTT of 1 msec, 100 msec or 100 usec, what window
would you recommend?  Besides, for many TCP connections, more throughput
than can be provided by 4 MSS worth of window is useless and will not and
cannot be used.

What resources are you conserving or better utilizing by adjusting your
window?  Why not just advertise a 2 GByte receive window for all
connections?  I can see two reasons, worst case buffer space used and the
buffer space in the router(s) just beyond the big bottleneck(s) in the
path.  You care about the former because it's not nice to have a packet
arrive when you have nowhere to put it.  You might care about the latter,
because having a lot of bytes waiting in that router means they're more
likely to be discarded or to contribute to latency on your other incoming
traffic.   However, the other guy's congestion window is also supposed to
worry about such things.

The paper recently mentioned here about dynamically adjusting receive
windows looks interesting.  As I understand it, their idea is to adjust
receive windows based on the out-going congestion window and/or recent
receive throughput and a notion of fairness.  Their plan seems intended
to give every connection the largest window that maximizes total system
throughput, while honoring the familiar operating system notion of fairness
or at least avoiding starvation.  (I'd repeat the URL of the paper, but
I think I've lost it.)


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 13:51:00 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA21860
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 13:50:59 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA06782; Wed, 24 Feb 1999 11:46:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sabre.sjf.novell.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA05228; Wed, 24 Feb 1999 11:43:35 -0500 (EST)
Received: (from mahdavi@localhost)
	by sabre.sjf.novell.com (8.9.1/8.9.1) id IAA17985;
	Wed, 24 Feb 1999 08:48:22 -0800
Reply-To: mahdavi@novell.com
To: Vernon Schryver <vjs@calcite.rhyolite.com>
Cc: tcp-impl@lerc.nasa.gov
Subject: Re: dynamic rwin adjustments
References: <199902240043.RAA22527@calcite.rhyolite.com>
From: Jamshid Mahdavi <mahdavi@novell.com>
Date: 24 Feb 1999 08:48:22 -0800
In-Reply-To: Vernon Schryver's message of "Tue, 23 Feb 1999 17:43:17 -0700 (MST)"
Message-ID: <yu8xsobvd9yh.fsf@sabre.sjf.novell.com>
Lines: 41
X-Mailer: Gnus v5.5/Emacs 20.3
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Vernon Schryver <vjs@calcite.rhyolite.com> writes:

> The paper recently mentioned here about dynamically adjusting receive
> windows looks interesting.  As I understand it, their idea is to adjust
> receive windows based on the out-going congestion window and/or recent
> receive throughput and a notion of fairness.  Their plan seems intended
> to give every connection the largest window that maximizes total system
> throughput, while honoring the familiar operating system notion of fairness
> or at least avoiding starvation.  (I'd repeat the URL of the paper, but
> I think I've lost it.)

A couple corrections, and the URL.  The paper covers dynamic tuning of
the senders socket buffer (and hence sending window).  It is dynamic
throughout the connection, not just at startup as Eric asked in his
original email.

For receiver window, we argue that just setting the window to max (as
Vernon suggests) is just as effective as dynamically tuning.  The
reason for this is that, under normal circumstances, receivers use 
very little memory.  The data in the receive window is normally stored
in the network.  The cases where you would need the receiver memory
are when packets are dropped or when applications get behind.

Unless you tuned *very* conservatively, tuning would still probably
allow for situations where loss in the network would fill up all of
the available memory (mbufs) on the receiver.  There is perhaps a
better case for trying to protect the system from slow applications.

In either event, data in the receive queue is eligible for dropping.
This is a fairly draconian solution, but since (in most
implementations) this data is not ACKed until it is delivered to the
application, you can free up memory very simply by dropping some data
from connections which are being memory-hogs.  In our testing, we
never encountered a situation where this was necessary -- although we
didn't go out of our way to look for one, either.

Oh yes, here is the URL:

http://www.psc.edu/networking/papers/auto_abstract.html

--Jamshid


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 13:58:14 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA22059
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 13:58:13 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA05614; Wed, 24 Feb 1999 12:41:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from justice.atc-bos.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA02950; Wed, 24 Feb 1999 12:36:35 -0500 (EST)
Received: (from mcmanus@localhost)
	by justice.atc-bos.com (8.8.7/8.8.7) id MAA06539;
	Wed, 24 Feb 1999 12:36:34 -0500
From: Patrick McManus <mcmanus@appliedtheory.com>
Message-Id: <199902241736.MAA06539@justice.atc-bos.com>
Subject: Idle Restart Algorithms
To: tcp-impl@lerc.nasa.gov
Date: Wed, 24 Feb 1999 12:36:33 -0500 (EST)
Reply-To: mcmanus@appliedtheory.com
X-Mailer: Franken-ELM [version 2.9X PL99 PGP2]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

So, I've been thinking about TCP idle restarts based on the reading of
both tcpimpl-prob-05 (paxson, et al..) and tcpimpl-restart-00 (hughes,
touch, heidemann exp 09/98).. and something doesn't feel right.

I offer up this narrative in case I've not grasped something correctly
and someone wants to do the kind favor of setting me straight before I
go much further. This rambles a bit, but it does capture my thinking on the
topic. 

From a heuristic pov, the purpose of congestion control is to
dynamically rate limit the pace of a stream to match the current
capacity characteristics of the underlying medium. Because both
contention for that medium is changing and the medium itself may
change (due to routing adjustments) this must be constantly
re-evaluated. TCP's basic approach is to race forward to about 1/2 of
what it thinks the limit might be, and then inch upwards until it gets
rebuked. This creates some kind of rolling estimate of an acceptable
send rate based on a feedback loop from the network. 

After a period of inactivity this estimate becomes less useful due to
lack of recent pertinent feedback. Traditional implementations react
to this by re-entering slow start which in effect re-discovers the
current available bandwidth. Alternatively, [Hughes,Touch,Heidemann]
suggest Congestion Window Monitoring which fundamentally imposes a
'use it or lose it' policy on growing CWNDs that are in excess of 4
plus the amount of outstanding data which would ensure that after an
idle period the cwnd is maxed out at 4.

The basic issue at hand is that the past performance of the network is
being used to predict the future. It is recognized that the longer the
interval between past input and the current decision time is, the less
reliable that data is for making the decision. However, traditional
reaction to this issue uses a simple 2 step function (valid/not valid)
to make that determination. The only input to this decision is RTO, which
in my opinion exerts a counter-productive force on the decision. For
two links with the same average RTT the link with the greater variance
is going to fall into the 'valid data' step more than the one with a
lesser variance. In essence, this increases the likelihood of using
past data to predict the future on a link that has proven _more_
volatile (and thus less likely to obey past patterns of
behavior). That doesn't seem like a good thing.

On a gut level, I'm not much of a fan of CWM either.. While leaky
bucket scenarios are attractive to me, CWM essentially enforces a
gushing bucket where sending capacity is throttled quickly back if the
sending rate doesn't equal the ack-reception rate. It strikes me as
over-aggressive... using 4 as a fall back instead of 1 is likely meant
to mitigate this, but I dislike the constants as they cannot apply
well to all mediums current or future.. (the increased use of
satellite services for downstream in use for things like DirectPC and
rural African service exhibit huge bandwidth delay properties that are
un-necessarily crippled by this)

I think there needs to be a little more discussion on the types of
timeouts that are realistic. Discussion generally just cites HTTP/1.1
and leaves it at that. I see several different scenarios within just
HTTP and also suspect NNTP exhibits relevant behavior. Some possible sources:

  * Short timeouts.. probably as a result of document retrieval,
    parsing time, and subsequent requests for embedded objects.. Just
    how long this takes depends on a lot of things of course, not the
    least of which are document composition and client
    implementation.. but it's probably on the order of 10RTT
  * longer timeouts.. the result of human interpretation of past
    results and the subsequent request to follow one of the presented
    related references.. on the order of a few whole seconds.
  * Folks reading news with NNTP (as opposed to NNTP used for
    transport) are going to exhibit patterns similar to the HTTP
    longer timeout mentioned above.. more and more private news servers
    are being introduced into the fabric of the web for use as a
    discussion medium, but for security/demand reasons are not propagated in
    any kind of general purpose fashion, so interactive news reading
    is becoming more of a WAN phenomenon.
  * HTTP cache interaction.. particularly pre-emptive retrieval
    scenarios that propagate information between caches.

I think the essence of the problem is the overloading of burst
management and flow rate control onto a single mechanism. A large,
previously established, cwnd may have been achieved through an earlier
long sustained flow and slowly ramped up. Even if the properties of
that link have not changed there is no guarantee that the link can
support a full cwnd burst presented all at once without dropping
packets. However it's quite possible that it can, as well and in the
case of large bandwidth*delay links that is really something that
should be taken advantage of.

So for the sake of clarity, I want to remove the concept of burst
control from the discussion of restart algorithms. Perhaps a 'burst
window' or other such technique can be orthogonally implemented to
address the issue (something that would learn in a similar fashion
what the maximum allowable network burst would be).

Moving back to idle algorithms I would like to experiment with
something adaptive.. maybe a cwnd reduction of 20% (min 1 segment) for
every A-4D elapsed in the idle period (apply in a compounding
fashion).. This reverses the impact of the variance mentioned above
while providing more fine grained input into the 'trust' process.

While I intend to go ahead in looking at this, I'm soliciting any
helpful insights or 'been there done that' comments folks may have.

Thanks,
-Pat

Patrick R. McManus - AppliedTheory Communications  -	Software Engineering
http://pat.appliedtheory.com/~mcmanus			Lead Developer
mcmanus@AppliedTheory.com	'Prince of Pollywood'	Standards, today!


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 14:47:44 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA22847
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 14:47:43 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA03205; Wed, 24 Feb 1999 13:36:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA02723; Wed, 24 Feb 1999 13:35:28 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id KAA00974;
	Wed, 24 Feb 1999 10:35:19 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id KAA14028;
	Wed, 24 Feb 1999 10:35:19 -0800 (PST)
Date: Wed, 24 Feb 1999 10:35:19 -0800 (PST)
Message-Id: <199902241835.KAA14028@rum.isi.edu>
To: tcp-impl@lerc.nasa.gov, mcmanus@appliedtheory.com
Subject: Re: Idle Restart Algorithms
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From owner-tcp-impl@lerc.nasa.gov Wed Feb 24 10:00:51 1999
> From: Patrick McManus <mcmanus@appliedtheory.com>
> Subject: Idle Restart Algorithms
> To: tcp-impl@lerc.nasa.gov
> Date: Wed, 24 Feb 1999 12:36:33 -0500 (EST)
> 
> So, I've been thinking about TCP idle restarts based on the reading of
> both tcpimpl-prob-05 (paxson, et al..) and tcpimpl-restart-00 (hughes,
> touch, heidemann exp 09/98).. and something doesn't feel right.
...

This is certainly a difficult issue, exemplified by the delay in
our getting out the update to the restart ID...

The primary issue, to me, is as follows:

	- the current scheme (time since receive) is defeated
		by request/response protocols, unintentionally
		generating excessive bursts

	- timeout based schemes are 'too' binary in nature (as you noted)
		there is no timeout for 'just under an RTO'
		there is a full window adjustment for anything larger
		(i.e., they are step functions - all on, or all off,
		not smooth)

CWM was one attempt at a continuous, graceful degradation of
performance, intended at avoiding line-rate bursts. We have a
different one, designed with Sally Floyd, which addresses another
source of bursts - packet loss. This mechanism is a more direct
leaky-bucket, which itself is a nice tool for controlling 
and limiting bursts to acceptable values.

Granted, as we have noted, there are many parameters that need
to be tuned - 

	- does the bucket size vary?

		e.g., a time-based decay of the bucket size


> >From a heuristic pov, the purpose of congestion control is to
> dynamically rate limit the pace of a stream to match the current
> capacity characteristics of the underlying medium.

Agreed - which is why true rate-pacing is preferable, but somewhat more
cumbersome. John Heidemann developed a version of this (RBP), which
kicks in only when the RTO goes off, and goes away when ACK clocking
resumes. Again, this has a step-mode which I would prefer to smooth 
somehow.

> After a period of inactivity this estimate becomes less useful due to
> lack of recent pertinent feedback. Traditional implementations react
> to this by re-entering slow start which in effect re-discovers the
> current available bandwidth. Alternatively, [Hughes,Touch,Heidemann]
> suggest Congestion Window Monitoring which fundamentally imposes a
> 'use it or lose it' policy on growing CWNDs that are in excess of 4
> plus the amount of outstanding data which would ensure that after an
> idle period the cwnd is maxed out at 4.

It seems reasonable that, after a very long period of time, the window
should essentially revert to the same as a new connection would be
able to inject. I.e., after long idles, a connection knows no more
about the network than a new connection.

> The basic issue at hand is that the past performance of the network is
> being used to predict the future. It is recognized that the longer the
> interval between past input and the current decision time is, the less
> reliable that data is for making the decision. However, traditional
> reaction to this issue uses a simple 2 step function (valid/not valid)
> to make that determination. The only input to this decision is RTO, which
> in my opinion exerts a counter-productive force on the decision.


> On a gut level, I'm not much of a fan of CWM either.. While leaky
> bucket scenarios are attractive to me, CWM essentially enforces a
> gushing bucket where sending capacity is throttled quickly back if the
> sending rate doesn't equal the ack-reception rate. It strikes me as
> over-aggressive... using 4 as a fall back instead of 1 is likely meant
> to mitigate this,

4 is intended to allow the window to open at full rate,
given non-sequential ACK losses. The sending capacity is thottled
back only if less than 1/4 the expected ACKs are returned within 
a RTT. 

> but I dislike the constants as they cannot apply
> well to all mediums current or future.. (the increased use of
> satellite services for downstream in use for things like DirectPC and
> rural African service exhibit huge bandwidth delay properties that are
> un-necessarily crippled by this)

Huge BW*delay is not directly an issue. The number of outstanding
packets is, and when it is high, the system is necessarily more
volatile. The question is whether to be aggressive or conservative;
satellite systems often work better with aggressive behavior. That
is something that could be discovered on a per-path basis (e.g., 
pathchar) and applied per-connection (e.g., some new TCP option {?}
or by control-block sharing {e.g., RFC2140}). 

> I think there needs to be a little more discussion on the types of
> timeouts that are realistic. Discussion generally just cites HTTP/1.1
> and leaves it at that. I see several different scenarios within just
> HTTP and also suspect NNTP exhibits relevant behavior. Some possible sources:
> 
>   * Short timeouts..
>   * longer timeouts...
>   * Folks reading news with NNTP 
>   * HTTP cache interaction..

This would be designing step-functions based on current behavior;
when behavior changes (as it did with the advent of HTTP), this
are just as likely to be inappropriate.

> I think the essence of the problem is the overloading of burst
> management and flow rate control onto a single mechanism.

Agreed (as in the ID). True rate pacing decouples the two.

> So for the sake of clarity, I want to remove the concept of burst
> control from the discussion of restart algorithms. Perhaps a 'burst
> window' or other such technique can be orthogonally implemented to
> address the issue (something that would learn in a similar fashion
> what the maximum allowable network burst would be).

A burst window is what a leaky bucket implements. The question
appears to be "is it variable, and on what basis".

----

As noted above, we are continuing to refine the ID.  It is not intended
to solve the problem completely immediately, but rather to 

	- define the problem
	- describe some issues in a solution
	- propose a few candidate solutions
	- recommend a quick fix for current implementations,
		based on minimal code modifications and
		generally conservative behavior

We would be very interested in hearing about other solutions, and
would be happy to collect these into a web page... send me your links,
and I'll post a URL shortly....

Thanks!

Joe


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 15:49:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA24023
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 15:49:22 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA00950; Wed, 24 Feb 1999 14:26:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mail2.microsoft.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA28272; Wed, 24 Feb 1999 14:21:47 -0500 (EST)
Received: by mail2.microsoft.com with Internet Mail Service (5.5.2524.0)
	id <FCC3V21Q>; Wed, 24 Feb 1999 11:21:46 -0800
Message-ID: <3924E6A6E200D211BAC900805F6FC9E103CDFD07@RED-MSG-11>
From: Venkat Padmanabhan <padmanab@microsoft.com>
To: "'mcmanus@appliedtheory.com'" <mcmanus@appliedtheory.com>
Cc: "'tcp-impl@lerc.nasa.gov'" <tcp-impl@lerc.nasa.gov>
Subject: RE: Idle Restart Algorithms
Date: Wed, 24 Feb 1999 11:21:41 -0800
X-Mailer: Internet Mail Service (5.5.2524.0)
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

I think you've outlined the issues very well. Here are a few comments.

> The basic issue at hand is that the past performance of the network is
> being used to predict the future. It is recognized that the longer the
> interval between past input and the current decision time is, the less
> reliable that data is for making the decision. However, traditional

I am not convinced that there is a strong correlation between the
length of the idle period and the reliability of old congestion
information. Consider a bottleneck link that is shared by multiple
connections. If one of them becomes idle, the others would soon
(within several RTTs) ramp up and consume the bandwidth that the
idle connection freed up. If the idle connection were to restart
after this time and immediately start sending at its old rate, it
may cause overload and packet loss. (This is even assuming that 
packets are sent out smoothly.) So the appropriateness of having
an idle connection restart without slow start dissipates within 
(at most) several RTTs, a duration which is likely to be a much 
shorter than the typical user "think time" in a Web browsing session.


> So for the sake of clarity, I want to remove the concept of burst
> control from the discussion of restart algorithms. Perhaps a 'burst
> window' or other such technique can be orthogonally implemented to
> address the issue (something that would learn in a similar fashion
> what the maximum allowable network burst would be).
> 

I couldn't agree with you more! Avoiding bursts, be it via timers,
"burst windows", etc., is the easy part. The hard part is figuring
out what the congestion window and other variables should be set
to after an idle period.

> Moving back to idle algorithms I would like to experiment with
> something adaptive.. maybe a cwnd reduction of 20% (min 1 segment) for
> every A-4D elapsed in the idle period (apply in a compounding
> fashion).. This reverses the impact of the variance mentioned above
> while providing more fine grained input into the 'trust' process.
> 

This may be worth investigating, but as I have outlined above, I
don't intuitively understand why a gradual decay of the congestion
window (over the timescale of interest -- 10s of seconds) would be
appropriate. Given the current congestion window size, should we be
more confident in estimating the appropriate window size after a
1-minute idle period compared to a 2-minute idle period?

> While I intend to go ahead in looking at this, I'm soliciting any
> helpful insights or 'been there done that' comments folks may have.
> 

The approach I took in my work on "TCP Fast Start" was to have an 
explicit protocol mechanism to "hedge" any bets (guesses) we may 
make about the appropriate window size after an idle period. With 
such a mechanism in place, it may be okay to err on the side of being 
too aggressive (i.e., using a larger window size than appropriate). 
The mechanism I used was priority dropping -- i.e., packets sent 
when the sender is unsure of the appropriateness of congestion window
size are assigned a low drop priority. This would shield (to an extent)
other traffic in the event of the guess being incorrect. While
priority dropping is a simple mechanism and similar in spirit to
some of the diffserv mechanisms being considered, the Internet is
not quite ready for it yet. More information on this work is available
from:
http://www.cs.berkeley.edu/~padmanab/papers/gi98.ps
http://www.cs.berkeley.edu/~padmanab/phd-thesis.html (Chapter 8)

One final note: I believe the deployment of RED in routers would
improve the chances of success for a fast start like scheme, because
the router buffers would better be able to absorb the surge in
load due to connections that are restarting.

-Venkat

Venkat Padmanabhan
Microsoft Research
padmanab@microsoft.com
http://www.research.microsoft.com/~padmanab


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 17:51:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA27019
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 17:51:29 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA03084; Wed, 24 Feb 1999 16:21:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tnt.isi.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA02643; Wed, 24 Feb 1999 16:20:58 -0500 (EST)
Received: from rum.isi.edu (rum-e.isi.edu [128.9.160.237])
	by tnt.isi.edu (8.8.7/8.8.6) with ESMTP id NAA19013;
	Wed, 24 Feb 1999 13:20:57 -0800 (PST)
From: Joe Touch <touch@ISI.EDU>
Received: (from touch@localhost)
	by rum.isi.edu (8.8.7/8.8.6) id NAA18230;
	Wed, 24 Feb 1999 13:20:56 -0800 (PST)
Date: Wed, 24 Feb 1999 13:20:56 -0800 (PST)
Message-Id: <199902242120.NAA18230@rum.isi.edu>
To: mcmanus@appliedtheory.com, padmanab@microsoft.com
Subject: RE: Idle Restart Algorithms
Cc: tcp-impl@lerc.nasa.gov
X-Sun-Charset: US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From owner-tcp-impl@lerc.nasa.gov Wed Feb 24 11:45:45 1999
> From: Venkat Padmanabhan <padmanab@microsoft.com>
> To: "'mcmanus@appliedtheory.com'" <mcmanus@appliedtheory.com>
> Cc: "'tcp-impl@lerc.nasa.gov'" <tcp-impl@lerc.nasa.gov>
> Subject: RE: Idle Restart Algorithms
> Date: Wed, 24 Feb 1999 11:21:41 -0800
> 
> I think you've outlined the issues very well. Here are a few comments.
...
> explicit protocol mechanism to "hedge" any bets (guesses) we may 
> make about the appropriate window size after an idle period. With 
> such a mechanism in place, it may be okay to err on the side of being 
> too aggressive (i.e., using a larger window size than appropriate). 
> The mechanism I used was priority dropping -- i.e., packets sent 
> when the sender is unsure of the appropriateness of congestion window
> size are assigned a low drop priority. This would shield (to an extent)
> other traffic in the event of the guess being incorrect. While
> priority dropping is a simple mechanism and similar in spirit to
> some of the diffserv mechanisms being considered, the Internet is
> not quite ready for it yet. More information on this work is available
> from:
> http://www.cs.berkeley.edu/~padmanab/papers/gi98.ps
> http://www.cs.berkeley.edu/~padmanab/phd-thesis.html (Chapter 8)

There was a version of something similar developed at UPenn,
as a follow-on to my own dissertation (Mirage), in which 
the send window was constant, and split into two sub-windows,
the boundary of which varied with the feedback. It would have
differened from Venkat's only in that the scheme was always
running, and that the overall window was constant (a function
of the upperbound RTT*BW, but stable compared to the boundary).

The only reference I have to this is:

	Univ. Penn. CS Tech Report
	MS-CIS-94-55 (DISTRIBUTED SYSTEMS LAB 80) 
	Congestion control by bandwidth-delay tradeoff in 
	very high-speed networks: the case of window-based control 
	Hyogon Kim, David J. Farber

Joe


From owner-tcp-impl@lerc.nasa.gov  Wed Feb 24 18:31:29 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA29448
	for <tcpimpl-archive@lists.ietf.org>; Wed, 24 Feb 1999 18:31:28 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA29326; Wed, 24 Feb 1999 17:16:27 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from sabre.sjf.novell.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA27468; Wed, 24 Feb 1999 17:12:14 -0500 (EST)
Received: (from mahdavi@localhost)
	by sabre.sjf.novell.com (8.9.1/8.9.1) id OAA18172;
	Wed, 24 Feb 1999 14:17:42 -0800
Reply-To: mahdavi@novell.com
To: "Eric A. Hall" <ehall@ehsco.com>
Cc: Matt Mathis <mathis@psc.edu>, Jeff Semke <semke@psc.edu>,
        TCP Implementations <tcp-impl@lerc.nasa.gov>
Subject: Re: dynamic rwin adjustments
References: <36D315AC.B9FCFCE0@ehsco.com> <yu8xu2wbdak3.fsf@sabre.sjf.novell.com> <36D42C50.F9CFB14A@ehsco.com>
From: Jamshid Mahdavi <mahdavi@novell.com>
Date: 24 Feb 1999 14:17:41 -0800
In-Reply-To: "Eric A. Hall"'s message of "Wed, 24 Feb 1999 08:44:00 -0800"
Message-ID: <yu8x7lt7phtm.fsf@sabre.sjf.novell.com>
Lines: 15
X-Mailer: Gnus v5.5/Emacs 20.3
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

"Eric A. Hall" <ehall@ehsco.com> writes:

> Well that's the problem I think. Defining MAX as 64k is definitely
> overkill for 10 mb/s link with 2ms response time, but its underkill (?)
> when the link is a multi-hop 100 mb/s connection with something like
> 10ms latency from multiple bridges/routers between the client and server
> systems (not uncommon on large complex LANs).

The question is (not that I'm saying you're wrong, mind you) why do
you think it is overkill?  In principle, you aren't using any
resources on the receiver by choosing 64k, and the sender is still
doing congestion control and properly utilizing the network path.  Or
so the theory goes :-)

--J


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 25 04:25:42 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id EAA05300
	for <tcpimpl-archive@lists.ietf.org>; Thu, 25 Feb 1999 04:25:41 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id CAA18027; Thu, 25 Feb 1999 02:31:28 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id CAA17656; Thu, 25 Feb 1999 02:30:25 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id AAA29529
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Thu, 25 Feb 1999 00:30:21 -0700 (MST)
Date: Thu, 25 Feb 1999 00:30:21 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902250730.AAA29529@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: dynamic rwin adjustments
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Jamshid Mahdavi <mahdavi@novell.com>

>                              ...    The paper covers dynamic tuning of
> the senders socket buffer (and hence sending window).  ...

Yes, I got it backwards.

> ...
> For receiver window, we argue that just setting the window to max (as
> Vernon suggests) is just as effective as dynamically tuning. ...

> Unless you tuned *very* conservatively, tuning would still probably
> allow for situations where loss in the network would fill up all of
> the available memory (mbufs) on the receiver. ...

Be careful not to say that too loudly.  It seems true to me, but I've
heard complaints about the BSD tactic of advertising of more receive window
that real bytes.  Some people feel strongly , or at least once felt, that
the sum of all of the advertised windows should be no more than the
system's available buffering.

Some BSD varients did a good imitation of crashing when they run out of
mbufs.  That's a good reason to build better IP fragment handling than in
original 4.3BSD.  I think there were CERT advisories about denial of
service on some systems involving sending orphan IP fragments.  A system
using `ttcp` to try to send 200 Mbit/sec or more of 8K UDP/IP/FDDI
generates a stream of orphans.  Watching a second system try to listen to
such a stream is one way to discover that lots of surprising system calls
depended on mget().


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 25 09:49:23 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id JAA08905
	for <tcpimpl-archive@lists.ietf.org>; Thu, 25 Feb 1999 09:49:23 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id HAA27129; Thu, 25 Feb 1999 07:46:29 -0500 (EST)
Received: from sophia.inria.fr (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id HAA26173; Thu, 25 Feb 1999 07:44:28 -0500 (EST)
Received: from sophia.inria.fr by sophia.inria.fr (8.8.8/8.8.5) with ESMTP id NAA10099 for <tcp-impl@lerc.nasa.gov>; Thu, 25 Feb 1999 13:44:19 +0100 (MET)
X-Authentication-Warning: sophia.inria.fr: Host clope.inria.fr [138.96.48.13] claimed to be sophia.inria.fr
Message-ID: <36D545A3.64812927@sophia.inria.fr>
Date: Thu, 25 Feb 1999 13:44:19 +0100
From: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
Organization: MISTRAL - INRIA Sophia Antipolis
X-Mailer: Mozilla 4.5 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: TCP Slow-Start
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hello, 

I am interested in my current work in the slow start phase of TCP. It is
clear that this algorithm is necessary to increase gradually the
congestion window so that to not overwhelm the network buffers and to
activate smoothly the ACK clock. At the beginning of the connection, it
serves also to estimate the available capacity of the network. 
But such slow increase deteriorates the performance of the protocol when
long delay paths such as satellite links are crossed by the connection.
Large initial window avoids some Round Trips at the beginning of the
connection but it may result in losses because this initial window is
transmitted in a burst. I think that the appropriate solution to this
problem is a kind of available bandwidth estimation. However, estimating
the available bandwidth is difficult to be implemented. I am
investigating another solution which lets TCP send many packets at the
beginning of the connection but not in a bursty manner as with large
initial window.

A timer of value smaller than timeout (saying 50ms or 100ms) is set when
the first packet is sent. This timer is canceled when an ACK is
received. If the timer expires, this means that an ACK has not been
received. The source supposes here that a long delay path has been
crossed by the connection and therefore it transmits a new packet and
sets again the timer. The absence of an ACK may be due to the loss of
the first packet. This loss is still detected via timeout.  
This timer is set in the first Round Trip Time at the beginning of the
connection. We can set it also whenever a timeout occurs. Once we
receive an ACK, a normal slow start is done. The window is increased by
one MSS for every new ACK.

This solution results in many packets transmitted at the beginning of
the connection (proportional to RTT) but spaced so that to not overload
the network. It functions only in case of long delay paths. With small
RTTs, an ACK is received before the expiration of the added timer and a
single packet is thus transmitted in the first RTT (the delayed ACK at
the receiver may affect this behavior). I think that spreading out the
first packets sent during slow start may result in separate bursts
during the subsequent RTTs. It is known that bursts sent during slow
start overload the network buffers due to their high rate. Reducing the
size of these bursts may avoid some losses. 

The idea is at the beginning, I don't know exactly what will be its
negative effect on TCP and on the network. Also, the overload it causes 
to the operating system must be investigated. Any feedback on this issue
will be very appreciated. 

Thanks in advance 

Chadi
-- 
                    **  Chadi Mohamad BARAKAT  **
           http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                  /\
PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 25 14:24:39 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA17096
	for <tcpimpl-archive@lists.ietf.org>; Thu, 25 Feb 1999 14:24:36 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA07488; Thu, 25 Feb 1999 12:51:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from saba.cs.washington.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA05290; Thu, 25 Feb 1999 12:48:00 -0500 (EST)
Received: from localhost (cardwell@localhost) by saba.cs.washington.edu (8.8.8+CS/7.2ws+) with SMTP id JAA04088; Thu, 25 Feb 1999 09:47:52 -0800
Date: Thu, 25 Feb 1999 09:47:52 -0800 (PST)
From: Neal Cardwell <cardwell@cs.washington.edu>
To: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: TCP Slow-Start
In-Reply-To: <36D545A3.64812927@sophia.inria.fr>
Message-ID: <Pine.LNX.4.02A.9902250933060.239-100000@saba.cs.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


> A timer of value smaller than timeout (saying 50ms or 100ms) is set when
> the first packet is sent. This timer is canceled when an ACK is
> received. If the timer expires, this means that an ACK has not been
> received. The source supposes here that a long delay path has been
> crossed by the connection and therefore it transmits a new packet and
> sets again the timer.

Resetting this new timer sounds risky. It implicitly makes an assumption
about the absolute magnitude of the bandwidth of Internet paths, an
assumption which could be incorrect in some common cases. For example,
what happens if there is a low bandwidth link on this path (say, a 28Kbps
modem)? With MSS=1460 and using 100ms for this timer, by the end of the
first RTT (more than 500ms later) this scheme will have sent 5 packets, or
7300 bytes, when the bandwidth-delay of the path is closer to 28Kbps*.5s =
1750 bytes, slightly more than one packet.

> I think that spreading out the
> first packets sent during slow start may result in separate bursts
> during the subsequent RTTs. It is known that bursts sent during slow
> start overload the network buffers due to their high rate. Reducing the
> size of these bursts may avoid some losses. 

This will smooth out some of the burstiness, but there will still be a lot
left. Consider that when the sender gets ACKs, usually for two packets, it
will send out a burst of three packets. This burst of three will result in
another (in general, larger) burst when the ACKs for these packets are
received in the next RTT, and so on...

neal


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 25 14:26:31 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA17129
	for <tcpimpl-archive@lists.ietf.org>; Thu, 25 Feb 1999 14:26:30 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA03797; Thu, 25 Feb 1999 12:44:37 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA01222; Thu, 25 Feb 1999 12:39:58 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 448;
          Wed, 24 Feb 1999 14:27:33 -0800
Message-ID: <36D47CD4.28344567@ehsco.com>
Date: Wed, 24 Feb 1999 14:27:32 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: mahdavi@novell.com
CC: Matt Mathis <mathis@psc.edu>, Jeff Semke <semke@psc.edu>,
        TCP Implementations <tcp-impl@lerc.nasa.gov>
Subject: Re: dynamic rwin adjustments
References: <36D315AC.B9FCFCE0@ehsco.com> <yu8xu2wbdak3.fsf@sabre.sjf.novell.com> <36D42C50.F9CFB14A@ehsco.com> <yu8x7lt7phtm.fsf@sabre.sjf.novell.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> The question is (not that I'm saying you're wrong, mind you) why do
> you think it is overkill?

Because it is overkill; whether or not it hurts anything, I don't know.
I wouldn't be surprised if at least some systems set aside resources
according to the window. Doesn't WinSock do this,?

There are absolutely problems when it is underestimated.

In either case, it's just wrong to use a "guessed" value: no matter how
you guess, you will be wrong a good bit of the time. Optimize for local
LAN usage and you're probably going to get hurt when you go out on the
WAN or when you connect to a server thats on the other side of the
campus. Conversely, optimizing for either of those would probably be
inadequate for the other scenarios. Therefore, it seems like doing
calculations on a per-link basis would be a good idea.

Also, before Vernon gets worked up, let me say right off that I'm not
proposing anything. I'm just wondering how much gain there would be had
from doing it. It would seem that there would be a substantial gain in
overall throughput (measured over weeks anyway). Would the lost setup
time take away from that?

Just some random thoughts.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 25 14:28:15 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA17193
	for <tcpimpl-archive@lists.ietf.org>; Thu, 25 Feb 1999 14:28:14 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id MAA02549; Thu, 25 Feb 1999 12:41:34 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Arachnid.NTRG.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id MAA01221; Thu, 25 Feb 1999 12:39:55 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 499;
          Wed, 24 Feb 1999 08:44:01 -0800
Message-ID: <36D42C50.F9CFB14A@ehsco.com>
Date: Wed, 24 Feb 1999 08:44:00 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: mahdavi@novell.com
CC: Matt Mathis <mathis@psc.edu>, Jeff Semke <semke@psc.edu>,
        TCP Implementations <tcp-impl@lerc.nasa.gov>
Subject: Re: dynamic rwin adjustments
References: <36D315AC.B9FCFCE0@ehsco.com> <yu8xu2wbdak3.fsf@sabre.sjf.novell.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> See our autotuning paper at:
> http://www.psc.edu/networking/papers/auto_abstract.html

I've read it.

> We argue in the paper (without real proof) that you can just maxing
> out the receiver window is just as effective as tuning would be, and
> much simpler.

Well that's the problem I think. Defining MAX as 64k is definitely
overkill for 10 mb/s link with 2ms response time, but its underkill (?)
when the link is a multi-hop 100 mb/s connection with something like
10ms latency from multiple bridges/routers between the client and server
systems (not uncommon on large complex LANs).

Making any sort of reasonable guess at all depends on knowing the
bandwidth, latency and MTU for that link. All of this info is either
already known, or is at least detectable, so maybe it should be used.

I'm just wondering if any vendors are building this into their stacks.
Or is it not worthwhile? The defaults are of course wrong at least some
of the time, but is figuring out the correct window just too much work
for too little gain?

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 03:59:49 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id DAA16720
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 03:59:48 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id CAA17587; Fri, 26 Feb 1999 02:11:35 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id CAA17118; Fri, 26 Feb 1999 02:10:09 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 26 Feb 1999 07:34:34 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10GHLf-001xhtC; Thu, 25 Feb 1999 23:06:11 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id IAA00342; Wed, 24 Feb 1999 08:06:16 -0800 (PST)
Message-Id: <199902241606.IAA00342@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Thu, 11 Feb 1999 12:55:54 PST."
             <Roam.SIMCSD.2.0.4.918766554.29208.kcpoon@jurassic> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 24 Feb 1999 08:06:16 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

K. Poon,

> I think most implementations support the socket option TCP_MAXSEG.  If
> a well written application is going to buffer data, I think it is
> better to buffer data in multiples of SMSS (sending MSS) bytes size.
> Then the problem mentioned above should not be there, assuming PMTU
> does not change... Should this, application buffering in chunks of
> SMSS bytes buffers, be mentioned in the draft as a best practise?

Three things:

1.  I don't know how to guarantee an application a way of tracking Eff.snd.MSS 
(rfc1122 terminology).  They can get a snapshot and, most times, that is 
sufficient but, as you mention, this number can change.

2.  Even so, i've added a parenthetical remark to the draft: "(ideally a 
multiple of Eff.snd.MSS)".

3.  Even if the application *does* buffer response (say) data in multiples of 
Eff.snd.MSS, there will quite possibly be a trailing portion of data that is 
less than Eff.snd.MSS.  The modified algorithm is designed to ensure timely 
transmission of that trailing portion.

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 16:21:40 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA27392
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 16:21:39 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA29450; Fri, 26 Feb 1999 14:54:37 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA26336; Fri, 26 Feb 1999 14:49:07 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 26 Feb 1999 20:13:34 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10GTFd-001xhoC; Fri, 26 Feb 1999 11:48:45 -0800 (PST)
Received: from ip5.san-francisco41.ca.pub-ip.psi.net ([38.28.91.5]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 26 Feb 1999 20:13:33 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA00997; Fri, 26 Feb 1999 11:49:43 -0800 (PST)
Message-Id: <199902261949.LAA00997@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@lerc.nasa.gov
Subject: revised internet draft on suggested mod to the Nagle algorithm
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 26 Feb 1999 11:49:42 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Hi, all.

I am sending (in a separate message) a revision of the internet draft.  I 
haven't submitted it to the I-D directory yet, wanting to pass it by everyone 
first.

There are various minor changes, but there are also some more major changes:

1.  "algorithmic" descriptions of both the current and modified Nagle 
algorithms.

2.  a discussion of implementing Nagle in send() (both current and proposed).

3.  an appendix giving some application code providing a model for how 
applications might ensure not having their data being subject to Nagle.

I think most people wanted "2" above, though i'm still a bit queasy.  I worry 
it will generate so much discussion that we won't get to any agreement.

Number "3" is *motivated* by Eric Hall's desire to have a warning included for 
application developers thinking of disabling Nagle (which is included 
separately in the document).  I'd be glad to delete the appendix (though TCP 
implementors might want to consider it as a test case outline).

Anyway, let me know what you think (not that anyone on this list is shy about 
that!).

Greg Minshall


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 16:21:48 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA27403
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 16:21:47 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA01190; Fri, 26 Feb 1999 14:58:05 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA27274; Fri, 26 Feb 1999 14:50:14 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 26 Feb 1999 20:14:40 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10GTGg-001xhoC; Fri, 26 Feb 1999 11:49:50 -0800 (PST)
Received: from ip5.san-francisco41.ca.pub-ip.psi.net ([38.28.91.5]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 26 Feb 1999 20:14:37 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id LAA01011; Fri, 26 Feb 1999 11:50:46 -0800 (PST)
Message-Id: <199902261950.LAA01011@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@lerc.nasa.gov
Subject: revised internet draft on suggested mod to the Nagle algorithm
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 26 Feb 1999 11:50:46 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Internet Engineering Task Force                            Greg Minshall
INTERNET-DRAFT                                             Siara Systems
draft-minshall-nagle-00.5                              February 26, 1999


	     A Proposed Modification to Nagle's Algorithm


Status of This Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet- Drafts
   as reference material or to cite them other than as ``work in
   progress.''

   To view the entire list of current Internet-Drafts, please check
   the ``1id-abstracts.txt'' listing contained in the Internet-Drafts
   Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
   (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
   (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US
   West Coast).

   This draft proposes a modification to Nagle's algorithm (as
   specified in RFC896) to allow TCP, under certain conditions, to
   send a small sized packet immediately after one or more maximum
   segment sized packet.


Abstract

   The Nagle algorithm is one of the primary mechanisms which protects
   the internet from poorly designed and/or poorly implemented
   applications.  However, for a certain class of applications
   (notably, request-response protocols) the Nagle algorithm interacts
   poorly with delayed acknowledgements to give these applications
   poorer performance.

   This draft is NOT suggesting that these applications should disable
   the Nagle algorithm.

   This draft suggests a fairly small and simple modification to the
   Nagle algorithm which preserves the Nagle algorithm as a means of
   protecting the internet while at the same time giving better
   performance to a wider class of applications.


Introduction to the Nagle algorithm

   The Nagle algorithm [RFC896] protects the internet from
   applications (most notably Telnet [RFC854], at the time the
   algorithm was developed) which tend to dribble small amounts of
   data to TCP.  Without the Nagle algorithm, TCP would transmit a
   packet, with a small amount of data, in response to each of the
   application's writes to TCP.  With the Nagle algorithm, a first
   small packet will be transmitted, then subsequent writes from the
   application will be buffered at the sending TCP until either i)
   enough application data has accumulated to enable TCP to transmit a
   maximum sized packet, or ii) the initial small packet is
   acknowledged by the receiving TCP.  This limits the number of small
   packets to one per round trip time.

   While the current Nagle algorithm does a very good job of
   protecting the internet from such applications, there are other
   applications, such as request-response protocols (with HTTP
   [RFC2068]  being a topical example) in which the current Nagle
   algorithm interacts with TCP's ``delayed ACK'' policy [RFC1122]
   to produce non-optimal results.


Delayed ACKs

   A receiving TCP tries to avoid acknowledging every received data
   packet in the hope of ``piggy-backing'' the acknowledgement on a
   data packet flowing in the reverse direction or combining the
   acknowledgement with a window update flowing in the reverse
   direction.  This process, known as ``delayed ACKing'' [RFC1122],
   typically causes an ACK to be generated for every other received
   (full-sized) data packet.  In the case of an ``isolated'' TCP
   packet (i.e., where a second TCP packet is not going to arrive
   anytime soon), the delayed ACK policy causes an acknowledgement for
   the data in the isolated packet to be delayed up to 200
   milliseconds of the receipt of the isolated packet (the actual
   maximum time the acknowledgement can be delayed is 500ms [RFC1122],
   but most systems implement a maximum of 200ms, and we shall assume
   that number in this document).  The way delayed ACKs are
   implemented in some systems causes the delayed ACK to be generated
   anytime between 0ms and 200ms; in this case, the average amount of
   time before the delayed ACK is generated is 100ms.


The interaction of delayed ACKs and Nagle

   If a TCP has more application data to transmit than will fit in one
   packet, but less than two full-sized packets' worth of data, it
   will transmit the first packet.  As a result of Nagle, it will not
   transmit the second packet until the first packet has been
   acknowledged.  On the other hand, the receiving TCP will delay
   acknowledging the first packet until either i) a second packet
   arrives (which, in this case, won't arrive), or ii) approximately
   100ms (and a maximum of 200ms) has elapsed.

   When the sending TCP receives the delayed ACK, it can then transmit
   its second packet.

   In a request-response protocol, this second packet will complete
   either a request or a response, which then enables a succeeding
   response or request.

   Note two (related) bad results of the interaction of delayed ACKs
   and the Nagle algorithm in this case: the request-response time may
   be increased by up to 400ms (if both the request and the response
   are delayed); and, consequently, the number of transactions per
   second is substantially reduced.


A proposed modification to the Nagle algorithm

   In the following discussion we make use of the following variables
   defined in the TCP RFC [RFC793] and in the host requirements RFC
   [RFC1122]: ``snd.nxt'' is a TCP variable which names the next byte
   of data to be transmitted; ``snd.una'' is a TCP variable which
   names the next byte of data to be acknowledged (if snd.nxt equals
   snd.una, then all previous packets have been acknowledged);
   Eff.snd.MSS is the largest TCP payload (user data) that can be
   transmitted in one packet.

   The current Nagle algorithm does not require any other state to be
   kept by TCP on a system.

   The proposed modification to the Nagle algorithm does,
   unfortunately, require one new state variable to be kept by TCP:
   ``snd.sml'' is a TCP variable which names the last byte of data in
   the most recently transmitted small packet.

   The current Nagle algorithm can be described as follows:

	"If a TCP has less than a full-sized packet to transmit,
	and if any previous packet has not yet been acknowledged,
	do not transmit a packet."

   and in pseudo-code:

	if ((packet.size < Eff.snd.MSS) && (snd.nxt > snd.una)) {
		do not send the packet;
	}

   The proposed Nagle algorithm modifies this as follows:

	"If a TCP has less than a full-sized packet to transmit,
	and if any previously transmitted less than full-sized
	packet has not yet been acknowledged, do not transmit
	a packet."

   and in pseudo-code:

	if ((packet.size < Eff.snd.MSS) && (snd.sml > snd.una)) {
		do not send the packet;
	} else {
		snd.sml = snd.nxt+packet.size;
		send the packet;
	}

   In other words, when running Nagle, only look at the recent
   transmission (and acknowledgement) of small packets (rather than
   all packets, as in the current Nagle).

   (In writing the above, I am aware that TCP acknowledges bytes, not
   packets.  However, expressing the algorithm in terms of packets
   seems to make the explanation a bit clearer.)


Implementing Nagle at Send

   The above description of the current Nagle algorithm and of the
   proposed modification assumes that the Nagle algorithm is being
   implemented just as TCP is about to hand a packet to IP to be
   transmitted, i.e., the algorithm is looking at the sizes of the
   packets it transmits.

   In reality, many TCPs essentially implement Nagle at the interface
   where applications present data to TCP to be transmitted (i.e., in
   the call to ``SEND'', as defined in section 3.8 of the TCP
   specification [RFC793]).  The motivation for this is to not
   penalize applications that provide data to TCP in large chunks
   (ideally a multiple of Eff.snd.MSS).

   This allows a single application send to be broken into zero or
   more full-sized packets, possibly followed by one small packet,
   without forcing any delay on the trailing small packet.  For
   example, one implementation with which the author is familiar first
   captures the boolean ``snd.nxt > snd.una'' in a temporary variable
   (``busy''):

	busy = (snd.nxt > snd.una);

   then goes into a loop transmitting packets out of the data which
   has been presented to TCP by the application; the loop contains the
   following code to implement the current Nagle algorithm:

	if ((packet.size < Eff.snd.MSS) && busy) {
		do not send the packet;
	}

   Since ``busy'' is a constant in the loop transmitting packets, a
   trailing small packet will be transmitted (after zero or more large
   packets transmitted by the same call to send) if the connection had
   no outstanding data at the time the application presented data to
   TCP for transmission (assuming the TCP window allows this).

   To implement the modified Nagle algorithm in such a system, we
   replace snd.sml with two variables: ``snd.sml.add'' is a TCP
   variable which names the last byte presented to TCP by the
   application with a ``small'' send (i.e., the application called
   SEND with fewer than Eff.snd.MSS bytes of data); and
   ``snd.sml.snt'' is a TCP variable which names the highest value of
   snd.sml.add which has, in fact, been transmitted.  The send routine
   contains the following code:

	if (byte.count < Eff.snd.MSS) {
		snd.sml.add = snd.una + snd.bytes.queued;
	}

   (where ``snd.bytes.queued'' is the number of bytes queued for
   transmission, and has already been updated with ``byte.count'', the
   number of bytes being presented to TCP in this call to SEND).

   The loop that transmits packets contains the following code:

	if (packet.size < Eff.snd.MSS) {
		if (snd.sm.snt > snd.una) {
			do not send the packet;
		} else {
			if ((snd.nxt + packet.size) <= snd.sm.add) {
				snd.sm.snt = snd.sm.add;
			}
			send the packet;
		}
	}

   (In most implementations, the most deeply nested ``if'' statement
   above is unnecessary, as a small-sized packet will contain all the
   data available to be transmitted, and so will include, or be
   beyond, snd.sm.add.  In this case, the modified Nagle algorithm
   adds one test, one addition, and one assignment in the send
   routine, and one assignment in the output routine.)


A Failure Mode

   If an application sends a large amount of data, followed by a small
   amount of data, followed by a large amount of data, the current
   Nagle algorithm would perform better than the proposed
   modification.  The current Nagle algorithm would send at most one
   small packet (possibly the last packet), delaying the middle
   (small) amount of data which would allow the application to send
   the following large amount of data; the modified Nagle algorithm
   would send as many as two small packets (the middle packet, plus
   possibly a last packet).


A separate, but desirable, system facility

   In addition to the Nagle algorithm (or the modification proposed by
   this draft), it would be desirable for a system providing TCP
   service to applications to allow the application to set TCP into a
   mode in which the TCP would only transmit small packets at the
   explicit direction of the application.  For example, a system based
   on BSD might implement a socket option (using setsockopt(2))
   SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly
   overloading the semantics of an existing flag, such as MSG_EOF).

   In this scenario, an application would set a socket into
   SO_EXPLICITPUSH mode, then enter a mode of writing data to the
   socket and, at the last write, using send(2) with the MSG_EOF flag.
   The underlying TCP would recognize the MSG_EOF flag as an indicator
   to transmit the (possibly) small packet.

   Like the proposed modification to the Nagle algorithm, this is
   fairly simple to implement.

   If a system were to implement this interface, it would be important
   to NOT disable Nagle when using this interface.  In other words,
   when using this interface, the default mode for TCP would be to NOT
   transmit a small packet (even in the presence of MSG_EOF) if a
   previously transmitted small packet was as yet unacknowledged.

   Note, also, that implementing this interface does not eliminate the
   desirability of using the modification of the Nagle as the default
   for applications.  More sophisticated networking applications might
   well use the new interface, but naive applications will often be
   adequately served by the modified Nagle algorithm.
   

Application scenarios that will not be helped by this modification

   The proposed modification helps applications which do not need to
   transmit more than one small packet in a single round-trip time.
   This characterizes one way file transfer applications (such as FTP
   [RFC959]) and request/response protocols (such as NNTP [RFC977] and
   HTTP [RFC2068] without pipelining).

   However, applications that need to transmit more than one small
   packet in a single round-trip time are not served by this
   modification.  An example of such an application is HTTP [RFC2068]
   using ``pipelining'', in which multiple requests (responses) are
   transmitted asynchronously.

   Applications needing to transmit more than one small packet in a
   single round-trip time will need other mechanisms to satisfy their
   requirements.  (One possible such mechanism would be to use more
   than one TCP connection.)

   If an application developer is considering disabling the Nagle
   algorithm, they should be very careful to ensure that their
   application will generally provide data to TCP in chunks larger
   than two full-sized segments (> 2*Eff.snd.MSS), and they should
   verify after their development that this is, in fact, true.  With
   Nagle disabled, many writes of small blocks of data can add
   significant load to the network, reducing the network's performance.


Acknowledgements

   Jim Gettys, Henrik Frystyk Nielsen, Jeff Mogul, and Yasushi Saito,
   as well as a message forwarded to the end2end-interest list by Sean
   Doran, have motivated my current interest in the Nagle algorithm.
   John Heidemann's work related to the Nagle algorithm has informed
   some of the thinking in this draft; discussions with John have also
   been helpful.  Members of the End-to-End Research Group (under
   the direction of Bob Braden) patiently listened to my discussion of
   the current state of the Nagle algorithm and to the modifications
   proposed in this document.

   Members of the TCP implementors mailing list
   <tcp-impl@lerc.nasa.gov> have been very helpful in refining this
   proposal.  In particular, Rick Jones, Neal Cardwell, Vernon
   Schryver, Bernie Volz, Sam Manthorpe, Art Shelest, David Borman,
   Kacheong Poon, Jon Snader, Eric Hall, Joe Touch, and Alan Cox.


Security Considerations

   The Nagle algorithm does not have major security consequences.

   Implementation of this algorithm should not negatively impact
   the performance of the internet.  The negative impact of
   implementation of this algorithm should be significantly less
   than disabling the Nagle algorithm.


Appendix -- Sample application code

   The following code is provided to give application developers a
   model for buffering.  We assume a BSD-style sockets API.

	#include <stdio.h>
	#include <stdlib.h>
	#include <sys/types.h>
	#include <sys/socket.h>
	#include <netinet/in.h>
	#include <netinet/tcp.h>

	#define	SNDBUF_MULT 3     /* * 2 * TCP_MAXSEG -> SO_SNDBUF */

	/*
	 * Given a connected socket (s), configure the socket
	 * with good buffer size defaults, and return the
	 * the size the application should use for issuing
	 * writes to the socket.
	 *
	 * Returns size to use for application buffering, or
	 * zero (0) on error.
	 */
	int
	getbufsize(int s)
	{
		unsigned long bufsize, parm;
		int buflen;

		buflen = sizeof bufsize;
		if (getsockopt(s, IPPROTO_TCP, TCP_MAXSEG,
					&bufsize, &buflen) == -1) {
			perror("getsockopt(...TCP_MAXSEG...)");
			return 0;
		}

		/* Set socket transmit buffer */
		parm = 2*SNDBUF_MULT*bufsize;
		if (setsockopt(s, SOL_SOCKET, SO_SNDBUF,
					&parm, sizeof parm) == -1) {
			perror("setsockopt(SO_SNDBUF)");
			return 0;
		}

		/* Now, set socket low water threshhold */
		parm = 2*bufsize;
		if (setsockopt(s, SOL_SOCKET, SO_SNDLOWAT,
					&parm, sizeof parm) == -1) {
			perror("setsockopt(...SO_SNDLOWAT...)");
			return 0;
		}

		return 2*bufsize;
	}

	int
	main(int argc, char *argv[])
	{
		char *buffer = 0;
		int buflen;
		int sock;

		/*
		 * ... allocate a socket (sock) and get it connected
		 * via either connect(2) or listen(2)/accept(2).
		 */

		buflen = getbufsize(sock);
		if (buflen == 0) {
			fprintf(stderr, "aborting\n");
			exit(1);
		}

		buffer = malloc(buflen);
		if (buffer == 0) {
			fprintf(stderr,
				"no room for buffer of size %d\n",
							    buflen);
			exit(1);
		}

		/*
		 * ... loop generating ``buflen'' data in buffer
		 * and using send(2) to hand it to TCP.
		 * When there is no more data to send, call
		 * send(2) one last time with <= ``buflen''
		 * bytes.
		 */

		return 0;
	}

		
References

[RFC793]	Postel, J. (ed), "Transmission Control Protocol",
			Sep-1981.
[RFC854]	Postel, J., J. Reynolds, "Telnet Protocol
			Specification", May-1983.
[RFC959]	Postel, J., J. Reynolds, "File Transfer Protocol
			(FTP)", Oct-1985.
[RFC977]	Kantor, B., P. Lapsley, "Network News Transfer
			Protocol", Feb-1986.
[RFC896]        Nagle, J., "Congestion control in IP/TCP internetworks",
			Jan-06-1984.
[RFC1122]       Braden, R. T., "Requirements for Internet hosts -
			communication layers", Oct-01-1989.
[RFC2068]	Fielding, R., J. Gettys, J. Mogul, H. Frystyk,
			T. Berners-Lee, "Hypertext Transfer Protocol
			-- HTTP/1.1".


Author's Address

   Greg Minshall
   Siara Systems
   300 Ferguson Drive, 2nd floor
   Mountain View, CA  94043
   USA

   <minshall@siara.com>


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 18:24:05 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA04272
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 18:24:04 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id QAA13947; Fri, 26 Feb 1999 16:59:37 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id QAA11883; Fri, 26 Feb 1999 16:55:20 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id OAA04335
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 26 Feb 1999 14:55:18 -0700 (MST)
Date: Fri, 26 Feb 1999 14:55:18 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902262155.OAA04335@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

I have the distinct impression that the consensus of this mailinglist
is that:

  - as much as possible, turning off the Nagle algorithm should be avoided.
    However, no one has reported any harm done to any network by
    applications that have set TCP_NODELAY.

  - an explicit flush mechanism might be useful, but it would be
    functionally identical to how toggling TCP_NODELAY should work.

  - modifying the Nagle algorithm as proposed would not be profitable.
     + The proposal requires additional CPU cycles, code, and state in
        every TCP implemetation, albeit not a lot.
     + There are no existing applications or programs that would benefit
	from the change.  All existing programs that would benefit already
	set TCP_NODELAY, have not wrecked the Internet or even caused any
	reported harm, and in most cases would stll be forced to set
	TCP_NODELAY and so could not benefit from the proposed change
        or avoid doing hypothetical harm to networks.


> ...
>    applications.  However, for a certain class of applications
>    (notably, request-response protocols) the Nagle algorithm interacts
>    poorly with delayed acknowledgements to give these applications
>    poorer performance.

That's not quite right, is it?  Many request-response applications do
not suffer any bad Nagle effects.  The canonical Nagle application, telnet,
is a request-response protocol.  Its requests are keystrokes and the
responses are echos.

Doesn't the class of applications are suffer from the Nagle algorithm
consist of:
  1. poorly written applications that do send()-send()-receive()
      instead of writev()-receive().
  2. applications that necessarily do send()-send() of small blocks.
  3. applications that send blocks larger than 1 MSS but not a multiple
      of the MSS on systems that do not block based on send() requests.

As far as I can see, only #3 might benefit from the proposal.  #1 and #2
would still be forced to set TCP_NODELAY.


>    This draft is NOT suggesting that these applications should disable
>    the Nagle algorithm.

Why not?  Exactly what is the harm done to the network or anything else
when type #3 applications disable the Nagle algorithm?  The only harm I've
heard mentioned can be best summarized as ""application writers are idiots
and will write junk code that will trash the network if we give them the
slightest opportunity.  Therefore, we must make the API so complicated and
obscure so that only the smartest programmers will figure out how to do
what they need to do."  I like the Nagle algorithm, but I am offended by
some of the implicit attitudes offered in its defense.  Some programmers
are idiots, but some others are not.


>    This draft suggests a fairly small and simple modification to the
>    Nagle algorithm which preserves the Nagle algorithm as a means of
>    protecting the internet while at the same time giving better
>    performance to a wider class of applications.

> ...
>    The current Nagle algorithm can be described as follows:

> ...
> 	if ((packet.size < Eff.snd.MSS) && (snd.nxt > snd.una)) {
> 		do not send the packet;
> 	}

My copy of 4.4BSD-Lite uses

 	if ((packet.size < Eff.snd.MSS) && (snd.nxt != snd.una)) {

That avoids hassles with computing (snd.nxt > snd.una)

> ...
> 	if ((packet.size < Eff.snd.MSS) && (snd.sml > snd.una)) {
> 		do not send the packet;
> 	} else {
> 		snd.sml = snd.nxt+packet.size;
> 		send the packet;
> 	}

Do you have a favorite way of dealing with the problems behind computing 
(snd.sml > snd.una) ?


> ...
>    To implement the modified Nagle algorithm in such a system, we
>    replace snd.sml with two variables: ``snd.sml.add'' is a TCP
> ...

Why modify the Nagle algorithm in such a system?  Exactly how do any of
the "certain class of applications" benefit?


> ...
>    explicit direction of the application.  For example, a system based
>    on BSD might implement a socket option (using setsockopt(2))
>    SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly
>    overloading the semantics of an existing flag, such as MSG_EOF).

Why would any system put the SO_EXPLICITPUSH into setsockopt()?
Or exactly what would be the difference in such a system between
setting SO_EXPLICITPUSH and setting TCP_NODELAY?

And as we all seem to agree, note that turning TCP_NODELAY on and
off around a send() request is or should be functionally identical
to putting SO_EXPLICITPUSH in the send request.


>    In this scenario, an application would set a socket into
>    SO_EXPLICITPUSH mode, then enter a mode of writing data to the
>    socket and, at the last write, using send(2) with the MSG_EOF flag.
>    The underlying TCP would recognize the MSG_EOF flag as an indicator
>    to transmit the (possibly) small packet.

So SO_EXPLICITPUSH as a setsockopt() is not an action, but an
enabler for MSG_EOF?  Why is an enabler needed?  Why not just honer
MSG_EOF() (assuming you don't instead turn TCP_NODELAY on and off).


> ..
>    If a system were to implement this interface, it would be important
>    to NOT disable Nagle when using this interface.  In other words,
>    when using this interface, the default mode for TCP would be to NOT
>    transmit a small packet (even in the presence of MSG_EOF) if a
>    previously transmitted small packet was as yet unacknowledged.

If there were a packet stuck from a previous application of the Nagle
algorithm, would you not want to push it out along with the new data that
has been explicitly marked with MSG_EOF as needing to go out?  Why delay
the new, urgent data because there is old, non-urgent data that is waiting?
Regardless of what any RFC said, such behavior would generate plenty of
bug reports.


> ...
> Application scenarios that will not be helped by this modification
>
>    The proposed modification helps applications which do not need to
>    transmit more than one small packet in a single round-trip time.
>    This characterizes one way file transfer applications (such as FTP
>    [RFC959]) and request/response protocols (such as NNTP [RFC977] and
>    HTTP [RFC2068] without pipelining).

In practice, that paragraph is false.  
  - Doesn't the new NNTP include pipelining?  And what would be the harm
      should an NNTP implementation set TCP_NODELAY?
  - When was the last time someone complained about FTP performance
      effects of Nagle?
  - since all HTTP implementations set TCP_NODELAY, modifcations to
      Nagle on their behalf are at best moot and irrelevant.

>    However, applications that need to transmit more than one small
>    packet in a single round-trip time are not served by this
>    modification.  An example of such an application is HTTP [RFC2068]
>    using ``pipelining'', in which multiple requests (responses) are
>    transmitted asynchronously.

When I last asked about concrete applications that would benefit from the
proposal, only new HTTP was nominated, but seemed a better fit with
a flush bit or TCP_NODELAY toggling.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 18:38:12 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA05171
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 18:38:11 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id RAA10330; Fri, 26 Feb 1999 17:34:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from zephyr.isi.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id RAA08415; Fri, 26 Feb 1999 17:32:26 -0500 (EST)
From: braden@ISI.EDU
Received: from gra.isi.edu (gra.isi.edu [128.9.160.133])
	by zephyr.isi.edu (8.8.7/8.8.6) with SMTP id OAA21005;
	Fri, 26 Feb 1999 14:32:20 -0800 (PST)
Date: Fri, 26 Feb 1999 14:28:03 -0800
Posted-Date: Fri, 26 Feb 1999 14:28:03 -0800
Message-Id: <199902262228.AA21500@gra.isi.edu>
Received: by gra.isi.edu (5.65c/4.0.3-6)
	id <AA21500>; Fri, 26 Feb 1999 14:28:03 -0800
To: tcp-impl@lerc.nasa.gov, vjs@calcite.rhyolite.com
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


  *>     However, no one has reported any harm done to any network by
  *>     applications that have set TCP_NODELAY.
  *> 

Vernon,

This is a niggle, but the Nagle algorithm  made a significant
contribution to avoiding Internet meltdown in the early days when there
was a lot of character-at-a-time telnet traffic and "high speed" was 56
Kbps. I suspect there may still be some local Internet regions where
its contribution is significant, but you are correct that I have no
data to prove it.

Bob Braden


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 19:32:40 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08545
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 19:32:39 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id SAA29291; Fri, 26 Feb 1999 18:19:44 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id SAA24201; Fri, 26 Feb 1999 18:14:50 -0500 (EST)
Received: from Eng.Sun.COM (engmail3 [129.144.170.5]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id PAA14023 for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 15:14:49 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (8.8.8+Sun/SMI-5.3) with ESMTP id PAA16366
	for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 15:14:46 -0800 (PST)
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id PAA03082
	for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 15:14:47 -0800 (PST)
Date: Fri, 26 Feb 1999 15:14:47 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902241606.IAA00342@red.mtv.siara.com>
Message-ID: <Roam.SIMCSD.2.0.4.920070887.22113.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> 1.  I don't know how to guarantee an application a way of tracking
> Eff.snd.MSS  (rfc1122 terminology).  They can get a snapshot and, most
> times, that is  sufficient but, as you mention, this number can change.

This may not be a problem.  Suppose the application gets SMSS right after the
connection is established.  This SMSS is the maximum SMSS this connection can
ever use because PMTUd will only make SMSS smaller.  If the app buffers, say
2 times of the max SMSS, data and the TCP stack applies Nagle on a per send
basis, no delay of sending will happen.

Actually, as long as an app buffers data in chunks greater than 1 SMSS and the
TCP stack applies Nagle on a per send basis, no delay of sending will happen. 
So maybe we should answer this question first, is the original idea of Nagle
algorithm based on a per send basis, not per segment basis as implemented in
original BSD code?  If so, maybe we should change the draft to clarify this.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 20:27:58 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA11871
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 20:27:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA23177; Fri, 26 Feb 1999 19:09:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from cs.rice.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA17900; Fri, 26 Feb 1999 19:04:23 -0500 (EST)
Received: (from aron@localhost)
	by cs.rice.edu (8.9.0/8.9.0) id SAA25202;
	Fri, 26 Feb 1999 18:04:11 -0600 (CST)
From: Mohit Aron <aron@cs.rice.edu>
Message-Id: <199902270004.SAA25202@cs.rice.edu>
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
To: vjs@calcite.rhyolite.com (Vernon Schryver)
Date: Fri, 26 Feb 1999 18:04:11 -0600 (CST)
Cc: tcp-impl@lerc.nasa.gov
In-Reply-To: <199902262155.OAA04335@calcite.rhyolite.com> from "Vernon Schryver" at Feb 26, 99 02:55:18 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> 
> Doesn't the class of applications are suffer from the Nagle algorithm
> consist of:
>   1. poorly written applications that do send()-send()-receive()
>       instead of writev()-receive().
>   2. applications that necessarily do send()-send() of small blocks.
>   3. applications that send blocks larger than 1 MSS but not a multiple
>       of the MSS on systems that do not block based on send() requests.
> 
> As far as I can see, only #3 might benefit from the proposal.  #1 and #2
> would still be forced to set TCP_NODELAY.
> 


Even #3 might not benefit under some cases. An example is associated with the
idiosyncrasies of the way application data is copied into mbufs in BSD based
systems. Data is copied in 4K or 1K chunks (whichever is larger) unless
the data is less than 1K when it is copied as a whole. Suppose the application
tries to write 1600 bytes of data (assuming 1MSS = 1500 bytes). Then the 
mbuf layer would make TCP send 1K first, and the remaining 576 bytes would
be held according to the Nagle's algorithm (even under the new proposal).

Due to the above, applications still might keep using TCP_NODELAY even if
the underlying TCP implements the proposed mod to Nagle's algorithm.


- Mohit


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 20:50:24 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA13315
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 20:50:23 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id TAA06445; Fri, 26 Feb 1999 19:49:39 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id TAA05392; Fri, 26 Feb 1999 19:48:12 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id RAA07199
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Fri, 26 Feb 1999 17:48:10 -0700 (MST)
Date: Fri, 26 Feb 1999 17:48:10 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199902270048.RAA07199@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: braden@ISI.EDU

>   *>     However, no one has reported any harm done to any network by
>   *>     applications that have set TCP_NODELAY.
>   *> 

> This is a niggle, but the Nagle algorithm  made a significant
> contribution to avoiding Internet meltdown in the early days when there
> was a lot of character-at-a-time telnet traffic and "high speed" was 56
> Kbps. I suspect there may still be some local Internet regions where
> its contribution is significant, but you are correct that I have no
> data to prove it.

I agree on all counts except for the first 4 words.  I phrased what I said
intentionally.  TCP_NODELAY did not exist before the Nagle algorithm, when
the problems it fixed were (I assume) rampant.  I believe that it would
be a Bad Thing(tm) if the Nagle algorithm were off by default.  I believe
making the Nagle algorighm on by default in RFC 1122 was a Good Thing(tm)
that reduces the number of packets, CPU cycles spent on packets, and
bandwidth spent on headers, and at the trivial cost of forcing people to
stop and think about tinygrams, and then do something explicit if they
really need them.

That is different from believing that any reportable harm has been done
to any network by any program that used TCP_NODELAY (or equivalent) to
turn off the Nagle algorithm.  I believe there's plenty of junk code that
unnecessarily send()'s tinygrams and even plenty of junk that also bogusly
turns off Nagle.  However, that does not imply any harm has been done.
Junk is usually too broken in other ways to harm the net.


] From: Mohit Aron <aron@cs.rice.edu>

] >   1. poorly written applications that do send()-send()-receive()
] >       instead of writev()-receive().
] >   2. applications that necessarily do send()-send() of small blocks.
] >   3. applications that send blocks larger than 1 MSS but not a multiple
] >       of the MSS on systems that do not block based on send() requests.
] > 
] > As far as I can see, only #3 might benefit from the proposal.  #1 and #2
] > would still be forced to set TCP_NODELAY.

] Even #3 might not benefit under some cases.
 
quite true.

]                                             An example is associated with the
] idiosyncrasies of the way application data is copied into mbufs in BSD based
] systems. Data is copied in 4K or 1K chunks (whichever is larger) unless
] the data is less than 1K when it is copied as a whole. Suppose the application
] tries to write 1600 bytes of data (assuming 1MSS = 1500 bytes). Then the 
] mbuf layer would make TCP send 1K first, and the remaining 576 bytes would
] be held according to the Nagle's algorithm (even under the new proposal).

(nits:  what's an mbuf layer?--"layer" seems a tad fancy for sosend().
And MSS=1500 is a lot less common than MSS=1460, the still minority
MSS=1452, or the uncommon MSS=1444.)

Wouldn't you consider a TCP implementation that divided the stream into
segments based on mbuf partitioning fundamentally broken?--I would.
I'd sneer at any TCP implementation that sends an initial 1K segment
given a send() request and an MSS both significantly larger than 1K,
no matter what its excuse (well, other than window space).
Consider the retransmissions from such an implementation.

I also don't see how tcp_output() in 4.4BSD-Lite would do such evil deeds.
Given mbufs payloads of 1K and 576 or any other combination, it seems to
me that len = min(so->so_snd.sb_cc, win) - off.

Finally, as I understand the proposal, while the classic Nagle algorithm
would delay the a remainder of 100 (MSS=1500) or 140 (MSS=1460), the
proposed algorithm would not.  It would immediately send the remainder.
That is exactly the case for which it has been proposed.

] Due to the above, applications still might keep using TCP_NODELAY even if
] the underlying TCP implements the proposed mod to Nagle's algorithm.

That conclusion is certainly true, but I don't agree with the example.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 21:22:13 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA15297
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 21:22:12 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id UAA08817; Fri, 26 Feb 1999 20:19:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id UAA05250; Fri, 26 Feb 1999 20:16:08 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 27 Feb 1999 01:40:32 UT
Received: from gateway2.mtv.siara.com by siara.com with smtp
	id m10GYLw-001xhqC; Fri, 26 Feb 1999 17:15:36 -0800 (PST)
Received: from ip157.san-francisco41.ca.pub-ip.psi.net ([38.28.91.157]) by gateway2.mtv.siara.com
          via smtpd (for [192.168.1.48]) with SMTP; 27 Feb 1999 01:40:24 UT
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id RAA01746; Fri, 26 Feb 1999 17:16:27 -0800 (PST)
Message-Id: <199902270116.RAA01746@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Fri, 26 Feb 1999 15:14:47 PST."
             <Roam.SIMCSD.2.0.4.920070887.22113.kcpoon@jurassic> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 26 Feb 1999 17:16:26 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

K. Poon,

Yes, you are right, the MSS at the beginning is the largest a session will 
see, so that may be sufficient.  I had forgotten that it can't grow (since, as 
you point out, the PMTU may change, but the negotiated MSS will not grow).

However:

> Actually, as long as an app buffers data in chunks greater than 1 SMSS and
> the TCP stack applies Nagle on a per send basis, no delay of sending will 
> happen. 

If the *final* application write is < 1 SMSS, it may be delayed by the current 
algorithm.  The modification would allow that last application write to be 
transmitted.

Greg


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 26 22:30:19 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA20203
	for <tcpimpl-archive@lists.ietf.org>; Fri, 26 Feb 1999 22:30:18 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id VAA24635; Fri, 26 Feb 1999 21:29:41 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from mercury.Sun.COM (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id VAA23735; Fri, 26 Feb 1999 21:28:33 -0500 (EST)
Received: from Eng.Sun.COM (engmail1 [129.146.1.13]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id SAA19196 for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 18:28:31 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (8.8.8+Sun/SMI-5.3) with ESMTP id SAA09149
	for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 18:28:27 -0800 (PST)
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id SAA03127
	for <tcp-impl@lerc.nasa.gov>; Fri, 26 Feb 1999 18:28:29 -0800 (PST)
Date: Fri, 26 Feb 1999 18:28:29 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
To: tcp-impl@lerc.nasa.gov
In-Reply-To: "Your message with ID" <199902270116.RAA01746@red.mtv.siara.com>
Message-ID: <Roam.SIMCSD.2.0.4.920082509.15757.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> K. Poon,
> If the *final* application write is < 1 SMSS, it may be delayed by the
> current  algorithm.  The modification would allow that last application
> write to be  transmitted.

Hmmm, it may be a little bit more complicated than this.  Suppose the last but
one send is a little bit bigger than 1 SMSS.  With your proposed change, TCP
will send the small tail segment right away.  Then the final send you
mentioned above will also be delayed because that small tail segment of the
previous send has not been acked.  So an app must write in exaclty SMSS size,
which we know may not be possible.  It seems to me that your proposed change
may be equivalent to applying Nagle on a per send basis in most situations.

Are we going around a circle here (-:  Let me ask this question again.  I
think many implementators also want to know the answer.  Is there any
objection that Nagle algorithm "should" be applied on a per send basis, as
opposed to what RFC 1122 section 4.2.3.4 says?

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 27 02:36:36 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA09018
	for <tcpimpl-archive@lists.ietf.org>; Sat, 27 Feb 1999 02:36:35 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id BAA13607; Sat, 27 Feb 1999 01:04:40 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from Twig.Rodents.Montreal.QC.CA (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id BAA10295; Sat, 27 Feb 1999 01:01:26 -0500 (EST)
Received: (from mouse@localhost)
	by Twig.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id BAA18960;
	Sat, 27 Feb 1999 01:00:59 -0500 (EST)
Date: Sat, 27 Feb 1999 01:00:59 -0500 (EST)
From: der Mouse  <mouse@Rodents.Montreal.QC.CA>
Message-Id: <199902270600.BAA18960@Twig.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
To: Greg Minshall <minshall@siara.com>
Cc: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 8bit

>    The proposed modification to the Nagle algorithm does,
>    unfortunately, require one new state variable to be kept by TCP:
>    ``snd.sml'' is a TCP variable which names the last byte of data in
>    the most recently transmitted small packet.
[...]
>    The proposed Nagle algorithm modifies this as follows:
> 
> 	"If a TCP has less than a full-sized packet to transmit,
> 	and if any previously transmitted less than full-sized
> 	packet has not yet been acknowledged, do not transmit
> 	a packet."

This sounds like what I remember seeing on the list.

>    and in pseudo-code:
> 
> 	if ((packet.size < Eff.snd.MSS) && (snd.sml > snd.una)) {
> 		do not send the packet;
> 	} else {
> 		snd.sml = snd.nxt+packet.size;
> 		send the packet;
> 	}

This does not.  Shouldn't the assignment be done only
if (packet.size < Eff.snd.MSS)?

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 13:24:40 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA25687
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 13:24:40 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id LAA15353; Mon, 1 Mar 1999 11:34:58 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ietf.org (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id LAA13882; Mon, 1 Mar 1999 11:32:17 -0500 (EST)
Received: from CNRI.Reston.VA.US (localhost [127.0.0.1])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA22597;
	Mon, 1 Mar 1999 11:31:42 -0500 (EST)
Message-Id: <199903011631.LAA22597@ietf.org>
To: IETF-Announce:;;@ns.cnri.reston.va.us
Cc: RFC Editor <rfc-editor@isi.edu>
Cc: Internet Architecture Board <iab@isi.edu>
Cc: tcp-impl@lerc.nasa.gov
From: The IESG <iesg-secretary@ietf.org>
Subject: Protocol Action: TCP Congestion Control to Proposed Standard
Date: Mon, 01 Mar 1999 11:31:42 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


The IESG has approved the Internet-Draft 'TCP Congestion Control'
<draft-ietf-tcpimpl-cong-control-05.txt> as a Proposed Standard.
This document obsoletes RFC 2001.


The IESG also approved The NewReno Modification to TCP's Fast Recovery
Algorithm <draft-ietf-tcpimpl-newreno-02.txt> as an Experimental RFC.


These documents are the product of the TCP Implementation Working
Group.  The IESG contact persons are Scott Bradner and Vern Paxson.


Technical Summary

  The TCP Congestion Control document specifies four TCP
  congestion control algorithms: slow start, congestion
  avoidance,  fast retransmit and fast recovery.   The document
  is an update of RFC 2001.  In addition to specifying
  the congestion control algorithms, it specifies what TCP
  connections should do after a relatively long idle period,
  as well as specifying and clarifying some of the issues
  pertaining to TCP ACK generation.

  The NewReno Modification document describes a specific
  algorithm for responding to partial acknowledgments, referred
  to as NewReno.  This document is published as an Experimental
  RFC to elicit implementation and testing experience.


Working Group Summary

  There was significant discussion about early versions of these
  documents but all issues were resolved in the final versions.

Protocol Quality
  These documents were reviewed for the IESG by Scott Bradner


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 14:30:47 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA27510
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 14:30:46 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA20721; Mon, 1 Mar 1999 13:15:05 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from tesla.comm.toronto.edu (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA18256; Mon, 1 Mar 1999 13:08:25 -0500 (EST)
Received: from plato.comm (plato.comm [128.100.10.11])
	by tesla.comm.toronto.edu (8.9.0/8.9.0) with ESMTP id NAA07464
	for <tcp-impl@lerc.nasa.gov>; Mon, 1 Mar 1999 13:04:47 -0500 (EST)
From: Irene Katzela <irene@comm.toronto.edu>
Received: (from irene@localhost)
	by plato.comm (8.9.0/8.9.0) id NAA23728
	for tcp-impl@lerc.nasa.gov; Mon, 1 Mar 1999 13:04:46 -0500 (EST)
Date: Mon, 1 Mar 1999 13:04:46 -0500 (EST)
Message-Id: <199903011804.NAA23728@plato.comm>
To: tcp-impl@lerc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Sorry if you receive multiple copies of this message


Call to register for Infocom 99.

Infocom 99 will be held in NY from March 21-25, 99. Please
register ASAP since the early registration deadline is March 1, 99.

For details, see
http://www.comm.utoronto.ca/~infocom


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 14:49:58 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA28193
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 14:49:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA01158; Mon, 1 Mar 1999 13:41:29 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from guns.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA01152; Mon, 1 Mar 1999 13:41:26 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id NAA02334; Mon, 1 Mar 1999 13:41:25 -0500 (EST)
Message-Id: <199903011841.NAA02334@guns.lerc.nasa.gov>
To: tcp-impl@grc.nasa.gov
From: Mark Allman <mallman@grc.nasa.gov>
Reply-To: mallman@grc.nasa.gov
Subject: another test message
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Centerfold
Date: Mon, 01 Mar 1999 13:41:25 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

 
another test...

sorry.

allman


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 14:54:59 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA28370
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 14:54:58 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA24565; Mon, 1 Mar 1999 13:25:07 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from ns1.siara.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA23565; Mon, 1 Mar 1999 13:22:42 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 1 Mar 1999 18:47:17 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10HXKd-001xhzC; Mon, 1 Mar 1999 10:22:19 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id KAA00372; Mon, 1 Mar 1999 10:23:17 -0800 (PST)
Message-Id: <199903011823.KAA00372@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Sat, 27 Feb 1999 01:00:59 EST."
             <199902270600.BAA18960@Twig.Rodents.Montreal.QC.CA> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Mon, 01 Mar 1999 10:23:17 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> This does not.  Shouldn't the assignment be done only
> if (packet.size < Eff.snd.MSS)?

Yes, thanks, this is the *2nd* time i've gotten this wrong!


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 15:04:04 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA28854
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 15:04:02 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA05771; Mon, 1 Mar 1999 13:54:04 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from guns.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA05659; Mon, 1 Mar 1999 13:54:00 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id NAA02839; Mon, 1 Mar 1999 13:54:00 -0500 (EST)
Message-Id: <199903011854.NAA02839@guns.lerc.nasa.gov>
To: tcp-impl@grc.nasa.gov
From: Mark Allman <mallman@grc.nasa.gov>
Reply-To: mallman@grc.nasa.gov
Subject: tcpimpl: list news
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Centerfold
Date: Mon, 01 Mar 1999 13:54:00 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

 
actually, just another test message...

Sorry for all this.

allman


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 15:08:05 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA29034
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 15:08:04 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA28927; Mon, 1 Mar 1999 13:36:32 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from guns.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA28920; Mon, 1 Mar 1999 13:36:29 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id NAA01930; Mon, 1 Mar 1999 13:36:28 -0500 (EST)
Message-Id: <199903011836.NAA01930@guns.lerc.nasa.gov>
To: tcp-impl@grc.nasa.gov
From: Mark Allman <mallman@lerc.nasa.gov>
Reply-To: mallman@lerc.nasa.gov
Subject: test message
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Centerfold
Date: Mon, 01 Mar 1999 13:36:27 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

 
just a quick test.  please ignore.

allman


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 15:21:36 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA29514
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 15:21:35 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id NAA02858; Mon, 1 Mar 1999 13:46:08 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from guns.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id NAA02669; Mon, 1 Mar 1999 13:46:03 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id NAA02548; Mon, 1 Mar 1999 13:46:03 -0500 (EST)
Message-Id: <199903011846.NAA02548@guns.lerc.nasa.gov>
To: tcp-impl@grc.nasa.gov
From: Mark Allman <mallman@grc.nasa.gov>
Reply-To: mallman@grc.nasa.gov
Subject: final test
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Centerfold
Date: Mon, 01 Mar 1999 13:46:03 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

 
final test.

allman


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 15:26:24 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA29667
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 15:26:23 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA08945; Mon, 1 Mar 1999 14:02:20 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from guns.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA08938; Mon, 1 Mar 1999 14:02:18 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id OAA03171; Mon, 1 Mar 1999 14:02:17 -0500 (EST)
Message-Id: <199903011902.OAA03171@guns.lerc.nasa.gov>
To: tcp-impl@grc.nasa.gov
From: Mark Allman <mallman@grc.nasa.gov>
Reply-To: mallman@grc.nasa.gov
Subject: tcpimpl: list news (really, this time)
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Centerfold
Date: Mon, 01 Mar 1999 14:02:17 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Folks-

First, I appologize for the bunches of test messages you ended up
getting this afternoon.  Sometimes "progress" is messy, I guess.

In any case, there is a slight modification in the "tcp-impl"
mailing list.  The US Congress has renamed NASA's Lewis Research
Center in honor of John Glenn.  The center is now known as the John
Glenn Research Center at Lewis Field.  Accordingly, the mailing list
and web site for the tcp-impl list have been slightly altered:

    The mailing list:

	tcp-impl@grc.nasa.gov

    To add or delete yourself:
    
	majordomo@grc.nasa.gov

    The web site:

	http://tcp-impl.grc.nasa.gov/tcp-impl/

The old "lerc" version will work for a while (where "a while" has
not yet been defined).  But, it is probably best to just go ahead
and start using the new version of the address.

Thanks,
allman


---
http://roland.grc.nasa.gov/~mallman/


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 15:54:37 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA00649
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 15:54:36 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA26841; Mon, 1 Mar 1999 14:50:11 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from alpha.xerox.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with SMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA25212; Mon, 1 Mar 1999 14:45:59 -0500 (EST)
From: spreitze@parc.xerox.com
Received: from augustus.parc.xerox.com ([13.2.116.14]) by alpha.xerox.com with SMTP id <53307(4)>; Mon, 1 Mar 1999 11:45:50 PST
Received: by augustus.parc.xerox.com id <105927>; Mon, 1 Mar 1999 11:45:32 PST
Date: Mon, 1 Mar 1999 11:45:19 PST
Subject: TCP and/or sockets vs. the last message in full-duplex applications
To: tcp-impl@grc.nasa.gov
Cc: Mike Spreitzer <spreitze@parc.xerox.com>
Message-Id: <99Mar1.114532pst."105927"@augustus.parc.xerox.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


The OMG seems to be in the process of adopting a messy fix to a bug in either TCP or the sockets interface.  I was wondering if you could give me a reality check as to whether it really is a bug, whether the bug is in TCP or certain APIs and their interfaces, what's the prognosis for getting it fixed, and whether the work-around outlined below is the appropriate one.  Apologies if this is not the right forum (and what is the right one?).

The issue below concerns the OMG's standard RPC protocol, GIOP, and its mapping to TCP, IIOP.  The problem occurs when a server initiates a connection closure at roughly the same time the client starts writing a new request message.  Clean shutdown messages, sent from server to client just before closure, were recently introduced to GIOP, but the bug seems to prevent them from being seen when the timing is close.  The proposed fix is to make the server: send the clean shutdown, close the sending half of the TCP connection, consume data until EOF or timeout, then close the receiving half.  This seems like an inappropriate amount of bother just to reliably send a message and close.

---------------- Forwarded  Message ----------------------

Date: Sun, 28 Feb 1999 14:49:41 PST
From: Jonathan Biggar <jon@floorboard.com>
Subject: Re: Draft Final Proposal for CloseConnection Issue for Review
To: Bill Janssen <janssen@parc.xerox.com>
cc: terutt@lucent.COM, interop@omg.ORG

Bill Janssen wrote:
> 
> I notice a problem with another section of this.
> 
> >    * After reliably issuing a CloseConnection message, the issuing orb
> >      may close the  connection.  Some transport protocols (not
> >      including TCP) do not provide an ?orderly disconnect? capability,
> >      guaranteeing reliable delivery of the last message sent. When
> >      GIOP is used with such protocols, an additional handshake needs
> >      to be provided as part of  the mapping to that protocol's
> >      connection mechanisms,  to guarantee that both ends of the
> >      connection understand the disposition of any outstanding GIOP
> >      requests.
> 
> Actually, TCP does not seem to provide reliable delivery of the last
> message sent.  Take the case where the server sends a CloseConnection,
> then closes the TCP connection.  If the client is writing to the socket
> during this, it will send data to a closed connection on the server.
> The server will then respond with a TCP reset.  This reset ``jumps the
> queue'', and instead of seeing the CloseConnection message on its next
> read on the socket, the client will see an ECONNRESET error.  This means
> that one side may still see connection failures while the other side
> presumes to have sent an orderly shutdown.

This is actually dependent on the TCP implementation.  Some throw away
any data pending to read when the ECONNRESET occurs, others keep it.  It
would be best to document this using the TCP standard rather than the
sockets implementation of it anyway.  In the real TCP standard, each
direction of the connection can be shutdown (i.e. send a FIN)
independently.  In practice, for sockets, use the shutdown() call
instead, with the argument 1, which means close the sending side only.

I would rewrite this paragraph like this:

   * After reliably issuing a CloseConnection message, the issuing orb
     may close the  connection.  Some transport protocols (not
     including TCP) do not provide an "orderly disconnect" capability,
     guaranteeing reliable delivery of the last message sent. When
     GIOP is used with such protocols, an additional handshake needs
     to be provided as part of  the mapping to that protocol's
     connection mechanisms,  to guarantee that both ends of the
     connection understand the disposition of any outstanding GIOP
     requests.  For TCP, the orb should only shutdown the sending side
     of the connection, and then read and discard any incoming data
until
     it receives an indication that the other side has also shutdown. 
At this
     point, the TCP connection can be closed completely.

-- 
Jon Biggar
Floorboard Software
jon@floorboard.com
jon@biggar.org


---------------- End of Forwarded Message ---------------


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 16:03:34 1999
Received: from assateague.lerc.nasa.gov (assateague-fi.lerc.nasa.gov [139.88.112.23])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA00943
	for <tcpimpl-archive@lists.ietf.org>; Mon, 1 Mar 1999 16:03:34 -0500 (EST)
Received: (listserv@localhost) by assateague.lerc.nasa.gov (NASA LeRC 8.7.4.1/2.01-main)
        id OAA27483; Mon, 1 Mar 1999 14:51:51 -0500 (EST)
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Received: from calcite.rhyolite.com (lombok-fi.lerc.nasa.gov [139.88.112.33]) by assateague-fi.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-main)
        id OAA25813; Mon, 1 Mar 1999 14:47:44 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id MAA21557
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Mon, 1 Mar 1999 12:47:43 -0700 (MST)
Date: Mon, 1 Mar 1999 12:47:43 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903011947.MAA21557@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

While it's clear there is little support for the current proposal for
modifying the Nagle algorithm, I think it is also clear that there would
be wide support for an ID that said that the Nagle algorithm should be
implemented according to the spirit of RFC 896.  We seem to agree that
the right way is to delay an application's send request when the connection
is not idle, but not delay any anything otherwise.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 19:48:07 1999
Received: from lombok-fi.lerc.nasa.gov ([139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08170
	for <tcpimpl-archive@odin.ietf.org>; Mon, 1 Mar 1999 19:48:07 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA01469
	for tcp-impl-outgoing; Mon, 1 Mar 1999 15:09:43 -0500 (EST)
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id PAA01455
	for <tcp-impl@lerc.nasa.gov>; Mon, 1 Mar 1999 15:09:38 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id OAA19829
	for tcp-impl@lerc.nasa.gov; Mon, 1 Mar 1999 14:09:32 -0600 (CST)
Date: Mon, 1 Mar 1999 14:09:32 -0600 (CST)
From: David Borman <dab@bsdi.com>
Message-Id: <199903012009.OAA19829@frantic.bsdi.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Date: Fri, 26 Feb 1999 18:28:29 -0800 (PST)
> From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
> ...
> Is there any
> objection that Nagle algorithm "should" be applied on a per send basis, as
> opposed to what RFC 1122 section 4.2.3.4 says?

Greg and I have exchanged a couple of messages off-line, and in the
course of that, it became clear to on me that "applying Nagle on a
per-send basis" has some ambiguity that needs to be cleared up.

This message is a bit longer that I'd like, but giving examples
is the clearest way for me to make my points.

Assume 2 4K and 1 500 byte writes, over ethernet.  What I implemented
in BSD/OS will do (assuming no window limitations...):

	write(4096)
	   so_send()
	      tcp_output(2048, moretosend=1)
		send(1460)
		<defer trailing 588, more to send>
	      tcp_output(2048, moretosend=0)
		send(1460 (588+872))
		<no more to send, no outstanding unacked data from
		 before this write, send the trailing piece>
		send(1176)

	write(4096)
	   so_send()
	      tcp_output(2048, moretosend=1)
		send(1460)
		<defer trailing 588, more to send>
	      tcp_output(2048, moretosend=0)
		send(1460 (588+872))
		<no more to send, outstanding unacked data from
		 before this write, defer data>

	write(500)
	   so_send()
		tcp_output(500, moretosend=0)
		send(1176+284);

	recv(ACK 2920); pkts 1-2
	recv(ACK 7016); pkts 3-5
	<next ACK is delayed...>
	recv(ACK 8476); pkt 6
		<all outstanding data is acked, send last piece>
		send(216)

So, in this instance, Nagle is applied to the write as a whole,
allowing the trailing piece of the initial write to be sent out
immediatly.  This was the goal, so that a single write of a request
that exceeded the MTU would be sent out immediatly, rather than
having the last part delayed waiting for an ACK.  After that, as
long as there is outstanding unacked data, the trailing tiny piece
is not sent out until all outstanding data is acked.  (In our test
case, it was a request of ~2K that now gets sent out immediatly in
two packets)

Now, taking it one step farther, another way to apply Nagle to the
entire write() could be:

	write(4096)
	   so_send()
	      tcp_output(2048)
		send(1460)
		<defer trailing 588, more to send>
	      tcp_output(2048)
		send(1460 (588+872))
		<trailing piece of the initial "big" write, send it>
		send(1176)

	write(4096)
	   so_send()
	      tcp_output(2048)
		send(1460)
		<defer trailing 588, more to send>
	      tcp_output(2048)
		send(1460 (588+872))
		<trailing piece of a "big" write, no outstanding
		 unacked data from a "small"write, send it>
		send(1176)

	write(500)
	   so_send()
		tcp_output(500)
		<A "small" write, and no outstanding unacked data from
		 a previous "small" write, so send it.
		send(500);

In a sense, just a single write > MTU can get all its big packets
and one little packet sent out,  applying the same logic to a
sequence of writes would allow all the big writes followed by
one little write to get sent out immediatly.

Of course, the problem with this (which is also what you can get
if you set TCP_NODELAY) is that you don't get a stream of full
sized packets followed by one small packet,
	send(1460); send(1460); send(1176);
	send(1460); send(1460); send(1176);
	send(500);

As opposed to the former which gives you:
	send(1460); send(1460); send(1176);
	send(1460); send(1460);
	send(1460);
	send(216);

And in reality, with slow-start, the BSD/OS code actually generates
(with sufficently large RTT):
	write(4096)
	   so_send()
	      tcp_output(2048, moretosend=1)
		send(1460)
		<defer trailing 588, more to send>
	      tcp_output(2048, moretosend=0)
		send(1460 (588+872))
		<no more to send, no outstanding unacked data from
		 before this write, but CW=2*1460; defer 1176>

	write(4096)
	   so_send()
	      tcp_output(2048, moretosend=1)
		<congestion window full, no send>
	      tcp_output(2048, moretosend=0)
		<congestion window full, no send>

	write(500)
	   so_send()
		tcp_output(500, moretosend=0)
		<congestion window full, no send>

	recv(ACK 2920); pkts 1-2
		<CW=3*1460, all data acked>
		send(1460); send(1460); send(1460);
	recv(ACK 5840); pkts 3-4
		<due to Nagle, defer final 1392 bytes>
	<delay for next ACK>
	recv(ACK 7300);
		<all outstanding data ACKed, send final piece>
		send(1392);

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 20:04:07 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA08823
	for <tcpimpl-archive@odin.ietf.org>; Mon, 1 Mar 1999 20:04:07 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA09365
	for tcp-impl-outgoing; Mon, 1 Mar 1999 17:06:25 -0500 (EST)
Received: from bsd.jcs.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA09358
	for <tcp-impl@lerc.nasa.gov>; Mon, 1 Mar 1999 17:06:22 -0500 (EST)
Received: (from jcs@localhost)
	by bsd.jcs.com (8.8.8/8.8.8) id RAA17438
	for tcp-impl@lerc.nasa.gov; Mon, 1 Mar 1999 17:08:23 -0500 (EST)
	(envelope-from jsnader@ix.netcom.com)
X-Authentication-Warning: bsd.jcs.com: jcs set sender to jsnader@ix.netcom.com using -f
Message-ID: <19990301170820.C17386@ix.netcom.com>
Date: Mon, 1 Mar 1999 17:08:20 -0500
From: Jon Snader <jsnader@ix.netcom.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Mail-Followup-To: tcp-impl@lerc.nasa.gov
References: <199903011947.MAA21557@calcite.rhyolite.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.93.1i
In-Reply-To: <199903011947.MAA21557@calcite.rhyolite.com>; from Vernon Schryver on Mon, Mar 01, 1999 at 12:47:43PM -0700
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Mon, Mar 01, 1999 at 12:47:43PM -0700, Vernon Schryver wrote:
> I think it is also clear that there would
> be wide support for an ID that said that the Nagle algorithm should be
> implemented according to the spirit of RFC 896.  We seem to agree that
> the right way is to delay an application's send request when the connection
> is not idle, but not delay any anything otherwise.
> 

RFC 896 says that *no* newly arrived data is to be sent if
the connection is not idle.  Most BSD derived implementations
will send new data if there is at least a full segment.  Are
you proposing to disallow this practice or is this still in
the spirit of RFC 896?

Jon Snader


From owner-tcp-impl@lerc.nasa.gov  Mon Mar  1 22:41:32 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA18565
	for <tcpimpl-archive@odin.ietf.org>; Mon, 1 Mar 1999 22:41:29 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA17692
	for tcp-impl-outgoing; Mon, 1 Mar 1999 20:03:10 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA17688
	for <tcp-impl@lerc.nasa.gov>; Mon, 1 Mar 1999 20:03:08 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id SAA27399
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Mon, 1 Mar 1999 18:03:07 -0700 (MST)
Date: Mon, 1 Mar 1999 18:03:07 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903020103.SAA27399@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: revised internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Jon Snader <jsnader@ix.netcom.com>

> > I think it is also clear that there would
> > be wide support for an ID that said that the Nagle algorithm should be
> > implemented according to the spirit of RFC 896.  We seem to agree that
> > the right way is to delay an application's send request when the connection
> > is not idle, but not delay any anything otherwise.

> RFC 896 says that *no* newly arrived data is to be sent if
> the connection is not idle.  Most BSD derived implementations
> will send new data if there is at least a full segment.  Are
> you proposing to disallow this practice or is this still in
> the spirit of RFC 896?

I think sending full segments is in the spirit of RFC 896.  The problem
that RFC 896 tries to solve is "too many small packets."  RFC 896 is not
about sending the absolute minimum number of packets or about minimizing
latency.  It is an "engineering solution" to "too many small packets."

On the other hand, does it matter whether you disallow that practice?
If the link is not idle, then an Ack is coming back.  If the previous
write which is causing that Ack was large, then the Ack will appear
soon and the system is probably congestion-window constrained and
unable to send any new data regardless.  If the previous write before
the big write was small, then it makes sense to delay the new data no
matter how large (provided the application does not the Nagle algorithm
turned off in general).  In either of those cases, delaying a big
write is either almost harmless or desirable.

Nothing can elminate the many legitimate situations where the Nagle
algorithm must be turned off.  You can at most hope to make it do the
right things on pure request-response applications.

I envision a flag sent through pru_send() like MORETOSEND but that says
"DO (NOT) DELAY IF IDLE," and removing the test of the NODELAY bit from
tcp_output().  That would move the policy issue to sosend() which could
worry about the definition of "small" or honor per-request "do (not) delay"
flag bits from the application.  If data were delayed by tcp_output because
tcp_output decide the link is idle and the new bit concurred, then later
when an Ack finally arrived, all queued data would be sent (subject to
windows, silly window avoidance, etc.).  That would solve the problem the
previous proposal addresses.  This idea would require no new state in the
TCB and less processing than the previous proposal.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Tue Mar  2 19:02:20 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA19026
	for <tcpimpl-archive@odin.ietf.org>; Tue, 2 Mar 1999 19:02:19 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA10280
	for tcp-impl-outgoing; Tue, 2 Mar 1999 12:49:36 -0500 (EST)
Received: from aland.bbn.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id MAA10261
	for <tcp-impl@grc.nasa.gov>; Tue, 2 Mar 1999 12:49:32 -0500 (EST)
Received: from aland.bbn.com (localhost [127.0.0.1])
	by aland.bbn.com (8.9.0.Beta3/8.9.0.Beta3) with ESMTP id JAA29527;
	Tue, 2 Mar 1999 09:49:24 -0800 (PST)
Message-Id: <199903021749.JAA29527@aland.bbn.com>
To: spreitze@parc.xerox.com
cc: tcp-impl@lerc.nasa.gov
Subject: Re: TCP and/or sockets vs. the last message in full-duplex applications 
In-reply-to: Your message of "Mon, 01 Mar 1999 11:45:19 PST."
             <99Mar1.114532pst."105927"@augustus.parc.xerox.com> 
Date: Tue, 02 Mar 1999 09:49:24 -0800
From: Craig Partridge <craig@aland.bbn.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Well there are two issues here:

1. What to do with a FIN if you have data still to read

    Jumping the queue sounds busted to me.

2. Whether an reliable close is possible

    It is not -- it is an instance of the two armies problem.

Craig

In message <99Mar1.114532pst."105927"@augustus.parc.xerox.com>, spreitze@parc.x
erox.com writes:

>
>The OMG seems to be in the process of adopting a messy fix to a bug in either 
 >TCP or the sockets interface.  I was wondering if you could give me a reality
 > check as to whether it really is a bug, whether the bug is in TCP or certain
 > APIs and their interfaces, what's the prognosis for getting it fixed, and wh
 >ether the work-around outlined below is the appropriate one.  Apologies if th
 >is is not the right forum (and what is the right one?).
>
>The issue below concerns the OMG's standard RPC protocol, GIOP, and its mappin
 >g to TCP, IIOP.  The problem occurs when a server initiates a connection clos
 >ure at roughly the same time the client starts writing a new request message.
 >  Clean shutdown messages, sent from server to client just before closure, we
 >re recently introduced to GIOP, but the bug seems to prevent them from being 
 >seen when the timing is close.  The proposed fix is to make the server: send 
 >the clean shutdown, close the sending half of the TCP connection, consume dat
 >a until EOF or timeout, then close the receiving half.  This seems like an in
 >appropriate amount of bother just to reliably send a message and close.
>
>---------------- Forwarded  Message ----------------------
>
>Date: Sun, 28 Feb 1999 14:49:41 PST
>From: Jonathan Biggar <jon@floorboard.com>
>Subject: Re: Draft Final Proposal for CloseConnection Issue for Review
>To: Bill Janssen <janssen@parc.xerox.com>
>cc: terutt@lucent.COM, interop@omg.ORG
>
>Bill Janssen wrote:
>> 
>> I notice a problem with another section of this.
>> 
>> >    * After reliably issuing a CloseConnection message, the issuing orb
>> >      may close the  connection.  Some transport protocols (not
>> >      including TCP) do not provide an ?orderly disconnect? capability,
>> >      guaranteeing reliable delivery of the last message sent. When
>> >      GIOP is used with such protocols, an additional handshake needs
>> >      to be provided as part of  the mapping to that protocol's
>> >      connection mechanisms,  to guarantee that both ends of the
>> >      connection understand the disposition of any outstanding GIOP
>> >      requests.
>> 
>> Actually, TCP does not seem to provide reliable delivery of the last
>> message sent.  Take the case where the server sends a CloseConnection,
>> then closes the TCP connection.  If the client is writing to the socket
>> during this, it will send data to a closed connection on the server.
>> The server will then respond with a TCP reset.  This reset ``jumps the
>> queue'', and instead of seeing the CloseConnection message on its next
>> read on the socket, the client will see an ECONNRESET error.  This means
>> that one side may still see connection failures while the other side
>> presumes to have sent an orderly shutdown.
>
>This is actually dependent on the TCP implementation.  Some throw away
>any data pending to read when the ECONNRESET occurs, others keep it.  It
>would be best to document this using the TCP standard rather than the
>sockets implementation of it anyway.  In the real TCP standard, each
>direction of the connection can be shutdown (i.e. send a FIN)
>independently.  In practice, for sockets, use the shutdown() call
>instead, with the argument 1, which means close the sending side only.
>
>I would rewrite this paragraph like this:
>
>   * After reliably issuing a CloseConnection message, the issuing orb
>     may close the  connection.  Some transport protocols (not
>     including TCP) do not provide an "orderly disconnect" capability,
>     guaranteeing reliable delivery of the last message sent. When
>     GIOP is used with such protocols, an additional handshake needs
>     to be provided as part of  the mapping to that protocol's
>     connection mechanisms,  to guarantee that both ends of the
>     connection understand the disposition of any outstanding GIOP
>     requests.  For TCP, the orb should only shutdown the sending side
>     of the connection, and then read and discard any incoming data
>until
>     it receives an indication that the other side has also shutdown. 
>At this
>     point, the TCP connection can be closed completely.
>
>-- 
>Jon Biggar
>Floorboard Software
>jon@floorboard.com
>jon@biggar.org
>
>
>
>---------------- End of Forwarded Message ---------------


From owner-tcp-impl@lerc.nasa.gov  Tue Mar  2 22:53:15 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA04504
	for <tcpimpl-archive@odin.ietf.org>; Tue, 2 Mar 1999 22:53:13 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA13711
	for tcp-impl-outgoing; Tue, 2 Mar 1999 19:52:14 -0500 (EST)
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA13631
	for <tcp-impl@lerc.nasa.gov>; Tue, 2 Mar 1999 19:48:42 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id AAA12783; Wed, 3 Mar 1999 00:48:27 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m10I0gm-0007U1C; Wed, 3 Mar 99 01:43 GMT
Message-Id: <m10I0gm-0007U1C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: TCP and/or sockets vs. the last message in full-duplex applications
To: craig@aland.bbn.com (Craig Partridge)
Date: Wed, 3 Mar 1999 01:43:08 +0000 (GMT)
Cc: spreitze@parc.xerox.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <199903021749.JAA29527@aland.bbn.com> from "Craig Partridge" at Mar 2, 99 09:49:24 am
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Well there are two issues here:
> 
> 1. What to do with a FIN if you have data still to read
>     Jumping the queue sounds busted to me.

Jumping the queue destroys everything. Been there, done that while I was
still a tcp writing newbie. You break ftp, everything and people bitch. I
thus find it hard to believe anyone got away with the mistake in a release
product without receiving hate mail by the truckload

Alan


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 00:56:09 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA16077
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 00:56:08 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA16349
	for tcp-impl-outgoing; Tue, 2 Mar 1999 22:12:16 -0500 (EST)
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id WAA16271
	for <tcp-impl@lerc.nasa.gov>; Tue, 2 Mar 1999 22:10:18 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 3 Mar 1999 03:33:59 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10I21p-001xi2C; Tue, 2 Mar 1999 19:08:57 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id TAA15487; Tue, 2 Mar 1999 19:09:53 -0800 (PST)
Message-Id: <199903030309.TAA15487@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Mon, 01 Mar 1999 14:09:32 CST."
             <199903012009.OAA19829@frantic.bsdi.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 02 Mar 1999 19:09:52 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

All,

In a message yesterday, David Borman pointed out that "applying Nagle on a 
per-send basis" is somewhat ambiguous.  I'd love to hear how various people 
interpret that statement (in terms, i guess, of how their own stack implements 
this).

I think the question of whether the proposed modification proposes anything 
new, or simply codifies existing practice, is probably wrapped up in this 
issue.

Thanks,  Greg


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 06:07:07 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id GAA05597
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 06:07:06 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id DAA20174
	for tcp-impl-outgoing; Wed, 3 Mar 1999 03:02:17 -0500 (EST)
Received: from alpha.xerox.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id CAA20130
	for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 02:58:35 -0500 (EST)
Received: from deimos ([13.0.209.39]) by alpha.xerox.com with SMTP id <61146(5)>; Tue, 2 Mar 1999 23:58:33 PST
From: "Mike Spreitzer" <spreitze@parc.xerox.com>
To: "Craig Partridge" <craig@aland.bbn.com>
Cc: <tcp-impl@lerc.nasa.gov>
Subject: RE: TCP and/or sockets vs. the last message in full-duplex applications 
Date: Tue, 2 Mar 1999 23:58:32 PST
Message-ID: <000101be654b$a1a48960$27d1000d@deimos.parc.xerox.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2377.0
In-reply-to: <199903021749.JAA29527@aland.bbn.com>
Importance: Normal
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> Well there are two issues here:
>
> 1. What to do with a FIN if you have data still to read
>
>     Jumping the queue sounds busted to me.

No, the FIN isn't the problem here.  The RST is.

> 2. Whether an reliable close is possible
>
>     It is not -- it is an instance of the two armies problem.

The problem at hand is more one of reliable data delivery before the close.
If only the RST were handled differently upon receipt at the client side
(i.e., didn't cause discarding of the segments received from the net but not
yet delivered to the app), we'd have this reliable data delivery.  I think
one possible answer would be to refine the RST bit into two bits: a more and
a less severe one.  But I'm also not sure it's right for the server to
respond with RST in the case at hand.

Mike


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 13:41:39 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA15900
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 13:41:38 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id KAA15962
	for tcp-impl-outgoing; Wed, 3 Mar 1999 10:15:39 -0500 (EST)
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id KAA15526
	for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 10:13:36 -0500 (EST)
Received: from big (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id KAA23677;
	Wed, 3 Mar 1999 10:13:33 -0500
Message-Id: <3.0.5.32.19990303101333.02ed2100@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Wed, 03 Mar 1999 10:13:33 -0500
To: spreitze@parc.xerox.com, tcp-impl@lerc.nasa.gov
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: TCP and/or sockets vs. the last message in full-duplex
  applications
In-Reply-To: <99Mar1.114532pst."105927"@augustus.parc.xerox.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 11:45 3/1/99 PST, spreitze@parc.xerox.com wrote:

>The proposed fix is to make the server: send the clean shutdown,
>close the sending half of the TCP connection, consume data until
>EOF or timeout, then close the receiving half.  This seems like an
>inappropriate amount of bother just to reliably send a message and
>close.

This is old news in HTTP world - in fact (although I can't find any mails
in the HTTP archives to show this), I believe to have first found this
problem in early 1995 when experimenting with cross-atlantic HTTP/1.0 PUT
requests between MIT and CERN. The problem that I saw for HTTP/1.0 PUT is
the following:

* Client A sends a PUT request with a large body to the server
* The server B sends back a short 401 Access Denied response
  and closes the connection in both directions.
* A receives the response but has already sent a large part of
  the body because of the RTT across the Atlantic
* A ACK's the response but the ACK is pending behind the data
  already sent to B
* B sees that data is still coming and sends a RST to A
* A gets the RST and passes it up immediately to the application
  dropping the HTTP response because the seq counter hasn't
  been updated.
* A gets a RST but no HTTP response - it doesn't know what happened.

At the time I talked with Dave Clark about it and he pointed out that it
was not a bug but normal behavior.

The problem can also occur in HTTP/1.1 pipelining which we found when
implementing this in the libwww HTTP code. This is described in an IETF
draft by Jim Gettys and Alan Freier that was never finished [1], section 8:
	
   In simple request/response protocols (e.g. HTTP/1.0), a server can go
   ahead and close both receive and transmit sides of its connection
   simultaneously whenever it needs to. A pipelined or streaming
   protocol (e.g. HTTP/1.1) connection, is more complex [Frystyk et.
   al.], and an implementation which does so can create major problems.

   The scenario is as follows: an HTTP/1.1 client talking to a HTTP/1.1
   server starts pipelining a batch of requests, for example 15 on an
   open TCP connection.  The server decides that it will not serve more
   than 5 requests per connection and closes the TCP connection in both
   directions after it successfully has served the first five requests.
   The remaining 10 requests that are already sent from the client will
   along with client generated TCP ACK packets arrive on a closed port
   on the server. This "extra" data causes the server's TCP to issue a
   reset which makes the client TCP stack pass the last ACK'ed packet to
   the client application and discard all other packets. This means that
   HTTP responses that are either being received or already have been
   received successfully but haven't been ACK'ed will be dropped by the
   client TCP. In this situation the client does not have any means of
   finding out which HTTP messages were successful or even why the
   server closed the connection. The server may have generated a
   "Connection: Close" header in the 5th response but the header may
   have been lost due to the TCP reset. Servers must therefore close
   each half of the connection independently.

This has in HTTP/1.1 rev 6 [2] been moved to section 10.4

	If the client is sending data, a server implementation
	using TCP SHOULD be careful to ensure that the client
	acknowledges receipt of the packet(s) containing the
	response, before the server closes the input connection.
	If the client continues sending data to the server after
	the close, the server's TCP stack will send a reset packet
	to the client, which may erase the client's unacknowledged
	input buffers before they can be read and interpreted by
	the HTTP application.

I believe that all modern HTTP servers in fact do the half-close. However,
this is not without problems as the server also has to be able to protect
itself. It therefore has to have some feeling of a "reasonable lingering
time" for the connection.

Henrik

[1] http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt
[2] http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-rev-06.txt
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 15:05:29 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA17562
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 15:05:28 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA10386
	for tcp-impl-outgoing; Wed, 3 Mar 1999 12:04:54 -0500 (EST)
Received: from alpha.xerox.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id MAA10109
	for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 12:03:23 -0500 (EST)
Received: from phobos ([13.2.118.23]) by alpha.xerox.com with SMTP id <61528(4)>; Wed, 3 Mar 1999 09:03:18 PST
From: "Mike Spreitzer" <spreitze@parc.xerox.com>
To: "'Craig Partridge'" <craig@aland.bbn.com>
Cc: <tcp-impl@lerc.nasa.gov>
Subject: RE: TCP and/or sockets vs. the last message in full-duplex applications 
Date: Wed, 3 Mar 1999 09:03:04 PST
Message-ID: <000001be6597$b4c2e090$1776020d@phobos.parc.xerox.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2377.0
Importance: Normal
In-Reply-To: <199903021749.JAA29527@aland.bbn.com>
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Actually, it looks to me like things would be great if we could just do the
"CLOSE" operation defined in the TCP RFC (739).  The "Event Processing"
section there outlines a series of transitions in which the TCB is not
discarded immediately at close time but rather is put through a series of
state transitions that do not call for RST to be sent (even if user data
arrives).  So one analysis is that sockets close() isn't TCP's "CLOSE".  Is
shutdown() with (how==2)?

Cheers,
Mike


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 17:08:00 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA20858
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 17:08:00 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA08026
	for tcp-impl-outgoing; Wed, 3 Mar 1999 14:15:00 -0500 (EST)
Received: from atlrel1.hp.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA07625
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 14:12:31 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by atlrel1.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id OAA15314
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 14:12:23 -0500 (EST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id LAA29972 for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 11:12:29 -0800 (PST)
Message-ID: <36DD899C.EADF0FED@cup.hp.com>
Date: Wed, 03 Mar 1999 11:12:28 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199903030309.TAA15487@red.mtv.siara.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Greg Minshall wrote:
> In a message yesterday, David Borman pointed out that "applying Nagle on a
> per-send basis" is somewhat ambiguous.  I'd love to hear how various people
> interpret that statement (in terms, i guess, of how their own stack implements
> this).

How does this interpretation sound:

If an application makes an API call providing for transmission a
quantity of data greater than or equal to the MSS of the associated TCP
connection, all data provided to the API call is immediately transmitted
by TCP, subject to the constraints of the receiver and congestion
windows.

This is a slightly formalized version of "If an application calls send()
with at least an MSS's worth of data, its transmission will not be
delayed by the Nagle algorithm."


rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 18:14:27 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA26260
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 18:14:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA24083
	for tcp-impl-outgoing; Wed, 3 Mar 1999 15:29:56 -0500 (EST)
Received: from web4.rocketmail.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id PAA20455
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 15:11:30 -0500 (EST)
Message-ID: <19990303200114.20713.rocketmail@web4.rocketmail.com>
Received: from [202.54.22.130] by web4; Wed, 03 Mar 1999 12:01:14 PST
Date: Wed, 3 Mar 1999 12:01:14 -0800 (PST)
From: vijay singh <vijjus@rocketmail.com>
Subject: TCP-offload on IBM mainframe
To: tcp-impl@lerc.nasa.gov
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


I wanted to put a question before the forum:

On an IBM mainframe, while offloading the TCP
processing on a 3172 Controller, can we control the
buffer allocation (mbufs) on the offload (not on the
TCP MVS bufferpool). I face a problem where the
offload runs out of mbufs and the server application
come down and does not come up again as it fails in
the BIND call. Or can we free some mbufs by dropping
some connections through netstat.

Vijay
E-MAIL: vijjus@rocketmail.com


---Vernon Schryver <vjs@calcite.rhyolite.com> wrote:
>
> While it's clear there is little support for the
current proposal for
> modifying the Nagle algorithm, I think it is also
clear that there would
> be wide support for an ID that said that the Nagle
algorithm should be
> implemented according to the spirit of RFC 896.  We
seem to agree that
> the right way is to delay an application's send
request when the connection
> is not idle, but not delay any anything otherwise.
> 
> 
> Vernon Schryver    vjs@rhyolite.com
> 

_________________________________________________________
DO YOU YAHOO!?
Get your free @yahoo.com address at http://mail.yahoo.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 18:17:19 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA26396
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 18:17:18 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA26124
	for tcp-impl-outgoing; Wed, 3 Mar 1999 15:40:03 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id PAA24906
	for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 15:35:56 -0500 (EST)
Received: from Eng.Sun.COM (engmail4 [129.144.134.6]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id MAA15184 for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 12:35:53 -0800
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by Eng.Sun.COM (8.8.8+Sun/SMI-5.3) with ESMTP id MAA13766
	for <tcp-impl@grc.nasa.gov>; Wed, 3 Mar 1999 12:35:52 -0800 (PST)
Received: (from kcpoon@localhost)
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) id MAA07521
	for tcp-impl@grc.nasa.gov; Wed, 3 Mar 1999 12:35:52 -0800 (PST)
Date: Wed, 3 Mar 1999 12:35:52 -0800 (PST)
From: Kacheong Poon <kcpoon@shield.Eng.Sun.COM>
Message-Id: <199903032035.MAA07521@shield.eng.sun.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>Of course, the problem with this (which is also what you can get
>if you set TCP_NODELAY) is that you don't get a stream of full
>sized packets followed by one small packet,
>	send(1460); send(1460); send(1176);
>	send(1460); send(1460); send(1176);
>	send(500);
>
>As opposed to the former which gives you:
>	send(1460); send(1460); send(1176);
>	send(1460); send(1460);
>	send(1460);
>	send(216);

In this particular example, the number of packets is the same, I don't see a
problem.  Yes, we can construct an example which with the "one step further
definition," tcp will send out more packets.  But if an app is well written
and does appropriate buffering, I think that is OK.

>	recv(ACK 2920); pkts 1-2
>		<CW=3*1460, all data acked>
>		send(1460); send(1460); send(1460);
>	recv(ACK 5840); pkts 3-4
>		<due to Nagle, defer final 1392 bytes>
>	<delay for next ACK>
>	recv(ACK 7300);
>		<all outstanding data ACKed, send final piece>
>		send(1392);

I think this last 1392 bytes should not be delayed, given that it is part
of a send which should not be delayed.  If the sequence of packets are
like

-->	PKT3 (1460); PKT4 (1460); PKT5 (1460)
<--	ACK (4)

Solaris will not delay sending

-->	PKT6 (1392)

Vernon brought up an interesting point in an earlier mail.  Apply Nagle
either on send requests or on segments, but not both.  That means TCP
will not apply Nagle check on sending packet if it is because of an ACK.
When we consider the meaning of Nagle on per send basis, I think we should
also think about that in the definition.

							K. Poon
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 18:57:51 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA28502
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 18:57:50 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id QAA03504
	for tcp-impl-outgoing; Wed, 3 Mar 1999 16:20:00 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id QAA02699
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 16:15:38 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id OAA16869
	for tcp-impl@lerc.nasa.gov  env-from <vjs>;
	Wed, 3 Mar 1999 14:15:22 -0700 (MST)
Date: Wed, 3 Mar 1999 14:15:22 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903032115.OAA16869@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Rick Jones <raj@cup.hp.com>

> ...
> If an application makes an API call providing for transmission a
> quantity of data greater than or equal to the MSS of the associated TCP
> connection, all data provided to the API call is immediately transmitted
> by TCP, subject to the constraints of the receiver and congestion
> windows.
>
> This is a slightly formalized version of "If an application calls send()
> with at least an MSS's worth of data, its transmission will not be
> delayed by the Nagle algorithm."


I prefer the perhaps less formalized version.  It says what we all seem
to mean and want (with the quibble that "MSS's worth" is not good).  The
more formal versions invites other questions, such as whether silly window
avoidance mechanisms, sender operating system scheduling or buffer
management, and a myriad of other things might delay the data, with or
without interaction with the Nagle algorithm.

I'd vote for the less formal but wordy "If an application calls send()
with at least 512 bytes or if it has turned off the Nagle algorithm, the
transmission of the data will not be delayed by the Nagle algorithm either
immediately or after other delays such as those caused by receiver or
congestion windows."

Given something like the MORETOSEND hack in recent BSD code, don't the
only undesirable Nagle delays occur after window delays?


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 19:24:02 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA01315
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 19:24:00 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id QAA09055
	for tcp-impl-outgoing; Wed, 3 Mar 1999 16:50:04 -0500 (EST)
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id QAA07969
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 16:45:12 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id NAA25117
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 13:45:20 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id NAA00417 for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 13:45:02 -0800 (PST)
Message-ID: <36DDAD5E.A8BBAD3D@cup.hp.com>
Date: Wed, 03 Mar 1999 13:45:02 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: TCP-offload on IBM mainframe
References: <19990303200114.20713.rocketmail@web4.rocketmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

vijay singh wrote:
> 
> I wanted to put a question before the forum:
> 
> On an IBM mainframe, while offloading the TCP
> processing on a 3172 Controller, can we control the
> buffer allocation (mbufs) on the offload (not on the
> TCP MVS bufferpool). I face a problem where the
> offload runs out of mbufs and the server application
> come down and does not come up again as it fails in
> the BIND call. Or can we free some mbufs by dropping
> some connections through netstat.

This might be straying a triffle too far from the _impl_ part of the
list but...

On the surface it sounds more like a capacity planning failure in
specing the capacity of the offload device. If one's netstat enables the
arbitrary dropping of connections, one could do that, but probably
should not, and it would probably not be a very long-term solution
regardless - having a bind succeede implies that there will be new
connections, which would just soak-up the mbufs again. 

Besides, how do you know what TCP connections are "safe" to drop? An
existing TCP connection could be for an application that does not know
how to recover gracefully. 

So, I would say that what you propose sounds like a rather bad practice,
and that the root cause needs to be addressed - the load of TCP
connections to the mainframe is beyond the capacity of the system. 

rick jones

-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 19:41:22 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA03945
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 19:41:20 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id QAA10506
	for tcp-impl-outgoing; Wed, 3 Mar 1999 16:59:54 -0500 (EST)
Received: from alpha.xerox.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id QAA09743
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 16:54:05 -0500 (EST)
Received: from mango.parc.xerox.com ([13.1.102.232]) by alpha.xerox.com with SMTP id <56356(5)>; Wed, 3 Mar 1999 13:53:57 PST
Received: from mango.parc.xerox.com (localhost.parc.xerox.com [127.0.0.1])
	by mango.parc.xerox.com (8.8.8/8.8.8) with ESMTP id NAA03192;
	Wed, 3 Mar 1999 13:53:56 -0800 (PST)
	(envelope-from fenner@mango.parc.xerox.com)
Message-Id: <199903032153.NAA03192@mango.parc.xerox.com>
To: Greg Minshall <minshall@siara.com>
cc: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm 
In-reply-to: Your message of "Tue, 02 Mar 1999 19:09:52 PST."
             <199903030309.TAA15487@red.mtv.siara.com> 
Date: Wed, 3 Mar 1999 13:53:55 PST
From: Bill Fenner <fenner@parc.xerox.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

I think a lot of the question of what applying Nagle on a per-send
basis means includes what happens when data is delayed due to cwnd
or the receive window.  My imagination says that you keep another
variable in the TCB that's the highest sequence number that has
passed Nagle (in order to not delay the last packet of a large
write), but I haven't particularly thought through exactly how this
variable is updated or used except in the obvious case.  In
particular, I don't know how it should be updated when you perform
a send() and there is still data outstanding.

  Bill


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 21:39:00 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA21193
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 21:38:59 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA26913
	for tcp-impl-outgoing; Wed, 3 Mar 1999 18:54:59 -0500 (EST)
Received: from frantic.bsdi.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA26211
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 18:48:22 -0500 (EST)
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.0/8.9.0) id RAA02345;
	Wed, 3 Mar 1999 17:48:10 -0600 (CST)
Date: Wed, 3 Mar 1999 17:48:10 -0600 (CST)
From: David Borman <dab@bsdi.com>
Message-Id: <199903032348.RAA02345@frantic.bsdi.com>
To: fenner@parc.xerox.com
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Cc: tcp-impl@lerc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Subject: Re: internet draft on suggested mod to the Nagle algorithm 
> Date: Wed, 3 Mar 1999 13:53:55 PST
> From: Bill Fenner <fenner@parc.xerox.com>
>
> I think a lot of the question of what applying Nagle on a per-send
> basis means includes what happens when data is delayed due to cwnd
> or the receive window.  My imagination says that you keep another
> variable in the TCB that's the highest sequence number that has
> passed Nagle (in order to not delay the last packet of a large
> write), but I haven't particularly thought through exactly how this
> variable is updated or used except in the obvious case.  In
> particular, I don't know how it should be updated when you perform
> a send() and there is still data outstanding.

I've had similar thoughts.

The Nagle sequence number advances when:
      o there is a large send() or TCP_NODELAY is set,
	to the end of the data.
      o any data is actually sent, to the end of that data.
It never retreats.

Decisions on whether or not to send based on TCP_NODELAY are
changed to decide whether or not to send based upon if the
send next sequence number is in front of the Nagle sequence number.

			-David Borman, dab@bsdi.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 21:39:54 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA21332
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 21:39:53 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA28004
	for tcp-impl-outgoing; Wed, 3 Mar 1999 19:04:59 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA27577
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 19:01:04 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id RAA20144
	env-from <vjs>;
	Wed, 3 Mar 1999 17:00:59 -0700 (MST)
Date: Wed, 3 Mar 1999 17:00:59 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903040000.RAA20144@calcite.rhyolite.com>
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Bill Fenner <fenner@parc.xerox.com>

> I think a lot of the question of what applying Nagle on a per-send
> basis means includes what happens when data is delayed due to cwnd
> or the receive window.  My imagination says that you keep another
> variable in the TCB that's the highest sequence number that has
> passed Nagle (in order to not delay the last packet of a large
> write), but I haven't particularly thought through exactly how this
> variable is updated or used except in the obvious case.  In
> particular, I don't know how it should be updated when you perform
> a send() and there is still data outstanding.

I think that's still confusing applying the algorithm per-send with
applying it per-Ack, and that is a Bad Thing(tm).  If you apply the
algorithm per-send, then future Acks are irrelevant, and there is no need
to save any state.

Applying it per-send is the equivalent of:

	link_new_data_into_socket_buffer();
	if (!idle || !NO_DELAY) {
		tcp_output();
	}

and that's all; no code whatsoever anywhere or anyplace else.

"delaying per-send" does not mean "delay per-send at first than randomly
delay arbirary segments at random future times as determined by the
vagaries of interface and MTU changes."

If you are delaying-per-send, then there should be at most one opportunity
for the entire blob of application data to be delayed.  If some of the
data is not delayed at first, then none of it should ever be delayed at
all.  If any of it is delayed, of course), then all of it should be
delayed.  Later (e.g. when an Ack arrives), all of the data should be
sent, just as it is now sent in a system with TCP_NODELAY=1.


Vernon Schryver    vjs@rhyolite.com


P.S. I'm sending to tcp-impl@lerc.nasa.gov instead of @grc.nasa.gov because
  that's how the return address arrived here.


From owner-tcp-impl@lerc.nasa.gov  Wed Mar  3 23:04:08 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA03201
	for <tcpimpl-archive@odin.ietf.org>; Wed, 3 Mar 1999 23:04:07 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA05447
	for tcp-impl-outgoing; Wed, 3 Mar 1999 20:19:55 -0500 (EST)
Received: from palrel3.hp.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA05260
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 20:16:35 -0500 (EST)
Received: from loiter.cup.hp.com (root@loiter.cup.hp.com [15.8.80.103])
	by palrel3.hp.com (8.8.6 (PHNE_14041)/8.8.5tis) with ESMTP id RAA25977
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 17:16:48 -0800 (PST)
Received: from cup.hp.com (raj@loiter [15.8.80.103]) by loiter.cup.hp.com with ESMTP (8.8.6/8.7.3 TIS Messaging 5.0) id RAA00812 for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 17:16:31 -0800 (PST)
Message-ID: <36DDDEEF.D2AFF571@cup.hp.com>
Date: Wed, 03 Mar 1999 17:16:31 -0800
From: Rick Jones <raj@cup.hp.com>
Organization: SNSL
X-Mailer: Mozilla 4.08 [en] (X11; I; HP-UX B.10.20 9000/735)
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Re: internet draft on suggested mod to the Nagle algorithm
References: <199903032348.RAA02345@frantic.bsdi.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

OK, time for another one of my naieve, ignorance exposing questions...

Why is there any need to track a sequence number here? I send data, but
cwnd prevents me from sending it all. The remote will send me an ACK
within one RTT, and say I only have a small quantity of data left to
send because the app has not sent more or has only trickled a little
more out. The app has had an RTT to give TCP more data, and that is
probably eons in local system timescales, so if there is no more data by
the time the ACK advancing cwnd comes back, there probably isn't likely
to be any additional data for any apreciable time anyway.

So, why not just send the remainder all the time? 

rick jones
-- 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, or post, but please do not do both...
my email address is raj in the cup.hp.com domain...


From owner-tcp-impl@lerc.nasa.gov  Thu Mar  4 02:02:33 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA18348
	for <tcpimpl-archive@odin.ietf.org>; Thu, 4 Mar 1999 02:02:32 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id XAA20578
	for tcp-impl-outgoing; Wed, 3 Mar 1999 23:14:54 -0500 (EST)
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id XAA20212
	for <tcp-impl@lerc.nasa.gov>; Wed, 3 Mar 1999 23:10:31 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 523;
          Wed, 3 Mar 1999 20:10:15 -0800
Message-ID: <36DE07A6.EF6F1025@ehsco.com>
Date: Wed, 03 Mar 1999 20:10:14 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Henrik Frystyk Nielsen <frystyk@w3.org>
CC: spreitze@parc.xerox.com, tcp-impl@lerc.nasa.gov
Subject: Re: TCP and/or sockets vs. the last message in full-duplexapplications
References: <3.0.5.32.19990303101333.02ed2100@localhost>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> The problem can also occur in HTTP/1.1 pipelining which we found when
> implementing this in the libwww HTTP code. This is described in an
> IETF draft by Jim Gettys and Alan Freier that was never finished [1],
> section 8:

>    The remaining 10 requests that are already sent from the client
>    will along with client generated TCP ACK packets arrive on a
>    closed port on the server.

So when would the server have received the ACKs for the data it sent?
This just never would have worked in anything resembling reliability.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Thu Mar  4 16:08:58 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA12823
	for <tcpimpl-archive@odin.ietf.org>; Thu, 4 Mar 1999 16:08:57 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id JAA14795
	for tcp-impl-outgoing; Thu, 4 Mar 1999 09:29:54 -0500 (EST)
Received: from prawn.fishy.net (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id JAA14257
	for <tcp-impl@lerc.nasa.gov>; Thu, 4 Mar 1999 09:26:36 -0500 (EST)
Received: from prodigy.net ([172.17.26.241]) by prawn.fishy.net (8.8.5/8.7.3) with ESMTP id JAA19846; Thu, 4 Mar 1999 09:05:42 -0500
Message-ID: <36DEDF5E.7A961FDD@prodigy.net>
Date: Thu, 04 Mar 1999 09:30:39 -1000
From: Oleg Vishnepolsky <oleg@prodigy.net>
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: vijay singh <vijjus@rocketmail.com>
CC: tcp-impl@lerc.nasa.gov
Subject: Re: TCP-offload on IBM mainframe
References: <19990303200114.20713.rocketmail@web4.rocketmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

vijay singh wrote:

> I wanted to put a question before the forum:
>
> On an IBM mainframe, while offloading the TCP
> processing on a 3172 Controller, can we control the
> buffer allocation (mbufs) on the offload (not on the
> TCP MVS bufferpool). I face a problem where the
> offload runs out of mbufs and the server application
> come down and does not come up again as it fails in
> the BIND call. Or can we free some mbufs by dropping
> some connections through netstat.
>
> Vijay
> E-MAIL: vijjus@rocketmail.com
>

I am the original author of the offload code on 3172, and of the
TCP/IP stack running on 3172 (and OS/2), although I left IBM
in 1993. Yes, you can blame everything on me :-)
Anyway, to increase the number of mbufs, the TCP/IP stack code,
that sits in a device driver, has to be recompiled. You can ask IBM
to do it for you.

Oleg Vishnepolsky


From owner-tcp-impl@lerc.nasa.gov  Fri Mar  5 14:44:28 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA20975
	for <tcpimpl-archive@odin.ietf.org>; Fri, 5 Mar 1999 14:44:27 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id LAA06163
	for tcp-impl-outgoing; Fri, 5 Mar 1999 11:19:25 -0500 (EST)
Received: from tux.w3.org (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id LAA04637
	for <tcp-impl@lerc.nasa.gov>; Fri, 5 Mar 1999 11:14:12 -0500 (EST)
Received: from spiderman.w3.org (root@localhost [127.0.0.1])
	by tux.w3.org (8.8.7/8.8.7) with SMTP id LAA28910;
	Fri, 5 Mar 1999 11:14:01 -0500
Message-Id: <3.0.5.32.19990305111351.00881ad0@localhost>
X-Sender: frystyk@localhost
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Fri, 05 Mar 1999 11:13:51 -0500
To: "Eric A. Hall" <ehall@ehsco.com>
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Subject: Re: TCP and/or sockets vs. the last message in
  full-duplexapplications
Cc: spreitze@parc.xerox.com, tcp-impl@lerc.nasa.gov
In-Reply-To: <36DE07A6.EF6F1025@ehsco.com>
References: <3.0.5.32.19990303101333.02ed2100@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 20:10 03/03/1999 -0800, Eric A. Hall wrote:
>
>> The problem can also occur in HTTP/1.1 pipelining which we found when
>> implementing this in the libwww HTTP code. This is described in an
>> IETF draft by Jim Gettys and Alan Freier that was never finished [1],
>> section 8:
>
>>    The remaining 10 requests that are already sent from the client
>>    will along with client generated TCP ACK packets arrive on a
>>    closed port on the server.
>
>So when would the server have received the ACKs for the data it sent?
>This just never would have worked in anything resembling reliability.

It was less an issue knowing this wouldn't work than making sure it would
actually be fixed in HTTP servers moving from HTTP/1.0 (where it wasn't an
issue) to HTTP/1.1.

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 10 02:13:31 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA06567
	for <tcpimpl-archive@odin.ietf.org>; Wed, 10 Mar 1999 02:13:31 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id XAA29836
	for tcp-impl-outgoing; Tue, 9 Mar 1999 23:01:07 -0500 (EST)
Received: from daffy.ee.lbl.gov (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA28416
	for <tcp-impl@grc.nasa.gov>; Tue, 9 Mar 1999 22:57:39 -0500 (EST)
Received: (from vern@localhost)
	by daffy.ee.lbl.gov (8.9.2/8.9.2) id TAA15651;
	Tue, 9 Mar 1999 19:57:38 -0800 (PST)
Message-Id: <199903100357.TAA15651@daffy.ee.lbl.gov>
To: tcp-impl@grc.nasa.gov
Subject: Fwd: RFC 2525 on TCP Implementation Problems
Date: Tue, 09 Mar 1999 19:57:38 PST
From: Vern Paxson <vern@ee.lbl.gov>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


------- Forwarded Message

Date:  Tue, 09 Mar 1999 16:19:27 -0800
From:  RFC Editor <rfc-ed@ISI.EDU>
Subject:  RFC 2525 on TCP Implementation Problems
To:  IETF-Announce: ;
Cc:  rfc-ed@ISI.EDU
Mime-Version:  1.0
Content-Type:  Multipart/Mixed; Boundary=NextPart


- --NextPart


A new Request for Comments is now available in online RFC libraries.


        RFC 2525: 

        Title:	    Known TCP Implementation Problems
	Author(s):  V. Paxson, M Allman, S. Dawson, W. Fenner,
		    J. Griner, I. Heavens, K. Lahey, J. Semke, B. Volz
	Status:     Informational
	Date:       March 1999
        Mailbox:    vern@ee.lbl.gov, mallman@lerc.nasa.gov,
		    sdawson@eecs.umich.edu, fenner@parc.xerox.com,
		    jgriner@lerc.nasa.gov, ian@spider.com,
		    kml@nas.nasa.gov, semke@psc.edu, volz@process.com
	Pages:      61
        Characters: 137201
        Updates/Obsoletes/See Also: None
        I-D Tag:    draft-ietf-tcpimpl-prob-05.txt


        URL:        ftp://ftp.isi.edu/in-notes/rfc2525.txt


This memo catalogs a number of known TCP implementation problems.
The goal in doing so is to improve conditions in the existing
Internet by enhancing the quality of current TCP/IP implementations.
It is hoped that both performance and correctness issues can be
resolved by making implementors aware of the problems and their
solutions.  In the long term, it is hoped that this will provide a
reduction in unnecessary traffic on the network, the rate of
connection failures due to protocol errors, and load on network
servers due to time spent processing both unsuccessful connections
and retransmitted data.  This will help to ensure the stability of
the global Internet.

This document is a product of the TCP Implementation Working Group of
the IETF.

This memo provides information for the Internet community.  It does
not specify an Internet standard of any kind.  Distribution of this
memo is unlimited.

This announcement is sent to the IETF list and the RFC-DIST list.
Requests to be added to or deleted from the IETF distribution list
should be sent to IETF-REQUEST@IETF.ORG.  Requests to be
added to or deleted from the RFC-DIST distribution list should
be sent to RFC-DIST-REQUEST@RFC-EDITOR.ORG.

Details on obtaining RFCs via FTP or EMAIL may be obtained by sending
an EMAIL message to rfc-info@RFC-EDITOR.ORG with the message body 
help: ways_to_get_rfcs.  For example:

        To: rfc-info@RFC-EDITOR.ORG
        Subject: getting rfcs

        help: ways_to_get_rfcs

Requests for special distribution should be addressed to either the
author of the RFC in question, or to RFC-Manager@RFC-EDITOR.ORG.  Unless
specifically noted otherwise on the RFC itself, all RFCs are for
unlimited distribution.echo 
Submissions for Requests for Comments should be sent to
RFC-EDITOR@RFC-EDITOR.ORG.  Please consult RFC 2223, Instructions to RFC
Authors, for further information.


Joyce K. Reynolds and Alegre Ramos
USC/Information Sciences Institute

...

Below is the data which will enable a MIME compliant Mail Reader 
implementation to automatically retrieve the ASCII version
of the RFCs.

- --NextPart
Content-Type: Multipart/Alternative; Boundary="OtherAccess"

- --OtherAccess
Content-Type:  Message/External-body;
        access-type="mail-server";
        server="RFC-INFO@RFC-EDITOR.ORG"

Content-Type: text/plain
Content-ID: <990309161746.RFC@RFC-EDITOR.ORG>

RETRIEVE: rfc
DOC-ID: rfc2525

- --OtherAccess
Content-Type:   Message/External-body;
        name="rfc2525.txt";
        site="ftp.isi.edu";
        access-type="anon-ftp";
        directory="in-notes"

Content-Type: text/plain
Content-ID: <990309161746.RFC@RFC-EDITOR.ORG>

- --OtherAccess--
- --NextPart--


------- End of Forwarded Message


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 10 15:14:49 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA16172
	for <tcpimpl-archive@odin.ietf.org>; Wed, 10 Mar 1999 15:14:48 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id LAA11938
	for tcp-impl-outgoing; Wed, 10 Mar 1999 11:41:33 -0500 (EST)
Received: from sophia.inria.fr (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id LAA08468
	for <tcp-impl@lerc.nasa.gov>; Wed, 10 Mar 1999 11:32:52 -0500 (EST)
Received: from sophia.inria.fr by sophia.inria.fr (8.8.8/8.8.5) with ESMTP id RAA06655 for <tcp-impl@lerc.nasa.gov>; Wed, 10 Mar 1999 17:32:42 +0100 (MET)
X-Authentication-Warning: sophia.inria.fr: Host clope.inria.fr [138.96.48.13] claimed to be sophia.inria.fr
Message-ID: <36E69EA8.2C49B333@sophia.inria.fr>
Date: Wed, 10 Mar 1999 17:32:40 +0100
From: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
Organization: MISTRAL - INRIA Sophia Antipolis
X-Mailer: Mozilla 4.5 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: tcp-impl@lerc.nasa.gov
Subject: Bandwidth estimation
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hello,

Two weeks ago, I talked about some kind of packet spacing to send many
packets at the beginning of slow start so that to avoid some RTTs and to
reduce the time required to reach the available bandwidth. It is clear
that such solution involves certain implicit bandwidth estimation.
Another mechanism is required to estimate the bandwidth on the path and
to set the timer accordingly. This lets us avoid the situation Neal
mentioned, that of a slow modem and large packets. How to estimate the
bandwidth, I think that Timestamp option provide us with the required
information. Namely, the time elapsed between the transmission of two
ACKs, together with the number of packets acked, can give the source an
idea about the bottleneck bandwidth. Now, the time between the beginning
of the connection and the obtaining of such information guarantees that
the network is not very congested (because a certain number of packets
has succeeded to cross it) and hence a faster increase of the window
with separated packets is possible. We can use the same algorithm
proposed by Hoe to estimate the bandwidth and then the slow start
threshold. The difference is that here whence the bandwidth is
estimated, a large window is transmitted by spacing the packets.

But here I ask, once such estimation is accomplished, why to not profit
from it to spread any subsequent burst sent by TCP even in congestion
avoidance. It is known that the loss and the compression of ACKs cause
burstiness at the source. With some estimation of the bandwidth, these
bursts can be spread out and some losses avoided (this is can beneficial
in asymmetric paths). Do you think that some kind of bandwidth
estimation and packets spread is feasible and form a practical solution?
I am seeing this as a way to separate the transmission rate of the
source from the arrival rate of ACKs (to don't rely on the ACK clock
when it is disturbed). With asymmetric paths and reverse traffic, ACK
stream doesn't represent what is happening on the forward path. 

I am currently studying the feasibility of such solution, any help will
be very appreciated.

Chadi


-- 
                    **  Chadi Mohamad BARAKAT  **
           http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                  /\
PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Thu Mar 11 19:17:15 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA16099
	for <tcpimpl-archive@odin.ietf.org>; Thu, 11 Mar 1999 19:17:14 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA17729
	for tcp-impl-outgoing; Thu, 11 Mar 1999 15:31:23 -0500 (EST)
Received: from ns1.siara.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id PAA16556
	for <tcp-impl@grc.nasa.gov>; Thu, 11 Mar 1999 15:26:04 -0500 (EST)
Received: from [192.168.1.48] by ns1.siara.com
          via smtpd (for fw01.lerc.nasa.gov [139.88.145.14]) with SMTP; 11 Mar 1999 20:51:13 UT
Received: from red.mtv.siara.com by siara.com with smtp
	id m10LC1W-001xi2C; Thu, 11 Mar 1999 12:25:42 -0800 (PST)
Received: from red.mtv.siara.com by red.mtv.siara.com (8.8.7) id MAA00329; Thu, 11 Mar 1999 12:26:45 -0800 (PST)
Message-Id: <199903112026.MAA00329@red.mtv.siara.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: tcp-impl@grc.nasa.gov
Subject: Nagle on send
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 11 Mar 1999 12:26:45 -0800
From: Greg Minshall <minshall@siara.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Hi.  Thanks for people's responses to my query of "what does Nagle on send 
mean?".

Here are what i think the variants are:

1.  FreeBSD 2.2.5, for example, says that if the send is bigger than an MTU 
but isn't too big (bigger than MCLBYTES), and there was no outstanding 
unacknowledged data, then all the data in the send, including any residual 
data making up only a small packet, will be transmitted.

2.  What i think most people would agree on:  If the send is bigger than an 
MTU, then even if there is unacknowledged data, then all the data in the send 
(no matter how large the send was, more or less), including any residual data 
making up only a small packet, will be transmitted.

3.  (I think this is what Vernon has in mind.)  If the send is not big enough 
and there is data in the pipe and NODELAY is not set, then don't call 
tcp_output().  And (if i understand correctly) don't do any Nagle in 
tcp_output().


OK, so forget #1 above (it doesn't meet most of our idea of "Nagle on send").

(I'm going to concentrate on #2, since i have a problem with #3, since i think 
it means that the next ACK received will cause a small packet to be generated, 
even if the next ACK doesn't acknowledge all outstanding data, and i'm not 
sure that is a good idea.  Also, the *intent* (and, i guess, mostly the 
*effect*) of #3 is the same as that of #2.)

Case #2 breaks into two categories, based on what happens if the current 
window/cwnd doesn't allow you to immediately transmit all the data from the 
send:

2a.  If because of window issues all the data can't be transmitted, then 
future transmissions will be constrained by things like "traditional Nagle".

2b.  If because of window issues all the data can't be transmitted, then 
future transmissions will *not* be constrained by things like "traditional 
Nagle".


I think one would prefer 2b.  I.e., the application makes a 64KB send() into a 
somewhat full pipe (i.e., data already transmitting), then you'd like to send 
the final 1296 (whatever) bytes as soon as the window allows, rather than 
waiting for all the data to drain from the pipe and the acknowledgment of the 
final 1460 byte segment.

Thoughts?

Greg


From owner-tcp-impl@lerc.nasa.gov  Thu Mar 11 21:02:20 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA17185
	for <tcpimpl-archive@odin.ietf.org>; Thu, 11 Mar 1999 21:02:20 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA23183
	for tcp-impl-outgoing; Thu, 11 Mar 1999 18:16:12 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA22490
	for <tcp-impl@grc.nasa.gov>; Thu, 11 Mar 1999 18:12:22 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id QAA04048
	for tcp-impl@grc.nasa.gov  env-from <vjs>;
	Thu, 11 Mar 1999 16:12:18 -0700 (MST)
Date: Thu, 11 Mar 1999 16:12:18 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903112312.QAA04048@calcite.rhyolite.com>
To: tcp-impl@grc.nasa.gov
Subject: Re: Nagle on send
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Greg Minshall <minshall@siara.com>

> ...
> 2.  What i think most people would agree on:  If the send is bigger than an 
> MTU, then even if there is unacknowledged data, then all the data in the send 
> (no matter how large the send was, more or less), including any residual data 
> making up only a small packet, will be transmitted.
>
> 3.  (I think this is what Vernon has in mind.)  If the send is not big enough 
> and there is data in the pipe and NODELAY is not set, then don't call 
> tcp_output().  And (if i understand correctly) don't do any Nagle in 
> tcp_output().

No, I think I said 
    put the data in the socket buffer;
    if (!NODELAY)
        tcp_output();

There's nothing there about sizes.


> OK, so forget #1 above (it doesn't meet most of our idea of "Nagle on send").
>
>(I'm going to concentrate on #2, since i have a problem with #3, since i think 
>it means that the next ACK received will cause a small packet to be generated, 
> even if the next ACK doesn't acknowledge all outstanding data, and i'm not 
> sure that is a good idea.  Also, the *intent* (and, i guess, mostly the 
> *effect*) of #3 is the same as that of #2.)

    - silly window avoidance is not the Nagle algorithm, and vice versa.
    	If your tcp_output() is not broken, then it won't generate any
        bad small packets.

    - if you have 1 byte waiting to send, 10K waiting to be Acked, and
       you receive an Ack for 1 MSS and with no data, then (ignoring SWS
       prevention), it is ok to send even with Nagle on because you know
       you're not going to be able to piggyback.   If the Ack came with
       data, then you should definitely send your tiny data (modulo SWS).

    - most of us agree that the intention or purpose of the Nagle algorithm
       is <<NOT>> to prevent small packets, but to piggyback small data
       on Acks instead of in their own packets.   That distinction is
       significant, and allows you avoid being obssessd with the sizes of
       packets.  That obsession is a Bad Thing because it leads to
       additional TCB state, code, and debugging opportunties.

    - sending small packets is <<GOOD>> even with the Nagle algorithm on
       provided there is no opportunity for piggybacking, and provided
       you believe the algorithm sould be applied to application send
       requests instead the bowels of the TCP state machine.

> Case #2 breaks into two categories, based on what happens if the current 
> ...

> Thoughts?

I think I'm opposed, because it is still not applying the Nagle algorithm
to application send requests, and it is still obsessing on the sizes of
packets instead of trying to utilize opportunities for piggybacking.


vjs


From owner-tcp-impl@lerc.nasa.gov  Fri Mar 12 22:33:47 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA23334
	for <tcpimpl-archive@odin.ietf.org>; Fri, 12 Mar 1999 22:33:47 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA08383
	for tcp-impl-outgoing; Fri, 12 Mar 1999 19:26:14 -0500 (EST)
Received: from smtp8.ny.us.ibm.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA08159
	for <tcp-impl@grc.nasa.gov>; Fri, 12 Mar 1999 19:25:48 -0500 (EST)
From: mrosu@us.ibm.com
Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.117.200.22])
	by smtp8.ny.us.ibm.com (8.8.7/8.8.7) with ESMTP id TAA10898;
	Fri, 12 Mar 1999 19:25:26 -0500
Received: from D51MTA03.pok.ibm.com (d51mta03.pok.ibm.com [9.117.200.31])
	by northrelay02.pok.ibm.com (8.8.7m1/NCO v1.8) with SMTP id TAA83666;
	Fri, 12 Mar 1999 19:25:44 -0500
Received: by D51MTA03.pok.ibm.com(Lotus SMTP MTA v4.6.4  (817.1 3-4-1999))  id 85256733.000257E7 ; Fri, 12 Mar 1999 19:25:35 -0500
X-Lotus-FromDomain: IBMUS
To: floyd@acm.org, tomh@CS.Berkeley.EDU
cc: tcp-impl@grc.nasa.gov, mrosu@us.ibm.com
Message-ID: <85256733.0002563A.00@D51MTA03.pok.ibm.com>
Date: Fri, 12 Mar 1999 19:25:28 -0500
Subject: Counting ACKs in NewReno
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Hi,

I've a question regarding your draft (draft-ietf-tcpimpl-newreno-02):

I believe that the NewReno draft/RFC should be more specific about the
handling of the duplicate ACKs counter (dupacks) in step 5.

For both partial and full ACKs, I propose that dupacks should be
decremented by the number of segments acknowledged minus one. This is
similar to the draft's recommendation for updating cwnd upon receiving
a partial ACK. The proposed change will only affect the retransmissions
of segments 'around' the right edge of the window, i.e., segments with
sequence numbers 'around' recover.

The draft is unclear about dupacks handling when a partial ACK is
received. If dupacks is left unchanged upon receiving a partial ACK,
unnecessary transmissions may occur (see Example 1). Instead,
decrementing dupacks, as I propose, delays the retransmission of
a segment until there is enough evidence that three segments sent
after it were received (see Examples 1 and 2).

The draft recommends dupacks to be reset upon receiving a full ACK.
Example 3 shows how resetting dupacks prevents a useful retransmision to
happen. In addition, Example 3 shows how information useful for segment
retransmission is preserved upon exiting fast recovery if dupacks
is updated as I propose.

I've checked the tcp-impl archive and didn't find any discussion on
this issue. If this has already been discussed in the WG, please let
me know.

Looking forward to your comments.
Regards,
Marcel


The partial ACK case is described first, followed by the full ACK
case. I'll assume that ACK messages are never lost. Lost ACKs will
only make the proposed scheme more conservative...

A. The partial ACKs case:
From draft-ietf-tcpimpl-newreno, February 1999, page 4, step 5:

       If this ACK does *not* acknowledge all of the data up to and
       including "recover", then this is a partial ACK.  In this case,
       retransmit the first unacknowledged segment.  Deflate the
       congestion window by the amount of new data acknowledged, then
       add back one MSS and send a new segment if permitted by the new
       value of cwnd.  This "partial window deflation" attempts to
       ensure that, when Fast Recovery eventually ends, approximately
       ssthresh amount of data will be outstanding in the network.

       >> Do not clear the counter recording the number of duplicate
       acknowledgements (i.e., do not exit the Fast Recovery procedure).<<

Should we understand that dupacks should not be changed upon receiving
a partial ACK?  If this is the case, the next example shows a situation
in which the draft recommends retransmitting a segment too early.

Example 1.
Consider the following sequence of eight segments, with an initial
congestion window (cwnd) of 8.  The 1st and the 7th segment in the
sequence are lost:
[1] 2 3 4 5 6 [7] 8    ([ ] =  lost segments)

After sending segments 1-8, the source receives three duplicate acks,
and it retransmits segment 1. Immediately before receiving the partial
ACK, which acknowledges segment 6, cwnd is 10 = 4 (initial cwnd = 8/2) +
6 (duplicate ACKs generated by segments 2-6, 8).

When the partial ack arrives, dupacks is 6 and the draft says only
that it shouldn't be reset.  As dupacks >= 3, should segment 7 be
retransmitted immediately?

Since only two segments sent after 7 (segment 8 and the retransmission
of 1) were acked by the destination, it is probably too early to
retransmit segment 7.

I believe that, upon receiving a partial ACK, the sender should
decrement dupacks by the number of segments acknowledged minus one.
In this example, the resulting dupacks is 1. If we want to be more
aggressive and count the retransmission of segment 1 as a segment sent
after 7 but acked before 7, then the new value of dupacks should be 2.
Either way, I don't think that segment 7 shouldn't be retransmitted
at this time.

According to the draft, upon receiving the partial ack, cwnd is
deflated by 6 (the amount of data acknowledged) and inflated by 1. The
new congestion window is 5 (10 - 6 + 1). Based on the new value of
cwnd, the source will sends three new segments. It is probably closer
to the spirit of the "Fast Retransmit" algorithm to retransmit segment
7 only after (if ever) two acks (or one ack if we count the retransmission
of 1) for segment 6 are received. Alternatively, if an ACK for segments
7, 8, 9, or 10 is received before dupacks becomes 3, the proposed rule
for updating dupacks eliminates an unnecessary retransmission.

The next example shows that the proposed rule for updating the dupacks
counter doesn't prevent partial ACKs from generating retransmissions.

Example 2.
Consider an initial cwnd of 10, message 1 and 7 are lost, messages
2-6, 8-10 are received. If dupacks is decremented upon receiving a partial
ack (for segment 6), then dupacks is updated to 8 - 5 = 3. You can think
of the new value of dupacks as representing the three duplicate ACKs
generated by the receipt of segments 8, 9, and 10. As the new value
of dupacks is >= 3, segment 7 is retransmitted immediately.

B. The full ACK case:
From draft-ietf-tcpimpl-newreno, February 1999, page 4, step 5:

       If this ACK acknowledges all of the data up to and including
       "recover", then the ACK acknowledges all the intermediate
       segments sent between the original transmission of the lost
       segment and the receipt of the third duplicate ACK.  Set cwnd to
       either (1) min (ssthresh, FlightSize + MSS); or (2) ssthresh,
       where ssthresh is the value set in step 1; this is termed
       "deflating" the window.  (We note that "FlightSize" in step 1
       referred to the amount of data outstanding in step 1, when Fast
       Recovery was entered, while "FlightSize" in step 5 refers to the
       amount of data outstanding in step 5, when Fast Recovery is
       exited.) If the second option is selected, the implementation
       should take measures to avoid a possible burst of data, in case
       the amount of data outstanding in the network was much less than
       the new congestion window allows [HTH98].  >> Clear the counter
       recording the number of duplicate acknowledgements, exiting the
       Fast Recovery procedure. <<


Example 3.
Consider an initial cwnd of 9, message 1 and 7 are lost, messages 2-6,
8, and 9 are received. recover is set to 9 and ssthresh to 4. If we update
dupacks as in example 1, upon receiving a partial ack for segment 6,
segment 7 is not retransmitted immediately because the new value of
dupacks is 2. However, the new cwnd is 6 and three new  data segments
(10, 11, 12) are sent. Lets assume segment 10 is dropped:
[1] 2 3 4 5 6 [7] 8 9 [10] 11 12 13   ([ ] = lost segments)

The sender receives two additional ACKs for segment 6 (dupacks == 4),
retransmits 7, and gets an ACK for segment 9. As recover is 9, this last
ACK is a full ACK.

According to the draft, upon receiving a full ACK, the source sets
dupacks to zero. In addition, the source sets cwnd to ssthresh (= 4)
and it sends a new data segment, 13. The receipt of segment 13 generates
a second ACK for segment 9 and dupacks is incremented from 0 to 1.

Clearing dupacks erases all the information the source has recorded
on segments 10, 11, and 12. Ideally, dupacks should be reset to 2
in order to count for segments 11 and 12.

If we apply the proposed rule for updating dupacks and decrement it
by the number of segments acknowledged minus one, the resulting value
is 2 (= 4 - 2).  After the second ACK for 9 is received (the ACK generated
by segment 13), segment 10 can be retransmitted.

Please correct me if I am wrong.


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 15 07:27:40 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA19933
	for <tcpimpl-archive@odin.ietf.org>; Mon, 15 Mar 1999 07:27:39 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id CAA28422
	for tcp-impl-outgoing; Mon, 15 Mar 1999 02:31:14 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id CAA28249
	for <tcp-impl@grc.nasa.gov>; Mon, 15 Mar 1999 02:28:09 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with SMTP id XAA24180;
	Sun, 14 Mar 1999 23:27:50 -0800 (PST)
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id XAA24901; Sun, 14 Mar 1999 23:27:46 -0800
Received: (from kcpoon@localhost)
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) id XAA15344;
	Sun, 14 Mar 1999 23:27:46 -0800 (PST)
Date: Sun, 14 Mar 1999 23:27:46 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Message-Id: <199903150727.XAA15344@jurassic.eng.sun.com>
To: mrosu@us.ibm.com
Subject: Re: Counting ACKs in NewReno
Cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Included message from mrosu@us.ibm.com:

>----
>I've a question regarding your draft (draft-ietf-tcpimpl-newreno-02):

The following is my understanding of NewReno and its variants.  The 2 basic
ideas behind NewReno are stated in the draft.  They are:

1. Multiple drops of TCP segments in a window should be treated as a
single congestion event.
2. It is safe to infer from a partial ACK during fast retransmit phase
that the next unack'ed segment is lost, retransmit it immediately.

Note that there is no mention of dup ack count in the above 2 ideas.  The
count is an implementation detail about when to start the fast retransmit
phase.

Reading from what you suggested, it seems to me that you did not agree that
2 above is a good idea.  I infer that you think TCP should not retransmit a
segment unless 3 dup acks of the previous segment are received.  While this
may be a good idea, but I think it has its shortcomings.  Let's look at
your example.

>Example 1.
...
>[1] 2 3 4 5 6 [7] 8    ([ ] =  lost segments)
...

>When the partial ack arrives, dupacks is 6 and the draft says only
>that it shouldn't be reset.  As dupacks >= 3, should segment 7 be
>retransmitted immediately?

Idea 2 says that it should.  

...
>The next example shows that the proposed rule for updating the dupacks
>counter doesn't prevent partial ACKs from generating retransmissions.
...

Let's consider a simpler example.  Using your notation, with cwnd = 5, the
following segments are sent.

[1] [2] 3 4 5

If we use the implementation suggested by the draft, seg 1 is fast
retransmitted as with the old algorithm.  If the retransmitted seg 1 is not
lost, the partial ack of 1 will trigger the retransmission of seg 2.  Note
that because of the small cwnd, no new segments are sent during the fast
retransmit phase.  But with your suggestion about dup ack count, seg 2
will not be retransmitted.  And because no new acks will come back, TCP
has to time out.  This is what NewReno tries to avoid.

...
>Example 3.
>Consider an initial cwnd of 9, message 1 and 7 are lost, messages 2-6,
>8, and 9 are received. recover is set to 9 and ssthresh to 4. If we update
>dupacks as in example 1, upon receiving a partial ack for segment 6,
>segment 7 is not retransmitted immediately because the new value of
>dupacks is 2. However, the new cwnd is 6 and three new  data segments
>(10, 11, 12) are sent. Lets assume segment 10 is dropped:
>[1] 2 3 4 5 6 [7] 8 9 [10] 11 12 13   ([ ] = lost segments)

NewReno will retrasnmit seg 7 immediately after getting a partial ack for
seg 6 because TCP is still in fast retransmit phase.  It seems to me that
you are mixing your suggestion with NewReno.

Let's look at what NewReno will do in your example more closely.  Note the
differences from the description in your mail.

1. Seg 1 is fast retransmitted after getting dup acks elicited by segs 2, 3,
and 4.  cwnd is set to 9/2 + 3 = 7.
2. Dup acks elicited by segs 5, 6, 8, and 9 increases cwnd to 11.  Segs 10
and 11 are sent.
3. Partial ack for 6 is received, seg 7 is retransmitted.  cwnd is set to 11
- 6 + 1 = 6.  Seg 12 can be sent.
4. Dup ack elicited by seg 11 is received, cwnd is set to 7.  Segs 13 can be
sent.
5. "Complete" ack for seg 9 is received, cwnd is set to 4.  Fast retransmit
phase ends.  No new seg can be sent.
6. Dup ack elicited by segs 12, 13 are received.  Dup ack count is 2, no
fast retransmit will happen.  TCP has to time out.

Although the end result is the same as you described, there are some subtle
differences in the sequence of events.  And one may argue that this time out
is a good thing.  In fast retransmit phase, the sender has already slowed
down.  Seg 10 is sent in this phase.  If segment loss is solely because of
congestion, the loss of seg 10 indicates that the network is congested even
after the sending rate is dropped.  It is good for TCP to slow down further.
This conforms to the idea that multiple drops in a window should be treated
as one single congestion event.  Seg 10 is outside the original window.  It
should be treated as another congestion event.  That is why dup ack count
is reset to 0.

Note that NewReno does not try to recover all segment drops.  It just tries
"harder" than the original fast retransmit algorithm.  Can your suggestion
be added to NewReno?  May be.  But there are several things you need to
clarify first.

1. You have to decide what a congestion event is.  You cannot keep on 
manipulating the dup ack count.  When should this end?  Does TCP halve its
window whenever dup ack count equals 3?
2. I think idea 2 of NeReno is pretty safe and very helpful when drops happen
in a burst.  Do you agree that we should not change it?  If no, why?

Basically you need to describe the reasons behind your suggestion and then
tell us how your suggestion should be integrated into NewReno.  As mentioned
in the draft (isn't it an RFC now?), there are many variations to NewReno.
You are encouraged to describe yours.

>----


							K. Poon
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 15 08:46:22 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id IAA20528
	for <tcpimpl-archive@odin.ietf.org>; Mon, 15 Mar 1999 08:46:21 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id FAA13532
	for tcp-impl-outgoing; Mon, 15 Mar 1999 05:06:09 -0500 (EST)
Received: from sophia.inria.fr (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id FAA13211
	for <tcp-impl@grc.nasa.gov>; Mon, 15 Mar 1999 05:01:17 -0500 (EST)
Received: from sophia.inria.fr by sophia.inria.fr (8.8.8/8.8.5) with ESMTP id LAA19785 for <tcp-impl@grc.nasa.gov>; Mon, 15 Mar 1999 11:01:13 +0100 (MET)
X-Authentication-Warning: sophia.inria.fr: Host clope.inria.fr [138.96.48.13] claimed to be sophia.inria.fr
Message-ID: <36ECDA67.BFB7878D@sophia.inria.fr>
Date: Mon, 15 Mar 1999 11:01:11 +0100
From: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
Organization: MISTRAL - INRIA Sophia Antipolis
X-Mailer: Mozilla 4.5 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
CC: tcp-impl@grc.nasa.gov
Subject: Re: Counting ACKs in NewReno
References: <85256733.0002563A.00@D51MTA03.pok.ibm.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hello,

The idea with the three duplicate ACKs of the fast retransmit algorithm
is to eliminate any possibility for packet reordering. It is supposed
that the receipt of three packets after the missing one is a good signal
of the loss of this packet. So it is retransmitted without waiting for a
timeout. Now, in New Reno, the first loss in a window required three
duplicate ACKs, but once a partial ACK is received, the source can
conclude directly that this packet is lost and it doesn't need the wait
for any additional dup ACK, for this reason it doesn't change its dup
ACKs counter. Why? because a partial ACK is sent most probably in
response to the retransmission of the previous loss in the same window,
so at least one Roundtrip time has passed since the first transmission
of the missing packet. This roundtrip is enough to take a decision that
the packet is lost. You can consider that the three dup ACKs counted at
the begining of Fast Recovery is used by the source to fast retransmit
any further loss in the same window.

Now for the value of the dup ACK counter at the end of fast Recovery, I
think that here the value of this counter doesn't count a lot. Setting
it to zero may prohibit the source from detecting some losses as you
show in your exemple. Also taking the number of duplicate ACKs into
account may cause a problem because we are not sure if these duplicate
ACKs correspond to packets sent before or after the loss. A duplicate
ACK for a packet sent before the missing packet is not a signal for the
loss of this packet. I propose here to not count on the counter and
instead to stretch the Fast Recovery phase to cover the new packets sent
during Fast Recovery. Let X be the highest sequence number sent when the
Fast Recovery phase is called, let Y be the highest sequence number sent
when the last loss in the window is retransmitted. When an ACK for this
loss is received, if this ACK askes for a packet less than Y than we can
conclude that this packet is also lost without waiting for dup ACKs
because at least one Roundtrip time has passed since its transmission.
Now if the ACK carries a sequence number more than or equal to Y than
dup ACK can be set to zero and normal Fast Retransmit can be used to
detect the subsequent losses.

Chadi
-- 
                    **  Chadi Mohamad BARAKAT  **
           http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                  /\
PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 15 18:22:31 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA25441
	for <tcpimpl-archive@odin.ietf.org>; Mon, 15 Mar 1999 18:22:31 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA20584
	for tcp-impl-outgoing; Mon, 15 Mar 1999 14:57:05 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA18999
	for <tcp-impl@grc.nasa.gov>; Mon, 15 Mar 1999 14:51:26 -0500 (EST)
Received: from engmail4.Eng.Sun.COM ([129.144.134.6])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id LAA28792;
	Mon, 15 Mar 1999 11:51:25 -0800 (PST)
Received: from shield.eng.sun.com (shield.Eng.Sun.COM [129.146.85.114])
	by engmail4.Eng.Sun.COM (8.8.8+Sun/8.8.8) with ESMTP id LAA08553;
	Mon, 15 Mar 1999 11:51:16 -0800 (PST)
Received: from shield (shield [129.146.85.114])
	by shield.eng.sun.com (8.9.1b+Sun/8.9.1) with SMTP id LAA12748;
	Mon, 15 Mar 1999 11:51:14 -0800 (PST)
Date: Mon, 15 Mar 1999 11:51:14 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: Counting ACKs in NewReno
To: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
Cc: tcp-impl@grc.nasa.gov
In-Reply-To: "Your message with ID" <36ECDA67.BFB7878D@sophia.inria.fr>
Message-ID: <Roam.SIMCSD.2.0.4.921527474.19927.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> during Fast Recovery. Let X be the highest sequence number sent when the
> Fast Recovery phase is called, let Y be the highest sequence number sent
> when the last loss in the window is retransmitted. When an ACK for this
> loss is received, if this ACK askes for a packet less than Y than we can
> conclude that this packet is also lost without waiting for dup ACKs
> because at least one Roundtrip time has passed since its transmission.

Should cwnd be adjusted for this kind of ack?  Should new segments be sent
during this "extension" period?  If yes, should the fast retransmit phase
keep on extending?  As I menioned in a previous mail, NewReno treats this
kind of drops as another congestion event.  And depending on the number of
segments in flight, another fast retransmit can be triggered to recover
this kind of drops.  I think you need to clarify your suggestion in the
context of idea 1, as stated in my previous mail, of NewReno.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 16 04:58:16 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id EAA08272
	for <tcpimpl-archive@odin.ietf.org>; Tue, 16 Mar 1999 04:58:15 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id BAA08238
	for tcp-impl-outgoing; Tue, 16 Mar 1999 01:47:01 -0500 (EST)
Received: from smtp8.ny.us.ibm.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id BAA07546
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 01:41:40 -0500 (EST)
From: mrosu@us.ibm.com
Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.117.200.22])
	by smtp8.ny.us.ibm.com (8.8.7/8.8.7) with ESMTP id BAA39052;
	Tue, 16 Mar 1999 01:41:19 -0500
Received: from D51MTA03.pok.ibm.com (d51mta03.pok.ibm.com [9.117.200.31])
	by northrelay02.pok.ibm.com (8.8.7m1/NCO v1.8) with SMTP id BAA256378;
	Tue, 16 Mar 1999 01:41:38 -0500
Received: by D51MTA03.pok.ibm.com(Lotus SMTP MTA v4.6.4  (817.1 3-4-1999))  id 85256736.0024BE27 ; Tue, 16 Mar 1999 01:41:19 -0500
X-Lotus-FromDomain: IBMUS
To: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
cc: tcp-impl@grc.nasa.gov, mrosu@us.ibm.com
Message-ID: <85256736.0024BCC6.00@D51MTA03.pok.ibm.com>
Date: Tue, 16 Mar 1999 01:41:20 -0500
Subject: Re: Counting ACKs in NewReno
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


   Hello,


   Now for the value of the dup ACK counter at the end of fast Recovery, I
   think that here the value of this counter doesn't count a lot.

Why? Because of lost ACKs?

   Setting
   it to zero may prohibit the source from detecting some losses as you
   show in your exemple. Also taking the number of duplicate ACKs into
   account may cause a problem because we are not sure if these duplicate
   ACKs correspond to packets sent before or after the loss. A duplicate
   ACK for a packet sent before the missing packet is not a signal for the
   loss of this packet.

At the end of Fast Recovery, if you update the dup ACK counter after
each partial or full ACK as I proposed, you have a useful count of the
packets sent after the loss and received by the other end. As long as
you don't overestimate this count, any number should be more useful than zero.

Marcel

   I propose here to not count on the counter and
   instead to stretch the Fast Recovery phase to cover the new packets sent
   during Fast Recovery. Let X be the highest sequence number sent when the
   Fast Recovery phase is called, let Y be the highest sequence number sent
   when the last loss in the window is retransmitted. When an ACK for this
   loss is received, if this ACK askes for a packet less than Y than we can
   conclude that this packet is also lost without waiting for dup ACKs
   because at least one Roundtrip time has passed since its transmission.
   Now if the ACK carries a sequence number more than or equal to Y than
   dup ACK can be set to zero and normal Fast Retransmit can be used to
   detect the subsequent losses.

   Chadi
   --
                       **  Chadi Mohamad BARAKAT  **
              http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                     /\
   PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
   2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
   06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 16 05:28:15 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id FAA08507
	for <tcpimpl-archive@odin.ietf.org>; Tue, 16 Mar 1999 05:28:15 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id BAA06123
	for tcp-impl-outgoing; Tue, 16 Mar 1999 01:27:07 -0500 (EST)
Received: from smtp8.ny.us.ibm.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id BAA05901
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 01:24:25 -0500 (EST)
From: mrosu@us.ibm.com
Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.117.200.22])
	by smtp8.ny.us.ibm.com (8.8.7/8.8.7) with ESMTP id BAA34792;
	Tue, 16 Mar 1999 01:23:33 -0500
Received: from D51MTA03.pok.ibm.com (d51mta03.pok.ibm.com [9.117.200.31])
	by northrelay02.pok.ibm.com (8.8.7m1/NCO v1.8) with SMTP id BAA265958;
	Tue, 16 Mar 1999 01:23:48 -0500
Received: by D51MTA03.pok.ibm.com(Lotus SMTP MTA v4.6.4  (817.1 3-4-1999))  id 85256736.002325E3 ; Tue, 16 Mar 1999 01:23:35 -0500
X-Lotus-FromDomain: IBMUS
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU,
        mrosu@us.ibm.com
Message-ID: <85256736.00231DBA.00@D51MTA03.pok.ibm.com>
Date: Tue, 16 Mar 1999 01:23:32 -0500
Subject: Re: Counting ACKs in NewReno
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Thank you very much for your comments.


   Included message from mrosu@us.ibm.com:

   >----
   >I've a question regarding your draft (draft-ietf-tcpimpl-newreno-02):


First, let me clarify my proposal: upon receiving a partial or full ack, dupacks
should be
decremented by (the number of segments acknowledged - 1), i.e.,
dupacks -= #segs_acked -1.

Second, I don't think your counterexample is correct because the algorithm I
propose retransmits
if dupacks >= 3, even if the dup acks counted don't ack the previous segment:

   Let's consider a simpler example. Using your notation, with cwnd = 5,
   the following segments are sent.

   [1] [2] 3 4 5

   If we use the implementation suggested by the draft, seg 1 is fast
   retransmitted as with the old algorithm.  If the retransmitted seg 1
   is not lost, the partial ack of 1 will trigger the retransmission of seg 2.
   Note that because of the small cwnd, no new segments are sent during
   the fast retransmit phase.  But with your suggestion about dup ack count,
   seg 2 will not be retransmitted.

No. If you count dupacks as I suggested, segment 2 will be retransmitted
upon receiving the partial ack of 1.

Reason: before retransmitting 1, dupacks is 3. Upon receiving
the partial ack of 1, dupacks is decremented by 0, i.e. dupacks remains 3
and segment 2 is retransmitted immediately.

A segment is retransmitted if dupacks is >= 3. The duplicate acks don't
have to ack the previous segment.

   And because no new acks will come back, TCP has to time out. This is what
   NewReno tries to avoid.

If you decrement dupacks upon receiving a full or partial ack, as I suggested,
you don't always
need new acks in order to retransmit.  For segments close to the left edge of
the sender window
(the window when fast retransmit is initiated) both 'old' and 'new' algorithms
have the same
behaviour. For segments close and beyond the right edge of the sender window,
the two algorithms
behave differently.

   The following is my understanding of NewReno and its variants. The 2 basic
   ideas behind NewReno are stated in the draft.  They are:

   1. Multiple drops of TCP segments in a window should be treated as a
   single congestion event.
   2. It is safe to infer from a partial ACK during fast retransmit phase
   that the next unack'ed segment is lost, retransmit it immediately.

   Note that there is no mention of dup ack count in the above 2 ideas.  The
   count is an implementation detail about when to start the fast retransmit
   phase.

   Reading from what you suggested, it seems to me that you did not agree that
   2 above is a good idea.

I didn't say that 2 is not a good idea. What I'm saying is that within
the window, counting dupacks more carefully will get you about the same
results as the old algorithm but with less risk. Furthermore, as you advance
beyond the right edge of the window, the new algorithm preserves "hard
to collect info" (# of dupacks) and having this info helps you avoid timeouts
when segments immediately to the right of the window are lost.


   I infer that you think TCP should not retransmit a segment unless 3 dup acks
   of the previous segment are received.

Sorry for the confusion. I propose that TCP should not retransmit
unless dupacks >= 3. The dup acks that are counted don't have to ack
the previous segment or even the same segment. Lets go back to example 1.

   While this may be a good idea, but I think it has its shortcomings.
   Let's look at your example.

   >Example 1.
   ...
   >[1] 2 3 4 5 6 [7] 8    ([ ] =  lost segments)

   >When the partial ack arrives, dupacks is 6 and the draft says only
   >that it shouldn't be reset.  As dupacks >= 3, should segment 7 be
   >retransmitted immediately?

   Idea 2 says that it should.

Segment 8 generates a dup ack for segment 1.
The retransmission of 1 generates a partial ack for 6 and segments
9 and 10 are transmitted. The dupacks counter is decremented by 5,
from 6 to 1. Segments 9 and 10 generate two acks for 6, which are both
duplicate acks. The dupacks counter is incremented twice by one,
it becomes 3 and  segment 7 is retransmitted.

In summary, 7 is retransmitted because dupacks is >= 3. One of the three
dup acks counted is for 1, the other two for 6.
   ...
   >Example 3.
   >Consider an initial cwnd of 9, message 1 and 7 are lost, messages 2-6,
   >8, and 9 are received. recover is set to 9 and ssthresh to 4. If we update
   >dupacks as in example 1, upon receiving a partial ack for segment 6,
   >segment 7 is not retransmitted immediately because the new value of
   >dupacks is 2. However, the new cwnd is 6 and three new  data segments
   >(10, 11, 12) are sent. Lets assume segment 10 is dropped:
   >[1] 2 3 4 5 6 [7] 8 9 [10] 11 12 13   ([ ] = lost segments)

   NewReno will retrasnmit seg 7 immediately after getting a partial ack for
   seg 6 because TCP is still in fast retransmit phase.  It seems to me that
   you are mixing your suggestion with NewReno.

Again, sorry for mixing the two. Example 3 argues against clearing the
dupacks counter upon receiving a full ack.

   Let's look at what NewReno will do in your example more closely.  Note the
   differences from the description in your mail.

   1. Seg 1 is fast retransmitted after getting dup acks elicited by segs 2, 3,
   and 4.  cwnd is set to 9/2 + 3 = 7.
   2. Dup acks elicited by segs 5, 6, 8, and 9 increases cwnd to 11.  Segs 10
   and 11 are sent.
   3. Partial ack for 6 is received, seg 7 is retransmitted.  cwnd is set to 11
   - 6 + 1 = 6.  Seg 12 can be sent.
   4. Dup ack elicited by seg 11 is received, cwnd is set to 7.  Segs 13 can be
   sent.
   5. "Complete" ack for seg 9 is received, cwnd is set to 4.  Fast retransmit
   phase ends.  No new seg can be sent.
   6. Dup ack elicited by segs 12, 13 are received.  Dup ack count is 2, no
   fast retransmit will happen.  TCP has to time out.

   Although the end result is the same as you described, there are some subtle
   differences in the sequence of events.

You're right, there are differences in the behaviour of the two algorithms
for the segments inside the window, i.e., 1 to 9. But these differences are
minor: the algorithm I suggested will retransmit 7, although a little later
than the NewReno

But there is also an important difference: the algorithm I suggested will
retransmit 10 without a timeout while NewReno timeouts because it drops all
the info on the dup acks generated by the segments outside of the window,
i.e., segments 11, 12, and 13.

   And one may argue that this time out is a good thing.

Due to my limited experience, I can't agree or disagree with you. I thought
that timeouts and unnecessary retransmissions are always bad.

   In fast retransmit phase, the sender has already slowed
   down.  Seg 10 is sent in this phase.  If segment loss is solely because of
   congestion, the loss of seg 10 indicates that the network is congested even
   after the sending rate is dropped.  It is good for TCP to slow down further.

Agree.

   This conforms to the idea that multiple drops in a window should be treated
   as one single congestion event.  Seg 10 is outside the original window.  It
   should be treated as another congestion event.

I think that reducing the congestion window is the best way to slow down TCP.

   That is why dup ack countis reset to 0.

I don't see the benefit of clearing the dupacks counter. Please elaborate.

   Note that NewReno does not try to recover all segment drops.  It just tries
   "harder" than the original fast retransmit algorithm.  Can your suggestion
   be added to NewReno?  May be.

In its simplest form, I believe that it can be added to NewReno without
increasing NewReno's complexity or overhead significantly.

   But there are several things you need to clarify first.

I agree.

   1. You have to decide what a congestion event is.  You cannot keep on
   manipulating the dup ack count.  When should this end?

When there are no losses, the dupacks counter will go to zero rapidly.
Because of lost acks, there will be attempts to decrement the counter
below zero. Anyway, dupacks should always be >= 0.

   Does TCP halve its window whenever dup ack count equals 3?

No. The congestion window should be handled as in the current draft.

   2. I think idea 2 of NeReno is pretty safe and very helpful when drops happen
   in a burst.  Do you agree that we should not change it?  If no, why?

I'm not sure idea 2 is always safe. For instance,
consider an initial window of 5, with segments 1 and 5 lost:
[1] 2 3 4 [5].
How safe is to retransmit 5 upon receiving a partial ack for 4?

Let me know what you think.

Anyway, unless my comment on your "[1] [2] 3 4 5" example is incorrect,
I believe the algorithm I suggested handles a burst of lost messages as
well as NewReno.

   Basically you need to describe the reasons behind your suggestion and then
   tell us how your suggestion should be integrated into NewReno.

Reasons behind my suggestion:
         1. avoid unnecessary retransmissions of segments inside and
close to the right edge of the window, and
         2. avoid timeouts on segments outside and close to the right
edge of the window.

The integration, if considered useful, should be relatively easy.

   As mentioned in the draft (isn't it an RFC now?), there are many
   variations to NewReno. You are encouraged to describe yours.
   >----

                                    K. Poon
                                    kcpoon@eng.sun.com


Thank you again for your comments,
Marcel


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 16 06:34:41 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id GAA08828
	for <tcpimpl-archive@odin.ietf.org>; Tue, 16 Mar 1999 06:34:40 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id DAA18291
	for tcp-impl-outgoing; Tue, 16 Mar 1999 03:12:03 -0500 (EST)
Received: from sophia.inria.fr (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id DAA16953
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 03:05:01 -0500 (EST)
Received: from sophia.inria.fr by sophia.inria.fr (8.8.8/8.8.5) with ESMTP id JAA08330; Tue, 16 Mar 1999 09:04:50 +0100 (MET)
X-Authentication-Warning: sophia.inria.fr: Host clope.inria.fr [138.96.48.13] claimed to be sophia.inria.fr
Message-ID: <36EE10A2.2F0987A2@sophia.inria.fr>
Date: Tue, 16 Mar 1999 09:04:50 +0100
From: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
Organization: MISTRAL - INRIA Sophia Antipolis
X-Mailer: Mozilla 4.5 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: mrosu@us.ibm.com
CC: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>, floyd@ee.lbl.gov,
        tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU
Subject: Re: Counting ACKs in NewReno
References: <85256736.00231DBA.00@D51MTA03.pok.ibm.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

All you said Marcel is correct, the problem is one a loss appears at the
end of the window (before the end by less than three packets) and you
don't send enough new segments during Fast Recovery to make dupacks more
than 3. Take this counterexemple

If the source transmits packets 

[1] 2 3 [4] 5

I use the same notation for lost packets. The source receives first
three dup acks that correspond to the receipt of packets 2, 3 and 5.
dupacks is then 3 and the source retransmits packet 1, sets its window
to 2 and adds three to cwnd which becomes 5. Here no new segments are
transmitted. When packet 1 arrives at the destination, an ack asking for
packet 4 is transmitted. According to your proposal, when the source
receives this ACK, it decrements dupacks by two which becomes 1. Here
packets 4 cannot be retransmitted and a timeout is required. However
with the current version of New Reno it is retransmitted directly. I
think that it must be retransmitted because one Round trip time has
passed since its transmission and it is very probable that it is lost.

Now for my proposal about stretching the Fast Recovery phase, I don't
mean by stretching try to recover the new packets sent during the Fast
Recovery phase without reducing further the window, I just want to
profit from the information that these packets are sent long time ago
(at least one Round trip time). As in current New Reno Fast Recovery
must always end when all the packets in the window when it is called are
acknowledged. Any loss in another round trip time is another congestion
signal and the window must be reduced again. 

Let X and Y the lowest and the highest sequence numbers transmitted when
Fast Recovery is called. X is lost. 

Let Z be the highest sequence number between X and Y that is also lost.
When the source receives the partial ACK asking for packet Z, it has
transmitted new packets until T whith T>Y. 
Here it retransmits Z and may be some new packets beyond T. Packet Z
arrives at the destination which must ack all the packets till T if they
are correctly received of course. Suppose that a packet between Y and T
is also lost, call it U, we have the following figure

X  [Z]  Y  [U]  T    

Here the source receives an ACK asking for packet U. If it resets the
dupacks counter after leaving fast recovery and if the number of
segments sent after T is less than three, the detection of the loss of U
requires a timeout. Recall that all the dupack of segments X till T have
been already received when the ack asking for segment U arrived. However
if the source uses the dupack counter as proposed by Marcel, it will
consider the dup acks of packets sent between U and T with any dup ack
for a packet sent after T. This may also cause a failure of Fast
Retransmit if the number of these dup acks is less than three. However
this proposition is better than resetting the counter because it takes
account of dup acks for packets between U and T.

I propose here that the source should retransmit directly packet U
without waiting for three dupack or timeout. At least one round trip
time has passed since the transmission of U and it is very probable that
it is lost. How to do this??

The source stores in a variable the highest sequence number sent and it
updates this variable when it receives a partial ACK (less than Y). When
it leaves Fast Recovery and receives an ACK asking for a packet less
than this variable (T in our exemple), it retransmits this packet
directly.

What do you think?

Chadi


-- 
                    **  Chadi Mohamad BARAKAT  **
           http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                  /\
PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 16 07:11:27 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA08987
	for <tcpimpl-archive@odin.ietf.org>; Tue, 16 Mar 1999 07:11:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id DAA22213
	for tcp-impl-outgoing; Tue, 16 Mar 1999 03:47:08 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id DAA22011
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 03:44:39 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with SMTP id AAA12448;
	Tue, 16 Mar 1999 00:44:34 -0800 (PST)
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id AAA18647; Tue, 16 Mar 1999 00:44:29 -0800
Received: (from kcpoon@localhost)
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) id AAA21819;
	Tue, 16 Mar 1999 00:44:25 -0800 (PST)
Date: Tue, 16 Mar 1999 00:44:25 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Message-Id: <199903160844.AAA21819@jurassic.eng.sun.com>
To: mrosu@us.ibm.com
Subject: Re: Counting ACKs in NewReno
Cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Included message from mrosu@us.ibm.com:

>----
>First, let me clarify my proposal: upon receiving a partial or full ack, dupacks
>should be
>decremented by (the number of segments acknowledged - 1), i.e.,
>dupacks -= #segs_acked -1.

Oops, I misunderstood yours as

dup_ack_cnt = dup_ack_cnt - segs_acked - 1.

Now, I see where that - 1 comes from.

>A segment is retransmitted if dupacks is >= 3. The duplicate acks don't
>have to ack the previous segment.

Let me try another example.  Using your notation, cwnd = 5,

[1] 2 [3] 4 5

After fast retransmitting seg 1, partial ack for seg 2 comes back.  Then
dup_ack_cnt -= 2 - 1 = 2.  (I hope I get it right this time (-:)  So
using your idea, since dup_ack_cnt < 3, seg 3 will not be retransmitted.
cwnd is set to 5 - 2 + 1 = 3.  So no new segment can be sent.  TCP has
to time out.

>I didn't say that 2 is not a good idea. What I'm saying is that within
>the window, counting dupacks more carefully will get you about the same
>results as the old algorithm but with less risk. Furthermore, as you advance
>beyond the right edge of the window, the new algorithm preserves "hard
>to collect info" (# of dupacks) and having this info helps you avoid timeouts
>when segments immediately to the right of the window are lost.

I think we can separate your modification into 2 cases.  Using NewReno's
terms, partial ack and ack covers recover.  Firstly, I don't believe that it
is "risky" at all in retransmitting the first unack'ed segment after getting
a partial ack.  Fast retransmit is triggered because of 3 dup acks.  We
assume that there is no reordering, thus the segment is really dropped.  This
means that no partial ack or complete ack can be sent back until this
dropped segment is retransmitted and received by the receiver.  Thus
retransmitting a segment because of a partial ack is as "risky" as the first
fast retransmission.  They are all linked together.

Suppose for the moment, we just apply half of your idea, do the dup_ack_cnt
calculation but relax the dup_ack_cnt >= 3 rule.  If after leaving fast
retransmit phase, dup_ack_cnt is still > 0, can we infer that some of the
new segments sent during fast retransmit phase are also dropped?  I'll say
probably.

>Due to my limited experience, I can't agree or disagree with you. I thought
>that timeouts and unnecessary retransmissions are always bad.

If the network is very congested, what is not good for a connection can be
the best way to prevent network collapse.

>I don't see the benefit of clearing the dupacks counter. Please elaborate.

The dup_ack_cnt is just an implementation detail in NewReno which is used to
identify when fast retransmit start and ends.  Reseting it to 0 just means
it is the end.

>When there are no losses, the dupacks counter will go to zero rapidly.
>Because of lost acks, there will be attempts to decrement the counter
>below zero. Anyway, dupacks should always be >= 0.
>No. The congestion window should be handled as in the current draft.

Actually, my intention is to ask when you stop the fast retransmit phase.
Do you stop only when dup_ack_cnt becomes 0 or a time out happens?  The
reason I ask is that a fast retransmit phase covers a single congestion
event.  If cwnd is adjusted as in the draft, the loss of a new segment sent
during fast retransmit phase is treated as the same congestion event.  I'd
suggest cwnd needs to be decremented when recovering for segments sent
beyond recover.

>I'm not sure idea 2 is always safe. For instance,
>consider an initial window of 5, with segments 1 and 5 lost:
>[1] 2 3 4 [5].
>How safe is to retransmit 5 upon receiving a partial ack for 4?

As I described above, it is as "safe" as the first fast retransmission
of seg 1, especially in this example.

>----


							K. Poon
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 16 07:37:02 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA09153
	for <tcpimpl-archive@odin.ietf.org>; Tue, 16 Mar 1999 07:37:01 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id EAA25350
	for tcp-impl-outgoing; Tue, 16 Mar 1999 04:17:02 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id EAA24745
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 04:10:54 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with SMTP id BAA17859;
	Tue, 16 Mar 1999 01:10:48 -0800 (PST)
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id BAA21283; Tue, 16 Mar 1999 01:10:42 -0800
Received: (from kcpoon@localhost)
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) id BAA22650;
	Tue, 16 Mar 1999 01:10:42 -0800 (PST)
Date: Tue, 16 Mar 1999 01:10:42 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Message-Id: <199903160910.BAA22650@jurassic.eng.sun.com>
To: Chadi.Barakat@sophia.inria.fr, mrosu@us.ibm.com
Subject: Re: Counting ACKs in NewReno
Cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Included message from "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>:

>----
>[1] 2 3 [4] 5
>
>I use the same notation for lost packets. The source receives first
>three dup acks that correspond to the receipt of packets 2, 3 and 5.
>dupacks is then 3 and the source retransmits packet 1, sets its window
>to 2 and adds three to cwnd which becomes 5. Here no new segments are
>transmitted. When packet 1 arrives at the destination, an ack asking for
>packet 4 is transmitted. According to your proposal, when the source
>receives this ACK, it decrements dupacks by two which becomes 1. Here
>packets 4 cannot be retransmitted and a timeout is required.

Actually, after getting partial ack of seg 3, cwnd is set to 5-3+1 = 3.
That means a new seg can be sent, assuming there is new data.  This
new seg 6 will in turn elicit another dup ack, which will increase cwnd
by 1 and dup_ack_cnt by 1.  Another new seg 7 can be sent, which will
elicit another dup ack.  After getting this dup ack, dup_ack_cnt will
be equal to 3, thus seg 4 is retransmitted according to Marcel's
suggestion.

>The source stores in a variable the highest sequence number sent and it
>updates this variable when it receives a partial ACK (less than Y). When
>it leaves Fast Recovery and receives an ACK asking for a packet less
>than this variable (T in our exemple), it retransmits this packet
>directly.

What I want to know is if you retransmit a segment because of this ack,
should cwnd be decremented?  I think you agree that it should.  Another
question is should this retransmission be the start of another fast
retransmit?  If it should, then maybe your T will become recover of the next
phase.  And because cwnd and ssthresh are getting smaller, this extension
will eventaully end.  

As I stated in my previous mail, if the window is relatively large, your
dropped U can be recovered by another fast retransmit.  Your extension
mechanism can recover it faster and for smaller windows.  You have a 
"fast track" fast retransmit proposal (-:

>----


							K. Poon
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 17 02:31:41 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA27260
	for <tcpimpl-archive@odin.ietf.org>; Wed, 17 Mar 1999 02:31:41 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id XAA25383
	for tcp-impl-outgoing; Tue, 16 Mar 1999 23:22:08 -0500 (EST)
Received: from smtp7.ny.us.ibm.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id XAA24800
	for <tcp-impl@grc.nasa.gov>; Tue, 16 Mar 1999 23:19:04 -0500 (EST)
From: mrosu@us.ibm.com
Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.117.200.22])
	by smtp7.ny.us.ibm.com (8.8.7/8.8.7) with ESMTP id XAA77836;
	Tue, 16 Mar 1999 23:18:18 -0500
Received: from D51MTA03.pok.ibm.com (d51mta03.pok.ibm.com [9.117.200.31])
	by northrelay02.pok.ibm.com (8.8.7m1/NCO v1.8) with SMTP id XAA141534;
	Tue, 16 Mar 1999 23:18:22 -0500
Received: by D51MTA03.pok.ibm.com(Lotus SMTP MTA v4.6.4  (817.1 3-4-1999))  id 85256737.00179FDE ; Tue, 16 Mar 1999 23:18:02 -0500
X-Lotus-FromDomain: IBMUS
To: "Chadi M. BARAKAT" <Chadi.Barakat@sophia.inria.fr>
cc: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>, floyd@ee.lbl.gov,
        tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU, mrosu@us.ibm.com
Message-ID: <85256737.00179E9B.00@D51MTA03.pok.ibm.com>
Date: Tue, 16 Mar 1999 23:18:04 -0500
Subject: Re: Counting ACKs in NewReno
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


   .........

   Let X and Y the lowest and the highest sequence numbers transmitted when
   Fast Recovery is called. X is lost.

   Let Z be the highest sequence number between X and Y that is also lost.
   When the source receives the partial ACK asking for packet Z, it has
   transmitted new packets until T whith T>Y.
   Here it retransmits Z and may be some new packets beyond T. Packet Z
   arrives at the destination which must ack all the packets till T if they
   are correctly received of course. Suppose that a packet between Y and T
   is also lost, call it U, we have the following figure

   X  [Z]  Y  [U]  T

   Here the source receives an ACK asking for packet U. If it resets the
   dupacks counter after leaving fast recovery and if the number of
   segments sent after T is less than three, the detection of the loss of U
   requires a timeout. Recall that all the dupack of segments X till T have
   been already received when the ack asking for segment U arrived. However
   if the source uses the dupack counter as proposed by Marcel, it will
   consider the dup acks of packets sent between U and T with any dup ack
   for a packet sent after T. This may also cause a failure of Fast
   Retransmit if the number of these dup acks is less than three. However
   this proposition is better than resetting the counter because it takes
   account of dup acks for packets between U and T.

   I propose here that the source should retransmit directly packet U
   without waiting for three dupack or timeout. At least one round trip
   time has passed since the transmission of U and it is very probable that
   it is lost. How to do this??

   ......

   What do you think?

   Chadi

I'm not sure your argument is correct.

Lets assume that U = T. What if Z is retransmitted immediately after U?
On their way to the receiver, Z and U get reordered, Z is received first and
an ACK asking for U is send back to the source. Immediately after that,
U is acked in a separate ACK message. The two acks are received by the source
in the order they were sent. Upon receiving the first of the two ACKs, U is
retransmitted unnecessarily.

In the following, U != T.
I believe you can use your idea only to fast retransmit the packets beyond Y
that were sent before V, where V is the packet with the highest sequence
number between X and Z, that was lost. In other words, you can fast
retransmit U if T is the highest new packet sent before the retransmision
of V, where X [V] [Z] Y [U] T.

A second case when it is safe to fast retransmit U is when there are at
least two packets between U and T. In this case, there are three packets
between U and the retransmission of Z and it is safe to ignore packet
reordering.
(This second case will be handled correctly by Reno/NewReno and U will be
fast retransmitted because dupacks >= 3 when the ACK asking for U is received).

Please correct me if I'm wrong.

If my understanding of NewReno is correct, the first retransmission is protected
by the first three duplicate ACKs, while the next retransmissions are protected
by the "at least one roundtrip" argument. Once you exit the window, you are
protected by the "at least one roundtrip" argument only in the special case
described above. Otherwise, you need to wait for the three dup acks.

In my proposal, you fast retransmit only when protected by three dup ACKs,
when you're inside or outside the window.

Marcel

   --
                       **  Chadi Mohamad BARAKAT  **
              http://www.inria.fr/mistral/personnel/Chadi.Barakat
                                     /\
   PhD Student - MISTRAL - INRIA    /  \   Chadi.Barakat@sophia.inria.fr
   2004, Route des Lucioles BP 93   \  /   Phone : + 33 4 92 38 71 99
   06902 Sophia Antipolis - France   \/    Cell  : + 33 6 10 42 36 30


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 17 04:19:48 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id EAA27776
	for <tcpimpl-archive@odin.ietf.org>; Wed, 17 Mar 1999 04:19:47 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id BAA10647
	for tcp-impl-outgoing; Wed, 17 Mar 1999 01:17:14 -0500 (EST)
Received: from smtp8.ny.us.ibm.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id BAA10133
	for <tcp-impl@grc.nasa.gov>; Wed, 17 Mar 1999 01:13:24 -0500 (EST)
From: mrosu@us.ibm.com
Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.117.200.22])
	by smtp8.ny.us.ibm.com (8.8.7/8.8.7) with ESMTP id BAA47886;
	Wed, 17 Mar 1999 01:12:31 -0500
Received: from D51MTA03.pok.ibm.com (d51mta03.pok.ibm.com [9.117.200.31])
	by northrelay02.pok.ibm.com (8.8.7m1/NCO v1.8) with SMTP id BAA83512;
	Wed, 17 Mar 1999 01:12:46 -0500
Received: by D51MTA03.pok.ibm.com(Lotus SMTP MTA v4.6.4  (817.1 3-4-1999))  id 85256737.0022192A ; Wed, 17 Mar 1999 01:12:26 -0500
X-Lotus-FromDomain: IBMUS
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU,
        mrosu@us.ibm.com
Message-ID: <85256737.00221895.00@D51MTA03.pok.ibm.com>
Date: Wed, 17 Mar 1999 01:12:25 -0500
Subject: Re: Counting ACKs in NewReno
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


   Included message from mrosu@us.ibm.com:

   .....
   Let me try another example.  Using your notation, cwnd = 5,

   [1] 2 [3] 4 5

   After fast retransmitting seg 1, partial ack for seg 2 comes back.  Then
   dup_ack_cnt -= 2 - 1 = 2.  (I hope I get it right this time (-:)  So
   using your idea, since dup_ack_cnt < 3, seg 3 will not be retransmitted.
   cwnd is set to 5 - 2 + 1 = 3.  So no new segment can be sent.  TCP has
   to time out.

Yes, you're right...
However, in my original message, under Example 1, I say:
"....If we want to be more aggressive and count the retransmission of
segment 1 as a segment sent after 7 but acked before 7, then the new
value of dupacks should be 2...."

I was trying to say that you can be more aggressive for the second
retransmission. The more aggressive algorithm won't timeout on
"[1] 2 [3] 4 5".

Here is a better way to count fast retransmissions as messages sent after X
but acked before X:

Enter in fast retransmit phase and retransmit if dupacks >= 3.
Upon receiving a partial ACK, always decrement dupacks by (# segs_acked - 1)
and:
if this would be the 2nd fast retransmit, then retransmit if dupacks >= 2,
or if this would be the 3rd fast retransmit, then retransmit if dupacks >= 1,
otherwise, retransmit unconditionally.

I believe that you can prove that the shortest three examples for which the
aggressive algorithm timeouts are the following: [1] 2 3 4 [5] [6] 7,
[1] 2 3 4 [5] 6 [7], and [1] 2 3 4 [5] [6] [7]. The main observation is that
while in the fast retransmit phase, cwnd is always >= cwnd0/2, where cwnd0
is the initial congestion window (before the congestion event).

Using the same observation, you can prove that the aggressive algorithm
timeouts after the first retransmit only if 1/2 of the window or more is
lost (in the examples above, at least 3 messages are lost for a window of 7).
NewReno won't timeout if 1/2 window or more is lost.

   ......
   I think we can separate your modification into 2 cases.  Using NewReno's
   terms, partial ack and ack covers recover.  Firstly, I don't believe that it
   is "risky" at all in retransmitting the first unack'ed segment after getting
   a partial ack.  Fast retransmit is triggered because of 3 dup acks.  We
   assume that there is no reordering, thus the segment is really dropped.  This
   means that no partial ack or complete ack can be sent back until this
   dropped segment is retransmitted and received by the receiver.  Thus
   retransmitting a segment because of a partial ack is as "risky" as the first
   fast retransmission.  They are all linked together.

   Suppose for the moment, we just apply half of your idea, do the dup_ack_cnt
   calculation but relax the dup_ack_cnt >= 3 rule.  If after leaving fast
   retransmit phase, dup_ack_cnt is still > 0, can we infer that some of the
   new segments sent during fast retransmit phase are also dropped?  I'll say
   probably.

   .......

OK, lets separate my proposal in two cases and apply only part of the idea.
In the fast retransmit phase upon receiving a partial ACK, adjust dupacks as
I proposed and retransmit unconditionally (because it is safe to assume
the segment was lost after at least one roundtrip).
Upon receiving a full ACK, adjust dupacks as I proposed. If dupacks > 0, use
it to determine the start of a new fast retransmit phase.

I believe that if you filter window updates correctly, a positive value
of dupacks upon leaving the fast retransmit is a pretty safe indication of
a lost segment.
If ACKs are lost on their way back to the source, dupacks underestimates
the number of segments received out of order.
Why do you say 'probably'?

   >I don't see the benefit of clearing the dupacks counter. Please elaborate.

   The dup_ack_cnt is just an implementation detail in NewReno which is used to
   identify when fast retransmit start and ends.  Reseting it to 0 just means
   it is the end.

Comparing against the current value of the 'recover' variable will do the same.
I assume that 'recover' is never cleared; only set when fast retransmit starts.

   >When there are no losses, the dupacks counter will go to zero rapidly.
   >Because of lost acks, there will be attempts to decrement the counter
   >below zero. Anyway, dupacks should always be >= 0.
   >No. The congestion window should be handled as in the current draft.

   Actually, my intention is to ask when you stop the fast retransmit phase.
   Do you stop only when dup_ack_cnt becomes 0 or a time out happens?  The
   reason I ask is that a fast retransmit phase covers a single congestion
   event.

You stop the current retransmit phase upon receiving a full ACK. Adjust cwnd
as the draft says and adjust dupacks as I suggested. If updated dupacks >= 3,
start a new retransmit phase; follow steps 1 and 2 in the draft. Clear dupacks
only if negative (if ACKs are lost, my algorithm for adjusting dupacks can
yield negative values).

   If cwnd is adjusted as in the draft, the loss of a new segment sent
   during fast retransmit phase is treated as the same congestion event.
   I'd suggest cwnd needs to be decremented when recovering for segments
   sent beyond recover.

Upon receiving a full or partial ACK, cwnd should be adjusted as specified in
the draft.

   >----


                                    K. Poon
                                    kcpoon@eng.sun.com

Looking forward to your comments,
Marcel


From owner-tcp-impl@lerc.nasa.gov  Thu Mar 18 02:06:49 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA18507
	for <tcpimpl-archive@odin.ietf.org>; Thu, 18 Mar 1999 02:06:49 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id XAA25719
	for tcp-impl-outgoing; Wed, 17 Mar 1999 23:12:11 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id XAA24836
	for <tcp-impl@grc.nasa.gov>; Wed, 17 Mar 1999 23:07:44 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with SMTP id UAA28561;
	Wed, 17 Mar 1999 20:07:35 -0800 (PST)
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id UAA10155; Wed, 17 Mar 1999 20:07:31 -0800
Received: from dors (awe185-28.AWE.Sun.COM [192.29.185.28])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id UAA07151;
	Wed, 17 Mar 1999 20:07:17 -0800 (PST)
Date: Wed, 17 Mar 1999 20:09:37 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: Counting ACKs in NewReno
To: mrosu@us.ibm.com
Cc: floyd@ee.lbl.gov, tcp-impl@grc.nasa.gov, tomh@CS.Berkeley.EDU
In-Reply-To: "Your message with ID" <85256737.00221895.00@D51MTA03.pok.ibm.com>
Message-ID: <Roam.SIMC.2.0.6.921730177.22721.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> Enter in fast retransmit phase and retransmit if dupacks >= 3.
> Upon receiving a partial ACK, always decrement dupacks by (# segs_acked - 1)
> and:
> if this would be the 2nd fast retransmit, then retransmit if dupacks >= 2,
> or if this would be the 3rd fast retransmit, then retransmit if dupacks >= 1,
> otherwise, retransmit unconditionally.

It seems to me that you agree that applying your idea during fast retransmit
phase is probably "too safe."  Note that when implementors add code to the
common path of TCP, we want the benefits outweigh the complexity.  Let's skip
this part and look more closely at the other part, and from an implementation
point of view.

> I believe that if you filter window updates correctly, a positive value
> of dupacks upon leaving the fast retransmit is a pretty safe indication of
> a lost segment.
> If ACKs are lost on their way back to the source, dupacks underestimates
> the number of segments received out of order.
> Why do you say 'probably'?

The question here is how you are going to implement it.  We can think in terms
of distinct segments.  But in implementation, we cannot do that.  While we can
keep track of dup acks, we don't really know how many segments a partial or
complete ack acknowledges.  TCP does not need to send full MSS size segments.
(Look at recent thread of modifying Nagle algorithm in tcp-impl.)   That means
an ack for 2920 bytes may actually ack 3 segments, 1 1460 bytes large and 2
730 bytes large.  What an implementation can do efficiently is to guess.  If
MSS is 1460 byes and TCP gets an ack for 2920 byes, assume that it is for 2
segments.  This means that when doing the subtraction in your idea, TCP is
probably going to subtract fewer than it actually should.  Thus a positive
dup_ack_cnt may not really indicate a drop.

If you have an idea of how your counting can be achieved efficiently, please
send it to the mailing list and we can discuss it.  The above is just what I
think your idea can be implemented.  It does not mean that it has to be done
this way.

> Comparing against the current value of the 'recover' variable will do the
> same. I assume that 'recover' is never cleared; only set when fast
> retransmit starts.

As I said repeatedly, dup_ack_cnt is just an implementation detail.  Note that
TCP cannot just compare recover with any ack.  TCP needs to know that it
really needs to compare an ack with recover.  The variable dup_ack_cnt serves
this purpose.  If dup_ack_cnt is >= 3, it means that TCP is in fast retransmit
mode and the seq number stored in recover is valid.  TCP needs to compare the
ack number with recover.  Reseting dup_ack_cnt means that TCP does not need to
do any comparison.  Note that you can introduce a state variable to indicate
that TCP is in fast retransmit mode and need to check recover.  Or using
various other methods.  They are just implementation details.

> You stop the current retransmit phase upon receiving a full ACK. Adjust cwnd
> as the draft says and adjust dupacks as I suggested. If updated dupacks >= 3,
> start a new retransmit phase; follow steps 1 and 2 in the draft. Clear
> dupacks only if negative (if ACKs are lost, my algorithm for adjusting
> dupacks can yield negative values).

This is what I like you to specify clearly.  Now you've done that, we can look
further to see how this can be implemented.

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 05:04:28 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id FAA16469
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 05:04:27 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id BAA07889
	for tcp-impl-outgoing; Mon, 22 Mar 1999 01:22:06 -0500 (EST)
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id BAA07491
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 01:19:16 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 479
          for <tcp-impl@grc.nasa.gov>; Sun, 21 Mar 1999 22:18:51 -0800
Message-ID: <36F5E0C9.1F5899D4@ehsco.com>
Date: Sun, 21 Mar 1999 22:18:50 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: TCP Implementations <tcp-impl@grc.nasa.gov>
Subject: alternative PMTU
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


Hi,

I'm sure somebody's thought of this before, but I can't figure out why
it wouldn't have worked.

Suppose that instead of doing this complex probe operation that we have
today, we just send datagrams with fragmentation allowed. Then, if
fragmentation does occur somewhere, the recipient responds with a TCP
option that states "fragmentation occurred: fragment size is X"

This would let the sender immediately drop the MSS down to the reported
size, and we wouldn't have any of this crap that we have to go through
today for PMTU (ICMP probs, etc.).

The only issue I can see here is that the sender may transmit lots of
too-big segments, and on a lossy network the resulting fragmentation
could cause all data to get lost. However, slow start would seem to
prevent this from happening. In that case, the same strategies used
today would still work (scale down until successful).

The other issue would be coordinating between TCP and IP, so that IP
could tell TCP when fragmentation had in fact occurred (based on the
header bits). I suppose one alternative to this would be for IP to send
this option/message if it gets data for TCP where fragmentation has
occurred.

The upsides to using an option is that it gets to ride for free on the
ACKs, it will get ignored if the original sender doesn't understand it,
and it won't be blocked by firewalls.

Thoughts?

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 10:50:48 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA19392
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 10:50:47 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA18424
	for tcp-impl-outgoing; Mon, 22 Mar 1999 07:39:16 -0500 (EST)
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA16042
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 07:32:43 -0500 (EST)
Received: from ehsco.com ([192.168.10.10]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 503;
          Mon, 22 Mar 1999 04:32:12 -0800
Message-ID: <36F6384C.A3E4436@ehsco.com>
Date: Mon, 22 Mar 1999 04:32:12 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
CC: tcp-impl@grc.nasa.gov
Subject: Re: alternative PMTU
References: <m10P4Zr-0007U2C@the-village.bc.nu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> Am I missing something here. The ICMP reply optionally gives you the
> mtu size

Sometimes an intermediary will, but the end-point never does. Also, it
can take multiple efforts to find the "true" end-to-end mtu.

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 10:52:13 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA19433
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 10:52:12 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA15197
	for tcp-impl-outgoing; Mon, 22 Mar 1999 07:27:06 -0500 (EST)
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA14658
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 07:23:17 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id MAA31178; Mon, 22 Mar 1999 12:23:11 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m10P4Zr-0007U2C; Mon, 22 Mar 99 13:17 GMT
Message-Id: <m10P4Zr-0007U2C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: alternative PMTU
To: ehall@ehsco.com (Eric A. Hall)
Date: Mon, 22 Mar 1999 13:17:10 +0000 (GMT)
Cc: tcp-impl@grc.nasa.gov
In-Reply-To: <36F5E0C9.1F5899D4@ehsco.com> from "Eric A. Hall" at Mar 21, 99 10:18:50 pm
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> This would let the sender immediately drop the MSS down to the reported
> size, and we wouldn't have any of this crap that we have to go through
> today for PMTU (ICMP probs, etc.).

Am I missing something here. The ICMP reply optionally gives you the 
mtu size

> ACKs, it will get ignored if the original sender doesn't understand it,
> and it won't be blocked by firewalls.

Dream on. Quite a few firewall products "clean up" the tcp sessions or
semi-proxy them. Look at both sides of a cisco PIX for example and you'll
see all sorts of magic going on, including some rather out of spec 
fragment handling.

Alan


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 10:58:58 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA19642
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 10:58:57 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA17389
	for tcp-impl-outgoing; Mon, 22 Mar 1999 07:37:47 -0500 (EST)
Received: from snowcrash.cymru.net (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA16175
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 07:35:00 -0500 (EST)
Received: from the-village.bc.nu (lightning.swansea.uk.linux.org [194.168.151.1]) by snowcrash.cymru.net (8.8.7/8.7.1) with SMTP id MAA31440; Mon, 22 Mar 1999 12:34:56 GMT
Received: by the-village.bc.nu (Smail3.1.29.1 #2)
	id m10P4lB-0007U2C; Mon, 22 Mar 99 13:28 GMT
Message-Id: <m10P4lB-0007U2C@the-village.bc.nu>
From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: alternative PMTU
To: ehall@ehsco.com (Eric A. Hall)
Date: Mon, 22 Mar 1999 13:28:53 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk, tcp-impl@grc.nasa.gov
In-Reply-To: <36F6384C.A3E4436@ehsco.com> from "Eric A. Hall" at Mar 22, 99 04:32:12 am
Content-Type: text
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> > Am I missing something here. The ICMP reply optionally gives you the
> > mtu size
> 
> Sometimes an intermediary will, but the end-point never does. Also, it
> can take multiple efforts to find the "true" end-to-end mtu.

You want the path mtu's passed in the BGP4 routing tables ?


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 13:35:17 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA24675
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 13:35:16 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id KAA01367
	for tcp-impl-outgoing; Mon, 22 Mar 1999 10:37:21 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id KAA00634
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 10:34:05 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id IAA28868
	for tcp-impl@grc.nasa.gov  env-from <vjs>;
	Mon, 22 Mar 1999 08:34:02 -0700 (MST)
Date: Mon, 22 Mar 1999 08:34:02 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903221534.IAA28868@calcite.rhyolite.com>
To: tcp-impl@grc.nasa.gov
Subject: Re: alternative PMTU
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: "Eric A. Hall" <ehall@ehsco.com>

> ...
> Suppose that instead of doing this complex probe operation that we have
> today, we just send datagrams with fragmentation allowed. Then, if
> fragmentation does occur somewhere, the recipient responds with a TCP
> option that states "fragmentation occurred: fragment size is X"
> ...

Contrast the current path MTU discovery mechanism with that idea:

  current scheme:
    deals with UDP, ICMP, and other protocols

    works even when other host and all routers do not know the new
      PMTU ICMP option.  It works very with only some routers updated,
      and with the far larger number of hosts running the original TCP.

    Routers do not absolutely need to be updated, but if they are,
      the change is tiny.

    always discovers the MTU that the sender needs to use to avoid
      fragmentation.

    handles routing changes


  alternate scheme:
    handles only TCP.

    Does not work at all unless requires the remote host has new TCP and
      IP code.  There are far more hosts that would need to be updated
      with new TCP and IP code than routers in the current scheme.

    The new code needed in hosts is much larger than the change in routers,
      and again, the routers do not absolutely need to be changed at all
      in the current scheme.

    gets the fragment seen by the far host, which can be larger than the
      smallest MTU if any router does IP reassembly, and usually is
      smaller.  What if the bottleneck router fragments into even sizes
      instead of maximum and tiny sizes?   What about the effects of the
      8-byte granularity of IP fragments?  Also consider the IP fragments
      an Ethernet host (1500) can receives from an FDDI sender, 1500,
      1500, and 1412 (assuming no TCP or IP options).  Does the alternate
      scheme report 1500 or 1412?   Now consider the fragments received
      on the Ethernet when the the FDDI host prefers to use an MSS of 4096
      instead of 4332, because that is often much faster.  (There are many
      commercial systems that do that.)

    Say you handled those problems by looking for the largest fragment or
      datagram received, and ignored the fact that would not always give
      the right answer.  Then what about routing changes?


] From: "Eric A. Hall" <ehall@ehsco.com>

] > Am I missing something here. The ICMP reply optionally gives you the
] > mtu size
]
] Sometimes an intermediary will, but the end-point never does.

The endpoint never does any IP fragmenting, so why do you care what it
guesses is the smallest MTU in the path?

]                                                               Also, it
] can take multiple efforts to find the "true" end-to-end mtu.

Not in practice, unless the router doing the fragmenting does not know
about the new ICMP option.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 18:18:29 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA03839
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 18:18:28 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA15484
	for tcp-impl-outgoing; Mon, 22 Mar 1999 15:13:10 -0500 (EST)
Received: from jupiter.nal.utoronto.ca (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id PAA14404;
	Mon, 22 Mar 1999 15:08:49 -0500 (EST)
Received: from nal.utoronto.ca by jupiter.nal.utoronto.ca (SMI-8.6/SMI-SVR4)
	id PAA15535; Mon, 22 Mar 1999 15:00:49 -0500
Message-ID: <36F6A2B4.A51783D7@nal.utoronto.ca>
Date: Mon, 22 Mar 1999 15:06:12 -0500
From: Raouf Boutaba <rboutaba@jupiter.nal.utoronto.ca>
Organization: University of Toronto
X-Mailer: Mozilla 4.51 [en] (Win95; I)
X-Accept-Language: en
MIME-Version: 1.0
To: xtp-relay@cs.concordia.ca
Subject: NETWORKIN 2000
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 8bit

[My apologies if you receive this more than once]
======================================================
=======================================================
                  NETWORKING 2000
                  ================
             IFIP - TC6/ European Union

            Broadband communications (BB),
            High Performance Networking (HPN),
            Performance of Communication Networks (PCN)


                            Paris, France
                     Cit� des Sciences, La Villette
                            May 14 � 19, 2000
                            Main Sponsors
                         FIP TC6, European Union

National Main Sponsor
RNRT (R�seau National de la Recherche en T�l�communications)

Supporters
Alcatel, CISCO, Thomson-CSF Detexis, EDF, European Union, France
Telecom, Sprint

Organizer
University of Paris VI , University of Versailles, ENST

General Chair
Guy Pujolle - France

Organization Committee
Serge Fdida - France
Jean-Alain Hernandez - France
Eric Horlait - France
Ren� Joly - France
Guy Pujolle - France

Steering Committee
 Augusto Casaca - Portugal
Andr� Danthine - Belgium
Olli Martikainen - Finland
Harry Perros - USA
Jan Slavik - Czech Republic
Otto Spaniol - Germany
Y. Takahashi - Japan
S. Tohme - France

General Program committee chair
Harry Perros - USA

Program committee chair for BB track
Ulf Korner - Sweden

Program committee chair for HPN track
Serge Fdida - France

Program committee chair for PCN track
Ioannis Stavrakakis - USA


 Program committee members
 Augusto Albuquerque - EC
Harmen van As - Austria
Augusto Casaca - Portugal
Paolo Castelli - Italy
Imrich Chlamtac -USA
Jean-Pierre Coudreuse - France
Nelson L. S. Fonseca - Brazil
Luigi Fratta - Italy
Giorgio Gallassi - Italy
Andre Girard - Canada
Villy B. Iversen - Denmark
Bijan Jabbari -USA
Konosuke Kawashima - Japan
Peter Key - USA
Daniel Kofman - France
Paul Kuehn - Germany
Helmut Leopold - Austria
John Luetchford - Canada
Lorne Mason - Canada
Serafim Nunes - Portugal
Guido Petit - Belgium
Sathya Rao - Switzerland
Joao Rodrigues - Portugal
Catherine Rosenberg - UK
Tadao Saito - Japan
Amardeo Sarnea - Germany
Marion Schreinemachers - Netherl.
Jan Slavik - Czech Republic
Samir Tohme - France
Danny Tsang - Hong Kong
Finn Arve Aagesen - Norway
Andres Albanese - USA
Chase Bailey - USA
Ermanno Berruto - Italy
Andrew Campbell - USA
Jon Crowcroft - UK
Andre Danthine - Belgium
Walid Dabbous - France
Michel Diaz - France
Sarolta Dibuz - Hungary
Christophe Diot - USA
Otto Duarte - Brazil
Wolfgang Effelsberg - Germany
Nicolas D. Georganas - Canada
Enrico Gregori - Italy
Roch Guerin - USA
Christian Huitema � USA
David Hutchinson -UK
Marjory Johnson - USA
Farouk Kamoun - Tunisia
Koos Koen - RSA
Jacques Labetoulle - France
Jean-Yves Le Boudec - Switzerl.
Guy Leduc - Belgium
Olli Martikainen - Finland
Steve Pink - Sweden
Nina Taft Plotkin - USA
Radu Popescu-Zeletin - Germany
Luigi Rizzo - Italy
Aruna Seneviratne -Australia
Otto Spaniol - Germany
Ralf Steinmetz - Germany
Chris Blondia - Belgium
Miklos Boda - Hungary
Herwig Bruneel - Belgium
Olga Casals - Spain
Prosper Chemouil - France
Giovanni Colombo - Italy
Tony Ephremides - USA
Andras Farago - USA
Erol Gelenbe - USA
Mario Gerla - USA
Fabrice Guillemin - France
Gerard Hebuterne - France
Demetres Kouvatsos - UK
Jim Kurose - USA
Karl Lindberger -Sweden
Jon Mark - Canada
Marco Marsan - Italy
Lazaros Merakos - Greece
Jouni Mikkonen - Finland
Debasis Mitra - USA
Naotama Morita - Japan
Arne Nilsson - USA
Raif Onvural - USA
Ramon Puigjaner - Spain
James Roberts - France
Yutaka Takahashi - Japan
Ahmed Tantawy - USA
Phouk Tran-Gia - Germany
Jorma Virtamo - Finland
Adam Wolisz - Germany


Tutorials Committee Chair
Eric Horlait -France

Publicity Committee Chair
Raouf Boutaba - Canada

Networking 2000 Conference
Networking 2000 Conference will provide an international technical forum

for experts from industry and academia to exchange ideas and present
results of ongoing research in networking. It is a joint conference of
the following three series of conferences :
Broadband Communications (BB)
High Performance Networking (HPN)
Performance of Communication Networks (PCN).


Topics
 Switching Design
Routing Design
Switching and Routing
Internet/Intranet
LAN, WAN Global Network Interconnection
High Performance Networking and Protocols
Service Provider Interworking
Control and Optimization of Communication Systems
Quality of Service
Multicast
Active Networks
Terrestrial Radio Systems
Wireless and Mobile Communications
Satellite and Space Communications
Low Earth Orbit Satellite Communication Systems
Optical Communications
Access Networks
Video Coding and Distribution
Object Orientation
Network Operations and Management
Signalling
Tariffing
Network Reliability, Availability and Survivability
Internet Services and Applications
Network Design Problems in Gigabit
Terabit Networks
Performance Evaluation of Telecommunication Systems
Traffic Models and Measurements
Traffic and Congestion Control

KEY DATES
* September 1st,, 1999 Deadline for submitting Short Course Proposals
* September 15, 1999  Deadline for submitting papers
* January 1st, 2000  Notification of acceptance of papers
* February 15, 2000  Papers received in camera ready form
* May 14-15, 2000  Tutorials
* May 16-19, 2000  Networking 2000 Conference

SUBMISSION OF PAPERS
Paper can be submitted by mail or email. Monitor the web site for
details: www.noc.uoa.gr/net2000


Papers are invited on the conference theme and related topics. Authors
must state that their paper have neither been published before nor
currently being submitted elsewhere.
Full papers should be no longer than 15 pages, including tables,
diagrams and pictures. The font size must be 12 points or larger. The
cover page must contain an abstract of about 150 words, name and
affiliation of author(s) as well as the lead author's postal address,
telephone number, fax number and e-mail.
Accepted papers will be included in the conference proceedings and will
be distributed to the attendees at the conference.
The language of the conference is English and papers must be in this
language.

SHORT COURSES
We solicit proposals for half/full day short courses outlining the
subject area and a short biography of the organizer(s).

Short courses can be submitted by mail or email to
Prof. E. Horlait
LIP6,
4 place Jussieu
75252 Paris Cedex 05

* by email to Eric.Horlait@lip6.fr

INFORMATION
www.noc.uoa.gr/net2000
www.prism.uvsq.fr/~net2000


Mini conferences and workshops will take place during the conference.
The following proposals are being considered: (specific call for papers
will be provided)

MWCN - Mobile and Wireless Communication Networks (Chairman Guy Omidyar
� Egypt) omidyar.guy@gtenet.com.eg

IATE - Intelligent agents for telecommunication environments
(Chairpersons Dominique Gaiti - France and Olli Martikainen - Finland)
Dominique.Ga�ti@lip6.fr, Olli.Martikainen@hut.fi

Quality of Service and Multimedia Applications (Chairman Eric Horlait -
France)
Eric.Horlait@lip6.fr

Constellation of satellites (Chairwomen C. Rosenberg - Canada)
C.Rosenberg@nortel.co.uk

Resource allocation for mobile networks (Chairman Sami Tabbane -
Tunisia)
sami.tabbane@supcom.rnu.tn

Internet and the Web (Chairman Keith Ross- France)
ross@eurecom.fr

Performance of Adaptive Intelligent Networks (Chairman E. Gelenbe - USA
erol@cs.ucf.edu

Programmable Networks and Active Networks (Chairman R. Boutaba � Canada,

Andrew Campbell - USA, Rolf Stadler � USA and Ian Wakeman - USA)
rboutaba@jupiter.nal.utoronto.ca   campbell@comet.columbia.edu

Multimedia Management (Chairman Jose Neuman de Souza � Brazil)
neuman@ufc.br


From owner-tcp-impl@lerc.nasa.gov  Mon Mar 22 21:57:31 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA06786
	for <tcpimpl-archive@odin.ietf.org>; Mon, 22 Mar 1999 21:57:31 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA07945
	for tcp-impl-outgoing; Mon, 22 Mar 1999 19:17:06 -0500 (EST)
Received: from Arachnid.NTRG.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA07245
	for <tcp-impl@grc.nasa.gov>; Mon, 22 Mar 1999 19:12:27 -0500 (EST)
Received: from ehsco.com ([192.168.10.6]) by Arachnid.NTRG.com
          (Netscape Messaging Server 3.62)  with ESMTP id 416;
          Mon, 22 Mar 1999 10:24:40 -0800
Message-ID: <36F68AE1.1F7E3863@ehsco.com>
Date: Mon, 22 Mar 1999 10:24:33 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
Organization: EHS Company
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
CC: tcp-impl@grc.nasa.gov
Subject: Re: alternative PMTU
References: <m10P4lB-0007U2C@the-village.bc.nu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> You want the path mtu's passed in the BGP4 routing tables ?

No, what I'm thinking is that instead of sending a series of recursive
probes through the network (which don't get returned, or which get eaten
by ICMP-hostile devices, or which don't provide any useful info), just
have the end-node return an option that says "hey fragmentation has
occurred and this is the biggest fragment that I got."

It is a lot easier to deal with the one message that is explicit than
trying to deal with many messages of many different flavors that don't
necessarily tell you anything even when they do succeed (the current
design of using DF and hoping for informative errors).

One downside to the one-shot model is that the remote end-point HAS to
support the option if it is going to be relied upon. Otherwise you'd
just keep sending fragmentable data and never get a response saying "got
frags."

-- 
Eric A. Hall                                            ehall@ehsco.com
+1-650-685-0557                                    http://www.ehsco.com


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 23 03:20:23 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id DAA18425
	for <tcpimpl-archive@odin.ietf.org>; Tue, 23 Mar 1999 03:20:22 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id AAA20317
	for tcp-impl-outgoing; Tue, 23 Mar 1999 00:12:09 -0500 (EST)
Received: from calcite.rhyolite.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id AAA19286
	for <tcp-impl@grc.nasa.gov>; Tue, 23 Mar 1999 00:06:57 -0500 (EST)
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.0/calcite) id WAA03533
	for tcp-impl@grc.nasa.gov  env-from <vjs>;
	Mon, 22 Mar 1999 22:06:55 -0700 (MST)
Date: Mon, 22 Mar 1999 22:06:55 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <199903230506.WAA03533@calcite.rhyolite.com>
To: tcp-impl@grc.nasa.gov
Subject: Re: alternative PMTU
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: "Eric A. Hall" <ehall@ehsco.com>

> ...
> No, what I'm thinking is that instead of sending a series of recursive
> probes through the network (which don't get returned, or which get eaten
> by ICMP-hostile devices, or which don't provide any useful info), just
> have the end-node return an option that says "hey fragmentation has
> occurred and this is the biggest fragment that I got."

But once again, the biggest fragment that a host received has no useful
relationship to the path MTU.  Please consider the implications of the
8-byte granularity of IP fragments.  Because of the 8-byte granularity,
it can be impossible for a router to generate a fragment that is as
large as the MTU allows.

Consdier the possibility that the router with the smallest MTU in the path
might not do not fragment stupidly.  What if the MSS is 2048, the PMTU is
1500, and the largest fragment far host receives is 1024?  Your scheme
would incorrectly declare the PMTU to be 1024.  

Consider load sharing a TCP connection with MSS=4312 (FDDI) over two paths,
one with PMTU of 1500 (Ethernet) and the other with PMTU 4352 (FDDI).
Your scheme would falsely declare the PMTU is 4352.

Also as I said before, your notion does not support UDP.
In practice, UDP often needs PMTU discovery while TCP usually does not,
because of the TCP MSS option.

All of those casees are handled correctly by the current mechansim RFC 1191.


> It is a lot easier to deal with the one message that is explicit than
> trying to deal with many messages of many different flavors that don't
> necessarily tell you anything even when they do succeed (the current
> design of using DF and hoping for informative errors).

Please take time to read RFC 1063 and RFC 1191, and see that there
are not "many messages of many different flavors", but only two.
Please also notice that one of those two flavors contains the actual,
accurate value, unlike the hopelessly inaccurate value of your proposal.

Consider the 1988 date of RFC 1063, and consider the likelihood that
support for the second flavor that contains all of the accurate information
is practically universal today in reasonable routers.


> One downside to the one-shot model is that the remote end-point HAS to
> support the option if it is going to be relied upon. Otherwise you'd
> just keep sending fragmentable data and never get a response saying "got
> frags."

Yes, and that issue was one of the design constraints on RFC 1063
and RFC 1191.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 23 12:38:47 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA23386
	for <tcpimpl-archive@odin.ietf.org>; Tue, 23 Mar 1999 12:38:46 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id JAA05432
	for tcp-impl-outgoing; Tue, 23 Mar 1999 09:17:41 -0500 (EST)
Received: from cqmx.corp.comsat.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id JAA04772
	for <tcp-impl@grc.nasa.gov>; Tue, 23 Mar 1999 09:15:15 -0500 (EST)
From: Minghua.Lu@comsat.com
Received: from smtpgw2.cws.comsat.com ([134.133.178.29])
          by cqmx.corp.comsat.com (Post.Office MTA v3.5.3 release 223
          ID# 0-0U10L2S100V35) with SMTP id com
          for <tcp-impl@grc.nasa.gov>; Tue, 23 Mar 1999 09:14:45 -0500
Received: from ccMail by smtpgw2.cws.comsat.com
  (IMA Internet Exchange 2.11 Enterprise) id 000FD9E9; Tue, 23 Mar 1999 09:14:46 -0500
Mime-Version: 1.0
Date: Tue, 23 Mar 1999 09:12:01 -0500
Message-ID: <000FD9E9.003332@comsat.com>
Subject: Question about caching ssthresh in routing table
To: tcp-impl@grc.nasa.gov
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Description: cc:Mail note part
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

     We are running some HTTP applications on Solaris 2.6 platform. In some 
     tcp connections, we noticed that the ssthresh was initialized to a 
     very low value (for example, 2 MSSs) for some reason, even though 
     there were no packet losses in these connections. The slow start phase 
     in these connections was very short, and the cwnd stays at low values 
     because of the slow increasing in linear phase. This behavior leads to 
     long delay and low throughput in the application.
     
     It is also noticed that if there are no more packet losses, the 
     ssthresh will stay at the same low value for all consequent TCP 
     connections. ssthresh will increase only if there is a packet loss. 
     
     We assumed that the above problem was caused by caching the ssthresh 
     in the routing table. It seems that caching ssthresh might not always 
     be desired.
     
     I have two questions regarding the Solaris 2.6 implementation:
     
     1) ssthresh is saved in the routing table when a tcp connection is 
     closed. A condition for saving the ssthresh is that there must be at 
     least a certain amount of data transferred during the connection. BSD 
     uses 16 times the high water mark of the send buffer. Solaris 2.6 does 
     not seem to be using the same value. Does anybody know what the 
     condition is for saving ssthresh in Solaris 2.6?
     
     2) I assumed that the ssthresh in the routing table will timeout 
     eventually if there are no more updates to it after some time. Does 
     anyone know what the timer is for this in Solaris 2.6?
     
     Minghua


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 23 17:54:06 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA29089
	for <tcpimpl-archive@odin.ietf.org>; Tue, 23 Mar 1999 17:54:05 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA18178
	for tcp-impl-outgoing; Tue, 23 Mar 1999 13:57:25 -0500 (EST)
Received: from hotmail.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with SMTP id NAA12605
	for <tcp-impl@grc.nasa.gov>; Tue, 23 Mar 1999 13:36:39 -0500 (EST)
Received: (qmail 2306 invoked by uid 0); 23 Mar 1999 18:27:49 -0000
Message-ID: <19990323182749.2305.qmail@hotmail.com>
Received: from 134.133.52.37 by www.hotmail.com with HTTP;
	Tue, 23 Mar 1999 10:27:43 PST
X-Originating-IP: [134.133.52.37]
From: "Minghua Lu" <minghua_lu@hotmail.com>
To: tcp-impl@grc.nasa.gov
Subject: ssthresh
Date: Tue, 23 Mar 1999 10:27:43 PST
Mime-Version: 1.0
Content-type: text/plain
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

We are running some HTTP applications on Solaris 2.6 platform. In some 
tcp connections, we noticed that the ssthresh was initialized to a 
very low value (for example, 2 MSSs) for some reason, even though 
there were no packet losses in these connections. The slow start phase 
in these connections was very short, and the cwnd stays at low values 
because of the slow increasing in linear phase. This behavior leads to 
long delay and low throughput in the application.
     
It is also noticed that if there are no more packet losses, the 
ssthresh will stay at the same low value for all consequent TCP 
connections. ssthresh will increase only if there is a packet loss. 
     
We assumed that the above problem was caused by caching the ssthresh in 
the routing table. It seems that caching ssthresh might not always 
be desired.
     
I have two questions regarding the Solaris 2.6 implementation:
     
1) ssthresh is saved in the routing table when a tcp connection is 
closed. A condition for saving the ssthresh is that there must be at 
least a certain amount of data transferred during the connection. BSD 
uses 16 times the high water mark of the send buffer. Solaris 2.6 does 
not seem to be using the same value. Does anybody know what the 
condition is for saving ssthresh in Solaris 2.6?
     
2) I assumed that the ssthresh in the routing table will timeout 
eventually if there are no more updates to it after some time. Does 
anyone know what the timer is for this in Solaris 2.6?
Get Your Private, Free Email at http://www.hotmail.com


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 24 06:04:03 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id GAA19498
	for <tcpimpl-archive@odin.ietf.org>; Wed, 24 Mar 1999 06:04:02 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id CAA06195
	for tcp-impl-outgoing; Wed, 24 Mar 1999 02:30:27 -0500 (EST)
Received: from mercury.Sun.COM (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id CAA05449
	for <tcp-impl@grc.nasa.gov>; Wed, 24 Mar 1999 02:25:18 -0500 (EST)
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with SMTP id XAA03047;
	Tue, 23 Mar 1999 23:25:20 -0800 (PST)
Received: from jurassic.eng.sun.com by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id XAA01805; Tue, 23 Mar 1999 23:25:14 -0800
Received: from dors (awe185-55.AWE.Sun.COM [192.29.185.55])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id XAA14751;
	Tue, 23 Mar 1999 23:25:09 -0800 (PST)
Date: Tue, 23 Mar 1999 23:27:32 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: Question about caching ssthresh in routing table
To: Minghua.Lu@comsat.com
Cc: tcp-impl@grc.nasa.gov
In-Reply-To: "Your message with ID" <000FD9E9.003332@comsat.com>
Message-ID: <Roam.SIMC.2.0.6.922260452.29839.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>      We assumed that the above problem was caused by caching the ssthresh 
>      in the routing table. It seems that caching ssthresh might not always 
>      be desired.

If you are using Solaris 2.6, I don't believe caching is on by default.  Can
you check with `ndd /dev/tcp tcp_rtt_updates` to see if the value is non-zero?
Do you have a snoop trace for the connections you talked about?  Do you infer
the ssthresh value from a trace?  Can you send a copy to me?  Thanks.

>      1) ssthresh is saved in the routing table when a tcp connection is 
>      closed. A condition for saving the ssthresh is that there must be at 
>      least a certain amount of data transferred during the connection. BSD 
>      uses 16 times the high water mark of the send buffer. Solaris 2.6 does 
>      not seem to be using the same value. Does anybody know what the 
>      condition is for saving ssthresh in Solaris 2.6?

tcp_rtt_updates controls when things like rtt_sa, rtt_sd, and ssthresh are
cached.  You can set it to a value you like.  To be the same as BSD, set it
to 16.  But in 2.6, it is 0 by default.  That means no caching is done.

>      2) I assumed that the ssthresh in the routing table will timeout 
>      eventually if there are no more updates to it after some time. Does 
>      anyone know what the timer is for this in Solaris 2.6?

If a destination is not directly connected and not being used, its cache will
be flushed in 30 seconds (ip_ire_cleanup_interval).  

							K. Poon.
							kcpoon@eng.sun.com


From owner-tcp-impl@lerc.nasa.gov  Sat Mar 27 22:29:24 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA28741
	for <tcpimpl-archive@odin.ietf.org>; Sat, 27 Mar 1999 22:29:23 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA01135
	for tcp-impl-outgoing; Sat, 27 Mar 1999 18:46:18 -0500 (EST)
Received: from pneumatic-tube.sgi.com (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA00458
	for <tcp-impl@grc.nasa.gov>; Sat, 27 Mar 1999 18:44:38 -0500 (EST)
From: vinayb@vinayb.engr.sgi.com
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by pneumatic-tube.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id PAA6346668
	for <@etube.sgi.com:tcp-impl@grc.nasa.gov>; Sat, 27 Mar 1999 15:44:36 -0800 (PST)
	mail_from (vinayb@vinayb.engr.sgi.com)
Received: from vinayb.engr.sgi.com (vinayb.engr.sgi.com [150.166.75.35])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id PAA84311;
	Sat, 27 Mar 1999 15:44:29 -0800 (PST)
	mail_from (vinayb@vinayb.engr.sgi.com)
Received: (from vinayb@localhost) by vinayb.engr.sgi.com (980427.SGI.8.8.8/980728.SGI.AUTOCF) id PAA08255; Sat, 27 Mar 1999 15:43:33 -0800 (PST)
Message-Id: <199903272343.PAA08255@vinayb.engr.sgi.com>
Subject: ISN numbers and TIME_WAIT
To: tcp-impl@grc.nasa.gov
Date: Sat, 27 Mar 1999 15:43:33 -0800 (PST)
Cc: fisher@cthulhu.engr.sgi.com, sm@cthulhu.engr.sgi.com,
        greg@cthulhu.engr.sgi.com
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hi folks,

This is a constantly recurring theme on the tcpimpl mailing list but the
solution seems to be elusive or divided among vendors. I have collected
some snoop traces from some of the TCP sessions between Irix, SunOS, Linux
and NT SP4.  This investigations were part of a bug that I was looking at
involving Irix and NT SP4. The test case involved running a remote rsh
command from a NT PC running SP4 to connect to a SGI Indy running
6.5.x. The command used to run on the client machines involved

rsh <remote-hostname> -l guest ls

Rsh uses a reserved port and not the ephemeral port for initiating the
connection. This sometimes leads to reuse of the same port number when the
command is re-issued. This might result in the remote TCP server getting a
SYN packet while in the midst of 2MSL timeout in the TIME_WAIT state. It
is also assumed that so long as the ISN number is monotonically increasing
(as specified in RFC 793), it is okay for the client to send another SYN
packet to the remote host to initiate another connection. However, the
behavior of the NT SP4 stack appears to not honor this "MUST"
requirement. I have collected snoop traces in which the SYN is sent to a
port in TIME_WAIT with a ISN number that is lower than what was used for
the previous connection. 

qepc-miles ---> NT client executing the rsh command

vinayb---> Indy running Irix 6.5.x acting as the remote server

Here is a excerpt of the snoop.

________________________________
 20   0.124234       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
128 bytes
 20   0.124234       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=114, ID=6923
 20   0.124234       vinayb -> qepc-miles   TCP D=1023 S=514 Fin Ack=40086
Seq=1227531872 Len=74 Win=61320
 20   0.124234       vinayb -> qepc-miles   RSHELL R port=1023
Desktop\ndumpster\ngue
________________________________
 21   0.000657       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
54 bytes
 21   0.000657       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=40, ID=6924
 21   0.000657       vinayb -> qepc-miles   TCP D=1022 S=1023 Fin
Ack=40058 Seq=1227593920 Len=0 Win=61320
________________________________
 22   0.002294   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 22   0.002294   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=40, ID=32012
 22   0.002294   qepc-miles -> vinayb       TCP D=514 S=1023
Ack=1227531947 Seq=40086 Len=0 Win=8685
 22   0.002294   qepc-miles -> vinayb       RSHELL C port=1023
________________________________
 23   0.000047   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 23   0.000047   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=40, ID=32268
 23   0.000047   qepc-miles -> vinayb       TCP D=1023 S=1022
Ack=1227593921 Seq=40058 Len=0 Win=8760
________________________________
 24   0.000522   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 24   0.000522   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=40, ID=32524
 24   0.000522   qepc-miles -> vinayb       TCP D=514 S=1023 Fin
Ack=1227531947 Seq=40086 Len=0 Win=8685
 24   0.000522   qepc-miles -> vinayb       RSHELL C port=1023
________________________________
 25   0.000107   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 25   0.000107   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=40, ID=32780
 25   0.000107   qepc-miles -> vinayb       TCP D=1023 S=1022 Fin
Ack=1227593921 Seq=40058 Len=0 Win=8760
________________________________
 26   0.000187       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
54 bytes
 26   0.000187       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=40, ID=6925
 26   0.000187       vinayb -> qepc-miles   TCP D=1023 S=514     Ack=40087
Seq=1227531947 Len=0 Win=61320
 26   0.000187       vinayb -> qepc-miles   RSHELL R port=1023
________________________________
 27   0.000312       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
54 bytes
 27   0.000312       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=40, ID=6926
 27   0.000312       vinayb -> qepc-miles   TCP D=1022 S=1023
Ack=40059 Seq=1227593921 Len=0 Win=61319
________________________________
 28   8.296753   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 28   8.296753   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=44, ID=34060
 28   8.296753   qepc-miles -> vinayb       TCP D=514 S=1023 Syn Seq=40059
Len=0 Win=8192
 28   8.296753   qepc-miles -> vinayb       RSHELL C port=1023
________________________________
 29   0.000329       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
54 bytes
 29   0.000329       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=40, ID=6988
 29   0.000329       vinayb -> qepc-miles   TCP D=1023 S=514     Ack=40087
Seq=1227531947 Len=0 Win=61320
 29   0.000329       vinayb -> qepc-miles   RSHELL R port=1023
________________________________
 30   0.002796   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 30   0.002796   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=40, ID=34316
 30   0.002796   qepc-miles -> vinayb       TCP D=514 S=1023 Rst Seq=40087
Len=0 Win=0
 30   0.002796   qepc-miles -> vinayb       RSHELL C port=1023
________________________________
 31   2.919513   qepc-miles -> vinayb       ETHER Type=0800 (IP), size =
60 bytes
 31   2.919513   qepc-miles -> vinayb       IP  D=150.166.75.35
S=192.82.162.65 LEN=44, ID=35084
 31   2.919513   qepc-miles -> vinayb       TCP D=514 S=1023 Syn Seq=40059
Len=0 Win=8192
 31   2.919513   qepc-miles -> vinayb       RSHELL C port=1023
________________________________
 32   0.000290       vinayb -> qepc-miles   ETHER Type=0800 (IP), size =
54 bytes
 32   0.000290       vinayb -> qepc-miles   IP  D=192.82.162.65
S=150.166.75.35 LEN=40, ID=7002
 32   0.000290       vinayb -> qepc-miles   TCP D=1023 S=514     Ack=40087
Seq=1227531947 Len=0 Win=61320
 32   0.000290       vinayb -> qepc-miles   RSHELL R port=1023

Packets prior to 28 are part of the earlier connection. Packets from 28
onwards belong to the new session (executing the same command again). The
packets of interest are 28 and 29. 

The NT box initiates a new TCP connection as shown in packet 28. However,
it chooses to use a ISN which is lower than the previous SEQ number for
the same earlier connection and the 2MSL timeout has not yet expired. The
Irix box responds by sending a ACK packet of the previous connection is
shown in packet 29.  The NT responds with a RST in packet 30. This where
the great debate starts. Since the 2MSL has not expired and according to
RFC 1337 (Time Wait Assassination Hazard), we do nothing but send a ACK
from the previous session.  Isn't this the correct behavior? Because I
have noticed that some vendors drop the PCB from the TIME_WAIT state and
respond with a SYN-ACK for the retransmitted SYN from the PC. I have snoop
traces for SunOS and others which do that. I guess I am looking for what is
the correct approach for these kinds of situation. On the other hand, if
the NT SP4 were to generate the proper ISN than all this would not be
required.

Comments are appreciated.
Vinay
-- 
Vinay Bannai <vinayb@engr.sgi.com>,  Email Pager: vinayb_p@pager.sgi.com 
Networking Core, Ph: (650)-933-2510, Pager: 1-888-515-3512


From owner-tcp-impl@lerc.nasa.gov  Tue Mar 30 11:57:04 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA21051
	for <tcpimpl-archive@odin.ietf.org>; Tue, 30 Mar 1999 11:57:03 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA25274
	for tcp-impl-outgoing; Tue, 30 Mar 1999 07:36:19 -0500 (EST)
Received: from cerbero.elet.polimi.it (fw01.lerc.nasa.gov [139.88.145.14])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA23385
	for <tcp-impl@grc.nasa.gov>; Tue, 30 Mar 1999 07:31:44 -0500 (EST)
Received: from cerbero.elet.polimi.it (IDENT:vorko@cerbero.elet.polimi.it [131.175.15.1])
	by cerbero.elet.polimi.it (8.9.1a/8.9.1) with SMTP id OAA12153;
	Tue, 30 Mar 1999 14:31:52 +0200 (MET DST)
Message-ID: <3700C437.66E0@cerbero.elet.polimi.it>
Date: Tue, 30 Mar 1999 14:31:51 +0200
From: Emanuele Zanotti <zanotti@cerbero.elet.polimi.it>
X-Mailer: Mozilla 3.04 (X11; I; SunOS 5.5.1 sun4u)
MIME-Version: 1.0
To: tcp-impl@grc.nasa.gov
CC: bettanet@tin.it
Subject: Doubts about RFC 2001 (Slow Start and Congestion Avoidance in TCP)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

-- 
Emanuele Zanotti
Politecnico di Milano
Dipartimento di Elettronica e Informazione
piazza Leonardo da Vinci 32
20133 Milano, Italy
mailto:zanotti@cerbero.elet.polimi.it
phone: +39.02.2399.3604
fax: +39.02.2399.3413


To the kind attention of the TCP implementation Working Group.

While reading RFC 2001 concerning congestion control algotithms (Slow
Start and Congestion Avoidance) a doubt has arisen among me some of my
colleagues in University. I kindly ask you whether you are so kind to
let us know your opinion about the following points:

1) RFC states that when a congestion occurs, ssthresh is set to: max (2,
0.5*min(cwnd,rcvwnd));
   Somebody states that the correct value is: max(2, min(cwnd/2,
rcvwnd))
So, which the right version?

2) Can the value of cwnd be greater than the value of rcvwnd ? Somebody
states it is not possible. This fact cuold modify the perfomance of
these algorithms.

Please send your answers to the following e-mail address too:

emanuzan@tin.it.

Thank you very much for your kind collaboration.


Best Regards.				
					Emanuele Zanotti


From owner-tcp-impl@lerc.nasa.gov  Wed Mar 31 13:39:23 1999
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA27628
	for <tcpimpl-archive@odin.ietf.org>; Wed, 31 Mar 1999 13:39:23 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id JAA13462
	for tcp-impl-outgoing; Wed, 31 Mar 1999 09:55:26 -0500 (EST)
Received: from guns.lerc.nasa.gov (guns.lerc.nasa.gov [139.88.44.160])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id JAA13425;
	Wed, 31 Mar 1999 09:55:00 -0500 (EST)
Received: from guns.lerc.nasa.gov by guns.lerc.nasa.gov with ESMTP (NASA LeRC 8.7.4.1/2.01-local)
        id JAA25967; Wed, 31 Mar 1999 09:55:00 -0500 (EST)
Message-Id: <199903311455.JAA25967@guns.lerc.nasa.gov>
To: Emanuele Zanotti <zanotti@cerbero.elet.polimi.it>
From: Mark Allman <mallman@grc.nasa.gov>
Reply-To: mallman@grc.nasa.gov
cc: tcp-impl@grc.nasa.gov, bettanet@tin.it, emanuzan@tin.it
Subject: Re: Doubts about RFC 2001 (Slow Start and Congestion Avoidance in TCP) 
Organization: Late Night Hackers, NASA Glenn, Cleveland, Ohio
Song-of-the-Day: Sweet Emotion
Date: Wed, 31 Mar 1999 09:54:59 -0500
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


[If this is a repeat, I appologize...  My mailer is acting funny
 today. --allman]

Hopefully these are cleared up in 2001.bis, which is currently in
the RFC editor's queue.  (draft-ietf-tcpimpl-cong-control-05.txt).

> 1) RFC states that when a congestion occurs, ssthresh is set to: max (2,
> 0.5*min(cwnd,rcvwnd));
>    Somebody states that the correct value is: max(2, min(cwnd/2,
> rcvwnd))
> So, which the right version?

When congestion occurs you should set ssthresh to:

    max (2*MSS,FlightSize/2)

where FlightSize is the actual amount of outstanding data that has
been sent into the network.  (Depending on the implementation,
FlightSize may equal cwnd, or it may not).

> 2) Can the value of cwnd be greater than the value of rcvwnd ?

Yes.  The value of cwnd in some implementations grows beyond the
size of the advertised window.  However, the amount of outstanding
data that can be sent into the network at any point in time is min
(cwnd,rwnd).  So, even if cwnd grows incedentally large, TCP is not
allowed to send more data than the window dictated by the receiver.

allman