From: clive@demon.net (Clive D.W. Feather) Date: Tue, 6 Jan 2004 10:34:32 +0000 Subject: [xml2rfc] Re: Abbreviation or end of sentence? In-Reply-To: <5.1.0.14.2.20040105124833.00ba3138@127.0.0.1> References: <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> <5.1.0.14.2.20040105124833.00ba3138@127.0.0.1> Message-ID: <20040106103432.GD51961@finch-staff-1.thus.net> Graham Klyne said: >> I might want to mechanically put tags round each i.e. and e.g. >> and Dr. and Mrs. in my text because, after all, they're abbreviations. I >> could then automatically expand "Dr." to "Doctor" in some >> situations. When I do so, I don't want the sentence-end semantics to >> change. > > [Having very little to do with XML2RFC ...] > > I learned [1] that a period is not used following a contraction (where the > last letter of the abbreviation is also the last letter of the full > word). So "Dr" and "Mrs" above would properly not be followed by a period, > unless (in some contorted way?), they appear at the end of a sentence. Even if so, there are similar abbreviations such as "Rev." which this clearly doesn't apply to. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 17:15:12 +0000 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF5AFC4.4060906@gmx.de> References: <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> <3FF5AFC4.4060906@gmx.de> Message-ID: <20040105171512.GD2703@finch-staff-1.thus.net> Julian Reschke said: > I could live with , although I think it doesn't exactly mean what > we need (after all it means "no break here", and we don't want to forbid > a break, we just don't want an additional *space*). My reading is that means "no break and no expansion". The latter is what we want after an abbreviation; I would normally want the former as well. If you want a "can break but can't expand" character, then (U+2002) seems the right choice, since one en is the traditional width of an inter-word space. If you want a "can't break but can expand", then you put a fixed width space like followed by &zwsp; (U+200B), which is a zero width space which is allowed to expand - i.e. it marks where space can be added. I would recommend to Marshall that he implement all of these: U+00A0 single space that never breaks or widens to two U+2002 single space that can break but never widens to two U+2003 double space that can break but never shrinks to one &fsp; U+2007 space always exactly the same width as a digit &zwsp; U+200B place where space can be added when justifying ( and &fsp; are useful semantic objects). > Anyway, I'm not going to support a grammar element that is defined to > have "no" meaning except for a strange side effect... Agreed. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Mon, 5 Jan 2004 16:25:50 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105151402.GI77123@finch-staff-1.thus.net> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <20040105145024.GG77123@finch-staff-1.thus.net> <20040105160741.7d639afa.henrik@levkowetz.com> <20040105151402.GI77123@finch-staff-1.thus.net> Message-ID: <20040105162550.2c22b141.henrik@levkowetz.com> --Signature=_Mon__5_Jan_2004_16_25_50_+0100_b=GOnnbEt8_qT8R7 Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Clive, Going by the excellent exposition given by John Klensin earlier, two of your sentence ending examples below are not sentence endings in either the british nor the logical style, only in the traditional american style: Monday 5 January 2004, Clive D.W. Feather wrote: ... > Um, why? The commonest cases are: > > Sentence end: > He said "I am here." Not end of sentence except in traditional US. style > He said "I am here". > He said "What is it?" Not end of sentence except in traditional US. style > He said "What is it?". > She said 'he said "I am here"'. ... In view of this, it seems simpler to let the editor add the necessary end of sentence indications according to which style he's following. Henrik --Signature=_Mon__5_Jan_2004_16_25_50_+0100_b=GOnnbEt8_qT8R7 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+YH/eVhrtTJkXCMRApCPAKCwZlFXncUnSVjYjNVk0UoHe8cJYQCgl5Od g6a8rU6iS9uDK5H4jZS9uBs= =YyHL -----END PGP SIGNATURE----- --Signature=_Mon__5_Jan_2004_16_25_50_+0100_b=GOnnbEt8_qT8R7-- From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 15:14:02 +0000 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105160741.7d639afa.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <20040105145024.GG77123@finch-staff-1.thus.net> <20040105160741.7d639afa.henrik@levkowetz.com> Message-ID: <20040105151402.GI77123@finch-staff-1.thus.net> Henrik Levkowetz said: >> I'm suggesting that markup is the wrong thing to do. > Well, we seem to pretty consistently agree to disagree on this point :-) Indeed. > I think markup is right and adding significance to particular characters > in the manner you propose is bad. Yet the primary problem is because particular characters (dot etc.) have significance over and above their use as a graphic. >> (4) A "sentence end sequence" is one of the characters . ! or ? followed by >> zero or more of the characters ' or ". > Yes, but you'll still have this come out wrong for some styles. Um, why? The commonest cases are: Sentence end: He said "I am here." He said "I am here". He said "What is it?" He said "What is it?". She said 'he said "I am here"'. Not sentence end: Pilate asked 'What is truth?' and did not wait for an answer. My proposal would get all of these right without problem. About the only hard case I can come up with is: I go ... I come back. where I would not expect the ... to end the sentence but my rules would. And that's only because of the capital letter. In my proposal, you'd have to write it as: I go ... I come back. > We might > just as well leave it to explicit markup. It ought to work right for as many unaltered texts as possible. Whatever the override mechanism, it should only be for the corner cases. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Mon, 5 Jan 2004 16:11:02 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <3FF97B5A.6000705@gmx.de> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <3FF977A9.50909@gmx.de> <20040105155047.3dd41400.henrik@levkowetz.com> <3FF97B5A.6000705@gmx.de> Message-ID: <20040105161102.054176e8.henrik@levkowetz.com> --Signature=_Mon__5_Jan_2004_16_11_02_+0100_jet.ZTFkpqhty6aq Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Monday 5 January 2004, Julian Reschke wrote: ... > >>However, I think in reality it's more complex than that. Should there be > >>an automatism to detect > >> > >> Foo. > >> > >>as sentence end, or should that be handled through the overrides? > > > > > > I haven't used , so don't have a strong opinion on this. > > What would you suggest? > > Whatever we define, it shouldn't be a special case. Either say that > sentence ends are only detected automatically when there's no > interleaved markup (easy) or state that sentence end detection happens > logically during the output state (that would be harder). Ok. > Personally I think that the easy approach is good enough and more likely > to work robustly. If we have a manual override, that should be enough, i.e. > > Foo. Sounds good. > BTW: I can live with and (now that these elements *do* > have semantics :-), but I think I still prefer PIs. Ok. I guess we leave it up to Marshall then... Henrik --Signature=_Mon__5_Jan_2004_16_11_02_+0100_jet.ZTFkpqhty6aq Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+X6GeVhrtTJkXCMRAlzQAJwIjeCMxVzRKS3aTL+kRAiZUHOpuACgnif0 q/Z4gnGagixH42ZGibeRPJg= =zya/ -----END PGP SIGNATURE----- --Signature=_Mon__5_Jan_2004_16_11_02_+0100_jet.ZTFkpqhty6aq-- From: henrik@levkowetz.com (Henrik Levkowetz) Date: Mon, 5 Jan 2004 16:07:41 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105145024.GG77123@finch-staff-1.thus.net> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <20040105145024.GG77123@finch-staff-1.thus.net> Message-ID: <20040105160741.7d639afa.henrik@levkowetz.com> --Signature=_Mon__5_Jan_2004_16_07_41_+0100_dh.LzwFho9o0AfUg Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Monday 5 January 2004, Clive D.W. Feather wrote: > Henrik Levkowetz said: > > I think that your proposal below could be made to work, but I also > > think that the rules are to complex and non-intuitive for the casual > > user. > > I don't. > > > Having a basic rule of > > a) any of "." "!" "?" followed by whitespace is sentence end > > My basic rule allows closing quotes after them, but otherwise agrees. Oh, > only if followed by a word beginning with a capital letter. > > > b) override markup and > > I'm suggesting that markup is the wrong thing to do. Well, we seem to pretty consistently agree to disagree on this point :-) I think markup is right and adding significance to particular characters in the manner you propose is bad. > Rather, use a wide > space character to force end of sentence, and a non-break space to force > not-end-of-sentence. > The rest of my proposal is gravy. > > >> (4) The characters . ! and ? are "sentence end characters". Any sequence of > >> one or more sentence end characters, possibly mixed in with zero or more ' > >> or " characters, is a "sentence end sequence". [So both ". and ." are such > >> sequences, but "' without one of the other three isn't.] > > I just realized that could be simplified to: > > (4) A "sentence end sequence" is one of the characters . ! or ? followed by > zero or more of the characters ' or ". Yes, but you'll still have this come out wrong for some styles. We might just as well leave it to explicit markup. Henrik --Signature=_Mon__5_Jan_2004_16_07_41_+0100_dh.LzwFho9o0AfUg Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+X29eVhrtTJkXCMRAkaiAKDNceYcnUiO1NUXI4NGvIQb4wX72wCfcb1o Gz5zNAkKK2CZKnKtItXqznI= =280Y -----END PGP SIGNATURE----- --Signature=_Mon__5_Jan_2004_16_07_41_+0100_dh.LzwFho9o0AfUg-- From: julian.reschke@gmx.de (Julian Reschke) Date: Mon, 05 Jan 2004 15:57:30 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105155047.3dd41400.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <3FF977A9.50909@gmx.de> <20040105155047.3dd41400.henrik@levkowetz.com> Message-ID: <3FF97B5A.6000705@gmx.de> Henrik Levkowetz wrote: > Monday 5 January 2004, Julian Reschke wrote: > >>Henrik Levkowetz wrote: > > ... > >>> I think that your proposal below could be made to work, but I also >>>think that the rules are to complex and non-intuitive for the casual >>>user. Having a basic rule of >>> a) any of "." "!" "?" followed by whitespace is sentence end >>>and >>> b) override markup and >>>is a bit easier to both explain and use, I think. >> >>However, I think in reality it's more complex than that. Should there be >>an automatism to detect >> >> Foo. >> >>as sentence end, or should that be handled through the overrides? > > > I haven't used , so don't have a strong opinion on this. > What would you suggest? Whatever we define, it shouldn't be a special case. Either say that sentence ends are only detected automatically when there's no interleaved markup (easy) or state that sentence end detection happens logically during the output state (that would be harder). Personally I think that the easy approach is good enough and more likely to work robustly. If we have a manual override, that should be enough, i.e. Foo. BTW: I can live with and (now that these elements *do* have semantics :-), but I think I still prefer PIs. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Mon, 5 Jan 2004 15:50:47 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <3FF977A9.50909@gmx.de> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> <3FF977A9.50909@gmx.de> Message-ID: <20040105155047.3dd41400.henrik@levkowetz.com> --Signature=_Mon__5_Jan_2004_15_50_47_+0100_DHv=wwtROPQg8JDV Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Monday 5 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: ... > > I think that your proposal below could be made to work, but I also > > think that the rules are to complex and non-intuitive for the casual > > user. Having a basic rule of > > a) any of "." "!" "?" followed by whitespace is sentence end > > and > > b) override markup and > > is a bit easier to both explain and use, I think. > > However, I think in reality it's more complex than that. Should there be > an automatism to detect > > Foo. > > as sentence end, or should that be handled through the overrides? I haven't used , so don't have a strong opinion on this. What would you suggest? Henrik --Signature=_Mon__5_Jan_2004_15_50_47_+0100_DHv=wwtROPQg8JDV Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+XnHeVhrtTJkXCMRAthPAKDZqSrnriLiIhdPw1HQlkM/WhrbYgCgugF9 /C8mCQe7WZaBunwMlU+pC8s= =AaOE -----END PGP SIGNATURE----- --Signature=_Mon__5_Jan_2004_15_50_47_+0100_DHv=wwtROPQg8JDV-- From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 14:50:25 +0000 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105153408.4cd888d0.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> Message-ID: <20040105145024.GG77123@finch-staff-1.thus.net> Henrik Levkowetz said: > I think that your proposal below could be made to work, but I also > think that the rules are to complex and non-intuitive for the casual > user. I don't. > Having a basic rule of > a) any of "." "!" "?" followed by whitespace is sentence end My basic rule allows closing quotes after them, but otherwise agrees. Oh, only if followed by a word beginning with a capital letter. > b) override markup and I'm suggesting that markup is the wrong thing to do. Rather, use a wide space character to force end of sentence, and a non-break space to force not-end-of-sentence. The rest of my proposal is gravy. >> (4) The characters . ! and ? are "sentence end characters". Any sequence of >> one or more sentence end characters, possibly mixed in with zero or more ' >> or " characters, is a "sentence end sequence". [So both ". and ." are such >> sequences, but "' without one of the other three isn't.] I just realized that could be simplified to: (4) A "sentence end sequence" is one of the characters . ! or ? followed by zero or more of the characters ' or ". -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: julian.reschke@gmx.de (Julian Reschke) Date: Mon, 05 Jan 2004 15:41:45 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105153408.4cd888d0.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> <20040105153408.4cd888d0.henrik@levkowetz.com> Message-ID: <3FF977A9.50909@gmx.de> Henrik Levkowetz wrote: > Hi Clive, > > I think that your proposal below could be made to work, but I also > think that the rules are to complex and non-intuitive for the casual > user. Having a basic rule of > a) any of "." "!" "?" followed by whitespace is sentence end > and > b) override markup and > is a bit easier to both explain and use, I think. However, I think in reality it's more complex than that. Should there be an automatism to detect Foo. as sentence end, or should that be handled through the overrides? Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Mon, 5 Jan 2004 15:34:08 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040105121141.GE77123@finch-staff-1.thus.net> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <20040105121141.GE77123@finch-staff-1.thus.net> Message-ID: <20040105153408.4cd888d0.henrik@levkowetz.com> --Signature=_Mon__5_Jan_2004_15_34_08_+0100_if4DuIeGfT78B6Nx Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Hi Clive, I think that your proposal below could be made to work, but I also think that the rules are to complex and non-intuitive for the casual user. Having a basic rule of a) any of "." "!" "?" followed by whitespace is sentence end and b) override markup and is a bit easier to both explain and use, I think. Henrik Monday 5 January 2004, Clive D.W. Feather wrote: > > (1) Within a (and anything else equivalent) all sequences of ordinary > spaces, tabs, and newlines are treated as equivalent. I call them > "whitespace" in the rest of this proposal. > > (2) The following characters are recognised but are *not* whitespace: > > U+00A0 > U+2002 > U+2003 > &fsp; U+2007 > > These are "other space characters". Their semantics are described below. > [&fsp; is optional.] > > (3) [Optional, to aid readability] Whitespace adjacent to other space > characters is ignored. > > (4) The characters . ! and ? are "sentence end characters". Any sequence of > one or more sentence end characters, possibly mixed in with zero or more ' > or " characters, is a "sentence end sequence". [So both ". and ." are such > sequences, but "' without one of the other three isn't.] > > (5) A actual sentence ending occurs in three places: > (A) At the end of a paragraph, irrespective of the last character; > (B) Immediately before an character; > (C) At the end of a sentence end sequence followed by whitespace and > then an uppercase letter. > Case (A) can be ignored. In the case of (B) or (C), the or whitespace > is a "sentence gap". > > (6) [Optional, to allow more control over spacing] The following character > is recognised: > > &zwsp; U+200B > > If it occurs within whitespace it is ignored. If it occurs elsewhere it has > semantics described below. > > (7) The visual appearance, line-break, and justification properties of the > various characters and sequences are: > > Appearance Break Expand > sentence gap 2 spaces yes yes > other whitespace 1 space yes yes > 1 space no no > 1 space yes no > &fsp; 1 space no no > &zwsp; nothing no yes > > "Break" means that the space (and any adjacent space) can be replaced by > a line-break. > "Expand" means that extra space can be added after the space to provide > justification. > > [The difference between and &fsp; is semantic - a &fsp; is a digit > that happens to be all white, rather than a space character. In particular, > it should in general not be viewed as a word separator.] > > (8) [Optional, to provide user convenience] Define a processing directive > something like: > > indicating that "i.e." is not a sentence ending string. You might also want > a predefined list of these ("i.e", "e.g.", "Dr.", "Mrs.", "etc.", etc.), > and a way of turning that list off. --Signature=_Mon__5_Jan_2004_15_34_08_+0100_if4DuIeGfT78B6Nx Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+XXgeVhrtTJkXCMRAkwGAJ9sTufUSXrLLHR62jLi+08nrRNSOQCaAhp3 8aNax/VM4n4RmfviCg5HPP4= =p5M7 -----END PGP SIGNATURE----- --Signature=_Mon__5_Jan_2004_15_34_08_+0100_if4DuIeGfT78B6Nx-- From: GK@ninebynine.org (Graham Klyne) Date: Mon, 05 Jan 2004 13:00:54 +0000 Subject: [xml2rfc] Abbreviation or end of sentence? In-Reply-To: <20040105113421.GB77123@finch-staff-1.thus.net> References: <20040103171959.18366810.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> Message-ID: <5.1.0.14.2.20040105124833.00ba3138@127.0.0.1> At 11:34 05/01/04 +0000, Clive D.W. Feather wrote: >I might want to mechanically put tags round each i.e. and e.g. >and Dr. and Mrs. in my text because, after all, they're abbreviations. I >could then automatically expand "Dr." to "Doctor" in some >situations. When I do so, I don't want the sentence-end semantics to >change. [Having very little to do with XML2RFC ...] I learned [1] that a period is not used following a contraction (where the last letter of the abbreviation is also the last letter of the full word). So "Dr" and "Mrs" above would properly not be followed by a period, unless (in some contorted way?), they appear at the end of a sentence. #g -- [1] Bill Bryson, "Dictionary of Troublesome Words", section "abbreviations, contractions, acronyms" ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 12:11:41 +0000 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040103213016.21c87c2a.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> Message-ID: <20040105121141.GE77123@finch-staff-1.thus.net> Henrik Levkowetz said: > So, here's yet another end of sentence handling proposal. And here is mine, based on using the various Unicode spaces. (1) Within a (and anything else equivalent) all sequences of ordinary spaces, tabs, and newlines are treated as equivalent. I call them "whitespace" in the rest of this proposal. (2) The following characters are recognised but are *not* whitespace: U+00A0 U+2002 U+2003 &fsp; U+2007 These are "other space characters". Their semantics are described below. [&fsp; is optional.] (3) [Optional, to aid readability] Whitespace adjacent to other space characters is ignored. (4) The characters . ! and ? are "sentence end characters". Any sequence of one or more sentence end characters, possibly mixed in with zero or more ' or " characters, is a "sentence end sequence". [So both ". and ." are such sequences, but "' without one of the other three isn't.] (5) A actual sentence ending occurs in three places: (A) At the end of a paragraph, irrespective of the last character; (B) Immediately before an character; (C) At the end of a sentence end sequence followed by whitespace and then an uppercase letter. Case (A) can be ignored. In the case of (B) or (C), the or whitespace is a "sentence gap". (6) [Optional, to allow more control over spacing] The following character is recognised: &zwsp; U+200B If it occurs within whitespace it is ignored. If it occurs elsewhere it has semantics described below. (7) The visual appearance, line-break, and justification properties of the various characters and sequences are: Appearance Break Expand sentence gap 2 spaces yes yes other whitespace 1 space yes yes 1 space no no 1 space yes no &fsp; 1 space no no &zwsp; nothing no yes "Break" means that the space (and any adjacent space) can be replaced by a line-break. "Expand" means that extra space can be added after the space to provide justification. [The difference between and &fsp; is semantic - a &fsp; is a digit that happens to be all white, rather than a space character. In particular, it should in general not be viewed as a word separator.] (8) [Optional, to provide user convenience] Define a processing directive something like: indicating that "i.e." is not a sentence ending string. You might also want a predefined list of these ("i.e", "e.g.", "Dr.", "Mrs.", "etc.", etc.), and a way of turning that list off. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 11:42:55 +0000 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <526732190.1073164446@scan.jck.com> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview.ca.us> <526732190.1073164446@scan.jck.com> Message-ID: <20040105114255.GD77123@finch-staff-1.thus.net> John C Klensin said: > The problem/mess is that the conventions for quotes at > sentence-end differ depending on which side of the pond one is > on. If I recall, so do the quotes. Specifically, the form > > Sentence ends "here." > > is, indeed, standard usage. Please don't characterise this as a US v UK issue, since it's much more complex than that (as some of your other quotes show). -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 11:34:21 +0000 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103171959.18366810.henrik@levkowetz.com> References: <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> Message-ID: <20040105113421.GB77123@finch-staff-1.thus.net> Henrik Levkowetz said: > * To avoid abbreviations triggering this, they may be enclosed in > . Abbreviations ending a sentence can be left > without markup. I don't like this, because it overloads the meaning. I might want to mechanically put tags round each i.e. and e.g. and Dr. and Mrs. in my text because, after all, they're abbreviations. I could then automatically expand "Dr." to "Doctor" in some situations. When I do so, I don't want the sentence-end semantics to change. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: clive@demon.net (Clive D.W. Feather) Date: Mon, 5 Jan 2004 11:29:18 +0000 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103011917.4d57f586.henrik@levkowetz.com> References: <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> <3FF5AFC4.4060906@gmx.de> <20040102233932.7a95820f.henrik@levkowetz.com> <3FF5F9E1.8070008@gmx.de> <20040103011917.4d57f586.henrik@levkowetz.com> Message-ID: <20040105112918.GA77123@finch-staff-1.thus.net> Henrik Levkowetz said: > Julian, I've clarified what semantics has, not said that it has > no semantics. Please. > > Supposing you have an input symbol stream in which "." is followed > immediately by WSP, or contrarily one in which "." is followed > immediately by a symbol and then WSP, or any other tag and then > WSP. Since has no semantics, that means you're using the presence or absence of markup to decide where you have a sentence end - your "any other tag". And I've already shown that that's broken. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: braden@ISI.EDU (Bob Braden) Date: Sun, 4 Jan 2004 14:21:38 -0800 (PST) Subject: [xml2rfc] end of sentence: two spaces? Message-ID: <200401042221.i04MLc800806@boreas.isi.edu> *> > *> Of COURSE I do agree with the desire/need to have xml2rfc do the RIGHT *> and CORRECT thing. *> *> All I was saying is that it seems wasting cycles if (RFC-)editor *> (or any person for that matter) goes through a document in a very *> detailed way and just add extra space in between sentences. *> *> Bert *> Bert, Sorry, we don't get your point. RFCs have always used two spaces after periods separating sentences. It makes them more readable, in the absence of a variable-width font. This was a Jon Postel convention. The RFC Editor has an automatic tool to find exceptions to this rule, but it cannot correct them entirely automatically because of ambiguities that require humans for resolutionm. But this is no big deal; it represents a negligible part of the editorial effort. OTOH, I would be astonished to learn that xml2rfc does not already take care of this already. Happy New Year to all, Bob Braden for the RFC Editor From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sun, 4 Jan 2004 23:09:31 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040104101138.4b0e7525.mrose+internet.xml2rfc@dbc.mtview.ca.us> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview .ca.us> <526732190.1073164446@scan.jck.com> <3FF7F796.6040305@gmx.de> <20040104161641.124e78fb.henrik@levkowetz.com> <20040104101138.4b0e7525.mrose+internet.xml2rfc@dbc.mtview.ca.us> Message-ID: <20040104230931.1720c1d9.henrik@levkowetz.com> --Signature=_Sun__4_Jan_2004_23_09_31_+0100_v8Z8/4UresmiPY=O Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Sunday 4 January 2004, Marshall Rose wrote: > so, what is the proposal currently under consideration, as stated in > its most concise form? This is the current proposal, amended based on comments from Alex Rousskov and Julian. Julian has not indicated agreement on using markup rather than processing instructions, but seems OK with the proposal otherwise. * "." | "!" | "?" followed by one or more whitespace characters is identified as sentence end. (This also includes endings like "...", "?!", "!.."). * Sentences ending in a quote, where the punctuation is placed inside the quote (as in: Sentence blah blah "quote.") will not be automatically identified as end of sentence, but will need explicit markup instead. * Forced sentence ending (needed e.g. when a sentence ends with a quote) is indicated by something like or , with preference for a short form markup like . * Whitespace adjacent to sentence end determined either implicitly or explicitly by markup will not be preserved; it will be possible to write (space represented as "_"): " _____This sentences ends with with a quote "To be or not to be." _____ And here's the next sentence." or: " _____Foo_bar. _____Foo_bar." and have it come out with exactly 2 spaces between sentences. * Ignore sentence end (needed after abbreviations ending in ".") is indicated by something like or with preference for a short form markup like * Sentence end is rendered by the .txt renderer as "." SP SP Henrik --Signature=_Sun__4_Jan_2004_23_09_31_+0100_v8Z8/4UresmiPY=O Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/+I8beVhrtTJkXCMRAjxqAJ9nbBq5tApbYZzddFaB6t8EprTPmQCgnrPT x3xMymb8/oxBmehwH+9XPLI= =2PaY -----END PGP SIGNATURE----- --Signature=_Sun__4_Jan_2004_23_09_31_+0100_v8Z8/4UresmiPY=O-- From: swb@employees.org (Scott W Brim) Date: Sun, 4 Jan 2004 13:58:27 -0500 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040104101138.4b0e7525.mrose+internet.xml2rfc@dbc.mtview.ca.us> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview.ca.us> <526732190.1073164446@scan.jck.com> <3FF7F796.6040305@gmx.de> <20040104161641.124e78fb.henrik@levkowetz.com> <20040104101138.4b0e7525.mrose+internet.xml2rfc@dbc.mtview.ca.us> Message-ID: <20040104185827.GC1496@sbrim-w2k01> On Sun, Jan 04, 2004 10:11:38AM -0800, Marshall Rose allegedly wrote: > so, what is the proposal currently under consideration, as stated in > its most concise form? I suggest leaving things as they are for the time being. From: mrose+internet.xml2rfc@dbc.mtview.ca.us (Marshall Rose) Date: Sun, 4 Jan 2004 10:11:38 -0800 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040104161641.124e78fb.henrik@levkowetz.com> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview .ca.us> <526732190.1073164446@scan.jck.com> <3FF7F796.6040305@gmx.de> <20040104161641.124e78fb.henrik@levkowetz.com> Message-ID: <20040104101138.4b0e7525.mrose+internet.xml2rfc@dbc.mtview.ca.us> > I now agree. Thanks John, for the clarifying information! so, what is the proposal currently under consideration, as stated in its most concise form? thanks, /mtr From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sun, 4 Jan 2004 16:16:41 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <3FF7F796.6040305@gmx.de> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview .ca.us> <526732190.1073164446@scan.jck.com> <3FF7F796.6040305@gmx.de> Message-ID: <20040104161641.124e78fb.henrik@levkowetz.com> Sunday 4 January 2004, Julian Reschke wrote: > John, > > thanks for the explanation. I was suspecting that this was partly an > Americanism :-) > > So I'd vote for xml2rfc *not* to use these additonal rules. I now agree. Thanks John, for the clarifying information! Henrik From: julian.reschke@gmx.de (Julian Reschke) Date: Sun, 04 Jan 2004 12:23:02 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <526732190.1073164446@scan.jck.com> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview .ca.us> <526732190.1073164446@scan.jck.com> Message-ID: <3FF7F796.6040305@gmx.de> John, thanks for the explanation. I was suspecting that this was partly an Americanism :-) So I'd vote for xml2rfc *not* to use these additonal rules. Regards, Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: julian.reschke@gmx.de (Julian Reschke) Date: Sun, 04 Jan 2004 12:19:40 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040104010119.69d1ae43.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <3FF73EBF.7020905@gmx.de> <20040104010119.69d1ae43.henrik@levkowetz.com> Message-ID: <3FF7F6CC.6060608@gmx.de> Henrik Levkowetz wrote: > Oh, you can rest assured, this will be enforced. Haven't you noticed > the third paragraph of a draft, in the "status of this memo" section? Indeed. Needless to say, when I wrote rfc2629.xslt I thought that was a typo and "fixed" it. > ... >>>Something short like would be more likely to be used, >>>IMHO. This should cover cases like (not the markup on a different >>>line, with indentation) >>> >>>_____This sentences ends with the following URL: http://foo.org/ >>>_____ >> >>Interesting example. I'd say that a sentence never can end in anything >>other than a punctuation mark, and that the above should be written as: >> >>_____This sentences ends with the following URL: . > > > ????? If you insert markup saying "the sentence ends here", it should > shouldn't be conditional on the particular preceding character! Maybe > I'm not understanding you correctly here. Yep. I didn't intend to say that. What I wanted to say is that I don't understand that example. A sentence should always end with punctuation, thus if a sentence ends in a URL it should be written as: See also http://xml.resource.org. Or: See also . But never: See also http://xml.resource.org Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: julian.reschke@gmx.de (Julian Reschke) Date: Sun, 04 Jan 2004 12:08:53 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <8169.1073173309@marajade.sandelman.ottawa.on.ca> References: <8169.1073173309@marajade.sandelman.ottawa.on.ca> Message-ID: <3FF7F445.8040004@gmx.de> Michael Richardson wrote: > Um. How important are two spaces between sentences to the RFC-editor. That was indeed the first question asked, and the answer was that that RFC Editor indeed inserts the spaces. This is a bad thing because - the RFC Editor needs to spend time on things that should be automated and - the xml2rfc txt output differs from published material, causing for instance differences in page breaks, TOC and index getting out of sync and so on... > We seem to be doing a lot of thinking here. > > If it is really worth keeping the two spaces, then we should have sentences > marked, as well as paragraphs. I.e. ~~Subject Verb Nounce~~ But we don't, and it seems that (almost) nobody is willing to add that markup. > I know how ugly that could be. > (Frankly, I wish XML was more like latex, with a blank line between paragraphs!) That's not XML's fault. The RFC2629 could be that way, but it isn't. > It there anyway we can go the other way? Less markup rather than more? > I say this as someone that liked the HTML 0.9 revision :-) > > I think many of us do not edit the XML with anything other than Emacs or equivalent. Yes, that's why we try to achieve the result with no additional markup (except in edge cases). Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: john+xml@jck.com (John C Klensin) Date: Sat, 03 Jan 2004 21:14:06 -0500 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview.ca.us> References: <20040104000200.19036.69117.Mailman@qawoor.dbc.mtview .ca.us> Message-ID: <526732190.1073164446@scan.jck.com> Folks, speaking as a native speaker of one of the languages called "English", we are rapidly headed into a mess here. It would be good if the RFC Editor would say something definitive but, in the past, presumably out of respect to good sense in one instance and to our British colleagues in the other, both sets of forms have, I believe, been permitted. The problem/mess is that the conventions for quotes at sentence-end differ depending on which side of the pond one is on. If I recall, so do the quotes. Specifically, the form Sentence ends "here." is, indeed, standard usage. It is also, as some native speakers of English and virtually every non-native speaker has observed, bizarre -- bizarre enough that many technical publications have adopted style manuals that insist on the other form to avoid massive confusion. Consider the sentence The name of the most-heavily-populated TLD is "COM.". Now, neither of those periods is superfluous. The second one ends the sentence, the first one if the formal notation for the root. This could, of course, rather easily appear in an RFC (and probably does, somewhere). Strict application of the rule would yield The name of the most-heavily-populated TLD is "COM.." which defies all reason and is confusing and ambiguous. Even Fowler (a, if not the, standard reference for anal-compulsive English) doesn't like the situation. He says Questions of order between inverted commas and stops are much debated and a writer's personal preference often conflicts with the style rules of editors and publishers. There are two schools of thought, which might be called the conventional and the logical. The conventional prefers to put stops within the inverted commas, if it can be done without ambiguity, on the ground that this has a more pleasing appearance. The logical punctuates according to sense, and puts them outside except when they actually form part of the quotation. [...] In the treatment of question and exclamation marks the systems tend to merge, perhaps because those symbols show up so glaringly the illogicality of the conventional one. In the following examples, the punctuation is standard under either system: [...] I said 'Am I my brother's keeper?' Did you say 'Am I my brother's keeper'? he then proceeds to say [...] The conventional system flouts common sense, and it is not easy to see what merit it is supposed to have to outweigh that defect; even the more pleasing appearance claimed for it is not likely to go unquestioned. So this argument has been going on for at least 80 years or so. Even the strongest advocates of the conventional system believe that it should yield to clarity when that is necessary, and it quite often is necessary in technical writing. Any sort of heuristics must, I think, either prefer the "logical" form or are regularly going to end up in deep trouble, requiring yet more markup. And, just to make this more exciting, the convention in American English is to use double quotes to set off quotations, using single quotes only within them for nesting, e.g., Tom said "Dick claims 'Harry is incoherent'". while, if one crosses the pond (I suspect either pond, but am not sure and am now curious), single quotes are the preferred form, with the double ones being nested, e.g., Tom said 'Dick claims "Harry is incoherent"'. And, just in case we aren't having enough fun yet, the most-cited style manual for American English (the "Chicago Manual..."), at least in my early edition, claims that the examples given by Fowler many years earlier are incorrect, since they want quotations in running text to start with a lower-case letter even if the original was a complete sentence, e.g., I said "am I my brother's keeper?" Sigh. john Henrik Levkowetz (I think) wrote: >> It looks like the above does not cover the strange English >> quotation rules where you are supposed to write >> >> Sentence ends "here." >> and not >> Sentence ends "here". > > True. I thought of those anomalies, but thought I'd not bring > that into the discussion yet... ,:-) > >> The same set of rules probably includes things like >> >> Does this sentence end with "foo?" > > Right. > >> but a native speaker should double check that. >> >> I suspect single quotes should be handled the same way. > > Possibly. In this case, I simply don't know. From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sun, 4 Jan 2004 01:01:19 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <3FF73EBF.7020905@gmx.de> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <3FF73EBF.7020905@gmx.de> Message-ID: <20040104010119.69d1ae43.henrik@levkowetz.com> Saturday 3 January 2004, Julian Reschke wrote: > Alex Rousskov wrote: > > > I assume the above already covers endings like "...", "?!", and > > "!.." > > > > It looks like the above does not cover the strange English quotation > > rules where you are supposed to write > > > > Sentence ends "here." > > and not > > Sentence ends "here". > > I'm starting to think that RFCs should be written in Latin :-) Honestly, > I hope that the RFC Editor doesn't enforce *these* rules as well (insert > flame war about America being old-fashioned, the metric system and date > formats here...). Oh, you can rest assured, this will be enforced. Haven't you noticed the third paragraph of a draft, in the "status of this memo" section? Personally, I think this is perverse and misguided if not outright unethical :-) but if the style guide says that the way it should be, we'll probably follow that till an RFC says otherwise or the manual of style is changed. > > The same set of rules probably includes things like > > > > Does this sentence end with "foo?" > > > > but a native speaker should double check that. > > Seems to be another good reason to have manual overrides. I'd certainly > *not* want to put that burden onto the RFC2629 formatting engine. That's not particularly hard. If you can recognise ("." | "?" | "!")(" ") you certainly can recognise ("." | "?" | "!")(" " | "\"") > > Something short like would be more likely to be used, > > IMHO. This should cover cases like (not the markup on a different > > line, with indentation) > > > > _____This sentences ends with the following URL: http://foo.org/ > > _____ > > Interesting example. I'd say that a sentence never can end in anything > other than a punctuation mark, and that the above should be written as: > > _____This sentences ends with the following URL: . ????? If you insert markup saying "the sentence ends here", it should shouldn't be conditional on the particular preceding character! Maybe I'm not understanding you correctly here. Henrik From: mcr@sandelman.ottawa.on.ca (Michael Richardson) Date: Sat, 03 Jan 2004 18:41:49 -0500 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: Your message of "Sat, 03 Jan 2004 01:19:17 +0100." <20040103011917.4d57f586.henrik@levkowetz.com> Message-ID: <8169.1073173309@marajade.sandelman.ottawa.on.ca> -----BEGIN PGP SIGNED MESSAGE----- Um. How important are two spaces between sentences to the RFC-editor. We seem to be doing a lot of thinking here. If it is really worth keeping the two spaces, then we should have sentences marked, as well as paragraphs. I.e. ~~Subject Verb Nounce~~ I know how ugly that could be. (Frankly, I wish XML was more like latex, with a blank line between paragraphs!) It there anyway we can go the other way? Less markup rather than more? I say this as someone that liked the HTML 0.9 revision :-) I think many of us do not edit the XML with anything other than Emacs or equivalent. ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson, Xelerance Corporation, Ottawa, ON |net architect[ ] mcr@xelerance.com http://www.sandelman.ottawa.on.ca/mcr/ |device driver[ ] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Finger me for keys iQCVAwUBP/dTO4qHRg3pndX9AQG81wP/f9aNCtI3R/rNK3WPsYJ9ZV55SB62p7T8 3uXtnY12BSjop1AFVmdfeokg6u5GqlqfAnBu0Dg/dj3EFR56NBHFdq9DOd85+O+F mygHvY/RGfCLf3mI3Sd3zcBEcN6aTLrgR3inNZITmjzCZ3coijTC0KCxTOUFq70s WNWOYWm51sA= =seN6 -----END PGP SIGNATURE----- From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 23:41:20 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <3FF73D66.2080706@gmx.de> References: <20040103213016.21c87c2a.henrik@levkowetz.com> <3FF73D66.2080706@gmx.de> Message-ID: <20040103234120.62f72a22.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_23_41_20_+0100__h+2yaaQyagd2uzZ Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > So, here's yet another end of sentence handling proposal. The first > > name I've proposed for the markup element is inspired by the > > element, but that this be the exact name isn't essential: > > > > * "." | "!" | "?" followed by one or more whitespace characters is > > identified as sentence end. (Should any others characters trigger > > this?) > > > > * Forced sentence ending (needed when a sentence ends with an > > abbreviation) : or maybe > > > > * Ignore sentence end (needed after abbreviations ending in "."): > > or maybe > > > > ... > > I'd like to understand why you feel that a simple PI such as (or > ) wouldn't work. Do you feel that PIs in some way are > second-class XML features? I'm sure it would work. I think style-wise it's inappropriate. I've been using PI's together with markup in files used to generate parts of my website for something like 4 years, and definitely don't think they're second class. I just 1. think it's not the most appropriate in this case and also 2. can just see how often I'll be writing draft text, working with markup, and write instead of (or more probably ). > In this particular case they seem to me the exactly right thing to use. > RFC2629 doesn't care about marking up sentence boundaries, and all > except one output format (text) doesn't care about it, so this is a > specific processing instruction (in the non-XML meaning) for a specific > formatter. Well, as a draft editor I see it as part of writing the draft, rather than part of indicating how I'd like it generated (which is the case for most of the rfc PIs), so I guess we'll just have to continue to disagree on this point. Henrik --Signature=_Sat__3_Jan_2004_23_41_20_+0100__h+2yaaQyagd2uzZ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/90UQeVhrtTJkXCMRAnHyAKC66LAbWSIkAuET6tX3t2YB85y4UQCdFuMT WW2mRtxCZPp6upk56gRk3Bc= =EYBh -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_23_41_20_+0100__h+2yaaQyagd2uzZ-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 23:14:23 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: References: <20040103213016.21c87c2a.henrik@levkowetz.com> Message-ID: <3FF73EBF.7020905@gmx.de> Alex Rousskov wrote: > I assume the above already covers endings like "...", "?!", and > "!.." > > It looks like the above does not cover the strange English quotation > rules where you are supposed to write > > Sentence ends "here." > and not > Sentence ends "here". I'm starting to think that RFCs should be written in Latin :-) Honestly, I hope that the RFC Editor doesn't enforce *these* rules as well (insert flame war about America being old-fashioned, the metric system and date formats here...). > The same set of rules probably includes things like > > Does this sentence end with "foo?" > > but a native speaker should double check that. Seems to be another good reason to have manual overrides. I'd certainly *not* want to put that burden onto the RFC2629 formatting engine. > Something short like would be more likely to be used, > IMHO. This should cover cases like (not the markup on a different > line, with indentation) > > _____This sentences ends with the following URL: http://foo.org/ > _____ Interesting example. I'd say that a sentence never can end in anything other than a punctuation mark, and that the above should be written as: _____This sentences ends with the following URL: . > Should HTML renderer do the same? Should it insert sentence markup > (e.g., sentence.) so that users can > control spacing with CSS? In rfc2629.xslt I'd probably add an empty span element that can be made visible for debugging purposes. > Thanks guys for converging on a solution before killing each other! Still working on it. In the end Marshall needs to decide whether he wants to implement all of that, and how to markup the special cases. -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 23:08:38 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040103213016.21c87c2a.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> Message-ID: <3FF73D66.2080706@gmx.de> Henrik Levkowetz wrote: > So, here's yet another end of sentence handling proposal. The first > name I've proposed for the markup element is inspired by the > element, but that this be the exact name isn't essential: > > * "." | "!" | "?" followed by one or more whitespace characters is > identified as sentence end. (Should any others characters trigger > this?) > > * Forced sentence ending (needed when a sentence ends with an > abbreviation) : or maybe > > * Ignore sentence end (needed after abbreviations ending in "."): > or maybe > > ... I'd like to understand why you feel that a simple PI such as (or ) wouldn't work. Do you feel that PIs in some way are second-class XML features? In this particular case they seem to me the exactly right thing to use. RFC2629 doesn't care about marking up sentence boundaries, and all except one output format (text) doesn't care about it, so this is a specific processing instruction (in the non-XML meaning) for a specific formatter. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 23:05:41 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103212822.240867e3.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> <3FF6F3D1.6040505@gmx.de> <20040103191152.6dc6128a.henrik@levkowetz.com> <3FF70E2E.9070209@gmx.de> <20040103212822.240867e3.henrik@levkowetz.com> Message-ID: <3FF73CB5.2070607@gmx.de> Henrik Levkowetz wrote: >>Sure. For instance (using "_" instead of SP), in >> >> >>___Foo_bar. >>___Foo_bar. >> >> >>I'd like to see xml2rfc to produce: >> >>Foo_bar.__Foo_bar. >> >>Not: >> >>___Foo_bar.____Foo_bar. > > > Thanks for the example. For the end-of-sentence part I agree, for > any other case there should be no change to the current processing. Correct. That *wouldn't' be a change. I'd just like that to be clarified (multiple whitespace characters in text content are treated as a single one). >>Thus >> >>- we need to describe the rules that xml2rfc can use to detect sentence >>endings to do "the right thing", and >> >>- possibly we need explicit "instructions" by which the default rules >>can be overridden (elements, PIs or special characters have been discussed). >> >>Of course it makes sense to minimize the number of these special cases, >>therefore the idea of making abbreviations explicit. That's useful >>anyway, because processors may be able to take advantage of that when >>producing other outputs, such as HTML. > > > Could be so, sure. But this is an addition to what we set out to solve. Agreed. We can discuss this independantly. > ... -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 22:55:21 +0100 Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: References: <20040103213016.21c87c2a.henrik@levkowetz.com> Message-ID: <20040103225521.22c90987.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_22_55_21_+0100_YMuyO5q+hlG..VjM Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Alex Rousskov wrote: > > On Sat, 3 Jan 2004, Henrik Levkowetz wrote: > > > So, here's yet another end of sentence handling proposal. The first > > name I've proposed for the markup element is inspired by the > /> element, but that this be the exact name isn't essential: > > > > * "." | "!" | "?" followed by one or more whitespace characters is > > identified as sentence end. (Should any others characters trigger > > this?) > > I assume the above already covers endings like "...", "?!", and > "!.." Right. > It looks like the above does not cover the strange English quotation > rules where you are supposed to write > > Sentence ends "here." > and not > Sentence ends "here". True. I thought of those anomalies, but thought I'd not bring that into the discussion yet... ,:-) > The same set of rules probably includes things like > > Does this sentence end with "foo?" Right. > but a native speaker should double check that. > > I suspect single quotes should be handled the same way. Possibly. In this case, I simply don't know. > > * Forced sentence ending (needed when a sentence ends with an > > abbreviation) : or maybe > > Something short like would be more likely to be used, > IMHO. This should cover cases like (note the markup on a different > line, with indentation) > > _____This sentences ends with the following URL: http://foo.org/ > _____ Yes. > > * Ignore sentence end (needed after abbreviations ending in "."): > > or maybe > > > > * Sentence end is rendered by the .txt renderer as "." SP SP > > Should HTML renderer do the same? Should it insert sentence markup > (e.g., sentence.) so that users can > control spacing with CSS? I don't think so, but that's not a strong opinion. > Thanks guys for converging on a solution before killing each other! We were close, weren't we ,,:-) Henrik --Signature=_Sat__3_Jan_2004_22_55_21_+0100_YMuyO5q+hlG..VjM Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9zpJeVhrtTJkXCMRAjhFAJ0Y7Gb/3lgQNXcdirduXwdVWaQQ+ACfaK4u XGQvQ7A66ojIhqH+whxhfyY= =RnN5 -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_22_55_21_+0100_YMuyO5q+hlG..VjM-- From: rousskov@measurement-factory.com (Alex Rousskov) Date: Sat, 3 Jan 2004 14:38:21 -0700 (MST) Subject: [xml2rfc] Another end of sentence handling proposal In-Reply-To: <20040103213016.21c87c2a.henrik@levkowetz.com> References: <20040103213016.21c87c2a.henrik@levkowetz.com> Message-ID: On Sat, 3 Jan 2004, Henrik Levkowetz wrote: > So, here's yet another end of sentence handling proposal. The first > name I've proposed for the markup element is inspired by the /> element, but that this be the exact name isn't essential: > > * "." | "!" | "?" followed by one or more whitespace characters is > identified as sentence end. (Should any others characters trigger > this?) I assume the above already covers endings like "...", "?!", and "!.." It looks like the above does not cover the strange English quotation rules where you are supposed to write Sentence ends "here." and not Sentence ends "here". The same set of rules probably includes things like Does this sentence end with "foo?" but a native speaker should double check that. I suspect single quotes should be handled the same way. > * Forced sentence ending (needed when a sentence ends with an > abbreviation) : or maybe Something short like would be more likely to be used, IMHO. This should cover cases like (not the markup on a different line, with indentation) _____This sentences ends with the following URL: http://foo.org/ _____ > * Ignore sentence end (needed after abbreviations ending in "."): > or maybe > > * Sentence end is rendered by the .txt renderer as "." SP SP Should HTML renderer do the same? Should it insert sentence markup (e.g., sentence.) so that users can control spacing with CSS? Thanks guys for converging on a solution before killing each other! Alex. From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 21:30:16 +0100 Subject: [xml2rfc] Another end of sentence handling proposal Message-ID: <20040103213016.21c87c2a.henrik@levkowetz.com> So, here's yet another end of sentence handling proposal. The first name I've proposed for the markup element is inspired by the element, but that this be the exact name isn't essential: * "." | "!" | "?" followed by one or more whitespace characters is identified as sentence end. (Should any others characters trigger this?) * Forced sentence ending (needed when a sentence ends with an abbreviation) : or maybe * Ignore sentence end (needed after abbreviations ending in "."): or maybe * Sentence end is rendered by the .txt renderer as "." SP SP Henrik From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 21:28:22 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF70E2E.9070209@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> <3FF6F3D1.6040505@gmx.de> <20040103191152.6dc6128a.henrik@levkowetz.com> <3FF70E2E.9070209@gmx.de> Message-ID: <20040103212822.240867e3.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_21_28_22_+0100_zSS7l1U_cGfF2/++ Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > >>I'm not sure why it's relevant whether it has the same name in HTML. In > >>fact, if it means the same thing, I'd *prefer* it to use the same name. > >>But that's not really important. > > > > > > As we're defining markup for xml2rfc and related processors here, we don't > > necessarily want to bind this element to have exactly the same semantics > > as . > > I didn't say that I *necessarily* want the same syntax. I just find it > pointless to avoid identical names. RFC2629 is not HTML, and should > there ever be a need to mix both vocabularies, this should be done using > XML namespaces. > > BTW: DocBook has an element > () which -- surprise -- > seems to have the same semantics. Well, I'd be happy to leave this up to Marshall, if he chooses to incorporate something like this. > >>> * "." | "!" | "?" followed by whitespace is identified as sentence end. > >>> (any others?) > >> > >>Plus: > >> > >>* Sequences of multiple whitespace characters inside text (t, spanx, > >>annotation...) are treated as a single character. > > > > > > Could you give an example of what you mean? > > Sure. For instance (using "_" instead of SP), in > > > ___Foo_bar. > ___Foo_bar. > > > I'd like to see xml2rfc to produce: > > Foo_bar.__Foo_bar. > > Not: > > ___Foo_bar.____Foo_bar. Thanks for the example. For the end-of-sentence part I agree, for any other case there should be no change to the current processing. > >>Well. In fact they must be left without markup, unless no sentence end > >>will be detected (at least as far as I understand the model you're > >>proposing). Obviously (not being able to use because it occured > >>at a sentence end) that would be a bad thing. > > > > > > The purpose of using would be to avoid interpreting "." SP > > as sentence end. So if you don't need to avoid it because you _have_ > > sentence end, you're fine, no? Unless you want to give > > additional semantics, of course. > > Sure. That's the whole point. If we introduce a new element to indicate > "this is an abbreviation" it needs to work everywhere. I was just > proposing it because in many cases it would be enough for disambiguating > the types of dots. > > The issue we need to solve is caused by the fact that RFC2629 (like > almost all other vocabularies, btw) does not specifically markup > sentence boundaries. This seems to be a non-issue, unless you have > indeed to produce monospaced output and are stuck with specific > formatting requirements. > > It seems that we all agree that > > - we don't want to break existing files, Agreed > - we don't want to do any additional typing unless it's completely > unavoidable. Agreed > Thus > > - we need to describe the rules that xml2rfc can use to detect sentence > endings to do "the right thing", and > > - possibly we need explicit "instructions" by which the default rules > can be overridden (elements, PIs or special characters have been discussed). > > Of course it makes sense to minimize the number of these special cases, > therefore the idea of making abbreviations explicit. That's useful > anyway, because processors may be able to take advantage of that when > producing other outputs, such as HTML. Could be so, sure. But this is an addition to what we set out to solve. > >>> * Sentence end is rendered by the .txt renderer as "." SP SP > >> > >>That still leaves the issue open how to handle things like. > >> > >> First sentence. Second sentence. > >> > >>I think in this case the processor should be able to determine that > >>there was in fact a sentence ending here. > > OK, so we finally seem to converge on the same plan. > > However, I think we haven't fully described that algorithm yet. I'll > have to do some more experimentation with example texts to see which > cases we haven't considered. > > Generally I think we'll need overrides for both cases (sentence end not > detected, or sentence end detected where there wasn't any). These > situations should be rare, and I think in these cases special PIs are > the least intrusive solution. 1. If you want to provide both overrides, we can dispense with the element; that is superfluous in this case, and becomes a separate proposition. 2. I still think markup rather than PIs is more appropriate for this. So, if we can't agree to let be the override, and used to override sentence end is too hard to understand, I'll be posting another proposal. Henrik --Signature=_Sat__3_Jan_2004_21_28_22_+0100_zSS7l1U_cGfF2/++ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9yXmeVhrtTJkXCMRAthaAJ9phXAPFVEsqaTEodniBI4Cdf08QQCeIK00 6RuMhZQYKmTyMaf2fwS6Tdw= =Z1JE -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_21_28_22_+0100_zSS7l1U_cGfF2/++-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 19:47:10 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103191152.6dc6128a.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> <3FF6F3D1.6040505@gmx.de> <20040103191152.6dc6128a.henrik@levkowetz.com> Message-ID: <3FF70E2E.9070209@gmx.de> Henrik Levkowetz wrote: >>I'm not sure why it's relevant whether it has the same name in HTML. In >>fact, if it means the same thing, I'd *prefer* it to use the same name. >>But that's not really important. > > > As we're defining markup for xml2rfc and related processors here, we don't > necessarily want to bind this element to have exactly the same semantics > as . I didn't say that I *necessarily* want the same syntax. I just find it pointless to avoid identical names. RFC2629 is not HTML, and should there ever be a need to mix both vocabularies, this should be done using XML namespaces. BTW: DocBook has an element () which -- surprise -- seems to have the same semantics. >>> * "." | "!" | "?" followed by whitespace is identified as sentence end. >>> (any others?) >> >>Plus: >> >>* Sequences of multiple whitespace characters inside text (t, spanx, >>annotation...) are treated as a single character. > > > Could you give an example of what you mean? Sure. For instance (using "_" instead of SP), in ___Foo_bar. ___Foo_bar. I'd like to see xml2rfc to produce: Foo_bar.__Foo_bar. Not: ___Foo_bar.____Foo_bar. >>Well. In fact they must be left without markup, unless no sentence end >>will be detected (at least as far as I understand the model you're >>proposing). Obviously (not being able to use because it occured >>at a sentence end) that would be a bad thing. > > > The purpose of using would be to avoid interpreting "." SP > as sentence end. So if you don't need to avoid it because you _have_ > sentence end, you're fine, no? Unless you want to give > additional semantics, of course. Sure. That's the whole point. If we introduce a new element to indicate "this is an abbreviation" it needs to work everywhere. I was just proposing it because in many cases it would be enough for disambiguating the types of dots. The issue we need to solve is caused by the fact that RFC2629 (like almost all other vocabularies, btw) does not specifically markup sentence boundaries. This seems to be a non-issue, unless you have indeed to produce monospaced output and are stuck with specific formatting requirements. It seems that we all agree that - we don't want to break existing files, - we don't want to do any additional typing unless it's completely unavoidable. Thus - we need to describe the rules that xml2rfc can use to detect sentence endings to do "the right thing", and - possibly we need explicit "instructions" by which the default rules can be overridden (elements, PIs or special characters have been discussed). Of course it makes sense to minimize the number of these special cases, therefore the idea of making abbreviations explicit. That's useful anyway, because processors may be able to take advantage of that when producing other outputs, such as HTML. >>> * Sentence end is rendered by the .txt renderer as "." SP SP >> >>That still leaves the issue open how to handle things like. >> >> First sentence. Second sentence. >> >>I think in this case the processor should be able to determine that >>there was in fact a sentence ending here. OK, so we finally seem to converge on the same plan. However, I think we haven't fully described that algorithm yet. I'll have to do some more experimentation with example texts to see which cases we haven't considered. Generally I think we'll need overrides for both cases (sentence end not detected, or sentence end detected where there wasn't any). These situations should be rare, and I think in these cases special PIs are the least intrusive solution. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 19:11:52 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF6F3D1.6040505@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> <3FF6F3D1.6040505@gmx.de> Message-ID: <20040103191152.6dc6128a.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_19_11_52_+0100_7TlsSP6MTU8af=6K Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > > In that case, let's use this, except maybe call it since html > > has an element. Seems workable to me, it is simple and when I > > think through the different cases it seems they can all be covered by > > this. > > I'm not sure why it's relevant whether it has the same name in HTML. In > fact, if it means the same thing, I'd *prefer* it to use the same name. > But that's not really important. As we're defining markup for xml2rfc and related processors here, we don't necessarily want to bind this element to have exactly the same semantics as . > > * "." | "!" | "?" followed by whitespace is identified as sentence end. > > (any others?) > > Plus: > > * Sequences of multiple whitespace characters inside text (t, spanx, > annotation...) are treated as a single character. Could you give an example of what you mean? > > * To avoid abbreviations triggering this, they may be enclosed in > > . Abbreviations ending a sentence can be left > > without markup. > > Well. In fact they must be left without markup, unless no sentence end > will be detected (at least as far as I understand the model you're > proposing). Obviously (not being able to use because it occured > at a sentence end) that would be a bad thing. The purpose of using would be to avoid interpreting "." SP as sentence end. So if you don't need to avoid it because you _have_ sentence end, you're fine, no? Unless you want to give additional semantics, of course. > > * Sentence end is rendered by the .txt renderer as "." SP SP > > That still leaves the issue open how to handle things like. > > First sentence. Second sentence. > > I think in this case the processor should be able to determine that > there was in fact a sentence ending here. Fine. Henrik --Signature=_Sat__3_Jan_2004_19_11_52_+0100_7TlsSP6MTU8af=6K Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9wXoeVhrtTJkXCMRAv4JAKDnPhcRehOCFPeBjW/A4PZ5iNNaDACg5RnR t3zpEV7sHR+HZqd+qwTqOw0= =XcsI -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_19_11_52_+0100_7TlsSP6MTU8af=6K-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 17:54:41 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103171959.18366810.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> <20040103171959.18366810.henrik@levkowetz.com> Message-ID: <3FF6F3D1.6040505@gmx.de> Henrik Levkowetz wrote: > In that case, let's use this, except maybe call it since html > has an element. Seems workable to me, it is simple and when I > think through the different cases it seems they can all be covered by > this. I'm not sure why it's relevant whether it has the same name in HTML. In fact, if it means the same thing, I'd *prefer* it to use the same name. But that's not really important. > * "." | "!" | "?" followed by whitespace is identified as sentence end. > (any others?) Plus: * Sequences of multiple whitespace characters inside text (t, spanx, annotation...) are treated as a single character. > * To avoid abbreviations triggering this, they may be enclosed in > . Abbreviations ending a sentence can be left > without markup. Well. In fact they must be left without markup, unless no sentence end will be detected (at least as far as I understand the model you're proposing). Obviously (not being able to use because it occured at a sentence end) that would be a bad thing. > * Sentence end is rendered by the .txt renderer as "." SP SP That still leaves the issue open how to handle things like. First sentence. Second sentence. I think in this case the processor should be able to determine that there was in fact a sentence ending here. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 17:19:59 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF6CE94.5060805@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> <3FF6CE94.5060805@gmx.de> Message-ID: <20040103171959.18366810.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_17_19_59_+0100_=ScbyYpJMe_xLM4. Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > > I see you snipped the e.g. proposal. > > Because we agreed. Why would I repeat everything? In that case, let's use this, except maybe call it since html has an element. Seems workable to me, it is simple and when I think through the different cases it seems they can all be covered by this. * "." | "!" | "?" followed by whitespace is identified as sentence end. (any others?) * To avoid abbreviations triggering this, they may be enclosed in . Abbreviations ending a sentence can be left without markup. * Sentence end is rendered by the .txt renderer as "." SP SP Henrik --Signature=_Sat__3_Jan_2004_17_19_59_+0100_=ScbyYpJMe_xLM4. Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9uuveVhrtTJkXCMRAreKAJ9q3d+GoPPg21Tdahe2ov9OjHdH/QCfR45w nMZDJOLx8C08Vfpklv1AAl8= =UpBV -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_17_19_59_+0100_=ScbyYpJMe_xLM4.-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 15:15:48 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103145204.18880b92.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> <20040103145204.18880b92.henrik@levkowetz.com> Message-ID: <3FF6CE94.5060805@gmx.de> Henrik Levkowetz wrote: >>I don't understand this statement. The raw text emitted for any sequence >>of xml2rfc markup is (more or less) well-defined, so it's possible to >>state whether the a specific element will cause text ending with >>punctuation to be output. So yes, this is possible. It's just non-trivial. > > > Well, think about it a bit more. The whole reason for this discussion > is exactly that the pure text does not contain enough information to > consistently determine whether you have an end of sentence or not. If > you look at the text produced by the markup it's too late, you've lost > too much information. I didn't say that. What I said is that you can *also* look at that text, this way you won't have lost that information. For instance, for any given text node (XPath-wise), I can determine the preceding node and decide whether that is rendered with a trailing punctuation mark. > I see you snipped the e.g. proposal. Because we agreed. Why would I repeat everything? > ... >>That's correct, and this is why I'd prefer either a P.I. or a Unicode >>character that doesn't have exactly the NBSP semantics. For instance, we >>could just use one from the private-use area (that's what it's for). > > > Except that what we want is really markup that will save us from the > need to put Da da da. around each sentence. It's > not strictly a character we want, and I'm doubtful whether it's a > processing instruction. Please stop making wrong assumptions about I was suggesting. Neither a specific control character nor a P.I. force you to markup sentences with start/end tags, nor did anybody suggest that you need to markup *all* sentence endings. Maybe all the confusion is caused by you not having understood what Clive and I are suggesting? To summarize: I'd like the processor to - ignore excess whitespace in elements (just like it should unless xml:space is "preserve") - do a best-effort guess about when a punctuation-mark-followed-by-white-space sequence is a sentence end and in particular make that algorithm smart enough to handle cases like the spanx example - have explicit syntax (P.I. or special character) to indicate a) "yes, this is a sentence end although it may not look like it" and/or b) "no, this isn't, although it does look like it" (we may need both a) and b) to cover all cases) Regards, Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 14:52:04 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF6C463.9070702@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> <3FF6C463.9070702@gmx.de> Message-ID: <20040103145204.18880b92.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_14_52_04_+0100_Cs6vJH4GZMZ0A4a9 Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > > ... > >>However, from a pure markup point of view, only the text produced by the > >>markup (ideally identical to the text *content* of the markup) should count. > > > > No. As long as we want to detect sentence end _withouth having markup for it_ > > this is not strictly possible. We are effectively saying that a certain form > > of text input has markup implications. We can't have the cake and eat it too. > > I don't understand this statement. The raw text emitted for any sequence > of xml2rfc markup is (more or less) well-defined, so it's possible to > state whether the a specific element will cause text ending with > punctuation to be output. So yes, this is possible. It's just non-trivial. Well, think about it a bit more. The whole reason for this discussion is exactly that the pure text does not contain enough information to consistently determine whether you have an end of sentence or not. If you look at the text produced by the markup it's too late, you've lost too much information. > > ... > >>While the iref example can simply be rewritten, the spanx example can't. > >>So *if* we choose that as a solution, there should be a way to enforce a > >>sentence end no matter what the surrounding markup is. > > > > > > Agreed. > > Great. > I see you snipped the e.g. proposal. > >>>That's OK; we have different opinions then. I think would be > >>>better because it would have absolutely no added implications, while > >>>". Xxxx" or ". Xxxx" would have some. > >> > >>Such as...? > > > > > > Sigh, do you really need this spelled out? > > Yes, because if you say "has implications" without saying which it's > impossible to address them. Some things are sufficiently obvious that it breaks up a discussion into details to go into them. I'm pretty sure you could have written the two paragraphs below yourself. > > ". Xxxx" : > > You have binding together a "." and a " ", which would render > > as "." SP SP inside a line, and prevent "." from coming in the last > > text position on a line, as and " " would be bound to it, > > with possibly changed line breaking as a result. > > > > ". Xxxx" : > > Again, you'd get changed line breaking as a line breadk between "." > > and "X" would not be permitted. > > > > Essentially, with this it would not be possible to produce an > > abbreviation followed by one space, then more text, which would permit > > line breaking at that space. > > That's correct, and this is why I'd prefer either a P.I. or a Unicode > character that doesn't have exactly the NBSP semantics. For instance, we > could just use one from the private-use area (that's what it's for). Except that what we want is really markup that will save us from the need to put Da da da. around each sentence. It's not strictly a character we want, and I'm doubtful whether it's a processing instruction. Henrik --Signature=_Sat__3_Jan_2004_14_52_04_+0100_Cs6vJH4GZMZ0A4a9 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9skFeVhrtTJkXCMRAg7LAKCaiG2MO6cbZ6iVmEjgp/8IMGlH2QCgpAcq 1AcpFAw3X3Y5UO2AoTOFzD8= =hKFd -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_14_52_04_+0100_Cs6vJH4GZMZ0A4a9-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 14:32:19 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103142444.24bded48.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> <20040103142444.24bded48.henrik@levkowetz.com> Message-ID: <3FF6C463.9070702@gmx.de> Henrik Levkowetz wrote: > ... >>However, from a pure markup point of view, only the text produced by the >>markup (ideally identical to the text *content* of the markup) should count. > > > No. As long as we want to detect sentence end _withouth having markup for it_ > this is not strictly possible. We are effectively saying that a certain form > of text input has markup implications. We can't have the cake and eat it too. I don't understand this statement. The raw text emitted for any sequence of xml2rfc markup is (more or less) well-defined, so it's possible to state whether the a specific element will cause text ending with punctuation to be output. So yes, this is possible. It's just non-trivial. > ... >>While the iref example can simply be rewritten, the spanx example can't. >>So *if* we choose that as a solution, there should be a way to enforce a >>sentence end no matter what the surrounding markup is. > > > Agreed. Great. > ... >>>That's OK; we have different opinions then. I think would be >>>better because it would have absolutely no added implications, while >>>". Xxxx" or ". Xxxx" would have some. >> >>Such as...? > > > Sigh, do you really need this spelled out? Yes, because if you say "has implications" without saying which it's impossible to address them. > ". Xxxx" : > You have binding together a "." and a " ", which would render > as "." SP SP inside a line, and prevent "." from coming in the last > text position on a line, as and " " would be bound to it, > with possibly changed line breaking as a result. > > ". Xxxx" : > Again, you'd get changed line breaking as a line breadk between "." > and "X" would not be permitted. > > Essentially, with this it would not be possible to produce an > abbreviation followed by one space, then more text, which would permit > line breaking at that space. That's correct, and this is why I'd prefer either a P.I. or a Unicode character that doesn't have exactly the NBSP semantics. For instance, we could just use one from the private-use area (that's what it's for). Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 14:24:44 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF6B71D.9080007@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> <3FF6B71D.9080007@gmx.de> Message-ID: <20040103142444.24bded48.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_14_24_44_+0100_kQSO9hCZQM35AQjB Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > ... > >>If so, what about: > >> > >> need a mumble. RFC 1234 says this. > > > > > > Here you have "." followed by whitespace. So we have a end of sentence. > > > > > >>or > >> > >> need a mumble. Meanwhile, we see > > > > > > Here we don't have "." followed by whitespace. This will not generate "." SP SP > > And this seems to be problematic. No, it's straightforward. > What we're doing here is letting the detection of a sentence end rely on > the fact that neither the end of the previous sentence > > I don't like that. So foo bar... > > nor the space between the sentences (as in the iref example) contains > markup (so from an XML point of view the end of the previous sentence > and the start of the next sentence are in a single piece of running text). Yes. > However, from a pure markup point of view, only the text produced by the > markup (ideally identical to the text *content* of the markup) should count. No. As long as we want to detect sentence end _withouth having markup for it_ this is not strictly possible. We are effectively saying that a certain form of text input has markup implications. We can't have the cake and eat it too. > So > > need a mumble. Meanwhile, we > > would be a sentence end (because the iref expands to zero text) and > > I don't like that. So foo bar... > > as well (as the spanx has text content expanding to something ending in > a dot, and the subsequent text starts with whitespace). > > While the iref example can simply be rewritten, the spanx example can't. > So *if* we choose that as a solution, there should be a way to enforce a > sentence end no matter what the surrounding markup is. Agreed. > I just spent some time trying sentence end detection in XSLT, and here > some more things to consider...: > > - Sentences may also stop with other punctuation (such as !). Yes, that should be taken into consideration as well. > - If the main issue is not to assume sentence ends because of previous > abbreviations, the best solution may be to add explicit markup for that > situation, such as: e.g.. However there'd still be an issue > if the abbreviation would indeed by a sentence end. e.g., and an explicit sentence end markup would be OK with me. > > ... > > That's OK; we have different opinions then. I think would be > > better because it would have absolutely no added implications, while > > ". Xxxx" or ". Xxxx" would have some. > > Such as...? Sigh, do you really need this spelled out? ". Xxxx" : You have binding together a "." and a " ", which would render as "." SP SP inside a line, and prevent "." from coming in the last text position on a line, as and " " would be bound to it, with possibly changed line breaking as a result. ". Xxxx" : Again, you'd get changed line breaking as a line breadk between "." and "X" would not be permitted. Essentially, with this it would not be possible to produce an abbreviation followed by one space, then more text, which would permit line breaking at that space. Henrik --Signature=_Sat__3_Jan_2004_14_24_44_+0100_kQSO9hCZQM35AQjB Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9sKceVhrtTJkXCMRAqmzAKDTTo6nv3nHELGFGkPZDXSdyKnkjQCg5B97 r6JMBVp/yqCl/5FR+QNyKyc= =6ljh -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_14_24_44_+0100_kQSO9hCZQM35AQjB-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 13:35:41 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040103002408.381f6c59.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> <20040103002408.381f6c59.henrik@levkowetz.com> Message-ID: <3FF6B71D.9080007@gmx.de> Henrik Levkowetz wrote: > ... >>If so, what about: >> >> need a mumble. RFC 1234 says this. > > > Here you have "." followed by whitespace. So we have a end of sentence. > > >>or >> >> need a mumble. Meanwhile, we see > > > Here we don't have "." followed by whitespace. This will not generate "." SP SP And this seems to be problematic. What we're doing here is letting the detection of a sentence end rely on the fact that neither the end of the previous sentence I don't like that. So foo bar... nor the space between the sentences (as in the iref example) contains markup (so from an XML point of view the end of the previous sentence and the start of the next sentence are in a single piece of running text). However, from a pure markup point of view, only the text produced by the markup (ideally identical to the text *content* of the markup) should count. So need a mumble. Meanwhile, we would be a sentence end (because the iref expands to zero text) and I don't like that. So foo bar... as well (as the spanx has text content expanding to something ending in a dot, and the subsequent text starts with whitespace). While the iref example can simply be rewritten, the spanx example can't. So *if* we choose that as a solution, there should be a way to enforce a sentence end no matter what the surrounding markup is. I just spent some time trying sentence end detection in XSLT, and here some more things to consider...: - Sentences may also stop with other punctuation (such as !). - If the main issue is not to assume sentence ends because of previous abbreviations, the best solution may be to add explicit markup for that situation, such as: e.g.. However there'd still be an issue if the abbreviation would indeed by a sentence end. > ... > That's OK; we have different opinions then. I think would be > better because it would have absolutely no added implications, while > ". Xxxx" or ". Xxxx" would have some. Such as...? -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 01:19:17 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF5F9E1.8070008@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> <3FF5AFC4.4060906@gmx.de> <20040102233932.7a95820f.henrik@levkowetz.com> <3FF5F9E1.8070008@gmx.de> Message-ID: <20040103011917.4d57f586.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_01_19_17_+0100_zAxrsjBca+h8I2_A Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Saturday 3 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: > > Friday 2 January 2004, Julian Reschke wrote: > > > >>Anyway, I'm not going to support a grammar element that is defined to > >>have "no" meaning except for a strange side effect... > > > > Very well, don't support it. The side effect is analogous to that of > > using zeroes when writing the number 1000. The contribution of each > > zero is in its placement, but it neither adds nor subtracts from the > > total value by itself. Very strange, I'm sure. > > I fully understand the concept. I just feel that it doesn't have any > place in designing markup vocabularies. So we disagree. > If you really think we should proceed that way, please make a complete > proposal that fully explains exactly when the processor should add the > additional space. Keep in mind that -- as you say that has in > fact no semantics -- you'll have to define that for any combination of > legal markup inside the element (remember Clive's questions)? Julian, I've clarified what semantics has, not said that it has no semantics. Please. Supposing you have an input symbol stream in which "." is followed immediately by WSP, or contrarily one in which "." is followed immediately by a symbol and then WSP, or any other tag and then WSP. Throughout lexing and parsing of the symbol stream, you won't gratuitously throw away information until the point where you make decisions about the output rendering; but at that point the transformation probably becomes irreversible. Identifying the symbol (or token) sequence "." immediately followed by WSP as end of sentence can be done at any point before or at the point where you reduce the information content (by doing the rendering or otherwise). If the token sequence has anything but "." preceding WSP, you don't have end of sentence. If you have end of sentence, you insert an end-of-sentence (EOS) token in your token stream at this point. Subsequently, if you come across a token at the point of output rendering, that token in no way changes the state of the renderer, but is discarded. If you come across some other token, you render it according to its specification. If you come across an end-of-sentence token, you render it as SP ( which makes "." WSP EOS come out as "." SP SP) in a RFC txt renderer. A renderer of some other format may render the EOS token in some other, appropriate manner. (Of course, the detailed mechanics of this can be done differently. You could replace the "." WSP token sequence by a single EOS, which is subsequently rendered as "." SP SP; or variations on this theme.) > I'll make a similar proposal for a special character code and/or PI > based solution (both would work almost identically) in the next few days. Sure, go ahead. Henrik. --Signature=_Sat__3_Jan_2004_01_19_17_+0100_zAxrsjBca+h8I2_A Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9gqFeVhrtTJkXCMRAmYdAJ94txy4kqgg5xDoy5Frzd05t+/qoACeJTew wb7yg6WljPnoG/ebz+ynIYQ= =qt+B -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_01_19_17_+0100_zAxrsjBca+h8I2_A-- From: henrik@levkowetz.com (Henrik Levkowetz) Date: Sat, 3 Jan 2004 00:24:08 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102144320.GJ75652@finch-staff-1.thus.net> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <20040102144320.GJ75652@finch-staff-1.thus.net> Message-ID: <20040103002408.381f6c59.henrik@levkowetz.com> --Signature=_Sat__3_Jan_2004_00_24_08_+0100_4R6M58x3aGhysi7. Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Friday 2 January 2004, Clive D.W. Feather wrote: > Henrik Levkowetz said: > >> But what does the null directive *mean*? > > The element would mean nothing. A markup element being present, but > > not in itself causing anything to be rendered. But by being there, you > > wouldn't have "." WSP, but "." WSP, which would avoid triggering > > the rendering of "." WSP as ". " > > So you're suggesting that any directive after a full stop would have this > effect? Yes. > If so, what about: > > need a mumble. RFC 1234 says this. Here you have "." followed by whitespace. So we have a end of sentence. > or > > need a mumble. Meanwhile, we see Here we don't have "." followed by whitespace. This will not generate "." SP SP > If not, you need to decide which ones have this effect and which don't. And > you need to give specific semantics. I'd propose no exceptions, no additional rules, as indicated above. The semantics would be indeed to add nothing, subtract nothing, the only effect being that the sequence of symbols in the input stream ". " would not have "." immediately before " ". > > It would be similar to the 'nop' processing instruction, which sometimes > > is useful in programming. > > But it isn't a NOP. No, it would be a > I still think that instead of the space is the right approach. That's OK; we have different opinions then. I think would be better because it would have absolutely no added implications, while ". Xxxx" or ". Xxxx" would have some. Henrik --Signature=_Sat__3_Jan_2004_00_24_08_+0100_4R6M58x3aGhysi7. Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9f2ZeVhrtTJkXCMRAnMNAJwKxokrrrqE4J5tt6DAz8c/xRG6FACg6bgL PfjUOowQMwzhMe6NMQNx1Kc= =IJzb -----END PGP SIGNATURE----- --Signature=_Sat__3_Jan_2004_00_24_08_+0100_4R6M58x3aGhysi7.-- From: julian.reschke@gmx.de (Julian Reschke) Date: Sat, 03 Jan 2004 00:08:17 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102233932.7a95820f.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> <3FF5AFC4.4060906@gmx.de> <20040102233932.7a95820f.henrik@levkowetz.com> Message-ID: <3FF5F9E1.8070008@gmx.de> Henrik Levkowetz wrote: > Friday 2 January 2004, Julian Reschke wrote: > >>Anyway, I'm not going to support a grammar element that is defined to >>have "no" meaning except for a strange side effect... > > > Very well, don't support it. The side effect is analogous to that of > using zeroes when writing the number 1000. The contribution of each > zero is in its placement, but it neither adds nor subtracts from the > total value by itself. Very strange, I'm sure. I fully understand the concept. I just feel that it doesn't have any place in designing markup vocabularies. If you really think we should proceed that way, please make a complete proposal that fully explains exactly when the processor should add the additional space. Keep in mind that -- as you say that has in fact no semantics -- you'll have to define that for any combination of legal markup inside the element (remember Clive's questions)? I'll make a similar proposal for a special character code and/or PI based solution (both would work almost identically) in the next few days. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Fri, 2 Jan 2004 23:39:32 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF5AFC4.4060906@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> <3FF5AFC4.4060906@gmx.de> Message-ID: <20040102233932.7a95820f.henrik@levkowetz.com> --Signature=_Fri__2_Jan_2004_23_39_32_+0100_Jch5_EiKBBhMT029 Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Friday 2 January 2004, Julian Reschke wrote: > Anyway, I'm not going to support a grammar element that is defined to > have "no" meaning except for a strange side effect... Very well, don't support it. The side effect is analogous to that of using zeroes when writing the number 1000. The contribution of each zero is in its placement, but it neither adds nor subtracts from the total value by itself. Very strange, I'm sure. Henrik --Signature=_Fri__2_Jan_2004_23_39_32_+0100_Jch5_EiKBBhMT029 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9fMkeVhrtTJkXCMRAhvlAKDxMCUZerlyijhWIGgY1RuPWnFLVQCdEwaV zDK7ehTGh71UM/ksFOLc9xw= =NqFk -----END PGP SIGNATURE----- --Signature=_Fri__2_Jan_2004_23_39_32_+0100_Jch5_EiKBBhMT029-- From: julian.reschke@gmx.de (Julian Reschke) Date: Fri, 02 Jan 2004 18:52:04 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102172621.72261265.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> <20040102172621.72261265.henrik@levkowetz.com> Message-ID: <3FF5AFC4.4060906@gmx.de> Henrik Levkowetz wrote: > Maybe tone down your arrogance a notch? The element would indeed mean I'm not going to reply to that... > "nothing", that _is_ it's semantics. Its only function would be to be > there at this particular position, so that you don't have "." adjacent > to WSP. So it does *not* mean "nothing", right? > Anyway, I'd be perfectly happy with an empty if that wouldn't > have any side effects. The whole idea is to make the code that looks for > end of sentence ( "." WSP ) not see it where one wants to avoid that. As Clive showed, making this depend on the occurence of *any* XML element is not going to work. So either we need either an element, a P.I. or a special Unicode character that *exactly* carries that semantics. I could live with , although I think it doesn't exactly mean what we need (after all it means "no break here", and we don't want to forbid a break, we just don't want an additional *space*). Thus a Processing Instruction may make more sense. Anyway, I'm not going to support a grammar element that is defined to have "no" meaning except for a strange side effect... Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Fri, 2 Jan 2004 17:26:21 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <3FF58380.2030201@gmx.de> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> <3FF58380.2030201@gmx.de> Message-ID: <20040102172621.72261265.henrik@levkowetz.com> --Signature=_Fri__2_Jan_2004_17_26_21_+0100_=1VFmRTAK3asknVG Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Friday 2 January 2004, Julian Reschke wrote: > Henrik Levkowetz wrote: ... > >>But what does the null directive *mean*? > > > > The element would mean nothing. A markup element being present, but > > not in itself causing anything to be rendered. But by being there, you > > wouldn't have "." WSP, but "." WSP, which would avoid triggering > > the rendering of "." WSP as ". " > > > > It would be similar to the 'nop' processing instruction, which sometimes > > is useful in programming. > > > > If the DTD permitted nested inside , you could have > > achieved the same thing by saying " Ta dum, ta dum (i.e. ta > > dum) teedeelee". You mostly can work around not having a null > > element, but sometimes it's quite useful. > > Come on. If it's defined to mean "nothing", it's useless. What you are > proposing has sematics (otherwise it wouldn't do anything). If we would > want to go that way, we could always use an empty . On the other > hand, looks almost exactly right. Maybe tone down your arrogance a notch? The element would indeed mean "nothing", that _is_ it's semantics. Its only function would be to be there at this particular position, so that you don't have "." adjacent to WSP. Anyway, I'd be perfectly happy with an empty if that wouldn't have any side effects. The whole idea is to make the code that looks for end of sentence ( "." WSP ) not see it where one wants to avoid that. Henrik --Signature=_Fri__2_Jan_2004_17_26_21_+0100_=1VFmRTAK3asknVG Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9ZuteVhrtTJkXCMRAojYAKCACIXel35h9IJlqGb3wWD4bnr3TQCghrWM qWYnhHuR3E5EXaaiHKwp8tA= =qrdD -----END PGP SIGNATURE----- --Signature=_Fri__2_Jan_2004_17_26_21_+0100_=1VFmRTAK3asknVG-- From: clive@demon.net (Clive D.W. Feather) Date: Fri, 2 Jan 2004 14:43:20 +0000 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102153617.32014ac8.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> Message-ID: <20040102144320.GJ75652@finch-staff-1.thus.net> Henrik Levkowetz said: >> But what does the null directive *mean*? > The element would mean nothing. A markup element being present, but > not in itself causing anything to be rendered. But by being there, you > wouldn't have "." WSP, but "." WSP, which would avoid triggering > the rendering of "." WSP as ". " So you're suggesting that any directive after a full stop would have this effect? If so, what about: need a mumble. RFC 1234 says this. or need a mumble. Meanwhile, we see If not, you need to decide which ones have this effect and which don't. And you need to give specific semantics. > It would be similar to the 'nop' processing instruction, which sometimes > is useful in programming. But it isn't a NOP. I still think that instead of the space is the right approach. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From: julian.reschke@gmx.de (Julian Reschke) Date: Fri, 02 Jan 2004 15:43:12 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102153617.32014ac8.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> <20040102153617.32014ac8.henrik@levkowetz.com> Message-ID: <3FF58380.2030201@gmx.de> Henrik Levkowetz wrote: > Friday 2 January 2004, Clive D.W. Feather wrote: > >>>I go back to stating a preference for markup, something like , >>>e.g. "i.e. " to avoid triggering the rendering of "." WSP as >>>"." SP SP >> >>But what does the null directive *mean*? > > > The element would mean nothing. A markup element being present, but > not in itself causing anything to be rendered. But by being there, you > wouldn't have "." WSP, but "." WSP, which would avoid triggering > the rendering of "." WSP as ". " > > It would be similar to the 'nop' processing instruction, which sometimes > is useful in programming. > > If the DTD permitted nested inside , you could have > achieved the same thing by saying " Ta dum, ta dum (i.e. ta > dum) teedeelee". You mostly can work around not having a null > element, but sometimes it's quite useful. Come on. If it's defined to mean "nothing", it's useless. What you are proposing has sematics (otherwise it wouldn't do anything). If we would want to go that way, we could always use an empty . On the other hand, looks almost exactly right. Julian -- bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 From: henrik@levkowetz.com (Henrik Levkowetz) Date: Fri, 2 Jan 2004 15:36:17 +0100 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20040102111006.GF75652@finch-staff-1.thus.net> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> <20040102111006.GF75652@finch-staff-1.thus.net> Message-ID: <20040102153617.32014ac8.henrik@levkowetz.com> --Signature=_Fri__2_Jan_2004_15_36_17_+0100_z4HDCpTAaAutzY8Q Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit Friday 2 January 2004, Clive D.W. Feather wrote: > > I go back to stating a preference for markup, something like , > > e.g. "i.e. " to avoid triggering the rendering of "." WSP as > > "." SP SP > > But what does the null directive *mean*? The element would mean nothing. A markup element being present, but not in itself causing anything to be rendered. But by being there, you wouldn't have "." WSP, but "." WSP, which would avoid triggering the rendering of "." WSP as ". " It would be similar to the 'nop' processing instruction, which sometimes is useful in programming. If the DTD permitted nested inside , you could have achieved the same thing by saying " Ta dum, ta dum (i.e. ta dum) teedeelee". You mostly can work around not having a null element, but sometimes it's quite useful. Henrik --Signature=_Fri__2_Jan_2004_15_36_17_+0100_z4HDCpTAaAutzY8Q Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/9YHreVhrtTJkXCMRAmsrAJ40zVGPiqcvm7VgRHdKbs5JmYrYDQCg7L1v yxnCElIv9u8r8X+kN1htxqs= =H7II -----END PGP SIGNATURE----- --Signature=_Fri__2_Jan_2004_15_36_17_+0100_z4HDCpTAaAutzY8Q-- From: clive@demon.net (Clive D.W. Feather) Date: Fri, 2 Jan 2004 11:10:06 +0000 Subject: [xml2rfc] end of sentence: two spaces? In-Reply-To: <20031231005501.31892d3b.henrik@levkowetz.com> References: <3FEEDA07.3070900@gmx.de> <20031228170447.5f6f7254.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FEFFFD7.6020001@gmx.de> <20031229071158.6985f569.mrose+internet.xml2rfc@dbc.mtview.ca.us> <3FF0A5C5.6090800@gmx.de> <20031230105949.GH76219@finch-staff-1.thus.net> <20031231005501.31892d3b.henrik@levkowetz.com> Message-ID: <20040102111006.GF75652@finch-staff-1.thus.net> Henrik Levkowetz said: >> Unicode has several special characters, so it's a question of picking the >> right one. >> >> I would argue that things like "i.e." need to be written as *either* >> i.e. no line break is permitted, but justification space is. >> i.e.‍ the zero width joiner shows a closer association. >> >> is or NO-BREAK SPACE >> ‍ is ‍ or ‍ ZERO WIDTH JOINER > > Good exposition. > > However, the special character option is starting to look like a rathole > to me. What if somebody actually needs to use this character withouth > implying the additional semantics we've added to it? Well, these characters *do* have semantics already. NO-BREAK SPACE means "a space that can not be changed to a line break". On further reading, it turns out that I was wrong in assuming it was otherwise like a space - Unicode treats space as a gap between words or sentences, but NO-BREAK SPACE only as the former. In particular, you are supposed to use it in contexts like: "Dr. Jones" where two words must remain visually together. I would argue that abbreviations are an exactly similar situation. So this strikes me as the right thing for the job. ZERO WIDTH JOINER means "the things either side are associated more closely than normal". In this case, we're saying that the dot and space are associated more closely than normal - the space is attached to the dot rather than the dot being the end of a sentence. People can still use ZWJ for other purposes without affecting us; it would only have our special semantic ("not the end of a sentence") when joining dot to space. However, I only suggested it because I misunderstood NO-BREAK SPACE. > I go back to stating a preference for markup, something like , > e.g. "i.e. " to avoid triggering the rendering of "." WSP as > "." SP SP But what does the null directive *mean*? Or is the rule rather "dot followed by space is the end of a sentence if and only if there is no intervening directive"? That's surely more complicated than using . -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646