It looks like MIME.decode_words_text_remapped eats spaces in places it shouldn't:
MIME.decode_words_text_remapped("=?ISO-8859-1?Q?a?= b c =?ISO-8859-1?Q?d?=");
(17) Result: "ab cd"
The result should have been "a b c d". Whitespace is only supposed to be removed between adjacent encoded-words (section 6.2).
Maybe the samples from rfc2047 should be added to the MIME testsuite?
More testcases are always welcome.
/ Peter Bortas
Previous text:
2002-12-12 18:34: Subject: incorrect rfc2047 MIME decoding?
It looks like MIME.decode_words_text_remapped eats spaces in places it shouldn't:
MIME.decode_words_text_remapped("=?ISO-8859-1?Q?a?= b c =?ISO-8859-1?Q?d?=");
(17) Result: "ab cd"
The result should have been "a b c d". Whitespace is only supposed to be removed between adjacent encoded-words (section 6.2).
Maybe the samples from rfc2047 should be added to the MIME testsuite?
/ Brevbäraren
The examples you refer to are for structured fields, not text fields. Thus, it is MIME.decode_words_tokenized_remapped() you should compare them against.
Pike v7.4 release 4 running Hilfe v3.5 (Incremental Pike Frontend)
MIME.decode_words_tokenized_remapped("=?ISO-8859-1?Q?a?= b c =?ISO-8859-1?Q?d?=");
(1) Result: ({ /* 4 elements */ "a", "b", "c", "d" })
Here, you get four separate tokens, just like the spec says.
For text fields, the example section say that "the rules are slightly different", but give no relevant examples. Can you find a better reference that claims your variant is correct for text fields?
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2002-12-12 18:34: Subject: incorrect rfc2047 MIME decoding?
It looks like MIME.decode_words_text_remapped eats spaces in places it shouldn't:
MIME.decode_words_text_remapped("=?ISO-8859-1?Q?a?= b c =?ISO-8859-1?Q?d?=");
(17) Result: "ab cd"
The result should have been "a b c d". Whitespace is only supposed to be removed between adjacent encoded-words (section 6.2).
Maybe the samples from rfc2047 should be added to the MIME testsuite?
/ Brevbäraren
In the last episode (Dec 12), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
For text fields, the example section say that "the rules are slightly different", but give no relevant examples. Can you find a better reference that claims your variant is correct for text fields?
My interpretation is that the only difference between structured fields and regular text is that in structured fields, encoded text inside a comment () can butt up against the parens instead of requiring whitespace.
Unfortunately, only one of the examples in RFC2047 is a text field, and the only whitespace in the text is between two encoded-words (and should be eaten). A couple google searches didn't come up with anything useful, so I started grepping my email archives for examples.
The best I could find is a header from the mutt-dev list (see http://groups.yahoo.com/group/mutt-dev/message/7390?source=1 )
The subject line reads
Subject: change to =?us-ascii?Q?rfc2047=5Fencode=5Fstring?=
, which should be decoded to "change to rfc2047_encode_string". I guess most of the time this issue never comes up, since if your subject is filled with non-ASCII characters, your MUA will end up encoding the entire header instead of only the offending word.
Probably the most conformant way is to only eat whitespace between two encoded words. Since the RFC doesn't seem to mention any other kinds of whitespace, the intention might be that they should be left alone.
It does pose something of a semantic problem for _encode_ though: Given the input
x = ({ ({ "Hello", 0 }), ({ "Wor", "iso-8859-1" }), ({ "ld", "iso-8859-2" }), ({ "!", 0 }) }) }) ;
what should MIME.encode_words_text(x, "q") produce? It is not possible to put the first encoded world directly after the "o", but if a space is inserted the resulting string will decode to
Hello World !
and not
HelloWorld!
as intended. Tricky...
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2002-12-12 19:46: Subject: Re: incorrect rfc2047 MIME decoding?
In the last episode (Dec 12), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
For text fields, the example section say that "the rules are slightly different", but give no relevant examples. Can you find a better reference that claims your variant is correct for text fields?
My interpretation is that the only difference between structured fields and regular text is that in structured fields, encoded text inside a comment () can butt up against the parens instead of requiring whitespace.
Unfortunately, only one of the examples in RFC2047 is a text field, and the only whitespace in the text is between two encoded-words (and should be eaten). A couple google searches didn't come up with anything useful, so I started grepping my email archives for examples.
The best I could find is a header from the mutt-dev list (see http://groups.yahoo.com/group/mutt-dev/message/7390?source=1 )
The subject line reads
Subject: change to =?us-ascii?Q?rfc2047=5Fencode=5Fstring?=
, which should be decoded to "change to rfc2047_encode_string". I guess most of the time this issue never comes up, since if your subject is filled with non-ASCII characters, your MUA will end up encoding the entire header instead of only the offending word.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
In the last episode (Dec 12), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
Probably the most conformant way is to only eat whitespace between two encoded words. Since the RFC doesn't seem to mention any other kinds of whitespace, the intention might be that they should be left alone.
It does pose something of a semantic problem for _encode_ though: Given the input
x = ({ ({ "Hello", 0 }), ({ "Wor", "iso-8859-1" }), ({ "ld", "iso-8859-2" }), ({ "!", 0 }) }) }) ;
what should MIME.encode_words_text(x, "q") produce? It is not possible to put the first encoded world directly after the "o", but if a space is inserted the resulting string will decode to
Hello World !
and not
HelloWorld!
as intended. Tricky...
RFC2047 says that "An 'encoded-word' that appears within a 'phrase' MUST be separated from any adjacent 'word', 'text' or 'special' by 'linear-white-space'". That means any strings adjacent to a string that gets encoded must also get encoded, unless they contain a leading (or trailing) space. So your array must end up being encoded as:
"=?us-ascii?q?Hello?= =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= =?us-ascii?q?!?="
or, if you choose to "extend" the charset into the adjacent string (which only works if the charset is a superset of us-ascii):
"=?iso-8859-1?q?HelloWor?= =?iso-8859-2?q?ld!?="
. If element 0 was "Hello ", and element 3 was " !", only then could you leave them unencoded, and the result would be
"Hello =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= !"
So your array must end up being encoded as:
"=?us-ascii?q?Hello?= =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= =?us-ascii?q?!?="
That's not correct. By setting the charset for "Hello" to 0, rather than "us-ascii", I have requested that the Hello part is encoded literally, and not as an encoded-word.
or, if you choose to "extend" the charset into the adjacent string (which only works if the charset is a superset of us-ascii):
"=?iso-8859-1?q?HelloWor?= =?iso-8859-2?q?ld!?="
That's not correct either. Since no charset is provided for the "Hello" part, I can't assume it's a subset of <whatever the encoding for "Wor" is>, and I can't even assume that it a subset of "us-ascii" as you did in the first suggestion.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2002-12-12 23:46: Subject: Re: incorrect rfc2047 MIME decoding?
In the last episode (Dec 12), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
Probably the most conformant way is to only eat whitespace between two encoded words. Since the RFC doesn't seem to mention any other kinds of whitespace, the intention might be that they should be left alone.
It does pose something of a semantic problem for _encode_ though: Given the input
x = ({ ({ "Hello", 0 }), ({ "Wor", "iso-8859-1" }), ({ "ld", "iso-8859-2" }), ({ "!", 0 }) }) }) ;
what should MIME.encode_words_text(x, "q") produce? It is not possible to put the first encoded world directly after the "o", but if a space is inserted the resulting string will decode to
Hello World !
and not
HelloWorld!
as intended. Tricky...
RFC2047 says that "An 'encoded-word' that appears within a 'phrase' MUST be separated from any adjacent 'word', 'text' or 'special' by 'linear-white-space'". That means any strings adjacent to a string that gets encoded must also get encoded, unless they contain a leading (or trailing) space. So your array must end up being encoded as:
"=?us-ascii?q?Hello?= =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= =?us-ascii?q?!?="
or, if you choose to "extend" the charset into the adjacent string (which only works if the charset is a superset of us-ascii):
"=?iso-8859-1?q?HelloWor?= =?iso-8859-2?q?ld!?="
. If element 0 was "Hello ", and element 3 was " !", only then could you leave them unencoded, and the result would be
"Hello =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= !"
/ Brevbäraren
In the last episode (Dec 13), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
So your array must end up being encoded as:
"=?us-ascii?q?Hello?= =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= =?us-ascii?q?!?="
That's not correct. By setting the charset for "Hello" to 0, rather than "us-ascii", I have requested that the Hello part is encoded literally, and not as an encoded-word.
Ah. Then you have asked the impossible, and MIME.encode_words_text should have thrown an exception, or failed in some other manner. Unencoded text must have a space between itself and encoded text.
The question was whether it would be most useful to give an error, or to silently insert the required whitespace.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2002-12-13 00:12: Subject: Re: incorrect rfc2047 MIME decoding?
In the last episode (Dec 13), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
So your array must end up being encoded as:
"=?us-ascii?q?Hello?= =?iso-8859-1?q?Wor?= =?iso-8859-2?q?ld?= =?us-ascii?q?!?="
That's not correct. By setting the charset for "Hello" to 0, rather than "us-ascii", I have requested that the Hello part is encoded literally, and not as an encoded-word.
Ah. Then you have asked the impossible, and MIME.encode_words_text should have thrown an exception, or failed in some other manner. Unencoded text must have a space between itself and encoded text.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
In the last episode (Dec 13), Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum said:
The question was whether it would be most useful to give an error, or to silently insert the required whitespace.
Depends, I guess, on whether you want the encoding to be reversible or not. The array notation lets you generate input that cannot legally be encoded, so if you bend a bit during encoding, you end up with something that when decoded, does not match your original array. Since we're really only talking about email headers here it probably doesn't matter one way or another.
pike-devel@lists.lysator.liu.se