Currently, Pike's MIME.Message parser doesn't handle non-ASCII headers with specified encodings:
MIME.Message("Hello, world!", (["Subject": "Hello, \U0001F310"]));
(10) Result: Message(([ ]))
(string)_;
(11) Result: "Subject: Hello, \U0001f310\r\n" "Content-Length: 13\r\n" "\r\n" "Hello, world!"
Going the other way:
MIME.Message("Subject: =?UTF-8?B?SGVsbG8sIPCfjJA=?=\r\n\r\nHello, world!");
(13) Result: Message(([ ]))
_->headers;
(14) Result: ([ /* 1 element */ "subject": "=?UTF-8?B?SGVsbG8sIPCfjJA=?=" ])
I'm currently working with IMAP and RFC[2]822 messages. I could either implement RFC 2047 parsing in my app, or enhance MIME.Message to return Unicode strings automatically.
Would this functionality be welcomed in trunk?
ChrisA
Adding charset decoding to MIME.Message sounds good to me, perhaps with a flag to enable it on decoding? (A compat problem I can think of is that applications may assume that decoded data is 8bit strings and fail to apply proper encoding before writing to file, causing an exception.)
/Marty
28 okt. 2016 kl. 17:20 skrev Chris Angelico rosuav@gmail.com:
Currently, Pike's MIME.Message parser doesn't handle non-ASCII headers with specified encodings:
MIME.Message("Hello, world!", (["Subject": "Hello, \U0001F310"]));
(10) Result: Message(([ ]))
(string)_;
(11) Result: "Subject: Hello, \U0001f310\r\n" "Content-Length: 13\r\n" "\r\n" "Hello, world!"
Going the other way:
MIME.Message("Subject: =?UTF-8?B?SGVsbG8sIPCfjJA=?=\r\n\r\nHello, world!");
(13) Result: Message(([ ]))
_->headers;
(14) Result: ([ /* 1 element */ "subject": "=?UTF-8?B?SGVsbG8sIPCfjJA=?=" ])
I'm currently working with IMAP and RFC[2]822 messages. I could either implement RFC 2047 parsing in my app, or enhance MIME.Message to return Unicode strings automatically.
Would this functionality be welcomed in trunk?
ChrisA
On Sun, Oct 30, 2016 at 4:17 AM, Martin Karlgren marty@roxen.com wrote:
Adding charset decoding to MIME.Message sounds good to me, perhaps with a flag to enable it on decoding? (A compat problem I can think of is that applications may assume that decoded data is 8bit strings and fail to apply proper encoding before writing to file, causing an exception.)
I agree about backward compat, and that's a bit problematic. So here's my thinking: MIME.UnicodeMessage will be a subclass of MIME.Message with the express goal of making everything use 21-bit strings. Any time it returns an eight-bit string, that is a bug to be fixed. So future incompatibility won't be a problem, as it's expressly documented that way; and past compatibility is fine, because MIME.Message itself isn't changing. Methods like MIME.Message()->get_filename, which currently do the decoding at that late point, can simply be overridden in UnicodeMessage.
Does that seem like a reasonable API?
ChrisA
On Sun, Oct 30, 2016 at 9:17 PM, Chris Angelico rosuav@gmail.com wrote:
On Sun, Oct 30, 2016 at 4:17 AM, Martin Karlgren marty@roxen.com wrote:
Adding charset decoding to MIME.Message sounds good to me, perhaps with a flag to enable it on decoding? (A compat problem I can think of is that applications may assume that decoded data is 8bit strings and fail to apply proper encoding before writing to file, causing an exception.)
I agree about backward compat, and that's a bit problematic. So here's my thinking: MIME.UnicodeMessage will be a subclass of MIME.Message with the express goal of making everything use 21-bit strings. Any time it returns an eight-bit string, that is a bug to be fixed. So future incompatibility won't be a problem, as it's expressly documented that way; and past compatibility is fine, because MIME.Message itself isn't changing. Methods like MIME.Message()->get_filename, which currently do the decoding at that late point, can simply be overridden in UnicodeMessage.
Does that seem like a reasonable API?
I've pushed a change to 8.1 that ought to be 100% backward compatible. If there's a problem, I can revert it, but there shouldn't be. (Just in case, it's not in 8.0.) The two notable features are:
1) MIME.UnicodeMessage, as described above 2) MIME.parse_headers() now takes an additional parameter 'unicode'.
Everything else should be completely invisible to most programs, and both of these can be ignored.
ChrisA
On Sun, Oct 30, 2016 at 11:04 PM, Chris Angelico rosuav@gmail.com wrote:
I've pushed a change to 8.1 that ought to be 100% backward compatible. If there's a problem, I can revert it, but there shouldn't be. (Just in case, it's not in 8.0.) The two notable features are:
- MIME.UnicodeMessage, as described above
- MIME.parse_headers() now takes an additional parameter 'unicode'.
Everything else should be completely invisible to most programs, and both of these can be ignored.
Turns out, I haven't been seeing all the messages on this list :( Sorry all! This change has been reverted.
New proposal: MIME.decode_words_text_remapped is the single most obvious way to decode a Subject header (among others), but that's really not obvious from the docs. Can we somehow make that more discoverable? I ended up implementing my own version of that (tediously and buggily [1]), then wanting to fold that into core so other people don't.
ChrisA
[1] https://github.com/Rosuav/zawinski/commit/43a09b87a1f7b89e553e2f842149c64186...
On 30 Oct 2016, at 23:15 , Chris Angelico rosuav@gmail.com wrote:
Turns out, I haven't been seeing all the messages on this list :( Sorry all! This change has been reverted.
The LysKOM bridge is broken again. Asking whomever it may concern: can it be fixed reliably or should everyone be forced to use the mailing list instead? ;)
New proposal: MIME.decode_words_text_remapped is the single most obvious way to decode a Subject header (among others), but that's really not obvious from the docs. Can we somehow make that more discoverable? I ended up implementing my own version of that (tediously and buggily [1]), then wanting to fold that into core so other people don't.
Copy+paste of Marcus Comstedt’s reply in LysKOM:
"It would probably make sense to link to decode_words_text and decode_words_tokenized from the entries for MIME.parse_headers and MIME.Message->headers. The various specializations (including _mapped) are then cross-linked from there.”
/Marty
On Mon, Oct 31, 2016 at 7:46 PM, Martin Karlgren marty@roxen.com wrote:
New proposal: MIME.decode_words_text_remapped is the single most obvious way to decode a Subject header (among others), but that's really not obvious from the docs. Can we somehow make that more discoverable? I ended up implementing my own version of that (tediously and buggily [1]), then wanting to fold that into core so other people don't.
Copy+paste of Marcus Comstedt’s reply in LysKOM:
"It would probably make sense to link to decode_words_text and decode_words_tokenized from the entries for MIME.parse_headers and MIME.Message->headers. The various specializations (including _mapped) are then cross-linked from there.”
Good plan. I've added some copy to those exact places. Also, have mentioned RFC 2047 alongside RFC 1522; when I did my research, it was 2047 that I found (since it's the current standard), and searching the docs for that number came up blank. (2047 is the current standard; 1522 is when these were first introduced. Like 2822 vs 822.)
ChrisA
In fact, 1522 is not the first either. It is predated by 1342, just like 1341 predates 1521. Mentioning the latest ratified version would probably be enough, they back-reference earlier versions anyway.
It would probably make sense to link to decode_words_text and decode_words_tokenized from the entries for MIME.parse_headers and MIME.Message->headers. The various specializations (including _mapped) are then cross-linked from there.
IÃ'm sorry, but your code is incorrect, and it cannot be done correctly either. As I explained, the encoding rules depend on the grammar of the paricular header field. Use of the _text functions is correct only for headers which grammar declare them to be "text". Headers which instead use the "phrase" non-terminal in their grammar (such as the "From:" and "To:" headers) need to be treated differently. There is even an example in RFC1522 which is not correctly decoded by your class.
The reasonable API is the one that has already been implemented for 11 years, for the reasons already given. :-) I wish MIME was simpler, but pretending that it is when it actually isn't will not solve problems but create them I'm afraid...
Currently, Pike's MIME.Message parser doesn't handle non-ASCII headers with specified encodings:
Sure it does.
MIME.Message("Subject: =?UTF-8?B?SGVsbG8sIPCfjJA=?=\r\n\r\nHello, world!");
(1) Result: Message(([ ]))
MIME.decode_words_text_remapped(_->headers->subject);
(2) Result: "Hello, \U0001f310"
It is not done automatially for two reasons:
1) RFC1522 encoding is only applicable to certain headers, and the way it is applied differs between two types of fields (tokenized fields, and free text fields). Thus, the application will need to use the function that is appropriate for the specific field it is accessing.
2) A remapping to unicode is not always needed or preferrable. Therefore an option is given to use a different set of function that preserves the original encoding:
MIME.decode_words_text(_->headers->subject);
(2) Result: ({ /* 1 element */ ({ /* 2 elements */ "Hello, \360\237\214\220", "utf-8" }) })
Encoding works similarly:
MIME.Message("Hello, world!", (["Subject": MIME.encode_words_text_remapped("Hello, \U0001F310", "base64", "utf-8")]));
(1) Result: Message(([ ]))
(string)_;
(2) Result: "Content-Length: 13\r\n" "Subject: Hello, =?utf-8?b?8J+MkA==?=\r\n" "\r\n" "Hello, world!"
pike-devel@lists.lysator.liu.se