How is the decoder returned by Locale.Charset.decoder("utf-8") supposed to behave when fed a bytestream that is not valid UTF-8? It seems to return some peculiar results, instead of throwing an error (like utf8_to_string quite correctly does):
object dec = Locale.Charset.decoder("utf8"); dec->feed("\xc0\xc0")->drain();
(62) Result: "@"
utf8_to_string("\xc0\xc0");
utf8_to_string(): Expected continuation character at index 1 (got 0xc0).
Locale.Charset.decoder never throws errors (except for internal error conditions). Instead, it makes a best effort intepretation of the data. In this case, you have something that is almost a valid two-byte encoding of '?' (\xc0\xbf), but the continuation byte has been increased by one, making it an illegal sequence. Well, if it _had_ been legal to increase the continuation byte by one, it would of course have meant that the character code should be increased by one (giving '@') since this is the last continuation byte, so that's how it is interpreted.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-03-05 23:01: Subject: decoder for utf-8
How is the decoder returned by Locale.Charset.decoder("utf-8") supposed to behave when fed a bytestream that is not valid UTF-8? It seems to return some peculiar results, instead of throwing an error (like utf8_to_string quite correctly does):
object dec = Locale.Charset.decoder("utf8"); dec->feed("\xc0\xc0")->drain();
(62) Result: "@"
utf8_to_string("\xc0\xc0");
utf8_to_string(): Expected continuation character at index 1 (got 0xc0).
/ rjb
Well, Locale.Charset.decoder does at least throw when fed an encoding name it can't recognize:
Locale.Charset.decoder("foo");
Unknown character encoding foo /usr/local/pike/7.4.13/lib/modules/_Charset.pmod:214: Locale.Charset->decoder("foo")
and that certainly is a Good Thing. The current behavior on "utf-8" unfortunately rules out using the decoder in an XML parser that wants to make a best effort to comply with the spec (even if full compliance isn't a realistic goal, in view of the bloated overengineered specification, *sigh*). That of course can be worked around by special-casing "utf-8" to use utf8_to_string, which seems to be more strict. But who knows what traps lurk in the handling of other encodings...
Wishful thinking: perhaps someday the Charset module might support a "strict mode", where it refuses to swallow sequences that are invalid in the given encoding?
/ rjb
Previous text:
2003-03-06 10:28: Subject: decoder for utf-8
Locale.Charset.decoder never throws errors (except for internal error conditions). Instead, it makes a best effort intepretation of the data. In this case, you have something that is almost a valid two-byte encoding of '?' (\xc0\xbf), but the continuation byte has been increased by one, making it an illegal sequence. Well, if it _had_ been legal to increase the continuation byte by one, it would of course have meant that the character code should be increased by one (giving '@') since this is the last continuation byte, so that's how it is interpreted.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Yes, _instansiating_ a Locale.Charset.decoder can throw an error.
Using utf8_to_string for utf-8 wouldn't make the parser strict, you still wouldn't catch errors in other encodings.
You are welcome to implement such a strict mode. In addition to detecting illegal sequences in UTF-* and ISO-2022, it should do range checking on "normal" character encodings, so that you can't use \x7f in US-ASCII for example.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-03-06 11:03: Subject: decoder for utf-8
Well, Locale.Charset.decoder does at least throw when fed an encoding name it can't recognize:
Locale.Charset.decoder("foo");
Unknown character encoding foo /usr/local/pike/7.4.13/lib/modules/_Charset.pmod:214: Locale.Charset->decoder("foo")
and that certainly is a Good Thing. The current behavior on "utf-8" unfortunately rules out using the decoder in an XML parser that wants to make a best effort to comply with the spec (even if full compliance isn't a realistic goal, in view of the bloated overengineered specification, *sigh*). That of course can be worked around by special-casing "utf-8" to use utf8_to_string, which seems to be more strict. But who knows what traps lurk in the handling of other encodings...
Wishful thinking: perhaps someday the Charset module might support a "strict mode", where it refuses to swallow sequences that are invalid in the given encoding?
/ rjb
Certainly, utf8_to_string only handles utf-8 - that's what I meant by special-casing. I do realize that there is no equivalent option for other encodings. However, utf-8 is the most common (and one of the defaults) for XML.
As for the DIY comment: heh, just what I expected to hear ;-) "Wishful thinking" is an appropriate phrase here... A more realistic option is to use the system's iconv for preprocessing, if available, at least in the glibc currently provided with the better Linux dists it seems to be quite capable. So much for portability, though :-(
(Redundant, tired remark: in a perfect world, the behavior we're discussing would be documented... sorry, couldn't resist)
/rjb
/ rjb
Previous text:
2003-03-06 11:09: Subject: decoder for utf-8
Yes, _instansiating_ a Locale.Charset.decoder can throw an error.
Using utf8_to_string for utf-8 wouldn't make the parser strict, you still wouldn't catch errors in other encodings.
You are welcome to implement such a strict mode. In addition to detecting illegal sequences in UTF-* and ISO-2022, it should do range checking on "normal" character encodings, so that you can't use \x7f in US-ASCII for example.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Contributions to the documentation are of course welcome as well.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-03-06 11:45: Subject: decoder for utf-8
Certainly, utf8_to_string only handles utf-8 - that's what I meant by special-casing. I do realize that there is no equivalent option for other encodings. However, utf-8 is the most common (and one of the defaults) for XML.
As for the DIY comment: heh, just what I expected to hear ;-) "Wishful thinking" is an appropriate phrase here... A more realistic option is to use the system's iconv for preprocessing, if available, at least in the glibc currently provided with the better Linux dists it seems to be quite capable. So much for portability, though :-(
(Redundant, tired remark: in a perfect world, the behavior we're discussing would be documented... sorry, couldn't resist)
/rjb
/ rjb
I'd be interested to hear why the charset module is treating imperfect input forgivingly. I can easily see cases where that is very useful, but it does not strike me as a typical Comstedt design choice, when there are rigid rules or standards on offer. Best practice recommended by RFC 1345 (which I have hardly read at all)?
/ Johan Sundström (folkskådare)
Previous text:
2003-03-06 10:28: Subject: decoder for utf-8
Locale.Charset.decoder never throws errors (except for internal error conditions). Instead, it makes a best effort intepretation of the data. In this case, you have something that is almost a valid two-byte encoding of '?' (\xc0\xbf), but the continuation byte has been increased by one, making it an illegal sequence. Well, if it _had_ been legal to increase the continuation byte by one, it would of course have meant that the character code should be increased by one (giving '@') since this is the last continuation byte, so that's how it is interpreted.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Because text streams processed by Locale.Charset.decoder are typically intended for human consumption, not for further machine processing.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-03-06 12:13: Subject: decoder for utf-8
I'd be interested to hear why the charset module is treating imperfect input forgivingly. I can easily see cases where that is very useful, but it does not strike me as a typical Comstedt design choice, when there are rigid rules or standards on offer. Best practice recommended by RFC 1345 (which I have hardly read at all)?
/ Johan Sundström (folkskådare)
pike-devel@lists.lysator.liu.se