decoder for utf-8

6 Mar 2003

      Well, Locale.Charset.decoder does at least throw when fed an encoding
name
it can't recognize:
...
Locale.Charset.decoder("foo");
Unknown character encoding foo
/usr/local/pike/7.4.13/lib/modules/_Charset.pmod:214:
Locale.Charset->decoder("foo")
and that certainly is a Good Thing.
The current behavior on "utf-8" unfortunately rules out using the
decoder
in an XML parser that wants to make a best effort to comply with the
spec
(even if full compliance isn't a realistic goal, in view of the
bloated overengineered
specification, *sigh*). That of course can be worked around by
special-casing
"utf-8" to use utf8_to_string, which seems to be more strict. But who
knows
what traps lurk in the handling of other encodings...
Wishful thinking: perhaps someday the Charset module might support a
"strict mode", where it refuses to swallow sequences that are invalid
in the
given encoding?
/ rjb
Previous text:
...
2003-03-06 10:28:
Subject: decoder for utf-8

Locale.Charset.decoder never throws errors (except for internal error
conditions).  Instead, it makes a best effort intepretation of the
data.  In this case, you have something that is almost a valid
two-byte encoding of '?' (\xc0\xbf), but the continuation byte has
been increased by one, making it an illegal sequence.  Well, if it
_had_ been legal to increase the continuation byte by one, it would of
course have meant that the character code should be increased by one
(giving '@') since this is the last continuation byte, so that's how
it is interpreted.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

decoder for utf-8