I'm not convinced that I like the utf8_to_string crippling. I can see why you don't want to do it in a lot of applications, but how do I decoce the non-shortest form with this change?
I didn't think they would occur, and they certainly shouldn't occur. What sort of broken stuff generate such things?
If it's really necessary, I guess a flag for it could be added.
Java, as Nilsson already mentioned. I think I missed something here. Does utf8_to_string now refute non-minimal encodings? What's the point of that? As long as string_to_utf8 generates minimal encodings, I don't see any problem with being able to decode non-minimal encodings.
He didn't say more specifically where and why in Java. I can't see it as anything but a bug that ought to be fixed there. (Not that that is any reason for not providing the option to decode non-shortest forms. That's decided by the severity of the bug and how common its faulty output is in practice.)
You can read all about the reasons why decoding non-shortest forms is bad in the technical reports at unicode.org, e.g. http://www.unicode.org/reports/tr36/tr36-2.html. Basically, it's very useful to be able to interpret the various ASCII chars that have special meaning in protocols without knowing or caring whether UTF-8 is being used. That's also one of the design goals for UTF-8, and it's only achieved by forbidding both coding and decoding of non-shortest forms.
Java uses a two-byte encoding of the NUL character, to avoid having NUL-bytes embedded in strings. I'm not sure you'll be able to convince Sun that this is a bug and have them change it retroactively...
I see. I don't intend to convince them anything, but it could be interesting to see their stance on whether they are going to force their variant through the Unicode consortium or call their encoding something else.
This is interesting concerning Java variant: http://mail.nl.linux.org/linux-utf8/2002-12/msg00306.html
It's not only the overlong encoding of NUL that's special. I think the way to handle this one is by adding a special encoding for it to the Charset module.
(Note that according to the author of that text, this "Java modified UTF-8" isn't intended to be used for generic I/O of UTF-8 strings, but rather for object serialization. There are other Java libraries that read and write correct UTF-8.)
Yes, characters outside BMP are incorrectly encoded as well. But in that case I don't know if it was intentional, and I don't know if utf8_to_string() actually decoded it "correctly".
For all cases where you want to be forgiving in what input you accept, is the world order that you use Locale.Charset.decoder("utf8")?
With no more background to the issue than what has(n't) been presented here, I would expect that to be the standardized default practice, no flags given. Feel free to forward other concerns, as you no doubt have good reasons behind this change.
For all cases where you want to be forgiving in what input you accept, is the world order that you use Locale.Charset.decoder("utf8")?
Is that a design goal for the Charset module? I didn't know that, so I made it equally stringent.
With no more background to the issue than what has(n't) been presented here, I would expect that to be the standardized default practice, no flags given.
I did nothing else than made the UTF-8 encoders and decoders comply with the standard as it is beginning with Unicode 3.1 from March 2001. They didn't before.
Is that a design goal for the Charset module? I didn't know that, so I made it equally stringent.
Possibly not, but it's often a design goal in real-world applications. I'm just trying to figure out what method I should adopt if I want to parse decipherable though bad (BAD Java! :) input benevolently. Being able to do so with support from the language feels like a good aim for any programming language, though the method should not necessarily be the same used to parse strict, correct, UTF-8 compliant input.
For the Charset module, I belive that the decoder should be lenient. The reason being that the module handles more than UTF-8, it handles also e.g. EBCDIC and UTF-7, which do _not_ share the design goal of UTF-8 that you should be able to do ASCII processing of the "encoded" form. If you look for "/" in an EBCDIC string for example, you will not find any slashes as they are encoded as "a". So the general operation principle for the Charset module is that you decode the string _first_, _then_ you look for specific characters. If you deliberatly vioulate this principle because you _know_ you are dealing with UTF-8, which lets you get away with it, you can just as well use utf8_to_string. That way you know that you have to rewrite the code anyway if you want to change to different character encoding.
Yes, there's no doubt that the property of being able to process ASCII in encoded form is specific to each encoding. But I don't see why there should be a general principle in the Charset module that that property must be disregarded in the encodings that have it just because there are other encodings that do not.
You are correct that if someone uses the Charset module with an arbitrary encoding, (s)he can't assume this property, but if that person uses a predetermined encoding that is known to have it, why not let him/her take advantage of that? There are afterall other cases when the Charset module users rely on this and various other encoding specific properties, typically in heuristics that guess the encoding.
But more importantly, this is not a matter of choice when it comes to UTF-8. The standard clearly states that an implementation MUST NOT decode non-shortest forms. If it does, it doesn't decode UTF-8 anymore, it decodes a superset of UTF-8 or, if you like, a dated version of it. I think the decoder returned by Locale.Charset.decoder("utf-8") should comply to the UTF-8 standard. There's every possibility to add more decoders to the Charset module for other variants. I can take it upon myself to fix an extended/historic encoder and decoder if someone proposes a name for it.
As for the argument to use utf8_to_string instead, that one doesn't have the feature to handle streaming operation. If streaming isn't wanted, I think most people already use utf8_to_string when they only deal with UTF-8.
If you want to check for specific characters while streaming, is there really any problem with checking for them in the output of the decoder rather than the input? In the case of streaming, the conversion needs to be a part of the main processing logic anyway.
The so-called protocol step and the decoding step might very well be clearly separated even when the overall operation is streaming.
Really? I don't see any particular reason to think so. This principle might not be prevalent in all parts of the protocol.
Because then you first need to separate out the parts that are. Each such part can then be decoded individually.
You can stream the decoding of a part while at the same time looking for the end of it (not unlikely by looking for a certain ASCII char).
Ok, I'm a little confused now. So you are looking for a particular ASCII char which terminates the UTF-8 encoded part. Is this character itself part of an UTF-8 encoded part or not? If it is, then you should decode before looking for it. If it is not, then there is only one possible encoding of it, and any overlong UTF-8 representation of it is clearly not the end marker.
Now you're back in the general case why it's a security problem to decode non-shortest forms. I trust I don't need to repeat the explanation for that. The whole thing, with the repeated protocol interpretation, might be different parts of the same program and these parts might very well stream to each other.
No. The general case where you have the security problem is when you have a corpus of text that is all UTF-8-encoded, and you try to do some processing on it before UTF-8-decoding it. That is a different case from the one we were discussing now where only parts of the data is UTF-8-encoded.
The parts that might or might not be UTF-8 encoded can still very well contain structural (or "protocol") information. That structural info can very well be interpreted and reinterpreted on several levels, some before decoding, some afterwards. The whole multilevel thing might very well be streaming.
And to reiterate, I think this whole line of discussion isn't particularly important compared to the simple argument that the encoding called "utf-8" in the Charset module should comply to the UTF-8 standard.
And to reiterate, I think this whole line of discussion isn't particularly important compared to the simple argument that the encoding called "utf-8" in the Charset module should comply to the UTF-8 standard.
As long as it correctly decodes text which complies to the UTF-8 standard (and always generates standard compilant output when encoding, of course), I don't see any particular problem with disregarding other parts of the UTF-8 standard. It's not like we need to go through some kind of UTF-8 certification or anything.
I see a particular problem with disregarding certain parts of the UTF-8 standard. The same one as the Unicode people themselves has noticed.
I am certain that it will lead to seemingly erratic errors when other applications communicate with Pike applications. The question is if the tradeoff for the potentially added problems are worth it. I don't know, since I don't know how frequent illegal UTF-8 strings are. In any event there should be a #pike-goo to prevent old application from suddenly start throwing exceptions on previously accepted data.
I don't believe non-shortest forms are common at all, except in deliberate variants like the Java version discussed earlier. That since it isn't easy to make an encoder that produces non-shortest forms through sheer sloppiness.
/.../ In any event there should be a #pike-goo to prevent old application from suddenly start throwing exceptions on previously accepted data.
I don't think that's a good idea since I consider this a security related fix that potentially stops exploits against old code. That's also the reason why I've patched all pikes back to 7.4 to not accept non-shortest forms.
What perhaps could be added is compat goo for the Java special NUL encoding, to cope with that specific case. It's not a common delimiter in ASCII-based protocols. Still, it could conceivably be used as an exploit for C-based libraries, although I'd like to think that Pike isn't very susceptible to that.
I don't think that's a good idea since I consider this a security related fix that potentially stops exploits against old code. That's also the reason why I've patched all pikes back to 7.4 to not accept non-shortest forms.
That is just not cool.
What perhaps could be added is compat goo for the Java special NUL encoding, to cope with that specific case. It's not a common delimiter in ASCII-based protocols. Still, it could conceivably be used as an exploit for C-based libraries, although I'd like to think that Pike isn't very susceptible to that.
I'd like to see the non-shortest version accepted per default and only only be filtered out if the I-might-decide-to-shoot-myself-in-the-foot-with-bad-string-handling-later flag is set.
Can you exemplify some of the situations where you actually encounter non-shortest forms, besides the Java encoding which has been discussed here? Since you feel so strongly about allowing them, I take it you get them frequently?
I otoh have the distinct impression that they don't occur except in exploits (except, again, the Java NUL trick). To me it's sort of like not fixing stack smashing bugs for fear of incompatibilities.
I don't, that I know of. The reason I feel strongly about it is that Pike is my toolbox that I use dayly to use real problems. I don't want to end up with an unsolvable problem one day because someone reinteprented a standard exclude everything potentially dangerous.
And no, I don't like unexecutable stacks as a general solution either.
I use it too for real problems. I'm also very concerned about reliance and stability since it's my work to support paying customers that have some fairly critical applications running on pike. Yet I'm confident this change will solve more problems than it generates.
I don't consider unwillingness to move old code to new APIs a good reason for breaking old APIs. It's not that I don't see the benefit, but you'd have to convince me it's big enough.
I wouldn't call it as breaking an API. It's rather not maintaining bug compatibility, something that we usually don't consider in general. (And besides, I'm quite certain much worse incompatibilities than this one slips by unnoticed.)
The reason I believe there's a real chance of security issues in this area is that a quite important property of UTF-8 doesn't hold. If something that is assumed to behave a certain way doesn't, there's a real chance of code that is written with that behavior in mind and therefore doesn't work right. And in this case the misbehavior obviously can have security effects.
I've also seen code that assumes this property of UTF-8 in the LDAP module. Afaics there's no exploitable vulnerability there, but there's no reason to believe that's the only instance.
But do you have a problem with it _because it's disregarding the standard_, or for a more rational reason? 13114448 seems to suggest the former. I don't think "the Unicode people themselves" noticed a particular problem with disregarding standards, but rather put something into the standard because they figured they saw a real problem. It's this problem we have been discussing, but whether it's in the standard or not is mostly a non-issue for me.
pike-devel@lists.lysator.liu.se