What do you mean they're mandatory in valid utf-8? They can occur in valid utf-8, but they must not occur.
Yes, you are right. It was a thought error on my part. It is UTF-9 which has this property. Dang.
Anyway, I agree on this point: Most of the time - when the URI comes from the outside - it's probably a good idea to just try utf-8 decoding and silently ignore errors, but not all the time. I.e. it could be a default that the user may override.
Yes. My suggestion is that that the decode function takes a second argument with a charset name to override the charset heuristic. We could also allow something like "raw" as an alias for "iso-8859-1", to signify "unspecified 8-bit charset" (whenever that would be useful...).
[...] If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.
Um, yes? It would be encoded as %c3%83%c2%a5, which would then be decoded as "Ã¥" by the decode function. That's pretty intact, no?
Yes I did read it. Perhaps you've missed the point, namely that I could very well be able to use it without decoding afterwards at all, as long as the wider chars are kept intact and I don't mind that the special chars are kept encoded.
I don't see why you wouldn't mind special chars being encoded if you mind that wide chars are. As long as something is encoded, it will neither display nicely, nor be usable in any other context than URL manipulation.
I think there's merit to jhs' reasoning in 16642703, namely that Standards.URI tries to stay out of the charset issue altogether, at least by default. It only does what it has to do to parse and format a URI. That means encoding only the US-ASCII chars that would be misinterpreted otherwise, and decoding nothing.
This only means that the issue is pushed somewhere else. It doesn't make it go away. A really conservative approach would be to not supply any Standards.URI at all, thay way we can be absolutely sure it never does anything wrong. We can also be absolutely sure that we're not helping the users achieve anything.
I think the default behaviour should be to help the user as much as possible.
More encoding and decoding services should be optional. It could be in another class or perhaps enabled by an optional "charset" property. That charset property could also take a special value for the "dwim try-utf-8" approach discussed earlier.
With the API we have now, fully decoded strings can not be returned. So rather than having a property, I think we should have a decode function, to which the strings can be passed after the user code separates them on "/" or whatever URI syntax still remains in the string. (In retrospect, it would be better if the URI class actually parsed all the URI syntax, rather than returning something half parsed. That would mean path being array(string) instead of string. Other fields might also be affected, I haven't checked.)
So to sum up, with this reasoning Standards.URI.http_encode and Standards.URI.quote currently encodes too much - by default they shouldn't touch 8-bit chars.
And we need a Standards.URI.decode.