/.../ the fact that a printable string encoded as ISO-8859-1 never is valid UTF-8, and vice versa, due to the range 0x80-0x9f being mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in ISO-8859-1.
What do you mean they're mandatory in valid utf-8? They can occur in valid utf-8, but they must not occur. Take the utf-8 encoding of "å", for instance: 0xc3 0xa5 (i.e. the all too familiar "Ã¥").
Granted, such odd sequences of characters practically never occur in other 8-bit character sets. So in practice it's fairly safe to just try utf-8 decode and fall back to "unspecified 8-bit charset" if it doesn't work.
It's however not completely safe, and another danger with that approach is that if there is a utf-8 encoding error somewhere (e.g. a variable truncated in the middle of a utf-8 sequence) then suddenly nothing gets decoded and there's no error.
Anyway, I agree on this point: Most of the time - when the URI comes from the outside - it's probably a good idea to just try utf-8 decoding and silently ignore errors, but not all the time. I.e. it could be a default that the user may override.
/.../ Hence the class should be able to both produce and parse IRIs without escaping the non-US-ASCII chars.
It can contain the character unencoded, but it can also contain them encoded. So in order to see nice characters, you should always decode.
That'd be rather clumsy. In this use case they wouldn't get encoded and they wouldn't have to be decoded. If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.
/.../
Did you actually read the sentence you commented here? I said that you need to decode "%" and "/", which are not wider characters, and which will be encoded even in an IRI. Since you need to call a decode function, it doesn't matter much if characters are encoded in the input to said function, as long as they aren't in the output.
Yes I did read it. Perhaps you've missed the point, namely that I could very well be able to use it without decoding afterwards at all, as long as the wider chars are kept intact and I don't mind that the special chars are kept encoded.
I think there's merit to jhs' reasoning in 16642703, namely that Standards.URI tries to stay out of the charset issue altogether, at least by default. It only does what it has to do to parse and format a URI. That means encoding only the US-ASCII chars that would be misinterpreted otherwise, and decoding nothing.
This way the user can afterwards, on the complete URI/IRI, choose to encode chars outside US-ASCII if it's going to be used somewhere where that's required.
More encoding and decoding services should be optional. It could be in another class or perhaps enabled by an optional "charset" property. That charset property could also take a special value for the "dwim try-utf-8" approach discussed earlier.
So to sum up, with this reasoning Standards.URI.http_encode and Standards.URI.quote currently encodes too much - by default they shouldn't touch 8-bit chars.