Perhaps, but not if you start with an URI that isn't a transformed IRI. Or are you suggesting that the URI class should just try to decode it as an IRI and silently continue without the utf-8 decode if that fails?
As far as I can see, there are 5 cases (assuming we start with an URI we got from somewhere, and not with an IRI):
1) There are no (encoded) non-ASCII characters involved 2) The URI is an IRI with non-ASCII characters which has been mapped to an URI 3) The URI has not been mapped from an IRI, but contains non-ASCII characters encoded as UTF-8 anyway 4) The URI has not been mapped from an IRI, and contains non-ASCII characters encoded as ISO-8859-1 5) The URI has not been mapped from an IRI, and contains non-ASCII character encoded as something which is neither UTF-8 nor ISO-8859-1
If we start with case 5, there is no way to decode that correctly (without additional context information), since we can't know what character encoding to use. The reasonable approach here would be to through an error. However, this case may very well be indistinguishable from case 2-4. So in order to guarantee an error here, we'd have to always give an error for non-ASCII characters. But that would be bad, because we should at least handle case 2 correctly, since 1 and 2 are the sane cases.
Case 2 and 3 can be handled in the same way, so there is no need to distinguish between them. Case 4 can be distinguised from case 3 (and 2) from the fact that a printable string encoded as ISO-8859-1 never is valid UTF-8, and vice versa, due to the range 0x80-0x9f being mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in ISO-8859-1.
So I see two options:
A) Decode as UTF-8 when possible, and throw an error otherwise. This gives correct results for case 1, 2, and 3, and throws an error in case 4. Case 5 would usually give an error, but might give an incorrect result in some rare cases.
B) Decode as UTF-8 when possible, and decode as ISO-8859-1 otherwise (this is identical to the approach you mention). This gives correct results for case 1, 2, 3, and 4, but always gives an incorrect result for case 5.
It would be nice to allow the user to specify an encoding, so that case 5 could also be handled correctly, but if no such a specification is give I think the default behaviour should be either A or B, depending on the relative frequency of case 4 and case 5 in the real world. After all, the purpose of the standard class is to provide the user with a service, so some kind of best effort is in order here. If the user has information that would allow him to do a better job (which will usually not be the case, I predict), it's better that this information is provided to the standard code.
I didn't say it was a performance issue either, rather one of functionality. An IRI can e.g. contain "ôôÌ" in a unicode context without any encoding whatsoever, whereas a URI can't. When writing documents containing IRIs in a unicode environment it is of course nice to see and handle the real glyphs directly. Hence the class should be able to both produce and parse IRIs without escaping the non-US-ASCII chars.
It can contain the character unencoded, but it can also contain them encoded. So in order to see nice characters, you should always decode.
Whether the "picked apart" pieces contain wider chars or not seems irrelevant since you need to decode it anyway (%25, %2f).
When used as I described above, the wider chars wouldn't be encoded to begin with.
Did you actually read the sentence you commented here? I said that you need to decode "%" and "/", which are not wider characters, and which will be encoded even in an IRI. Since you need to call a decode function, it doesn't matter much if characters are encoded in the input to said function, as long as they aren't in the output.
The decoding should give you wide chars regardless of whether you start with an IRI or an IRI mapped into an IRI (see above).
I assume at least one of the "IRI" there should be "URI".
Indeed. And if you followed the suggestion to "see above", you can probably guess which one. :-)
Decoding a URI in general can't produce wide chars since it can't assume that the URI is a transformed IRI.
Neither can it assume that it's not. See the case study above.