Protocols.HTTP.http_encode_string

15 Jul 2008


      ...
Perhaps, but not if you start with an URI that isn't a transformed
IRI. Or are you suggesting that the URI class should just try to
decode it as an IRI and silently continue without the utf-8 decode if
that fails?
As far as I can see, there are 5 cases (assuming we start with an URI
we got from somewhere, and not with an IRI):
1) There are no (encoded) non-ASCII characters involved
2) The URI is an IRI with non-ASCII characters which has been mapped
   to an URI
3) The URI has not been mapped from an IRI, but contains non-ASCII
   characters encoded as UTF-8 anyway
4) The URI has not been mapped from an IRI, and contains non-ASCII
   characters encoded as ISO-8859-1
5) The URI has not been mapped from an IRI, and contains non-ASCII
   character encoded as something which is neither UTF-8 nor
   ISO-8859-1
If we start with case 5, there is no way to decode that correctly
(without additional context information), since we can't know what
character encoding to use.  The reasonable approach here would be to
through an error.  However, this case may very well be
indistinguishable from case 2-4.  So in order to guarantee an error
here, we'd have to always give an error for non-ASCII characters.  But
that would be bad, because we should at least handle case 2
correctly, since 1 and 2 are the sane cases.
Case 2 and 3 can be handled in the same way, so there is no need to
distinguish between them.  Case 4 can be distinguised from case 3 (and
2) from the fact that a printable string encoded as ISO-8859-1 never
is valid UTF-8, and vice versa, due to the range 0x80-0x9f being
mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in
ISO-8859-1.
So I see two options:
A) Decode as UTF-8 when possible, and throw an error otherwise.
   This gives correct results for case 1, 2, and 3, and throws an
   error in case 4.  Case 5 would usually give an error, but might
   give an incorrect result in some rare cases.
B) Decode as UTF-8 when possible, and decode as ISO-8859-1 otherwise
   (this is identical to the approach you mention).  This gives
   correct results for case 1, 2, 3, and 4, but always gives an
   incorrect result for case 5.
It would be nice to allow the user to specify an encoding, so that
case 5 could also be handled correctly, but if no such a specification
is give I think the default behaviour should be either A or B, 
depending on the relative frequency of case 4 and case 5 in the real
world.  After all, the purpose of the standard class is to provide
the user with a service, so some kind of best effort is in order
here.  If the user has information that would allow him to do a better
job (which will usually not be the case, I predict), it's better that
this information is provided to the standard code.
...
I didn't say it was a performance issue either, rather one of
functionality. An IRI can e.g. contain "ôôÌ" in a unicode context
without any encoding whatsoever, whereas a URI can't. When writing
documents containing IRIs in a unicode environment it is of course
nice to see and handle the real glyphs directly. Hence the class
should be able to both produce and parse IRIs without escaping the
non-US-ASCII chars.
It can contain the character unencoded, but it can also contain them
encoded.  So in order to see nice characters, you should always
decode.
...
...
Whether the "picked apart" pieces contain wider chars or not seems
irrelevant since you need to decode it anyway (%25, %2f).
When used as I described above, the wider chars wouldn't be encoded to
begin with.
Did you actually read the sentence you commented here?  I said that
you need to decode "%" and "/", which are not wider characters, and
which will be encoded even in an IRI.  Since you need to call a decode
function, it doesn't matter much if characters are encoded in the
input to said function, as long as they aren't in the output.
...
...
The decoding should give you wide chars regardless of whether you
start with an IRI or an IRI mapped into an IRI (see above).
I assume at least one of the "IRI" there should be "URI".
Indeed.  And if you followed the suggestion to "see above", you can
probably guess which one.  :-)
...
Decoding a URI in general can't produce wide chars since it can't
assume that the URI is a transformed IRI.
Neither can it assume that it's not.  See the case study above.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Protocols.HTTP.http_encode_string