Protocols.HTTP.http_encode_string

20 Jul 2008


      ...
/.../ the fact that a printable string encoded as ISO-8859-1 never
is valid UTF-8, and vice versa, due to the range 0x80-0x9f being
mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in
ISO-8859-1.
What do you mean they're mandatory in valid utf-8? They can occur in
valid utf-8, but they must not occur. Take the utf-8 encoding of "å",
for instance: 0xc3 0xa5 (i.e. the all too familiar "Ã¥").
Granted, such odd sequences of characters practically never occur in
other 8-bit character sets. So in practice it's fairly safe to just
try utf-8 decode and fall back to "unspecified 8-bit charset" if it
doesn't work.
It's however not completely safe, and another danger with that
approach is that if there is a utf-8 encoding error somewhere (e.g. a
variable truncated in the middle of a utf-8 sequence) then suddenly
nothing gets decoded and there's no error.
Anyway, I agree on this point: Most of the time - when the URI comes
from the outside - it's probably a good idea to just try utf-8
decoding and silently ignore errors, but not all the time. I.e. it
could be a default that the user may override.
...
...
/.../ Hence the class should be able to both produce and parse IRIs
without escaping the non-US-ASCII chars.
It can contain the character unencoded, but it can also contain them
encoded.  So in order to see nice characters, you should always
decode.
That'd be rather clumsy. In this use case they wouldn't get encoded
and they wouldn't have to be decoded. If the unicode sequence "Ã¥" do
happen to occur (as uncommon as it might be) then it should still be
intact on the other side.
/.../
...
Did you actually read the sentence you commented here?  I said that
you need to decode "%" and "/", which are not wider characters, and
which will be encoded even in an IRI.  Since you need to call a decode
function, it doesn't matter much if characters are encoded in the
input to said function, as long as they aren't in the output.
Yes I did read it. Perhaps you've missed the point, namely that I
could very well be able to use it without decoding afterwards at all,
as long as the wider chars are kept intact and I don't mind that the
special chars are kept encoded.
I think there's merit to jhs' reasoning in 16642703, namely that
Standards.URI tries to stay out of the charset issue altogether, at
least by default. It only does what it has to do to parse and format a
URI. That means encoding only the US-ASCII chars that would be
misinterpreted otherwise, and decoding nothing.
This way the user can afterwards, on the complete URI/IRI, choose to
encode chars outside US-ASCII if it's going to be used somewhere where
that's required.
More encoding and decoding services should be optional. It could be in
another class or perhaps enabled by an optional "charset" property.
That charset property could also take a special value for the "dwim
try-utf-8" approach discussed earlier.
So to sum up, with this reasoning Standards.URI.http_encode and
Standards.URI.quote currently encodes too much - by default they
shouldn't touch 8-bit chars.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Protocols.HTTP.http_encode_string