Protocols.HTTP.http_encode_string

20 Jul 2008


      ...
What do you mean they're mandatory in valid utf-8? They can occur in
valid utf-8, but they must not occur.
Yes, you are right.  It was a thought error on my part.  It is UTF-9
which has this property.  Dang.
...
Anyway, I agree on this point: Most of the time - when the URI comes
from the outside - it's probably a good idea to just try utf-8
decoding and silently ignore errors, but not all the time. I.e. it
could be a default that the user may override.
Yes.  My suggestion is that that the decode function takes a second
argument with a charset name to override the charset heuristic.  We
could also allow something like "raw" as an alias for "iso-8859-1", to
signify "unspecified 8-bit charset" (whenever that would be
useful...).
...
[...] If the unicode sequence "Ã¥" do
happen to occur (as uncommon as it might be) then it should still be
intact on the other side.
Um, yes?  It would be encoded as %c3%83%c2%a5, which would then be
decoded as "Ã¥" by the decode function.  That's pretty intact, no?
...
Yes I did read it. Perhaps you've missed the point, namely that I
could very well be able to use it without decoding afterwards at all,
as long as the wider chars are kept intact and I don't mind that the
special chars are kept encoded.
I don't see why you wouldn't mind special chars being encoded if you
mind that wide chars are.  As long as something is encoded, it will
neither display nicely, nor be usable in any other context than URL
manipulation.
...
I think there's merit to jhs' reasoning in 16642703, namely that
Standards.URI tries to stay out of the charset issue altogether, at
least by default. It only does what it has to do to parse and format a
URI. That means encoding only the US-ASCII chars that would be
misinterpreted otherwise, and decoding nothing.
This only means that the issue is pushed somewhere else.  It doesn't
make it go away.  A really conservative approach would be to not
supply any Standards.URI at all, thay way we can be absolutely sure it
never does anything wrong.  We can also be absolutely sure that we're
not helping the users achieve anything.
I think the default behaviour should be to help the user as much as
possible.
...
More encoding and decoding services should be optional. It could be in
another class or perhaps enabled by an optional "charset" property.
That charset property could also take a special value for the "dwim
try-utf-8" approach discussed earlier.
With the API we have now, fully decoded strings can not be returned.
So rather than having a property, I think we should have a decode
function, to which the strings can be passed after the user code
separates them on "/" or whatever URI syntax still remains in the
string.  (In retrospect, it would be better if the URI class actually
parsed all the URI syntax, rather than returning something half
parsed.  That would mean path being array(string) instead of string.
Other fields might also be affected, I haven't checked.)
...
So to sum up, with this reasoning Standards.URI.http_encode and
Standards.URI.quote currently encodes too much - by default they
shouldn't touch 8-bit chars.
And we need a Standards.URI.decode.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Protocols.HTTP.http_encode_string