Protocols.HTTP.http_encode_string

List overview All Threads
Download

newer

older

LMTP fix: 7.7 or 7.9?

getting warnings about unused...

Martin Stjernholm, Roxen IS ＠ Pike developers forum

15 Jul 2008 15 Jul '08

11:15 a.m.

I was just notified (crunch [bug 4560]) that Protocols.HTTP.http_encode_string doesn't work right for chars wider than 7 bits:

According to RFCs 3986 (URI) and 3987 (IRI), chars should be utf-8 encoded followed by the http %XX encoding. http_encode_string instead leaves 8-bit chars unencoded and uses that strange %uXXXX encoding for wider chars, a form that has no grounds in standards at all as far as I've been able to tell. (Must say I'm curious where it comes from. A comment says it's some kind of Safari encoding. My limited googling suggests that Safari at least nowadays uses the RFC method.)

The corresponding functions in Roxen have been corrected since 4.0. Encoding/decoding functions are always hazardous to change, so it's perhaps not an ideal time to do it right now. Otoh it would be rather nice to have correctly working functions in Pike instead of only in Roxen. So what do you say about changing it now?

Show replies by date

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

15 Jul 15 Jul

11:20 a.m.

I'd say this _is_ the ideal time to fix it.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

11:25 a.m.

Btw, maybe Standards.URI()->http_encode should be fixed at the same time? It doesn't seem to encode wide characters at all, and encodes 8-bit characters as %XX (iso-8859-1).

Also, is there a http_decode_string() function somewhere? Standards.URI()->path et all seems to return the string with escapes still in it. Is that the correct behaviour?

Martin Stjernholm, Roxen IS ＠ Pike developers forum

noon

There is a _Roxen.http_decode_string which decodes the %XX escapes themselves (along with those peculiar %uXXXX) but it doesn't do the subsequent utf-8 decoding. I was planning on making a Protocol.HTTP.http_decode() (losing the superfluous "_string" suffix at the same time) which wraps both together.

It's not entirely safe to assume that any %XX-encoded string is utf-8-encoded underneath however, as the whole elaborate "magic_roxen_automatic_charset_variable" system in Roxen shows (although this is getting better since nonconforming browsers are starting to get rare). Still, I think Pike modules should allow the user to choose a different interpretation.

As for Standards.URI.path, it wouldn't be safe to decode all %XX escapes there since the caller then wouldn't be able to tell a quoted "/" inside a path segment from a path segment separator.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

12:20 p.m.

As for your first question, Standards.URI.http_encode appears to be correct since only 7-bit chars are allowed in URIs (it does however also encode some 8-bit chars that it really doesn't have to do). To follow the standards accurately, we should rather add a Standards.IRI (see RFC 3987) which handles wider chars and transformation to/from URIs.

Same reasoning can be applied to Protocols.HTTP, btw: The http scheme is only defined for URIs and hence simply can't handle chars wider than 7 bits. But in that case it's practical to implicitly "switch" to IRI when wider chars are detected and automatically do the transformation to URI.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:25 p.m.

Is there actually any benefit to having different classes for URI and IRI? It seems to me it just increases the possibility of selecting the wrong one.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:45 p.m.

What I mean is, if Standards.URI->create() is passed an IRI, can't it just convert that IRI to the corresponding URI and initialize the object with that? Shouldn't that be sufficient to handle IRIs as well? Why would we need a Standards.IRI?

Martin Stjernholm, Roxen IS ＠ Pike developers forum

1:05 p.m.

Well, one could argue that it'd help people pick the right one and realize that they actually aren't using URIs anymore if they go outside US-ASCII, which probably is a widespread misconception.

But there would also be practical differences:

o An IRI class can decode the utf-8 sequences, which an URI class can't. (More precisely, it must do this precisely in the transformation from an URI.)

o An IRI class doesn't necessarily have to do the transformation to URI since an IRI can contain wide chars in unencoded form. I.e. it should be able to put together and pick apart the IRI syntax with 8 bit and wider chars on both sides.

o As for the encoding side, extending the URI class to automatically do an IRI-to-URI conversion for wider chars is safe from a standards perspective (i.e. it wouldn't break the URI standard). But in practice it wouldn't be strictly compatible since Standards.URI currently treats 8-bit chars differently.

Last argument applies to the proposed change to Protocols.HTTP.http_encode_string too, btw.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

1:15 p.m.

...

Well, one could argue that it'd help people pick the right one and

How so? If there is only one, then that one is always the right one.

...

realize that they actually aren't using URIs anymore if they go outside US-ASCII, which probably is a widespread misconception.

Since the IRIs can be mapped to URIs, they can be made to actually use URIs without them having to realize.

...

But there would also be practical differences:

o An IRI class can decode the utf-8 sequences, which an URI class can't. (More precisely, it must do this precisely in the transformation from an URI.)

When must one transform from an URI then? Using the URI representation seems more powerful since it can represent both URIs and IRIs.

Of course, having a function to decode the utf-8 sequences is something we want, but this should be possible (and done in the same way) regardless of whether you start with an IRI or an IRI mapped into an URI, IMO.

...

o An IRI class doesn't necessarily have to do the transformation to URI since an IRI can contain wide chars in unencoded form. I.e. it should be able to put together and pick apart the IRI syntax with 8 bit and wider chars on both sides.

No, but I don't think the performance issue warrants a confusing split in the namespace. Whether the "picked apart" pieces contain wider chars or not seems irrelevant since you need to decode it anyway (%25, %2f). The decoding should give you wide chars regardless of whether you start with an IRI or an IRI mapped into an IRI (see above).

...

o As for the encoding side, extending the URI class to automatically do an IRI-to-URI conversion for wider chars is safe from a standards perspective (i.e. it wouldn't break the URI standard). But in practice it wouldn't be strictly compatible since Standards.URI currently treats 8-bit chars differently.

Yes, but that is a bug, AFAICT, just as http_encode_string() is currently bugged. The behaviour we'd be removing is wrong, from a standards point of view, so removing it from the Standards module seems the right thing to do.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

2:55 p.m.

...

How so? If there is only one, then that one is always the right one.

It's not quite so simple when communicating with the outside world which doesn't unify the two concepts.

The W3C chose to make IRI a separate standard instead of extending URI. They've obviously pondered that approach at length, so I guess they did it with good reason.

...

...
realize that they actually aren't using URIs anymore if they go outside US-ASCII, which probably is a widespread misconception.

Since the IRIs can be mapped to URIs, they can be made to actually use URIs without them having to realize.

The receiving side might not do the same transformation back. E.g. when the URI is passed as an url over http there is no obligation - not even in the latest standards - to do the reverse URI-to-IRI transformation on the url after receiving the request. This makes it good to be aware of what is happening, so one can judge better how the receiver might (mis)behave.

...

When must one transform from an URI then?

Huh? To process it, of course. E.g. unicode data sent in a web form, where the de-facto behavor of modern browsers is to do an IRI-to-URI transformation first. It'd be nice to have that decoding built into the class.

...

Using the URI representation seems more powerful since it can represent both URIs and IRIs.

The problem is that a URI can't fully represent an IRI. It can only contain a (transformed) IRI, just like an octet string can contain a URI.

...

Of course, having a function to decode the utf-8 sequences is something we want, but this should be possible (and done in the same way) regardless of whether you start with an IRI or an IRI mapped into an URI, IMO.

Perhaps, but not if you start with an URI that isn't a transformed IRI. Or are you suggesting that the URI class should just try to decode it as an IRI and silently continue without the utf-8 decode if that fails?

...

/.../ I don't think the performance issue warrants a confusing split in the namespace.

I didn't say it was a performance issue either, rather one of functionality. An IRI can e.g. contain "ôôÌ" in a unicode context without any encoding whatsoever, whereas a URI can't. When writing documents containing IRIs in a unicode environment it is of course nice to see and handle the real glyphs directly. Hence the class should be able to both produce and parse IRIs without escaping the non-US-ASCII chars.

...

Whether the "picked apart" pieces contain wider chars or not seems irrelevant since you need to decode it anyway (%25, %2f).

When used as I described above, the wider chars wouldn't be encoded to begin with.

But besides, more functionality to alleviate the user from decoding %XX escapes is in order.

...

The decoding should give you wide chars regardless of whether you start with an IRI or an IRI mapped into an IRI (see above).

I assume at least one of the "IRI" there should be "URI". Decoding a URI in general can't produce wide chars since it can't assume that the URI is a transformed IRI.

Footnote: Now my pike discussion quota is used up for at least today.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

3:40 p.m.

...

Perhaps, but not if you start with an URI that isn't a transformed IRI. Or are you suggesting that the URI class should just try to decode it as an IRI and silently continue without the utf-8 decode if that fails?

As far as I can see, there are 5 cases (assuming we start with an URI we got from somewhere, and not with an IRI):

1) There are no (encoded) non-ASCII characters involved 2) The URI is an IRI with non-ASCII characters which has been mapped to an URI 3) The URI has not been mapped from an IRI, but contains non-ASCII characters encoded as UTF-8 anyway 4) The URI has not been mapped from an IRI, and contains non-ASCII characters encoded as ISO-8859-1 5) The URI has not been mapped from an IRI, and contains non-ASCII character encoded as something which is neither UTF-8 nor ISO-8859-1

If we start with case 5, there is no way to decode that correctly (without additional context information), since we can't know what character encoding to use. The reasonable approach here would be to through an error. However, this case may very well be indistinguishable from case 2-4. So in order to guarantee an error here, we'd have to always give an error for non-ASCII characters. But that would be bad, because we should at least handle case 2 correctly, since 1 and 2 are the sane cases.

Case 2 and 3 can be handled in the same way, so there is no need to distinguish between them. Case 4 can be distinguised from case 3 (and 2) from the fact that a printable string encoded as ISO-8859-1 never is valid UTF-8, and vice versa, due to the range 0x80-0x9f being mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in ISO-8859-1.

So I see two options:

A) Decode as UTF-8 when possible, and throw an error otherwise. This gives correct results for case 1, 2, and 3, and throws an error in case 4. Case 5 would usually give an error, but might give an incorrect result in some rare cases.

B) Decode as UTF-8 when possible, and decode as ISO-8859-1 otherwise (this is identical to the approach you mention). This gives correct results for case 1, 2, 3, and 4, but always gives an incorrect result for case 5.

It would be nice to allow the user to specify an encoding, so that case 5 could also be handled correctly, but if no such a specification is give I think the default behaviour should be either A or B, depending on the relative frequency of case 4 and case 5 in the real world. After all, the purpose of the standard class is to provide the user with a service, so some kind of best effort is in order here. If the user has information that would allow him to do a better job (which will usually not be the case, I predict), it's better that this information is provided to the standard code.

...

I didn't say it was a performance issue either, rather one of functionality. An IRI can e.g. contain "ôôÌ" in a unicode context without any encoding whatsoever, whereas a URI can't. When writing documents containing IRIs in a unicode environment it is of course nice to see and handle the real glyphs directly. Hence the class should be able to both produce and parse IRIs without escaping the non-US-ASCII chars.

It can contain the character unencoded, but it can also contain them encoded. So in order to see nice characters, you should always decode.

...

...
Whether the "picked apart" pieces contain wider chars or not seems irrelevant since you need to decode it anyway (%25, %2f).

When used as I described above, the wider chars wouldn't be encoded to begin with.

Did you actually read the sentence you commented here? I said that you need to decode "%" and "/", which are not wider characters, and which will be encoded even in an IRI. Since you need to call a decode function, it doesn't matter much if characters are encoded in the input to said function, as long as they aren't in the output.

...

...
The decoding should give you wide chars regardless of whether you start with an IRI or an IRI mapped into an IRI (see above).

I assume at least one of the "IRI" there should be "URI".

Indeed. And if you followed the suggestion to "see above", you can probably guess which one. :-)

...

Decoding a URI in general can't produce wide chars since it can't assume that the URI is a transformed IRI.

Neither can it assume that it's not. See the case study above.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

20 Jul 20 Jul

11:55 a.m.

...

/.../ the fact that a printable string encoded as ISO-8859-1 never is valid UTF-8, and vice versa, due to the range 0x80-0x9f being mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in ISO-8859-1.

What do you mean they're mandatory in valid utf-8? They can occur in valid utf-8, but they must not occur. Take the utf-8 encoding of "å", for instance: 0xc3 0xa5 (i.e. the all too familiar "Ã¥").

Granted, such odd sequences of characters practically never occur in other 8-bit character sets. So in practice it's fairly safe to just try utf-8 decode and fall back to "unspecified 8-bit charset" if it doesn't work.

It's however not completely safe, and another danger with that approach is that if there is a utf-8 encoding error somewhere (e.g. a variable truncated in the middle of a utf-8 sequence) then suddenly nothing gets decoded and there's no error.

Anyway, I agree on this point: Most of the time - when the URI comes from the outside - it's probably a good idea to just try utf-8 decoding and silently ignore errors, but not all the time. I.e. it could be a default that the user may override.

...

...
/.../ Hence the class should be able to both produce and parse IRIs without escaping the non-US-ASCII chars.

It can contain the character unencoded, but it can also contain them encoded. So in order to see nice characters, you should always decode.

That'd be rather clumsy. In this use case they wouldn't get encoded and they wouldn't have to be decoded. If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.

/.../

...

Did you actually read the sentence you commented here? I said that you need to decode "%" and "/", which are not wider characters, and which will be encoded even in an IRI. Since you need to call a decode function, it doesn't matter much if characters are encoded in the input to said function, as long as they aren't in the output.

Yes I did read it. Perhaps you've missed the point, namely that I could very well be able to use it without decoding afterwards at all, as long as the wider chars are kept intact and I don't mind that the special chars are kept encoded.

I think there's merit to jhs' reasoning in 16642703, namely that Standards.URI tries to stay out of the charset issue altogether, at least by default. It only does what it has to do to parse and format a URI. That means encoding only the US-ASCII chars that would be misinterpreted otherwise, and decoding nothing.

This way the user can afterwards, on the complete URI/IRI, choose to encode chars outside US-ASCII if it's going to be used somewhere where that's required.

More encoding and decoding services should be optional. It could be in another class or perhaps enabled by an optional "charset" property. That charset property could also take a special value for the "dwim try-utf-8" approach discussed earlier.

So to sum up, with this reasoning Standards.URI.http_encode and Standards.URI.quote currently encodes too much - by default they shouldn't touch 8-bit chars.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:35 p.m.

...

What do you mean they're mandatory in valid utf-8? They can occur in valid utf-8, but they must not occur.

Yes, you are right. It was a thought error on my part. It is UTF-9 which has this property. Dang.

...

Anyway, I agree on this point: Most of the time - when the URI comes from the outside - it's probably a good idea to just try utf-8 decoding and silently ignore errors, but not all the time. I.e. it could be a default that the user may override.

Yes. My suggestion is that that the decode function takes a second argument with a charset name to override the charset heuristic. We could also allow something like "raw" as an alias for "iso-8859-1", to signify "unspecified 8-bit charset" (whenever that would be useful...).

...

[...] If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.

Um, yes? It would be encoded as %c3%83%c2%a5, which would then be decoded as "Ã¥" by the decode function. That's pretty intact, no?

...

Yes I did read it. Perhaps you've missed the point, namely that I could very well be able to use it without decoding afterwards at all, as long as the wider chars are kept intact and I don't mind that the special chars are kept encoded.

I don't see why you wouldn't mind special chars being encoded if you mind that wide chars are. As long as something is encoded, it will neither display nicely, nor be usable in any other context than URL manipulation.

...

I think there's merit to jhs' reasoning in 16642703, namely that Standards.URI tries to stay out of the charset issue altogether, at least by default. It only does what it has to do to parse and format a URI. That means encoding only the US-ASCII chars that would be misinterpreted otherwise, and decoding nothing.

This only means that the issue is pushed somewhere else. It doesn't make it go away. A really conservative approach would be to not supply any Standards.URI at all, thay way we can be absolutely sure it never does anything wrong. We can also be absolutely sure that we're not helping the users achieve anything.

I think the default behaviour should be to help the user as much as possible.

...

More encoding and decoding services should be optional. It could be in another class or perhaps enabled by an optional "charset" property. That charset property could also take a special value for the "dwim try-utf-8" approach discussed earlier.

With the API we have now, fully decoded strings can not be returned. So rather than having a property, I think we should have a decode function, to which the strings can be passed after the user code separates them on "/" or whatever URI syntax still remains in the string. (In retrospect, it would be better if the URI class actually parsed all the URI syntax, rather than returning something half parsed. That would mean path being array(string) instead of string. Other fields might also be affected, I haven't checked.)

...

So to sum up, with this reasoning Standards.URI.http_encode and Standards.URI.quote currently encodes too much - by default they shouldn't touch 8-bit chars.

And we need a Standards.URI.decode.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

10:50 p.m.

...

...
[...] If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.

Um, yes? It would be encoded as %c3%83%c2%a5, which would then be decoded as "Ã¥" by the decode function. That's pretty intact, no?

No. With "the other side" I meant in the formatted URI, not when the it has been picked apart into its components again by another object. I.e. something like this:

...

object o = Standards.URI("http://x.com/"); o->path = "recept/räksmörgås.html"; (string) o;

Result: "http://x.com/recept/r%C3%A4ksm%C3%B6rg%C3%A5s.html"

This is a perfectly acceptable IRI that can be put into an iso-8859-1 document. Applies when the URI/IRI is parsed too, of course. That's the reason it can be useful to skip the encoding of chars outside US-ASCII.

...

/.../ So rather than having a property, I think we should have a decode function, to which the strings can be passed after the user code separates them on "/" or whatever URI syntax still remains in the string.

Sure, why not? Maybe it could take a charset too to know how to handle the 8-bit chars. If the extra encoding gets likewise optional, it both gets more symmetric and works in the use case I've been trying to describe.

...

(In retrospect, it would be better if the URI class actually parsed all the URI syntax, rather than returning something half parsed. That would mean path being array(string) instead of string. /.../

I'm not so sure; a path on array form gets unbearably cumbersome to handle compared to the standard string form. An alternative is to only decode as much as possible, i.e. leave only %2F (for "/") and %25 (for "%"). That's a consistent encoding too that can be decoded the same way after path splitting, if the user wants to. It's a bit unfortunate that the "%" chars have to left encoded too, though.

Martin Bähr

21 Jul 21 Jul

2:56 a.m.

On Sun, Jul 20, 2008 at 10:50:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:

...

...
...
[...] If the unicode sequence "Ã¥" do happen to occur (as uncommon as it might be) then it should still be intact on the other side.

Um, yes? It would be encoded as %c3%83%c2%a5, which would then be decoded as "Ã¥" by the decode function. That's pretty intact, no?

No. With "the other side"

the "other side" of this conversation is missing again in the exported list...

greetings, martin.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:55 a.m.

...

...
/.../ So rather than having a property, I think we should have a decode function, to which the strings can be passed after the user code separates them on "/" or whatever URI syntax still remains in the string.

Sure, why not? Maybe it could take a charset too to know how to handle the 8-bit chars.

Yes, that was exactly what I was suggesting higher up in the text. :-)

...

If the extra encoding gets likewise optional, it both gets more symmetric and works in the use case I've been trying to describe.

True, right now no characters are encoded by the constructor, so it would indeed be more symmetrical to add this functionality to a function like quote().

By the way, is the someone who knows why 7.7 has two encode functions (http_encode and quote) whereas 7.6 only has one (quote)?

...

...
(In retrospect, it would be better if the URI class actually parsed all the URI syntax, rather than returning something half parsed. That would mean path being array(string) instead of string. /.../

I'm not so sure; a path on array form gets unbearably cumbersome to handle compared to the standard string form. An alternative is to only decode as much as possible, i.e. leave only %2F (for "/") and %25 (for "%"). That's a consistent encoding too that can be decoded the same way after path splitting, if the user wants to. It's a bit unfortunate that the "%" chars have to left encoded too, though.

The thing is that decoding is not a question of whether the use wants to decode. The user has to decode to get something useful. Unless the class does it for him, of course. I don't see that a partial decode would be particularly useful to anyone.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

12:20 p.m.

...

True, right now no characters are encoded by the constructor, so it would indeed be more symmetrical to add this functionality to a function like quote().

And maybe also a Standards.URI.encoded_uri(void|string charset) which returns an encoded URI directly, to make it a little more convenient.

...

/.../ I don't see that a partial decode would be particularly useful to anyone.

The thing is that when it comes to "/" inside path segments, at least 95% of all users don't want to bother with them at all, because there's no good way to handle them later on either. A partial decoding then does what the user wants for the input that the user wants to handle, without ambiguity, and the still-encoded slashes might very well be easier to deal with in higher layers (they can e.g. be stored in a filename in an ordinary file system). But a problem is that this approach also affects "%" which is a more sane char in ordinary paths.

Another alternative is to provide a function that throws an error if an encoded "/" is encountered. Normally that'd allow the user to simply ignore the problem.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:35 p.m.

...

And maybe also a Standards.URI.encoded_uri(void|string charset) which returns an encoded URI directly, to make it a little more convenient.

I take that you mean that it would _only_ encode non-ascii characters then? Because the way the API looks now, we have no way of knowing which characters are meant to be metacharacters unless we require that all metacharacters are already encoded in the input. I'm not entirely happy with the suggested name, it should be clearer that it mearly takes a partially encoded URL and encodes it some more (and exactly what gets encoded).

...

The thing is that when it comes to "/" inside path segments, at least 95% of all users don't want to bother with them at all, because there's no good way to handle them later on either. A partial decoding then does what the user wants for the input that the user wants to handle, without ambiguity, and the still-encoded slashes might very well be easier to deal with in higher layers (they can e.g. be stored in a filename in an ordinary file system). But a problem is that this approach also affects "%" which is a more sane char in ordinary paths.

Apart from the fact that you miss out on "%", x->path*"/" isn't all that inconvenient IMO. And it would allow you to use x->path*"\" on NT if you really want to. :)

Martin Stjernholm, Roxen IS ＠ Pike developers forum

10:45 p.m.

...

I take that you mean that it would _only_ encode non-ascii characters then? Because the way the API looks now, we have no way of knowing which characters are meant to be metacharacters unless we require that all metacharacters are already encoded in the input. /.../

My intention was that it would return a thoroughly and correctly encoded url. That means that it would encode characters that are neither in the reserved set (i.e. what you call metacharacters) nor the unreserved set (i.e. US-ASCII letters, digits and a few other chars).

You're right that the current implementation apparently assumes that any reserved chars occurring in the component variables should retain their metameaning (e.g. "/" in path). This should be stated more clearly to avoid confusion, but it's not necessarily a problem: It'd be easy to add more functions to build the components from parts where every char is taken as data.

To take the path as an example, one could create a function build_path:

void build_path (array(string) path_segments) { path = map (path_segments, encode_reserved) * "/"; }

where encode_reserved only encodes the reserved chars and "%". Then the encoded_uri would do the rest of the job. E.g:

...

object uri = Standards.URI("http://foo.com"); uri->build_path (({"odd/path%", "räksmörgås.html"})); uri->path;

Result: "odd%2Fpath%25/räksmörgås.html"

...

uri->encoded_uri ("iso-8859-1");

Result: "http://foo.com/odd%2Fpath%25/r%C3%A4ksm%C3%B6rg%C3%A5s.html"

It'd be neat to use the getters and setters for this, so that we get a virtual variable called "split_path" or something.

...

Apart from the fact that you miss out on "%", x->path*"/" isn't all that inconvenient IMO. And it would allow you to use x->path*"\" on NT if you really want to. :)

I don't understand what you mean with missing out on "%" there. If you're suggesting that the user should simply join a fully decoded and splitted path using path*"/" then the only effect is that the user in his/hers own code reintroduces the ambiguity we're trying to avoid. That's not a solution.

Btw, after reading RFC 3986 section 2.2 more carefully, it's clear that no reserved character can be decoded in the path component (or in any other component for that matter), since even if some reserved chars have no metameaning for a component in the standard, they can still have scheme-specific or implementation-specific metameaning if they occur literally.

I.e. even a function that returns the path segments in an array can't decode the reserved chars, unless it assumes that the implementation - i.e. the caller - doesn't differentiate between the meta- and data meaning of those chars. That'd in most cases be a user friendly assumption, but still one that should be stated.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

11 p.m.

...

void build_path (array(string) path_segments) { path = map (path_segments, encode_reserved) * "/"; }

where encode_reserved only encodes the reserved chars and "%". Then the encoded_uri would do the rest of the job. E.g:

But why would this be better than

path = map (path_segments, encode) * "/";

where encode does all the job?

...

I don't understand what you mean with missing out on "%" there.

I mean that if a user uses a path with encoded "%"s in it without decoding it, he will not get the desired result, i.e. he misses out.

...

If you're suggesting that the user should simply join a fully decoded and splitted path using path*"/" then the only effect is that the user in his/hers own code reintroduces the ambiguity we're trying to avoid. That's not a solution.

Well, it depends on the context if this is the right thing to do or not, of course. If the encoded and unencoded /:s actually need to be handled differently in the application, then the user needs to do something anyway. If that something is simply to complain, then

Array.sum(map(path, has_value, "/"));

would do as a simple test. Depending on the situation there are other characters than "/" that you might want to treat specially. "\" for example, would introduce exactly the same ambiguity that you refer to if running on an NT system. So simply leaving "/" encoded does not really solve that problem.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

11:50 p.m.

...

/.../ But why would this be better than
path = map (path_segments, encode) * "/";
where encode does all the job?

Because, again, before the encoded_uri() step it's a valid IRI which needs no further encoding.

...

Well, it depends on the context if this is the right thing to do or not, of course. If the encoded and unencoded /:s actually need to be handled differently in the application, then the user needs to do something anyway.

Well, the different handling of that particular char in a path is mandated by RFC 3986 if the URI is hierarchical. The application can't choose to not differentiate in that case (except, of course, if it chooses to not conform to the URI standard). For other reserved chars it can make that choice, though.

...

If that something is simply to complain, then

Array.sum(map(path, has_value, "/"));

would do as a simple test.

That's not simple for applications that don't want to handle it. For them it's simpler if the path splitter function in Standards.URI throws an error, just like it does on other malformed URI:s.

...

Depending on the situation there are other characters than "/" that you might want to treat specially. "\" for example, would introduce exactly the same ambiguity that you refer to if running on an NT system.

Can't see how it would, since it can never occur unencoded in a URI or IRI. If that happens then the URI is invalid, and I think the best way to handle that is to treat it as if it was encoded (i.e. just leave it as it is during %-decoding).

...

So simply leaving "/" encoded does not really solve that problem.

I never said it does, I only said that in some cases it can be enough.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

11:50 p.m.

...

Because, again, before the encoded_uri() step it's a valid IRI which needs no further encoding.

So now we are talking about the hypothetic Standards.IRI class again? As an URI, it _does_ need further encoding.

...

...
Depending on the situation there are other characters than "/" that you might want to treat specially. "\" for example, would introduce exactly the same ambiguity that you refer to if running on an NT system.

Can't see how it would, since it can never occur unencoded in a URI or IRI.

Um, weren't we discussing how decoding everything except %25 and %2f was supposed to make the user happy somehow? In that case %5c would be decoded into , no?

Martin Stjernholm, Roxen IS ＠ Pike developers forum

22 Jul 22 Jul

midnight

...

So now we are talking about the hypothetic Standards.IRI class again?

Yes. Since you objected to having a separate Standards.IRI I'm talking about what's required to merge that functionality into the current class. But by now I think this functionality fits well into Standards.URI since it doesn't do any implicit encoding or decoding.

...

Um, weren't we discussing how decoding everything except %25 and %2f was supposed to make the user happy somehow? In that case %5c would be decoded into , no?

Right, but as opposed to "/" there can never be any unencoded "" with metameaning that it can be ambiguous with.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

12:05 a.m.

And as I said, after reading the RFC more carefully I no longer consider it an option to decode everything except %25 and %2F. It must leave all reserved chars encoded.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:15 a.m.

Agreed, you need to be able to query the undecoded version.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:10 a.m.

...

Yes. Since you objected to having a separate Standards.IRI I'm talking about what's required to merge that functionality into the current class. But by now I think this functionality fits well into Standards.URI since it doesn't do any implicit encoding or decoding.

My suggestion was not to merge the functionality, but simply to allow the constructor to be called with an IRI, and in that case convert it to the equivalent URI, just like Gmp.mpq() can be called with a bignum, converting it to the equivalent fraction. IMO, this should covert most real use-cases.

...

...
Um, weren't we discussing how decoding everything except %25 and %2f was supposed to make the user happy somehow? In that case %5c would be decoded into , no?

Right, but as opposed to "/" there can never be any unencoded "" with metameaning that it can be ambiguous with.

No, but a "" can be ambigous with a "/", so you need to check for it before you use the path as a local filename. So we have the following scenarios:

1) path is fully encoded:

* Need to check for %2f and %2F before decoding, and \ after decoding (or %2c and %2C before decoding)

2) path is decoded except for %25 and %2f:

* Need to check for %2f, %2F and \ (and you'll get the wrong result if %25 is present)

3) path is array of fully decoded components

* Need to check for / and .

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

15 Jul 15 Jul

2:05 p.m.

Of course, it's perfectly reasonable to keep the old behaviour of both Protocols.HTTP.http_encode_string and Standards.URI->create() in compat mode, to avoid breaking existing applications.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

7:15 p.m.

...

Btw, maybe Standards.URI()->http_encode should be fixed at the same time? It doesn't seem to encode wide characters at all, and encodes 8-bit characters as %XX (iso-8859-1).

I think me and Johan Schön chickened out when making Standards.URI and only aimed for the basic principle of taking URI:s apart and putting them together again, without losing data or precision. The latter is a bug, especially today, and probably ought to be fixed with prior utf-8 encoding.

I would very much welcome an improved variant with getters and setters doing automatic encode/decode translation, perhaps in the form of a Standards.URL, where such behaviours are more well defined (especially for schemes http, https, ftp, ftps and maybe a few others) than for the generic case of URIs, or abominations like the javascript: scheme.

Doing it in the form of an inheriting Standards.URL would have a bonus benefit of not fscking up prior code. In practice you rarely have URIs that are not URLs too, anyway, so getting a kick-ass Standards.URL for such matters would be an improvement, and afford more useful defaults.

For tinkering with the URI parts, setting and getting them raw, the low level Standards.URI could stay mostly as is, while most API users would instead adopt tools better equipped for playing with URLs.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

7 p.m.

6206

Age (days ago)

6213

Last active (days ago)

pike-devel@lists.lysator.liu.se

28 comments

4 participants

tags (0)

participants (4)

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Martin Bähr
Martin Stjernholm, Roxen IS ＠ Pike developers forum