After having needed to program (HTTP) parameter decoding/encoding for the nth time, I decided to add some proper code for generic use.
Any objections to the interface and/or naming? Is the solution fast enough?
Which "parameters" are you referring to? Query encoding/decoding is already in Standards.URI. Header encoding/decoding exists in Protocols.HTTP.Query, although the decoding functions doesn't seem to be callable on its own...
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
Which "parameters" are you referring to? Query encoding/decoding is already in Standards.URI. Header encoding/decoding exists in Protocols.HTTP.Query, although the decoding functions doesn't seem to be callable on its own...
Well, if they're there somewhere, I'd gladly make them callable.
I'm referring to the encoding and decoding of lines like:
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits; server_max_window_bits=10, permessage-deflate; client_max_window_bits
Or
Forwarded: for=1.2.3.4;host="foo.b a;r",for=5.6.7.8,for=9.10.11.12;host=bar.foo
It's needed at various spots internally in the Pike libs, and now I need it externally too (to handle forwarded requests in my mini-http server).
Well, those "lines" are headers (which may actually span multiple physical lines), with a name ('Sec-WebSocket-Extensions' and 'Forwarded' respectively), and a value ('permessage-deflate; client_max_window_bits; server_max_window_bits=10, permessage-deflate; client_max_window_bits', and 'for=1.2.3.4;host="foo.ba;r",for=5.6.7.8,for=9.10.11.12;host=bar.foo' respectively).
If you are referring to tokenization and and formatting of structured values of such headers, then MIME.tokenize and MIME.quote should be your friends. The detailed interpretation of the actual tokens would be specific to the header at hand though.
Parsing headers from "lines" can be done with MIME.Message, which can also format them back into "lines", but there is also Protocols.HTTP.Query.headers_encode for that. And of course if you are using Protocols.HTTP to make/process requests the module will take care of this for you.
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
If you are referring to tokenization and and formatting of structured values of such headers, then MIME.tokenize and MIME.quote should be your friends. The detailed interpretation of the actual tokens would
The main problem is indeed the tokenisation. What is needed at various points is the decomposition of such a headerfield into an ordered structure which contains the key-value pairs.
Browsing through MIME, I find: MIME.decode_words_tokenized_labled(). It does some of the parsing, but the result it produces still takes considerable postprocessing in order to get to the key-value pairs we need. Maybe there is something better in there, which almost does what I need.
But it seems like the MIME library as-is has all the basic tools, but the tools still need to be integrated to get the key-value pairs.
Parsing headers from "lines" can be done with MIME.Message, which can also format them back into "lines",
MIME.Message would be way too heavy handed to parse merely a single headerfield.
but there is also Protocols.HTTP.Query.headers_encode for that. And of course if you are using Protocols.HTTP to make/process requests the module will take care of this for you.
The headers_encode creates the header-fields, but does nothing with the internal parameters per header-field. Then again, converting from structured data back to a valid header-field is easy and not the issue here.
I'll browse through the MIME libs some more to see if I can find salvation there. So far it seems a bit murky to me (for the use cases I described).
Which of the (plethora of) MIME functions is the best for you depends a bit on your requirements. The ones with "words" in the name support RFC 2047 (the son of RFC 1522) encoding for non-ASCII attributes, for example if you have a header such as
Content-Disposition: attachment; filename==?iso-8859-1?q?sn=F6gubbe.txt?=
where the value of the "filename" attribute contains a non-ASCII character. By using decode_words_tokenized{,_labled}, you get the character encoding ("iso-8859-1" in this case) separarately, and the value still encoded in that encoding. If you instead use decode_words_tokenized{,_labeld}_remapped, the values are instead remapped to Pike Unicode strings (losing the information about the original encoding).
If you are only dealing with headers where non-ASCII text is not allowed, you can use the tokenize functions without the "decode_words" prefix.
As for the "labled" variants, these are only needed if you want to keep comments, or to distinguish between quoted and non-quoted values (even though they are semantically equivalent). This is mostly intended for GUI:s with fancy display mechanisms. The variants without "labeled" give you the information you need normally - strings are tokens or quoted strings, ints are tspecials (such as '=' for the equals sign), comments and whitespace are simply removed. So for example
text/plain; charset=us-ascii (Plain text)
and
text/plain; charset="us-ascii"
(an example from RFC 2045 of two completely equivalent header values) both yield
({ "text", '/', "plain", ';' "charset", '=', "us-ascii" })
The quote/encode_words_quoted functions work analogously but in the other direction. Note that for encode_words_quoted{,_labled}_remapped you need to specify which character encoding to remap non-ASCII strings to (either as a fixed string like "utf-8", or as a function to dynamically pick an encoding based on the string contents), and also whether to use base64 or quoted-printable.
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
Which of the (plethora of) MIME functions is the best for you depends a bit on your requirements. The ones with "words" in the name support
Thanks for the explanation. Maybe some of this should go into the MIME docs as examples.
Anyway, after some searching and trying, I finally settled on using MIME.tokenize and MIME.quote.
So I rewrote my implementation to use both primitives, and then moved the implementation, the testsuite and the references to: MIME.decode_headerfield_params() MIME.encode_headerfield_params()
Comments?
Looks good to me. Although technically the variable "key" should be string|int, because due to the forgiving nature towards non-conforming inputs, you can get tspecials both as keys and values:
MIME.decode_headerfield_params(":=colon;question=?");
(1) Result: ({ /* 1 element */ ([ /* 2 elements */ 58: "colon", "question": 63, ]) })
pike-devel@lists.lysator.liu.se