Recent changes to how the testsuite is run makes error messages from expected failures to be printed out. That is not helpful.
Also, we've gone from 0 to 42 failing tests.
Recent changes to how the testsuite is run makes error messages from expected failures to be printed out. That is not helpful.
Fixed.
Also, we've gone from 0 to 42 failing tests.
Also fixed. I was considering making Parser.HTML be a real html5 parser, by the way, but then I read it. :)
It would probably be better to start at least the actual parser/tokenizer from scratch.
Anyway, the reason for the changes in Parser.HTML was that it was actually fairly frequent to omit the space between arguments.
That is, this is perfectly valid HTML:
<a id='id'style='foo'href=''>
and for some reason as an example yandex felt like it was a good idea (in reality the string inside the quotes contained spaces and such, so the quote there was needed, removing the space after the argument then saved a byte on the size of the html file. I guess they are using a html optimizer that)
Isn't the old syntax what Roxen uses to mix different quotes in the same attribute value? I believe it's supposed to work like this:
<foo attr="use ' here"'use " here'/>
but I'm not entirely sure (can't find an example now). A change to Parser.HTML would be a compat problem in that case.
Isn't the old syntax what Roxen uses to mix different quotes in the same attribute value? I believe it's supposed to work like this:
<foo attr="use ' here"'use " here'/>
I do not really think that was ever an encouraged syntax? "e; has always been preferably to my knowledge. It has been a while, however.
And it is an issue if you want Parser.HTML to parse actual HTML. Since the 'save a space' seems to be surprisingly common, granted, we only encounter issues in our turbo servers when it is used with "style" or "src", but that happened fairly often, presumable because 'style' has a tendency to contain whitespaces.
I don't see any problem with supporting both; an attribute name can never start with ' or ", so there is no ambiguity.
<t a='b'c='d''e'="f">
is clearyly a syntax error. If you interpret 'd''e' as the attribute value of c, then you have a spurious = after it, which is not allowed. If you interpret 'd' as the attribute value if c, then you have a quoted attribute name, which is not allowed.
<t a='b'c='d''e'="f">
is clearyly a syntax error. If you interpret 'd''e' as the attribute value of c, then you have a spurious = after it, which is not allowed. If you interpret 'd' as the attribute value if c, then you have a quoted attribute name, which is not allowed.
Yes, I know, the question is if changing that is perhaps more dangerous than changing how the quoting works.
A real html compatible parser would be very nice when parsing actual HTML, but perhaps not 100% compatible with RXML (one somewhat common usecase of Parser.HTML, although HTML parsing is probably more common.)
Yes, I know, the question is if changing that is perhaps more dangerous than changing how the quoting works.
Changing what? Hasn't it always been a syntax error? Just that the parser doesn't actually report it?
Who would write something like that, and what would they intend it to mean?
Changing what? Hasn't it always been a syntax error? Just that the parser doesn't actually report it?
No, Parser.HTML has always _explicitly_ allowed quoted attribute names (it sets the allowed quotes to the same thing for both attribute values and names).
I know this is fairly odd, but so is the more-than-one-quoted-string syntax.
<t "a"='a''b'c'"d> would return ([ "a":"abcd" ])
It will now be ([ "a":"b" "b":"b", "c":"c" "d":"d" ])
Neither is correct.
The only difference in the parser when parsing attribute names and values is that names do not get entities in them parsed. The quoting rules are the same
No, Parser.HTML has always _explicitly_ allowed quoted attribute names
Where is this documented? Why would anyone use it?
(it sets the allowed quotes to the same thing for both attribute values and names).
I know this is fairly odd, but so is the more-than-one-quoted-string syntax.
<t "a"='a''b'c'"d> would return ([ "a":"abcd" ])
Really? That doesn't even have matched quotes...
I would like something that breaks down an html document to a datastructure, preferable one with tools like searching.
Parser.HTML is created to allow RXML (or similar) parsing with as little computron usage as possible. What I am using it for mostly is breaking down random HTML documents for data gathering, which isn't the intended use...
I would like something that breaks down an html document to a datastructure, preferable one with tools like searching.
Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.
Parser.HTML is created to allow RXML (or similar) parsing with as little computron usage as possible. What I am using it for mostly is breaking down random HTML documents for data gathering, which isn't the intended use...
Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)
It is faster to have a simple tokenizer that then outputs tokens that is handled by either a tree generator (as also specified in html5) or somethgin that just calls callbacks for tags (like the current Parser.HTML)
I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)
Handy reference:
https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.
So maybe it's time for a new tool.
Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)
Yes, but I believe it was written to search for specific tags, not parse every single tag or even to build a datastructure around it. So it's naturally pretty bad at anything not RXML (as RXML were at the time, too, probably) :)
I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)
Which is bad. But it shouldn't be the largest obstacle. :)
Use a subtree. Parser.HTML.Tokenizer?
Also, I'm confused. Is
([ /* 3 elements */ "a": "b", "c": "d", "e": "f" ])
supposed to be the old parser behaviour? Wouldn't that contradict Jonas's claim that multiple quoted strings would get concatenated? Or is it only if they use different quotes? Because in that case there is _still_ no ambiguity...
Here's one of several real-world examples that I found from our CMS (it's a bit tricky to grep after these constructs...):
#define quote(X) (replace((X)||"", "'", "'"'"'")) [...] "<var name='destname' type='string' size='40' default='" + quote(sbobj->name(id)) + "' />"
A sbobj->name() returning "'foo'" would then produce:
<var ... default=''"'"'foo'"'"'' />
Whether this is a guarantee that we never concatenate two strings with same quote char I don't know, but I'd rather preserve any such backwards compatibility than the very-odd quoted attribute _name_ syntax which I've never seen used.
Well, the quoted attribute thing seems to have been a red herring, since it is actually clear how to interpret every such case:
If there is whitespace between the two quoted things, then the second quoted thing is an attribute name. If there is no whitespace bwteeen the two quoted things, then the they should both be aggregated into the attribute value.
Since neither of these cases are allowed in HTML, only the old behaviour has any claim on desired behaviour here.
The only problematic case seems to be aggregation of quoted and non-quoted parts of an attribute value. Which your real world example luckily does not exhibit.
No, it was somewhat easy to fix so the attributes work as in HTML, but now they no longer work as they used to.
There should be a Parser.HTML object that can carry a flag to control the behaviour, if someone wants the new/old behaviour?
Is it possible to use heuristics so that a='b'c=d is detected?
pike-devel@lists.lysator.liu.se