testsuite

List overview All Threads
Download

newer

older

Fun with lfuns: `= and `[..]=

testsuite

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

17 Sep 2014 17 Sep '14

10:09 a.m.

Recent changes to how the testsuite is run makes error messages from expected failures to be printed out. That is not helpful.

Also, we've gone from 0 to 42 failing tests.

Show replies by date

Per Hedbor () ＠ Pike (-) developers forum

17 Sep 17 Sep

10:09 a.m.

...

Recent changes to how the testsuite is run makes error messages from expected failures to be printed out. That is not helpful.

Fixed.

...

Also, we've gone from 0 to 42 failing tests.

Also fixed. I was considering making Parser.HTML be a real html5 parser, by the way, but then I read it. :)

It would probably be better to start at least the actual parser/tokenizer from scratch.

Anyway, the reason for the changes in Parser.HTML was that it was actually fairly frequent to omit the space between arguments.

That is, this is perfectly valid HTML:

and for some reason as an example yandex felt like it was a good idea (in reality the string inside the quotes contained spaces and such, so the quote there was needed, removing the space after the argument then saved a byte on the size of the html file. I guess they are using a html optimizer that)

-- Per Hedbor

Jonas Walld�n ＠ Pike developers forum

10:09 a.m.

Isn't the old syntax what Roxen uses to mix different quotes in the same attribute value? I believe it's supposed to work like this:

but I'm not entirely sure (can't find an example now). A change to Parser.HTML would be a compat problem in that case.

Per Hedbor () ＠ Pike (-) developers forum

10:11 a.m.

...

Isn't the old syntax what Roxen uses to mix different quotes in the same attribute value? I believe it's supposed to work like this:

<foo attr="use ' here"'use " here'/>

I do not really think that was ever an encouraged syntax? &quote; has always been preferably to my knowledge. It has been a while, however.

And it is an issue if you want Parser.HTML to parse actual HTML. Since the 'save a space' seems to be surprisingly common, granted, we only encounter issues in our turbo servers when it is used with "style" or "src", but that happened fairly often, presumable because 'style' has a tendency to contain whitespaces.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:11 a.m.

I don't see any problem with supporting both; an attribute name can never start with ' or ", so there is no ambiguity.

Per Hedbor () ＠ Pike (-) developers forum

10:11 a.m.

Well, that would work, except for this:

...

p->finish("<t a='b'c='d''e'="f">")->read();

([ /* 3 elements */ "a": "b", "c": "d", "e": "f" ]) (3) Result: "<t a='b'c=d 'e'="f">"

...

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:11 a.m.

is clearyly a syntax error. If you interpret 'd''e' as the attribute value of c, then you have a spurious = after it, which is not allowed. If you interpret 'd' as the attribute value if c, then you have a quoted attribute name, which is not allowed.

Per Hedbor () ＠ Pike (-) developers forum

10:25 a.m.

...

<t a='b'c='d''e'="f">

is clearyly a syntax error. If you interpret 'd''e' as the attribute value of c, then you have a spurious = after it, which is not allowed. If you interpret 'd' as the attribute value if c, then you have a quoted attribute name, which is not allowed.

Yes, I know, the question is if changing that is perhaps more dangerous than changing how the quoting works.

A real html compatible parser would be very nice when parsing actual HTML, but perhaps not 100% compatible with RXML (one somewhat common usecase of Parser.HTML, although HTML parsing is probably more common.)

-- Per Hedbor

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:25 a.m.

...

Yes, I know, the question is if changing that is perhaps more dangerous than changing how the quoting works.

Changing what? Hasn't it always been a syntax error? Just that the parser doesn't actually report it?

Who would write something like that, and what would they intend it to mean?

Per Hedbor () ＠ Pike (-) developers forum

10:35 a.m.

...

Changing what? Hasn't it always been a syntax error? Just that the parser doesn't actually report it?

No, Parser.HTML has always _explicitly_ allowed quoted attribute names (it sets the allowed quotes to the same thing for both attribute values and names).

I know this is fairly odd, but so is the more-than-one-quoted-string syntax.

<t "a"='a''b'c'"d> would return ([ "a":"abcd" ])

It will now be ([ "a":"b" "b":"b", "c":"c" "d":"d" ])

Neither is correct.

The only difference in the parser when parsing attribute names and values is that names do not get entities in them parsed. The quoting rules are the same

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:40 a.m.

...

No, Parser.HTML has always _explicitly_ allowed quoted attribute names

Where is this documented? Why would anyone use it?

...

(it sets the allowed quotes to the same thing for both attribute values and names).

I know this is fairly odd, but so is the more-than-one-quoted-string syntax.

<t "a"='a''b'c'"d> would return ([ "a":"abcd" ])

Really? That doesn't even have matched quotes...

Per Hedbor () ＠ Pike (-) developers forum

10:40 a.m.

...

Really? That doesn't even have matched quotes...

You are free to leave out the last endquote...

Mirar ＠ Pike developers forum

10:35 a.m.

I would like something that breaks down an html document to a datastructure, preferable one with tools like searching.

Parser.HTML is created to allow RXML (or similar) parsing with as little computron usage as possible. What I am using it for mostly is breaking down random HTML documents for data gathering, which isn't the intended use...

Per Hedbor () ＠ Pike (-) developers forum

10:40 a.m.

...

I would like something that breaks down an html document to a datastructure, preferable one with tools like searching.

Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.

...

Parser.HTML is created to allow RXML (or similar) parsing with as little computron usage as possible. What I am using it for mostly is breaking down random HTML documents for data gathering, which isn't the intended use...

Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)

It is faster to have a simple tokenizer that then outputs tokens that is handled by either a tree generator (as also specified in html5) or somethgin that just calls callbacks for tags (like the current Parser.HTML)

I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)

Handy reference:

https://html.spec.whatwg.org/multipage/syntax.html#tokenization

-- Per Hedbor

Mirar ＠ Pike developers forum

1:10 p.m.

...

Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.

So maybe it's time for a new tool.

...

Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)

Yes, but I believe it was written to search for specific tags, not parse every single tag or even to build a datastructure around it. So it's naturally pretty bad at anything not RXML (as RXML were at the time, too, probably) :)

...

I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)

Which is bad. But it shouldn't be the largest obstacle. :)

Use a subtree. Parser.HTML.Tokenizer?

Jonas Walld�n ＠ Pike developers forum

1:10 p.m.

Or Parser.HTML5?

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:30 a.m.

Also, I'm confused. Is

...

([ /* 3 elements */ "a": "b", "c": "d", "e": "f" ])

supposed to be the old parser behaviour? Wouldn't that contradict Jonas's claim that multiple quoted strings would get concatenated? Or is it only if they use different quotes? Because in that case there is _still_ no ambiguity...

Jonas Walld�n ＠ Pike developers forum

11 a.m.

Here's one of several real-world examples that I found from our CMS (it's a bit tricky to grep after these constructs...):

#define quote(X) (replace((X)||"", "'", "'"'"'")) [...] "<var name='destname' type='string' size='40' default='" + quote(sbobj->name(id)) + "' />"

A sbobj->name() returning "'foo'" would then produce:

Whether this is a guarantee that we never concatenate two strings with same quote char I don't know, but I'd rather preserve any such backwards compatibility than the very-odd quoted attribute _name_ syntax which I've never seen used.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

11:10 a.m.

Well, the quoted attribute thing seems to have been a red herring, since it is actually clear how to interpret every such case:

If there is whitespace between the two quoted things, then the second quoted thing is an attribute name. If there is no whitespace bwteeen the two quoted things, then the they should both be aggregated into the attribute value.

Since neither of these cases are allowed in HTML, only the old behaviour has any claim on desired behaviour here.

The only problematic case seems to be aggregation of quoted and non-quoted parts of an attribute value. Which your real world example luckily does not exhibit.

Mirar ＠ Pike developers forum

10:11 a.m.

Eep. That shouldn't be too hard to fix in Parser.HTML though?

Per Hedbor () ＠ Pike (-) developers forum

10:11 a.m.

No, it was somewhat easy to fix so the attributes work as in HTML, but now they no longer work as they used to.

Mirar ＠ Pike developers forum

10:11 a.m.

There should be a Parser.HTML object that can carry a flag to control the behaviour, if someone wants the new/old behaviour?

Is it possible to use heuristics so that a='b'c=d is detected?

Mirar ＠ Pike developers forum

10:11 a.m.

Is the exporter fixed yet?

Peter Bortas ＠ Pike developers forum

10:11 a.m.

I just called Hedda. He will take a look at it ASAP.

3983

Age (days ago)

3983

Last active (days ago)

pike-devel@lists.lysator.liu.se

23 comments

6 participants

tags (0)

participants (6)

Jonas Walld�n ＠ Pike developers forum
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum
Mirar ＠ Pike developers forum
Per Hedbor () ＠ Pike (-) developers forum
Peter Bortas ＠ Pike developers forum