Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.
So maybe it's time for a new tool.
Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)
Yes, but I believe it was written to search for specific tags, not parse every single tag or even to build a datastructure around it. So it's naturally pretty bad at anything not RXML (as RXML were at the time, too, probably) :)
I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)
Which is bad. But it shouldn't be the largest obstacle. :)
Use a subtree. Parser.HTML.Tokenizer?