I would like something that breaks down an html document to a datastructure, preferable one with tools like searching.
Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification.
Parser.HTML is created to allow RXML (or similar) parsing with as little computron usage as possible. What I am using it for mostly is breaking down random HTML documents for data gathering, which isn't the intended use...
Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :)
It is faster to have a simple tokenizer that then outputs tokens that is handled by either a tree generator (as also specified in html5) or somethgin that just calls callbacks for tags (like the current Parser.HTML)
I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :)
Handy reference:
https://html.spec.whatwg.org/multipage/syntax.html#tokenization