testsuite

17 Sep 2014


      ...
I would like something that breaks down an html document to a
datastructure, preferable one with tools like searching.
Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable
specification.
...
Parser.HTML is created to allow RXML (or similar) parsing with as
little computron usage as possible. What I am using it for mostly is
breaking down random HTML documents for data gathering, which isn't
the intended use...
Ironically enough it is about 10 times slower than ye olde Opera HTML5
parser at actually parsing html. :)
It is faster to have a simple tokenizer that then outputs tokens
that is handled by either a tree generator (as also specified in
html5) or somethgin that just calls callbacks for tags (like the
current Parser.HTML)
I have seriously considered writing one. But the name 'Parser.HTML' is
already taken. :)
Handy reference:
https://html.spec.whatwg.org/multipage/syntax.html#tokenization
-- 
Per Hedbor

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

testsuite