On Thu, 25 May 2006 08:30:01 +0000 (UTC) "Johan Sch_n (Opera Software, Firefruit) @ Pike (-) developers forum" 10353@lyskom.lysator.liu.se wrote:
This should probably have been documented somewhere in the Web.Crawler module (sorry about that, although it's quite some time ago now):
Here are the different stages:
0: "waiting" 1: "fetching" 2: "fetched" 3: "filtered" 4: "indexed" 5: "completed" 6: "error"
These don't seem to be getting used consistantly, but in attempting to clear it up I have just broken things. So to just fix the looping problem when denied by robots.txt, does this look good?
Index: Crawler.pmod =================================================================== RCS file: /pike/data/cvsroot/Pike/7.7/lib/modules/Web.pmod/Crawler.pmod,v retrieving revision 1.24 diff -u -r1.24 Crawler.pmod --- Crawler.pmod 19 May 2006 19:15:30 -0000 1.24 +++ Crawler.pmod 25 May 2006 20:36:07 -0000 @@ -457,7 +457,7 @@ if(sizeof(ready_uris)) { foreach(indices(ready_uris), string ready_uri) - if(ready_uris[ready_uri] != 2) + if(ready_uris[ready_uri] < 2) { ready_uris[ready_uri]=2; return Standards.URI(ready_uri);