Web.Crawler.Crawler calls queue->set_stage(real_uri, 6); if its denied by a robots.txt exclusion. Does anyone know what 6 means? It looks like its just ignored later on in MemoryQueue()->get(). The result is that any time a Crawler hits a uri denied by robots.txt, it loops forever checking that uri, calling the error_callback, and then leaving it the queue to check again.
Adam
This should probably have been documented somewhere in the Web.Crawler module (sorry about that, although it's quite some time ago now):
Here are the different stages:
0: "waiting" 1: "fetching" 2: "fetched" 3: "filtered" 4: "indexed" 5: "completed" 6: "error"
The code would be more readable with an enum or some defines then. Patches accepted.
Yes, but using an enum has the added advantage of giving you a compilation error if you mistype one of the names.
On Thu, 25 May 2006 08:30:01 +0000 (UTC) "Johan Sch_n (Opera Software, Firefruit) @ Pike (-) developers forum" 10353@lyskom.lysator.liu.se wrote:
This should probably have been documented somewhere in the Web.Crawler module (sorry about that, although it's quite some time ago now):
Here are the different stages:
0: "waiting" 1: "fetching" 2: "fetched" 3: "filtered" 4: "indexed" 5: "completed" 6: "error"
These don't seem to be getting used consistantly, but in attempting to clear it up I have just broken things. So to just fix the looping problem when denied by robots.txt, does this look good?
Index: Crawler.pmod =================================================================== RCS file: /pike/data/cvsroot/Pike/7.7/lib/modules/Web.pmod/Crawler.pmod,v retrieving revision 1.24 diff -u -r1.24 Crawler.pmod --- Crawler.pmod 19 May 2006 19:15:30 -0000 1.24 +++ Crawler.pmod 25 May 2006 20:36:07 -0000 @@ -457,7 +457,7 @@ if(sizeof(ready_uris)) { foreach(indices(ready_uris), string ready_uri) - if(ready_uris[ready_uri] != 2) + if(ready_uris[ready_uri] < 2) { ready_uris[ready_uri]=2; return Standards.URI(ready_uri);
pike-devel@lists.lysator.liu.se