Re: Web.Crawler breaks on robots.txt exclusions?

25 May 2006

On Thu, 25 May 2006 08:30:01 +0000 (UTC) "Johan Sch_n (Opera Software,
Firefruit) @ Pike (-) developers forum" 10353@lyskom.lysator.liu.se wrote:
...
This should probably have been documented somewhere in the Web.Crawler
module (sorry about that, although it's quite some time ago now):
Here are the different stages:
0: "waiting"
1: "fetching"
2: "fetched"
3: "filtered"
4: "indexed"
5: "completed"
6: "error"
These don't seem to be getting used consistantly, but in attempting to
clear it up I have just broken things.  So to just fix the looping
problem when denied by robots.txt, does this look good?
Index: Crawler.pmod
===================================================================
RCS file: /pike/data/cvsroot/Pike/7.7/lib/modules/Web.pmod/Crawler.pmod,v
retrieving revision 1.24
diff -u -r1.24 Crawler.pmod

--- Crawler.pmod	19 May 2006 19:15:30 -0000	1.24
+++ Crawler.pmod	25 May 2006 20:36:07 -0000
@@ -457,7 +457,7 @@
     if(sizeof(ready_uris))
     {
       foreach(indices(ready_uris), string ready_uri)
-	if(ready_uris[ready_uri] != 2)
+	if(ready_uris[ready_uri] < 2)
    {
      ready_uris[ready_uri]=2;
      return Standards.URI(ready_uri);

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Web.Crawler breaks on robots.txt exclusions?