Web.Crawler breaks on robots.txt exclusions?

List overview All Threads
Download

newer

older

Postgres tests broken in 7.6

How do we handle security issues...

Adam Montague

12 May 2006 12 May '06

9:36 p.m.

Web.Crawler.Crawler calls queue->set_stage(real_uri, 6); if its denied by a robots.txt exclusion. Does anyone know what 6 means? It looks like its just ignored later on in MemoryQueue()->get(). The result is that any time a Crawler hits a uri denied by robots.txt, it loops forever checking that uri, calling the error_callback, and then leaving it the queue to check again.

Adam

Show replies by date

Johan Sch�n (Opera Software, Firefruit) ＠ Pike (-) developers forum

25 May 25 May

9:30 a.m.

This should probably have been documented somewhere in the Web.Crawler module (sorry about that, although it's quite some time ago now):

Here are the different stages:

0: "waiting" 1: "fetching" 2: "fetched" 3: "filtered" 4: "indexed" 5: "completed" 6: "error"

Peter Bortas ＠ Pike developers forum

3:40 p.m.

The code would be more readable with an enum or some defines then. Patches accepted.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

4:20 p.m.

Or using strings. It's about as efficient as using integers for this purpose.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

4:25 p.m.

Yes, but using an enum has the added advantage of giving you a compilation error if you mistype one of the names.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

5:25 p.m.

Well, using strings and enums doesn't exclude each other.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

5:30 p.m.

Well, using strings and enums doesn't exclude each other.

Adam Montague

10:04 p.m.

On Thu, 25 May 2006 08:30:01 +0000 (UTC) "Johan Sch_n (Opera Software, Firefruit) @ Pike (-) developers forum" 10353@lyskom.lysator.liu.se wrote:

...

This should probably have been documented somewhere in the Web.Crawler module (sorry about that, although it's quite some time ago now):

Here are the different stages:

0: "waiting" 1: "fetching" 2: "fetched" 3: "filtered" 4: "indexed" 5: "completed" 6: "error"

These don't seem to be getting used consistantly, but in attempting to clear it up I have just broken things. So to just fix the looping problem when denied by robots.txt, does this look good?

Index: Crawler.pmod =================================================================== RCS file: /pike/data/cvsroot/Pike/7.7/lib/modules/Web.pmod/Crawler.pmod,v retrieving revision 1.24 diff -u -r1.24 Crawler.pmod --- Crawler.pmod 19 May 2006 19:15:30 -0000 1.24 +++ Crawler.pmod 25 May 2006 20:36:07 -0000 @@ -457,7 +457,7 @@ if(sizeof(ready_uris)) { foreach(indices(ready_uris), string ready_uri) - if(ready_uris[ready_uri] != 2) + if(ready_uris[ready_uri] < 2) { ready_uris[ready_uri]=2; return Standards.URI(ready_uri);

7030

Age (days ago)

7043

Last active (days ago)

pike-devel@lists.lysator.liu.se

7 comments

5 participants

tags (0)

participants (5)

Adam Montague
Johan Sch�n (Opera Software, Firefruit) ＠ Pike (-) developers forum
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum
Peter Bortas ＠ Pike developers forum