Hi,
There is one problem with Regexp.PCRE when study() is called (or Regexp.PCRE.Studied()). exec() code (in cmod) contains the following: ---snip--- #ifdef PCRE_EXTRA_STUDY_DATA if (THIS->extra) opts|=PCRE_EXTRA_STUDY_DATA; #else /* FIXME: Throw an error if THIS->extra is set? */ #endif /* PCRE_EXTRA_STUDY_DATA */ ---snip---
and opts will be used later in call to pcre_exec(), but.. this is a bit incorrect, since this option should be set not in the call to pcre_exec(), but in pcre_extra struct (field "flags"). Hence, any call to exec()/split() etc. for studied PCRE gives an error (with PCRE 4.4):
Regexp re = Regexp.PCRE("([a-z]+)"); re->split("123abc123");
error returned from exec: ERROR.BADOPTION
In case of PCRE 3.9, it passes by, but it is only because there is no symbol PCRE_EXTRA_STUDY_DATA defined.
Additionally, PCRE doc says (man pcreapi):
---snip--- Other flag bits should be set to zero. The study_data field is set in the pcre_extra block that is returned by pcre_study(), together with the appropriate flag bit. You should not set this yourself, but you can add to the block by setting the other fields. ---snip---
So, there is absolutely no need to pass any options to pcre_exec() call while handling exec(). I would commit a fix, but I am not sure how to regenerate file pcre_glue.cmod.compiled - or is it enough to make change in cmod and .compiled will be (re)generated automatically? Also, side note - the ovector size is defined as 3000 - isn't this an overkill for most cases? It will consume 12K of stack space with every call to exec() (at least, this space is not dynamically allocated, which would be even worse).
Regards, /Al
Regexp re = Regexp.PCRE("([a-z]+)"); re->split("123abc123");
error returned from exec: ERROR.BADOPTION
Ups. Good observation.
Also, side note - the ovector size is defined as 3000 - isn't this an overkill for most cases? It will consume 12K of stack space with every call to exec() (at least, this space is not dynamically allocated, which would be even worse).
12K isn't much, since the stack is at least 2Mb on any machine I know. And it's quickly eaten up if you use recursive regexps, which you can do with PCRE.
Anything is generated from the .cmod file, so patch it and see if it gets any better.
/ Mirar
Previous text:
2004-06-27 02:56: Subject: Regexp.PCRE problem
Hi,
There is one problem with Regexp.PCRE when study() is called (or Regexp.PCRE.Studied()). exec() code (in cmod) contains the following:
---snip--- #ifdef PCRE_EXTRA_STUDY_DATA if (THIS->extra) opts|=PCRE_EXTRA_STUDY_DATA; #else /* FIXME: Throw an error if THIS->extra is set? */ #endif /* PCRE_EXTRA_STUDY_DATA */ ---snip---
and opts will be used later in call to pcre_exec(), but.. this is a bit incorrect, since this option should be set not in the call to pcre_exec(), but in pcre_extra struct (field "flags").
Hence, any call to exec()/split() etc. for studied PCRE gives an error (with PCRE 4.4):
Regexp re = Regexp.PCRE("([a-z]+)"); re->split("123abc123");
error returned from exec: ERROR.BADOPTION
In case of PCRE 3.9, it passes by, but it is only because there is no symbol PCRE_EXTRA_STUDY_DATA defined.
Additionally, PCRE doc says (man pcreapi):
---snip--- Other flag bits should be set to zero. The study_data field is set in the pcre_extra block that is returned by pcre_study(), together with the appropriate flag bit. You should not set this yourself, but you can add to the block by setting the other fields. ---snip---
So, there is absolutely no need to pass any options to pcre_exec() call while handling exec().
I would commit a fix, but I am not sure how to regenerate file pcre_glue.cmod.compiled - or is it enough to make change in cmod and .compiled will be (re)generated automatically?
Also, side note - the ovector size is defined as 3000 - isn't this an overkill for most cases? It will consume 12K of stack space with every call to exec() (at least, this space is not dynamically allocated, which would be even worse).
Regards, /Al
/ Brevbäraren
In the last episode (Jun 27), Mirar @ Pike developers forum said:
Regexp re = Regexp.PCRE("([a-z]+)"); re->split("123abc123");
error returned from exec: ERROR.BADOPTION
Ups. Good observation.
Also, side note - the ovector size is defined as 3000 - isn't this an overkill for most cases? It will consume 12K of stack space with every call to exec() (at least, this space is not dynamically allocated, which would be even worse).
12K isn't much, since the stack is at least 2Mb on any machine I know. And it's quickly eaten up if you use recursive regexps, which you can do with PCRE.
The thread stack may be significantly less, though. Back in 2002 I determined the default stack size for threaded programs on a bunch of OSes:
AIX 96K FreeBSD 64K HP-UX 64K IRIX 128K Linux 2M Solaris 1M Tru64 5M
On 32-bit platforms, a large stack size really eats into your memory. 200 threads would require 400MB of space just for the stacks.
Yes, of course. But PCRE->exec() is rarely recursed (can you recurse it?), and even 64K is enough for several recursions.
/ Mirar
Previous text:
2004-06-27 19:00: Subject: Re: Regexp.PCRE problem
In the last episode (Jun 27), Mirar @ Pike developers forum said:
Regexp re = Regexp.PCRE("([a-z]+)"); re->split("123abc123");
error returned from exec: ERROR.BADOPTION
Ups. Good observation.
Also, side note - the ovector size is defined as 3000 - isn't this an overkill for most cases? It will consume 12K of stack space with every call to exec() (at least, this space is not dynamically allocated, which would be even worse).
12K isn't much, since the stack is at least 2Mb on any machine I know. And it's quickly eaten up if you use recursive regexps, which you can do with PCRE.
The thread stack may be significantly less, though. Back in 2002 I determined the default stack size for threaded programs on a bunch of OSes:
AIX 96K FreeBSD 64K HP-UX 64K IRIX 128K Linux 2M Solaris 1M Tru64 5M
On 32-bit platforms, a large stack size really eats into your memory. 200 threads would require 400MB of space just for the stacks.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
pike-devel@lists.lysator.liu.se