Here are the docstrings for my set of regexp operators. I doubt much else of my code would be of any use here.
//! @decl RxNode any(); //! //! Matches any symbol.
//! @decl RxNode seq (LaxRxType... regexps); //! //! A sequence. If an array is used as a sub-regexp it's converted to //! this.
//! @decl RxNode seq_or (LaxRxType... regexps); //! //! Like @[Rx.or], but keeps the order between the sub-regexps, so //! that if two or more of them match the same input, it's always //! the match in the first one that's returned first. If all //! alternative matches are requested, they're enumerated in the order //! that the sub-regexps match. //! //! @note //! This is the union variant that most closely resembles the "|" //! operator in most other regexp engines. However, in some cases it //! cannot do as good a job to determinize as @[Rx.or], so if the //! order isn't relevant, use that one instead.
//! @decl RxNode range (Symbol from, Symbol to); //! //! A range of all symbols between @[from] and @[to], inclusive.
//! @decl RxNode rep (LaxRxType regexp, void|int low, void|int high); //! //! Repetition, which can be upwardly bounded or unbounded. (In the //! unbounded forms this includes "Kleene star" and "Kleene plus".) //! The given regexp must match at least @[low] and at most @[high] //! times. @[low] defaults to zero. There's no upper bound if @[high] //! is left out or is negative. If @[high] isn't negative but less //! than @[low], this matches nothing. //! //! @note //! The first returned match is the longest possible one. Therefore //! this operator is "greedy". There's also a non-greedy variant //! @[Rx.lrep]. //! //! Actually the above is not entirely correct; the first returned //! match is really the first match of @[regexp], repeated as many //! times as possible. //! //! For example, if @[regexp] matches @tt{"aa"@} and @tt{"a"@} in that //! order, then the first match on @tt{"aaa"@} will have two //! repetitions where @[regexp] matched @tt{"aa"@} and then @tt{"a"@}, //! and not three repetitions where each matched @tt{"a"@}. //! //! Otoh, if @[regexp] is lazy and matches @tt{"a"@} before //! @tt{"aa"@}, and if the repetition is upwardly bounded to two //! repetitions, then the first match on @tt{"aaa"@} will be two //! repetitions where each matched @tt{"a"@}. I.e. the first match is //! not the longest possible one.
//! @decl RxNode lrep (LaxRxType regexp, void|int low, void|int high); //! //! Like @[Rx.rep], but implements laziness: The first returned match //! repeats the regexp as few times as possible within the limits, //! whereas @[Rx.rep] repeats it as many times as possible. //! //! @note //! The first returned match is actually the first match of @[regexp], //! repeated as few times as possible. //! //! For example, if @[regexp] matches @tt{"a"@} and @tt{"aa"@} in that //! order, then the first match on @tt{"aaa"@} will have three //! repetitions where each @[regexp] matched @tt{"a"@}, and not two //! repetitions where one of them matched @tt{"aa"@}. //! //! Otoh, if @[regexp] is greedy and matches @tt{"aa"@} before //! @tt{"a"@}, and if the repetition must match at least once, then //! the first match on @tt{"aaa"@} will be one repetition where //! @[regexp] matched @tt{"aa"@} and not @tt{"a"@}. I.e. the first //! match is not the shortest possible one.
//! @decl RxNode opt (LaxRxType regexp); //! //! Match the regexp optionally, i.e. like //! @tt{@[Rx.rep] (@[regexp], 0, 1)@}. //! //! @note //! In the case where it's possible to both match the regexp and not //! match it, the first returned match will be with the regexp. I.e. //! this operator is "greedy" just like @[Rx.rep]. There's also a //! non-greedy variant @[Rx.lopt].
//! @decl RxNode lopt (LaxRxType regexp); //! //! Match the regexp optionally and lazily, i.e. like //! @tt{@[Rx.lrep] (@[regexp], 0, 1)@}. So whenever it's possible to //! not match the regexp, the first returned match won't match it.
//! @decl RxNode str (string literal); //! //! A literal string. If a string is used as a sub-regexp, it's //! converted to this. Technically this is a syntax parser that treats //! its whole input as a literal.
//! @decl RxNode set_str (string chars); //! //! A set of symbols parsed from a string.
//! @decl RxNode save (LaxRxType regexp, void|string name); //! //! Saves the match of @[regexp] for later retrieval. If @[name] is //! given, it's used as a name to identify the saved submatch, //! otherwise it's accessed by position. //! //! The position is determined by counting the start of each unnamed //! submatch as they are encountered from left to right, beginning at //! zero. Note that this might not be well defined if e.g. @tt{(< >)@} //! or @tt{([ ])@} is used to build the regexp tree. //! //! If @[regexp] matches several times (typically when used inside a //! repetition) every match overwrites the preceding one, so only the //! last match is available afterwards.
//! @decl RxNode saveall (LaxRxType regexp, void|string name); //! //! Like @[Rx.save], but if @[regexp] matches several times (typically //! when used inside a repetition) then all those matches are saved. //! The saved value is an array of the matches, in the order they are //! found.
To put the operators above in some perspective, here are the others that I think would be a bit difficult to include in the pcre glue:
//! @decl RxNode sym (Symbol... symbols); //! //! A sequence of symbols. The difference from @[Rx.seq] is that the //! elements are treated as literal symbols and not regexps. This is //! only necessary when the symbols are of a type that otherwise would //! be interpreted as something else, e.g. strings.
//! @decl RxNode pair (Symbol from, Symbol to); //! //! The pair @tt{@[from]/@[to]@}, where the symbol @[from] in the //! input is mapped to @[to] in the output. The result is thus a //! transducer.
//! @decl RxNode or (LaxRxType... regexps); //! //! A union; matches if any of the arguments match. If a multiset is //! used as a sub-regexp it's converted to this. //! //! @note //! When given no arguments, this doesn't match anything at all. //! //! @note //! This operator tries to get as good determinization as possible by //! allowing any match order between the alternatives. It's therefore //! effectively "greedy" to the extent that determinization succeeds, //! but that can't be counted on since determinization isn't //! guaranteed to be complete. There's also the @[Rx.seq_or] variant //! that always matches the alternatives in the order they are given //! (which most closely resembles the behavior in other common regexp //! engines).
//! @decl RxNode and (LaxRxType... regexps); //! //! Intersection; matches only when all the arguments match.
//! @decl RxNode neg (LaxRxType regexp) //! //! Negation; matches everything that @[regexp] doesn't match.
//! @decl RxNode sub (LaxRxType a, LaxRxType b); //! //! Subtraction; matches when @[a] but not @[b] matches.
//! @decl RxNode set (Symbol... symbols); //! //! A set of symbols. Much like @[Rx.or], but the elements are treated //! as literal symbols and not regexps.
//! @decl RxNode map (LaxRxType from, LaxRxType to); //! //! Maps the regexp @[from] to the regexp @[to]. Both must be //! recognizers and the result is a transducer. If a mapping with a //! single element is used as a sub-regexp, it's converted to this (a //! mapping with more elements becomes the union of the pairs in //! it). //! //! (Technically, this is the cross product of @[from] and @[to], i.e. //! the set of string pairs @tt{a/b@}, where @tt{a@} matches @[from] //! and @tt{b@} matches @[to].)
//! @decl RxNode test (function(DataList,void|Rx.Rx.Process:int) func, @ //! void|int low, void|int high) //! @decl RxNode test (function(DataList,void|Rx.Rx.Process:int) func, @ //! LaxRxType regexp) //! //! Calls @[func] to test whether there's a match at this position. //! //! The function will be called with a piece of the input and should //! return nonzero if the whole piece matches, zero otherwise. The //! second argument to the function is the current @[Rx.Rx.Process] //! object. Although it can't be used to reliably look at the input it //! might be useful to look at flags, e.g. @[Rx.Rx.Process.DEBUG_LOG]. //! //! If @[low] and/or @[high] is given, they give the lower and upper //! limit of the length of the string that can possibly be matched by //! @[func]. @[low] defaults to zero. There's no upper bound if //! @[high] is left out or is negative. //! //! If @[regexp] is given, only input which it matches will be tested //! with @[func]. //! //! @note //! If the possible matches aren't screened with @[regexp] or a narrow //! @[low]/@[high] interval, it's likely that the test function is //! called excessively often.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-09-21 15:43: Subject: wish: string with other quoting then \
Just changing the regexp quote character to something else would make a simple rule.
Of course.
It'd be very simple to implement a similar object/function interface in your pcre glue. It'd just be a set of functions that internally converts to pcre regexp syntax. I can provide the design I've made for that; it's very straightforward.
That's true. I'm currently on the step of starting to write the Pike level glue for Regexp.PCRE... Was there a start of that somewhere? I can't seem to find it.
/ Mirar