Previous: , Up: System Definitions   [Contents][Index]


17.1 Regular Expressions

The function string-match (*Index string-match::) is used to match a regular expression against a string. If the variable *case-fold-search* is not nil, case is ignored in the match. To determine the extent of the match use *Index match-beginning:: and *Index match-end::.

Regular expressions are implemented using Henry Spencer’s package (thank you Henry!), and much of the description of regular expressions below is copied verbatim from his manual entry. Code for delimited searches, case insensitive searches, and speedups to allow fast searching of long files was contributed by W. Schelter. The speedups use an adaptation by Schelter of the Boyer and Moore string search algorithm to the case of branched regular expressions. These allow such expressions as ’not_there|really_not’ to be searched for 30 times faster than in GNU emacs (1995), and 200 times faster than in the original Spencer method. Expressions such as [a-u]bcdex get a speedup of 60 and 194 times respectively. This is based on searching a string of 50000 characters (such as the file tk.lisp).

Ordering Multiple Matches

In general there may be more than one way to match a regular expression to an input string. For example, consider the command

 (string-match "(a*)b*"  "aabaaabb")

Considering only the rules given so far, the value of (list-matches 0 1) might be ("aabb" "aa") or ("aaab" "aaa") or ("ab" "a") or any of several other combinations. To resolve this potential ambiguity string-match chooses among alternatives using the rule first then longest. In other words, it considers the possible matches in order working from left to right across the input string and the pattern, and it attempts to match longer pieces of the input string before shorter ones. More specifically, the following rules apply in decreasing order of priority:

In the example from above, (a*)b* matches aab: the (a*) portion of the pattern is matched first and it consumes the leading aa; then the b* portion of the pattern consumes the next b. Or, consider the following example:

 (string-match "(ab|a)(b*)c"  "xabc") ==> 1
 (list-matches 0 1 2 3) ==> ("abc" "ab" "" NIL)
 (match-beginning 0) ==> 1
 (match-end 0) ==> 4
 (match-beginning 1) ==> 1
 (match-end 1) ==> 3
 (match-beginning 2) ==> 3
 (match-end 2) ==> 3
 (match-beginning 3) ==> -1
 (match-end 3) ==> -1

In the above example the return value of 1 (which is > -1) indicates that a match was found. The entire match runs from 1 to 4. Rule 4 specifies that (ab|a) gets first shot at the input string and Rule 2 specifies that the ab sub-expression is checked before the a sub-expression. Thus the b has already been claimed before the (b*) component is checked and (b*) must match an empty string.

The special characters in the string "\()[]+.*|^$?", must be quoted, if a simple string search is desired. The function re-quote-string is provided for this purpose.

(re-quote-string "*standard*") ==> "\\*standard\\*"

(string-match (re-quote-string "*standard*") "X *standard* ")
 ==> 2

(string-match "*standard*" "X *standard* ")
Error: Regexp Error: ?+* follows nothing

Note there is actually just one \ before the * but the printer makes two so that the string can be read, since \ is also the lisp quote character. In the last example an error is signalled since the special character * must follow an atom if it is interpreted as a regular expression.


Previous: , Up: System Definitions   [Contents][Index]