pregexp delivers a cross-platform regular expression implementation compatible with R4RS, R5RS, and R6RS Scheme standards, along with a Common Lisp variant. The library provides Perl-compatible pattern syntax including numeric quantifiers, non-greedy matching, capture groups, POSIX character classes, case-insensitive modes, backreferences, alternation, and lookaround assertions. Both forward and backward lookahead/lookbehind mechanisms are supported beyond the standard regex operations.
Load pregexp.scm into your Scheme environment (module installation is also supported via the distribution's install instructions). The Common Lisp version resides in pregexp.lisp. All examples below use Scheme syntax but translate readily to Common Lisp.
Core Procedures
pregexp.scm exports seven primary procedures: pregexp, pregexp-match-positions, pregexp-match, pregexp-split, pregexp-replace, pregexp-replace*, and pregexp-quote.
pregexp
Compiles a string-based regex (U-regexp) into an S-expression representation (S-regexp):
(pregexp "b.t")
;; => (:sub (:or (:seq #\b :any #\t)))
pregexp-match-positions
Returns #f for failed matches or a list of index pairs for successful matches:
(pregexp-match-positions "heart" "hard")
;; => #f
(pregexp-match-positions "thread" "sew thread here")
;; => ((4 . 10))
(substring "sew thread here" 4 10)
;; => "thread"
Optional third and fourth arguments restrict matching to a substring:
(pregexp-match-positions "thread"
"find thread here, more thread there"
15 30)
;; => ((21 . 27))
pregexp-match
Returns matched substrings directly rather than positions:
(pregexp-match "heart" "hard")
;; => #f
(pregexp-match "thread" "sew thread here")
;; => ("thread")
pregexp-split
Splits strings using a pattern delimiter:
(pregexp-split "," "alpha,beta,gamma,delta")
;; => ("alpha" "beta" "gamma" "delta")
(pregexp-split " " "split peas")
;; => ("split" "peas")
Empty pattern splits into characters:
(pregexp-split "" "fragment")
;; => ("f" "r" "a" "g" "m" "e" "n" "t")
Use " +" (not " *") for multi-space delimiters:
(pregexp-split " +" "split peas here")
;; => ("split" "peas" "here")
pregexp-replace
Replaces first match occurrence:
(pregexp-replace "re" "fibre" "er")
;; => "fiber"
pregexp-replace*
Replaces all matches:
(pregexp-replace* "re" "fibre spectre" "er")
;; => "fiber specter"
pregexp-quote
Escapes special regex characters in literal strings:
(pregexp-quote "lambda")
;; => "lambda"
(pregexp-quote "list?")
;; => "list\\?"
Pattern Syntax Reference
Basic Anchors
^ and $ anchor to string start/end:
(pregexp-match-positions "^start" "the start")
;; => #f
(pregexp-match-positions "end$" "the end is near the end")
;; => ((22 . 25))
\\b matches word boundaries:
(pregexp-match-positions "cat\\b" "cater to the cat")
;; => ((12 . 15))
\\B negates word boundaries:
(pregexp-match-positions "an\\B" "an analysis")
;; => ((3 . 5))
Characters and Character Classes
Literal characters match themselves. Meta-sequences \\n, \\r, \\t match newline, carriage return, and tab. Dot . matches any character except newline:
(pregexp-match "b.t" "bat")
;; => ("bat")
Square brackets define character classes:
(pregexp-match "b[aeiou]t" "bet")
;; => ("bet")
Ranges use hyphens: "[a-cx-z]" matches a, b, c, x, y, z. Leading ^ negates the class: "do[^g]" matches "dot" and "don" but not "dog". Most metacharacters lose special meaning inside bracktes (except ], -, and ^).
Common Character Classes
\\d = digits [0-9], \\s = whitespace, \\w = word characters [A-Za-z0-9_]. Uppercase variants invert: \\D, \\S, \\W:
(pregexp-match "\\d\\d"
"room 101 has 2 beds")
;; => ("10")
POSIX Character Classes
POSIX classes use [:...:] syntax inside brackets:
[:alnum:] [:alpha:] [:ascii:] [:blank:] [:cntrl:] [:digit:]
[:graph:] [:lower:] [:print:] [:space:] [:upper:] [:word:] [:xdigit:]
(pregexp-match "[[:alpha:]_]" "--x--")
;; => ("x")
(pregexp-match "[[:alpha:]_]" "--_--")
;; => ("_")
Insert ^ after [: to invert: [:^alpha:] matches non-letters.
Quantifiers
* (0+), + (1+), ? (0 or 1):
(pregexp-match-positions "b[aeiou]*d" "bead")
;; => ((0 . 4))
(pregexp-match-positions "b[aeiou]+d" "bd")
;; => #f
Numeric Quantifiers
{n} matches exactly n instances. {m,n} matches between m and n (inlcusive). Omit m for 0, omit n for infinity:
(pregexp-match "[aeiou]{3}" "sequoia")
;; => ("uoia")
(pregexp-match "[aeiou]{2,3}" "evolve")
;; => #f
Non-Greedy Quantifiers
Append ? for minimal matching: *?, +?, ??, {m}?, {m,n}?:
(pregexp-match "\\{.*?\\}" "{tag1} {tag2} {tag3}")
;; => ("{tag1}")
Grouping
Parentheses (...) create capturing groups. Matches return the full match followed by submatches:
(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jul 4, 1776")
;; => ("jul 4, 1776" "jul" "4" "1776")
Groups combine with quantifiers:
(pregexp-match "(ha )*" "ha ha ha ha")
;; => ("ha ha ha ha" "ha ")
When quantified groups match multiple times, only the last capture appears:
(pregexp-match "([a-z]+;)*" "wash; rinse; repeat;")
;; => ("wash; rinse; repeat;" " repeat;")
Unmatched optional groups return #f:
(define date-pattern
(pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))
(pregexp-match date-pattern "mar 15, 2020")
;; => ("mar 15, 2020" "mar" "15," "2020")
(pregexp-match date-pattern "mar 2020")
;; => ("mar 2020" "mar" #f "2020")
Backreferences
In replacement strings, \\n references the nth submatch (\\0 or \\& for full match):
(pregexp-replace* "-(.+?)-"
"the -alpha-, the -beta-, and the -gamma-"
"*\\1*")
;; => "the *alpha*, the *beta*, and *gamma*"
Backreferences in patterns match previously captured text:
(pregexp-match "(\\w+) \\1" "hello hello world")
;; => ("hello hello" "hello")
Use \\\\ for literal backslashes in replacements, \\$ for empty string.
Non-Capturing Groups
(?:...) groups without capturing:
(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
"/usr/local/bin/scheme")
;; => ("/usr/local/bin/scheme" "scheme")
Cloisters
Modifiers between ? and : in non-capturing groups. i enables case-insensitivity:
(pregexp-match "(?i:scheme)" "Scheme")
;; => ("Scheme")
x enables free-spacing mode (ignores whitespace and comments):
(pregexp-match "(?x:hello\\ world)" "helloworld")
;; => ("helloworld")
Multiple modifiers combine: (?ix:...). Prefix with - to disable: (?-i:...).
Alternation
| separates alternatives:
(pregexp-match "b(ee|ea|oa)d" "bead")
;; => ("bead" "ea")
Leftmost alternatives match first. Place longer patterns before shorter ones:
(pregexp-match "execute|execution"
"execution")
;; => ("execute") ; fails!
(pregexp-match "execution|execute"
"execution")
;; => ("execution") ; succeeds
Backtracking Control
Greedy quantifiers match maximally but backtrack to enable overall matches:
(pregexp-match "a*aa" "aaaa")
;; => ("aaaa")
Atomic Grouping
(?>...) prevents backtracking within the group:
(pregexp-match "(?>a+)." "aaaa")
;; => #f
Lookaround Assertions
Lookahead
Positive (?=...) asserts following pattern matches:
(pregexp-match-positions "win(?=dows)"
"win some windows")
;; => ((8 . 11))
Negative (?!...) asserts following pattern does not match:
(pregexp-match-positions "win(?!dows)"
"windows win some")
;; => ((8 . 11))
Lookbehind
Positive (?<=...) asserts preceding pattern matches:
(pregexp-match-positions "(?<=under)stand"
"understand the standpoint")
;; => ((5 . 10))
Negative (?<!...) asserts preceding pattern does not match:
(pregexp-match-positions "(?<!under)stand"
"stand under the stand")
;; => ((18 . 23))