Portable Regular Expression Library for Scheme and Common Lisp: pregexp

pregexp delivers a cross-platform regular expression implementation compatible with R4RS, R5RS, and R6RS Scheme standards, along with a Common Lisp variant. The library provides Perl-compatible pattern syntax including numeric quantifiers, non-greedy matching, capture groups, POSIX character classes, case-insensitive modes, backreferences, alternation, and lookaround assertions. Both forward and backward lookahead/lookbehind mechanisms are supported beyond the standard regex operations.

Load pregexp.scm into your Scheme environment (module installation is also supported via the distribution's install instructions). The Common Lisp version resides in pregexp.lisp. All examples below use Scheme syntax but translate readily to Common Lisp.

Core Procedures

pregexp.scm exports seven primary procedures: pregexp, pregexp-match-positions, pregexp-match, pregexp-split, pregexp-replace, pregexp-replace*, and pregexp-quote.

pregexp

Compiles a string-based regex (U-regexp) into an S-expression representation (S-regexp):

(pregexp "b.t")
;; => (:sub (:or (:seq #\b :any #\t)))

pregexp-match-positions

Returns #f for failed matches or a list of index pairs for successful matches:

(pregexp-match-positions "heart" "hard")
;; => #f

(pregexp-match-positions "thread" "sew thread here")
;; => ((4 . 10))

(substring "sew thread here" 4 10)
;; => "thread"

Optional third and fourth arguments restrict matching to a substring:

(pregexp-match-positions "thread"
  "find thread here, more thread there"
  15 30)
;; => ((21 . 27))

pregexp-match

Returns matched substrings directly rather than positions:

(pregexp-match "heart" "hard")
;; => #f

(pregexp-match "thread" "sew thread here")
;; => ("thread")

pregexp-split

Splits strings using a pattern delimiter:

(pregexp-split "," "alpha,beta,gamma,delta")
;; => ("alpha" "beta" "gamma" "delta")

(pregexp-split " " "split peas")
;; => ("split" "peas")

Empty pattern splits into characters:

(pregexp-split "" "fragment")
;; => ("f" "r" "a" "g" "m" "e" "n" "t")

Use " +" (not " *") for multi-space delimiters:

(pregexp-split " +" "split   peas   here")
;; => ("split" "peas" "here")

pregexp-replace

Replaces first match occurrence:

(pregexp-replace "re" "fibre" "er")
;; => "fiber"

pregexp-replace*

Replaces all matches:

(pregexp-replace* "re" "fibre spectre" "er")
;; => "fiber specter"

pregexp-quote

Escapes special regex characters in literal strings:

(pregexp-quote "lambda")
;; => "lambda"

(pregexp-quote "list?")
;; => "list\\?"

Pattern Syntax Reference

Basic Anchors

^ and $ anchor to string start/end:

(pregexp-match-positions "^start" "the start")
;; => #f

(pregexp-match-positions "end$" "the end is near the end")
;; => ((22 . 25))

\\b matches word boundaries:

(pregexp-match-positions "cat\\b" "cater to the cat")
;; => ((12 . 15))

\\B negates word boundaries:

(pregexp-match-positions "an\\B" "an analysis")
;; => ((3 . 5))

Characters and Character Classes

Literal characters match themselves. Meta-sequences \\n, \\r, \\t match newline, carriage return, and tab. Dot . matches any character except newline:

(pregexp-match "b.t" "bat")
;; => ("bat")

Square brackets define character classes:

(pregexp-match "b[aeiou]t" "bet")
;; => ("bet")

Ranges use hyphens: "[a-cx-z]" matches a, b, c, x, y, z. Leading ^ negates the class: "do[^g]" matches "dot" and "don" but not "dog". Most metacharacters lose special meaning inside bracktes (except ], -, and ^).

Common Character Classes

\\d = digits [0-9], \\s = whitespace, \\w = word characters [A-Za-z0-9_]. Uppercase variants invert: \\D, \\S, \\W:

(pregexp-match "\\d\\d"
  "room 101 has 2 beds")
;; => ("10")

POSIX Character Classes

POSIX classes use [:...:] syntax inside brackets:

[:alnum:] [:alpha:] [:ascii:] [:blank:] [:cntrl:] [:digit:]
[:graph:] [:lower:] [:print:] [:space:] [:upper:] [:word:] [:xdigit:]
(pregexp-match "[[:alpha:]_]" "--x--")
;; => ("x")

(pregexp-match "[[:alpha:]_]" "--_--")
;; => ("_")

Insert ^ after [: to invert: [:^alpha:] matches non-letters.

Quantifiers

* (0+), + (1+), ? (0 or 1):

(pregexp-match-positions "b[aeiou]*d" "bead")
;; => ((0 . 4))

(pregexp-match-positions "b[aeiou]+d" "bd")
;; => #f

Numeric Quantifiers

{n} matches exactly n instances. {m,n} matches between m and n (inlcusive). Omit m for 0, omit n for infinity:

(pregexp-match "[aeiou]{3}" "sequoia")
;; => ("uoia")

(pregexp-match "[aeiou]{2,3}" "evolve")
;; => #f

Non-Greedy Quantifiers

Append ? for minimal matching: *?, +?, ??, {m}?, {m,n}?:

(pregexp-match "\\{.*?\\}" "{tag1} {tag2} {tag3}")
;; => ("{tag1}")

Grouping

Parentheses (...) create capturing groups. Matches return the full match followed by submatches:

(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jul 4, 1776")
;; => ("jul 4, 1776" "jul" "4" "1776")

Groups combine with quantifiers:

(pregexp-match "(ha )*" "ha ha ha ha")
;; => ("ha ha ha ha" "ha ")

When quantified groups match multiple times, only the last capture appears:

(pregexp-match "([a-z]+;)*" "wash; rinse; repeat;")
;; => ("wash; rinse; repeat;" " repeat;")

Unmatched optional groups return #f:

(define date-pattern
  (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))

(pregexp-match date-pattern "mar 15, 2020")
;; => ("mar 15, 2020" "mar" "15," "2020")

(pregexp-match date-pattern "mar 2020")
;; => ("mar 2020" "mar" #f "2020")

Backreferences

In replacement strings, \\n references the nth submatch (\\0 or \\& for full match):

(pregexp-replace* "-(.+?)-"
  "the -alpha-, the -beta-, and the -gamma-"
  "*\\1*")
;; => "the *alpha*, the *beta*, and *gamma*"

Backreferences in patterns match previously captured text:

(pregexp-match "(\\w+) \\1" "hello hello world")
;; => ("hello hello" "hello")

Use \\\\ for literal backslashes in replacements, \\$ for empty string.

Non-Capturing Groups

(?:...) groups without capturing:

(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
  "/usr/local/bin/scheme")
;; => ("/usr/local/bin/scheme" "scheme")

Cloisters

Modifiers between ? and : in non-capturing groups. i enables case-insensitivity:

(pregexp-match "(?i:scheme)" "Scheme")
;; => ("Scheme")

x enables free-spacing mode (ignores whitespace and comments):

(pregexp-match "(?x:hello\\ world)" "helloworld")
;; => ("helloworld")

Multiple modifiers combine: (?ix:...). Prefix with - to disable: (?-i:...).

Alternation

| separates alternatives:

(pregexp-match "b(ee|ea|oa)d" "bead")
;; => ("bead" "ea")

Leftmost alternatives match first. Place longer patterns before shorter ones:

(pregexp-match "execute|execution"
  "execution")
;; => ("execute") ; fails!

(pregexp-match "execution|execute"
  "execution")
;; => ("execution") ; succeeds

Backtracking Control

Greedy quantifiers match maximally but backtrack to enable overall matches:

(pregexp-match "a*aa" "aaaa")
;; => ("aaaa")

Atomic Grouping

(?>...) prevents backtracking within the group:

(pregexp-match "(?>a+)." "aaaa")
;; => #f

Lookaround Assertions

Lookahead

Positive (?=...) asserts following pattern matches:

(pregexp-match-positions "win(?=dows)"
  "win some windows")
;; => ((8 . 11))

Negative (?!...) asserts following pattern does not match:

(pregexp-match-positions "win(?!dows)"
  "windows win some")
;; => ((8 . 11))

Lookbehind

Positive (?<=...) asserts preceding pattern matches:

(pregexp-match-positions "(?<=under)stand"
  "understand the standpoint")
;; => ((5 . 10))

Negative (?<!...) asserts preceding pattern does not match:

(pregexp-match-positions "(?<!under)stand"
  "stand under the stand")
;; => ((18 . 23))

Tags: scheme Common Lisp pregexp regular expressions R4RS

Posted on Mon, 22 Jun 2026 17:33:47 +0000 by ineedhelpbigtime