Regular Expression Patterns
Regex patterns use special syntax to define matching rules. The following table lists common special syntax elements, note that some meanings may change when using optional flags.
| Pattern | Description |
|---|---|
^ |
Matches the start of a string |
$ |
Matches the end of a string |
. |
Matches any single character except newline; when using re.DOTALL flag, matches all characters including newlines |
[chars] |
Defines a character set, e.g., [amk] matches a, m, or k |
[^chars] |
Matches any character not in the set, e.g., [^abc] matches anything except a, b, c |
re* |
Matches 0 or more occurrences of the preceding expression |
re+ |
Matches 1 or more occurrences of the preceding expression |
re? |
Matches 0 or 1 occurrence of the preceding expression, uses non-greedy matching |
re{n} |
Matches exactly n occurrences of the preceding expression |
re{n,} |
Matches at least n occurrences of the preceding expression |
re{n,m} |
Matches between n and m occurrences of the preceding expression, uses greedy matching by default |
| `a | b` |
(re) |
Matches the enclosed expression and creates a capturing group |
(?imx) |
Enables i, m, or x flags only within the parentheses |
(?-imx) |
Disables i, m, or x flags only within the parentheses |
(?: re) |
Matches the enclosed expression without creating a capturing group |
(?imx: re) |
Enables i, m, or x flags for the enclosed expression within parentheses |
(?-imx: re) |
Disables i, m, or x flags for the enclosed expression within parentheses |
(?# comment) |
Adds a comment ignored by the regex engine |
(?= re) |
Positive lookahead: matches sucessfully if the enclosed expression is found at current position, without advancing the matching position |
(?! re) |
Negative lookahead: matches successfully if the enclosed expression is NOT found at current position |
(?> re) |
Matches as an independent pattern to avoid backtracking |
\w |
Matches alphanumeric characters and underscores |
\W |
Matches non-alphanumeric, non-underscore characters |
\s |
Matches any whitespace character, equivalent to [\t\n\r\f] |
\S |
Matches any non-whitespace character |
\d |
Matches any digit, equivalent to [0-9] |
\D |
Matches any non-digit character |
\A |
Matches the absolute start of a string |
\Z |
Matches the absolute end of a string; if a newline exists, matches up to before the newline |
\z |
Matches the absolute end of a string regardless of newlines |
\G |
Matches at the position where the last match ended |
\b |
Matches a word boundary, e.g., er\b matches er in "never" but not in "verb" |
\B |
Matches a non-word boundary, e.g., er\B matches er in "verb" but not in "never" |
\n, \t, etc. |
Matches newline, tab, and other whitespace escape sequences |
\1-\9 |
Matches the content of the nth capturing group |
\10 |
Matches the 10th capturing group if it exists; otherwise matches an octal character code |
Common Special Character Classes
| Example | Description |
|---|---|
. |
Matches any single character except \n; use [.\n] to match including newlines |
\d |
Matches a single digit, equivalent to [0-9] |
\D |
Matches a single non-digit character, equivalent to [^0-9] |
\s |
Matches any whitespace character including spaces, tabs, form feeds, etc., equivalent to [ \f\n\r\t\v] |
\S |
Matches any non-whitespace character, equivalent to [^ \f\n\r\t\v] |
\w |
Matches any word character including underscores, equivalent to [A-Za-z0-9_] |
\W |
Matches any non-word character, equivalent to [^A-Za-z0-9_] |
Common Regex Modifiers
| Flag | Purpose |
|---|---|
re.I |
Makes matching case-insensitive |
re.L |
Performs locale-aware matching |
re.M |
Enables multi-line matching, affects ^ and $ |
re.S |
Makes . match all characters including newlines |
re.U |
Parses characters according to Unicode standards, affects \w, \W, \b, \B |
re.X |
Allows more flexible formatting for readable regex writing |
The most commonly used modifiers for web scraping are re.S and re.I.
Using Python's re Module
Python's standard re library implements full regex functionality, allowing you to use regex patterns in Python code.
1. match(): Match From the Start of a String
The match() method attempts to match a regex pattern starting at the beginning of the input string. It returns a match object on success, or None on failure.
Example:
import re
test_string = 'Hello 123 4567 Demo_Example is a Regex Tutorial'
print(len(test_string))
match_obj = re.match(r'^Hello\s\d{3}\s\d{4}\s\w{10}', test_string)
print(match_obj)
print(match_obj.group())
print(match_obj.span())
Running this code will output:
41
<_sre.SRE_Match object; span=(0, 25), match='Hello 123 4567 Demo_Ex'>
Hello 123 4567 Demo_Ex
(0, 25)
Extracting Specific Matched Groups
To extract partial content from a matched string, use parentheses () to define capturing groups. You can retrieve group content using the group() method with the group index.
Example:
import re
test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^Hello\s(\d+)\sDemo', test_string)
print(match_obj)
print(match_obj.group())
print(match_obj.group(1))
print(match_obj.span())
Output:
<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 Demo'>
Hello 1234567 Demo
1234567
(0, 19)
Universal Matching with .*
Use .* to match any sequence of characters except newlines: . matches any non-newline character, and * matches zero or more occurrences of the preceding character. This simplifies complex regex patterns.
Example:
import re
test_string = 'Hello 123 4567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^Hello.*Tutorial$', test_string)
print(match_obj.group())
print(match_obj.span())
Output:
Hello 123 4567 Demo_Example is a Regex Tutorial
(0, 41)
Greedy vs Non-Greedy Matching
By default, .* uses greedy matching, which matches as much text as possible. This can lead to unexpected results. For example:
import re
test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^He.*(\d+).*Tutorial$', test_string)
print(match_obj.group(1))
This outputs 7, because .* consumes all text up to the last digit, leaving only 7 for \d+.
Use .*? for non-greedy matching, which matches as little text as possible:
import re
test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^He.*?(\d+).*Tutorial$', test_string)
print(match_obj.group(1))
This correctly outputs 1234567.
A common pitfall: .*? may match nothing if the target content is at the end of the string. For example:
import re
url_str = 'http://example.com/comment/abc123'
result1 = re.match(r'http.*?comment/(.*?)', url_str)
result2 = re.match(r'http.*?comment/(.*)', url_str)
print('result1:', result1.group(1))
print('result2:', result2.group(1))
Output:
result1:
result2: abc123
2. search(): Scan Entire String for First Match
Unlike match(), which only matches from the start of the string, search() scans the entire input string and returns the first successful match.
Example of failed match():
import re
test_string = 'Extra text Hello 1234567 Demo Example'
match_obj = re.match(r'Hello.*Demo', test_string)
print(match_obj)
Outputs None because the string does not start with Hello.
Rewritten with search():
import re
test_string = 'Extra text Hello 1234567 Demo Example'
search_obj = re.search(r'Hello.*?(\d+).*Demo', test_string)
print(search_obj.group(1))
Outputs 1234567.
Practical Example: Extract Data from HTML
Using a sample HTML snippet:
html_content = '''<div id="song-list">
<h2 class="title">Classic Old Songs</h2>
<p class="intro">Classic song list</p>
<ul id="list" class="list-group">
<li data-view="2">Path I Walk With You</li>
<li data-view="7">
<a href="/1.mp3" singer="Richie Jen">The Laughing Sea</a>
</li>
<li data-view="4" class="active">
<a href="/2.mp3" singer="Chyi Chin">Past随风</a>
</li>
<li data-view="6"><a href="/3.mp3" singer="Beyond">Glory Years</a></li>
<li data-view="5"><a href="/4.mp3" singer="Kelly Chen">Notebook</a></li>
<li data-view="5">
<a href="/5.mp3" singer="Teresa Teng">I Hope You Are Well</a>
</li>
</ul>
</div>'''
To extract the singer and song name from the active <li> node:
import re
match_result = re.search(r'<li.*?active.*?singer="(.*?)">(.*?)</a>', html_content, re.S)
if match_result:
print(match_result.group(1), match_result.group(2))
Outputs Chyi Chin Past随风.
Without the active flag, search() returns the first matching node:
import re
match_result = re.search(r'<li.*?singer="(.*?)">(.*?)</a>', html_content, re.S)
if match_result:
print(match_result.group(1), match_result.group(2))
Outputs Richie Jen The Laughing Sea.
3. findall(): Get All Matches as a List
The findall() method scans the antire string and returns a list of all matching results. Each list element is a tuple of captured groups.
Example to extract all song links, singers, and names:
import re
all_results = re.findall(r'<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html_content, re.S)
print(all_results)
for item in all_results:
print(f"Link: {item[0]}, Singer: {item[1]}, Song: {item[2]}")
4. sub(): Replace Matched Text
Use sub() to replace matched text with a specified string. This is useful for cleaning up text.
Example to remove all digits from a string:
import re
raw_text = '54aK54yr5oiR54ix5L2g'
clean_text = re.sub(r'\d+', '', raw_text)
print(clean_text)
Outputs aKyroiRixLg.
You can also use sub() to simplify HTML parsing by removing tags first:
import re
# Remove all <a> tags
clean_html = re.sub(r'<a.*?>|</a>', '', html_content)
# Extract song names from <li> tags
song_names = re.findall(r'<li.*?>(.*?)</li>', clean_html, re.S)
for name in song_names:
print(name.strip())
5. compile(): Precompile Regex Patterns
Use compile() to convert a regex string into a reusable pattern object, which improves performance when using the same pattern multiple times.
Example:
import re
# Precompile regex to match time strings
time_pattern = re.compile(r'\d{2}:\d{2}')
date1 = '2024-05-12 14:30'
date2 = '2024-05-13 15:45'
date3 = '2024-05-14 16:00'
# Remove time from each date string
clean1 = re.sub(time_pattern, '', date1)
clean2 = re.sub(time_pattern, '', date2)
clean3 = re.sub(time_pattern, '', date3)
print(clean1, clean2, clean3)
Outputs 2024-05-12 2024-05-13 2024-05-14 .