Common Regular Expression Usage for Python Web Scraping

Regular Expression Patterns

Regex patterns use special syntax to define matching rules. The following table lists common special syntax elements, note that some meanings may change when using optional flags.

Pattern Description
^ Matches the start of a string
$ Matches the end of a string
. Matches any single character except newline; when using re.DOTALL flag, matches all characters including newlines
[chars] Defines a character set, e.g., [amk] matches a, m, or k
[^chars] Matches any character not in the set, e.g., [^abc] matches anything except a, b, c
re* Matches 0 or more occurrences of the preceding expression
re+ Matches 1 or more occurrences of the preceding expression
re? Matches 0 or 1 occurrence of the preceding expression, uses non-greedy matching
re{n} Matches exactly n occurrences of the preceding expression
re{n,} Matches at least n occurrences of the preceding expression
re{n,m} Matches between n and m occurrences of the preceding expression, uses greedy matching by default
`a b`
(re) Matches the enclosed expression and creates a capturing group
(?imx) Enables i, m, or x flags only within the parentheses
(?-imx) Disables i, m, or x flags only within the parentheses
(?: re) Matches the enclosed expression without creating a capturing group
(?imx: re) Enables i, m, or x flags for the enclosed expression within parentheses
(?-imx: re) Disables i, m, or x flags for the enclosed expression within parentheses
(?# comment) Adds a comment ignored by the regex engine
(?= re) Positive lookahead: matches sucessfully if the enclosed expression is found at current position, without advancing the matching position
(?! re) Negative lookahead: matches successfully if the enclosed expression is NOT found at current position
(?> re) Matches as an independent pattern to avoid backtracking
\w Matches alphanumeric characters and underscores
\W Matches non-alphanumeric, non-underscore characters
\s Matches any whitespace character, equivalent to [\t\n\r\f]
\S Matches any non-whitespace character
\d Matches any digit, equivalent to [0-9]
\D Matches any non-digit character
\A Matches the absolute start of a string
\Z Matches the absolute end of a string; if a newline exists, matches up to before the newline
\z Matches the absolute end of a string regardless of newlines
\G Matches at the position where the last match ended
\b Matches a word boundary, e.g., er\b matches er in "never" but not in "verb"
\B Matches a non-word boundary, e.g., er\B matches er in "verb" but not in "never"
\n, \t, etc. Matches newline, tab, and other whitespace escape sequences
\1-\9 Matches the content of the nth capturing group
\10 Matches the 10th capturing group if it exists; otherwise matches an octal character code

Common Special Character Classes

Example Description
. Matches any single character except \n; use [.\n] to match including newlines
\d Matches a single digit, equivalent to [0-9]
\D Matches a single non-digit character, equivalent to [^0-9]
\s Matches any whitespace character including spaces, tabs, form feeds, etc., equivalent to [ \f\n\r\t\v]
\S Matches any non-whitespace character, equivalent to [^ \f\n\r\t\v]
\w Matches any word character including underscores, equivalent to [A-Za-z0-9_]
\W Matches any non-word character, equivalent to [^A-Za-z0-9_]

Common Regex Modifiers

Flag Purpose
re.I Makes matching case-insensitive
re.L Performs locale-aware matching
re.M Enables multi-line matching, affects ^ and $
re.S Makes . match all characters including newlines
re.U Parses characters according to Unicode standards, affects \w, \W, \b, \B
re.X Allows more flexible formatting for readable regex writing

The most commonly used modifiers for web scraping are re.S and re.I.

Using Python's re Module

Python's standard re library implements full regex functionality, allowing you to use regex patterns in Python code.

1. match(): Match From the Start of a String

The match() method attempts to match a regex pattern starting at the beginning of the input string. It returns a match object on success, or None on failure.

Example:

import re

test_string = 'Hello 123 4567 Demo_Example is a Regex Tutorial'
print(len(test_string))
match_obj = re.match(r'^Hello\s\d{3}\s\d{4}\s\w{10}', test_string)
print(match_obj)
print(match_obj.group())
print(match_obj.span())

Running this code will output:

41
<_sre.SRE_Match object; span=(0, 25), match='Hello 123 4567 Demo_Ex'>
Hello 123 4567 Demo_Ex
(0, 25)

Extracting Specific Matched Groups

To extract partial content from a matched string, use parentheses () to define capturing groups. You can retrieve group content using the group() method with the group index.

Example:

import re

test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^Hello\s(\d+)\sDemo', test_string)
print(match_obj)
print(match_obj.group())
print(match_obj.group(1))
print(match_obj.span())

Output:

<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 Demo'>
Hello 1234567 Demo
1234567
(0, 19)

Universal Matching with .*

Use .* to match any sequence of characters except newlines: . matches any non-newline character, and * matches zero or more occurrences of the preceding character. This simplifies complex regex patterns.

Example:

import re

test_string = 'Hello 123 4567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^Hello.*Tutorial$', test_string)
print(match_obj.group())
print(match_obj.span())

Output:

Hello 123 4567 Demo_Example is a Regex Tutorial
(0, 41)

Greedy vs Non-Greedy Matching

By default, .* uses greedy matching, which matches as much text as possible. This can lead to unexpected results. For example:

import re

test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^He.*(\d+).*Tutorial$', test_string)
print(match_obj.group(1))

This outputs 7, because .* consumes all text up to the last digit, leaving only 7 for \d+.

Use .*? for non-greedy matching, which matches as little text as possible:

import re

test_string = 'Hello 1234567 Demo_Example is a Regex Tutorial'
match_obj = re.match(r'^He.*?(\d+).*Tutorial$', test_string)
print(match_obj.group(1))

This correctly outputs 1234567.

A common pitfall: .*? may match nothing if the target content is at the end of the string. For example:

import re

url_str = 'http://example.com/comment/abc123'
result1 = re.match(r'http.*?comment/(.*?)', url_str)
result2 = re.match(r'http.*?comment/(.*)', url_str)
print('result1:', result1.group(1))
print('result2:', result2.group(1))

Output:

result1: 
result2: abc123

2. search(): Scan Entire String for First Match

Unlike match(), which only matches from the start of the string, search() scans the entire input string and returns the first successful match.

Example of failed match():

import re

test_string = 'Extra text Hello 1234567 Demo Example'
match_obj = re.match(r'Hello.*Demo', test_string)
print(match_obj)

Outputs None because the string does not start with Hello.

Rewritten with search():

import re

test_string = 'Extra text Hello 1234567 Demo Example'
search_obj = re.search(r'Hello.*?(\d+).*Demo', test_string)
print(search_obj.group(1))

Outputs 1234567.

Practical Example: Extract Data from HTML

Using a sample HTML snippet:

html_content = '''<div id="song-list">
    <h2 class="title">Classic Old Songs</h2>
    <p class="intro">Classic song list</p>
    <ul id="list" class="list-group">
        <li data-view="2">Path I Walk With You</li>
        <li data-view="7">
            <a href="/1.mp3" singer="Richie Jen">The Laughing Sea</a>
        </li>
        <li data-view="4" class="active">
            <a href="/2.mp3" singer="Chyi Chin">Past随风</a>
        </li>
        <li data-view="6"><a href="/3.mp3" singer="Beyond">Glory Years</a></li>
        <li data-view="5"><a href="/4.mp3" singer="Kelly Chen">Notebook</a></li>
        <li data-view="5">
            <a href="/5.mp3" singer="Teresa Teng">I Hope You Are Well</a>
        </li>
    </ul>
</div>'''

To extract the singer and song name from the active <li> node:

import re

match_result = re.search(r'<li.*?active.*?singer="(.*?)">(.*?)</a>', html_content, re.S)
if match_result:
    print(match_result.group(1), match_result.group(2))

Outputs Chyi Chin Past随风.

Without the active flag, search() returns the first matching node:

import re

match_result = re.search(r'<li.*?singer="(.*?)">(.*?)</a>', html_content, re.S)
if match_result:
    print(match_result.group(1), match_result.group(2))

Outputs Richie Jen The Laughing Sea.

3. findall(): Get All Matches as a List

The findall() method scans the antire string and returns a list of all matching results. Each list element is a tuple of captured groups.

Example to extract all song links, singers, and names:

import re

all_results = re.findall(r'<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html_content, re.S)
print(all_results)
for item in all_results:
    print(f"Link: {item[0]}, Singer: {item[1]}, Song: {item[2]}")

4. sub(): Replace Matched Text

Use sub() to replace matched text with a specified string. This is useful for cleaning up text.

Example to remove all digits from a string:

import re

raw_text = '54aK54yr5oiR54ix5L2g'
clean_text = re.sub(r'\d+', '', raw_text)
print(clean_text)

Outputs aKyroiRixLg.

You can also use sub() to simplify HTML parsing by removing tags first:

import re

# Remove all <a> tags
clean_html = re.sub(r'<a.*?>|</a>', '', html_content)
# Extract song names from <li> tags
song_names = re.findall(r'<li.*?>(.*?)</li>', clean_html, re.S)
for name in song_names:
    print(name.strip())

5. compile(): Precompile Regex Patterns

Use compile() to convert a regex string into a reusable pattern object, which improves performance when using the same pattern multiple times.

Example:

import re

# Precompile regex to match time strings
time_pattern = re.compile(r'\d{2}:\d{2}')
date1 = '2024-05-12 14:30'
date2 = '2024-05-13 15:45'
date3 = '2024-05-14 16:00'

# Remove time from each date string
clean1 = re.sub(time_pattern, '', date1)
clean2 = re.sub(time_pattern, '', date2)
clean3 = re.sub(time_pattern, '', date3)
print(clean1, clean2, clean3)

Outputs 2024-05-12 2024-05-13 2024-05-14 .

Tags: python web scraping regular expressions re Module

Posted on Sun, 10 May 2026 08:12:24 +0000 by movieflick