Handling Regular Expressions and Tabular Data in Python

Using the `re` Module for Pattern Matching

Regular expressions (regex) are sequences of characters defining a search pattern, commonly used for string validation and parsing. Python's re module provides full support for these patterns.

Matching from the Start: `re.match`

The match function attempts to match a pattern only at the beginning of a string. If the pattern is not found at index 0, it returns None.

Syntax:

result = re.match(pattern, text, flags=0)

Argument	Description
`pattern`	The regex sequence to search for.
`text`	The string input to be scanned.
`flags`	Modifiers (e.g., `re.IGNORECASE`) to change matching behavior.

Extracting Data:

result.group(num): Returns the specific match group. group(0) returns the whole match.
result.groups(): Returns a tuple containing all subgroups.

Scanning the Whole String: `re.search`

Unlike match, search scans through the entire string and returns a match object for the first location where the pattern is found.

result = re.search(pattern, text, flags=0)

Compiling Patterns

For repeated use, compile a regex in to a pattern object using re.compile. This is more efficient for performance-sensitive code.

regex_obj = re.compile(r'\d+')
match_obj = regex_obj.search('User 123')

Common Flags:

re.I: Case-insensitive matching.
re.M: Multi-line matching (affects ^ and $).
re.S: Makes . match any character including newlines.

Search and Replace: `re.sub`

To replace occurrences of a pattern, use sub.

altered_text = re.sub(pattern, replacement, text, count=0, flags=0)

replacement: The string (or function) to replace the match with.
count: Maximum number of replacements. 0 means replace all.

Finding All Occurrences

findall: Returns a list of all non-overlapping matches as strings.

matches = re.findall(r'\w+', 'Hello World')
# Result: ['Hello', 'World']

finditer: Returns an iterator yielding match objects. Useful for large texts to save memory.
```
for m in re.finditer(r'\d+', 'a1b2c3'):
    print(m.group())
```

Splitting Strings: `re.split`

Split a string by the occurrencse of a patttern.

parts = re.split(r'\s+', 'Split   by   spaces')

Regex Syntax Cheat Sheet

Symbol	Meaning
`^`	Start of string (or line in multiline mode).
`$`	End of string (or line in multiline mode).
`.`	Any character (except newline unless `re.S` is used).
`\d`	Any digit (equivalent to `[0-9]`).
`\D`	Any non-digit.
`\w`	Alphanumeric character or underscore.
`\W`	Non-alphanumeric character.
`\s`	Whitespace (space, tab, newline).
`\b`	Word boundary.
`*`	0 or more repetitions of the preceding element.
`+`	1 or more repetitions.
`?`	0 or 1 repetition (or non-greedy modifier).
`{n,m}`	Between `n` and `m` repetitions.
`[abc]`	Matches a single character: `a`, `b`, or `c`.
`[^abc]`	Matches any character except `a`, `b`, or `c`.
`(a	b)`
`(...)`	Grouping. Captures the matched text.

Working with `pandas` for Data Analysis

Pandas is the standard library for handling structured data (like CSV, Excel) in Python. It introduces two primary data structures: Series (1D) and DataFrame (2D).

Creating a DataFrame

You can create a DataFrame from a dictionary where keys become column names and values become the data.

import pandas as pd

data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard'],
    'Price': [1200, 25, 75],
    'Stock': [10, 150, 60]
}

df = pd.DataFrame(data)
print(df)

Reading External Data

Pandas excels at reading files. The most common function is read_csv.

# Reading a CSV file
df_sales = pd.read_csv('sales_data.csv')

# Reading an Excel file
df_excel = pd.read_excel('report.xlsx', sheet_name='2023')

Basic Inspection

To understand the shape and content of your data:

# Display the first 5 rows
df.head()

# Get summary statistics (mean, std, min, max)
df.describe()

# Check data types and non-null counts
df.info()

Filtering and Selection

Selecting specific columns or rows based on conditions is a core operation.

# Select a single column (returns a Series)
prices = df['Price']

# Select multiple columns (returns a DataFrame)
subset = df[['Product', 'Price']]

# Filter rows where Price is greater than 100
high_value = df[df['Price'] > 100]

Handling Missing Values

Real-world data often has gaps. Pandas uses NaN (Not a Number) to represent missing values.

# Check for missing values
missing_stats = df.isnull().sum()

# Option 1: Drop rows with any missing values
df_clean = df.dropna()

# Option 2: Fill missing values with a default
df_filled = df.fillna(0)

Grouping Data

The groupby operation is used to split data, apply a function (like sum or mean), and combine the results.

# Assuming a DataFrame 'orders' with columns 'Category' and 'Revenue'
category_summary = orders.groupby('Category')['Revenue'].sum()

Tags: python regular expressions Pandas Data Processing regex

Posted on Sun, 10 May 2026 07:32:09 +0000 by balloontrader

Freaks City

Handling Regular Expressions and Tabular Data in Python

Using the `re` Module for Pattern Matching

Matching from the Start: `re.match`

Scanning the Whole String: `re.search`

Compiling Patterns

Search and Replace: `re.sub`

Finding All Occurrences

Splitting Strings: `re.split`

Regex Syntax Cheat Sheet

Working with `pandas` for Data Analysis

Creating a DataFrame

Reading External Data

Basic Inspection

Filtering and Selection

Handling Missing Values

Grouping Data

Hot Tags

Freaks City

Handling Regular Expressions and Tabular Data in Python

Using the re Module for Pattern Matching

Matching from the Start: re.match

Scanning the Whole String: re.search

Compiling Patterns

Search and Replace: re.sub

Finding All Occurrences

Splitting Strings: re.split

Regex Syntax Cheat Sheet

Working with pandas for Data Analysis

Creating a DataFrame

Reading External Data

Basic Inspection

Filtering and Selection

Handling Missing Values

Grouping Data

Hot Tags

Using the `re` Module for Pattern Matching

Matching from the Start: `re.match`

Scanning the Whole String: `re.search`

Search and Replace: `re.sub`

Splitting Strings: `re.split`

Working with `pandas` for Data Analysis