Handling Regular Expressions and Tabular Data in Python

Using the re Module for Pattern Matching

Regular expressions (regex) are sequences of characters defining a search pattern, commonly used for string validation and parsing. Python's re module provides full support for these patterns.

Matching from the Start: re.match

The match function attempts to match a pattern only at the beginning of a string. If the pattern is not found at index 0, it returns None.

Syntax:

result = re.match(pattern, text, flags=0)
Argument Description
pattern The regex sequence to search for.
text The string input to be scanned.
flags Modifiers (e.g., re.IGNORECASE) to change matching behavior.

Extracting Data:

  • result.group(num): Returns the specific match group. group(0) returns the whole match.
  • result.groups(): Returns a tuple containing all subgroups.

Scanning the Whole String: re.search

Unlike match, search scans through the entire string and returns a match object for the first location where the pattern is found.

result = re.search(pattern, text, flags=0)

Compiling Patterns

For repeated use, compile a regex in to a pattern object using re.compile. This is more efficient for performance-sensitive code.

regex_obj = re.compile(r'\d+')
match_obj = regex_obj.search('User 123')

Common Flags:

  • re.I: Case-insensitive matching.
  • re.M: Multi-line matching (affects ^ and $).
  • re.S: Makes . match any character including newlines.

Search and Replace: re.sub

To replace occurrences of a pattern, use sub.

altered_text = re.sub(pattern, replacement, text, count=0, flags=0)
  • replacement: The string (or function) to replace the match with.
  • count: Maximum number of replacements. 0 means replace all.

Finding All Occurrences

  • findall: Returns a list of all non-overlapping matches as strings.
    matches = re.findall(r'\w+', 'Hello World')
    # Result: ['Hello', 'World']
    
  • finditer: Returns an iterator yielding match objects. Useful for large texts to save memory.
    for m in re.finditer(r'\d+', 'a1b2c3'):
        print(m.group())
    

Splitting Strings: re.split

Split a string by the occurrencse of a patttern.

parts = re.split(r'\s+', 'Split   by   spaces')

Regex Syntax Cheat Sheet

Symbol Meaning
^ Start of string (or line in multiline mode).
$ End of string (or line in multiline mode).
. Any character (except newline unless re.S is used).
\d Any digit (equivalent to [0-9]).
\D Any non-digit.
\w Alphanumeric character or underscore.
\W Non-alphanumeric character.
\s Whitespace (space, tab, newline).
\b Word boundary.
* 0 or more repetitions of the preceding element.
+ 1 or more repetitions.
? 0 or 1 repetition (or non-greedy modifier).
{n,m} Between n and m repetitions.
[abc] Matches a single character: a, b, or c.
[^abc] Matches any character except a, b, or c.
`(a b)`
(...) Grouping. Captures the matched text.

Working with pandas for Data Analysis

Pandas is the standard library for handling structured data (like CSV, Excel) in Python. It introduces two primary data structures: Series (1D) and DataFrame (2D).

Creating a DataFrame

You can create a DataFrame from a dictionary where keys become column names and values become the data.

import pandas as pd

data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard'],
    'Price': [1200, 25, 75],
    'Stock': [10, 150, 60]
}

df = pd.DataFrame(data)
print(df)

Reading External Data

Pandas excels at reading files. The most common function is read_csv.

# Reading a CSV file
df_sales = pd.read_csv('sales_data.csv')

# Reading an Excel file
df_excel = pd.read_excel('report.xlsx', sheet_name='2023')

Basic Inspection

To understand the shape and content of your data:

# Display the first 5 rows
df.head()

# Get summary statistics (mean, std, min, max)
df.describe()

# Check data types and non-null counts
df.info()

Filtering and Selection

Selecting specific columns or rows based on conditions is a core operation.

# Select a single column (returns a Series)
prices = df['Price']

# Select multiple columns (returns a DataFrame)
subset = df[['Product', 'Price']]

# Filter rows where Price is greater than 100
high_value = df[df['Price'] > 100]

Handling Missing Values

Real-world data often has gaps. Pandas uses NaN (Not a Number) to represent missing values.

# Check for missing values
missing_stats = df.isnull().sum()

# Option 1: Drop rows with any missing values
df_clean = df.dropna()

# Option 2: Fill missing values with a default
df_filled = df.fillna(0)

Grouping Data

The groupby operation is used to split data, apply a function (like sum or mean), and combine the results.

# Assuming a DataFrame 'orders' with columns 'Category' and 'Revenue'
category_summary = orders.groupby('Category')['Revenue'].sum()

Tags: python regular expressions Pandas Data Processing regex

Posted on Sun, 10 May 2026 07:32:09 +0000 by balloontrader