Using the re Module for Pattern Matching
Regular expressions (regex) are sequences of characters defining a search pattern, commonly used for string validation and parsing. Python's re module provides full support for these patterns.
Matching from the Start: re.match
The match function attempts to match a pattern only at the beginning of a string. If the pattern is not found at index 0, it returns None.
Syntax:
result = re.match(pattern, text, flags=0)
| Argument | Description |
|---|---|
pattern |
The regex sequence to search for. |
text |
The string input to be scanned. |
flags |
Modifiers (e.g., re.IGNORECASE) to change matching behavior. |
Extracting Data:
result.group(num): Returns the specific match group.group(0)returns the whole match.result.groups(): Returns a tuple containing all subgroups.
Scanning the Whole String: re.search
Unlike match, search scans through the entire string and returns a match object for the first location where the pattern is found.
result = re.search(pattern, text, flags=0)
Compiling Patterns
For repeated use, compile a regex in to a pattern object using re.compile. This is more efficient for performance-sensitive code.
regex_obj = re.compile(r'\d+')
match_obj = regex_obj.search('User 123')
Common Flags:
re.I: Case-insensitive matching.re.M: Multi-line matching (affects^and$).re.S: Makes.match any character including newlines.
Search and Replace: re.sub
To replace occurrences of a pattern, use sub.
altered_text = re.sub(pattern, replacement, text, count=0, flags=0)
replacement: The string (or function) to replace the match with.count: Maximum number of replacements.0means replace all.
Finding All Occurrences
findall: Returns a list of all non-overlapping matches as strings.matches = re.findall(r'\w+', 'Hello World') # Result: ['Hello', 'World']finditer: Returns an iterator yielding match objects. Useful for large texts to save memory.for m in re.finditer(r'\d+', 'a1b2c3'): print(m.group())
Splitting Strings: re.split
Split a string by the occurrencse of a patttern.
parts = re.split(r'\s+', 'Split by spaces')
Regex Syntax Cheat Sheet
| Symbol | Meaning |
|---|---|
^ |
Start of string (or line in multiline mode). |
$ |
End of string (or line in multiline mode). |
. |
Any character (except newline unless re.S is used). |
\d |
Any digit (equivalent to [0-9]). |
\D |
Any non-digit. |
\w |
Alphanumeric character or underscore. |
\W |
Non-alphanumeric character. |
\s |
Whitespace (space, tab, newline). |
\b |
Word boundary. |
* |
0 or more repetitions of the preceding element. |
+ |
1 or more repetitions. |
? |
0 or 1 repetition (or non-greedy modifier). |
{n,m} |
Between n and m repetitions. |
[abc] |
Matches a single character: a, b, or c. |
[^abc] |
Matches any character except a, b, or c. |
| `(a | b)` |
(...) |
Grouping. Captures the matched text. |
Working with pandas for Data Analysis
Pandas is the standard library for handling structured data (like CSV, Excel) in Python. It introduces two primary data structures: Series (1D) and DataFrame (2D).
Creating a DataFrame
You can create a DataFrame from a dictionary where keys become column names and values become the data.
import pandas as pd
data = {
'Product': ['Laptop', 'Mouse', 'Keyboard'],
'Price': [1200, 25, 75],
'Stock': [10, 150, 60]
}
df = pd.DataFrame(data)
print(df)
Reading External Data
Pandas excels at reading files. The most common function is read_csv.
# Reading a CSV file
df_sales = pd.read_csv('sales_data.csv')
# Reading an Excel file
df_excel = pd.read_excel('report.xlsx', sheet_name='2023')
Basic Inspection
To understand the shape and content of your data:
# Display the first 5 rows
df.head()
# Get summary statistics (mean, std, min, max)
df.describe()
# Check data types and non-null counts
df.info()
Filtering and Selection
Selecting specific columns or rows based on conditions is a core operation.
# Select a single column (returns a Series)
prices = df['Price']
# Select multiple columns (returns a DataFrame)
subset = df[['Product', 'Price']]
# Filter rows where Price is greater than 100
high_value = df[df['Price'] > 100]
Handling Missing Values
Real-world data often has gaps. Pandas uses NaN (Not a Number) to represent missing values.
# Check for missing values
missing_stats = df.isnull().sum()
# Option 1: Drop rows with any missing values
df_clean = df.dropna()
# Option 2: Fill missing values with a default
df_filled = df.fillna(0)
Grouping Data
The groupby operation is used to split data, apply a function (like sum or mean), and combine the results.
# Assuming a DataFrame 'orders' with columns 'Category' and 'Revenue'
category_summary = orders.groupby('Category')['Revenue'].sum()