Pandas is a powerful Python library for data manipulation and analysis. It provides two primary data structures: Series (1D) and DataFrame (2D), along with numerous functions for data processing.
Importing Pandas
# Import necessary libraries
import numpy as np
import pandas as pd
Reading and Writting Data
Pandas supports various file formats for data import and export:
Reading CSV Files
# Read a CSV file
customer_data = pd.read_csv('customers.csv')
customer_data.head()
Reading Text Files
# Read a text file with custom separator
log_data = pd.read_table('server_logs.txt', sep='|')
log_data.head()
Reading Excel Files
# Read an Excel file
sales_data = pd.read_excel('monthly_sales.xlsx')
sales_data.head()
Reading Parameters
Key parameters for reading files:
header=None: Treat first row as data, not column namesindex_col: Use specified column(s) as indexusecols: Select specific columns to readnrows: Limit number of rows to read
# Example with parameters
sample_data = pd.read_csv('large_dataset.csv',
index_col=['ID'],
usecols=['ID', 'Name', 'Age'],
nrows=100)
Basic Data Structures
Series
A Series is a one-dimensional labeled array capable of holding any data type:
# Create a Series
product_prices = pd.Series(data=[19.99, 29.99, 9.99, 49.99],
index=['Product_A', 'Product_B', 'Product_C', 'Product_D'],
dtype='float64',
name='Price_List')
product_prices
DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types:
# Create DataFrame from list of lists
employee_records = [[101, 'John Doe', 'Engineering', 75000],
[102, 'Jane Smith', 'Marketing', 68000],
[103, 'Bob Johnson', 'Sales', 72000]]
employees = pd.DataFrame(data=employee_records,
columns=['ID', 'Name', 'Department', 'Salary'])
employees
DataFrame Operaitons
# Select a single column
employees['Name']
# Select multiple columns
employees[['Name', 'Department']]
# Transpose DataFrame
employees.T
Essential DataFrame Functions
Basic Information
# Get column names
employees.columns
# View first n rows (default 5)
employees.head(3)
# View last n rows (default 5)
employees.tail(2)
# Get DataFrame info
employees.info()
# Get descriptive statistics
employees.describe()
Statistical Functions
# Calculate mean
employees['Salary'].mean()
# Calculate maximum value
employees['Salary'].max()
# Calculate quantile
employees['Salary'].quantile(0.75)
# Count non-null values
employees['Department'].count()
# Find index of maximum value
employees['Salary'].idxmax()
Unique Values
# Get unique values
employees['Department'].unique()
# Count unique values
employees['Department'].nunique()
# Value counts
employees['Department'].value_counts()
Duplicates
# Remove duplicates
unique_employees = employees.drop_duplicates(['Department', 'Salary'])
# Check for duplicates
employees.duplicated(['Department', 'Salary'])
Data Replacement
Mapping Replacement
# Replace values using mapping
employees['Department'].replace({'Engineering': 'Tech', 'Marketing': 'BizDev'})
Logical Replacement
# Replace based on condition
employees['Salary'].where(employees['Salary'] > 70000, 70000)
# Mask values based on condition
employees['Salary'].mask(employees['Salary'] < 65000, 65000)
Numerical Operations
# Round values
employees['Salary'].round(-3)
# Absolute values
employees['Salary'].abs()
# Clip values to range
employees['Salary'].clip(lower=60000, upper=80000)
Sorting Data
Value Sorting
# Sort by single column
employees.sort_values('Salary', ascending=False)
# Sort by multiple columns
employees.sort_values(['Department', 'Salary'], ascending=[True, False])
Index Sorting
# Sort by index
employees.sort_index(level=['Department', 'Name'], ascending=[True, False])
Apply Function
The apply() method alllows applying custom functions to DataFrame rows or columns:
# Define custom function
def salary_category(salary):
if salary > 70000:
return 'High'
elif salary > 60000:
return 'Medium'
else:
return 'Low'
# Apply function to column
employees['Salary_Category'] = employees['Salary'].apply(salary_category)