Python Data Analysis in Prcatice: Processing Titanic Survival Data with Pandas
Preparation
Before starting data analysis, import Pandas and NumPy libraries with standard aliases:
import pandas as pd
import numpy as np
1. Data Loading
Use pd.read_csv() to load the Titanic dataset and head() to inspect the first 5 rows:
titanic = pd.read_csv("titanic_train.csv")
titanic.head()
The output shows key fields: passenger ID, survival status, ticket class, name, sex, age, etc.
2. Handling Missing Values
Missing values (NaN) can skew analysis. Here we focus on the "Age" column.
2.1 Detecting Missing Values: isnull()
isnull() returns a boolean mask (True = missing):
age = titanic["Age"]
print("First 11 values of Age:")
print(age.loc[0:10])
age_is_null = pd.isnull(age)
print("\nMissing value detection (True = missing):")
print(age_is_null)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(f"\nNumber of missing Age values: {age_null_count}")
2.2 Impact and Solutions
Consequence of ignoring missing values: Computing mean direct returns NaN:
mean_age = sum(titanic["Age"]) / len(titanic["Age"])
print("Mean age without handling missing values:", mean_age) # NaN
Three Effective Solutions
Method 1: Filter missing values (use when missing rate is very low):
good_ages = titanic["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print("Mean age after filtering missing values:", correct_mean_age)
Method 2: Use mean() (recommended) - automatically ignores NaN:
correct_mean_age = titanic["Age"].mean()
print("Mean age using mean():", correct_mean_age)
Method 3: Fill missing values (better approach) - preserves data:
age_filled = titanic["Age"].fillna(titanic["Age"].mean())
print("Filled Age (first 11 rows):")
print(age_filled.loc[0:10])
⚠️ Important: Filtering is only suitable when missing rate < 5%; filling (mean/median/mode) is more common to retain data integrity.
3. Data Statistical Analysis
3.1 Average Fare by Ticket Class
Method 1: For loop
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic[titanic["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"].mean()
fares_by_class[this_class] = pclass_fares
print("Average fare by class (for loop):", fares_by_class)
Method 2: pivot_table() (recommended)
fares_by_class = titanic.pivot_table(
index="Pclass",
values="Fare",
aggfunc=np.mean
)
print("Average fare by class (pivot_table):")
print(fares_by_class)
3.2 Additional Statistical Scenarios
Scenario 1: Average survival rate by class
survival_by_class = titanic.pivot_table(
index="Pclass",
values="Survived",
aggfunc=np.mean
)
print("Average survival rate by class:")
print(survival_by_class)
Scenario 2: Average age by class
age_by_class = titanic.pivot_table(
index="Pclass",
values="Age",
aggfunc=np.mean
)
print("Average age by class:")
print(age_by_class)
Scenario 3: Multi-field statistics (total fare and survivors by embarkation point)
port_stats = titanic.pivot_table(
index="Embarked",
values=["Fare", "Survived"],
aggfunc=np.sum
)
print("Total fare and survivors by embarkation point:")
print(port_stats)
4. Data Cleaning and Selection
4.1 Dropping Missing Values: dropna()
print("Before dropping (first 7 rows):")
print(titanic.head(7))
new_titanic = titanic.dropna(
axis=0,
subset=["Age", "Sex"]
)
print("\nAfter dropping (first 7 rows):")
print(new_titanic.head(7))
4.2 Precise Selection: loc[]
row_83_age = titanic.loc[83, "Age"]
row_766_pclass = titanic.loc[766, "Pclass"]
print(f"Age at row 83: {row_83_age}")
print(f"Pclass at row 766: {row_766_pclass}")
5. Sorting and Index Reset
5.1 Sorting: sort_values()
new_titanic = titanic.sort_values("Age", ascending=False)
print("Sorted by age descending (first 10 rows):")
print(new_titanic.head(10))
5.2 Resetting Index: reset_index()
titanic_reindexed = new_titanic.reset_index(drop=True)
print("\nAfter reset index (first 10 rows):")
print(titanic_reindexed.head(10))
6. Custom Functions with apply()
apply() applies a functon to rows or columns. Use axis=1 for rows, axis=0 (default) for columns.
Scenario 1: Get the 100th row (column-wise)
def hundredth_row(column):
return column.loc[99]
hundredth_row_data = titanic.apply(hundredth_row)
print("100th row data:")
print(hundredth_row_data)
Scenario 2: Count missing values per column
def count_nulls(column):
column_null = pd.isnull(column)
return len(column[column_null])
all_null_count = titanic.apply(count_nulls)
print("\nMissing value count per column:")
print(all_null_count)
Scenario 3: Convert ticket class to text (row-wise)
def class_to_text(row):
pclass = row["Pclass"]
if pd.isnull(pclass):
return "Unknown"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
else:
return "Third Class"
classes_text = titanic.apply(class_to_text, axis=1)
print("\nText classes (first 10 rows):")
print(classes_text.head(10))
Scenario 4: Discretize age (row-wise)
def age_label(row):
age = row["Age"]
if pd.isnull(age):
return "Unknown"
elif age < 18:
return "Minor"
else:
return "Adult"
titanic["age_labels"] = titanic.apply(age_label, axis=1)
print("\nAge labels (first 10 rows):")
print(titanic[["Age", "age_labels"]].head(10))
Bonus: Survival rate by age group
age_group_survival = titanic.pivot_table(
index="age_labels",
values="Survived",
aggfunc=np.mean
)
print("\nSurvival rate by age group:")
print(age_group_survival)
Key Functions Summary
| Function | Description | Key Parameters/Notes |
|---|---|---|
pd.read_csv() |
Load CSV data | File path, encoding |
isnull() |
Detect missing values | Returns boolean mask |
mean() |
Compute mean | Auto-ignores NaN |
pivot_table() |
Grouped statistics | index, values, aggfunc |
dropna() |
Drop missing values | axis, subset |
loc[] |
Index-based selection | loc[row, column] |
sort_values() |
Sort by column | ascending |
reset_index() |
Reset index | drop=True to discard old index |
apply() |
Apply custom function | axis (0 or 1) |
By mastering these steps, you can complete the full data analysis pipeline: loading, cleaning, statistical analysis, and custom operations. Practice with the dataset to deepen understanding of function parameters and appropriate use cases.