Processing Titanic Survival Data with Pandas

Python Data Analysis in Prcatice: Processing Titanic Survival Data with Pandas

Preparation

Before starting data analysis, import Pandas and NumPy libraries with standard aliases:

import pandas as pd
import numpy as np

1. Data Loading

Use pd.read_csv() to load the Titanic dataset and head() to inspect the first 5 rows:

titanic = pd.read_csv("titanic_train.csv")
titanic.head()

The output shows key fields: passenger ID, survival status, ticket class, name, sex, age, etc.

2. Handling Missing Values

Missing values (NaN) can skew analysis. Here we focus on the "Age" column.

2.1 Detecting Missing Values: isnull()

isnull() returns a boolean mask (True = missing):

age = titanic["Age"]
print("First 11 values of Age:")
print(age.loc[0:10])

age_is_null = pd.isnull(age)
print("\nMissing value detection (True = missing):")
print(age_is_null)

age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(f"\nNumber of missing Age values: {age_null_count}")

2.2 Impact and Solutions

Consequence of ignoring missing values: Computing mean direct returns NaN:

mean_age = sum(titanic["Age"]) / len(titanic["Age"])
print("Mean age without handling missing values:", mean_age)  # NaN

Three Effective Solutions

Method 1: Filter missing values (use when missing rate is very low):

good_ages = titanic["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print("Mean age after filtering missing values:", correct_mean_age)

Method 2: Use mean() (recommended) - automatically ignores NaN:

correct_mean_age = titanic["Age"].mean()
print("Mean age using mean():", correct_mean_age)

Method 3: Fill missing values (better approach) - preserves data:

age_filled = titanic["Age"].fillna(titanic["Age"].mean())
print("Filled Age (first 11 rows):")
print(age_filled.loc[0:10])

⚠️ Important: Filtering is only suitable when missing rate < 5%; filling (mean/median/mode) is more common to retain data integrity.

3. Data Statistical Analysis

3.1 Average Fare by Ticket Class

Method 1: For loop

passenger_classes = [1, 2, 3]
fares_by_class = {}

for this_class in passenger_classes:
    pclass_rows = titanic[titanic["Pclass"] == this_class]
    pclass_fares = pclass_rows["Fare"].mean()
    fares_by_class[this_class] = pclass_fares

print("Average fare by class (for loop):", fares_by_class)

Method 2: pivot_table() (recommended)

fares_by_class = titanic.pivot_table(
    index="Pclass",
    values="Fare",
    aggfunc=np.mean
)
print("Average fare by class (pivot_table):")
print(fares_by_class)

3.2 Additional Statistical Scenarios

Scenario 1: Average survival rate by class

survival_by_class = titanic.pivot_table(
    index="Pclass",
    values="Survived",
    aggfunc=np.mean
)
print("Average survival rate by class:")
print(survival_by_class)

Scenario 2: Average age by class

age_by_class = titanic.pivot_table(
    index="Pclass",
    values="Age",
    aggfunc=np.mean
)
print("Average age by class:")
print(age_by_class)

Scenario 3: Multi-field statistics (total fare and survivors by embarkation point)

port_stats = titanic.pivot_table(
    index="Embarked",
    values=["Fare", "Survived"],
    aggfunc=np.sum
)
print("Total fare and survivors by embarkation point:")
print(port_stats)

4. Data Cleaning and Selection

4.1 Dropping Missing Values: dropna()

print("Before dropping (first 7 rows):")
print(titanic.head(7))

new_titanic = titanic.dropna(
    axis=0,
    subset=["Age", "Sex"]
)

print("\nAfter dropping (first 7 rows):")
print(new_titanic.head(7))

4.2 Precise Selection: loc[]

row_83_age = titanic.loc[83, "Age"]
row_766_pclass = titanic.loc[766, "Pclass"]

print(f"Age at row 83: {row_83_age}")
print(f"Pclass at row 766: {row_766_pclass}")

5. Sorting and Index Reset

5.1 Sorting: sort_values()

new_titanic = titanic.sort_values("Age", ascending=False)
print("Sorted by age descending (first 10 rows):")
print(new_titanic.head(10))

5.2 Resetting Index: reset_index()

titanic_reindexed = new_titanic.reset_index(drop=True)
print("\nAfter reset index (first 10 rows):")
print(titanic_reindexed.head(10))

6. Custom Functions with apply()

apply() applies a functon to rows or columns. Use axis=1 for rows, axis=0 (default) for columns.

Scenario 1: Get the 100th row (column-wise)

def hundredth_row(column):
    return column.loc[99]

hundredth_row_data = titanic.apply(hundredth_row)
print("100th row data:")
print(hundredth_row_data)

Scenario 2: Count missing values per column

def count_nulls(column):
    column_null = pd.isnull(column)
    return len(column[column_null])

all_null_count = titanic.apply(count_nulls)
print("\nMissing value count per column:")
print(all_null_count)

Scenario 3: Convert ticket class to text (row-wise)

def class_to_text(row):
    pclass = row["Pclass"]
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

classes_text = titanic.apply(class_to_text, axis=1)
print("\nText classes (first 10 rows):")
print(classes_text.head(10))

Scenario 4: Discretize age (row-wise)

def age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "Unknown"
    elif age < 18:
        return "Minor"
    else:
        return "Adult"

titanic["age_labels"] = titanic.apply(age_label, axis=1)
print("\nAge labels (first 10 rows):")
print(titanic[["Age", "age_labels"]].head(10))

Bonus: Survival rate by age group

age_group_survival = titanic.pivot_table(
    index="age_labels",
    values="Survived",
    aggfunc=np.mean
)
print("\nSurvival rate by age group:")
print(age_group_survival)

Key Functions Summary

Function Description Key Parameters/Notes
pd.read_csv() Load CSV data File path, encoding
isnull() Detect missing values Returns boolean mask
mean() Compute mean Auto-ignores NaN
pivot_table() Grouped statistics index, values, aggfunc
dropna() Drop missing values axis, subset
loc[] Index-based selection loc[row, column]
sort_values() Sort by column ascending
reset_index() Reset index drop=True to discard old index
apply() Apply custom function axis (0 or 1)

By mastering these steps, you can complete the full data analysis pipeline: loading, cleaning, statistical analysis, and custom operations. Practice with the dataset to deepen understanding of function parameters and appropriate use cases.

Tags: python Pandas Data Analysis Titanic Missing Values

Posted on Wed, 27 May 2026 19:01:42 +0000 by yasir_memon