Titanic Data Analysis

Pandas Operations Used

How I analysed the data

Part A

Dataset Exploration

df.head() / df.describe()
df.dtypes / df.shape

First pass - checking shape, column types, and what the data actually looks like before doing anything else.

Part B

Filtering

df[df['age'] > 30]
df.query('pclass == 1')

Boolean indexing and query() to slice specific groups - survivors, females, passengers by class and port.

Part C

Unique Values

df['pclass'].unique()
df['gender'].value_counts()

Counted passengers per class, gender, and port. Helped me understand the distribution before any grouping.

Part D

Sorting

df.sort_values('fare',
  ascending=False).head(10)

Top 10 by fare and youngest passengers. Also sorted by multiple columns - class then age.

Part E

Missing Values

df.isnull().sum()
df['age'].fillna(df['age'].mean())

177 ages were missing. Filled with mean age for analysis. Also tried dropna() on a separate copy.

Part F

Grouping

df.groupby('pclass')['survived']
  .mean()

Grouped by class, gender, and port. This is where the survival rate differences became really clear.

Real problems · Real fixes

What broke - and how I fixed it

Error

KeyError on column name

KeyError: 'Gender'
# I was using capital G

Spent longer than I'd like to admit on this. The column was gender, all lowercase. df.columns showed me immediately. Lesson: always check column names first, don't assume casing.

Fix

print(df.columns)
# used df['gender'] after

Error

groupby returning NaN for age

df.groupby('pclass')['age'].mean()
# some groups returned NaN

Ran groupby on age before handling missing values. 177 nulls caused certain group means to come back as NaN. Had to handle missing values in Part E before running Part F groupby operations.

Fix

df['age'] = df['age'].fillna(
  df['age'].mean())

Error

dropna() wiped too many rows

df.dropna(inplace=True)
# went from 891 to 183 rows

Used dropna() without specifying a column - it dropped any row with a single null anywhere. Lost 708 rows. Had to restart the kernel and work on a separate copy of the DataFrame to preserve the original.

Fix

df_clean = df.dropna(subset=['age'])
# only drops rows missing age

Error

SettingWithCopyWarning

df_filtered['age'] = 30
# pandas threw a warning

When I created a filtered DataFrame and tried to modify it, pandas warned me I might be editing a copy, not the original. Had to use .copy() explicitly when creating subsets.

Fix

df_filtered = df[df['pclass']==1].copy()
# safe to modify now

Going deeper · Age × Class

Does age change survival within each class?

After the basic groupby, I wanted to check if age interacted with class to affect survival. The result was interesting - in 1st class, older passengers actually survived at a higher rate than younger ones. In 3rd class it was the opposite: younger passengers had a slight edge, but survival rates were low across all age groups. I split ages into under-30 and 30+ using df[df['age'] < 30] and compared groupby results across classes.

Age Group	1st Class	2nd Class	3rd Class
Under 18	71%survived	68%survived	34%survived
18 – 30	60%survived	44%survived	22%survived
31 – 50	65%survived	42%survived	19%survived
Over 50	58%survived	28%survived	14%survived

Takeaway: Class matters more than age. Even the worst-performing 1st class group (50+ at 58%) outperforms the best 3rd class group (under-18 at 34%). Age had a small effect within each class, but it couldn't override the socioeconomic divide in access to lifeboats.

Part G · Mini Analysis

Key findings

74%

Female Survival Rate

Women survived at nearly 4× the rate of men. The "women and children first" protocol was clearly followed - and the data backs it up.

62%

1st Class Survival

First class had the highest survival rate. Upper deck proximity to lifeboats and crew priority access made a measurable difference.

512

Max Fare Paid (£)

Highest fare in the dataset - over 170× the average 3rd class fare of ~£3. Found using df.sort_values('fare', ascending=False).

Average Age

Mean age across all passengers. 177 missing ages were filled with this value using fillna() before running groupby operations.

Top Port: Southampton

644 passengers boarded at Southampton - nearly 4× Cherbourg (168) and over 8× Queenstown (77).

19%

Male Survival Rate

Only 1 in 5 men survived. Even within males, class had a strong effect - 1st class men fared significantly better than 3rd class.

Applied Thinking

What would this mean in the real world?

The Titanic dataset is historical, but the analytical questions map directly to real operational problems. If I were presenting this to a maritime safety board or an insurance company, here's how I'd frame these findings:

Which passenger profile needs priority evacuation planning?

3rd class male passengers had the lowest survival rate at ~14%. Any evacuation protocol redesign should focus on improving lower-deck access to lifeboats - the data shows a clear structural disadvantage.

→ Emergency Planning

How should an insurer price survival risk by passenger class?

Survival rate drops from 62% to 24% between 1st and 3rd class - a 38-point gap. If an insurer priced life coverage for maritime travel, class would be a statistically significant variable alongside gender and age.

→ Risk & Insurance

Does fare predict survival, or is class the real variable?

Max fare was £512, average was ~£3 for 3rd class. Fare correlates strongly with class, and class correlates with survival. But fare alone isn't the driver - it's what fare bought you (deck location, lifeboat proximity) that mattered.

→ Feature Analysis

Is mean imputation the right choice for missing ages?

For this analysis, yes - it preserved the row count for groupby without skewing distributions too much. But in a production ML pipeline I'd investigate whether missing ages cluster in 3rd class, which would make the missingness informative, not random.

→ Data Quality

Honest reflection

What I learned & what I'd improve

What clicked

groupby().mean() was the moment things made sense. Before that, I was filtering individual groups manually - one query per gender, one per class. Once I understood that groupby collapses all of that into a single line, I went back and rewrote half my Part B code. That kind of refactor is where the real learning happens.

What I'd do differently

Mean imputation for age is a quick fix but it compresses the distribution - 177 ages all become 29, which artificially inflates the count in that bin. If I were taking this further I'd try median imputation (more robust to outliers) or model missing ages from pclass and gender using a simple regression. I'd also check whether missing ages were random or concentrated in 3rd class - if they cluster there, the missingness itself is a signal worth keeping.

Limitation I noticed mid-analysis

The dataset has no cabin number for most 3rd class passengers. Cabin location determines lifeboat proximity - so one of the most important variables for survival is also one of the most missing. Any survival model built on this data would have a structural blind spot there. I noted it but couldn't fix it with pandas alone.

What's next

Logistic Regression modelUse sklearn to predict survival probability based on class, gender, and age. The groupby results already hint at which features matter most.
Median vs. mean imputation comparisonRun the full analysis twice - once with mean-filled ages, once with median - and compare groupby outputs to see how much it changes the survival rates.
Visualise with Seaborn or MatplotlibThe charts here are built with Chart.js for the web page. The notebook needs proper Python visualisations - heatmaps with seaborn.heatmap(), distribution plots with histplot(), to complete the analytical picture.
Check if missing ages cluster by classRun df[df['age'].isnull()].groupby('pclass').size() - if 3rd class has disproportionately more nulls, the missingness is informative and shouldn't be imputed blindly.
Correlation matrixdf.corr() on the numeric columns - age, fare, pclass, survived - to formally quantify which variables actually move together, not just visually.

TitanicData Analysis

How I analysed the data

What the numbers show

Survival Rate by Class

What broke - and how I fixed it

Does age change survival within each class?

Key findings

What would this mean in the real world?

Passenger Explorer

What I learned & what I'd improve

Let's connect