Data Analysis Project · Python · Pandas

TitanicData Analysis

Exploring survival patterns across 891 passengers using pandas - filtering, grouping, cleaning missing values, and pulling out what the numbers actually say.

0
Passengers
0
Survived
0
Perished
0
Missing Ages
See the Analysis Open Notebook
Scroll
Built with Python 3 pandas numpy Jupyter Notebook Google Colab Chart.js
Data Source Kaggle - Titanic Dataset (titanic.csv) Downloaded from Kaggle's public dataset repository. 891 rows, 12 columns. No API - just a straight CSV download and loaded with pd.read_csv('titanic.csv').
38.4%
Survived
342 of 891 passengers made it
61.6%
Perished
549 of 891 passengers lost
Pandas Operations Used

How I analysed the data

Part A
Dataset Exploration
df.head() / df.describe()
df.dtypes / df.shape
First pass - checking shape, column types, and what the data actually looks like before doing anything else.
Part B
Filtering
df[df['age'] > 30]
df.query('pclass == 1')
Boolean indexing and query() to slice specific groups - survivors, females, passengers by class and port.
Part C
Unique Values
df['pclass'].unique()
df['gender'].value_counts()
Counted passengers per class, gender, and port. Helped me understand the distribution before any grouping.
Part D
Sorting
df.sort_values('fare',
ascending=False).head(10)
Top 10 by fare and youngest passengers. Also sorted by multiple columns - class then age.
Part E
Missing Values
df.isnull().sum()
df['age'].fillna(df['age'].mean())
177 ages were missing. Filled with mean age for analysis. Also tried dropna() on a separate copy.
Part F
Grouping
df.groupby('pclass')['survived']
.mean()
Grouped by class, gender, and port. This is where the survival rate differences became really clear.
Parts B – F · Visualized

What the numbers show

Gender Analysis
Survival by Gender
df.groupby('gender')['survived'].mean()
Class Analysis
Survival Rate by Class
df.groupby('pclass')['survived'].mean()
Embarkation
Passengers by Port
df['embarked'].value_counts()
Demographics
Age Distribution
df['age'].fillna(df['age'].mean())
Part F · Groupby Result

Survival Rate by Class

First Class
62% survived
62%
Second Class
47% survived
47%
Third Class
24% survived
24%
Real problems · Real fixes

What broke - and how I fixed it

Error
KeyError on column name
KeyError: 'Gender'
# I was using capital G
Spent longer than I'd like to admit on this. The column was gender, all lowercase. df.columns showed me immediately. Lesson: always check column names first, don't assume casing.
Fix print(df.columns)
# used df['gender'] after
Error
groupby returning NaN for age
df.groupby('pclass')['age'].mean()
# some groups returned NaN
Ran groupby on age before handling missing values. 177 nulls caused certain group means to come back as NaN. Had to handle missing values in Part E before running Part F groupby operations.
Fix df['age'] = df['age'].fillna(
df['age'].mean())
Error
dropna() wiped too many rows
df.dropna(inplace=True)
# went from 891 to 183 rows
Used dropna() without specifying a column - it dropped any row with a single null anywhere. Lost 708 rows. Had to restart the kernel and work on a separate copy of the DataFrame to preserve the original.
Fix df_clean = df.dropna(subset=['age'])
# only drops rows missing age
Error
SettingWithCopyWarning
df_filtered['age'] = 30
# pandas threw a warning
When I created a filtered DataFrame and tried to modify it, pandas warned me I might be editing a copy, not the original. Had to use .copy() explicitly when creating subsets.
Fix df_filtered = df[df['pclass']==1].copy()
# safe to modify now
Going deeper · Age × Class

Does age change survival within each class?

After the basic groupby, I wanted to check if age interacted with class to affect survival. The result was interesting - in 1st class, older passengers actually survived at a higher rate than younger ones. In 3rd class it was the opposite: younger passengers had a slight edge, but survival rates were low across all age groups. I split ages into under-30 and 30+ using df[df['age'] < 30] and compared groupby results across classes.

Age Group 1st Class 2nd Class 3rd Class
Under 18 71%survived 68%survived 34%survived
18 – 30 60%survived 44%survived 22%survived
31 – 50 65%survived 42%survived 19%survived
Over 50 58%survived 28%survived 14%survived
Takeaway: Class matters more than age. Even the worst-performing 1st class group (50+ at 58%) outperforms the best 3rd class group (under-18 at 34%). Age had a small effect within each class, but it couldn't override the socioeconomic divide in access to lifeboats.
Part G · Mini Analysis

Key findings

74%
Female Survival Rate
Women survived at nearly 4× the rate of men. The "women and children first" protocol was clearly followed - and the data backs it up.
62%
1st Class Survival
First class had the highest survival rate. Upper deck proximity to lifeboats and crew priority access made a measurable difference.
512
Max Fare Paid (£)
Highest fare in the dataset - over 170× the average 3rd class fare of ~£3. Found using df.sort_values('fare', ascending=False).
29
Average Age
Mean age across all passengers. 177 missing ages were filled with this value using fillna() before running groupby operations.
S
Top Port: Southampton
644 passengers boarded at Southampton - nearly 4× Cherbourg (168) and over 8× Queenstown (77).
19%
Male Survival Rate
Only 1 in 5 men survived. Even within males, class had a strong effect - 1st class men fared significantly better than 3rd class.
Applied Thinking

What would this mean in the real world?

The Titanic dataset is historical, but the analytical questions map directly to real operational problems. If I were presenting this to a maritime safety board or an insurance company, here's how I'd frame these findings:

Which passenger profile needs priority evacuation planning?
3rd class male passengers had the lowest survival rate at ~14%. Any evacuation protocol redesign should focus on improving lower-deck access to lifeboats - the data shows a clear structural disadvantage.
→ Emergency Planning
How should an insurer price survival risk by passenger class?
Survival rate drops from 62% to 24% between 1st and 3rd class - a 38-point gap. If an insurer priced life coverage for maritime travel, class would be a statistically significant variable alongside gender and age.
→ Risk & Insurance
Does fare predict survival, or is class the real variable?
Max fare was £512, average was ~£3 for 3rd class. Fare correlates strongly with class, and class correlates with survival. But fare alone isn't the driver - it's what fare bought you (deck location, lifeboat proximity) that mattered.
→ Feature Analysis
Is mean imputation the right choice for missing ages?
For this analysis, yes - it preserved the row count for groupby without skewing distributions too much. But in a production ML pipeline I'd investigate whether missing ages cluster in 3rd class, which would make the missingness informative, not random.
→ Data Quality
Source Code
View the full notebook on GitHub
The complete Jupyter notebook with all pandas operations, cell outputs, and analysis steps - Parts A through G - is available on GitHub.
Interactive

Passenger Explorer

# Name Age ↕ Gender Class Fare ↕ Port Status
Honest reflection

What I learned & what I'd improve

What clicked

groupby().mean() was the moment things made sense. Before that, I was filtering individual groups manually - one query per gender, one per class. Once I understood that groupby collapses all of that into a single line, I went back and rewrote half my Part B code. That kind of refactor is where the real learning happens.

What I'd do differently

Mean imputation for age is a quick fix but it compresses the distribution - 177 ages all become 29, which artificially inflates the count in that bin. If I were taking this further I'd try median imputation (more robust to outliers) or model missing ages from pclass and gender using a simple regression. I'd also check whether missing ages were random or concentrated in 3rd class - if they cluster there, the missingness itself is a signal worth keeping.

Limitation I noticed mid-analysis

The dataset has no cabin number for most 3rd class passengers. Cabin location determines lifeboat proximity - so one of the most important variables for survival is also one of the most missing. Any survival model built on this data would have a structural blind spot there. I noted it but couldn't fix it with pandas alone.

What's next
  • Logistic Regression modelUse sklearn to predict survival probability based on class, gender, and age. The groupby results already hint at which features matter most.
  • Median vs. mean imputation comparisonRun the full analysis twice - once with mean-filled ages, once with median - and compare groupby outputs to see how much it changes the survival rates.
  • Visualise with Seaborn or MatplotlibThe charts here are built with Chart.js for the web page. The notebook needs proper Python visualisations - heatmaps with seaborn.heatmap(), distribution plots with histplot(), to complete the analytical picture.
  • Check if missing ages cluster by classRun df[df['age'].isnull()].groupby('pclass').size() - if 3rd class has disproportionately more nulls, the missingness is informative and shouldn't be imputed blindly.
  • Correlation matrixdf.corr() on the numeric columns - age, fare, pclass, survived - to formally quantify which variables actually move together, not just visually.
Contact

Let's connect

Open to new opportunities, collaborations, or just a good conversation about data engineering.