Exploring survival patterns across 891 passengers using pandas - filtering, grouping, cleaning missing values, and pulling out what the numbers actually say.
pd.read_csv('titanic.csv').
df.head() / df.describe()
df.dtypes / df.shape
df[df['age'] > 30]
df.query('pclass == 1')
df['pclass'].unique()
df['gender'].value_counts()
df.sort_values('fare',
ascending=False).head(10)
df.isnull().sum()
df['age'].fillna(df['age'].mean())
df.groupby('pclass')['survived']
.mean()
KeyError: 'Gender'
# I was using capital G
print(df.columns)
# used df['gender'] after
df.groupby('pclass')['age'].mean()
# some groups returned NaN
df['age'] = df['age'].fillna(
df['age'].mean())
df.dropna(inplace=True)
# went from 891 to 183 rows
df_clean = df.dropna(subset=['age'])
# only drops rows missing age
df_filtered['age'] = 30
# pandas threw a warning
df_filtered = df[df['pclass']==1].copy()
# safe to modify now
After the basic groupby, I wanted to check if age interacted with class to affect survival. The result was interesting - in 1st class, older passengers actually survived at a higher rate than younger ones. In 3rd class it was the opposite: younger passengers had a slight edge, but survival rates were low across all age groups. I split ages into under-30 and 30+ using df[df['age'] < 30] and compared groupby results across classes.
| Age Group | 1st Class | 2nd Class | 3rd Class |
|---|---|---|---|
| Under 18 | 71%survived | 68%survived | 34%survived |
| 18 – 30 | 60%survived | 44%survived | 22%survived |
| 31 – 50 | 65%survived | 42%survived | 19%survived |
| Over 50 | 58%survived | 28%survived | 14%survived |
The Titanic dataset is historical, but the analytical questions map directly to real operational problems. If I were presenting this to a maritime safety board or an insurance company, here's how I'd frame these findings:
| # | Name | Age ↕ | Gender | Class | Fare ↕ | Port | Status |
|---|
groupby().mean() was the moment things made sense. Before that, I was filtering individual groups manually - one query per gender, one per class. Once I understood that groupby collapses all of that into a single line, I went back and rewrote half my Part B code. That kind of refactor is where the real learning happens.
Mean imputation for age is a quick fix but it compresses the distribution - 177 ages all become 29, which artificially inflates the count in that bin. If I were taking this further I'd try median imputation (more robust to outliers) or model missing ages from pclass and gender using a simple regression. I'd also check whether missing ages were random or concentrated in 3rd class - if they cluster there, the missingness itself is a signal worth keeping.
The dataset has no cabin number for most 3rd class passengers. Cabin location determines lifeboat proximity - so one of the most important variables for survival is also one of the most missing. Any survival model built on this data would have a structural blind spot there. I noted it but couldn't fix it with pandas alone.
Open to new opportunities, collaborations, or just a good conversation about data engineering.