Notes of Titanic on Kaggle

Titanic is similar to the problem in practical. Each feature has it own meaning and background, which can help us to build a better model. I also acquired knowledge about the history of Titanic and traditional prefix culture in English.

If I have time, I have to been to the Merseyside Maritime Museum again and I’m sure I can gain some new and unique experience from there.

Current Score: 0.81340 (Top 7%)

Feature Engineering

Imgur
From the correlation heatmap we can find the relationship between features.

The meaning of features

Sex

Imgur
There is a famous script in the movie Titanic: “Lady and children first”. And this figure shows how this rule works. The female survivor is much more than the male survivor.

Ages

Imgur
Input print(x_train["Age"].mean()) then get the average age of the survivors is 29.6991.
The distribution of the age as shown below:
Imgur

Relation between Name, Ticket Number and Fare

Attribute “Name” looks useless at the beginning. But it contains an essential feature “Prefix”. The prefix shows the gender, age and the social level, which influence the survive rate obviously. I’ll show how to extract the prefix as a relevant feature in the next part.

“Name” contains the first name and the last name. Noticed that the family with the same last name have the same ticket number and fare. And it is hard to find other useful message from the attribute “Ticket Number”. Therefore, we can drop the “Ticket Number” attribute. “Name” can be dropped too after extracting the prefix.

Prefix

Extract the Prefix

Prefix contains useful hidden information. We can extract the prefix by regular expression:

1
2
3
4
names=x_train["Name"]
prefix=[]
for name in names:
prefix.append(re.search(',(.*?)\.',name).group(1))

Then we can get the titles. Here is the count result of all titles:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Mr              517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
the Countess 1
Sir 1
Lady 1
Ms 1
Mme 1
Capt 1
Don 1
Jonkheer 1

Prefix feature is the most interesting part of Titanic dataset probably. It gives me a chance to know about the prefix culture in different languages. The story of Titanic’s survivor/victims with the rare title is attracting as well. Here are the extracted prefixes:

Therefore, the prefixes can be separated into 4 groups:

  • For man: Mr
  • For woman: Miss, Mrs, Ms, Mme, Mlle
  • Child: Master
  • Rare: Sir, Lady, Capt, Don, Jonkheer, Major, Col, Rev, Dr

Then use the numerical feature to represent those 4 groups. The full code of dealing the prefix as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_prefix(dataset):
names=dataset["Name"]
prefix=[]
for name in names:
prefix.append(re.search(',(.*?)\.',name).group(1))
dataset["Prefix"]=pd.Series(prefix,index=dataset.index)
prefix=pd.Series(dataset["Prefix"])
prefix=prefix.replace(" Mr",0)
prefix=prefix.replace([" Miss"," Mrs"," Ms"," Mme"," Mlle"],1)
prefix=prefix.replace(" Master",2)
prefix=prefix.replace([' Rev', ' Dr', ' Col', ' Major', ' the Countess', ' Capt', ' Dona', 'Don', ' Sir', ' Lady',
' Jonkheer'],4)
dataset["Prefix"]=prefix
return dataset

PClass

Passenger class, or PClass shows the different class of ticket. 1 is upper, 2 is Middle, 3 is lower
Imgur
We can find that the higher the class is, the higher survive rate as well. The average price of different class as shown below:
Imgur
The 1st class is expensive than the other classes. It is worth to buy the best ticket when take a ship, which can provide a higher survive rate. Just in case :P

Embarked

Embarked means 3 different embarked place: Southampton,
sns.countplot(x_train["Embarked"],hue=x_train["Survived"])

Imgur

Family

Imgur

1
sns.countplot(x_train["Family"],hue=x_train["Survived"])

The figure of family as shown below.
Imgur

Filling missing Values

Input print(x_train.info()) to get the train data info:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float32
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
Family 891 non-null int64
Prefix 891 non-null object
Firstanme 891 non-null object
dtypes: float32(1), float64(1), int64(6), object(7)
memory usage: 101.0+ KB
None

In train data, there are lots of missing value in Cabin. There are 2 missing values in Embarked as well.

Input print(x_test.info()) to get the test data info:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

Reference

  1. Encyclopedia Titanic.
    This wiki contains the detail of all survivors.
  2. A journey through Titanic