Titanic is similar to the problem in practical. Each feature has it own meaning and background, which can help us to build a better model. I also acquired knowledge about the history of Titanic and traditional prefix culture in English.
If I have time, I have to been to the Merseyside Maritime Museum again and I’m sure I can gain some new and unique experience from there.
Current Score: 0.81340 (Top 7%)
Feature Engineering
From the correlation heatmap we can find the relationship between features.
The meaning of features
Sex
There is a famous script in the movie Titanic: “Lady and children first”. And this figure shows how this rule works. The female survivor is much more than the male survivor.
Ages
Input print(x_train["Age"].mean())
then get the average age of the survivors is 29.6991
.
The distribution of the age as shown below:
Relation between Name, Ticket Number and Fare
Attribute “Name” looks useless at the beginning. But it contains an essential feature “Prefix”. The prefix shows the gender, age and the social level, which influence the survive rate obviously. I’ll show how to extract the prefix as a relevant feature in the next part.
“Name” contains the first name and the last name. Noticed that the family with the same last name have the same ticket number and fare. And it is hard to find other useful message from the attribute “Ticket Number”. Therefore, we can drop the “Ticket Number” attribute. “Name” can be dropped too after extracting the prefix.
Prefix
Extract the Prefix
Prefix contains useful hidden information. We can extract the prefix by regular expression:1
2
3
4names=x_train["Name"]
prefix=[]
for name in names:
prefix.append(re.search(',(.*?)\.',name).group(1))
Then we can get the titles. Here is the count result of all titles:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
the Countess 1
Sir 1
Lady 1
Ms 1
Mme 1
Capt 1
Don 1
Jonkheer 1
Prefix feature is the most interesting part of Titanic dataset probably. It gives me a chance to know about the prefix culture in different languages. The story of Titanic’s survivor/victims with the rare title is attracting as well. Here are the extracted prefixes:
- Mr,Miss,Mrs: Common prefix title in English
- Master: a title for child
- Rare prefix: The prefixes that less than 10 people have. These prefixes show the social level, job, or nationalities. Noticed that the people with the rare prefix have a pretty high survivor rate. Maybe it because those prefixes show the strength (like col and major) or respectable society position(like the Countess).
- Dr: Doctor
- Rev: Reverend, the prefix for the priest. Noticed that all people with this prefix dead in this disaster. I remember the scene in the movie Titanic that a priest lead the people pray on the ship. Thank them for brining the peace before the end of the world.
- Col: colonel. e.g.: Simonius-Blumer, Col. Oberst Alfons
- Major: major. e.g.: Peuchen, Major. Arthur Godfrey
- Mlle: = Miss. French prefix title. e.g.:Sagesser, Mlle. Emma
- Mme: = Mrs. French prefix title.
- the Countess: a title for rich woman. e.g.:Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
- Sir: e.g.:Duff Gordon, Sir. Cosmo Edmund (“Mr Morgan”)
- Lady: e.g.:Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”)
- Ms: = Mrs. Spanish prefix title. e.g.: Reynaldo, Ms. Encarnacion
- Capt: Captain. e.g.:Crosby, Capt. Edward Gifford. Notice that he is not the captain of the Titanic. Here the “Captain” should follow the military meaning.
- Don: a honorific prefix in Latin. e.g.:Uruchurtu, Don. Manuel E
- Jonkheer: a honorific prefix title in Dutch. e.g.:Reuchlin, Jonkheer. John George
Therefore, the prefixes can be separated into 4 groups:
- For man: Mr
- For woman: Miss, Mrs, Ms, Mme, Mlle
- Child: Master
- Rare: Sir, Lady, Capt, Don, Jonkheer, Major, Col, Rev, Dr
Then use the numerical feature to represent those 4 groups. The full code of dealing the prefix as shown below:
1 | def get_prefix(dataset): |
PClass
Passenger class, or PClass
shows the different class of ticket. 1
is upper
, 2
is Middle
, 3
is lower
We can find that the higher the class is, the higher survive rate as well. The average price of different class as shown below:
The 1st class is expensive than the other classes. It is worth to buy the best ticket when take a ship, which can provide a higher survive rate. Just in case :P
Embarked
Embarked means 3 different embarked place: Southampton,sns.countplot(x_train["Embarked"],hue=x_train["Survived"])
Family
1
sns.countplot(x_train["Family"],hue=x_train["Survived"])
The figure of family as shown below.
Filling missing Values
Input print(x_train.info())
to get the train data info:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float32
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
Family 891 non-null int64
Prefix 891 non-null object
Firstanme 891 non-null object
dtypes: float32(1), float64(1), int64(6), object(7)
memory usage: 101.0+ KB
None
In train data, there are lots of missing value in Cabin
. There are 2 missing values in Embarked
as well.
Input print(x_test.info())
to get the test data info:
1 | <class 'pandas.core.frame.DataFrame'> |
Reference
- Encyclopedia Titanic.
This wiki contains the detail of all survivors. - A journey through Titanic