Titanic is similar to the problem in practical. Each feature has it own meaning and background, which can help us to build a better model. I also acquired knowledge about the history of Titanic and traditional prefix culture in English.

If I have time, I have to been to the Merseyside Maritime Museum again and I’m sure I can gain some new and unique experience from there.

Current Score: 0.81340 (Top 7%)

Feature Engineering

Imgur
From the correlation heatmap we can find the relationship between features.

The meaning of features

Sex

Imgur
There is a famous script in the movie Titanic: “Lady and children first”. And this figure shows how this rule works. The female survivor is much more than the male survivor.

Ages

Imgur
Input print(x_train["Age"].mean()) then get the average age of the survivors is 29.6991.
The distribution of the age as shown below:
Imgur

Relation between Name, Ticket Number and Fare

Attribute “Name” looks useless at the beginning. But it contains an essential feature “Prefix”. The prefix shows the gender, age and the social level, which influence the survive rate obviously. I’ll show how to extract the prefix as a relevant feature in the next part.

“Name” contains the first name and the last name. Noticed that the family with the same last name have the same ticket number and fare. And it is hard to find other useful message from the attribute “Ticket Number”. Therefore, we can drop the “Ticket Number” attribute. “Name” can be dropped too after extracting the prefix.

Prefix

Extract the Prefix

Prefix contains useful hidden information. We can extract the prefix by regular expression:

names=x_train["Name"]
prefix=[]
for name in names:
    prefix.append(re.search(',(.*?)\.',name).group(1))

Then we can get the titles. Here is the count result of all titles:

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Major             2
Mlle              2
the Countess      1
Sir               1
Lady              1
Ms                1
Mme               1
Capt              1
Don               1
Jonkheer          1

Prefix feature is the most interesting part of Titanic dataset probably. It gives me a chance to know about the prefix culture in different languages. The story of Titanic’s survivor/victims with the rare title is attracting as well. Here are the extracted prefixes:

Mr,Miss,Mrs: Common prefix title in English
Master: a title for child
Rare prefix: The prefixes that less than 10 people have. These prefixes show the social level, job, or nationalities. Noticed that the people with the rare prefix have a pretty high survivor rate. Maybe it because those prefixes show the strength (like col and major) or respectable society position(like the Countess).
- Dr: Doctor
- Rev: Reverend, the prefix for the priest. Noticed that all people with this prefix dead in this disaster. I remember the scene in the movie Titanic that a priest lead the people pray on the ship. Thank them for brining the peace before the end of the world.
- Col: colonel. e.g.: Simonius-Blumer, Col. Oberst Alfons
- Major: major. e.g.: Peuchen, Major. Arthur Godfrey
- Mlle: = Miss. French prefix title. e.g.:Sagesser, Mlle. Emma
- Mme: = Mrs. French prefix title.
- the Countess: a title for rich woman. e.g.:Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
- Sir: e.g.:Duff Gordon, Sir. Cosmo Edmund (“Mr Morgan”)
- Lady: e.g.:Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”)
- Ms: = Mrs. Spanish prefix title. e.g.: Reynaldo, Ms. Encarnacion
- Capt: Captain. e.g.:Crosby, Capt. Edward Gifford. Notice that he is not the captain of the Titanic. Here the “Captain” should follow the military meaning.
- Don: a honorific prefix in Latin. e.g.:Uruchurtu, Don. Manuel E
- Jonkheer: a honorific prefix title in Dutch. e.g.:Reuchlin, Jonkheer. John George

Therefore, the prefixes can be separated into 4 groups:

For man: Mr
For woman: Miss, Mrs, Ms, Mme, Mlle
Child: Master
Rare: Sir, Lady, Capt, Don, Jonkheer, Major, Col, Rev, Dr

Then use the numerical feature to represent those 4 groups. The full code of dealing the prefix as shown below:

def get_prefix(dataset):
    names=dataset["Name"]
    prefix=[]
    for name in names:
        prefix.append(re.search(',(.*?)\.',name).group(1))
    dataset["Prefix"]=pd.Series(prefix,index=dataset.index)
    prefix=pd.Series(dataset["Prefix"])
    prefix=prefix.replace(" Mr",0)
    prefix=prefix.replace([" Miss"," Mrs"," Ms"," Mme"," Mlle"],1)
    prefix=prefix.replace(" Master",2)
    prefix=prefix.replace([' Rev', ' Dr', ' Col', ' Major', ' the Countess', ' Capt', ' Dona', 'Don', ' Sir', ' Lady',
             ' Jonkheer'],4)
    dataset["Prefix"]=prefix
    return dataset

PClass

Passenger class, or PClass shows the different class of ticket. 1 is upper, 2 is Middle, 3 is lower
Imgur
We can find that the higher the class is, the higher survive rate as well. The average price of different class as shown below:
Imgur
The 1st class is expensive than the other classes. It is worth to buy the best ticket when take a ship, which can provide a higher survive rate. Just in case :P

Embarked

Embarked means 3 different embarked place: Southampton,
sns.countplot(x_train["Embarked"],hue=x_train["Survived"])

Imgur

Family

Imgur

1	sns.countplot(x_train["Family"],hue=x_train["Survived"])

The figure of family as shown below.
Imgur

Filling missing Values

Input print(x_train.info()) to get the train data info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float32
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Family         891 non-null int64
Prefix         891 non-null object
Firstanme      891 non-null object
dtypes: float32(1), float64(1), int64(6), object(7)
memory usage: 101.0+ KB
None

In train data, there are lots of missing value in Cabin. There are 2 missing values in Embarked as well.

Input print(x_test.info()) to get the test data info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

Reference

Encyclopedia Titanic.
This wiki contains the detail of all survivors.
A journey through Titanic