Import

In this tutorial we will analyse titanic dataset of Kaggle. We will learn about a general flow that can be followed while analysing a dataset.

First we begin with importing required modules and data that I assume you have downloaded from the above mentioned link.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
from ipywidgets import *


%matplotlib inline

train=pd.read_csv('train.csv') #Read CSV file train set

We are using pandas Dataframe for storing our dataset, an alternative is to use numpy array.

Pandas dataframe object provide head method to get a glimpse of data. Go ahead put some number in the parantheses to see that many rows of the dataset. Another dataframe method that is useful at this stage is info method, here you can easily identify features(columns) with maximum number of NA values. It is important to identify these features with missing values before you start building model these missing values can significantly reduce the model accuracy. There are number of things that you can do to deal with the features with missing values. In our case, we will delete Cabin feature since missing values are way too high, fill NA values in Age variable with average of the column, and delete two rows with missing values in Embarked variable.

In [10]:
train.head()
Out[10]:
Survived PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 1 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [11]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [3]:
del train['Cabin']
train.Age.fillna(train.Age.mean(), inplace=True)
train= train.dropna()
train.head()
Out[3]:
Survived PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
0 0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
2 1 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
4 0 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S

Dealing with categorical variable

Another thing that we need to do before start training model is to convert all categorical variables into numbers which reduces processing power and memory footprints. We use get_dummies method provided by pandas to do this. get_dummies will create as many extra columns as number of categories in the feature. It will also fill value 1 in the newly created feature in the rows where that particular category was present and 0 in the rows where other category was present.

In [4]:
dummies=pd.get_dummies(train['Sex'])
train=train.join(dummies)
dummies1=pd.get_dummies(train['Embarked'], prefix='Embarked')
train = train.join(dummies1)
del train['Sex']
del train['Ticket']
del train['Embarked']
del train['Name']
train.head()
#sns.pairplot(train, hue='Survived')
Out[4]:
Survived PassengerId Pclass Age SibSp Parch Fare female male Embarked_C Embarked_Q Embarked_S
0 0 1 3 22.0 1 0 7.2500 0.0 1.0 0.0 0.0 1.0
1 1 2 1 38.0 1 0 71.2833 1.0 0.0 1.0 0.0 0.0
2 1 3 3 26.0 0 0 7.9250 1.0 0.0 0.0 0.0 1.0
3 1 4 1 35.0 1 0 53.1000 1.0 0.0 0.0 0.0 1.0
4 0 5 3 35.0 0 0 8.0500 0.0 1.0 0.0 0.0 1.0

Feature Selection

Generally you will read more data means better and accurate model but here data means observations or rows and not necessarily features or columns. Features which explain only a small fraction of variations in dependent variable will only slow down model building process without improving accuracy and in cases where two or more features have correlation will make the model even worse. So it is neccessary to remove unneccessary features from the dataset. We will use scikit-learn's feature_selection sub module for this purpose. It provide few methods to do this, but here we will use SelectKBest which selects k best variables from the dataset. Below code prints features and corresponding p-values when performing Chi-squared test to explain variance of the dependent variable.

In [5]:
tt=train.iloc[:,1:]                     # remove Survived(0) column since it is the dependent variable
X_new = SelectKBest(chi2, k=6)
X_new.fit_transform(tt, train.Survived)
pd.options.display.float_format = '{:,.9f}'.format   # make float values easier to read
pd.DataFrame(X_new.pvalues_, tt.columns)
Out[5]:
0
PassengerId 0.068236432
Pclass 0.000000040
Age 0.000000116
SibSp 0.122020833
Parch 0.001227426
Fare 0.000000000
female 0.000000000
male 0.000000000
Embarked_C 0.000005023
Embarked_Q 0.897161121
Embarked_S 0.017516334

Now, we will start selecting features with minimum p-values. We identify these as Pclass, Age, Parch, Fare, female, male and Embarked_C.

In [6]:
train_f = train.loc[:,['Survived','Pclass','Age','Parch','Fare','female','male','Embarked_C']]
train_f.head()
Out[6]:
Survived Pclass Age Parch Fare female male Embarked_C
0 0 3 22.000000000 0 7.250000000 0.000000000 1.000000000 0.000000000
1 1 1 38.000000000 0 71.283300000 1.000000000 0.000000000 1.000000000
2 1 3 26.000000000 0 7.925000000 1.000000000 0.000000000 0.000000000
3 1 1 35.000000000 0 53.100000000 1.000000000 0.000000000 0.000000000
4 0 3 35.000000000 0 8.050000000 0.000000000 1.000000000 0.000000000

Model Fitting

We will start fitting model in the modified dataset

In [10]:
from sklearn import cross_validation
from sklearn import svm
clfsvc = svm.SVC(kernel='rbf')
scoressvc = cross_validation.cross_val_score(clfsvc, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoressvc.mean()
Out[10]:
0.72335291113380995
In [30]:
from sklearn.naive_bayes import GaussianNB
clfgnb = GaussianNB()
scoregnb = cross_validation.cross_val_score(clfgnb, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoregnb.mean()
Out[30]:
0.77951991828396328
In [31]:
from sklearn import tree
clftree = tree.DecisionTreeClassifier()
scoretree = cross_validation.cross_val_score(clftree, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoretree.mean()
Out[31]:
0.79191777323799784
In [11]:
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
scoresrf = cross_validation.cross_val_score(clfrf, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresrf.mean()
Out[11]:
0.8020429009193053

We found that RandomForest gave us best score lets see whether we can improve it further using Grid Search.

In [13]:
parameters = {'n_estimators':[8,15], 'criterion':('gini','entropy'), 'min_samples_split':[2,6], 'n_jobs':[-1]}
from sklearn.grid_search import GridSearchCV
clfgs = GridSearchCV(clfrf, parameters)
In [14]:
scoresgs = cross_validation.cross_val_score(clfgs, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresgs.mean()
Out[14]:
0.82342951991828384
In [15]:
clfgs.fit(X=train_f.iloc[:,1:] , y=train_f.iloc[:,0])
Out[15]:
GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_jobs': [-1], 'criterion': ('gini', 'entropy'), 'min_samples_split': [2, 6], 'n_estimators': [8, 15]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [16]:
test_result= pd.DataFrame(clfgs.predict(test_f))
In [7]:
# Preparing test data set for training

test=pd.read_csv('test.csv')
del test['Cabin']
test.Age.fillna(train.Age.mean(), inplace=True)
test.Fare.fillna(train.Fare.mean(), inplace=True)
test= test.dropna()

dummies=pd.get_dummies(test['Sex'])
test=test.join(dummies)
dummies1=pd.get_dummies(test['Embarked'], prefix='Embarked')
test = test.join(dummies1)
del test['Sex']
del test['Ticket']
del test['Embarked']
del test['Name']

test_f = test.loc[:,['Pclass','Age','Parch','Fare','female','male','Embarked_C']]
test_f.head()
Out[7]:
Pclass Age Parch Fare female male Embarked_C
0 3 34.500000000 0 7.829200000 0.000000000 1.000000000 0.000000000
1 3 47.000000000 0 7.000000000 1.000000000 0.000000000 0.000000000
2 2 62.000000000 0 9.687500000 0.000000000 1.000000000 0.000000000
3 3 27.000000000 0 8.662500000 0.000000000 1.000000000 0.000000000
4 3 22.000000000 1 12.287500000 1.000000000 0.000000000 0.000000000
In [17]:
test_result = pd.DataFrame(test_result)
test['Survived'] = test_result
In [18]:
test1=test.loc[:,['PassengerId','Survived']]
test1.to_csv('titanic_pred.csv')
In [64]:
This will give accuracy of 77.99% in Kaggle.