Import¶

In this tutorial we will analyse titanic dataset of Kaggle. We will learn about a general flow that can be followed while analysing a dataset.

First we begin with importing required modules and data that I assume you have downloaded from the above mentioned link.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
from ipywidgets import *


%matplotlib inline

train=pd.read_csv('train.csv') #Read CSV file train set

We are using pandas Dataframe for storing our dataset, an alternative is to use numpy array.

Pandas dataframe object provide head method to get a glimpse of data. Go ahead put some number in the parantheses to see that many rows of the dataset. Another dataframe method that is useful at this stage is info method, here you can easily identify features(columns) with maximum number of NA values. It is important to identify these features with missing values before you start building model these missing values can significantly reduce the model accuracy. There are number of things that you can do to deal with the features with missing values. In our case, we will delete Cabin feature since missing values are way too high, fill NA values in Age variable with average of the column, and delete two rows with missing values in Embarked variable.

train.head()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

del train['Cabin']
train.Age.fillna(train.Age.mean(), inplace=True)
train= train.dropna()
train.head()

Dealing with categorical variable¶

Another thing that we need to do before start training model is to convert all categorical variables into numbers which reduces processing power and memory footprints. We use get_dummies method provided by pandas to do this. get_dummies will create as many extra columns as number of categories in the feature. It will also fill value 1 in the newly created feature in the rows where that particular category was present and 0 in the rows where other category was present.

dummies=pd.get_dummies(train['Sex'])
train=train.join(dummies)
dummies1=pd.get_dummies(train['Embarked'], prefix='Embarked')
train = train.join(dummies1)
del train['Sex']
del train['Ticket']
del train['Embarked']
del train['Name']
train.head()
#sns.pairplot(train, hue='Survived')

Feature Selection¶

Generally you will read more data means better and accurate model but here data means observations or rows and not necessarily features or columns. Features which explain only a small fraction of variations in dependent variable will only slow down model building process without improving accuracy and in cases where two or more features have correlation will make the model even worse. So it is neccessary to remove unneccessary features from the dataset. We will use scikit-learn's feature_selection sub module for this purpose. It provide few methods to do this, but here we will use SelectKBest which selects k best variables from the dataset. Below code prints features and corresponding p-values when performing Chi-squared test to explain variance of the dependent variable.

tt=train.iloc[:,1:]                     # remove Survived(0) column since it is the dependent variable
X_new = SelectKBest(chi2, k=6)
X_new.fit_transform(tt, train.Survived)
pd.options.display.float_format = '{:,.9f}'.format   # make float values easier to read
pd.DataFrame(X_new.pvalues_, tt.columns)

Now, we will start selecting features with minimum p-values. We identify these as Pclass, Age, Parch, Fare, female, male and Embarked_C.

train_f = train.loc[:,['Survived','Pclass','Age','Parch','Fare','female','male','Embarked_C']]
train_f.head()

Model Fitting¶

We will start fitting model in the modified dataset

from sklearn import cross_validation
from sklearn import svm
clfsvc = svm.SVC(kernel='rbf')
scoressvc = cross_validation.cross_val_score(clfsvc, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoressvc.mean()

0.72335291113380995

from sklearn.naive_bayes import GaussianNB
clfgnb = GaussianNB()
scoregnb = cross_validation.cross_val_score(clfgnb, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoregnb.mean()

0.77951991828396328

from sklearn import tree
clftree = tree.DecisionTreeClassifier()
scoretree = cross_validation.cross_val_score(clftree, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoretree.mean()

0.79191777323799784

from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
scoresrf = cross_validation.cross_val_score(clfrf, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresrf.mean()

0.8020429009193053

We found that RandomForest gave us best score lets see whether we can improve it further using Grid Search.

parameters = {'n_estimators':[8,15], 'criterion':('gini','entropy'), 'min_samples_split':[2,6], 'n_jobs':[-1]}
from sklearn.grid_search import GridSearchCV
clfgs = GridSearchCV(clfrf, parameters)

scoresgs = cross_validation.cross_val_score(clfgs, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresgs.mean()

0.82342951991828384

clfgs.fit(X=train_f.iloc[:,1:] , y=train_f.iloc[:,0])

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_jobs': [-1], 'criterion': ('gini', 'entropy'), 'min_samples_split': [2, 6], 'n_estimators': [8, 15]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

test_result= pd.DataFrame(clfgs.predict(test_f))

# Preparing test data set for training

test=pd.read_csv('test.csv')
del test['Cabin']
test.Age.fillna(train.Age.mean(), inplace=True)
test.Fare.fillna(train.Fare.mean(), inplace=True)
test= test.dropna()

dummies=pd.get_dummies(test['Sex'])
test=test.join(dummies)
dummies1=pd.get_dummies(test['Embarked'], prefix='Embarked')
test = test.join(dummies1)
del test['Sex']
del test['Ticket']
del test['Embarked']
del test['Name']

test_f = test.loc[:,['Pclass','Age','Parch','Fare','female','male','Embarked_C']]
test_f.head()

test_result = pd.DataFrame(test_result)
test['Survived'] = test_result

test1=test.loc[:,['PassengerId','Survived']]
test1.to_csv('titanic_pred.csv')

This will give accuracy of 77.99% in Kaggle.

	Survived	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	1	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	5	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Survived	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked
0	0	1	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	S
1	1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C
2	1	3	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	S
3	1	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	S
4	0	5	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	S

	0
PassengerId	0.068236432
Pclass	0.000000040
Age	0.000000116
SibSp	0.122020833
Parch	0.001227426
Fare	0.000000000
female	0.000000000
male	0.000000000
Embarked_C	0.000005023
Embarked_Q	0.897161121
Embarked_S	0.017516334

	Survived	Pclass	Age	Fare	female	male	Embarked_C
0	0	3	22.000000000	7.250000000	0.000000000	1.000000000	0.000000000
1	1	1	38.000000000	71.283300000	1.000000000	0.000000000	1.000000000
2	1	3	26.000000000	7.925000000	1.000000000	0.000000000	0.000000000
3	1	1	35.000000000	53.100000000	1.000000000	0.000000000	0.000000000
4	0	3	35.000000000	8.050000000	0.000000000	1.000000000	0.000000000

	Pclass	Age	Parch	Fare	female	male
0	3	34.500000000	0	7.829200000	0.000000000	1.000000000
1	3	47.000000000	0	7.000000000	1.000000000	0.000000000
2	2	62.000000000	0	9.687500000	0.000000000	1.000000000
3	3	27.000000000	0	8.662500000	0.000000000	1.000000000
4	3	22.000000000	1	12.287500000	1.000000000	0.000000000

	Survived	PassengerId	Pclass	Age	SibSp	Fare	female	male	Embarked_C	Embarked_S
0	0	1	3	22.0	1	7.2500	0.0	1.0	0.0	1.0
1	1	2	1	38.0	1	71.2833	1.0	0.0	1.0	0.0
2	1	3	3	26.0	0	7.9250	1.0	0.0	0.0	1.0
3	1	4	1	35.0	1	53.1000	1.0	0.0	0.0	1.0
4	0	5	3	35.0	0	8.0500	0.0	1.0	0.0	1.0