In this tutorial we will analyse titanic dataset of Kaggle. We will learn about a general flow that can be followed while analysing a dataset.
First we begin with importing required modules and data that I assume you have downloaded from the above mentioned link.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
from ipywidgets import *
%matplotlib inline
train=pd.read_csv('train.csv') #Read CSV file train set
We are using pandas Dataframe for storing our dataset, an alternative is to use numpy array.
Pandas dataframe object provide head method to get a glimpse of data. Go ahead put some number in the parantheses to see that many rows of the dataset. Another dataframe method that is useful at this stage is info method, here you can easily identify features(columns) with maximum number of NA values. It is important to identify these features with missing values before you start building model these missing values can significantly reduce the model accuracy. There are number of things that you can do to deal with the features with missing values. In our case, we will delete Cabin feature since missing values are way too high, fill NA values in Age variable with average of the column, and delete two rows with missing values in Embarked variable.
train.head()
train.info()
del train['Cabin']
train.Age.fillna(train.Age.mean(), inplace=True)
train= train.dropna()
train.head()
Another thing that we need to do before start training model is to convert all categorical variables into numbers which reduces processing power and memory footprints. We use get_dummies method provided by pandas to do this. get_dummies will create as many extra columns as number of categories in the feature. It will also fill value 1 in the newly created feature in the rows where that particular category was present and 0 in the rows where other category was present.
dummies=pd.get_dummies(train['Sex'])
train=train.join(dummies)
dummies1=pd.get_dummies(train['Embarked'], prefix='Embarked')
train = train.join(dummies1)
del train['Sex']
del train['Ticket']
del train['Embarked']
del train['Name']
train.head()
#sns.pairplot(train, hue='Survived')
Generally you will read more data means better and accurate model but here data means observations or rows and not necessarily features or columns. Features which explain only a small fraction of variations in dependent variable will only slow down model building process without improving accuracy and in cases where two or more features have correlation will make the model even worse. So it is neccessary to remove unneccessary features from the dataset. We will use scikit-learn's feature_selection sub module for this purpose. It provide few methods to do this, but here we will use SelectKBest which selects k best variables from the dataset. Below code prints features and corresponding p-values when performing Chi-squared test to explain variance of the dependent variable.
tt=train.iloc[:,1:] # remove Survived(0) column since it is the dependent variable
X_new = SelectKBest(chi2, k=6)
X_new.fit_transform(tt, train.Survived)
pd.options.display.float_format = '{:,.9f}'.format # make float values easier to read
pd.DataFrame(X_new.pvalues_, tt.columns)
Now, we will start selecting features with minimum p-values. We identify these as Pclass, Age, Parch, Fare, female, male and Embarked_C.
train_f = train.loc[:,['Survived','Pclass','Age','Parch','Fare','female','male','Embarked_C']]
train_f.head()
We will start fitting model in the modified dataset
from sklearn import cross_validation
from sklearn import svm
clfsvc = svm.SVC(kernel='rbf')
scoressvc = cross_validation.cross_val_score(clfsvc, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoressvc.mean()
from sklearn.naive_bayes import GaussianNB
clfgnb = GaussianNB()
scoregnb = cross_validation.cross_val_score(clfgnb, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoregnb.mean()
from sklearn import tree
clftree = tree.DecisionTreeClassifier()
scoretree = cross_validation.cross_val_score(clftree, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoretree.mean()
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
scoresrf = cross_validation.cross_val_score(clfrf, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresrf.mean()
We found that RandomForest gave us best score lets see whether we can improve it further using Grid Search.
parameters = {'n_estimators':[8,15], 'criterion':('gini','entropy'), 'min_samples_split':[2,6], 'n_jobs':[-1]}
from sklearn.grid_search import GridSearchCV
clfgs = GridSearchCV(clfrf, parameters)
scoresgs = cross_validation.cross_val_score(clfgs, X=train_f.iloc[:,1:] , y=train_f.iloc[:,0], cv=10)
scoresgs.mean()
clfgs.fit(X=train_f.iloc[:,1:] , y=train_f.iloc[:,0])
test_result= pd.DataFrame(clfgs.predict(test_f))
# Preparing test data set for training
test=pd.read_csv('test.csv')
del test['Cabin']
test.Age.fillna(train.Age.mean(), inplace=True)
test.Fare.fillna(train.Fare.mean(), inplace=True)
test= test.dropna()
dummies=pd.get_dummies(test['Sex'])
test=test.join(dummies)
dummies1=pd.get_dummies(test['Embarked'], prefix='Embarked')
test = test.join(dummies1)
del test['Sex']
del test['Ticket']
del test['Embarked']
del test['Name']
test_f = test.loc[:,['Pclass','Age','Parch','Fare','female','male','Embarked_C']]
test_f.head()
test_result = pd.DataFrame(test_result)
test['Survived'] = test_result
test1=test.loc[:,['PassengerId','Survived']]
test1.to_csv('titanic_pred.csv')
This will give accuracy of 77.99% in Kaggle.