Kaggle competition located at https://www.kaggle.com/c/house-prices-advanced-regression-techniques. First we start by importing libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import r2_score,explained_variance_score
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 14)
Then, we read data and separate target variables with predictors
train = pd.read_csv('/mnt/disk2/Data/HousePrices/train.csv')
#dtrain=xgb.DMatrix('/mnt/disk2/Data/HousePrices/train.csv')
Y=train['SalePrice']
X=train.iloc[:,0:80]
We can get glimpse of missing values and data types. It is necessary to deal with this now else it will conflict while training classifier. xgboost specially doesn't deal with object data type.
X.info()
X=pd.get_dummies(X)
X.fillna(0,inplace=True)
Dropping some features from training set since these are not present in test data
X.drop(['Utilities_NoSeWa', 'PoolQC_Fa', 'Condition2_RRAn', 'Exterior1st_Stone', 'Condition2_RRAe', 'GarageQual_Ex', 'RoofMatl_Membran', 'RoofMatl_ClyTile', 'Exterior2nd_Other', 'Heating_Floor', 'Heating_OthW', 'RoofMatl_Metal', 'HouseStyle_2.5Fin', 'MiscFeature_TenC', 'Electrical_Mix', 'Condition2_RRNn', 'RoofMatl_Roll', 'Exterior1st_ImStucc'],axis=1,inplace=True)
Fitting xgboost classifier
kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor().fit(X.ix[train_index,:],Y[train_index])
predictions = xgb_model.predict(X.ix[test_index,:])
actuals = Y[test_index]
print(r2_score(actuals, predictions))
print(explained_variance_score(actuals, predictions))
xgb.plot_importance(xgb_model, max_num_features=20, height=.7, importance_type='gain')
print("Parameter optimization")
#xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
{'max_depth': [2,4,6],
'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,Y)
print(clf.best_score_)
print(clf.best_params_)
Creating classifier with optimised parameters. We need to convert pandas Dataframe to Numpy array due to a bug in XGBoost package.
kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=200).fit(X.ix[train_index,:].as_matrix(),Y[train_index].as_matrix())
predictions = xgb_model.predict(X.ix[test_index,:].as_matrix())
actuals = Y[test_index].as_matrix()
print(r2_score(actuals, predictions))
print(explained_variance_score(actuals, predictions))
test = pd.read_csv('/mnt/disk2/Data/HousePrices/test.csv')
test=pd.get_dummies(test)
test.fillna(0,inplace=True)
predict = xgb_model.predict(test.as_matrix())
sub=pd.DataFrame()
sub['Id']=test.Id
sub['SalePrice']=pd.DataFrame(predict)
sub
sub.to_csv('submission.csv')
This is a basic analysis.There are several ways to improve it further, one way is to reduce number of feature used in training, this will avoid overfitting.