Kaggle competition located at https://www.kaggle.com/c/house-prices-advanced-regression-techniques. First we start by importing libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import r2_score,explained_variance_score
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import seaborn as sns

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 14)

Then, we read data and separate target variables with predictors

train = pd.read_csv('/mnt/disk2/Data/HousePrices/train.csv')
#dtrain=xgb.DMatrix('/mnt/disk2/Data/HousePrices/train.csv')
Y=train['SalePrice']
X=train.iloc[:,0:80]

We can get glimpse of missing values and data types. It is necessary to deal with this now else it will conflict while training classifier. xgboost specially doesn't deal with object data type.

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
dtypes: float64(3), int64(34), object(43)
memory usage: 912.6+ KB

X=pd.get_dummies(X)
X.fillna(0,inplace=True)

Dropping some features from training set since these are not present in test data

X.drop(['Utilities_NoSeWa', 'PoolQC_Fa', 'Condition2_RRAn', 'Exterior1st_Stone', 'Condition2_RRAe', 'GarageQual_Ex', 'RoofMatl_Membran', 'RoofMatl_ClyTile', 'Exterior2nd_Other', 'Heating_Floor', 'Heating_OthW', 'RoofMatl_Metal', 'HouseStyle_2.5Fin', 'MiscFeature_TenC', 'Electrical_Mix', 'Condition2_RRNn', 'RoofMatl_Roll', 'Exterior1st_ImStucc'],axis=1,inplace=True)

Fitting xgboost classifier

kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor().fit(X.ix[train_index,:],Y[train_index])
    predictions = xgb_model.predict(X.ix[test_index,:])
    actuals = Y[test_index]
    print(r2_score(actuals, predictions))
    print(explained_variance_score(actuals, predictions))

0.862577489679
0.86271443844
0.804550146201
0.804569386597

xgb.plot_importance(xgb_model, max_num_features=20, height=.7, importance_type='gain')

<matplotlib.axes._subplots.AxesSubplot at 0x7fde033887b8>

print("Parameter optimization")
#xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,Y)
print(clf.best_score_)
print(clf.best_params_)

Parameter optimization
Fitting 3 folds for each of 9 candidates, totalling 27 fits

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   26.0s finished

0.887149329662
{'max_depth': 2, 'n_estimators': 200}

Creating classifier with optimised parameters. We need to convert pandas Dataframe to Numpy array due to a bug in XGBoost package.

kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=200).fit(X.ix[train_index,:].as_matrix(),Y[train_index].as_matrix())
    predictions = xgb_model.predict(X.ix[test_index,:].as_matrix())
    actuals = Y[test_index].as_matrix()
    print(r2_score(actuals, predictions))
    print(explained_variance_score(actuals, predictions))

0.875871921819
0.87587764061
0.82333063306
0.823381103422

test = pd.read_csv('/mnt/disk2/Data/HousePrices/test.csv')
test=pd.get_dummies(test)
test.fillna(0,inplace=True)
predict = xgb_model.predict(test.as_matrix())

sub=pd.DataFrame()
sub['Id']=test.Id
sub['SalePrice']=pd.DataFrame(predict)

sub

sub.to_csv('submission.csv')

This is a basic analysis.There are several ways to improve it further, one way is to reduce number of feature used in training, this will avoid overfitting.

	Id	SalePrice
0	1461	125202.523438
1	1462	161186.406250
2	1463	179692.140625
3	1464	181828.125000
4	1465	198630.734375
5	1466	175201.906250
6	1467	171272.375000
7	1468	165066.593750
8	1469	193533.781250
9	1470	131481.843750
10	1471	205215.593750
11	1472	93203.812500
12	1473	90192.750000
13	1474	156455.109375
14	1475	138402.796875
15	1476	420012.875000
16	1477	258751.265625
17	1478	279299.875000
18	1479	265375.031250
19	1480	483079.281250
20	1481	339368.875000
21	1482	212813.093750
22	1483	167213.625000
23	1484	164368.531250
24	1485	177559.000000
25	1486	197312.906250
26	1487	346723.187500
27	1488	233565.078125
28	1489	205618.437500
29	1490	213946.640625
...	...	...
1429	2890	89575.328125
1430	2891	129273.234375
1431	2892	64325.347656
1432	2893	89196.976562
1433	2894	76502.132812
1434	2895	330348.125000
1435	2896	277433.593750
1436	2897	194578.812500
1437	2898	146548.015625
1438	2899	220173.968750
1439	2900	152121.468750
1440	2901	202760.890625
1441	2902	190732.953125
1442	2903	373589.593750
1443	2904	363271.500000
1444	2905	98666.945312
1445	2906	197718.187500
1446	2907	110750.460938
1447	2908	129971.304688
1448	2909	149817.125000
1449	2910	90642.859375
1450	2911	79658.070312
1451	2912	137520.187500
1452	2913	81148.210938
1453	2914	74452.132812
1454	2915	79421.335938
1455	2916	80909.085938
1456	2917	180584.140625
1457	2918	117666.476562
1458	2919	228054.140625