Kaggle competition located at https://www.kaggle.com/c/house-prices-advanced-regression-techniques. First we start by importing libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import r2_score,explained_variance_score
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import seaborn as sns

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 14)

Then, we read data and separate target variables with predictors

In [12]:
train = pd.read_csv('/mnt/disk2/Data/HousePrices/train.csv')
#dtrain=xgb.DMatrix('/mnt/disk2/Data/HousePrices/train.csv')
Y=train['SalePrice']
X=train.iloc[:,0:80]

We can get glimpse of missing values and data types. It is necessary to deal with this now else it will conflict while training classifier. xgboost specially doesn't deal with object data type.

In [154]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
dtypes: float64(3), int64(34), object(43)
memory usage: 912.6+ KB
In [13]:
X=pd.get_dummies(X)
X.fillna(0,inplace=True)

Dropping some features from training set since these are not present in test data

In [ ]:
X.drop(['Utilities_NoSeWa', 'PoolQC_Fa', 'Condition2_RRAn', 'Exterior1st_Stone', 'Condition2_RRAe', 'GarageQual_Ex', 'RoofMatl_Membran', 'RoofMatl_ClyTile', 'Exterior2nd_Other', 'Heating_Floor', 'Heating_OthW', 'RoofMatl_Metal', 'HouseStyle_2.5Fin', 'MiscFeature_TenC', 'Electrical_Mix', 'Condition2_RRNn', 'RoofMatl_Roll', 'Exterior1st_ImStucc'],axis=1,inplace=True)

Fitting xgboost classifier

In [5]:
kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor().fit(X.ix[train_index,:],Y[train_index])
    predictions = xgb_model.predict(X.ix[test_index,:])
    actuals = Y[test_index]
    print(r2_score(actuals, predictions))
    print(explained_variance_score(actuals, predictions))
0.862577489679
0.86271443844
0.804550146201
0.804569386597
In [162]:
xgb.plot_importance(xgb_model, max_num_features=20, height=.7, importance_type='gain')
Out[162]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fde033887b8>
In [35]:
print("Parameter optimization")
#xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,Y)
print(clf.best_score_)
print(clf.best_params_)
Parameter optimization
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   26.0s finished
0.887149329662
{'max_depth': 2, 'n_estimators': 200}

Creating classifier with optimised parameters. We need to convert pandas Dataframe to Numpy array due to a bug in XGBoost package.

In [22]:
kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=200).fit(X.ix[train_index,:].as_matrix(),Y[train_index].as_matrix())
    predictions = xgb_model.predict(X.ix[test_index,:].as_matrix())
    actuals = Y[test_index].as_matrix()
    print(r2_score(actuals, predictions))
    print(explained_variance_score(actuals, predictions))
0.875871921819
0.87587764061
0.82333063306
0.823381103422
In [23]:
test = pd.read_csv('/mnt/disk2/Data/HousePrices/test.csv')
test=pd.get_dummies(test)
test.fillna(0,inplace=True)
predict = xgb_model.predict(test.as_matrix())
In [45]:
sub=pd.DataFrame()
sub['Id']=test.Id
sub['SalePrice']=pd.DataFrame(predict)
In [46]:
sub
Out[46]:
Id SalePrice
0 1461 125202.523438
1 1462 161186.406250
2 1463 179692.140625
3 1464 181828.125000
4 1465 198630.734375
5 1466 175201.906250
6 1467 171272.375000
7 1468 165066.593750
8 1469 193533.781250
9 1470 131481.843750
10 1471 205215.593750
11 1472 93203.812500
12 1473 90192.750000
13 1474 156455.109375
14 1475 138402.796875
15 1476 420012.875000
16 1477 258751.265625
17 1478 279299.875000
18 1479 265375.031250
19 1480 483079.281250
20 1481 339368.875000
21 1482 212813.093750
22 1483 167213.625000
23 1484 164368.531250
24 1485 177559.000000
25 1486 197312.906250
26 1487 346723.187500
27 1488 233565.078125
28 1489 205618.437500
29 1490 213946.640625
... ... ...
1429 2890 89575.328125
1430 2891 129273.234375
1431 2892 64325.347656
1432 2893 89196.976562
1433 2894 76502.132812
1434 2895 330348.125000
1435 2896 277433.593750
1436 2897 194578.812500
1437 2898 146548.015625
1438 2899 220173.968750
1439 2900 152121.468750
1440 2901 202760.890625
1441 2902 190732.953125
1442 2903 373589.593750
1443 2904 363271.500000
1444 2905 98666.945312
1445 2906 197718.187500
1446 2907 110750.460938
1447 2908 129971.304688
1448 2909 149817.125000
1449 2910 90642.859375
1450 2911 79658.070312
1451 2912 137520.187500
1452 2913 81148.210938
1453 2914 74452.132812
1454 2915 79421.335938
1455 2916 80909.085938
1456 2917 180584.140625
1457 2918 117666.476562
1458 2919 228054.140625

1459 rows × 2 columns

In [47]:
sub.to_csv('submission.csv')

This is a basic analysis.There are several ways to improve it further, one way is to reduce number of feature used in training, this will avoid overfitting.