Investigating the Factors That Affect Real Estate Price's in California

Introduction

Housing prices are highly speculative. Expert price estimates tend to range greatly. The price of a house is seemingly dependent on so many variables that it can be difficult to extract what exactly is most important to look for. This analysis aims to find the primary variables that determine a house's sales price. The analysis will be run on a data set of approximately 1500 houses with over 60 different variables measured on each house. From this data set, my goal is to narrow down the important factors that affect a house's price.

The data used in this analysis is a Kaggle data set of the sales price of properties that includes 82 variables with 1459 observations. The data was curated for a Kaggle competition that investigates price estimation of homes depending on a variety of explanatory variables. The list of explanatory variables can be seen in the appendix.

My objective in this analysis is to identify the primary explanatory variables that affect the price of any given property given the explanatory variables. To achieve this I will be using multiple linear regression to obtain the important factors and end up with a model that can predict sales price with relatively high accuracy.

Method

The method I will use to estimate the property prices is multiple linear regression. However, the data needs to meet the assumptions to use multiple linear regression affectivly:

The dependent variable should be measured on a continuos scale
There must be a linear relationship between the dependent and independent variables
The variance of the residuals should be equal at each level of the explanatory variables
The errors must be independent from one another
The risiduals must be normally distributed
There should be no outliers or influential points
There should be no multi collinearity

There are several ways to insure tha the data meets the given assumptions and this will be seen in the analysis section. Given that all asspects are met we can use Multiple linear regression to analyse the data. Multiple linear regression is an extension of linear regression, as it uses two or more explanatory variables. The method aims to use the explanatory or independent variables to estimate the dependent variable. In this case, our dependent variable is the property's sale price and the explanatory variables are a collection of measurements and observations made on the property.

Data Cleaning

The data set had several different quantitative variable types including continuos, discrete and categorical variables. As a linear regression only deals with numeric values, the categorical data needed to be converted into a numeric format. To do this I cast all the categorical variables to a numeric value based on their nature. For example, the condition of the over all house had 5 categorical variables ranging from horrible condition to excelent condition. Therefore, I cast each category to a number betweek 1 and 5 from bad to good. That being said some of the data found in the dataset was unfortunately unusable given that it was descriptive by nature and was open to interpretation. To keep the analysis relativly simple I cut out these variables. I used the following code snippet to transform the categorical variables into numeric variables.

df = pd.read_csv('train.csv')
df['LotShape'].iloc[df['LotShape'] == 'Reg'] = 0
df['LotShape'].iloc[df['LotShape'] == 'IR1'] = 1
df['LotShape'].iloc[df['LotShape'] == 'IR2'] = 2
df['LotShape'].iloc[df['LotShape'] == 'IR3'] = 3
df['LotShape']=df['LotShape'].astype(int)


df['ExterQual'].iloc[df['ExterQual'] == 'Ex'] = 5
df['ExterQual'].iloc[df['ExterQual'] == 'Gd'] = 4
df['ExterQual'].iloc[df['ExterQual'] == 'TA'] = 3
df['ExterQual'].iloc[df['ExterQual'] == 'Fa'] = 2
df['ExterQual'].iloc[df['ExterQual'] == 'Po'] = 1
df['ExterQual']=df['ExterQual'].astype(int)

df['ExterCond'].iloc[df['ExterCond'] == 'Ex'] = 5
df['ExterCond'].iloc[df['ExterCond'] == 'Gd'] = 4
df['ExterCond'].iloc[df['ExterCond'] == 'TA'] = 3
df['ExterCond'].iloc[df['ExterCond'] == 'Fa'] = 2
df['ExterCond'].iloc[df['ExterCond'] == 'Po'] = 1
df['ExterCond']=df['ExterCond'].astype(int)

df['CentralAir'].iloc[df['CentralAir']== 'N']=0
df['CentralAir'].iloc[df['CentralAir']== 'Y']=1
df['CentralAir']=df['CentralAir'].astype(int)

df['LandContour'].iloc[df['LandContour'] != 'Lvl'] = 0
df['LandContour'].iloc[df['LandContour'] == 'Lvl'] = 1
df['LandContour']=df['LandContour'].astype(int)

df['Utilities'].iloc[df['Utilities'] != 'AllPub'] = 0
df['Utilities'].iloc[df['Utilities'] == 'AllPub'] = 1
df['Utilities']=df['Utilities'].astype(int)

df['LandSlope'].iloc[df['LandSlope'] != 'Gtl'] = 0
df['LandSlope'].iloc[df['LandSlope'] == 'Gtl'] = 1
df['LandSlope']=df['LandSlope'].astype(int)

df['HouseStyle'].iloc['1Story'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['1.5Fin'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['1.5Unf'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['2Story'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['2.5Fin'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['2.5Unf'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['SFoyer'== df['HouseStyle']] = 0
df['HouseStyle'].iloc['SLvl'== df['HouseStyle']] = 0
df['HouseStyle']=df['HouseStyle'].astype(int)

df['Condition1'].iloc['normal'!= df['Condition1']] = 0
df['Condition1'].iloc['normal'== df['Condition1']] = 1
df['Condition1']=df['Condition1'].astype(int)

df['Condition2'].iloc['normal'!= df['Condition1']] = 0
df['Condition2'].iloc['normal'== df['Condition1']] = 1
df['Condition2']=df['Condition2'].astype(int)

df['SaleCondition'].iloc['Normal'!= df['SaleCondition']] = 0
df['SaleCondition'].iloc['Normal'== df['SaleCondition']] = 1
df['SaleCondition']=df['SaleCondition'].astype(int)

df['PoolQC'].iloc[df['PoolQC'] == 'Ex'] = 5
df['PoolQC'].iloc[df['PoolQC'] == 'Gd'] = 4
df['PoolQC'].iloc[df['PoolQC'] == 'TA'] = 3
df['PoolQC'].iloc[df['PoolQC'] == 'Fa'] = 2
df['PoolQC'].iloc[df['PoolQC'] == 'Po'] = 1
df['PoolQC'].iloc[df['PoolQC'].isna()] = 0
df['PoolQC'] = df['PoolQC'].astype(int)

df['Fence'].iloc[df['Fence'].isna()] =0
df['Fence'].iloc[df['Fence'] == 'GdWo'] = 2
df['Fence'].iloc[df['Fence'] == 'MnPrv'] = 1
df['Fence'].iloc[df['Fence'] == 'GdPrv'] = 2
df['Fence'].iloc[df['Fence'] == 'MnWw'] = 1
df['Fence']=df['Fence'].astype(int)

df['GarageQual'].iloc[df['GarageQual'] == 'Ex'] = 5
df['GarageQual'].iloc[df['GarageQual'] == 'Gd'] = 4
df['GarageQual'].iloc[df['GarageQual'] == 'TA'] = 3
df['GarageQual'].iloc[df['GarageQual'] == 'Fa'] = 2
df['GarageQual'].iloc[df['GarageQual'] == 'Po'] = 1
df['GarageQual'].iloc[df['GarageQual'].isna()] = 0
df['GarageQual'] = df['GarageQual'].astype(int)

df['GarageCond'].iloc[df['GarageCond'] == 'Ex'] = 5
df['GarageCond'].iloc[df['GarageCond'] == 'Gd'] = 4
df['GarageCond'].iloc[df['GarageCond'] == 'TA'] = 3
df['GarageCond'].iloc[df['GarageCond'] == 'Fa'] = 2
df['GarageCond'].iloc[df['GarageCond'] == 'Po'] = 1
df['GarageCond'].iloc[df['GarageCond'].isna()] = 0
df['GarageCond'] = df['GarageCond'].astype(int)

df['PavedDrive'].iloc[df['PavedDrive'] == 'Y'] = 2
df['PavedDrive'].iloc[df['PavedDrive'] == 'P'] = 1
df['PavedDrive'].iloc[df['PavedDrive'] == 'N'] = 1
df['PavedDrive'] = df['PavedDrive'].astype(int)

df['HeatingQC'].iloc[df['HeatingQC'] == 'Ex'] = 5
df['HeatingQC'].iloc[df['HeatingQC'] == 'Gd'] = 4
df['HeatingQC'].iloc[df['HeatingQC'] == 'TA'] = 3
df['HeatingQC'].iloc[df['HeatingQC'] == 'Fa'] = 2
df['HeatingQC'].iloc[df['HeatingQC'] == 'Po'] = 1
df['HeatingQC'].iloc[df['HeatingQC'].isna()] = 0
df['HeatingQC'] = df['HeatingQC'].astype(int)

df['GarageType'].iloc[df['GarageType'].isna()] =0;
df['GarageType'].iloc[df['GarageType'] != 0] = 1
df['GarageType'] = df['GarageType'].astype(int)

df['GarageFinish'].iloc[df['GarageFinish'].isna()]=0
df['GarageFinish'].iloc[df['GarageFinish']!=0]=1
df['GarageFinish'] = df['GarageFinish'].astype(int)


df['KitchenQual'].iloc[df['KitchenQual'] == 'Ex'] = 5
df['KitchenQual'].iloc[df['KitchenQual'] == 'Gd'] = 4
df['KitchenQual'].iloc[df['KitchenQual'] == 'TA'] = 3
df['KitchenQual'].iloc[df['KitchenQual'] == 'Fa'] = 2
df['KitchenQual'].iloc[df['KitchenQual'] == 'Po'] = 1
df['KitchenQual'].iloc[df['KitchenQual'].isna()] = 0
df['KitchenQual'] = df['KitchenQual'].astype(int)

df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Ex'] = 5
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Gd'] = 4
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'TA'] = 3
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Fa'] = 2
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Po'] = 1
df['FireplaceQu'].iloc[df['FireplaceQu'].isna()] = 0
df['FireplaceQu'] = df['FireplaceQu'].astype(int)

df['BsmtCond'].iloc[df['BsmtCond'] == 'Ex'] = 5
df['BsmtCond'].iloc[df['BsmtCond'] == 'Gd'] = 4
df['BsmtCond'].iloc[df['BsmtCond'] == 'TA'] = 3
df['BsmtCond'].iloc[df['BsmtCond'] == 'Fa'] = 2
df['BsmtCond'].iloc[df['BsmtCond'] == 'Po'] = 1
df['BsmtCond'].iloc[df['BsmtCond'].isna()] = 0
df['BsmtCond'] = df['BsmtCond'].astype(int)

df['BsmtQual'].iloc[df['BsmtQual'] == 'Ex'] = 5
df['BsmtQual'].iloc[df['BsmtQual'] == 'Gd'] = 4
df['BsmtQual'].iloc[df['BsmtQual'] == 'TA'] = 3
df['BsmtQual'].iloc[df['BsmtQual'] == 'Fa'] = 2
df['BsmtQual'].iloc[df['BsmtQual'] == 'Po'] = 1
df['BsmtQual'].iloc[df['BsmtQual'].isna()] = 0
df['BsmtQual'] = df['BsmtQual'].astype(int)

df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Gd'] = 4
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Av'] = 3
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Mn'] = 2
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'No'] = 1
df['BsmtExposure'].iloc[df['BsmtExposure'].isna()] = 0
df['BsmtExposure'] = df['BsmtExposure'].astype(int)

df['Functional'].iloc[df['Functional'] == 'Typ'] = 8
df['Functional'].iloc[df['Functional'] == 'Min1'] = 7
df['Functional'].iloc[df['Functional'] == 'Min2'] = 6
df['Functional'].iloc[df['Functional'] == 'Mod'] = 5
df['Functional'].iloc[df['Functional'] == 'Maj1'] = 4
df['Functional'].iloc[df['Functional'] == 'Maj2'] = 3
df['Functional'].iloc[df['Functional'] == 'Sev'] = 2
df['Functional'].iloc[df['Functional'] == 'Sal'] = 1
df['Functional'] = df['Functional'].astype(int)

temp = pd.DataFrame(df.groupby(by = 'Neighborhood')['SalePrice'].mean())
temp  = temp.reset_index()
temp = temp.rename(columns = {'SalePrice':'Mean'})
temp = temp.sort_values(by='Mean').reset_index().drop(
'index', axis=1).reset_index().drop('Mean',axis=1)

df = pd.merge(temp,df,how='left',
			right_on='Neighborhood',
			left_on='Neighborhood')
df = df.drop("Neighborhood", axis=1)
df = df.rename(columns={'index':'Neighborhood'})
df = df.drop(['MiscFeature','LotConfig', 'Foundation', 'Electrical',
			'MasVnrType','BsmtFinType1', 'RoofStyle', 'Street',
			'Alley', 'MSZoning', 'SaleType','BsmtFinType2',
			'BldgType','Heating', 'Exterior2nd', 'Exterior1st', 
			'Condition1','Condition2','Id', 'RoofMatl'], axis=1)

Transformation

To insure an accurate analysis, I needed to confirmt that the variables where linear. I first investigated the dependent variable for linearization issues. To do this I used a Q-Q plot with a simple distribution of the dependent variable in this case property sales price.

The following result was obtained by using a log transformation on the dependent variable:

As we can see the QQ-plot seems to have an upwards curvature, this indicates that the data is not linear and and is skewed, this is confirmed when looking at the sale price distribution as it is a right skewed distribution. With this information, we can use a log transformation to transform the dependent variable to achieve a better fit towards a normal distribution.

This transformation significantly improved the normality of the sales price. For the time being I will not be investigating the linearity of any of the dependetn variables given the amount of variables that there are. That being said after obtaining a final linear model I will use the obtained variables and ensure that they are all linear.

Analysis

To begin the analysis I first decided to get more insight into the data to better understand the relationships between the dependent variable and the independent variables. To do this I used a correlation matrix that gives an initial understanding of which variables seem to be highly correlated with the sales price. Unsurprisingly the top 3 variables that seem to correlate with sales prices are: Overall quality, neighborhood, and Ground floor living area. These correlations will give us good insight when deciding which variables to keep in case of multi collinearity.

To reduce the number of variables in the final regression I will be using a stepwise backward selection process. Backward selection is done by starting with a regression that includes all variables and systematically removing variables that don't seem to contribute to the regression given their p-value. In this case, my threshold to keep a variable will be set at an alpha = 0.05. Once all variables are below an alpha of 0.05 the stepwise selection will terminate and produce the final model.

After performing a backward selection we got the following variables that seemed to be the most significant in the linear regression:


	OLS Regression Results
	  Dep. Variable:         SalePrice       R-squared (uncentered):          0.973 
	
	
	  Model:                    OLS          Adj. R-squared (uncentered):     0.972 
	
	
	  Method:              Least Squares     F-statistic:                    1716. 
	
	
	  Date:              Sun, 01 Aug 2021    Prob (F-statistic):              0.00  
	
	
	  Time:                  14:00:50        Log-Likelihood:               -13268. 
	
	
	  No. Observations:         1121         AIC:                         2.658e+04
	
	
	  Df Residuals:             1098         BIC:                         2.670e+04
	
	
	  Df Model:                   23                                               
	
	
	  Covariance Type:       nonrobust                                             
	
	
	
	
			           coef      std err       t       P>|t|   [0.025     0.975]  
	
	
	  Neighborhood   2308.1607    237.480      9.719   0.000   1842.194   2774.127
	
	
	  MSSubClass     -190.0565     30.090     -6.316   0.000   -249.097   -131.016
	
	
	  LotFrontage    -172.0544     54.509     -3.156   0.002   -279.008    -65.101
	
	
	  LotArea           0.4661      0.144      3.234   0.001      0.183      0.749
	
	
	  HouseStyle    -4743.8869   2343.042     -2.025   0.043  -9341.232   -146.542
	
	
	  OverallQual    1.078e+04   1451.791      7.428   0.000   7934.632   1.36e+04
	
	
	  OverallCond    5051.5530   1050.347      4.809   0.000   2990.639   7112.467
	
	
	  MasVnrArea       28.6283      6.294      4.549   0.000     16.279     40.977
	
	
	  ExterQual      9999.8740   3097.383      3.228   0.001   3922.415   1.61e+04
	
	
	  BsmtQual       8313.0791   2205.997      3.768   0.000   3984.634   1.26e+04
	
	
	  BsmtCond      -6206.5204   2708.728     -2.291   0.022  -1.15e+04   -891.653
	
	
	  BsmtExposure   5674.4909   1178.654      4.814   0.000   3361.822   7987.159
	
	
	  BsmtFinSF1       10.2277      3.208      3.188   0.001      3.934     16.522
	
	
	  GrLivArea        49.6567      4.416     11.244   0.000     40.992     58.322
	
	
	  BsmtFullBath   6130.6166   2691.107      2.278   0.023    850.323   1.14e+04
	
	
	  BedroomAbvGr  -5939.7076   1912.752     -3.105   0.002  -9692.770  -2186.645
	
	
	  KitchenAbvGr  -1.792e+04   5946.669     -3.013   0.003  -2.96e+04  -6249.616
	
	
	  KitchenQual    1.033e+04   2406.377      4.291   0.000   5603.935    1.5e+04
	
	
	  TotRmsAbvGrd   4492.2601   1346.548      3.336   0.001   1850.161   7134.359
	
	
	  GarageYrBlt     -72.0393      9.304     -7.743   0.000    -90.294    -53.785
	
	
	  GarageCars     1.446e+04   2185.427      6.618   0.000   1.02e+04   1.88e+04
	
	
	  GarageQual     9975.0846   3996.579      2.496   0.013   2133.290   1.78e+04
	
	
	  ScreenPorch      47.8903     17.782      2.693   0.007     12.999     82.782
	
	
	
	
	  Omnibus:        376.939    Durbin-Watson:         1.824 
	
	
	  Prob(Omnibus):   0.000     Jarque-Bera (JB):   49775.782
	
	
	  Skew:           -0.407     Prob(JB):               0.00 
	
	
	  Kurtosis:       35.634     Cond. No.           7.83e+04

OLS Regression Results
Dep. Variable:	SalePrice	R-squared (uncentered):	0.973
Model:	OLS	Adj. R-squared (uncentered):	0.972
Method:	Least Squares	F-statistic:	1716.
Date:	Sun, 01 Aug 2021	Prob (F-statistic):	0.00
Time:	14:00:50	Log-Likelihood:	-13268.
No. Observations:	1121	AIC:	2.658e+04
Df Residuals:	1098	BIC:	2.670e+04
Df Model:	23
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Neighborhood	2308.1607	237.480	9.719	0.000	1842.194	2774.127
MSSubClass	-190.0565	30.090	-6.316	0.000	-249.097	-131.016
LotFrontage	-172.0544	54.509	-3.156	0.002	-279.008	-65.101
LotArea	0.4661	0.144	3.234	0.001	0.183	0.749
HouseStyle	-4743.8869	2343.042	-2.025	0.043	-9341.232	-146.542
OverallQual	1.078e+04	1451.791	7.428	0.000	7934.632	1.36e+04
OverallCond	5051.5530	1050.347	4.809	0.000	2990.639	7112.467
MasVnrArea	28.6283	6.294	4.549	0.000	16.279	40.977
ExterQual	9999.8740	3097.383	3.228	0.001	3922.415	1.61e+04
BsmtQual	8313.0791	2205.997	3.768	0.000	3984.634	1.26e+04
BsmtCond	-6206.5204	2708.728	-2.291	0.022	-1.15e+04	-891.653
BsmtExposure	5674.4909	1178.654	4.814	0.000	3361.822	7987.159
BsmtFinSF1	10.2277	3.208	3.188	0.001	3.934	16.522
GrLivArea	49.6567	4.416	11.244	0.000	40.992	58.322
BsmtFullBath	6130.6166	2691.107	2.278	0.023	850.323	1.14e+04
BedroomAbvGr	-5939.7076	1912.752	-3.105	0.002	-9692.770	-2186.645
KitchenAbvGr	-1.792e+04	5946.669	-3.013	0.003	-2.96e+04	-6249.616
KitchenQual	1.033e+04	2406.377	4.291	0.000	5603.935	1.5e+04
TotRmsAbvGrd	4492.2601	1346.548	3.336	0.001	1850.161	7134.359
GarageYrBlt	-72.0393	9.304	-7.743	0.000	-90.294	-53.785
GarageCars	1.446e+04	2185.427	6.618	0.000	1.02e+04	1.88e+04
GarageQual	9975.0846	3996.579	2.496	0.013	2133.290	1.78e+04
ScreenPorch	47.8903	17.782	2.693	0.007	12.999	82.782

Omnibus:	376.939	Durbin-Watson:	1.824
Prob(Omnibus):	0.000	Jarque-Bera (JB):	49775.782
Skew:	-0.407	Prob(JB):	0.00
Kurtosis:	35.634	Cond. No.	7.83e+04

The summary of the linear regression gave us a total of 23 highly significant variables that are all below our alpha of 0.05 or 5%. Most of the obtained variables seem to be relatively straight forwards and have logical reasoning behind them such as living area. There are however some peculiar relationships that were uncovered by this analysis. The first that seemed out of place is the variable that accounts for a property having a screen porch. That being said, one possible reasoning behind this could be where the data was obtained(California). Another suspect relationship is the negative relationship lot frontage and sales price has. This seems counterintuitive but on further inspection could be a significant indicator of the size of the house itself. That being said it is still a relatively suspect relationship.

Possible Improvements

The analysis gave a great insight into the factors that seem to contribute most to the sales price of properties in California. That being said there are some things I would like to try to improve the analysis and its accuracy/relevance. The first alteration I would make would be to try and linearize the explanatory variables in the analysis to correct for any nonlinear variables. Additionally, I would like to do a more detailed analysis into the different neighborhoods and living areas and identify the reasonings of their effect on property sales prices.

📔 Overview

An investigation into real estate prices in california. Using a dataset found on kaggle I analyze the major factors that contribute to real estate prices in california.

⚙️ Tools

Python
Scipy
Matplotlib
Pandas
Numpy

🤹 Skills

Multiple Linear Regression, Data Cleaning, Data Transformation