Investigating the Factors That Affect Real Estate Price's in California

Introduction

Housing prices are highly speculative. Expert price estimates tend to range greatly. The price of a house is seemingly dependent on so many variables that it can be difficult to extract what exactly is most important to look for. This analysis aims to find the primary variables that determine a house's sales price. The analysis will be run on a data set of approximately 1500 houses with over 60 different variables measured on each house. From this data set, my goal is to narrow down the important factors that affect a house's price.

The data used in this analysis is a Kaggle data set of the sales price of properties that includes 82 variables with 1459 observations. The data was curated for a Kaggle competition that investigates price estimation of homes depending on a variety of explanatory variables. The list of explanatory variables can be seen in the appendix.

My objective in this analysis is to identify the primary explanatory variables that affect the price of any given property given the explanatory variables. To achieve this I will be using multiple linear regression to obtain the important factors and end up with a model that can predict sales price with relatively high accuracy.

Method

The method I will use to estimate the property prices is multiple linear regression. However, the data needs to meet the assumptions to use multiple linear regression affectivly:

  • The dependent variable should be measured on a continuos scale
  • There must be a linear relationship between the dependent and independent variables
  • The variance of the residuals should be equal at each level of the explanatory variables
  • The errors must be independent from one another
  • The risiduals must be normally distributed
  • There should be no outliers or influential points
  • There should be no multi collinearity

There are several ways to insure tha the data meets the given assumptions and this will be seen in the analysis section. Given that all asspects are met we can use Multiple linear regression to analyse the data. Multiple linear regression is an extension of linear regression, as it uses two or more explanatory variables. The method aims to use the explanatory or independent variables to estimate the dependent variable. In this case, our dependent variable is the property's sale price and the explanatory variables are a collection of measurements and observations made on the property.

Data Cleaning

The data set had several different quantitative variable types including continuos, discrete and categorical variables. As a linear regression only deals with numeric values, the categorical data needed to be converted into a numeric format. To do this I cast all the categorical variables to a numeric value based on their nature. For example, the condition of the over all house had 5 categorical variables ranging from horrible condition to excelent condition. Therefore, I cast each category to a number betweek 1 and 5 from bad to good. That being said some of the data found in the dataset was unfortunately unusable given that it was descriptive by nature and was open to interpretation. To keep the analysis relativly simple I cut out these variables. I used the following code snippet to transform the categorical variables into numeric variables.

df = pd.read_csv('train.csv')
df['LotShape'].iloc[df['LotShape'] == 'Reg'] = 0
df['LotShape'].iloc[df['LotShape'] == 'IR1'] = 1
df['LotShape'].iloc[df['LotShape'] == 'IR2'] = 2
df['LotShape'].iloc[df['LotShape'] == 'IR3'] = 3
df['LotShape']=df['LotShape'].astype(int)


df['ExterQual'].iloc[df['ExterQual'] == 'Ex'] = 5
df['ExterQual'].iloc[df['ExterQual'] == 'Gd'] = 4
df['ExterQual'].iloc[df['ExterQual'] == 'TA'] = 3
df['ExterQual'].iloc[df['ExterQual'] == 'Fa'] = 2
df['ExterQual'].iloc[df['ExterQual'] == 'Po'] = 1
df['ExterQual']=df['ExterQual'].astype(int)

df['ExterCond'].iloc[df['ExterCond'] == 'Ex'] = 5
df['ExterCond'].iloc[df['ExterCond'] == 'Gd'] = 4
df['ExterCond'].iloc[df['ExterCond'] == 'TA'] = 3
df['ExterCond'].iloc[df['ExterCond'] == 'Fa'] = 2
df['ExterCond'].iloc[df['ExterCond'] == 'Po'] = 1
df['ExterCond']=df['ExterCond'].astype(int)

df['CentralAir'].iloc[df['CentralAir']== 'N']=0
df['CentralAir'].iloc[df['CentralAir']== 'Y']=1
df['CentralAir']=df['CentralAir'].astype(int)

df['LandContour'].iloc[df['LandContour'] != 'Lvl'] = 0
df['LandContour'].iloc[df['LandContour'] == 'Lvl'] = 1
df['LandContour']=df['LandContour'].astype(int)

df['Utilities'].iloc[df['Utilities'] != 'AllPub'] = 0
df['Utilities'].iloc[df['Utilities'] == 'AllPub'] = 1
df['Utilities']=df['Utilities'].astype(int)

df['LandSlope'].iloc[df['LandSlope'] != 'Gtl'] = 0
df['LandSlope'].iloc[df['LandSlope'] == 'Gtl'] = 1
df['LandSlope']=df['LandSlope'].astype(int)

df['HouseStyle'].iloc['1Story'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['1.5Fin'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['1.5Unf'== df['HouseStyle']] = 1
df['HouseStyle'].iloc['2Story'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['2.5Fin'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['2.5Unf'== df['HouseStyle']] = 2
df['HouseStyle'].iloc['SFoyer'== df['HouseStyle']] = 0
df['HouseStyle'].iloc['SLvl'== df['HouseStyle']] = 0
df['HouseStyle']=df['HouseStyle'].astype(int)

df['Condition1'].iloc['normal'!= df['Condition1']] = 0
df['Condition1'].iloc['normal'== df['Condition1']] = 1
df['Condition1']=df['Condition1'].astype(int)

df['Condition2'].iloc['normal'!= df['Condition1']] = 0
df['Condition2'].iloc['normal'== df['Condition1']] = 1
df['Condition2']=df['Condition2'].astype(int)

df['SaleCondition'].iloc['Normal'!= df['SaleCondition']] = 0
df['SaleCondition'].iloc['Normal'== df['SaleCondition']] = 1
df['SaleCondition']=df['SaleCondition'].astype(int)

df['PoolQC'].iloc[df['PoolQC'] == 'Ex'] = 5
df['PoolQC'].iloc[df['PoolQC'] == 'Gd'] = 4
df['PoolQC'].iloc[df['PoolQC'] == 'TA'] = 3
df['PoolQC'].iloc[df['PoolQC'] == 'Fa'] = 2
df['PoolQC'].iloc[df['PoolQC'] == 'Po'] = 1
df['PoolQC'].iloc[df['PoolQC'].isna()] = 0
df['PoolQC'] = df['PoolQC'].astype(int)

df['Fence'].iloc[df['Fence'].isna()] =0
df['Fence'].iloc[df['Fence'] == 'GdWo'] = 2
df['Fence'].iloc[df['Fence'] == 'MnPrv'] = 1
df['Fence'].iloc[df['Fence'] == 'GdPrv'] = 2
df['Fence'].iloc[df['Fence'] == 'MnWw'] = 1
df['Fence']=df['Fence'].astype(int)

df['GarageQual'].iloc[df['GarageQual'] == 'Ex'] = 5
df['GarageQual'].iloc[df['GarageQual'] == 'Gd'] = 4
df['GarageQual'].iloc[df['GarageQual'] == 'TA'] = 3
df['GarageQual'].iloc[df['GarageQual'] == 'Fa'] = 2
df['GarageQual'].iloc[df['GarageQual'] == 'Po'] = 1
df['GarageQual'].iloc[df['GarageQual'].isna()] = 0
df['GarageQual'] = df['GarageQual'].astype(int)

df['GarageCond'].iloc[df['GarageCond'] == 'Ex'] = 5
df['GarageCond'].iloc[df['GarageCond'] == 'Gd'] = 4
df['GarageCond'].iloc[df['GarageCond'] == 'TA'] = 3
df['GarageCond'].iloc[df['GarageCond'] == 'Fa'] = 2
df['GarageCond'].iloc[df['GarageCond'] == 'Po'] = 1
df['GarageCond'].iloc[df['GarageCond'].isna()] = 0
df['GarageCond'] = df['GarageCond'].astype(int)

df['PavedDrive'].iloc[df['PavedDrive'] == 'Y'] = 2
df['PavedDrive'].iloc[df['PavedDrive'] == 'P'] = 1
df['PavedDrive'].iloc[df['PavedDrive'] == 'N'] = 1
df['PavedDrive'] = df['PavedDrive'].astype(int)

df['HeatingQC'].iloc[df['HeatingQC'] == 'Ex'] = 5
df['HeatingQC'].iloc[df['HeatingQC'] == 'Gd'] = 4
df['HeatingQC'].iloc[df['HeatingQC'] == 'TA'] = 3
df['HeatingQC'].iloc[df['HeatingQC'] == 'Fa'] = 2
df['HeatingQC'].iloc[df['HeatingQC'] == 'Po'] = 1
df['HeatingQC'].iloc[df['HeatingQC'].isna()] = 0
df['HeatingQC'] = df['HeatingQC'].astype(int)

df['GarageType'].iloc[df['GarageType'].isna()] =0;
df['GarageType'].iloc[df['GarageType'] != 0] = 1
df['GarageType'] = df['GarageType'].astype(int)

df['GarageFinish'].iloc[df['GarageFinish'].isna()]=0
df['GarageFinish'].iloc[df['GarageFinish']!=0]=1
df['GarageFinish'] = df['GarageFinish'].astype(int)


df['KitchenQual'].iloc[df['KitchenQual'] == 'Ex'] = 5
df['KitchenQual'].iloc[df['KitchenQual'] == 'Gd'] = 4
df['KitchenQual'].iloc[df['KitchenQual'] == 'TA'] = 3
df['KitchenQual'].iloc[df['KitchenQual'] == 'Fa'] = 2
df['KitchenQual'].iloc[df['KitchenQual'] == 'Po'] = 1
df['KitchenQual'].iloc[df['KitchenQual'].isna()] = 0
df['KitchenQual'] = df['KitchenQual'].astype(int)

df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Ex'] = 5
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Gd'] = 4
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'TA'] = 3
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Fa'] = 2
df['FireplaceQu'].iloc[df['FireplaceQu'] == 'Po'] = 1
df['FireplaceQu'].iloc[df['FireplaceQu'].isna()] = 0
df['FireplaceQu'] = df['FireplaceQu'].astype(int)

df['BsmtCond'].iloc[df['BsmtCond'] == 'Ex'] = 5
df['BsmtCond'].iloc[df['BsmtCond'] == 'Gd'] = 4
df['BsmtCond'].iloc[df['BsmtCond'] == 'TA'] = 3
df['BsmtCond'].iloc[df['BsmtCond'] == 'Fa'] = 2
df['BsmtCond'].iloc[df['BsmtCond'] == 'Po'] = 1
df['BsmtCond'].iloc[df['BsmtCond'].isna()] = 0
df['BsmtCond'] = df['BsmtCond'].astype(int)

df['BsmtQual'].iloc[df['BsmtQual'] == 'Ex'] = 5
df['BsmtQual'].iloc[df['BsmtQual'] == 'Gd'] = 4
df['BsmtQual'].iloc[df['BsmtQual'] == 'TA'] = 3
df['BsmtQual'].iloc[df['BsmtQual'] == 'Fa'] = 2
df['BsmtQual'].iloc[df['BsmtQual'] == 'Po'] = 1
df['BsmtQual'].iloc[df['BsmtQual'].isna()] = 0
df['BsmtQual'] = df['BsmtQual'].astype(int)

df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Gd'] = 4
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Av'] = 3
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'Mn'] = 2
df['BsmtExposure'].iloc[df['BsmtExposure'] == 'No'] = 1
df['BsmtExposure'].iloc[df['BsmtExposure'].isna()] = 0
df['BsmtExposure'] = df['BsmtExposure'].astype(int)

df['Functional'].iloc[df['Functional'] == 'Typ'] = 8
df['Functional'].iloc[df['Functional'] == 'Min1'] = 7
df['Functional'].iloc[df['Functional'] == 'Min2'] = 6
df['Functional'].iloc[df['Functional'] == 'Mod'] = 5
df['Functional'].iloc[df['Functional'] == 'Maj1'] = 4
df['Functional'].iloc[df['Functional'] == 'Maj2'] = 3
df['Functional'].iloc[df['Functional'] == 'Sev'] = 2
df['Functional'].iloc[df['Functional'] == 'Sal'] = 1
df['Functional'] = df['Functional'].astype(int)

temp = pd.DataFrame(df.groupby(by = 'Neighborhood')['SalePrice'].mean())
temp  = temp.reset_index()
temp = temp.rename(columns = {'SalePrice':'Mean'})
temp = temp.sort_values(by='Mean').reset_index().drop(
'index', axis=1).reset_index().drop('Mean',axis=1)

df = pd.merge(temp,df,how='left',
			right_on='Neighborhood',
			left_on='Neighborhood')
df = df.drop("Neighborhood", axis=1)
df = df.rename(columns={'index':'Neighborhood'})
df = df.drop(['MiscFeature','LotConfig', 'Foundation', 'Electrical',
			'MasVnrType','BsmtFinType1', 'RoofStyle', 'Street',
			'Alley', 'MSZoning', 'SaleType','BsmtFinType2',
			'BldgType','Heating', 'Exterior2nd', 'Exterior1st', 
			'Condition1','Condition2','Id', 'RoofMatl'], axis=1)

Transformation

To insure an accurate analysis, I needed to confirmt that the variables where linear. I first investigated the dependent variable for linearization issues. To do this I used a Q-Q plot with a simple distribution of the dependent variable in this case property sales price.

The following result was obtained by using a log transformation on the dependent variable:

As we can see the QQ-plot seems to have an upwards curvature, this indicates that the data is not linear and and is skewed, this is confirmed when looking at the sale price distribution as it is a right skewed distribution. With this information, we can use a log transformation to transform the dependent variable to achieve a better fit towards a normal distribution.

This transformation significantly improved the normality of the sales price. For the time being I will not be investigating the linearity of any of the dependetn variables given the amount of variables that there are. That being said after obtaining a final linear model I will use the obtained variables and ensure that they are all linear.

Analysis

To begin the analysis I first decided to get more insight into the data to better understand the relationships between the dependent variable and the independent variables. To do this I used a correlation matrix that gives an initial understanding of which variables seem to be highly correlated with the sales price. Unsurprisingly the top 3 variables that seem to correlate with sales prices are: Overall quality, neighborhood, and Ground floor living area. These correlations will give us good insight when deciding which variables to keep in case of multi collinearity.

To reduce the number of variables in the final regression I will be using a stepwise backward selection process. Backward selection is done by starting with a regression that includes all variables and systematically removing variables that don't seem to contribute to the regression given their p-value. In this case, my threshold to keep a variable will be set at an alpha = 0.05. Once all variables are below an alpha of 0.05 the stepwise selection will terminate and produce the final model.

After performing a backward selection we got the following variables that seemed to be the most significant in the linear regression:


The summary of the linear regression gave us a total of 23 highly significant variables that are all below our alpha of 0.05 or 5%. Most of the obtained variables seem to be relatively straight forwards and have logical reasoning behind them such as living area. There are however some peculiar relationships that were uncovered by this analysis. The first that seemed out of place is the variable that accounts for a property having a screen porch. That being said, one possible reasoning behind this could be where the data was obtained(California). Another suspect relationship is the negative relationship lot frontage and sales price has. This seems counterintuitive but on further inspection could be a significant indicator of the size of the house itself. That being said it is still a relatively suspect relationship.

Possible Improvements

The analysis gave a great insight into the factors that seem to contribute most to the sales price of properties in California. That being said there are some things I would like to try to improve the analysis and its accuracy/relevance. The first alteration I would make would be to try and linearize the explanatory variables in the analysis to correct for any nonlinear variables. Additionally, I would like to do a more detailed analysis into the different neighborhoods and living areas and identify the reasonings of their effect on property sales prices.

📔 Overview
An investigation into real estate prices in california. Using a dataset found on kaggle I analyze the major factors that contribute to real estate prices in california.
⚙️ Tools
  • Python
  • Scipy
  • Matplotlib
  • Pandas
  • Numpy
🤹 Skills
Multiple Linear Regression, Data Cleaning, Data Transformation