Predicting Boston Housing Prices: Linear Regression Modeling
- Introduction
- Data Loading & Preview
- Exploratory Data Analysis
- Bivariate Analysis
- Linear Regression Model Development
- Setup For Predictions# Predicting Boston Housing Prices
Introduction
Problem Statement
The problem at hand is to predict the housing prices of boston based on the data (features) provided.
Data Summary
CRIM: per capita crime rate by townZN: proportion of residential land zoned for lots over 25,000 sq. ft.INDUS: proportion of non-retail business acres per townCHAS: Charles River dummy variable (= 1 if next to river; 0 otherwise)NOX: nitric oxides concentration (parts per 10 million)RM: average number of rooms per dwellingAGE: proportion of owner-occupied units built prior to 1940DIS: weighted distances to five Boston employment centersRAD: index of accessibility to radial highwaysTAX: full-value property-tax rate per 10,000 dollarsPTRATIO: pupil-teacher ratio by townLSTAT: %lower status of the populationMEDV: Median value of owner-occupied homes in 1000 dollars.
Dependencies
In [15]:
%load_ext nb_black
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_errorIn [4]:
df = pd.read_csv("boston.csv")
df.head()Out [4]:
In [6]:
rowCount, colCount = df.shape
print(f'{rowCount} rows')
print(f'{colCount} columns')In [7]:
df.info()In [8]:
df.describe()Out [8]:
In [16]:
# defining the figure size
plt.figure(figsize=(15, 10))
features = df.columns
for i, feature in enumerate(features):
plt.subplot(4, 4, i+1)
sns.histplot(data=df, x=feature)
plt.tight_layout()
plt.show()In [22]:
plt.figure(figsize=(15, 10))
features = df.columns
for i, feature in enumerate(features):
plt.subplot(4, 4, i+1)
sns.scatterplot(data=df, x=feature, y="MEDV")
plt.tight_layout()
plt.show()In [23]:
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()In [39]:
sns.pairplot(df);- NX and TAX show a slightly strong positive linear relationship with INDUS, while DIS shows a slightly strong negative linear relationship with INDUS.
- NX shows a slightly strong positive linear relationship with AGE, while DIS shows a slightly strong negative linear relationship with AGE.
- RM shows a slightly strong positive linear relationship with MEDV, while LSTAT shows a slightly strong negative linear relationship with MEDV.
In [24]:
X = df.drop("MEDV", axis=1)
y = df["MEDV"]In [25]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)In [26]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)Out [26]:
In [27]:
print(
"The score (R-squared) on the training set is ",
regression_model.score(X_train, y_train),
)In [29]:
#
# r2
# SST - sum squared total
# SSE - sum squared error
#
def r_squared(model, X, y):
y_mean = y.mean()
SST = ((y - y_mean) ** 2).sum()
SSE = ((y - model.predict(X)) ** 2).sum()
r_square = 1 - SSE / SST
return SSE, SST, r_square
SSE, SST, r_square = r_squared(regression_model, X_train, y_train)
print("SSE: ", SSE)
print("SST: ", SST)
print("R-squared: ", r_square)In [31]:
print(
"The score (R-squared) on the test set is ", regression_model.score(X_test, y_test)
)In [33]:
print(
"The Root Mean Square Error (RMSE) of the model for the training set is ",
np.sqrt(mean_squared_error(y_train, regression_model.predict(X_train))),
)In [34]:
print(
"The Root Mean Square Error (RMSE) of the model for the test set is ",
np.sqrt(mean_squared_error(y_test, regression_model.predict(X_test))),
)In [35]:
coef_df = pd.DataFrame(
np.append(regression_model.coef_, regression_model.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_dfOut [35]:
In [37]:
Equation = "Price = " + str(regression_model.intercept_)
print(Equation, end=" ")
for i in range(len(X_train.columns)):
if i != len(X_train.columns) - 1:
print(
"+ (",
regression_model.coef_[i],
")*(",
X_train.columns[i],
")",
end=" ",
)
else:
print("+ (", regression_model.coef_[i], ")*(", X_train.columns[i], ")")Setup For Predictions
The model is already setup to predict new results on new data.Using
regression_model.predict() and passing in an instance of a data row will return a prediction.The input of the new data must be in the same shape & order as the original datasource.
Page Tags:
python
data-science
jupyter
learning
numpy