Multiple Regression

In [6]:
import pandas as pd
%matplotlib inline
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
In [7]:
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()
Out [7]:
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1
In [15]:
df1=df[['Mileage','Price']]

# 10K-mile chunks, up-to 50K miles
bins =  np.arange(0,50000,10000)

avgPricePerGroup = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()
print(avgPricePerGroup.head())
avgPricePerGroup['Price'].plot.line()
                     Mileage         Price
Mileage                                   
(0, 10000]       5588.629630  24096.714451
(10000, 20000]  15898.496183  21955.979607
(20000, 30000]  24114.407104  20278.606252
(30000, 40000]  33610.338710  19463.670267
Out [15]:
<Axes: xlabel='Mileage'>
output png
In [18]:
scale = StandardScaler()

# extract 3 features to compare
X = df[['Mileage', 'Cylinder', 'Doors']]
# set the dependent variable
y = df['Price']

# 
# SCALE the feature's values
# 
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)
# X.head()
# Add a constant column to our model so we can have a Y-intercept
X = sm.add_constant(X)

print (X)

est = sm.OLS(y, X).fit()

print(est.summary())
     const   Mileage  Cylinder     Doors
0      1.0 -1.417485   0.52741  0.556279
1      1.0 -1.305902   0.52741  0.556279
2      1.0 -0.810128   0.52741  0.556279
3      1.0 -0.426058   0.52741  0.556279
4      1.0  0.000008   0.52741  0.556279
..     ...       ...       ...       ...
799    1.0 -0.439853   0.52741  0.556279
800    1.0 -0.089966   0.52741  0.556279
801    1.0  0.079605   0.52741  0.556279
802    1.0  0.750446   0.52741  0.556279
803    1.0  1.932565   0.52741  0.556279

[804 rows x 4 columns]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Price   R-squared:                       0.360
Model:                            OLS   Adj. R-squared:                  0.358
Method:                 Least Squares   F-statistic:                     150.0
Date:                Wed, 15 Jan 2025   Prob (F-statistic):           3.95e-77
Time:                        19:21:21   Log-Likelihood:                -8356.7
No. Observations:                 804   AIC:                         1.672e+04
Df Residuals:                     800   BIC:                         1.674e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.134e+04    279.405     76.388      0.000    2.08e+04    2.19e+04
Mileage    -1272.3412    279.567     -4.551      0.000   -1821.112    -723.571
Cylinder    5587.4472    279.527     19.989      0.000    5038.754    6136.140
Doors      -1404.5513    279.446     -5.026      0.000   -1953.085    -856.018
==============================================================================
Omnibus:                      157.913   Durbin-Watson:                   0.069
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              257.529
Skew:                           1.278   Prob(JB):                     1.20e-56
Kurtosis:                       4.074   Cond. No.                         1.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/var/folders/nl/x4_9px9n74d9fqqzf68gtzjr0000gn/T/ipykernel_95766/1638616297.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * cylinders + B3 * doors

  • cylinders have a coefficient of over 5K
  • mileage have a negative coefficient of 1,200
  • door-count has a negative coefficient of 1,400
In [19]:
# 
# Another SIMPLER look at the average-price-by-door-count!
# 
y.groupby(df.Doors).mean()
Out [19]:
Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Using The "Model" For new input

How would you use this to make an actual prediction? Start by scaling your multiple feature variables into the same scale used to train the model, then just call est.predict() on the scaled features:

In [28]:
newCar = { "miles": 45000, "cyl": 8, "doors": 4 }
scaled = scale.transform([[newCar['miles'], newCar['cyl'], newCar['doors']]])
scaled = np.insert(scaled[0], 0, 1) #Need to add that constant column in again.
print(f'scaled:{scaled}')
predicted = est.predict(scaled)
print(f'predicted price: {predicted[0]}')
scaled:[1.         3.07256589 1.96971667 0.55627894]
predicted price: 27658.157073156413
Page Tags:
python
data-science
jupyter
learning
numpy