Imports
Load Some Data
Wrangle & Preview
Using The "Model" For new input# Multiple Regression
more than one variable influences the dependent variable

Imports

In [6]:

import pandas as pd
%matplotlib inline
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

Load Some Data

In [7]:

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()

Out [7]:

	Price	Mileage	Make	Model	Trim	Type	Cylinder	Liter	Doors	Cruise	Sound	Leather
0	17314.103129	8221	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	1
1	17542.036083	9135	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
2	16218.847862	13196	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
3	16336.913140	16342	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	0
4	16339.170324	19832	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	1

Wrangle & Preview

In [15]:

df1=df[['Mileage','Price']]

# 10K-mile chunks, up-to 50K miles
bins =  np.arange(0,50000,10000)

avgPricePerGroup = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()
print(avgPricePerGroup.head())
avgPricePerGroup['Price'].plot.line()

                     Mileage         Price
Mileage                                   
(0, 10000]       5588.629630  24096.714451
(10000, 20000]  15898.496183  21955.979607
(20000, 30000]  24114.407104  20278.606252
(30000, 40000]  33610.338710  19463.670267

Out [15]:

<Axes: xlabel='Mileage'>

In [18]:

scale = StandardScaler()

# extract 3 features to compare
X = df[['Mileage', 'Cylinder', 'Doors']]
# set the dependent variable
y = df['Price']

# 
# SCALE the feature's values
# 
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)
# X.head()
# Add a constant column to our model so we can have a Y-intercept
X = sm.add_constant(X)

print (X)

est = sm.OLS(y, X).fit()

print(est.summary())

     const   Mileage  Cylinder     Doors
0      1.0 -1.417485   0.52741  0.556279
1      1.0 -1.305902   0.52741  0.556279
2      1.0 -0.810128   0.52741  0.556279
3      1.0 -0.426058   0.52741  0.556279
4      1.0  0.000008   0.52741  0.556279
..     ...       ...       ...       ...
799    1.0 -0.439853   0.52741  0.556279
800    1.0 -0.089966   0.52741  0.556279
801    1.0  0.079605   0.52741  0.556279
802    1.0  0.750446   0.52741  0.556279
803    1.0  1.932565   0.52741  0.556279

[804 rows x 4 columns]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Price   R-squared:                       0.360
Model:                            OLS   Adj. R-squared:                  0.358
Method:                 Least Squares   F-statistic:                     150.0
Date:                Wed, 15 Jan 2025   Prob (F-statistic):           3.95e-77
Time:                        19:21:21   Log-Likelihood:                -8356.7
No. Observations:                 804   AIC:                         1.672e+04
Df Residuals:                     800   BIC:                         1.674e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.134e+04    279.405     76.388      0.000    2.08e+04    2.19e+04
Mileage    -1272.3412    279.567     -4.551      0.000   -1821.112    -723.571
Cylinder    5587.4472    279.527     19.989      0.000    5038.754    6136.140
Doors      -1404.5513    279.446     -5.026      0.000   -1953.085    -856.018
==============================================================================
Omnibus:                      157.913   Durbin-Watson:                   0.069
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              257.529
Skew:                           1.278   Prob(JB):                     1.20e-56
Kurtosis:                       4.074   Cond. No.                         1.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/var/folders/nl/x4_9px9n74d9fqqzf68gtzjr0000gn/T/ipykernel_95766/1638616297.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * cylinders + B3 * doors

cylinders have a coefficient of over 5K
mileage have a negative coefficient of 1,200
door-count has a negative coefficient of 1,400

In [19]:

# 
# Another SIMPLER look at the average-price-by-door-count!
# 
y.groupby(df.Doors).mean()

Out [19]:

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Using The "Model" For new input

How would you use this to make an actual prediction? Start by scaling your multiple feature variables into the same scale used to train the model, then just call est.predict() on the scaled features:

In [28]:

newCar = { "miles": 45000, "cyl": 8, "doors": 4 }
scaled = scale.transform([[newCar['miles'], newCar['cyl'], newCar['doors']]])
scaled = np.insert(scaled[0], 0, 1) #Need to add that constant column in again.
print(f'scaled:{scaled}')
predicted = est.predict(scaled)
print(f'predicted price: {predicted[0]}')

scaled:[1.         3.07256589 1.96971667 0.55627894]
predicted price: 27658.157073156413