Table Of Contents

Modeling And Wrangling

Here, python libraries will be used to do some data wrangling prior to building a model:

  • Download Data from the internet: with pandas read_csv we can pass a url that returns a csv
  • preview downloaded data with the pandas head() method
  • Normalize Data Values with sklearn MinMaxScaler and OneHotEncoder
  • Split data into training & testing with sklearn train_test_split

In this example, some data about medical insurance will be downloaded, wrangled, and used to build a machine-learning model that can predict insurance costs based on age, sex, bmi, children, smoking_status and residential_region

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

Download Some Data

In [2]:
dataUrl = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
dataFromWeb = pd.read_csv(dataUrl)
dataFromWeb.head()
Out [2]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Create Number-Only Data with One-Hot-Encoding

In [3]:
oheEncodedData = pd.get_dummies(dataFromWeb)
oheEncodedData.head()
Out [3]:
age bmi children charges sex_female sex_male smoker_no smoker_yes region_northeast region_northwest region_southeast region_southwest
0 19 27.900 0 16884.92400 True False False True False False False True
1 18 33.770 1 1725.55230 False True True False False False True False
2 28 33.000 3 4449.46200 False True True False False False True False
3 33 22.705 0 21984.47061 False True True False False True False False
4 32 28.880 0 3866.85520 False True True False False True False False

Split Data: Features & Labels

The charges column represent the dependent variable here: the labels.
All the others represent independent variables, the features.

In [4]:
labelField = 'charges'
featureData = oheEncodedData.drop(labelField, axis=1)
labelData = oheEncodedData[labelField]
featureData.head()
Out [4]:
age bmi children sex_female sex_male smoker_no smoker_yes region_northeast region_northwest region_southeast region_southwest
0 19 27.900 0 True False False True False False False True
1 18 33.770 1 False True True False False False True False
2 28 33.000 3 False True True False False False True False
3 33 22.705 0 False True True False False True False False
4 32 28.880 0 False True True False False True False False

Split Data: Train & test

In [5]:
testDataPercentage = .2 # how much of our data should we use for "testing"
randomVal = 42
feature_training_data, feature_testing_data, label_training_data, label_testing_data = train_test_split(featureData, 
                                                    labelData, 
                                                    test_size=testDataPercentage, 
                                                    random_state=randomVal) # set random state for reproducible splits
In [6]:
feature_training_data.head()
Out [6]:
age bmi children sex_female sex_male smoker_no smoker_yes region_northeast region_northwest region_southeast region_southwest
560 46 19.95 2 True False True False False True False False
1285 47 24.32 0 True False True False True False False False
1142 52 24.86 0 True False True False False False True False
969 39 34.32 5 True False True False False False True False
486 54 21.47 3 True False True False False True False False
In [7]:
label_training_data.head()
Out [7]:
560      9193.83850
1285     8534.67180
1142    27117.99378
969      8596.82780
486     12475.35130
Name: charges, dtype: float64

Create A Model

In [8]:
epochCount = 100
# Set random seed
tf.random.set_seed(randomVal)

# layers
denseLayer = tf.keras.layers.Dense(1)
# Create a new model (same as model_2)
insurance_model = tf.keras.Sequential()
insurance_model.add(denseLayer)
insurance_model.add(denseLayer)

# Compile the model
insurance_model.compile(loss=tf.keras.losses.mae,
                        optimizer=tf.keras.optimizers.SGD(),
                        metrics=['mae'])

# adjust data type to prevent error 
feature_training_data=feature_training_data.astype(np.float32)
label_training_data=label_training_data.astype(np.float32)
feature_testing_data=feature_testing_data.astype(np.float32)
label_testing_data=label_testing_data.astype(np.float32)


# Fit the model
# save output to a variable
modelHistory = insurance_model.fit(feature_training_data, label_training_data, epochs=epochCount)
Epoch 1/100
34/34 [==============================] - 1s 5ms/step - loss: 12929.0977 - mae: 12929.0977
Epoch 2/100
34/34 [==============================] - 0s 5ms/step - loss: 12084.2998 - mae: 12084.2998
Epoch 3/100
34/34 [==============================] - 0s 4ms/step - loss: 11257.6836 - mae: 11257.6836
Epoch 4/100
34/34 [==============================] - 0s 5ms/step - loss: 10501.6211 - mae: 10501.6211
Epoch 5/100
34/34 [==============================] - 0s 4ms/step - loss: 9854.5127 - mae: 9854.5127
Epoch 6/100
34/34 [==============================] - 0s 4ms/step - loss: 9307.1348 - mae: 9307.1348
Epoch 7/100
34/34 [==============================] - 0s 4ms/step - loss: 8834.4043 - mae: 8834.4043
Epoch 8/100
34/34 [==============================] - 0s 4ms/step - loss: 8449.6650 - mae: 8449.6650
Epoch 9/100
34/34 [==============================] - 0s 4ms/step - loss: 8144.5552 - mae: 8144.5552
Epoch 10/100
34/34 [==============================] - 0s 4ms/step - loss: 7902.4521 - mae: 7902.4521
Epoch 11/100
34/34 [==============================] - 0s 4ms/step - loss: 7714.7314 - mae: 7714.7314
Epoch 12/100
34/34 [==============================] - 0s 4ms/step - loss: 7582.1240 - mae: 7582.1240
Epoch 13/100
34/34 [==============================] - 0s 4ms/step - loss: 7493.8091 - mae: 7493.8091
Epoch 14/100
34/34 [==============================] - 0s 4ms/step - loss: 7433.5542 - mae: 7433.5542
Epoch 15/100
34/34 [==============================] - 0s 4ms/step - loss: 7393.4766 - mae: 7393.4766
Epoch 16/100
34/34 [==============================] - 0s 5ms/step - loss: 7362.8184 - mae: 7362.8184
Epoch 17/100
34/34 [==============================] - 0s 4ms/step - loss: 7341.7515 - mae: 7341.7515
Epoch 18/100
34/34 [==============================] - 0s 5ms/step - loss: 7326.7085 - mae: 7326.7085
Epoch 19/100
34/34 [==============================] - 0s 4ms/step - loss: 7314.8198 - mae: 7314.8198
Epoch 20/100
34/34 [==============================] - 0s 5ms/step - loss: 7305.0923 - mae: 7305.0923
Epoch 21/100
34/34 [==============================] - 0s 5ms/step - loss: 7296.5908 - mae: 7296.5908
Epoch 22/100
34/34 [==============================] - 0s 6ms/step - loss: 7290.1104 - mae: 7290.1104
Epoch 23/100
34/34 [==============================] - 0s 5ms/step - loss: 7285.4897 - mae: 7285.4897
Epoch 24/100
34/34 [==============================] - 0s 5ms/step - loss: 7281.0742 - mae: 7281.0742
Epoch 25/100
34/34 [==============================] - 0s 6ms/step - loss: 7276.6904 - mae: 7276.6904
Epoch 26/100
34/34 [==============================] - 0s 5ms/step - loss: 7272.5249 - mae: 7272.5249
Epoch 27/100
34/34 [==============================] - 0s 4ms/step - loss: 7268.4888 - mae: 7268.4888
Epoch 28/100
34/34 [==============================] - 0s 4ms/step - loss: 7264.3525 - mae: 7264.3525
Epoch 29/100
34/34 [==============================] - 0s 4ms/step - loss: 7260.4541 - mae: 7260.4541
Epoch 30/100
34/34 [==============================] - 0s 4ms/step - loss: 7256.5405 - mae: 7256.5405
Epoch 31/100
34/34 [==============================] - 0s 5ms/step - loss: 7252.4370 - mae: 7252.4370
Epoch 32/100
34/34 [==============================] - 0s 5ms/step - loss: 7248.7378 - mae: 7248.7378
Epoch 33/100
34/34 [==============================] - 0s 4ms/step - loss: 7244.6558 - mae: 7244.6558
Epoch 34/100
34/34 [==============================] - 0s 5ms/step - loss: 7240.7632 - mae: 7240.7632
Epoch 35/100
34/34 [==============================] - 0s 4ms/step - loss: 7237.0293 - mae: 7237.0293
Epoch 36/100
34/34 [==============================] - 0s 4ms/step - loss: 7233.1123 - mae: 7233.1123
Epoch 37/100
34/34 [==============================] - 0s 4ms/step - loss: 7229.2168 - mae: 7229.2168
Epoch 38/100
34/34 [==============================] - 0s 4ms/step - loss: 7225.4897 - mae: 7225.4897
Epoch 39/100
34/34 [==============================] - 0s 4ms/step - loss: 7221.4600 - mae: 7221.4600
Epoch 40/100
34/34 [==============================] - 0s 4ms/step - loss: 7217.6426 - mae: 7217.6426
Epoch 41/100
34/34 [==============================] - 0s 4ms/step - loss: 7213.9087 - mae: 7213.9087
Epoch 42/100
34/34 [==============================] - 0s 4ms/step - loss: 7210.1353 - mae: 7210.1353
Epoch 43/100
34/34 [==============================] - 0s 4ms/step - loss: 7206.1602 - mae: 7206.1602
Epoch 44/100
34/34 [==============================] - 0s 4ms/step - loss: 7202.5337 - mae: 7202.5337
Epoch 45/100
34/34 [==============================] - 0s 4ms/step - loss: 7198.9209 - mae: 7198.9209
Epoch 46/100
34/34 [==============================] - 0s 4ms/step - loss: 7195.1436 - mae: 7195.1436
Epoch 47/100
34/34 [==============================] - 0s 4ms/step - loss: 7191.5684 - mae: 7191.5684
Epoch 48/100
34/34 [==============================] - 0s 4ms/step - loss: 7187.6177 - mae: 7187.6177
Epoch 49/100
34/34 [==============================] - 0s 4ms/step - loss: 7184.2520 - mae: 7184.2520
Epoch 50/100
34/34 [==============================] - 0s 4ms/step - loss: 7180.5498 - mae: 7180.5498
Epoch 51/100
34/34 [==============================] - 0s 4ms/step - loss: 7176.8579 - mae: 7176.8579
Epoch 52/100
34/34 [==============================] - 0s 4ms/step - loss: 7173.0317 - mae: 7173.0317
Epoch 53/100
34/34 [==============================] - 0s 4ms/step - loss: 7169.5488 - mae: 7169.5488
Epoch 54/100
34/34 [==============================] - 0s 4ms/step - loss: 7165.8984 - mae: 7165.8984
Epoch 55/100
34/34 [==============================] - 0s 4ms/step - loss: 7162.1387 - mae: 7162.1387
Epoch 56/100
34/34 [==============================] - 0s 5ms/step - loss: 7158.6626 - mae: 7158.6626
Epoch 57/100
34/34 [==============================] - 0s 5ms/step - loss: 7155.1860 - mae: 7155.1860
Epoch 58/100
34/34 [==============================] - 0s 5ms/step - loss: 7151.6074 - mae: 7151.6074
Epoch 59/100
34/34 [==============================] - 0s 5ms/step - loss: 7148.1851 - mae: 7148.1851
Epoch 60/100
34/34 [==============================] - 0s 6ms/step - loss: 7144.7017 - mae: 7144.7017
Epoch 61/100
34/34 [==============================] - 0s 5ms/step - loss: 7141.2495 - mae: 7141.2495
Epoch 62/100
34/34 [==============================] - 0s 4ms/step - loss: 7137.6250 - mae: 7137.6250
Epoch 63/100
34/34 [==============================] - 0s 4ms/step - loss: 7134.3550 - mae: 7134.3550
Epoch 64/100
34/34 [==============================] - 0s 5ms/step - loss: 7131.1562 - mae: 7131.1562
Epoch 65/100
34/34 [==============================] - 0s 5ms/step - loss: 7127.7969 - mae: 7127.7969
Epoch 66/100
34/34 [==============================] - 0s 4ms/step - loss: 7124.3398 - mae: 7124.3398
Epoch 67/100
34/34 [==============================] - 0s 5ms/step - loss: 7121.2031 - mae: 7121.2031
Epoch 68/100
34/34 [==============================] - 0s 4ms/step - loss: 7117.9922 - mae: 7117.9922
Epoch 69/100
34/34 [==============================] - 0s 4ms/step - loss: 7114.6816 - mae: 7114.6816
Epoch 70/100
34/34 [==============================] - 0s 5ms/step - loss: 7111.5186 - mae: 7111.5186
Epoch 71/100
34/34 [==============================] - 0s 5ms/step - loss: 7108.1860 - mae: 7108.1860
Epoch 72/100
34/34 [==============================] - 0s 5ms/step - loss: 7105.2412 - mae: 7105.2412
Epoch 73/100
34/34 [==============================] - 0s 5ms/step - loss: 7101.9375 - mae: 7101.9375
Epoch 74/100
34/34 [==============================] - 0s 5ms/step - loss: 7098.5718 - mae: 7098.5718
Epoch 75/100
34/34 [==============================] - 0s 4ms/step - loss: 7095.4531 - mae: 7095.4531
Epoch 76/100
34/34 [==============================] - 0s 4ms/step - loss: 7092.1846 - mae: 7092.1846
Epoch 77/100
34/34 [==============================] - 0s 4ms/step - loss: 7089.0986 - mae: 7089.0986
Epoch 78/100
34/34 [==============================] - 0s 4ms/step - loss: 7086.0303 - mae: 7086.0303
Epoch 79/100
34/34 [==============================] - 0s 4ms/step - loss: 7083.0830 - mae: 7083.0830
Epoch 80/100
34/34 [==============================] - 0s 4ms/step - loss: 7079.7832 - mae: 7079.7832
Epoch 81/100
34/34 [==============================] - 0s 4ms/step - loss: 7076.8062 - mae: 7076.8062
Epoch 82/100
34/34 [==============================] - 0s 4ms/step - loss: 7074.0054 - mae: 7074.0054
Epoch 83/100
34/34 [==============================] - 0s 4ms/step - loss: 7071.2134 - mae: 7071.2134
Epoch 84/100
34/34 [==============================] - 0s 4ms/step - loss: 7067.8643 - mae: 7067.8643
Epoch 85/100
34/34 [==============================] - 0s 4ms/step - loss: 7065.1138 - mae: 7065.1138
Epoch 86/100
34/34 [==============================] - 0s 4ms/step - loss: 7062.0625 - mae: 7062.0625
Epoch 87/100
34/34 [==============================] - 0s 5ms/step - loss: 7059.3682 - mae: 7059.3682
Epoch 88/100
34/34 [==============================] - 0s 5ms/step - loss: 7056.2017 - mae: 7056.2017
Epoch 89/100
34/34 [==============================] - 0s 4ms/step - loss: 7053.3081 - mae: 7053.3081
Epoch 90/100
34/34 [==============================] - 0s 4ms/step - loss: 7050.1855 - mae: 7050.1855
Epoch 91/100
34/34 [==============================] - 0s 5ms/step - loss: 7047.3662 - mae: 7047.3662
Epoch 92/100
34/34 [==============================] - 0s 6ms/step - loss: 7044.6016 - mae: 7044.6016
Epoch 93/100
34/34 [==============================] - 0s 6ms/step - loss: 7041.6235 - mae: 7041.6235
Epoch 94/100
34/34 [==============================] - 0s 6ms/step - loss: 7038.6577 - mae: 7038.6577
Epoch 95/100
34/34 [==============================] - 0s 6ms/step - loss: 7035.6338 - mae: 7035.6338
Epoch 96/100
34/34 [==============================] - 0s 5ms/step - loss: 7032.9272 - mae: 7032.9272
Epoch 97/100
34/34 [==============================] - 0s 4ms/step - loss: 7030.1157 - mae: 7030.1157
Epoch 98/100
34/34 [==============================] - 0s 4ms/step - loss: 7027.0903 - mae: 7027.0903
Epoch 99/100
34/34 [==============================] - 0s 4ms/step - loss: 7024.1489 - mae: 7024.1489
Epoch 100/100
34/34 [==============================] - 0s 4ms/step - loss: 7021.0322 - mae: 7021.0322

Review Model Results

In [9]:
# Check the results of the insurance model
insurance_model.evaluate(feature_testing_data, label_testing_data)
9/9 [==============================] - 0s 5ms/step - loss: 7002.0923 - mae: 7002.0923
Out [9]:
[7002.09228515625, 7002.09228515625]
In [10]:
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model MAE: {insurance_model.get_metrics_result()["mae"].numpy()}')
Training Label Median: 9575.4423828125
Training Label Mean: 13346.08984375
model MAE: 7002.09228515625

Because the MAE (mean absolute error) is so "large", the model is not great.

Visualize & Analyze The Loss Curve

In [11]:
pd.DataFrame(modelHistory.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')
Out [11]:
Text(0.5, 0, 'epochs')
output png

The loss score took a large drop toward the beginning of the epochs.
The loss curve "slowed down", and seems to be still dropping toward the end.

Experiment I

  • different layers
  • different optimizer fn

Build A New Model Version

In [12]:
insurance_model_2 = tf.keras.Sequential()
modell2EpochCount = 100
# different & more layers
l1 = tf.keras.layers.Dense(100)
l2 = tf.keras.layers.Dense(10)
l3 = tf.keras.layers.Dense(1)

insurance_model_2.add(l1)
insurance_model_2.add(l2)
insurance_model_2.add(l3)

# Compile the model
insurance_model_2.compile(loss=tf.keras.losses.mae,
                          optimizer=tf.keras.optimizers.Adam(), # Adam works but SGD doesn't 
                          metrics=['mae'])

# Fit the model and save the history (we can plot this)
model_2_history = insurance_model_2.fit(feature_training_data, label_training_data, epochs=modell2EpochCount, verbose=0)

Evaluate the model

In [13]:
insurance_model_2.evaluate(feature_testing_data, label_testing_data)
9/9 [==============================] - 0s 5ms/step - loss: 4758.9893 - mae: 4758.9893
Out [13]:
[4758.9892578125, 4758.9892578125]

Review Model Results

In [14]:
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model_2 MAE: {insurance_model_2.get_metrics_result()["mae"].numpy()}')
Training Label Median: 9575.4423828125
Training Label Mean: 13346.08984375
model_2 MAE: 4758.9892578125

Visualise Model Loss

In [15]:
pd.DataFrame(model_2_history.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')
Out [15]:
Text(0.5, 0, 'epochs')
output png

Experiment II

In [16]:
insurance_model_3 = tf.keras.Sequential()
model3EpochCount = 200

insurance_model_3.add(l1)
insurance_model_3.add(l2)
insurance_model_3.add(l3)

# Compile the model
insurance_model_3.compile(loss=tf.keras.losses.mae,
                          optimizer=tf.keras.optimizers.Adam(), # Adam works but SGD doesn't 
                          metrics=['mae'])

# Fit the model and save the history (we can plot this)
model_3_history = insurance_model_3.fit(feature_training_data, label_training_data, epochs=model3EpochCount, verbose=0)
In [17]:
insurance_model_2.evaluate(feature_testing_data, label_testing_data)
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model_3 MAE: {insurance_model_3.get_metrics_result()["mae"].numpy()}')
9/9 [==============================] - 0s 5ms/step - loss: 3230.5137 - mae: 3230.5137
Training Label Median: 9575.4423828125
Training Label Mean: 13346.08984375
model_3 MAE: 3515.3447265625
In [18]:
pd.DataFrame(model_3_history.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')
Out [18]:
Text(0.5, 0, 'epochs')
output png