Table Of Contents
- Modeling And Wrangling
- Download Some Data
- Create Number-Only Data with One-Hot-Encoding
- Split Data: Features & Labels
- Split Data: Train & test
- Create A Model
- Review Model Results
- Visualize & Analyze The Loss Curve
- Experiment I
- Build A New Model Version
- Evaluate the model
- Review Model Results
- Visualise Model Loss
- Experiment II
Modeling And Wrangling
Here, python libraries will be used to do some data wrangling prior to building a model:
- Download Data from the internet: with pandas
read_csvwe can pass a url that returns a csv - preview downloaded data with the pandas
head()method - Normalize Data Values with sklearn
MinMaxScalerandOneHotEncoder - Split data into training & testing with sklearn
train_test_split
In this example, some data about medical insurance will be downloaded, wrangled, and used to build a machine-learning model that can predict insurance costs based on age, sex, bmi, children, smoking_status and residential_region
In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_splitDownload Some Data
In [2]:
dataUrl = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
dataFromWeb = pd.read_csv(dataUrl)
dataFromWeb.head()Out [2]:
Create Number-Only Data with One-Hot-Encoding
In [3]:
oheEncodedData = pd.get_dummies(dataFromWeb)
oheEncodedData.head()Out [3]:
Split Data: Features & Labels
The charges column represent the dependent variable here: the labels.
All the others represent independent variables, the features.
In [4]:
labelField = 'charges'
featureData = oheEncodedData.drop(labelField, axis=1)
labelData = oheEncodedData[labelField]
featureData.head()Out [4]:
Split Data: Train & test
In [5]:
testDataPercentage = .2 # how much of our data should we use for "testing"
randomVal = 42
feature_training_data, feature_testing_data, label_training_data, label_testing_data = train_test_split(featureData,
labelData,
test_size=testDataPercentage,
random_state=randomVal) # set random state for reproducible splitsIn [6]:
feature_training_data.head()Out [6]:
In [7]:
label_training_data.head()Out [7]:
Create A Model
In [8]:
epochCount = 100
# Set random seed
tf.random.set_seed(randomVal)
# layers
denseLayer = tf.keras.layers.Dense(1)
# Create a new model (same as model_2)
insurance_model = tf.keras.Sequential()
insurance_model.add(denseLayer)
insurance_model.add(denseLayer)
# Compile the model
insurance_model.compile(loss=tf.keras.losses.mae,
optimizer=tf.keras.optimizers.SGD(),
metrics=['mae'])
# adjust data type to prevent error
feature_training_data=feature_training_data.astype(np.float32)
label_training_data=label_training_data.astype(np.float32)
feature_testing_data=feature_testing_data.astype(np.float32)
label_testing_data=label_testing_data.astype(np.float32)
# Fit the model
# save output to a variable
modelHistory = insurance_model.fit(feature_training_data, label_training_data, epochs=epochCount)Review Model Results
In [9]:
# Check the results of the insurance model
insurance_model.evaluate(feature_testing_data, label_testing_data)Out [9]:
In [10]:
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model MAE: {insurance_model.get_metrics_result()["mae"].numpy()}')Because the MAE (mean absolute error) is so "large", the model is not great.
Visualize & Analyze The Loss Curve
In [11]:
pd.DataFrame(modelHistory.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')Out [11]:
The loss score took a large drop toward the beginning of the epochs.
The loss curve "slowed down", and seems to be still dropping toward the end.
Experiment I
- different layers
- different optimizer fn
Build A New Model Version
In [12]:
insurance_model_2 = tf.keras.Sequential()
modell2EpochCount = 100
# different & more layers
l1 = tf.keras.layers.Dense(100)
l2 = tf.keras.layers.Dense(10)
l3 = tf.keras.layers.Dense(1)
insurance_model_2.add(l1)
insurance_model_2.add(l2)
insurance_model_2.add(l3)
# Compile the model
insurance_model_2.compile(loss=tf.keras.losses.mae,
optimizer=tf.keras.optimizers.Adam(), # Adam works but SGD doesn't
metrics=['mae'])
# Fit the model and save the history (we can plot this)
model_2_history = insurance_model_2.fit(feature_training_data, label_training_data, epochs=modell2EpochCount, verbose=0)Evaluate the model
In [13]:
insurance_model_2.evaluate(feature_testing_data, label_testing_data)Out [13]:
Review Model Results
In [14]:
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model_2 MAE: {insurance_model_2.get_metrics_result()["mae"].numpy()}')Visualise Model Loss
In [15]:
pd.DataFrame(model_2_history.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')Out [15]:
Experiment II
In [16]:
insurance_model_3 = tf.keras.Sequential()
model3EpochCount = 200
insurance_model_3.add(l1)
insurance_model_3.add(l2)
insurance_model_3.add(l3)
# Compile the model
insurance_model_3.compile(loss=tf.keras.losses.mae,
optimizer=tf.keras.optimizers.Adam(), # Adam works but SGD doesn't
metrics=['mae'])
# Fit the model and save the history (we can plot this)
model_3_history = insurance_model_3.fit(feature_training_data, label_training_data, epochs=model3EpochCount, verbose=0)In [17]:
insurance_model_2.evaluate(feature_testing_data, label_testing_data)
print(f'Training Label Median: {label_training_data.median()}')
print(f'Training Label Mean: {label_training_data.mean()}')
print(f'model_3 MAE: {insurance_model_3.get_metrics_result()["mae"].numpy()}')In [18]:
pd.DataFrame(model_3_history.history).plot()
plt.ylabel('loss')
plt.xlabel('epochs')Out [18]: