Dependencies

In [1]:

import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm

Load Some Data

In [2]:

irisData = datasets.load_iris()

Split Data into Train & Test Data

A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [3]:

# Split the iris data into train/test data sets with 40% reserved for testing
# DOCS: 
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
trainingData = irisData.data
expectedResults = irisData.target
percentageOfTestingData = .4
X_train, X_test, y_train, y_test = train_test_split(trainingData, expectedResults, test_size=percentageOfTestingData, random_state=0)

Build A Model From Training Data

In [4]:

# Build an SVC model for predicting irisData classifications using training data
# SVC DOCS
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

svcKernel = 'linear'
svcLinearModel = svm.SVC(kernel=svcKernel, C=1).fit(X_train, y_train)

Measure the Model's Performance

In [11]:

# Now measure its performance with the test data
linearModelScore = svcLinearModel.score(X_test, y_test)
print(f'linearModelScore: {linearModelScore}')

linearModelScore: 0.9

Start K-Fold Cross Validation

An Article

In [6]:

# set a "K" value, the number of samples to use from the dataset
k = 5;

In [12]:

# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
cvs = cross_val_score(svcLinearModel, irisData.data, irisData.target, cv=k)

# Print the accuracy for each fold:
print("svcLinearModel cross-validation score:")
print(scores)

# And the mean accuracy of all 5 folds:
print("mean of scores:",scores.mean())

cross-validation score:
[0.96666667 1.         0.96666667 0.96666667 1.        ]
mean of scores: 0.9800000000000001

Use K-Fold With Different Variables

Our model is pretty great.
Here, using a poly kernel value

In [17]:

polyKernel = 'poly'
svcPolyModel = svm.SVC(kernel=polyKernel, C=1)
scores = cross_val_score(svcPolyModel, irisData.data, irisData.target, cv=k)
print('svcPolyModel scores')
print(scores)
print(scores.mean())

svcPolyModel scores
[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001

The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [18]:

# Build an SVC model for predicting iris classifications using training data
svcModel = svm.SVC(kernel=polyKernel, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
svcModel.score(X_test, y_test)

Out [18]:

0.9

Comparing Polynomial Degress

In [19]:

svcModelTwo = svm.SVC(kernel=polyKernel, degree=2, C=1).fit(X_train, y_train)
svcModelFour = svm.SVC(kernel=polyKernel, degree=4, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
print(f'svcModelTwoScore: {svcModelTwo.score(X_test, y_test)}')
print(f'svcModelFourScore: {svcModelFour.score(X_test, y_test)}')

svcModelTwoScore: 0.95
svcModelFourScore: 0.9166666666666666

Page Tags:

python

data-science

jupyter

learning

numpy