K-Fold Cross-Validation

scikitlearn has a few modules included that make this simple in implementation.

Dependencies

In [1]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm
In [2]:
irisData = datasets.load_iris()

Split Data into Train & Test Data

A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [3]:
# Split the iris data into train/test data sets with 40% reserved for testing
# DOCS: 
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
trainingData = irisData.data
expectedResults = irisData.target
percentageOfTestingData = .4
X_train, X_test, y_train, y_test = train_test_split(trainingData, expectedResults, test_size=percentageOfTestingData, random_state=0)
In [4]:
# Build an SVC model for predicting irisData classifications using training data
# SVC DOCS
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

svcKernel = 'linear'
svcLinearModel = svm.SVC(kernel=svcKernel, C=1).fit(X_train, y_train)
In [11]:
# Now measure its performance with the test data
linearModelScore = svcLinearModel.score(X_test, y_test)
print(f'linearModelScore: {linearModelScore}')
linearModelScore: 0.9
In [6]:
# set a "K" value, the number of samples to use from the dataset
k = 5;
In [12]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
cvs = cross_val_score(svcLinearModel, irisData.data, irisData.target, cv=k)

# Print the accuracy for each fold:
print("svcLinearModel cross-validation score:")
print(scores)

# And the mean accuracy of all 5 folds:
print("mean of scores:",scores.mean())
cross-validation score:
[0.96666667 1.         0.96666667 0.96666667 1.        ]
mean of scores: 0.9800000000000001

Use K-Fold With Different Variables

Our model is pretty great.
Here, using a poly kernel value

In [17]:
polyKernel = 'poly'
svcPolyModel = svm.SVC(kernel=polyKernel, C=1)
scores = cross_val_score(svcPolyModel, irisData.data, irisData.target, cv=k)
print('svcPolyModel scores')
print(scores)
print(scores.mean())
svcPolyModel scores
[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001

The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [18]:
# Build an SVC model for predicting iris classifications using training data
svcModel = svm.SVC(kernel=polyKernel, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
svcModel.score(X_test, y_test)   
Out [18]:
0.9
In [19]:
svcModelTwo = svm.SVC(kernel=polyKernel, degree=2, C=1).fit(X_train, y_train)
svcModelFour = svm.SVC(kernel=polyKernel, degree=4, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
print(f'svcModelTwoScore: {svcModelTwo.score(X_test, y_test)}')
print(f'svcModelFourScore: {svcModelFour.score(X_test, y_test)}')
svcModelTwoScore: 0.95
svcModelFourScore: 0.9166666666666666
Page Tags:
python
data-science
jupyter
learning
numpy