K-Fold Cross-Validation
- Dependencies
- Load Some Data
- Split Data into Train & Test Data
- Build A Model From Training Data
- Measure the Model's Performance
- Start K-Fold Cross Validation
- Use K-Fold With Different Variables
- Comparing Polynomial Degress# K-Fold Cross Validation This is a way to address over-fitting a model to the training data.
- split data into
K
random groups - reserve 1 set as testing data
- train over-and-over on the
K-1
segments - take the AVERAGE of the K r-squared scored as the predictability score
scikitlearn has a few modules included that make this simple in implementation.
Dependencies
In [1]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm
In [2]:
irisData = datasets.load_iris()
Split Data into Train & Test Data
A single train/test split is made easy with the train_test_split function in the cross_validation library:In [3]:
# Split the iris data into train/test data sets with 40% reserved for testing
# DOCS:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
trainingData = irisData.data
expectedResults = irisData.target
percentageOfTestingData = .4
X_train, X_test, y_train, y_test = train_test_split(trainingData, expectedResults, test_size=percentageOfTestingData, random_state=0)
In [4]:
# Build an SVC model for predicting irisData classifications using training data
# SVC DOCS
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
svcKernel = 'linear'
svcLinearModel = svm.SVC(kernel=svcKernel, C=1).fit(X_train, y_train)
In [11]:
# Now measure its performance with the test data
linearModelScore = svcLinearModel.score(X_test, y_test)
print(f'linearModelScore: {linearModelScore}')
In [6]:
# set a "K" value, the number of samples to use from the dataset
k = 5;
In [12]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
cvs = cross_val_score(svcLinearModel, irisData.data, irisData.target, cv=k)
# Print the accuracy for each fold:
print("svcLinearModel cross-validation score:")
print(scores)
# And the mean accuracy of all 5 folds:
print("mean of scores:",scores.mean())
In [17]:
polyKernel = 'poly'
svcPolyModel = svm.SVC(kernel=polyKernel, C=1)
scores = cross_val_score(svcPolyModel, irisData.data, irisData.target, cv=k)
print('svcPolyModel scores')
print(scores)
print(scores.mean())
The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:
In [18]:
# Build an SVC model for predicting iris classifications using training data
svcModel = svm.SVC(kernel=polyKernel, C=1).fit(X_train, y_train)
# Now measure its performance with the test data
svcModel.score(X_test, y_test)
Out [18]:
In [19]:
svcModelTwo = svm.SVC(kernel=polyKernel, degree=2, C=1).fit(X_train, y_train)
svcModelFour = svm.SVC(kernel=polyKernel, degree=4, C=1).fit(X_train, y_train)
# Now measure its performance with the test data
print(f'svcModelTwoScore: {svcModelTwo.score(X_test, y_test)}')
print(f'svcModelFourScore: {svcModelFour.score(X_test, y_test)}')
Page Tags:
python
data-science
jupyter
learning
numpy