XGBoost & Ensemble Learning

Bagging

Bagging, bootstrap aggregating, uses multiple random subsets/samples of the source data to train on.

Boosting

Boosting is running several sequential models in an ensemble of models in such a way that each model fixes errors made in the previous model. AdaBoost, XGBoost, and Gradient Boosting Machines are all examples of boosting.

Stacking

Stacking combines many models and trains a "meta" model.

Voting

Voting classifiers combine predictions from several models by majority voting or by averaging probabilities together.

Advanced Ensemble Learning

Bayes Optimal Classifier.
Bayesian Parameter Averaging.
Bayesian Model Combination.
These are, apparently, complex approaches to "figuring out" which ensemble learning approach is best.

XGBoost

Extreme gradient boosted trees.
xgboost is a python library: pip install xgboost.
Apparently xgboost has helped win kaggle competition.

UsingXGBoost to leverage Ensemble Learning

Import Dependencies

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

Load Data

Here, we're using the iris dataset (a small dataset about flowers!)

In [12]:
iris = load_iris()

numSamples, numFeatures = iris.data.shape
print(numSamples)
print(numFeatures)
print(list(iris.target_names))
150
4
['setosa', 'versicolor', 'virginica']
In [13]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Format The Data for XGBoost

XGBoost ??prefers/requires?? the input data to be in the DMatrix format.
Luckily the XGBoost library has a DMatrix method that can do just that!

In [14]:
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Setup XGBoost Hyper-Parameters

setting up hyper parameters and learning task parameters:

  • max_depth: represents the maximum "depth" of the tree. defaults to 6
  • eta: the step size used during shrinking & updating the model, used to prevent over-fitting. defaults to .3
  • objective: here the value instructs xgboost to do multi-class classification with the softmax. This requires setting num_class as well
  • num_class: the number of classes
In [15]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 
In [16]:
model = xgb.train(param, train, epochs)
In [22]:
predictions = model.predict(test)
print(predictions)
[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]

View Accuracy score

using the sklearn lib:

In [23]:
accuracy_score(y_test, predictions)
Out [23]:
1.0

1.0.
That. is. perfect.

Page Tags:
python
data-science
jupyter
learning
numpy