XGBoost & Ensemble Learning
- Ensemble Learning
- XGBoost
- UsingXGBoost to leverage Ensemble Learning
- Import Dependencies
- Load Data
- Split Data into Training & Testing Datasets
- Format The Data for XGBoost
- Setup XGBoost Hyper-Parameters
- Train
- View Test Predictions
- View Accuracy score# XGBoost & Ensemble Learning
Ensemble Learning
Use many models to solve a single problem.
Bagging
Bagging, bootstrap aggregating, uses multiple random subsets/samples of the source data to train on.Boosting
Boosting is running several sequential models in an ensemble of models in such a way that each model fixes errors made in the previous model. AdaBoost, XGBoost, and Gradient Boosting Machines are all examples of boosting.Stacking
Stacking combines many models and trains a "meta" model.Voting
Voting classifiers combine predictions from several models by majority voting or by averaging probabilities together.Advanced Ensemble Learning
Bayes Optimal Classifier.Bayesian Parameter Averaging.
Bayesian Model Combination.
These are, apparently, complex approaches to "figuring out" which ensemble learning approach is best.
XGBoost
Extreme gradient boosted trees.xgboost is a python library:
pip install xgboost.Apparently xgboost has helped win kaggle competition.
UsingXGBoost to leverage Ensemble Learning
Import Dependencies
In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgbIn [12]:
iris = load_iris()
numSamples, numFeatures = iris.data.shape
print(numSamples)
print(numFeatures)
print(list(iris.target_names))In [13]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)Format The Data for XGBoost
XGBoost ??prefers/requires?? the input data to be in the DMatrix format.Luckily the XGBoost library has a
DMatrix method that can do just that!In [14]:
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)Setup XGBoost Hyper-Parameters
setting up hyper parameters and learning task parameters:max_depth: represents the maximum "depth" of the tree. defaults to6eta: the step size used during shrinking & updating the model, used to prevent over-fitting. defaults to.3objective: here the value instructs xgboost to do multi-class classification with thesoftmax. This requires settingnum_classas wellnum_class: the number of classes
In [15]:
param = {
'max_depth': 4,
'eta': 0.3,
'objective': 'multi:softmax',
'num_class': 3}
epochs = 10 In [16]:
model = xgb.train(param, train, epochs)In [22]:
predictions = model.predict(test)
print(predictions)View Accuracy score
using the sklearn lib:In [23]:
accuracy_score(y_test, predictions)Out [23]:
1.0.
That. is. perfect.
Page Tags:
python
data-science
jupyter
learning
numpy