One ML Algorithm for all of your Data Science Needs

Learning how XGBoost became such a popular model in today’s time in the data science community.

INSAID
7 min readNov 24, 2022

By Daksh Bhatnagar

Introduction

First of all, let’s understand what XGBoost means. XGBoost or Extreme Gradient Boosting is an upgrade of Gradient Boosting where the model simply learns to predict the residuals or the errors of the previous decision tree.

Sequential decision trees are trained to predict the errors and iteratively the model converges and reaches its minimum point. Here’s how a Gradient Boosting Machine model works:

  • The average value of the target column and uses as an initial prediction for every input.
  • The residuals (difference/errors) of the predictions with the targets are computed.
  • A decision tree of limited depth is trained to predict just the residuals for each input.
  • Predictions from the decision tree are scaled using a parameter called the learning rate (this prevents overfitting)
  • Scaled predictions from the tree are added to the previous predictions to obtain new and improved predictions.
  • Steps 2 to 5 are repeated to create new decision trees, each of which is trained to predict just the residuals from the previous prediction.

Here’s a visual representation of gradient boosting:

XGBoost is a framework that uses regularization, and auto-pruning of the trees which basically helps with the overfitting of the model on the data since machine learning models usually tend to overfit on the training dataset resulting in poor performance on the validation data set.

What Makes XGBoost Different

XGBoost as we discussed earlier is an upgrade of Gradient Boosting however it goes all the way back to Decision Trees as the initial algorithm to later build on and reach where it is today. The algorithm differentiates itself in the following ways:

  1. A wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems.
  2. Portability: Runs easily on Windows, Linux, and OS X.
  3. Languages: Supports major programming languages including C++, Python, R, Java, Scala, and Julia.
  4. Cloud Integration: Supports AWS, and Azure and works well with Spark, and other ecosystems.

The chart below shows the evolution of tree-based algorithms over the years.

Source: towardsdatascience.com

Advantages & Disadvantages of using XGBoost

Advantages:

  • XGBoost has many hyper-parameters that can be tuned. For E.g Learning Rate, Max Depth, Gamma, Num Estimators, etc.
  • XGBoost is built in such a way that it can handle missing values on its own. When the missing parameter is specified, the values in the input predictor equal to the missing parameter are removed. By default, it is NaN
  • It provides various intuitive features, such as parallelization, auto-pruning, and more.

Disadvantages:

  • Unlike LightGBM, in XGB, one has to deal with Categorical features first before feeding them into the models.

Math Behind XGBoost

Now that we have a little more understanding of XGBoost let’s dive a little deeper and understand the math behind XGBoost. XGBoost implements 3 parameters that help the model not overfit and deliver great results

  1. Lambda (for regularization)
  2. Gamma (for node Split)
  3. Eta (for how fast you’d like to converge)

What happens is that the similarity score is calculated with the help of the formula

Credit: https://pianalytix.com

There is something also called GAIN which is given by the formula below. Gain implies the relative contribution of the corresponding feature to the model. Higher the gain, the higher importance the feature has while making a prediction.

GAIN = Left Similarity + Right Similarity — Root Similarity

Now, if the gain as calculated from the above formula is greater than gamma (a hyperparameter), the nodes are going to split otherwise they will not which means they are not going to split fully which is exactly why overfitting happens and how the model learns the exact same training patterns.

Let’s take an example now where we assume our Sum of Residual Squared is 100 and the number of residuals is 10 and the Lambda has been set to 1 which makes the similarity score 9.09. Let’s also assume the left similarity is 40 and the right similarity is 60 so based on the formula above our gain will be 90.91 and assuming we have set our Gamma to be 90, the node will not split further however if the gamma was 100, the node splitting will continue

On the other side, let’s assume we choose to use 0 for regularization the gain will be 90 which means the node splitting will happen so we can see how Lambda and Similarity scores have inverse relationships.

IF THE SIMILARITY SCORE IS LESS THEN THE GAIN IS LESS WHICH MEANS THE VALUE WILL BE LOWER THAN GAMMA WHICH WILL NOT ALLOW THE TREE TO SPLIT TILL THE LEAF NODE

We can use the ‘feature_importances_’ attribute to get the feature importance and decide which model is most relevant to the model and which one is the least important.

importancedf = pd.DataFrame(data=model.feature_importances_).T
importancedf.columns = df.columns[:-1]
importancedf = importancedf.T
importancedf.rename(columns = {0:'Importance'}, inplace = True)
importancedf
XGBClassifier Model Feature Importance

As mentioned earlier, XGBoost does Auto Pruning of the trees that are being created, and the math we discussed earlier is exactly how the pruning is done to avoid overfitting.

The higher the Gamma, the more aggressively you want pruning to be done. If the gamma is lower, the less aggressive approach you have towards pruning the data.

Coming to Eta (learning rate), this parameter helps us control how fast or slow we want the model to be converged. The default value for the model is 0.3. A low eta value means the model is more robust to overfitting.

A lower eta might need more iterations to converge however with a higher eta, there is a fair chance of missing the optimum value (i.e the convergence point). A general range is 0.1 to 1 for you to experiment with which value works the best for your use case.

Reason for XGBoost Popularity

Let’s discuss a few reasons for XGBoost’s amazing performance and gaining popularity:-

  • Handles Missing Values — Usually, we fill the missing values by using some sort of strategy but in the case of XGBoost there is an inbuilt imputer that takes care of all the missing values giving the data scientists much more time to focus on the business problem at hand.
  • Built-in Cross-Validation — The algorithm comes with a built-in cross-validation method which reduces the need to explicitly use this method. Cross-validation is great because you get to know how the model is performing on an unknown dataset. In Cross-validation (for k=5), 1st fold will be used for testing data and the rest folds will be for training data and this is done for every fold. XGBoost leverages the power of Cross-Validation which is built in making it much more acceptable to users.
  • Parallelization: XGBoost uses the process of sequential tree building using parallelized implementation.
  • Tree Pruning: XGBoost uses the ‘max_depth’ parameter and starts pruning trees backward. The ‘depth-first’ approach improves computational performance significantly.
  • Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources which is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Concepts such as ‘out-of-core’ computing optimize available disk space while handling big data frames that do not fit into memory.
  • Weighted Quantile Sketch: XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.

Conclusion

  • XGBoost is a great Machine Learning algorithm since it’s been known to perform well.
  • We learned how XGBoost uses software and algorithmic optimizations to help deliver results in a short amount of time with lesser computing.
  • Up Next, I’ll be covering more Algorithms and Architectures that are important for data science like “ModelStacking”, “Ensembling”, and “Meta-Layers”.
  • Follow me! for more upcoming Data Science, Machine Learning, and Artificial Intelligence articles.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

INSAID

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!