# One ML Algorithm for all of your Data Science Needs

## Learning how XGBoost became such a popular model in today’s time in the data science community.

# Introduction

First of all, let’s understand what XGBoost means. XGBoost or *Extreme Gradient Boosting* is an **upgrade of Gradient Boosting** where the model simply learns to predict the residuals or the errors of the previous decision tree.

**Sequential decision trees** are trained to predict the errors and iteratively the model converges and reaches its** minimum point. **Here’s how a **Gradient Boosting Machine** model works:

- The
**average value**of the target column and uses as an initial prediction for every input. - The residuals (difference/errors) of the
**predictions**with the targets are computed. - A decision tree of limited depth is trained to
**predict**just the**residuals**for**each input.** - Predictions from the decision tree are scaled using a parameter called the
**learning rate**(this prevents overfitting) **Scaled predictions**from the tree are added to the previous predictions to obtain new and improved predictions.- Steps 2 to 5 are repeated to create new decision trees, each of which is trained to predict just the
**residuals**from the**previous prediction.**

Here’s a visual representation of gradient boosting:

XGBoost is a framework that uses **regularization, and auto-pruning of the trees **which basically helps with the overfitting of the model on the data since machine learning models usually tend to overfit on the training dataset resulting in poor performance on the validation data set.

# What Makes XGBoost Different

XGBoost as we discussed earlier is an upgrade of Gradient Boosting however it goes all the way back to **Decision Trees as the initial algorithm** to later build on and reach where it is today. The algorithm differentiates itself in the following ways:

**A wide range of applications**: Can be used to solve regression, classification, ranking, and user-defined prediction problems.**Portability**: Runs easily on Windows, Linux, and OS X.**Languages**: Supports major programming languages including C++, Python, R, Java, Scala, and Julia.**Cloud Integration**: Supports AWS, and Azure and works well with Spark, and other ecosystems.

The chart below shows the evolution of tree-based algorithms over the years.

# Advantages & Disadvantages of using XGBoost

**Advantages:**

- XGBoost has
**many hyper-parameters**that can be tuned. For E.g Learning Rate, Max Depth, Gamma, Num Estimators, etc. - XGBoost is built in such a way that it can
**handle missing values on its own**. When the missing parameter is specified, the values in the input predictor equal to the missing parameter are removed. By default, it is NaN - It provides various intuitive features, such as
**parallelization**,**auto-pruning**, and more.

**Disadvantages:**

- Unlike LightGBM, in XGB, one has to
**deal with Categorical features first**before feeding them into the models.

# Math Behind XGBoost

Now that we have a little more understanding of XGBoost let’s dive a little deeper and understand the math behind XGBoost. XGBoost **implements 3 parameters** that help the model not overfit and deliver great results

**Lambda**(for regularization)**Gamma**(for node Split)**Eta**(for how fast you’d like to converge)

What happens is that the similarity score is calculated with the help of the formula

There is something also called **GAIN **which is given by the formula below. Gain implies the relative contribution of the corresponding feature to the model. Higher the gain, the higher importance the feature has while making a prediction.

GAIN = Left Similarity + Right Similarity — Root Similarity

Now, if the gain as calculated from the above formula is greater than gamma (a hyperparameter), the **nodes are going to split otherwise they will not** which means they are not going to split fully which is exactly why overfitting happens and how the model learns the exact same training patterns.

Let’s take an example now where we assume our **Sum of Residual Squared** is 100 and the number of residuals is 10 and the Lambda has been set to 1 which makes the similarity score 9.09. Let’s also assume the left similarity is 40 and the right similarity is 60 so based on the formula above our gain will be 90.91 and assuming we have set our Gamma to be 90, the node will not split further however if the gamma was 100, the node splitting will continue

On the other side, let’s assume we choose to use 0 for regularization the gain will be 90 which means the node splitting will happen so we can see how Lambda and Similarity scores have inverse relationships.

IF THE SIMILARITY SCORE IS LESS THEN THE GAIN IS LESS WHICH MEANS THE VALUE WILL BE LOWER THAN GAMMA WHICH WILL NOT ALLOW THE TREE TO SPLIT TILL THE LEAF NODE

We can use the ‘**feature_importances_**’ attribute to get the feature importance and decide which model is most relevant to the model and which one is the least important.

`importancedf = pd.DataFrame(data=model.feature_importances_).T`

importancedf.columns = df.columns[:-1]

importancedf = importancedf.T

importancedf.rename(columns = {0:'Importance'}, inplace = True)

importancedf

As mentioned earlier, XGBoost does Auto Pruning of the trees that are being created, and the math we discussed earlier is exactly how the pruning is done to **avoid overfitting**.

The higher the Gamma, the more aggressively you want pruning to be done. If the gamma is lower, the less aggressive approach you have towards pruning the data.

Coming to Eta (learning rate), this parameter helps us control **how fast or slow we want the model to be converged**. The default value for the model is 0.3. A low eta value means the model is more robust to overfitting.

A lower eta might need more iterations to converge however with a higher eta, there is a fair chance of missing the optimum value (i.e the convergence point). A general range is 0.1 to 1 for you to experiment with which value works the best for your use case.

# Reason for XGBoost Popularity

Let’s discuss a few reasons for XGBoost’s amazing performance and gaining popularity:-

**Handles Missing Values**— Usually, we fill the missing values by using some sort of strategy but in the case of XGBoost there is an**inbuilt imputer**that takes care of all the missing values giving the data scientists much more time to focus on the business problem at hand.**Built-in Cross-Validation —**The algorithm comes with a built-in cross-validation method which reduces the need to explicitly use this method. Cross-validation is great because you get to know**how the model is performing on an unknown dataset**. In Cross-validation (for k=5), 1st fold will be used for testing data and the rest folds will be for training data and this is done for every fold. XGBoost leverages the power of Cross-Validation which is built in making it much more acceptable to users.**Parallelization**: XGBoost uses the process of sequential tree building using parallelized implementation.**Tree Pruning:**XGBoost uses the ‘max_depth’ parameter and starts pruning trees backward. The ‘depth-first’ approach improves computational performance significantly.**Hardware Optimization**: This algorithm has been designed to make efficient**use of hardware resources**which is**accomplished by cache awareness**by allocating internal buffers in each thread to store gradient statistics. Concepts such as ‘out-of-core’ computing optimize available disk space while handling big**data frames that do not fit into memory**.**Weighted Quantile Sketch:**XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively**find the optimal split points**among weighted datasets.

# Conclusion

**XGBoost**is a great Machine Learning algorithm since it’s been known to**perform well.**- We learned how XGBoost uses
**software and algorithmic optimizations**to help**deliver results**in a short amount of time with**lesser computing.** **Up Next,**I’ll be covering**more Algorithms and Architectures**that are**important for data science**like “M**odelStacking”, “Ensembling”,**and “**Meta-Layers”.****Follow me!**for more**upcoming Data Science, Machine Learning, and Artificial Intelligence articles.**

# Final Thoughts and Closing Comments

There are **some vital points** many **people fail to understand** while they pursue their **Data Science **or **AI journey**. If you are one of them and looking for a way to **counterbalance** these **cons**, check out the certification programs provided by **INSAID** on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).