Applying AutoML(Part-2) with MLBox

7 min readMar 12, 2021
  1. MLBox is an incredible Automated Machine Learning python library.
    It provides the following features:
  • Fast reading and distributed data preprocessing/cleaning/formatting.
  • Highly robust feature selection and leak detection.
  • Accurate hyper-parameter optimization in high-dimensional space.
  • State-of-the-art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…).
  • Prediction with model interpretation.

2. MLBox can be used for a variety of ML tasks including data preprocessing.
It drops highly cardinal and Unnamed columns from the data.

3. MLBox then can map the same changes on the test data provided.

4. MLBox main package contains 3 sub-packages:
preprocessing, optimization, and prediction.

5. Each one of them is respectively aimed at reading and preprocessing data, testing or optimizing a wide range of learners, and predicting the target on a test dataset.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud


  • Automatic identification and classification of task
  • Pre-processing of data while reading and removing the drift variables, thereby cleaning files
  • Uses Entity Embedding for encoding categorical variables for creating new informative traits.


  • Things may collapse suddenly since it’s under active development
  • The selection method is purely mathematical drops out variables significant on the business front


mlbox.preprocessing.Reader(sep=None, header=0, to_hdf5=False, to_path=’save’, verbose=True)
  • The preprocessing contains a Reader function.
    * This function reads and cleans the data in the notebook.
    * It accepts only CSV, Xls, JSON, and h5 format files.
    1. sep = Delimiter to use when reading a file.
    2. header = Accepts int or None values. If the header is 0 the first line is considered as a header.
    3. to_hdf5 = Accepts boolena value, returns file in hdf5 format when true.
    4. to_path = Name of the folder where files are saved

It also offers multiple functionalities.

→ clean

clean(path, drop_duplicate = False)
  • It deletes Unnamed columns
  • Casts list into Variables
  • Try to Cast the Numerical variable to float.
  • It extracts timestamp into the year, month, day, day of the week, and hour
  • It also drops duplicate.
  • It returns a clean pandas data frame
    1. path = The path to the location of the dataset
    2. drop_duplicate = Accepts boolean input

→ train_test_split

train_test_split(Lpath, target_name)
  • It creates a train and test for datasets that we have to split.
  • It automatically determines the task for classification and encodes the categorical features.
    1. Lpath = List of the path of the location of the data frame
    2. target_name = Name of the target variable in the dataset

Drift Thresholding:

dft = Drift_thresholder()
df = dft.fit_transform(df)
  • It automatically drops variables drifting between the train and test data.
    1. threshold = It is by default set to 0.6, it should always be between 0 to one.
    The lower the more you keep non-drifting/stable variables: a feature with a drift measure of 0. is very stable and a one with 1. is highly unstable.
    2. inplace = Accepts Boolean value. If True the train and the test datasets are transformed otherwise they are not.
    3. to_path = Name of the folder where the list of drift co-efficient is saved.

Drifts happen when train and test data are conceptually different leading to the poor performance of the model.
For example:
If our train data contains characteristics of human beings between age 20–30 then obviously it will fail to perform on human beings above 50 years of age.

It also offers multiple functionalities.

→ drifts
It returns a dictionary containing drifts for each feature.

→ fit_transform

  • It fits and transforms the train and test data
  • It automatically drops ids and drifting variables between train and test datasets.
  • The list of the drift coefficients are available and saved as “drifts.txt”


This helps in imputing missing values as well as encoding categorical data to numerical values.

→ Missing values

  • MLBox provides functions that help to impute missing values for both numerical and categorical data.
mlbox.encoding.NA_encoder(numerical_strategy=’mean’, categorical_strategy=’’)
    1. numerical_strategy = It is set to “mean” by default. Other ways are “median”, “most_frequent” or any int/float value.
    2. categorical_strategy = It is set to “NuLL” by default. Other ways include any string or “most_frequent”

→ Categorical features

mlbox.encoding.Categorical_encoder(strategy=’label_encoding’, verbose=False) Encodes categorical features.
  • It has multiple strategies to encode categorical data.
    1. strategy = The strategy to encode categorical features. Available strategies are “label_encoding”, “dummification”, “random_projection”, “entity_embedding”.
    2. Verbose = It accepts a boolean value and is useful for entity embedding strategy.


MLBox offers models for both Classification and Regression Tasks.
The models are wrapped around the Sklearn library.

→ Classification

Feature Selection:

mlbox.model.classification.Clf_feature_selector(strategy=’l1’, threshold=0.3)
  • Helps in selecting useful features for classification problem only.
  • It has several features and wrapper methods.
    1. strategy = By default “l1” is used, Other ways are “variance”, “rf_feature_importance”.
    2. threshold = The percentage of the variable to discard according to the strategy. It should be between 0 to 1.

Classification model:

  • It offers a variety of algorithms to model the data.
    1. strategy = By default it takes LightGBM, however, it also includes “RandomForest”, “ExtraTrees”, “Tree”, “Bagging”, “AdaBoost”, and “Linear”.
    2. **params = Parameters of the classifier we are passing.


  • It is generated with a cross-validation method.
  • Uses several of classifier’s predictions for a second layer estimator.
    1. base_estimator = It takes in the list of estimators to fit in the first level using cross-validation
    2. level_estimator = By default it is set to LogistiRegression, the estimator is used in the second and last level.
    3. n_folds = By default it is 5, the number of folds used to generate the meta-features for the training set
    4. drop_first = It accepts boolean values, if true, it gives an output n_classes-1 probabilities

→ Regression

Feature Selection:

mlbox.model.regression.Reg_feature_selector(strategy=’l1’, threshold=0.3)
  • It helps select useful features.
  • It has both filter and wrapper methods to it.
  • USed only for regression problems.
    1. strategy = By default it is set to “l1”, other ways are “variance”, “rf_feature_importance”.
    2. threshold = helps specify the percentage of variables to discard.

Regression model:

  • It offers a variety of algorithms to model the data.
    1. strategy = By default it takes LightGBM, however, it also includes “RandomForest”, “ExtraTrees”, “Tree”, “Bagging”, “AdaBoost”, and “Linear”.
    2. **params = Parameters of the classifier we are passing.


  • It is used to optimize the whole pipeline.
mlbox.optimisation.Optimiser(scoring=None, n_folds=2,random_state=1,
to_path=’save’, verbose=True)
    1. scoring = If None, “neg_log_loss” is used for classification and “neg_mean_squared_error” for regression. It takes in all the scoring parameters present in the sklearn.
    2. n_folds = The number of cross-validation folds


mlbox.prediction.Predictor(to_path=’save’, verbose=True)
  • It fits and predicts on the test dataset.
  • The test data should not contain the target variable.
    1. to_path = Name of the folder to save the result.

Python implementation of MLBox

We will use our good old Titanic data to implement MLBox.

1. Installing the Package:

It is pretty easy to install and directly start using it.

!pip install mlbox

2. Importing the modules from MLBox

from mlbox.preprocessing import *from mlbox.optimisation import *from mlbox.prediction import *

3. Specifying the path and target name

In MLBox we need to specify the path where our train and test data are present along with the target name.

paths = ["/contents/train.csv","/contents/test.csv"]
target_name = "Survived"

4. Preprocessing and splitting our data

rd = Reader(sep = ",")df = rd.train_test_split(paths, target_name)

5. Drift thresholding our data to remove any kind of bias

dft = Drift_thresholder()df = dft.fit_transform(df)

6. Initializing the optimizer

#Hyperparameter tuningopt = Optimiser(scoring = "accuracy", n_folds = 5)
  • We will define a search space for the i=optimizer to search from
space = {'est__strategy':{"search":"choice","space":["LightGBM"]},

7. Optimizing

params = opt.optimise(space, df,15)
  • We pass in our search space, our data frame, and max_eval.
  • max_eval we took to be 15 as it is the number of iterations for an accurate optimal hyper-parameter, by default it is set to be 40.

8. Prediction

  • It gives out the important features that helped in the prediction
  • It shows the top 10 predictions along with probabilities for each class.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on




One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!