# Automated Hyperparameter tuning

Ever since the introduction of a few advanced algorithms in the field of Machine Learning,**Hyparameter tuning has been a tedious task**.

- Hyperparameters are tunable and can be used to get the optimal performing model.
- It’s always tricky to find the optimal combinations of any ML model for a specific task.
- Not only it takes up time writing lines and lines of code, but it also takes up time to train.
- A Hyperparameter controls the result/conduct of any ML models.

## Also, Check out our Article on:

*Introduction to AutoML-The future of industry ML execution**Applying AutoML (Part-1) using Auto-Sklearn**Applying AutoML(Part-2) with MLBox**Applying AutoML (Part-3) with TPOT**Applying AutoML (Part-4) using H2O**AutoML on the Cloud*

# Traditional Hyperparameter Tuning!

Let’s look at the

traditional waytotunea RandomForest model.

We are taking theRandomForest modelas most of us are quite comfortable with it and knows most of the hyperparameter associated with it.

**There are three types of hyperparameter searches:***a**. GridSearch*

*b**. RandomSearch*

*c**. ManualSearch*

Read in detail about it with our article:Hyper-parameter Tuning

We will be seeing the time taken by both Grid and Random Search to search for the optimal hyperparameter.

**→ Grid Search:**

As you can see it **took 3 hours** to find a set of hyperparameters for only 800 data points. Imagine passing millions of data!!**Grid Search will take up days to model.**

**→ Random Search:**

Even though RandomSearch took 2 min, it never guarantees an optimal set of hyperparameters

## → Problems with the traditional method.

- Increased time complexity.
**GridSearch**is aalgorithm that runs over every value we passed to tune the model. Whereas, the*greedy search***RandomSearch***randomly chooses over the value*.**GridSearch**suffers from the:*curse of dimensionality***→**The number of model evaluations with each set of hyperparameters grows drastically during the optimization process.**→**Additionally, it is not even guaranteed to find the best solution.- The drawback of
**RandomSearch**is unnecessarily**high variance**.

The method is entirely random and uses no intelligence in selecting which points to try.

# Why use Automated-Hyperparameter tuning?

- GridSearch and RandomSearch are hands-off, however, it takes very long run times to execute the process.

This is due to the fact that they waste most of their time evaluating parameters on the search space that do not add any value. - Increasingly, hyperparameter tuning is done by automated methods that aim to find optimal hyperparameters in less time using an
**informed search**with no manual effort necessary beyond the initial set-up.

Before moving on to the packages for Automated Hyperparameter tuning in Python. Let us understand what

Bayesian Optimizationis.

# Bayesian Optimization

**Sequential model-based optimization**(also known as Bayesian optimization) is one of the most proficient strategies(per function evaluation) of function minimization.- This effectiveness makes it suitable for improving the hyperparameters of the ML models which are slow to train.
**SMBO**methods are used where a user wants tosome*minimize*f(x) that takes a lot of time to evaluate.*scalar-valued function*- The advantages of
**SMBO**are that it:

• Leverages smoothness without an analytic gradient.

• Handles every type of variable (real-valued, discrete, and conditional variables).

• Handles evaluation of the function**f(x)**parallelly.

• Adapts to hundreds of variables, even with a limit of just a few hundred function evaluations. **A Bayesian Optimization**is an approach that uses the Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.- This is mostly useful for objective functions that can be complex, noisy, and/or expensive to evaluate.
**Bayesian optimization**in turn considers past evaluations while picking the set of hyperparameters.- By using this informative method of picking the set of hyperparameters it enables itself to focus on those areas of the parameter space that it accepts will bring the most promising validation scores.

# Idea Behind Bayesian Optimization

To understand let’s go back to the

Bayes Theorem.

The Optimization uses the Bayes Theorem to direct the search in order tofind the minimum or maximum of an objective function.

- We know that the Bayes Theorem describes the probability of an event, supported by prior knowledge of conditions that may be associated with the event.

In simple terms, Bayes Theorem calculates the conditional probability of an event.

Now, when we apply the same logic to the hyperparameter tuning, we get:

Here,

* ** P(metric | combination of hyperparameter)** gives the probability of the given metric to be minimized/maximized given the combination of hyperparameter values.

*

**is the probability of a certain hyperparameter combination if the given metric is minimized/maximized.**

*P(combination of hyperparameter | metric)**

**is the initial metric quantity in the scalar.**

*P(metric)**

**is the probability of getting that particular hyperparameter combination.**

*P(combination of hyperparameter)*- However, we do not want to calculate the conditional probabilities instead we want to optimize a quantity.

We can simplify the equation by removing the normalizing value P(B), these steps make the conditional probability equation a proportional quantity.

Now, we get an equation:

`P(A|B) = P(B|A) * P(A)`

This can also be written as **posterior = likelihood * prior**

- Now, various hyperparameter configurations are navigated through, which take advantage of the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run.

The posterior captures the updated beliefs about the unknown objective function. One may also translate this step of Bayesian optimization as finding the objective function with a surrogate function (also called a response surface).

- Bayesian optimization finds a posterior distribution because of the function that has to be optimized during the parameter optimization, then uses an acquisition function to sample from that posterior to seek out the subsequent set of parameters to be explored.

## Surrogate Function

- The
**surrogate function**is often interpreted as a Substitute for the target function. - It is used to propose parameter sets to the objective function that likely yields an improvement in terms of accuracy score.
- The
**surrogate function**is a technique used to find the approximate mapping of input examples to an output score. **The problem**of forming a**surrogate model**is usually handled as a regression problem where we provide the info as input (with a group of hyperparameters) and it returns an estimation of the objective function parameterized by a mean and a standard deviation.

**The common choices for surrogate models are:**

**→ Gaussian Process Regression*** Gaussian Processes are considered as a good method for modeling loss functions in a model-based optimization context.

* The Gaussian Process works by building a joint probability between the input features and the true values of the objective function. By this method, with sufficient iterations, it is able to capture a valid estimate of the objective function.

**→ Tree-structured Parzen Estimator(TPE)***

**Tree-structured Parzen Estimator**

**(TPE)**algorithm is made to optimize the hyperparameters and find a configuration that helps to generate an expected accuracy target and fits the best possible response-time improvement.

*

**TPE**is an iterative process that uses the history of evaluated hyperparameters to create a probabilistic model, which is used to suggest the next set of hyperparameters to evaluate.

*

**The tree-structured Parzen Estimator (TPE)**models p(x|y) by transforming that yielding process, replacing the distributions of the configuration earlier with non-parametric densities.

*

**TPE**supports a wide variety of variables in parameter search space e.g., uniform, log-uniform, quantized log-uniform, normally-distributed real value, categorical.

## Acquisition Function

The surrogate function is used to test a range of candidate samples in the domain

- The
**acquisition function**is maximized at every iteration to decide where next to sample from the objective function — the acquisition function takes into account the mean and variance of the forecasts over the space to model the efficiency of sampling. - From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.
- The function is then sampled at the argmax of the acquisition function, the Gaussian process is updated and the whole method is repeated.

# Packages for Automatic Hyperparameter tuning

## Scikit-Optimize

**Scikit-Optimize**is one of the libraries that’s- This library implements several methods for sequential model-based optimization by reducing expensive and noisy black-box functions.

**→ Installing Scikit-Optimize**

`!pip install scikit-optimize`

**→ Importing necessary modules**

`import skopt`

from skopt import gp_minimize

from skopt.space import Integer, Categorical

from skopt.utils import use_named_args

from skopt.plots import plot_convergence

**→ Defining the parameter space**

`space = [`

Integer(200,1500,name = "n_estimators"),

Integer(10, 80, name = "max_depth"),

Categorical(["auto", "sqrt"], name = "max_features"),

Integer(2,15, name = "min_samples_split"),

Integer(1,9, name = "min_samples_leaf"),

Categorical([True,False], name = "bootstrap")

]

**→ Initializing the objective function**

@use_named_args(space)def objective(**params):

rf.set_params(**params)

return -np.mean(cross_val_score(rf, x_train, y_train, cv=5,

n_jobs=-1,

scoring="neg_mean_absolute_error"))

**→ Optimizing the function**

`%%time`

tune_rand_gp = gp_minimize(objective,space,random_state=1234)

We can see that at a total time of ‘9 min and 34 seconds’ the skopt package found the best set of parameters for our RandomForest Model

**→ Checking the best parameters**

`print(f"Best parameters: \n")`

print(f'n_estimators={tune_rand_gp.x[0]}')

print(f'max_depth={tune_rand_gp.x[1]}')

print(f'max_features={tune_rand_gp.x[2]}')

print(f'min_samples_split={tune_rand_gp.x[3]}')

print(f'min_samples_leaf={tune_rand_gp.x[4]}')

print(f'bootstrap = {tune_rand_gp.x[5]}')

**→ Visualizing the convergence**

`plot_convergence(tune_rand_gp)`

## Hyperopt

**HyperOpt***takes as an input space of hyperparameters in which it will search and moves according to the result of past trials this means that we get an optimizer that could minimize/maximize any function for us.*- The
**Hyperopt**library provides different algorithms and a way to parallelize by building an infrastructure for performing hyperparameter optimization (model selection) in Python. **HyperOpt**provides an optimization interface that identifies a configuration space and an evaluation function that attaches real-valued loss values to points within the configuration space.

**→ Importing the necessary modules:**

`from hyperopt import fmin, tpe, hp,Trials,STATUS_OK`

**→ Initializing the parameters:**

Hyperopt provides us with a range of parameter expressions:

*hp.choice(labels,options):*

Returns one of the n examples provided, the**options**should be a list or a tuple.*hp.randint(label,upper):*

Returns a random integer from o to**upper.***hp.uniform(label, lower, upper)*:

Returns the uniform range of values between the lower and the upper limit.Returns a value like round(uniform(low, high) / q) * q*hp.quniform(label, lower, upper):*

Suitable for a discrete value with respect to which the objective is still somewhat “smooth”, but which should be bounded both above and below.*hp.loguniform(label, low, high):*

Returns a uniformly distributed value by drawing value according to exp(uniform(low, high))

To read more about the parameters refer to this **link**.

space = {"n_estimators": hp.choice("n_estimators",[200,600,900,1200,1500]),"max_depth": hp.quniform("max_depth", 10, 80,5),"max_features": hp.choice("criterion", ["auto", "sqrt"]),"min_samples_split":hp.choice("min_samples_split",[2, 5, 10,12,15]),"min_samples_leaf":hp.choice("min_samples_leaf",[1, 2, 4,7,9]),"bootstrap": hp.choice("bootstrap",[True,False])}

**→ Defining the function to minimize/maximize:**

`def tune_random(params):`

rand = RandomForestClassifier(**params,n_jobs=-1)

acc = cross_val_score(rand, x, y,scoring="accuracy").mean()

return {"loss": -acc, "status": STATUS_OK}

**→ Minimizing the function:**

%%timetrials = Trials()best = fmin(fn=tune_random, space = space, algo=tpe.suggest,

max_evals=100, trials=trials)print("Best: {}".format(best))

After minimizing the function the result returns the best set of hyperparameters.

It can be seen that the hyperparameters that could help our model perform better was found in 11 min and 31 seconds

**→ Checking out results, losses and statuses:**

`trials.results`

The result

returnsthe set oflosses and statusof each trials.

`trials.losses`

To check only losses or statuses from individual trials we can use these two methods.

`trials.statuses`

## Optuna

- It is Platform agnostic that makes it usable with any kind of framework like TensorFlow, PyTorch and sci-kit learn.
- It has a basic template.

import optunadef objective(trial):

#ML Logic here return evaluation_scorestudy = optuna.create_study()

study.optimize(objective, n_trial=....)#Here,n_trials means the number of trials you want to go through.

- It helps minimize or maximize any function we want.
- It provides an easy mechanism to distribute the optimization, so if we have multiple machines, we can simultaneously run multiple trials that are asynchronous and show near-linear scalability.
- To set up the distribution we just need to change a few lines of the code.

`study = optuna.create_study(study_name = '',#name of experiment`

storage = '',#url of the database

load_if_exist = True)

- Optuna can share the history among six processes parallelly.

L

et’s look at it Practically

**→ Installing Optuna**

`!pip install optuna`

**→ Importing Optuna and defining the objective function to minimize/maximize**

import optunadef objective(trial):

n_estimators = trial.suggest_int("n_estimators",200,1500)

max_features = trial.suggest_categorical("max_features",

["auto","sqrt"])

max_depth = trial.suggest_int("max_depth",10,80,log = True)

min_samples_split = trial.suggest_int("min_samples_split",2,15)

min_samples_leaf = trial.suggest_int("min_samples_leaf",1,9)

bootstrap = trial.suggest_categorical("bootstrap",[True,False])

rand = RandomForestClassifier(n_estimators=n_estimators,

max_features=max_features,

max_depth=max_depth,

min_samples_leaf = min_samples_leaf,

min_samples_split = min_samples_split,

bootstrap = bootstrap)

score = cross_val_score(rand,x,y,n_jobs = -1,cv=5)

accuracy = score.mean()

return accuracy

**→ Creating and optimizing the optimization task:**

- Creating:

`study = optuna.create_study(direction='maximize')`

- Optimizing:

%%timestudy.optimize(objective, n_trials=100)

It took roughly 6 min for Optuna to come up with the best set of hyperparameters

**Conclusions:**

- When dealing with huge data, Bayesian Optimization tends to reduce the time complexity for hyperparameter tuning.
- The Set of hyperparameters found using the Bayesian Optimization methods are the optimal set of hyperparameters.

## References:

- https://conference.scipy.org/proceedings/scipy2013/pdfs/bergstra_hyperopt.pdf
- https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
- https://arxiv.org/pdf/1012.2599.pdf
- https://machinelearningmastery.com/what-is-bayesian-optimization/

## Also, Check out our Article on:

*Introduction to AutoML-The future of industry ML execution**Applying AutoML (Part-1) using Auto-Sklearn**Applying AutoML(Part-2) with MLBox**Applying AutoML (Part-3) with TPOT**Applying AutoML (Part-4) using H2O**AutoML on the Cloud*

** Follow **us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a **Clap**👏 if you ** find this article useful** as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.