Automated Hyperparameter tuning

  1. Hyperparameters are tunable and can be used to get the optimal performing model.
  2. It’s always tricky to find the optimal combinations of any ML model for a specific task.
  3. Not only it takes up time writing lines and lines of code, but it also takes up time to train.
  4. A Hyperparameter controls the result/conduct of any ML models.

Also, Check out our Article on:

Traditional Hyperparameter Tuning!

Read in detail about it with our article:
Hyper-parameter Tuning

→ Problems with the traditional method.

  • Increased time complexity.
  • GridSearch is a greedy search algorithm that runs over every value we passed to tune the model. Whereas, the RandomSearch randomly chooses over the value.
  • GridSearch suffers from the curse of dimensionality:
    The number of model evaluations with each set of hyperparameters grows drastically during the optimization process.
    Additionally, it is not even guaranteed to find the best solution.
  • The drawback of RandomSearch is unnecessarily high variance.
    The method is entirely random and uses no intelligence in selecting which points to try.

Why use Automated-Hyperparameter tuning?

  • GridSearch and RandomSearch are hands-off, however, it takes very long run times to execute the process.
    This is due to the fact that they waste most of their time evaluating parameters on the search space that do not add any value.
  • Increasingly, hyperparameter tuning is done by automated methods that aim to find optimal hyperparameters in less time using an informed search with no manual effort necessary beyond the initial set-up.

Bayesian Optimization

  1. Sequential model-based optimization (also known as Bayesian optimization) is one of the most proficient strategies(per function evaluation) of function minimization.
  2. This effectiveness makes it suitable for improving the hyperparameters of the ML models which are slow to train.
  3. SMBO methods are used where a user wants to minimize some scalar-valued function f(x) that takes a lot of time to evaluate.
  4. The advantages of SMBO are that it:
    • Leverages smoothness without an analytic gradient.
    • Handles every type of variable (real-valued, discrete, and conditional variables).
    • Handles evaluation of the function f(x) parallelly.
    • Adapts to hundreds of variables, even with a limit of just a few hundred function evaluations.
  5. A Bayesian Optimization is an approach that uses the Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.
  6. This is mostly useful for objective functions that can be complex, noisy, and/or expensive to evaluate.
  7. Bayesian optimization in turn considers past evaluations while picking the set of hyperparameters.
  8. By using this informative method of picking the set of hyperparameters it enables itself to focus on those areas of the parameter space that it accepts will bring the most promising validation scores.

Idea Behind Bayesian Optimization

  • We know that the Bayes Theorem describes the probability of an event, supported by prior knowledge of conditions that may be associated with the event.
    In simple terms, Bayes Theorem calculates the conditional probability of an event.
  • However, we do not want to calculate the conditional probabilities instead we want to optimize a quantity.
    We can simplify the equation by removing the normalizing value P(B), these steps make the conditional probability equation a proportional quantity.
P(A|B) = P(B|A) * P(A)
  • Now, various hyperparameter configurations are navigated through, which take advantage of the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run.

The posterior captures the updated beliefs about the unknown objective function. One may also translate this step of Bayesian optimization as finding the objective function with a surrogate function (also called a response surface).

  • Bayesian optimization finds a posterior distribution because of the function that has to be optimized during the parameter optimization, then uses an acquisition function to sample from that posterior to seek out the subsequent set of parameters to be explored.

Surrogate Function

  • The surrogate function is often interpreted as a Substitute for the target function.
  • It is used to propose parameter sets to the objective function that likely yields an improvement in terms of accuracy score.
  • The surrogate function is a technique used to find the approximate mapping of input examples to an output score.
  • The problem of forming a surrogate model is usually handled as a regression problem where we provide the info as input (with a group of hyperparameters) and it returns an estimation of the objective function parameterized by a mean and a standard deviation.

Acquisition Function

  • The acquisition function is maximized at every iteration to decide where next to sample from the objective function — the acquisition function takes into account the mean and variance of the forecasts over the space to model the efficiency of sampling.
  • From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.
  • The function is then sampled at the argmax of the acquisition function, the Gaussian process is updated and the whole method is repeated.

Packages for Automatic Hyperparameter tuning

Scikit-Optimize

  • Scikit-Optimize is one of the libraries that’s easy to implement than other hyperparameter optimization libraries and also has better community support and documentation.
  • This library implements several methods for sequential model-based optimization by reducing expensive and noisy black-box functions.
!pip install scikit-optimize
import skopt
from skopt import gp_minimize
from skopt.space import Integer, Categorical
from skopt.utils import use_named_args
from skopt.plots import plot_convergence
space = [
Integer(200,1500,name = "n_estimators"),
Integer(10, 80, name = "max_depth"),
Categorical(["auto", "sqrt"], name = "max_features"),
Integer(2,15, name = "min_samples_split"),
Integer(1,9, name = "min_samples_leaf"),
Categorical([True,False], name = "bootstrap")
]
@use_named_args(space)def objective(**params):
rf.set_params(**params)
return -np.mean(cross_val_score(rf, x_train, y_train, cv=5,
n_jobs=-1,
scoring="neg_mean_absolute_error"))
%%time
tune_rand_gp = gp_minimize(objective,space,random_state=1234)

We can see that at a total time of ‘9 min and 34 seconds’ the skopt package found the best set of parameters for our RandomForest Model

print(f"Best parameters: \n")
print(f'n_estimators={tune_rand_gp.x[0]}')
print(f'max_depth={tune_rand_gp.x[1]}')
print(f'max_features={tune_rand_gp.x[2]}')
print(f'min_samples_split={tune_rand_gp.x[3]}')
print(f'min_samples_leaf={tune_rand_gp.x[4]}')
print(f'bootstrap = {tune_rand_gp.x[5]}')
plot_convergence(tune_rand_gp)

Hyperopt

  • HyperOpt takes as an input space of hyperparameters in which it will search and moves according to the result of past trials this means that we get an optimizer that could minimize/maximize any function for us.
  • The Hyperopt library provides different algorithms and a way to parallelize by building an infrastructure for performing hyperparameter optimization (model selection) in Python.
  • HyperOpt provides an optimization interface that identifies a configuration space and an evaluation function that attaches real-valued loss values to points within the configuration space.
from hyperopt import fmin, tpe, hp,Trials,STATUS_OK
  1. hp.choice(labels,options):
    Returns one of the n examples provided, the options should be a list or a tuple.
  2. hp.randint(label,upper):
    Returns a random integer from o to upper.
  3. hp.uniform(label, lower, upper):
    Returns the uniform range of values between the lower and the upper limit.
  4. hp.quniform(label, lower, upper):
    Returns a value like round(uniform(low, high) / q) * q
    Suitable for a discrete value with respect to which the objective is still somewhat “smooth”, but which should be bounded both above and below.
  5. hp.loguniform(label, low, high):
    Returns a uniformly distributed value by drawing value according to exp(uniform(low, high))
space = {"n_estimators": hp.choice("n_estimators",[200,600,900,1200,1500]),"max_depth": hp.quniform("max_depth", 10, 80,5),"max_features": hp.choice("criterion", ["auto", "sqrt"]),"min_samples_split":hp.choice("min_samples_split",[2, 5, 10,12,15]),"min_samples_leaf":hp.choice("min_samples_leaf",[1, 2, 4,7,9]),"bootstrap": hp.choice("bootstrap",[True,False])}
def tune_random(params):
rand = RandomForestClassifier(**params,n_jobs=-1)
acc = cross_val_score(rand, x, y,scoring="accuracy").mean()
return {"loss": -acc, "status": STATUS_OK}
%%timetrials = Trials()best = fmin(fn=tune_random, space = space, algo=tpe.suggest,
max_evals=100, trials=trials)
print("Best: {}".format(best))

After minimizing the function the result returns the best set of hyperparameters.
It can be seen that the hyperparameters that could help our model perform better was found in 11 min and 31 seconds

trials.results
trials.losses
trials.statuses

Optuna

  • It is Platform agnostic that makes it usable with any kind of framework like TensorFlow, PyTorch and sci-kit learn.
  • It has a basic template.
import optunadef objective(trial):

#ML Logic here
return evaluation_scorestudy = optuna.create_study()
study.optimize(objective, n_trial=....)
#Here,n_trials means the number of trials you want to go through.
  • It helps minimize or maximize any function we want.
  • It provides an easy mechanism to distribute the optimization, so if we have multiple machines, we can simultaneously run multiple trials that are asynchronous and show near-linear scalability.
  • To set up the distribution we just need to change a few lines of the code.
study = optuna.create_study(study_name = '',#name of experiment
storage = '',#url of the database
load_if_exist = True)
  • Optuna can share the history among six processes parallelly.
!pip install optuna
import optunadef objective(trial):
n_estimators = trial.suggest_int("n_estimators",200,1500)
max_features = trial.suggest_categorical("max_features",
["auto","sqrt"])
max_depth = trial.suggest_int("max_depth",10,80,log = True)
min_samples_split = trial.suggest_int("min_samples_split",2,15)
min_samples_leaf = trial.suggest_int("min_samples_leaf",1,9)
bootstrap = trial.suggest_categorical("bootstrap",[True,False])
rand = RandomForestClassifier(n_estimators=n_estimators,
max_features=max_features,
max_depth=max_depth,
min_samples_leaf = min_samples_leaf,
min_samples_split = min_samples_split,
bootstrap = bootstrap)
score = cross_val_score(rand,x,y,n_jobs = -1,cv=5)
accuracy = score.mean()
return accuracy
  • Creating:
study = optuna.create_study(direction='maximize')
  • Optimizing:
%%timestudy.optimize(objective, n_trials=100)

Conclusions:

  1. When dealing with huge data, Bayesian Optimization tends to reduce the time complexity for hyperparameter tuning.
  2. The Set of hyperparameters found using the Bayesian Optimization methods are the optimal set of hyperparameters.

References:

  1. https://conference.scipy.org/proceedings/scipy2013/pdfs/bergstra_hyperopt.pdf
  2. https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
  3. https://arxiv.org/pdf/1012.2599.pdf
  4. https://machinelearningmastery.com/what-is-bayesian-optimization/

Also, Check out our Article on:

Visit us on https://www.insaid.co/

--

--

--

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Dog Breed Image Classification

MlOps Project 1

Gradient Descent — An Explanation

ML Infra Best Practices: Monitoring Overview

Exploring Activation and Loss Functions in Machine Learning

Introduction to Natural Language Processing (NLP)

Analyzing Movie Reviews using Sentiment Analysis (NLP)

Types of machine learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
INSAID

INSAID

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!

Did Stacking Improve My PySpark Churn Prediction Model?

Regression in RAPIDS vs Sklearn

Feature Engineering Example with Python

Data generators for Machine Learning