A Complete Guide to PyCaret!!!

INSAID
11 min readSep 18, 2020

What is PyCaret?

  • PyCaret is an open-source low-code machine learning library in Python that aims to reduce the time needed for experimenting with different machine learning models.
  • It helps Data Scientist to perform any experiments end-to-end quickly and more efficiently.
  • PyCaret being a low-code library makes you more productive. With less time spent coding, you and your team can now focus on business problems.
  • PyCaret is a wrapper around many ML models and frameworks such as XGBoost, Scikit-learn, and many more.

Why use PyCaret?

  • It helps in data preprocessing.
  • It trains multiple models simultaneously and outputs a table comparing the performance of each model by considering a few performance metrics such as precision, recall, f1-score, and so on.
  • It is easy to analyze and interpret as it requires minimal codes to run.
  • In a few lines of code, it increases productivity.

Also, Check out our Article on:

Classification using PyCaret
Regression using PyCaret
Anomaly Detection using PyCaret
Clustering using PyCaret

Let’s Get Started with PyCaret!!!

→ Installing PyCaret

Installing PyCaret is easy. However, It is recommended to create a virtual environment for as it installs dependencies like pandas, NumPy, seaborn, and many more, you can check this link to know the complete list of dependencies.

  1. To create a virtual environment you can use anaconda prompt in your laptop/desktop.
  • Open anaconda prompt and type the below code in the CLI.
conda create -n yourenvname python=x.x anaconda

After pressing enter on the command, a little later you will get a prompt on the CLI.

Type ‘y’ and press enter again and your environment will be created

  • To activate the environment.
conda activate yourenvname

Notice after inputting the command our created environment was activated.

  • To deactivate the environment
conda deactivate
  • To delete the environment
conda remove -n yourenvname --all
  • An alternate way to create an environment is to go directly in the anaconda navigator and in the bottom left corner you will see an icon “Create environment”.

2. To install PyCaret, activate the virtual environment and open jupyter notebook by typing jupyter notebook in the CLI.

  • In Jupyter Notebook
!pip install pycaret
  • In Anaconda prompt, after activating our virtual environment type the command is given below.
pip install pycaret
  • importing PyCaret and checking the version in the jupyter notebook.
import pycaret
pycaret.__version__
  • If you are using Google colab run the provided code and then proceed.
from pycaret.utils import enable_colabenable_colab()

→ Checking the pre-loaded data in the PyCaret

from pycaret.datasets import get_dataget_data('index')

→ Setting up the PyCaret environment

Before moving on with any kind of experimentation using PyCaret we need to set up the environment.

  • The first step to setting up the environment is to import a module.
    Depending upon the type of experiment needed to be performed.
#For Regression
from pycaret.regression import *
#For Classification
from pycaret.classification import *
#For Clustering
from pycaret.clustering import *
#For Anomaly Detection
from pycaret.anomaly import *
#For NLP
from pycaret.nlp import *
#For association rule mining
from pycaret.arules import *
  • The second step is to initialize the setup:
    It is a mandatory step that should be done before any machine learning experiment.
reg = setup(data = DataFrame_name, target = 'target_variable_name')

As you know PyCaret helps in model deployment too, so all the experiment done is saved in a pipeline and this pipeline can be deployed into production with ease.

After this press enter and you will get results as shown below.

The Setup step covers a wide range of pre-processing tasks like:

Data Type Inference:
-
It helps determine the correct data types for all the features.

Data Cleaning and Preparation:
-
It automatically imputes the missing values present in the data.
- By default, the numerical values are imputed with mean and the categorical data are imputed with the mode.
- Also, Encoding of the categorical features is performed automatically.

Train Test Split:
- It automatically splits the data into train and test for modeling. In the case of classification problems, it uses stratified splits.
- By default, the split ratio is 70% train and 30% test. However, this can be changed by using a parameter within the setup. “train_size”.
-
Evaluation of every ML model and hyperparameter optimization is done using K-Fold Cross-Validation.

Assigning Session ID as seed:
-
Session id is a pseudo-random number generated by default if no session_id parameter is passed.
- PyCaret distributes this id as a seed in all the functions to isolate the effect of randomization.
- This allows for reproducibility at later date in the same or different environment.

Now, after our envronment is setup for the training. We can carry on with our ML experimentation.

Creating Models

Creating a model in PyCaret is one of the simplest tasks.

The “create_model” function takes in just the model ID as a string and performs the task.

create_model('model_ID')

To use cross-validation during the training, we can include one more parameter “fold” inside the create_model function.
By default, the fold is set to 10.

create_model('model_ID',fold = n)

n is the number of folds required.

After performing this we get a table of all the metrics rounded up to 4 decimal digits as an output.
Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

→ Classification:

+------------+---------------------------------+
| ID | Name |
+------------+---------------------------------+
| ‘lr’ | Logistic Regression |
| ‘knn’ | K Nearest Neighbour |
| ‘nb’ | Naives Bayes |
| ‘dt’ | Decision Tree Classifier |
| ‘svm’ | SVM – Linear Kernel |
| ‘rbfsvm’ | SVM – Radial Kernel |
| ‘gpc’ | Gaussian Process Classifier |
| ‘mlp’ | Multi Level Perceptron |
| ‘ridge’ | Ridge Classifier |
| ‘rf’ | Random Forest Classifier |
| ‘qda’ | Quadratic Discriminant Analysis |
| ‘ada’ | Ada Boost Classifier |
| ‘gbc’ | Gradient Boosting Classifier |
| ‘lda’ | Linear Discriminant Analysis |
| ‘et’ | Extra Trees Classifier |
| ‘xgboost’ | Extreme Gradient Boosting |
| ‘lightgbm’ | Light Gradient Boosting |
| ‘catboost’ | CatBoost Classifier |
+------------+---------------------------------+

→ Regression:

+------------+-----------------------------------+
| ID | Name |
+------------+-----------------------------------+
| ‘lr’ | Linear Regression |
| ‘lasso’ | Lasso Regression |
| ‘ridge’ | Ridge Regression |
| ‘en’ | Elastic Net |
| ‘lar’ | Least Angle Regression |
| ‘llar’ | Lasso Least Angle Regression |
| ‘omp’ | Orthogonal Matching Pursuit |
| ‘br’ | Bayesian Ridge |
| ‘ard’ | Automatic Relevance Determination |
| ‘par’ | Passive Aggressive Regressor |
| ‘ransac’ | Random Sample Consensus |
| ‘tr’ | TheilSen Regressor |
| ‘huber’ | Huber Regressor |
| ‘kr’ | Kernel Ridge |
| ‘svm’ | Support Vector Machine |
| ‘knn’ | K Neighbors Regressor |
| ‘dt’ | Decision Tree |
| ‘rf’ | Random Forest |
| ‘et’ | Extra Trees Regressor |
| ‘ada’ | AdaBoost Regressor |
| ‘gbr’ | Gradient Boosting Regressor |
| ‘mlp’ | Multi Level Perceptron |
| ‘xgboost’ | Extreme Gradient Boosting |
| ‘lightgbm’ | Light Gradient Boosting |
| ‘catboost’ | CatBoost Regressor |
+------------+-----------------------------------+

→ Clustering:

+-------------+----------------------------------+
| ID | Name |
+-------------+----------------------------------+
| ‘kmeans’ | K-Means Clustering |
| ‘ap’ | Affinity Propagation |
| ‘meanshift’ | Mean shift Clustering |
| ‘sc’ | Spectral Clustering |
| ‘hclust’ | Agglomerative Clustering |
| ‘dbscan’ | Density-Based Spatial Clustering |
| ‘optics’ | OPTICS Clustering |
| ‘birch’ | Birch Clustering |
| ‘kmodes’ | K-Modes Clustering |
+-------------+----------------------------------+

→ Anomaly Detection:

+-------------+-----------------------------------+
| ID | Name |
+-------------+-----------------------------------+
| ‘abod’ | Angle-base Outlier Detection |
| ‘iforest’ | Isolation Forest |
| ‘cluster’ | Clustering-Based Local Outlier |
| ‘cof’ | Connectivity-Based Outlier Factor |
| ‘histogram’ | Histogram-based Outlier Detection |
| ‘knn’ | k-Nearest Neighbors Detector |
| ‘lof’ | Local Outlier Factor |
| ‘svm’ | One-class SVM detector |
| ‘pca’ | Principal Component Analysis |
| ‘mcd’ | Minimum Covariance Determinant |
| ‘sod’ | Subspace Outlier Detection |
| ‘sos | Stochastic Outlier Selection |
+-------------+-----------------------------------+

→ NLP:

+-------+-----------------------------------+
| ID | Model |
+-------+-----------------------------------+
| ‘lda’ | Latent Dirichlet Allocation |
| ‘lsi’ | Latent Semantic Indexing |
| ‘hdp’ | Hierarchical Dirichlet Process |
| ‘rp’ | Random Projections |
| ‘nmf’ | Non-Negative Matrix Factorization |
+-------+-----------------------------------+

Compare models

This function compares each model present in the PyCaret depending upon the problem statement.
Training of every model is done using the default hyperparameters and evaluates performance metrics using the cross-validation.

compare_models()

The output of the function is a table showing the average score of all models across the folds. The number of folds can be defined using the fold parameters within the compare_models function. By default, the fold is set to 10. The table is sorted (highest to lowest) by the metric of choice and can be defined using the sort parameter. By default, the table is sorted by Accuracy for classification experiments and R2 for regression experiments. Certain models are prevented for comparison because of their longer run-time. To bypass this prevention, the turbo parameter can be set to False.

To select the top n numbers of the model, include n_select hyperparameter within the compare_models function.

compare_models(n_select = n)

We can even sort it using the metrics.

compare_models(n_select = n, sort ‘ AUC’)

Tune Model

It provides just one line function to perform hyperparameter tuning of any model present in the PyCaret Library.

It tunes the hyperparameter of the model passed as an estimator using a Random grid search with pre-defined grids that are fully customizable.

  • First, create a model
dt = create_model('dt') #dt stands for the Decision Tree
  • Tune the model
tuned = tune_model(dt, n_iter = 50)

Plot a Model

It helps in checking the performance of a model with different graphs in one line of code.

model = create_model('Model_name')plot_model(model)

→ Classification:

+-----------------------------+--------------------+
| Name | Plot |
+-----------------------------+--------------------+
| Area Under the Curve | ‘auc’ |
| Discrimination Threshold | ‘threshold’ |
| Precision Recall Curve | ‘pr’ |
| Confusion Matrix | ‘confusion_matrix’ |
| Class Prediction Error | ‘error’ |
| Classification Report | ‘class_report’ |
| Decision Boundary | ‘boundary’ |
| Recursive Feature Selection | ‘rfe’ |
| Learning Curve | ‘learning’ |
| Manifold Learning | ‘manifold’ |
| Calibration Curve | ‘calibration’ |
| Validation Curve | ‘vc’ |
| Dimension Learning | ‘dimension’ |
| Feature Importance | ‘feature’ |
| Model Hyperparameter | ‘parameter’ |
+-----------------------------+--------------------+

→ Regression:

+-----------------------------+-------------+
| Name | Plot |
+-----------------------------+-------------+
| Residuals Plot | ‘residuals’ |
| Prediction Error Plot | ‘error’ |
| Cooks Distance Plot | ‘cooks’ |
| Recursive Feature Selection | ‘rfe’ |
| Learning Curve | ‘learning’ |
| Validation Curve | ‘vc’ |
| Manifold Learning | ‘manifold’ |
| Feature Importance | ‘feature’ |
| Model Hyperparameter | ‘parameter’ |
+-----------------------------+-------------+

→ Clustering:

+-----------------------+----------------+
| Cluster PCA Plot (2d) | ‘cluster’ |
+-----------------------+----------------+
| Cluster TSnE (3d) | ‘tsne’ |
| Elbow Plot | ‘elbow’ |
| Silhouette Plot | ‘silhouette’ |
| Distance Plot | ‘distance’ |
| Distribution Plot | ‘distribution’ |
+-----------------------+----------------+

→ Anomaly Detection:

+---------------------------+--------+
| t-SNE (3d) Dimension Plot | ‘tsne’ |
+---------------------------+--------+
| UMAP Dimensionality Plot | ‘umap’ |
+---------------------------+--------+

→ Natural Language Processing:

+---------------------------+----------------------+
| Name | Plot |
+---------------------------+----------------------+
| Word Token Frequency | ‘frequency’ |
| Word Distribution Plot | ‘distribution’ |
| Bigram Frequency Plot | ‘bigram’ |
| Trigram Frequency Plot | ‘trigram’ |
| Sentiment Polarity Plot | ‘sentiment’ |
| Part of Speech Frequency | ‘pos’ |
| t-SNE (3d) Dimension Plot | ‘tsne’ |
| Topic Model (pyLDAvis) | ‘topic_model’ |
| Topic Infer Distribution | ‘topic_distribution’ |
| Word cloud | ‘wordcloud’ |
| UMAP Dimensionality Plot | ‘umap |
+---------------------------+----------------------+

Interpret Model

After building a model one of the most important task is to interpret the results.

Model Interpretability helps debug the model by analyzing what the model really thinks is important.

model = create_model('Model_name')interpret_model(model)

Finalize Model

It is the last step of building a model in PyCaret.

This function takes a trained model object and returns a model that has been trained on the entire dataset.

model = create_model('Model_name')finalize_model(model)

Deploy Model

  1. Once a model is finalized after experimenting on the dataset it can be saved using the “save_model” function.
    The “save_model” saves the pipeline and the trained model this can be used in applications as a binary pickle file.
  2. An alternative and a low code possibility to deploy the model on the cloud can be done using PyCaret’s “deploy_model” function.
  3. Models can easily be deployed in AWS using the PyCaret.
  4. Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command-line interface. To configure AWS environment variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your amazon console account:
  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
model = create_model('Model_name')final_model = finalize_model(model)
deploy_model(final_model, model_name = 'Model_name_aws', platform = 'aws',
authentication = {'bucket' : 'pycaret-test'})

This deployed model can also be used to predict.

predictions = predict_model(model_name = 'lr_aws', data = data_unseen, platform = 'aws', authentication = { 'bucket' : 'pycaret-test' })

Also, Check out our Article on:

Classification using PyCaret
Regression using PyCaret
Anomaly Detection using PyCaret
Clustering using PyCaret

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

--

--

INSAID

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!