Executors in Airflow

By Hiren Rupchandani & Mukesh Kumar

Table of Contents

1. Sequential Executor

2. Local Executor

3. Celery Executor

4. Kubernetes Executor

5. CeleryKubernetes Executor

6. Dask Executor

7. What’s Next?

Airflow executors are the mechanism that handles the running of tasks. Airflow can only have one executor configured at a time; they can be set in the airflow.cfg file in the airflow directory.

Primarily, there are two types of executors in Airflow — local executors which run tasks on the same machine (or process) as the scheduler and remote executors which run tasks on a different machine (or process) from the scheduler.

Airflow offers various built-in executors which can be referred to by their name. These executors are as follows:

Sequential Executor

It is a lightweight local executor which is available by default in airflow. It runs only one task instance at a time and is not production-ready. It can run with SQLite since SQLite does not support multiple connections. It is prone to single-point failure and can be used for debugging.

Sequential Executor

Local Executor

Unlike the sequential executor, the local executor can run multiple task instances. MySQL or PostgreSQL database systems are used since they allow multiple connections with the local executor. Parallelism can be achieved with the help of a local executor. You can refer to airflow documentation on LocalExecutor to achieve parallelism in pipelines.

Local Executor

Celery Executor

Celery is a task queue and can distribute tasks across multiple celery workers. Using a Celery Executor, the workload is distributed from the main application onto multiple celery workers with the help of a message broker such as RabbitMQ or Redis. The executor publishes a request to execute a task in the queue, and one of several worker nodes picks up the request and executes it.

RabbitMQ is a message broker and provides communication between multiple services by operating message queues. It provides an API for other services to publish and to subscribe to the queues.

MySQL or PostgreSQL database systems are required to set up the Celery Executor. It is a remote executor and can be used for horizontal scaling where workers are distributed across multiple machines in a pipeline. It also allows for real-time processing and task scheduling. It is fault-tolerant, unlike the local executors.

Celery Executor

Kubernetes Executor

The Kubernetes Executor uses the Kubernetes API for resource optimization. It runs as a fixed single Pod in the Scheduler that only requires access to the Kubernetes API. A Pod is the smallest deployable object in Kubernetes.

When a DAG submits a task, the Kubernetes Executor requests a worker pod. The worker pod, called from the Kubernetes API, runs the task, reports the result, and terminates.

MySQL or PostgreSQL database systems are required to set up the Kubernetes Executor. It does not require any additional components such as a message broker but it does require a Kubernetes environment.

One of its advantages over the other executors is that Pods are run only when a task needs to be executed, saving resources. In other executors, the workers are statically configured and are running all the time, regardless of workloads.

It can automatically scale up to meet the workload requirements, scale down to zero (if no DAGs or tasks are running), and is fault-tolerant.

Kubernetes Executor

CeleryKubernetes Executor

The allows users to run simultaneously a Celery Executor and a Kubernetes Executor. The CeleryKubernetes Executor documentation mentions a thumb rule on when to use this executor:

  1. The number of tasks needed to be scheduled at the peak exceeds the scale that your Kubernetes cluster can comfortably handle
  2. A relatively small portion of your tasks requires runtime isolation.
  3. You have plenty of small tasks that can be executed on Celery workers but you also have resource-hungry tasks that will be better to run in predefined environments.
CeleryKubernetes Executor

Dask Executor

Dask is a parallel computing library in python that has a sophisticated distributed task scheduler. Dask Executor allows you to run Airflow tasks in a Dask Distributed cluster. It can be used as an alternative to the Celery Executor for horizontal scaling. You can refer to the difference between the Celery Executor and Dask Executor on this blog.

Dask Executor

These are the various pre-defined executors that airflow supports. In the upcoming article, we will see how we can configure and use the Local Executor to run a DAG in Airflow.

What’s Next

Local Executor in Airflow

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!