Executors in Airflow
Table of Contents
7. What’s Next?
Airflow executors are the mechanism that handles the running of tasks. Airflow can only have one executor configured at a time; they can be set in the airflow.cfg file in the airflow directory.
Primarily, there are two types of executors in Airflow — local executors which run tasks on the same machine (or process) as the scheduler and remote executors which run tasks on a different machine (or process) from the scheduler.
Airflow offers various built-in executors which can be referred to by their name. These executors are as follows:
It is a lightweight local executor which is available by default in airflow. It runs only one task instance at a time and is not production-ready. It can run with SQLite since SQLite does not support multiple connections. It is prone to single-point failure and can be used for debugging.
Unlike the sequential executor, the local executor can run multiple task instances. MySQL or PostgreSQL database systems are used since they allow multiple connections with the local executor. Parallelism can be achieved with the help of a local executor. You can refer to airflow documentation on LocalExecutor to achieve parallelism in pipelines.
Celery is a task queue and can distribute tasks across multiple celery workers. Using a Celery Executor, the workload is distributed from the main application onto multiple celery workers with the help of a message broker such as RabbitMQ or Redis. The executor publishes a request to execute a task in the queue, and one of several worker nodes picks up the request and executes it.
RabbitMQ is a message broker and provides communication between multiple services by operating message queues. It provides an API for other services to publish and to subscribe to the queues.
MySQL or PostgreSQL database systems are required to set up the Celery Executor. It is a remote executor and can be used for horizontal scaling where workers are distributed across multiple machines in a pipeline. It also allows for real-time processing and task scheduling. It is fault-tolerant, unlike the local executors.
The Kubernetes Executor uses the Kubernetes API for resource optimization. It runs as a fixed single Pod in the Scheduler that only requires access to the Kubernetes API. A Pod is the smallest deployable object in Kubernetes.
When a DAG submits a task, the Kubernetes Executor requests a worker pod. The worker pod, called from the Kubernetes API, runs the task, reports the result, and terminates.
MySQL or PostgreSQL database systems are required to set up the Kubernetes Executor. It does not require any additional components such as a message broker but it does require a Kubernetes environment.
One of its advantages over the other executors is that Pods are run only when a task needs to be executed, saving resources. In other executors, the workers are statically configured and are running all the time, regardless of workloads.
It can automatically scale up to meet the workload requirements, scale down to zero (if no DAGs or tasks are running), and is fault-tolerant.
The allows users to run simultaneously a Celery Executor and a Kubernetes Executor. The CeleryKubernetes Executor documentation mentions a thumb rule on when to use this executor:
- The number of tasks needed to be scheduled at the peak exceeds the scale that your Kubernetes cluster can comfortably handle
- A relatively small portion of your tasks requires runtime isolation.
- You have plenty of small tasks that can be executed on Celery workers but you also have resource-hungry tasks that will be better to run in predefined environments.
Dask is a parallel computing library in python that has a sophisticated distributed task scheduler. Dask Executor allows you to run Airflow tasks in a Dask Distributed cluster. It can be used as an alternative to the Celery Executor for horizontal scaling. You can refer to the difference between the Celery Executor and Dask Executor on this blog.
These are the various pre-defined executors that airflow supports. In the upcoming article, we will see how we can configure and use the Local Executor to run a DAG in Airflow.