Kubernetes Executor in Airflow

By Hiren Rupchandani & Mukesh Kumar

Although the Celery executor is the most preferred for data pipelines in Airflow, it has some notable drawbacks:

  • Need to set-up extra infrastructure like RabbitMQ/Redis and Flower.
  • Need to manage dependencies for Celery, RabbitMQ/Redis, and Flower.
  • Airflow workers stay idle when there is no workload, so it leads to wastage of resources.
  • Worker nodes are not as resilient as you think.

The Kubernetes Executor provides the following advantages:

  • Runs the tasks in a Kubernetes cluster
  • Each task runs in it’s own pod
  • Expands and shrinks the cluster according to workload. So we can scale it down to zero as well.
  • Scheduler subscribes to the Kubernetes API so communication is possible between them.

DISCLAIMER: Before we begin, we need to declare that this article is focused for Windows 10 users only.

Prerequisites

  • You should have Docker installed on your system. Docker does not work on Windows 7, 8, 8.1, and 10 Home Edition.
  • Although, you can use WSL2 based Docker on the Windows 10 Home Edition, it is preferred to have the Professional Edition.
  • It is expected that you have Windows 10 Professional.
  • Chocolatey, a package installer for Windows.
  • You need a storage of more than 15GB on your system, to create a PersistentVolumne and keep other relevant files.
  • Some decent knowledge of Docker, Kubernetes, Containers, and Pods.

Process

  • We will install Kubernetes, kubectl (CLI for Kubernetes), and Helm, which is a package manager for Kubernetes (think of Helm like apt commands in Ubuntu).
  • You can refer to Kubernetes installation at this link and this link.
  • You can refer to Helm installation at this link.
  • To ease the airflow installation for beginners, we are using a GitHub repository which you can download from here. (Credits given in the end)
  • We will then install airflow with Kubernetes executor in the Kubernetes environment using Helm.
  • This installation will also create a volume to store your DAGs.

Let’s Get Started

Install Kubernetes and Helm

  • First step is to install Kubernetes using Minikube. To do so, open Powershell as an administrator and type the following command:
choco install minikube -y
  • Once the installation is complete, you can initialize a local cluster using:
minikube start
  • You can get the node info using this command:
kubectl get nodes
OUTPUT:
NAME STATUS ROLES AGE VERSION
minikube Ready master 2d
  • Now we can install Helm using:
choco install kubernetes-helm
  • Helm will allow us to properly install complex packages with various dependencies like Apache Airflow inside a Kubernetes cluster.

Configure and Install Airflow

  • You need to download the repository mentioned here and extract the relevant files at an appropriate location like C:/Users/username/Documents. Note this path as absolute path.
  • Now, open the chapter2/airflow-helm-config-kubernetes-executor.yaml file and change the path on line 22 to the absolute path. It should look something like this:
path: "/Users/username/Documents/etl-series/dags"
  • The configuration basically creates a Volume at the given path and mounts this volume to Airflow Scheduler, Webserver, and Workers.
  • We can now write dags on our local machine and let Airflow running inside Kubernetes pick it up from there.
  • Now, we can install Airflow with the following command:
helm install airflow stable/airflow -f chapter2/airflow-helm-config-kubernetes-executor.yaml --version 7.2.0
  • The installation status can be checked using:
helm list
  • Once deployed, you can type the following command:
export POD_NAME=$(kubectl get pods --namespace default -l "component=web,app=airflow" -o jsonpath="{.items[0].metadata.name}")
echo
http://127.0.0.1:8080
kubectl port-forward --namespace default $POD_NAME 8080:8080

DAGRun

kubectl get pods
  • The output will show the various pods that are spawned along with their name, status, and age.
  • When you run the DAG, you will observe that Airflow schedules a pod that begins with dagthatexecutesviak8sexecutor.
  • This pod, in return, starts another pod to execute the actual tasks defined in that DAG using the KubernetesPodOperator. Notice pods that begins with dagthatexecutesviakubernetespodoperator.

Congratulations! You have successfully set up airflow within a Kubernetes environment.

References

What’s Next?

Airflow Error Tracking using Sentry

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!