Introduction to Apache Airflow
By Hiren Rupchandani & Mukesh Kumar
Table of Contents
In the previous article, we gave an overview about what data pipelines are, the components that are involved and the tools that are required to build a data pipeline. This article will explore a workflow management framework — Apache Airflow. Let’s dive in…
What is Airflow?
A formal definition of Airflow from Airbnb is as follows:
Airflow is a platform to programmatically author, schedule and monitor workflows.
Apache Airflow is a workflow management platform open-sourced by Airbnb that manages directed acyclic graphs (DAGs) and their associated tasks. By default, Python is used as the programming language to define a pipeline’s tasks and their dependencies.
Airflow provides a rich interface to create and maintain workflows. It marks the following principles that make it stand apart from its competitors:
- Dynamic: Pipelines can be instantiated dynamically with the help of Python programming language.
- Scalable: Due to it’s modular architecture and the ability to orchestrate, the number of workers can be scaled (up or down) according to the user’s requirement.
- Extensible: Airflow gives the ability to define custom operators and integrate with the third party tools such as StatsD, MySQL, etc.
- Elegant: With the help of Jinja (a web template engine for the Python), parametrization is possible in Airflow. It also features an easy to learn user interface.
Airbnb developed Airflow to manage their large and complex network of computational jobs. They made the project open-source in October 2014 and it became a part of Apache’s Incubator program in March 2016 before finally becoming a Top-Level Project of Apache Software Foundation in January 2019. Today it is adopted by more than 400 companies including Airbnb, Robinhood, and Twitter in their data architecture.
Airflow is a workflow scheduler and management program mainly used for developing and maintaining data pipelines. The jobs are represented in the form of Directed Acyclic Graphs (DAGs) that further contain a set of interrelated tasks. Before diving into the architecture, let’s see a high level overview of some basic terms that are used in Airflow:
Airflow has some basic terms that will be used throughout the series while building and monitoring data pipelines. These terms are as follows:
It is the basic unit of execution. It can be reading the data from a database, processing the data, storing the data in a database, etc. There are three basic types of Tasks in Airflow:
- Operators: They are pre-defined templates used to build most of the Tasks.
- Sensors: They are a special subclass of Operators and have only one job — to wait for an external event to take place so they can allow their downstream tasks to run.
- TaskFlow: It was recently added in Airflow 2.0 and provides the functionality of sharing data in a data pipeline.
Directed Acyclic Graphs
In basic terms, a DAG is a graph, with nodes connected via directed edges and has no cyclic edges between the nodes. In Airflow, the Tasks are the nodes and the directed edges represent the dependencies between Tasks.
A DAG has directed edges connecting nodes. Similarly, in Airflow, a DAG has dependencies connected between tasks. It defines how the workflow should be carried out in a DAG.
Execution of a single task. They also indicate the state of the Task such as “running”, “success”, “failed”, “skipped”, “up for retry”, etc. The color codes of various states of tasks are as follows:
When a DAG is triggered in Airflow, a DAGrun object is created. DAGrun is the instance of an executing DAG. It contains a timestamp at which the DAG was instantiated and the state (running, success, failed) of the DAG. DAGruns can be created by an external trigger or at scheduled intervals by the scheduler.
In the following execution of a DAG, we will assume four different tasks:
- Reading Data: To read data from the source.
- Processing Categorical Data: To process the categorical data.
- Processing Continuous Data: To process the continuous data.
- Merging Data: To merge the processed categorical and continuous data.
Also, the dependencies are set as following:
- The Reading Data is an upstream task to Processing Categorical Data, and Processing Continuous Data, and Merging Data is a downstream task to Processing Categorical Data and Processing Continuous Data.
The below figure depicts a pictorial representation of the DAG’s execution:
The execution steps are as follows:
- When a DAG is triggered, a DAGrun instance is created which sets the state of the DAG to “running” and also consists of the states of the tasks. The task states are set as “queued” and are scheduled based on their dependencies.
- So the first task (Reading Data) that reads the data is executed. This task is set to the “running” state while the other tasks remain in the “queued” state. After its successful execution, its state is set to “success” else “failed”. You can refer to fig. a in the above diagram.
- According to the dependencies set, the processing tasks (Processing Categorical Data and Processing Continuous Data) are scheduled to be executed after the Reading Data task.
- So their state is set to “running” and their execution begins (either serially or in parallel depending on their executor). You can refer to fig. b in the above diagram.
- After their successful execution, their status is changed to “success” and the next task’s (Merging Data) state is set to “running”. You can refer to fig. c in the above diagram.
- Because of the dependencies set, the final task of merging the data is executed only after the processing tasks have successfully finished their execution.
- After the final task has been successfully executed, its state is set to “success”. For further reference, you can check fig. d in the above image.
- Since the DAG has been successfully executed, its state is set to “success”.
Now that the basic terminologies are clear with an example, let’s see the primary components that form the architecture of Airflow:
Webserver: A simple user interface to inspect, trigger, and debug the working of DAGs with the help of logs. It displays the task states and enables the user to interact with the metadata database.
Executor: It is the mechanism that handles the running tasks. Airflow has many executors, primarily being Sequential Executor, Local Executor, and Debug Executor. There are remote executors as well for complex tasks such as Celery Executor, Dask Executor, Kubernetes Executor, and CeleryKubernetes Executor.
Workers: Executor work closely with workers to execute the tasks. It assigns the tasks waiting in the queue to the workers.
Scheduler: A scheduler has two tasks:
- To trigger the scheduled DAGs.
- To submit the tasks to the executor to run.
It is a multi-threaded Python processes that uses the DAG information to schedule the tasks. It stores the information of each DAG in the metadata database.
Metadata Database: It powers how the other components interact with each other as well as stores all the states stored by the other three components (webserver, scheduler, and executor). All the processes read and write in this database. Database management systems that are supported by SQLAlchemy (such as MySQL and PostgresDB) can be used for Metadata database.
Working of the components
- The Scheduler constantly keeps on tapping the dag directory and makes an entry for each DAG in the database.
- The DAGs are then parsed and the DAGRuns are created. The Scheduler also creates instances of the tasks that are needed to be executed.
- All these tasks are marked “scheduled” in the database.
- The primary scheduler process then takes all the tasks that are marked “scheduled” and sends them to a queue. These tasks are then marked as “queued”.
- The executor fetches the task from the scheduler queue and assigns them to the workers.