Introduction to Apache Airflow

Logo of Apache Airflow (Source: Wikipedia)

By Hiren Rupchandani & Mukesh Kumar

Table of Contents

1. What is Airflow?

2. Airflow Principles

3. Airflow’s History

4. Airflow Architecture

5. Airflow’s Components

6. Working of Components

7. What’s Next

In the previous article, we gave an overview about what data pipelines are, the components that are involved and the tools that are required to build a data pipeline. This article will explore a workflow management framework — Apache Airflow. Let’s dive in…

What is Airflow?

Apache Airflow was open-sourced by Airbnb in October 2014

A formal definition of Airflow from Airbnb is as follows:

Airflow is a platform to programmatically author, schedule and monitor workflows.

Apache Airflow is a workflow management platform open-sourced by Airbnb that manages directed acyclic graphs (DAGs) and their associated tasks. By default, Python is used as the programming language to define a pipeline’s tasks and their dependencies.

Airflow Principles

Airflow provides a rich interface to create and maintain workflows. It marks the following principles that make it stand apart from its competitors:

  • Dynamic: Pipelines can be instantiated dynamically with the help of Python programming language.
  • Scalable: Due to it’s modular architecture and the ability to orchestrate, the number of workers can be scaled (up or down) according to the user’s requirement.
  • Extensible: Airflow gives the ability to define custom operators and integrate with the third party tools such as StatsD, MySQL, etc.
  • Elegant: With the help of Jinja (a web template engine for the Python), parametrization is possible in Airflow. It also features an easy to learn user interface.

Airflow’s History

Logo of Airbnb (Source: Wikimedia)

Airbnb developed Airflow to manage their large and complex network of computational jobs. They made the project open-source in October 2014 and it became a part of Apache’s Incubator program in March 2016 before finally becoming a Top-Level Project of Apache Software Foundation in January 2019. Today it is adopted by more than 400 companies including Airbnb, Robinhood, and Twitter in their data architecture.

Number of contributions to Airflow’s GitHub from October 2014 to July 2021

Airflow Architecture

Airflow is a workflow scheduler and management program mainly used for developing and maintaining data pipelines. The jobs are represented in the form of Directed Acyclic Graphs (DAGs) that further contain a set of interrelated tasks. Before diving into the architecture, let’s see a high level overview of some basic terms that are used in Airflow:

Basic Concepts

Airflow has some basic terms that will be used throughout the series while building and monitoring data pipelines. These terms are as follows:

Tasks

Processes/Tasks represented as individual modules

It is the basic unit of execution. It can be reading the data from a database, processing the data, storing the data in a database, etc. There are three basic types of Tasks in Airflow:

  • Operators: They are pre-defined templates used to build most of the Tasks.
  • Sensors: They are a special subclass of Operators and have only one job — to wait for an external event to take place so they can allow their downstream tasks to run.
  • TaskFlow: It was recently added in Airflow 2.0 and provides the functionality of sharing data in a data pipeline.

Directed Acyclic Graphs

Tasks related via dependencies, forming a DAG

In basic terms, a DAG is a graph, with nodes connected via directed edges and has no cyclic edges between the nodes. In Airflow, the Tasks are the nodes and the directed edges represent the dependencies between Tasks.

Control Flow

A DAG has directed edges connecting nodes. Similarly, in Airflow, a DAG has dependencies connected between tasks. It defines how the workflow should be carried out in a DAG.

Task instance

Execution of a single task. They also indicate the state of the Task such as “running”, “success”, “failed”, “skipped”, “up for retry”, etc. The color codes of various states of tasks are as follows:

States of Tasks indicated by color codes in Airflow UI

DAGrun

When a DAG is triggered in Airflow, a DAGrun object is created. DAGrun is the instance of an executing DAG. It contains a timestamp at which the DAG was instantiated and the state (running, success, failed) of the DAG. DAGruns can be created by an external trigger or at scheduled intervals by the scheduler.

Example

In the following execution of a DAG, we will assume four different tasks:

  • Reading Data: To read data from the source.
  • Processing Categorical Data: To process the categorical data.
  • Processing Continuous Data: To process the continuous data.
  • Merging Data: To merge the processed categorical and continuous data.

Also, the dependencies are set as following:

  • The Reading Data is an upstream task to Processing Categorical Data, and Processing Continuous Data, and Merging Data is a downstream task to Processing Categorical Data and Processing Continuous Data.

The below figure depicts a pictorial representation of the DAG’s execution:

Example: Execution of a DAG

The execution steps are as follows:

  • When a DAG is triggered, a DAGrun instance is created which sets the state of the DAG to “running” and also consists of the states of the tasks. The task states are set as “queued” and are scheduled based on their dependencies.
  • So the first task (Reading Data) that reads the data is executed. This task is set to the “running” state while the other tasks remain in the “queued” state. After its successful execution, its state is set to “success” else “failed”. You can refer to fig. a in the above diagram.
  • According to the dependencies set, the processing tasks (Processing Categorical Data and Processing Continuous Data) are scheduled to be executed after the Reading Data task.
  • So their state is set to “running” and their execution begins (either serially or in parallel depending on their executor). You can refer to fig. b in the above diagram.
  • After their successful execution, their status is changed to “success” and the next task’s (Merging Data) state is set to “running”. You can refer to fig. c in the above diagram.
  • Because of the dependencies set, the final task of merging the data is executed only after the processing tasks have successfully finished their execution.
  • After the final task has been successfully executed, its state is set to “success”. For further reference, you can check fig. d in the above image.
  • Since the DAG has been successfully executed, its state is set to “success”.

Airflow’s Components

Now that the basic terminologies are clear with an example, let’s see the primary components that form the architecture of Airflow:

A generalized Airflow Architecture with essential components

Webserver: A simple user interface to inspect, trigger, and debug the working of DAGs with the help of logs. It displays the task states and enables the user to interact with the metadata database.

Executor: It is the mechanism that handles the running tasks. Airflow has many executors, primarily being Sequential Executor, Local Executor, and Debug Executor. There are remote executors as well for complex tasks such as Celery Executor, Dask Executor, Kubernetes Executor, and CeleryKubernetes Executor.

Workers: Executor work closely with workers to execute the tasks. It assigns the tasks waiting in the queue to the workers.

Scheduler: A scheduler has two tasks:

  • To trigger the scheduled DAGs.
  • To submit the tasks to the executor to run.

It is a multi-threaded Python processes that uses the DAG information to schedule the tasks. It stores the information of each DAG in the metadata database.

Metadata Database: It powers how the other components interact with each other as well as stores all the states stored by the other three components (webserver, scheduler, and executor). All the processes read and write in this database. Database management systems that are supported by SQLAlchemy (such as MySQL and PostgresDB) can be used for Metadata database.

Working of the components

Working of components in the Airflow architecture
  • The Scheduler constantly keeps on tapping the dag directory and makes an entry for each DAG in the database.
  • The DAGs are then parsed and the DAGRuns are created. The Scheduler also creates instances of the tasks that are needed to be executed.
  • All these tasks are marked “scheduled” in the database.
  • The primary scheduler process then takes all the tasks that are marked “scheduled” and sends them to a queue. These tasks are then marked as “queued”.
  • The executor fetches the task from the scheduler queue and assigns them to the workers.

What’s Next?

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Performing Analysis Of Meteorological Data

More coffee, more code, fewer bugs? Intro to Descriptive Statistics.

How to use Predictive Machine Learning in your Business?

Feature Engineering Examples: Binning Numerical Features

First Principles Thinking — If Elon Musk Did Conversion Rate Optimisation

How to Deploy TensorFlow Models to the Web

We need a reset on how we think about the future — even with Biden and Harris in the White House

Keeping Up With Data #64

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
INSAID

INSAID

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!

More from Medium

Airflow: Create Custom Operator from MySQL to PostgreSQL

Data Engineering- Exploring Apache Airflow

Reading BigQuery table in PySpark

AWS Glue using Pyspark