Atanu Maity
Production Level Data Process Workflow management - AirFlow
In today's world, every data-informed company faces the same issue, their data teams and data volume are growing quickly, and accordingly, so does the complexity of the challenges they take on. Hence the micro and typical workflow among data engineers, data scientists, and analysts, should be well managed. We need some tool or platform which will allow us to move fast and keep our momentum as we author, monitor, and retrofit data pipelines.
Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Here comes the Airflow, a product originally created by Airbnb for their need and now maintained by Apache as they have open-sourced it.
The Details about the tool can be found here
In summary, the workflows in Airflow are basically maintained by DAGs (Directed Acyclic Graphs) which are basically directed sequences of operators, which are fitted with one another in upstream and downstream manner.
An Example DAG Graph View:

There are different types of operators available( As given on Airflow Website):
BashOperator - executes a bash command
PythonOperator - calls an arbitrary Python function
EmailOperator - sends an email
SimpleHttpOperator - sends an HTTP request
MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
Sensor - waits for a certain time, file, database row, S3 key, etc…
How to install Airflow:
Its a purely Pythonic tool and hence pip install works fine here.
pip install apache-airflow
You can also install Airflow with support for extra features like gcp or postgres,
pip install apache-airflow[postgres,gcp]
More can be found here
How to start Airflow:
Before you start anything, create a folder and set it as AIRFLOW_HOME
$ export AIRFLOW_HOME='pwd' your_airflow_home_path
Make sure you are a folder above of your_airflow_home_path before running the export command.
Within your_airflow_home_path you will create another folder to keep your own DAGs. Name it dags.
Now you need to initiate a database backend, As Airflow was built to interact with its metadata using the great SqlAlchemy library.
To do that simply you need to execute a command which will initiate a MySql database backend to store and access Airflow to its metadata.
$ airflow initdb
Now, you need to start a webserver backend which will allow you to access Airflow UI to play with the DAGs
$ airflow webserver -p 8080
It will initiate the sever at the port 8080
Now in another terminal, you need to start Airflow schedular, which will allow you to schedule a DAG task, to do that you need to run
$ airflow scheduler
You are done with the initial setups! Open your browser and simply go to
localhost:8080/
The Airflow UI is here..Tadah!
Now you can see you are by default logged in as Admin but no option to get logged out or create a different user. To enable that you need to do a trick.
Go to your_airflow_home_path and check for airflow.cfg file and set the parameter value of RBAC (Role-based access control) as True and save the changes
rbac = True
Restart the whole airflow initiating process from the beginning.
There you are, the UI is asking for user credentials.
Now you need to create users with different roles set up. Default roles created in Airflow are Admin, User, Ops etc. You also can create your own role group with different permissions allowed.
Using Airflow CLI functionality you can simply create new users and assign them a specific role
airflow create_user [-h] [-r ROLE] [-u USERNAME] [-e EMAIL] [-f
FIRSTNAME] [-l LASTNAME] [-p PASSWORD] [--use_random_password]
So if you simply run from CLI
airflow create_user -r Admin -u atanu -e mymail@example.com -f atanu -l maity -p abcdef
airflow will create a user 'atanu' with the admin role, attaching all other credentials.
Now you can simply log in using your credentials in Airflow UI.
Once you are logged in you can see few examples DAGs, which are pre-given by Airflow. Every time you are logged in, Airflow UI will contain all those DAGs. But good news you can turn off that too. Go to airflow.cfg and set
load_examples = False
and save the settings.
I guess this is enough for the introductory part and setting up Airflow. In the upcoming blogs, we will learn how to trigger DAG, how to write custom operators, custom dags, how to set up email notification for failure and a lot more functionalities.
Let's install Airflow in your local machine and start to play with it. Thanks. Happy Learning!