top of page
Search
  • Writer's pictureAtanu Maity

Production Level Data Process Workflow management - AirFlow

In today's world, every data-informed company faces the same issue, their data teams and data volume are growing quickly, and accordingly, so does the complexity of the challenges they take on. Hence the micro and typical workflow among data engineers, data scientists, and analysts, should be well managed. We need some tool or platform which will allow us to move fast and keep our momentum as we author, monitor, and retrofit data pipelines.


Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Here comes the Airflow, a product originally created by Airbnb for their need and now maintained by Apache as they have open-sourced it.

The Details about the tool can be found here

In summary, the workflows in Airflow are basically maintained by DAGs (Directed Acyclic Graphs) which are basically directed sequences of operators, which are fitted with one another in upstream and downstream manner.


An Example DAG Graph View:

There are different types of operators available( As given on Airflow Website):

  • BashOperator - executes a bash command

  • PythonOperator - calls an arbitrary Python function

  • EmailOperator - sends an email

  • SimpleHttpOperator - sends an HTTP request

  • MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command

  • Sensor - waits for a certain time, file, database row, S3 key, etc…


How to install Airflow:

Its a purely Pythonic tool and hence pip install works fine here.

pip install apache-airflow

You can also install Airflow with support for extra features like gcp or postgres,

pip install apache-airflow[postgres,gcp]

More can be found here


How to start Airflow:

Before you start anything, create a folder and set it as AIRFLOW_HOME


$ export AIRFLOW_HOME='pwd' your_airflow_home_path


Make sure you are a folder above of your_airflow_home_path before running the export command.

Within your_airflow_home_path you will create another folder to keep your own DAGs. Name it dags.

Now you need to initiate a database backend, As Airflow was built to interact with its metadata using the great SqlAlchemy library.

To do that simply you need to execute a command which will initiate a MySql database backend to store and access Airflow to its metadata.

$ airflow initdb

Now, you need to start a webserver backend which will allow you to access Airflow UI to play with the DAGs

$ airflow webserver -p 8080

It will initiate the sever at the port 8080

Now in another terminal, you need to start Airflow schedular, which will allow you to schedule a DAG task, to do that you need to run

$ airflow scheduler 

You are done with the initial setups! Open your browser and simply go to

localhost:8080/

The Airflow UI is here..Tadah!


Now you can see you are by default logged in as Admin but no option to get logged out or create a different user. To enable that you need to do a trick.

Go to your_airflow_home_path and check for airflow.cfg file and set the parameter value of RBAC (Role-based access control) as True and save the changes

rbac = True

Restart the whole airflow initiating process from the beginning.

There you are, the UI is asking for user credentials.

Now you need to create users with different roles set up. Default roles created in Airflow are Admin, User, Ops etc. You also can create your own role group with different permissions allowed.

Using Airflow CLI functionality you can simply create new users and assign them a specific role

airflow create_user [-h] [-r ROLE] [-u USERNAME] [-e EMAIL] [-f 
           FIRSTNAME] [-l LASTNAME] [-p PASSWORD] [--use_random_password]

So if you simply run from CLI

airflow create_user -r Admin -u atanu -e mymail@example.com -f atanu -l maity -p abcdef

airflow will create a user 'atanu' with the admin role, attaching all other credentials.

Now you can simply log in using your credentials in Airflow UI.


Once you are logged in you can see few examples DAGs, which are pre-given by Airflow. Every time you are logged in, Airflow UI will contain all those DAGs. But good news you can turn off that too. Go to airflow.cfg and set

load_examples = False

and save the settings.


I guess this is enough for the introductory part and setting up Airflow. In the upcoming blogs, we will learn how to trigger DAG, how to write custom operators, custom dags, how to set up email notification for failure and a lot more functionalities.


Let's install Airflow in your local machine and start to play with it. Thanks. Happy Learning!







98 views0 comments

Recent Posts

See All

Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The main 4 components of Airflow a

bottom of page