How to run ML and memory-intensive tasks in Airflow — Using Kubernetes Pod Operator -Part 1

Revathi Prakash
THE ICONIC Tech
Published in
7 min readMay 4, 2020

--

At THE ICONIC, a lot of our data Extract, Transform, Load (ETL) operations are scheduled and managed on Apache Airflow — the open-source scheduler which lets you manage your data manipulation workflows using smart frameworks called DAGs — Directed Acyclic Graphs. These DAGs help us mimic the various acyclical i.e. complex interdependent operations which are the nature of data manipulation.

Airflow has grown so popular in its uptake at THE ICONIC, that not only is it used by the Data Scientists, but also used as a service by the Software Engineers and others to schedule their own BYO data pipelines. If you are using Airflow for scheduling tasks, there are different situations where you would want to have the ability to spin up a separate pod to perform heavy-duty operations, aka some Machine Learning algorithms or huge memory-draining applications that might end up queuing out all the other DAGs and taking the Airflow cluster down or trigger tasks independent of the programming language in which the tasks are written.

The Kubernetes Pod Operator helps us deal with these situations where we can spin up and spin down pods depending upon the tasks that we have to perform and thereby overcoming the programming language barriers and resource constraints.

From Airflow 1.10 version, we have the KubernetesExecutor and a set of associated operators, which are new and allow us to do a lot more managed scheduling. The main advantages of the Kubernetes Executor are these

  • High level of elasticity where you schedule your resources depending upon the workload.
  • Independent pod for each task.
  • Fault tolerance as a task’s failure affects only the pod where it is run and will not take the entire application down.
  • YAML configuration files.

In this article, we will understand how to use Kubernetes Pod operator and test it out locally and use it in production. The KuberenetesPodOperator allows you to create Pods on Kubernetes. This works with all executors (it is preferable to use with Celery Executor as it maximizes the efficiency) and resembles what Kubernetes Executor does in the background, but here we do it explicitly with the Pod operator.

Implementation Details:

Step 1: Set up minikube

You will have to install and set up minikube which is explained here depending upon your operating system.

https://kubernetes.io/docs/tasks/tools/install-minikube/

Once you have minikube installed we need to set up the cluster and create an Airflow pod.

I haven’t used breeze/tick to set up the Airflow deployment in minikube. This tutorial makes use of the basic minikube setup and kubectl commands.

The first thing we need to do is to start and set up a minikube which can be done with the below steps.

# Start minikube
minikube start --kubernetes-version 1.18.2

Step 2: Configure your Docker Environment

The below command helps to configure your local environment to re-use the Docker daemon inside the minikube instance.

# Set docker enveval $(minikube docker-env)

Once we have minikube set up we need to build the sample Airflow Docker image. I have used Dockerfile from the Apache Airflow GitHub repository by puckel.

Sample Github code repo — https://github.com/revathijay/airflow-pods-example

Step 3: Build a local Airflow image

Now we need to build the image in the minikube context. If the Dockerfile doesn’t exist in the current working directory, you can specify the path of the file. I have used the path from the sample repository. Also, I have tagged the image to be called airflow-kubes:latest.

docker build -f .cicd/docker/Dockerfile -t airflow-kubes:latest .

Step 4: Deploy Airflow in minikube

Once the image is built we can deploy it in minikube with the following steps.

Before we set out to deploy Airflow and test the Kubernetes Operator, we need to make sure the application is tied to a service account that has the necessary privileges for creating new pods in the default namespace. In Kubernetes, service accounts are used to provide an identity for pods. Here we are going to bind roles to the default service account as listed in role.yaml.

kubectl apply -f .cicd/k8s/role.yaml
Fig1: Cluster role binding for enabling permission to create pods from airflow.

Now once the service account permissions are sorted we have the option to either just create a pod with the above image or create a Kubernetes deployment.

This below command creates the pod. You have to set the Kube config to be default minikube config. (This can be done by specifying kubeconfig parameter ).

kubectl run airflow-kubes — kubeconfig=$HOME/.kube/config — image=airflow-kubes:latest — image-pull-policy=Never

You should see:

Fig2: Airflow pod creation.

To create a Kubernetes deployment we need to use the YAML file provided in the sample project. If you run the below command it creates a deployment for Airflow.

kubectl apply -f .cicd/k8s/deployment.yaml

And this should result in this:

Fig3: Airflow deployment created.

Step 5: Opening up access to Airflow

To create a service to access Airflow we need to expose the deployment that we created.

kubectl expose deployment airflow-kubes-pod — type=LoadBalancer — port=8080

You should see:

Fig4: Airflow service is created

This exposes the service and you can log in with the URL:

minikube service airflow-kubes-pod

You Should be able to see the Airflow UI

Fig5: Airflow UI

Step 6: Monitor and spend the rest of your life debugging the pods here

To see the Kubernetes deployment and services we need to go to the dashboard with the below command.

minikube dashboard

You should see a link which lists all the deployments, jobs and crons in your Kubernetes cluster which looks like this:

Fig6: Kubernetes Dashboard

Once you have run all of these you are good to go.

You can bash into the running Airflow pod and then run a sample test that I have added here.

kubectl get pods
kubectl exec -it <replace with your container id> — /bin/bash

Now its time to test our sample DAG tasks. We can test out Kubernetes pod operator with the sample dag that is added in the Github repository. As seen in the code there are two tasks for the sample DAG and we are going to run the passing task.

airflow test kubernetes_sample passing-task 2020–04–12

You should see the logs as below.

Fig7: Airflow tasks logs.

In the background you should see a pod getting created and it running the specific commands that have been specified in the task.

If it exits with an error, just try it again because it takes time to download the image mentioned in the code. I have used a simple Python 3.6 image to print ‘Hello World’.

If you log in to the Airflow UI and trigger the same Kubernetes DAG, you should see the tasks getting executed.

Fig8: Airflow UI DAG execution status.

If you go to the kubes dashboard you should see pods getting triggered and successfully terminated.

Fig9: Airflow passing task pod creation.

Quick Summary

These 6 simple steps will get you from someone who builds a unique microservice for every single application to someone who sends out a URL and help document to the next data scientist that wants more infra to build and deploy their ML codes.

Things To Remember

Please keep a note of these parameters for Kubernetes pod operator:

  1. Is_delete_operator_pod — This parameter will delete completed pods so that your cluster resources are managed properly and cleared up once the tasks are executed.
  2. In_cluster — While deploying to production this helps to tell the task to look for the config in the cluster.
  3. Namespace — This helps to identify where to deploy and where to look for configuration.

To set and run the same in production, the recommended method is to create a separate service account and assign required roles and permissions and add the service account details to the deployment YAML for Airflow.

Also if you want to pull images from the private registry you need to create a secret based on existing Docker credentials and to use the secret with imagePullSecrets tag in deployment and the service account creation YAML.

What’s Next

In the next part we will see how to make use of Kubernetes Executor where all the tasks are executed as independent pods irrespective of the operator defined for the task.

Happy experimenting with Kubernetes Pod Operator. If you have any comments or feedback, please leave it in the comments section below, more than happy to learn from all of you!

--

--