GithubHelp home page GithubHelp logo

k8s-airflow's Introduction

k8s-airflow (Autoscaled Celery Executor)

./img/airflow-k8s-infra.png

Installing

  • Edit the git-sync secret with your dags repository and credentials to allow access to airflow and celery sidecar containers
vim src/airflow/git-sync.secret.yaml
  • Then apply the project in your cluster
make

As we use a custom resource definition for the custom metrics API, a race condition might occur when applying the cluster configuration, if you encounter the no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" error, try waiting some seconds before retrying make

What will be deployed

Full cluster topology (click to enlarge)

./img/airflow-k8s-infra-2.png

Airflow

Airflow pods running airflow webserver & airflow scheduler with pre-installed airflow-exporter project to expose metrics to /admin/metrics

By default, i'm using my own image hosted on docker hub (jjaniec/airflow-exporter) but you can easily make your own using something like:

FROM apache/airflow:master
RUN pip install --user airflow-exporter

New dags are pulled using the git-sync method, repo config and credentials can be set in src/airflow/git-sync.secret.yaml

Celery

Celery pods to execute tasks in a distributed way, in a StatefulSet scaled by an HorizontalPodAutoscaler.

By default, the metric used by the hpa to scale celery pods is the airflow_tasks_per_worker, defined in the prometheus adapter config as:

(number of pending tasks) / (celery workers count)

With the following PromQL query:

ceil(
    max(
        sum(airflow_task_status{<<.LabelMatchers>>,status=~"up_for_retry|up_for_reschedule|queued|running|scheduled|none"})
        by (namespace, service)
        / ignoring (namespace, service) group_left count(container_memory_usage_bytes{container_label_run="celery"})
        by (namespace, service)
        or count(up{namespace="airflow", service="airflow"})
        by (namespace, service) - 1)
    by (namespace, service)
)

(The count part is only here so the hpa can get a 0 value if the airflow_task_status vector is empty instead of an empty reply, which is badly interpreted and can lead to scaling out the worker nodes if no tasks are running)

Tasks are fetched and results are stored on local redis & mysql deployments and tasks logs can be exported in an s3 bucket by setting the AIRFLOW__CORE__REMOTE_LOGGING, AIRFLOW__CORE__REMOTE_LOG_CONN_ID and AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER variables in the airflow/airflow-celery-env.cm.yaml file.

Grafana

A simple grafana server will be deployed fetching metrics from the prometheus server, with a base dashboard containing cpu / memory and autoscaling metrics from the airflow cluster.

Screenshots of the scaling process with max nodes set to 7 in a gke node pool of 2vCPU / 7.5G RAM instances:

./img/dashboard1.png

./img/dashboard2.png

cAdvisor

To expose node metrics, collected by the prometheus server, and used to calculate total count of running celery workers by the celery hpa

https://github.com/coreos/prometheus-operator

Prometheus server with service discovery on the airflow and cAdvisor services

Custom metrics api to use prometheus metrics with the celery statefulset hpa

Notes

By default, scaling in of celery workers by the hpa will take 5 minutes, if you want to speed up the process and have a faster scaling in of the workers, see: hpa-support-for-cooldown-delay

Documentation / Useful links

k8s-airflow's People

Contributors

jjaniec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

teaglebuilt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.