

The following snippets show some sample alerts that can be set. Now that we have data in InfluxDB, and the monitoring is in place, we can use Kapacitor to write the TICK script to trigger alerts based on the checks and thresholds. A snapshot of the dashboard in a dummy environment Alerting You should also consider tracking your specific operators and tasks that you think have higher chances of failure and/or consume more resources. It’s important to track these metrics at an overall level, as well as individual tasks and the DAG level. Trend: Time spent by DAGs for completing dependency checks.Trend: Time taken by the DAGs before coming to an end state.

Trend: Time taken by crucial tasks, sensors.Trend: Task Instances status (successes/failures).Trend: Operator-wise execution status (failure/success).Trend: Executor tasks status (running/queued/open slots).Trend: Jobs execution status (started/ended).Number of active DAGs, and DAG parsing time.Are our custom metrics and configurations being reflected in metrics?.Health checks: Are scheduler, webserver, workers, and other custom processes running? What’s their uptime?.The following list contains some of the important areas that you should monitor, which could also be helpful for debugging and finding bottlenecks for resources: Sample Grafana query to fetch data from Influx What should you monitor?

We solved that by writing some custom statsd telegraf templates based on the metrics name. In the first attempt, the measurements created by Airflow in InfluxDB were not how we wanted them to be. Understanding Airflow statsd metricsĪirflow’s implementation and documentation of metrics are not the best things about it and it’s still in the early stages. The alerts can now be configured in Kapacitor using TICK scripts which we’ll cover in the next sections. You can add InfluxDB as a data source in Grafana as well as in Kapacitor. We’ve configured InfluxDB as an output for Telegraf configuration ( nf) which will send the data over HTTP. Airflow Monitoring - High-Level Architecture Our custom processes are also emitting those heartbeats and other data in the same way. The statsd client will send all the metrics to Telegraf over UDP. A snapshot of the dashboard in a dummy environment High-level architectureĪt a high-level, we have multiple Airflow processes running in our different Kubernetes Pods, and each of them has a statsd client enabled using airflow.cfg. Hence, this requires a thorough monitoring and alerting system. We know very well that, the more components you have, higher the chances of failure. At Gojek, we have a few additional processes as well to enable flexibility for our workflows.įor example, we have a separate process running to sync our DAGs with GCS/git and a separate process to sync custom Airflow variables. Scheduler, Webserver, Workers, Executor, and so on. It has multiple components to enable this, viz. It has a very resilient architecture and the design is highly scalable. Like any production application, it becomes crucial to monitor the Airflow jobs and of course, Airflow itself. Monitoring all these pipelines is not easy - especially considering that Airflow is still in its early phase. We cater over a thousand pipelines and an enormous amount of data using Airflow. Needless to say, Airflow is one of our most heavily used tools. The ‘Data Engineering’ (DE) Team is responsible for building the platform and products to manage the entire lifecycle of data. We’re constantly making use of that data and give value back to our customers, merchants, and partners - in the form of recommendations and other customisations. Airflow is an open-source workflow management platform that enables scheduling and monitoring workflows programmatically.Īt Gojek, our products generate a tremendous amount of data, but that’s only step one. If your organisation deals with a significant amount of data and has huge data pipelines, chances are you must have used or heard about Apache Airflow already.
