Skip to content

Fix OTel timer metrics using Gauge instead of Histogram#64207

Open
namratachaudhary wants to merge 1 commit intoapache:mainfrom
namratachaudhary:fix/otel-timer-gauge-to-histogram
Open

Fix OTel timer metrics using Gauge instead of Histogram#64207
namratachaudhary wants to merge 1 commit intoapache:mainfrom
namratachaudhary:fix/otel-timer-gauge-to-histogram

Conversation

@namratachaudhary
Copy link

@namratachaudhary namratachaudhary commented Mar 25, 2026

Timer and timing metrics in the OTel logger were recorded using a Gauge instrument, which does not preserve multiple observations within an export interval. In practice, if the scheduler loop runs many times between exports, only a single value is retained, and the distribution of durations is lost.

This PR switches timing() and timer() to use a Histogram instrument, which captures duration distributions via count, sum, and bucketed observations. This enables accurate downstream analysis such as percentiles (p50, p95, p99).

The OTel supplementary guidelines recommend Histogram for measurements where "the statistics about this thing are likely to be meaningful" and durations are a canonical example.

Changes

  • _OtelTimer.stop() and SafeOtelLogger.timing() now call record_histogram_value() instead of set_gauge_value()
  • New InternalHistogram class wrapping meter.create_histogram(unit="ms")
  • New MetricsMap.record_histogram_value() method
  • Tests updated: test_timing_existing_metric now verifies both observations are recorded (record.call_count == 2), not just the last value

Users affected

Affected metrics: all timer/timing metrics emitted via the OTel logger (e.g. scheduler.scheduler_loop_duration, dagrun.dependency-check, task.duration, etc.)

OTel users: the instrument type for timer/timing metrics changes from Gauge to Histogram. Dashboards or alerts treating these as Gauges may need to be updated. An example is shown below :

Before (Gauge):

# Last recorded duration for a specific task
airflow_task_duration{dag_id="my_dag", task_id="my_task"}

# Using ti.finish counter to approximate task completions 
rate(airflow_ti_finish{dag_id="my_dag", task_id="my_task", state="success"}[5m])

After (Histogram):

# Average task duration over last 5 minutes
rate(airflow_task_duration_sum{dag_id="my_dag", task_id="my_task"}[5m])
/ rate(airflow_task_duration_count{dag_id="my_dag", task_id="my_task"}[5m])

# p95 — "95% of runs finished within X ms"
histogram_quantile(0.95,
  rate(airflow_task_duration_bucket{dag_id="my_dag", task_id="my_task"}[5m])
)

# How many task runs completed in the last 5 minutes
rate(airflow_task_duration_count{dag_id="my_dag", task_id="my_task"}[5m])

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Sonnet 4.6 following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg
Copy link

boring-cyborg bot commented Mar 25, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant