Fix OTel timer metrics using Gauge instead of Histogram#64207
Open
namratachaudhary wants to merge 1 commit intoapache:mainfrom
Open
Fix OTel timer metrics using Gauge instead of Histogram#64207namratachaudhary wants to merge 1 commit intoapache:mainfrom
namratachaudhary wants to merge 1 commit intoapache:mainfrom
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Timer and timing metrics in the OTel logger were recorded using a
Gaugeinstrument, which does not preserve multiple observations within an export interval. In practice, if the scheduler loop runs many times between exports, only a single value is retained, and the distribution of durations is lost.This PR switches
timing()andtimer()to use aHistograminstrument, which captures duration distributions via count, sum, and bucketed observations. This enables accurate downstream analysis such as percentiles (p50, p95, p99).The OTel supplementary guidelines recommend Histogram for measurements where "the statistics about this thing are likely to be meaningful" and durations are a canonical example.
Changes
_OtelTimer.stop()andSafeOtelLogger.timing()now callrecord_histogram_value()instead ofset_gauge_value()InternalHistogramclass wrappingmeter.create_histogram(unit="ms")MetricsMap.record_histogram_value()methodtest_timing_existing_metricnow verifies both observations are recorded (record.call_count == 2), not just the last valueUsers affected
Affected metrics: all timer/timing metrics emitted via the OTel logger (e.g.
scheduler.scheduler_loop_duration,dagrun.dependency-check,task.duration, etc.)OTel users: the instrument type for timer/timing metrics changes from
GaugetoHistogram. Dashboards or alerts treating these as Gauges may need to be updated. An example is shown below :Before (Gauge):
After (Histogram):
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Sonnet 4.6 following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.