Add new HookMetricEmit that can emit arbitrary metrics#1285
Conversation
HookMetric that can emit arbitrary metricsHookMetricEmit that can emit arbitrary metrics
Here, build on the proposal in [1] to add a metric hook to River, and start emitting metrics for the time to lock jobs and the number of jobs locked. Especially the first metric is generally very useful for looking for queue degradation due to dead tuples. [1] riverqueue/river#1285
c3c56ed to
e54c7ad
Compare
945be6d to
e4a4b69
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4a4b69bd8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return | ||
| } | ||
|
|
||
| if len(p.metricEmitHooks) > 0 { |
There was a problem hiding this comment.
Skip metrics when no JobGetAvailable query runs
When a queue is paused, innerFetchLoop still calls dispatchWork with count == 0, and StandardPilot.JobGetAvailable returns immediately for MaxToLock <= 0 without issuing the DB query. This block still emits both metrics, so paused queues produce near-zero job_get_available_duration samples and zero-count fetches that never actually locked anything, skewing the observability signal this hook is meant to expose; gate emission on count > 0 or on an actual fetch being attempted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Modified dispatchWork to return early in case of a count <= 0 (i.e. paused queue).
I was thinking about observability in River the other day, and how if one were trying to track River performance degradation due to dead tuples, how you might do so. Surprisingly, I don't think it's really possible under our current framework -- the best proxy for it is time to lock jobs, a metric which we don't provide any way of extracting out of River. Here, add a new hook called `HookMetricEmit`. The idea behind it is that it may receive an arbitrary metric that River starts to emit, and handle it by passing it onto a metrics system of choice, like OTEL, DataDog, or whatever. Within `HookMetricEmit`, add two specific metrics to get us started: * `JobGetAvailableDurationMetric`: Time to lock available jobs. This will start to show significant degradation in case of dead tuples. * `JobGetAvailableCountMetric`: Number of jobs locked while getting available. This is somewhat useful in seeing how well batch locking is working out in a program (i.e. are we locking one job at a time or 100?).
e4a4b69 to
1717713
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1717713632
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| var startedAt time.Time | ||
| if len(p.metricEmitHooks) > 0 { | ||
| startedAt = time.Now() |
There was a problem hiding this comment.
Use the service clock for fetch duration metrics
When Config.Test.Time supplies a stubbed time generator, this captures real wall-clock time instead of River's configured service clock, and the later time.Since(startedAt) keeps the new metric nondeterministic under time stubbing. Metric hooks used in tests will see durations that don't advance with p.Time, unlike the rest of the producer's timing/database timestamps; capture and subtract with p.Time.Now() on both sides instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Yeah, I think it's okay to use the stub-free version for metric durations. I did it on purpose since the code is made a little clearer and I had no plans to sub the value.
I was thinking about observability in River the other day, and how if
one were trying to track River performance degradation due to dead
tuples, how you might do so. Surprisingly, I don't think it's really
possible under our current framework -- the best proxy for it is time to
lock jobs, a metric which we don't provide any way of extracting out of
River.
Here, add a new hook called
HookMetricEmit. The idea behind it is that itmay receive an arbitrary metric that River starts to emit, and handle it
by passing it onto a metrics system of choice, like OTEL, DataDog, or
whatever.
Within
HookMetricEmit, add two specific metrics to get us started:JobGetAvailableDurationMetric: Time to lock available jobs. Thiswill start to show significant degradation in case of dead tuples.
JobGetAvailableCountMetric: Number of jobs locked while gettingavailable. This is somewhat useful in seeing how well batch locking is
working out in a program (i.e. are we locking one job at a time or
100?).