in_gpu_metrics: Handle NVIDIA GPU metrics via NVML#11720
in_gpu_metrics: Handle NVIDIA GPU metrics via NVML#11720
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an NVML-backed GPU metrics backend: new NVML implementation and header, build inclusion, plugin data-structure and config additions, runtime NVML init/detect/collect/shutdown, per-card backend dispatch, shared include/exclude card filtering helpers, and an AMD card creation ordering tweak. Changes
Sequence Diagram(s)sequenceDiagram
participant Plugin as Plugin (in_gpu_metrics)
participant NVMLLib as NVML Library (libnvidia-ml.so)
participant GPU as GPU Hardware
rect rgba(100,150,200,0.5)
Note over Plugin,NVMLLib: Initialization
Plugin->>NVMLLib: dlopen(libnvidia-ml.so[.1])
NVMLLib-->>Plugin: handle / failure
Plugin->>NVMLLib: dlsym() resolve functions
NVMLLib-->>Plugin: function pointers
Plugin->>NVMLLib: nvmlInit_v2()
NVMLLib-->>Plugin: init status
end
rect rgba(100,200,150,0.5)
Note over Plugin,NVMLLib: Detection
Plugin->>NVMLLib: nvmlDeviceGetCount_v2()
NVMLLib-->>GPU: query devices
GPU-->>NVMLLib: device count
NVMLLib-->>Plugin: count
Plugin->>Plugin: allocate gpu_card entries (GPU_BACKEND_NVML)
end
rect rgba(200,150,100,0.5)
Note over Plugin,NVMLLib: Per-card Collection
Plugin->>NVMLLib: nvmlDeviceGetHandleByIndex()/GetHandleByUUID()
NVMLLib-->>Plugin: device handle
Plugin->>NVMLLib: nvmlDeviceGetUtilizationRates()/nvmlDeviceGetMemoryInfo()/...()
NVMLLib-->>Plugin: metric values
Plugin->>Plugin: update collectors (labels: card_id, backend)
end
rect rgba(200,100,150,0.5)
Note over Plugin,NVMLLib: Shutdown
Plugin->>NVMLLib: nvmlShutdown()
NVMLLib-->>Plugin: shutdown complete
Plugin->>NVMLLib: dlclose()
NVMLLib-->>Plugin: handle released
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
plugins/in_gpu_metrics/gpu_metrics.c (1)
45-50: Consider logging a warning for unknown backend types.Cards with an uninitialized or unknown
backend_typewill be silently skipped. A debug log for unhandled types could help diagnose configuration issues.Optional: Add fallback logging
if (card->backend_type == GPU_BACKEND_AMD) { amd_gpu_collect_metrics(ctx, card); } else if (card->backend_type == GPU_BACKEND_NVML) { nvml_gpu_collect_metrics(ctx, card); } + else { + flb_plg_debug(ctx->ins, "unknown backend_type %d for card%d", + card->backend_type, card->id); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_gpu_metrics/gpu_metrics.c` around lines 45 - 50, The backend-type switch silently ignores unknown values; update the code where card->backend_type is checked (the branch that currently calls amd_gpu_collect_metrics(ctx, card) and nvml_gpu_collect_metrics(ctx, card)) to add an else fallback that emits a warning-level log mentioning the card identifier and the unexpected backend value (e.g., include card->id or similar) so configuration/missing-init issues are visible; keep existing AMD and NVML branches and only add the warning for unhandled backend types.plugins/in_gpu_metrics/nvml_gpu.c (1)
60-68: Static function pointers may cause issues with multiple plugin instances or threaded use.These static function pointers are global state shared across all
in_gpu_metricsinstances. Since the plugin is marked withFLB_INPUT_THREADED, concurrent access could be problematic if multiple instances initialize/shutdown simultaneously.Consider storing these pointers in the
struct in_gpu_metricscontext, or adding a reference count and mutex if shared state is intentional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_gpu_metrics/nvml_gpu.c` around lines 60 - 68, The static NVML function pointer globals (f_nvml_init_v2, f_nvml_shutdown, f_nvml_device_get_count_v2, f_nvml_device_get_handle_by_index_v2, f_nvml_device_get_memory_info, f_nvml_device_get_utilization_rates, f_nvml_device_get_temperature, f_nvml_device_get_power_usage, f_nvml_error_string) create shared global state that can race across threaded plugin instances; move these pointers into the instance context (struct in_gpu_metrics) so each plugin instance has its own set, and update all code that references the static names to use the context fields (e.g., ctx->f_nvml_init_v2) during init/shutdown and usage, or alternatively implement a reference-counted/shared structure protected by a mutex and update init/shutdown paths to manage the refcount to prevent concurrent teardown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 128-130: The call to nvml_gpu_detect_cards(ctx) ignores its return
value so partial failures go unnoticed; update the block that calls
nvml_gpu_initialize(ctx) and nvml_gpu_detect_cards(ctx) to check
nvml_gpu_detect_cards' return code, and if non-zero log a warning (including ctx
context and the returned error code/message) via the plugin's logging facility
and handle the failure path appropriately (e.g., skip enabling metrics or return
an error) so partial detection errors are surfaced; reference the existing
nvml_gpu_initialize and nvml_gpu_detect_cards calls and use ctx to locate the
surrounding code to add the check and log.
In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 155-165: In nvml_gpu_detect_cards, a flb_calloc failure leaves
previously added gpu_card entries in ctx->cards causing inconsistent state;
update the error path so that when flb_calloc returns NULL you iterate the
ctx->cards list, remove and free each gpu_card (undoing cfl_list_add) before
returning -1 (use the same list/remove and flb_free helpers already used
elsewhere for gpu_card cleanup), ensuring ctx->cards is empty on failure;
reference nvml_gpu_detect_cards, struct gpu_card, ctx->cards, cfl_list_add and
flb_calloc when locating the fix.
---
Nitpick comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 45-50: The backend-type switch silently ignores unknown values;
update the code where card->backend_type is checked (the branch that currently
calls amd_gpu_collect_metrics(ctx, card) and nvml_gpu_collect_metrics(ctx,
card)) to add an else fallback that emits a warning-level log mentioning the
card identifier and the unexpected backend value (e.g., include card->id or
similar) so configuration/missing-init issues are visible; keep existing AMD and
NVML branches and only add the warning for unhandled backend types.
In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 60-68: The static NVML function pointer globals (f_nvml_init_v2,
f_nvml_shutdown, f_nvml_device_get_count_v2,
f_nvml_device_get_handle_by_index_v2, f_nvml_device_get_memory_info,
f_nvml_device_get_utilization_rates, f_nvml_device_get_temperature,
f_nvml_device_get_power_usage, f_nvml_error_string) create shared global state
that can race across threaded plugin instances; move these pointers into the
instance context (struct in_gpu_metrics) so each plugin instance has its own
set, and update all code that references the static names to use the context
fields (e.g., ctx->f_nvml_init_v2) during init/shutdown and usage, or
alternatively implement a reference-counted/shared structure protected by a
mutex and update init/shutdown paths to manage the refcount to prevent
concurrent teardown.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 48ace6bb-45d1-4576-8f13-4dbbb0c53f05
📒 Files selected for processing (6)
plugins/in_gpu_metrics/CMakeLists.txtplugins/in_gpu_metrics/amd_gpu.cplugins/in_gpu_metrics/gpu_metrics.cplugins/in_gpu_metrics/gpu_metrics.hplugins/in_gpu_metrics/nvml_gpu.cplugins/in_gpu_metrics/nvml_gpu.h
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea5272723e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
plugins/in_gpu_metrics/gpu_metrics.c (1)
128-131:⚠️ Potential issue | 🟡 MinorSurface NVML detection failures more prominently.
A non-zero
nvml_gpu_detect_cards()here still only produces a generic debug log, so partial NVIDIA discovery failures remain easy to miss in production. Please log at least a warning and include the return code.Suggested change
if (nvml_gpu_initialize(ctx) == 0) { - if (nvml_gpu_detect_cards(ctx) != 0) { - flb_plg_debug(ctx->ins, "NVML card detection encountered errors"); + ret = nvml_gpu_detect_cards(ctx); + if (ret != 0) { + flb_plg_warn(ctx->ins, + "NVML card detection encountered errors: ret=%d", + ret); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_gpu_metrics/gpu_metrics.c` around lines 128 - 131, The NVML card detection path currently only emits a debug message; change the handling in the nvml_gpu_initialize success branch so that when nvml_gpu_detect_cards(ctx) returns non-zero you log a warning (use flb_plg_warn or equivalent) on ctx->ins and include the returned error code and a short contextual message (e.g. "NVML card detection encountered errors, rc=%d") so partial discovery failures are surfaced in production.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 128-131: The NVML card detection path currently only emits a debug
message; change the handling in the nvml_gpu_initialize success branch so that
when nvml_gpu_detect_cards(ctx) returns non-zero you log a warning (use
flb_plg_warn or equivalent) on ctx->ins and include the returned error code and
a short contextual message (e.g. "NVML card detection encountered errors,
rc=%d") so partial discovery failures are surfaced in production.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e2ac74dd-faa7-495d-817a-7f8c1005f482
📒 Files selected for processing (1)
plugins/in_gpu_metrics/gpu_metrics.c
505af8a to
7d7ed04
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
plugins/in_gpu_metrics/gpu_metrics.h (1)
63-64: Consider enum + namespaced constants for backend IDs.These macros work, but replacing them with an enum (and namespaced values) improves type-safety and avoids global macro collisions.
♻️ Optional refactor
-#define GPU_BACKEND_AMD 1 -#define GPU_BACKEND_NVML 2 +enum flb_gpu_backend_type { + FLB_GPU_BACKEND_AMD = 1, + FLB_GPU_BACKEND_NVML = 2 +};struct gpu_card { int id; - int backend_type; + enum flb_gpu_backend_type backend_type; flb_sds_t hwmon_path; struct cfl_list _head; };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_gpu_metrics/gpu_metrics.h` around lines 63 - 64, Replace the two preprocessor macros GPU_BACKEND_AMD and GPU_BACKEND_NVML with a scoped, typed enum (e.g. enum class Backend : int { AMD = 1, NVML = 2 }) or a namespaced constexpr int constants inside a namespace (e.g. namespace gpu::Backend { constexpr int AMD = 1; constexpr int NVML = 2; }) to avoid global macro collisions; remove the `#define` lines and update all uses of GPU_BACKEND_AMD and GPU_BACKEND_NVML to the new symbols (Backend::AMD / Backend::NVML or gpu::Backend::AMD / gpu::Backend::NVML) and adjust any function signatures or switch statements expecting ints to accept the new enum type or cast as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_gpu_metrics/gpu_common.c`:
- Around line 98-106: Replace the unsafe atoi(token) usage in the non-range
branch with strtol: call errno = 0; long val = strtol(token, &endptr, 10);
validate that endptr points to the terminating NUL (i.e., *endptr == '\0'),
errno is 0, and val fits in int bounds (INT_MIN <= val <= INT_MAX); only then
cast to int and compare with card_id (card_id == (int)val); if validation fails
treat the token as non-matching (do not match card 0) and proceed without
changing the free/return behavior. Ensure to include errno and limits checks
(errno, endptr, INT_MAX/INT_MIN) and adjust error handling accordingly in the
same function where token, card_id, start, end and flb_free(dup) are used.
---
Nitpick comments:
In `@plugins/in_gpu_metrics/gpu_metrics.h`:
- Around line 63-64: Replace the two preprocessor macros GPU_BACKEND_AMD and
GPU_BACKEND_NVML with a scoped, typed enum (e.g. enum class Backend : int { AMD
= 1, NVML = 2 }) or a namespaced constexpr int constants inside a namespace
(e.g. namespace gpu::Backend { constexpr int AMD = 1; constexpr int NVML = 2; })
to avoid global macro collisions; remove the `#define` lines and update all uses
of GPU_BACKEND_AMD and GPU_BACKEND_NVML to the new symbols (Backend::AMD /
Backend::NVML or gpu::Backend::AMD / gpu::Backend::NVML) and adjust any function
signatures or switch statements expecting ints to accept the new enum type or
cast as needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2f18d3d8-7150-4062-aee8-d77426809ef4
📒 Files selected for processing (8)
plugins/in_gpu_metrics/CMakeLists.txtplugins/in_gpu_metrics/amd_gpu.cplugins/in_gpu_metrics/gpu_common.cplugins/in_gpu_metrics/gpu_common.hplugins/in_gpu_metrics/gpu_metrics.cplugins/in_gpu_metrics/gpu_metrics.hplugins/in_gpu_metrics/nvml_gpu.cplugins/in_gpu_metrics/nvml_gpu.h
✅ Files skipped from review due to trivial changes (2)
- plugins/in_gpu_metrics/CMakeLists.txt
- plugins/in_gpu_metrics/nvml_gpu.h
🚧 Files skipped from review as they are similar to previous changes (3)
- plugins/in_gpu_metrics/gpu_metrics.c
- plugins/in_gpu_metrics/amd_gpu.c
- plugins/in_gpu_metrics/nvml_gpu.c
Signed-off-by: Hiroshi Hatake <[email protected]>
|
@cosmo0920 thanks for this! would you please describe in the PR summary the new metrics collected ? also would be good we can match the metrics from the official Nvidia prometheus exporter |
Signed-off-by: Hiroshi Hatake <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 327-344: nvml_get_device_handle currently falls back to
nvmlDeviceGetHandleByIndex_v2 when card->uuid is NULL, which returns the parent
GPU handle for MIG children; change it so that if card->uuid is NULL and this
card is a MIG child (detect via the field set in nvml_detect_mig_devices, e.g.,
card->parent_card_id is set/ != card->id or card->is_mig_device), the function
returns -1 instead of calling f_nvml_device_get_handle_by_index_v2; otherwise
keep the existing index-based lookup. Ensure you update only
nvml_get_device_handle and use the card->parent_card_id or card->is_mig_device
symbol to decide the early failure so MIG child devices don’t get the parent GPU
handle.
- Around line 754-764: nvml_gpu_shutdown currently dlcloses the NVML library but
leaves the file-scope function pointers (the f_nvml_* statics) pointing at the
unloaded code; after dlclose you must reset all f_nvml_* pointers to NULL to
avoid a use-after-unload when the plugin is re-initialized. Update
nvml_gpu_shutdown to, after setting ctx->nvml_lib_handle = NULL, assign NULL to
every f_nvml_* symbol (e.g., f_nvml_init, f_nvml_shutdown, f_nvmlDeviceGetCount,
etc. — cover all f_nvml_ prefixed function pointers in the file) so subsequent
nvml_gpu_initialize will not retain stale addresses.
- Around line 466-496: In nvml_collect_process_memory, avoid calling all
resolved NVML variants sequentially; instead select a single compute and a
single graphics API in priority order v3 → v2 → v1 and invoke only that one to
prevent duplicate per-PID emissions. Concretely, check
f_nvml_device_get_compute_running_processes_v3, else _v2, else base
f_nvml_device_get_compute_running_processes and call the corresponding
nvml_emit_process_memory_samples_v3/_v2/_v1 once; do the same for
f_nvml_device_get_graphics_running_processes_* and their emitters so only one
compute emitter and one graphics emitter run per invocation of
nvml_collect_process_memory.
- Around line 620-635: The code calls nvml_read_device_uuid(device, uuid,
sizeof(uuid)) twice—once to populate uuid and again inside the ternary passed to
nvml_detect_mig_devices—causing redundant NVML reads; modify the block to call
nvml_read_device_uuid only once, store its boolean result (e.g., uuid_ok) and
reuse uuid when calling nvml_detect_mig_devices (pass NULL if uuid_ok is false),
adjusting nvml_register_card logic to use uuid/NULL accordingly; also move any
new variable declarations (uuid_ok or similar) to the top of the function to
comply with the project's C style guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 660a1b0c-6753-4c0d-b475-e9500359c2c8
📒 Files selected for processing (4)
plugins/in_gpu_metrics/amd_gpu.cplugins/in_gpu_metrics/gpu_metrics.cplugins/in_gpu_metrics/gpu_metrics.hplugins/in_gpu_metrics/nvml_gpu.c
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/in_gpu_metrics/gpu_metrics.h
51c173e to
ea159a7
Compare
ea159a7 to
017a3b9
Compare
Signed-off-by: Hiroshi Hatake <[email protected]>
Motivation
in_gpu_metricsNVIDIA behavior with dcgm-exporter expectations.Description
nvmlDeviceGetClockInfoand emit:gpu_clock_mhz{type="sm"}gpu_clock_mhz{type="memory"}gpu_clock_mhz{type="graphics"}UINT64_MAX) entriesv3,v2,v1) for compute/graphics paths, improving process coverage on real systems.Conclusion
nvidia-smiGPU users in the validated run.Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Bug Fixes