Skip to content

in_gpu_metrics: Handle NVIDIA GPU metrics via NVML#11720

Open
cosmo0920 wants to merge 5 commits intomasterfrom
cosmo0920-implement-nvidia-gpu-metrics
Open

in_gpu_metrics: Handle NVIDIA GPU metrics via NVML#11720
cosmo0920 wants to merge 5 commits intomasterfrom
cosmo0920-implement-nvidia-gpu-metrics

Conversation

@cosmo0920
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 commented Apr 16, 2026

Motivation

  • Align in_gpu_metrics NVIDIA behavior with dcgm-exporter expectations.
  • Add NVML clock collection and stabilize per-process GPU memory collection.
  • Avoid malformed process entries while still capturing all valid running GPU processes.

Description

  • Added optional NVML clock support via nvmlDeviceGetClockInfo and emit:
    • gpu_clock_mhz{type="sm"}
    • gpu_clock_mhz{type="memory"}
    • gpu_clock_mhz{type="graphics"}
  • Added process entry validation to drop invalid/sentinel samples:
    • invalid PID values
    • sentinel memory (UINT64_MAX) entries
  • Updated process traversal logic to query all available NVML process API variants (v3, v2, v1) for compute/graphics paths, improving process coverage on real systems.
  • Kept previous process/MIG metric implementation intact (no rollback).

Conclusion

  • Clock metrics and process metrics are now being collected as expected.
  • Process coverage now matches observed nvidia-smi GPU users in the validated run.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
$ fluent-bit -i gpu_metrics -o stdout
  • Debug log output from testing the change
Fluent Bit v5.0.3
* Copyright (C) 2015-2026 The Fluent Bit Authors
* Fluent Bit is a CNCF graduated project under the Fluent organization
* https://fluentbit.io

______ _                  _    ______ _ _           _____  _____ 
|  ___| |                | |   | ___ (_) |         |  ___||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   _|___ \ | |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \|  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V //\__/ /\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)\___/


[2026/04/17 15:12:45.594] [ info] Configuration:
[2026/04/17 15:12:45.594] [ info]  flush time     | 1.000000 seconds
[2026/04/17 15:12:45.594] [ info]  grace          | 5 seconds
[2026/04/17 15:12:45.594] [ info]  daemon         | 0
[2026/04/17 15:12:45.594] [ info] ___________
[2026/04/17 15:12:45.594] [ info]  inputs:
[2026/04/17 15:12:45.594] [ info]      gpu_metrics
[2026/04/17 15:12:45.594] [ info] ___________
[2026/04/17 15:12:45.594] [ info]  filters:
[2026/04/17 15:12:45.594] [ info] ___________
[2026/04/17 15:12:45.594] [ info]  outputs:
[2026/04/17 15:12:45.594] [ info]      stdout.0
[2026/04/17 15:12:45.594] [ info] ___________
[2026/04/17 15:12:45.594] [ info]  collectors:
[2026/04/17 15:12:45.595] [ info] [fluent bit] version=5.0.3, commit=63ed88ee65, pid=41257
[2026/04/17 15:12:45.595] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2026/04/17 15:12:45.595] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2026/04/17 15:12:45.595] [ info] [simd    ] SSE2
[2026/04/17 15:12:45.595] [ info] [cmetrics] version=2.1.2
[2026/04/17 15:12:45.595] [ info] [ctraces ] version=0.7.1
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] initializing
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] storage_strategy='memory' (memory only)
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] Starting AMD GPU detection in /sys
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] Found card1
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] Card1 vendor: 0x10de
[2026/04/17 15:12:45.595] [ info] [input:gpu_metrics:gpu_metrics.0] Card1 is not AMD (vendor: 0x10de)
[2026/04/17 15:12:45.604] [ info] [input:gpu_metrics:gpu_metrics.0] NVML backend enabled
[2026/04/17 15:12:45.615] [ info] [input:gpu_metrics:gpu_metrics.0] detected 1 NVIDIA GPU card(s) via NVML
[2026/04/17 15:12:45.615] [debug] [input:gpu_metrics:gpu_metrics.0] [thread init] initialization OK
[2026/04/17 15:12:45.615] [ info] [input:gpu_metrics:gpu_metrics.0] thread instance initialized
[2026/04/17 15:12:45.615] [debug] [gpu_metrics:gpu_metrics.0] created event channels: read=44 write=45
[2026/04/17 15:12:45.615] [debug] [stdout:stdout.0] created event channels: read=48 write=49
[2026/04/17 15:12:45.615] [ info] [sp] stream processor started
[2026/04/17 15:12:45.615] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2026/04/17 15:12:45.625] [ info] [output:stdout:stdout.0] worker #0 started
[2026/04/17 15:12:51.319] [debug] [task] created task=0x78cbc3e2e5a0 id=0 OK
[2026/04/17 15:12:51.319] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
2026-04-17T06:12:50.319605471Z gpu_utilization_percent{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 13
2026-04-17T06:12:50.319605471Z gpu_memory_used_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 1094844416
2026-04-17T06:12:50.319605471Z gpu_memory_total_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 17171480576
2026-04-17T06:12:50.319605471Z gpu_clock_mhz{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",type="sm"} = 210
2026-04-17T06:12:50.319605471Z gpu_clock_mhz{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",type="memory"} = 405
2026-04-17T06:12:50.319605471Z gpu_clock_mhz{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",type="graphics"} = 210
2026-04-17T06:12:50.319605471Z gpu_power_watts{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 16.042000000000002
2026-04-17T06:12:50.319605471Z gpu_temperature_celsius{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 54
2026-04-17T06:12:50.319605471Z gpu_fan_pwm_percent{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia"} = 0
2026-04-17T06:12:50.319605471Z gpu_process_memory_used_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",pid="10583"} = 257449984
2026-04-17T06:12:50.319605471Z gpu_process_memory_used_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",pid="10832"} = 80486400
2026-04-17T06:12:50.319605471Z gpu_process_memory_used_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",pid="11672"} = 172830720
2026-04-17T06:12:50.319605471Z gpu_process_memory_used_bytes{card="GPU-7d0adcaf-2e55-9176-a1ac-6b1b820ffe22",vendor="nvidia",pid="32646"} = 23273472
[2026/04/17 15:12:51.320] [debug] [output:stdout:stdout.0] cmt decode msgpack returned : 1
[2026/04/17 15:12:51.320] [debug] [out flush] cb_destroy coro_id=0
[2026/04/17 15:12:51.320] [debug] [task] destroy task=0x78cbc3e2e5a0 (task_id=0)
^C[2026/04/17 15:12:52] [engine] caught signal (SIGINT)
[2026/04/17 15:12:52.285] [ warn] [engine] service will shutdown in max 5 seconds
[2026/04/17 15:12:52.285] [ info] [engine] pausing all inputs..
[2026/04/17 15:12:52.285] [debug] [input:gpu_metrics:gpu_metrics.0] thread pause instance
[2026/04/17 15:12:52.319] [ info] [engine] service has stopped (0 pending tasks)
[2026/04/17 15:12:52.319] [debug] [input:gpu_metrics:gpu_metrics.0] thread pause instance
[2026/04/17 15:12:52.319] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2026/04/17 15:12:52.320] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2026/04/17 15:12:52.331] [debug] [input:gpu_metrics:gpu_metrics.0] thread exit instance
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Optional NVML (NVIDIA) backend with runtime detection, enable/disable config, and graceful shutdown.
    • Plugin now emits per-process GPU memory and MIG device metadata metrics.
    • Improved card include/exclude pattern handling for flexible per-card selection.
  • Bug Fixes

    • Per-card backend selection corrected so each GPU uses the appropriate vendor backend during detection and collection.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an NVML-backed GPU metrics backend: new NVML implementation and header, build inclusion, plugin data-structure and config additions, runtime NVML init/detect/collect/shutdown, per-card backend dispatch, shared include/exclude card filtering helpers, and an AMD card creation ordering tweak.

Changes

Cohort / File(s) Summary
Build Configuration
plugins/in_gpu_metrics/CMakeLists.txt
Added nvml_gpu.c to the plugin build source list.
Public types & config
plugins/in_gpu_metrics/gpu_metrics.h
Added backend_type, uuid, parent_uuid, gpu_instance_id, compute_instance_id to struct gpu_card; added enable_nvml, nvml_initialized, nvml_lib_handle, and new gauges to struct in_gpu_metrics; added GPU_BACKEND_AMD/GPU_BACKEND_NVML macros.
Plugin core logic
plugins/in_gpu_metrics/gpu_metrics.c
Initialize NVML-related fields and config (enable_nvml), attempt NVML init/detect at startup with conditional logging, dispatch per-card collection by backend_type, create new gauges, free per-card UUIDs on shutdown, and call NVML shutdown.
NVML backend header
plugins/in_gpu_metrics/nvml_gpu.h
New header declaring NVML lifecycle and collection functions.
NVML backend implementation
plugins/in_gpu_metrics/nvml_gpu.c
New NVML implementation: runtime dlopen/dlsym of libnvidia-ml, symbol binding, nvmlInit/Shutdown, device detection (including MIG when available), per-card metric collection (utilization, memory, clocks, optional temp/power/fan, per-process memory), and library unload.
AMD backend tweak
plugins/in_gpu_metrics/amd_gpu.c
Switched card filtering to shared gpu_should_include_card(), set card->backend_type = GPU_BACKEND_AMD, and initialize UUID/instance fields before resolving hwmon path.
Common helpers
plugins/in_gpu_metrics/gpu_common.c, plugins/in_gpu_metrics/gpu_common.h
Added match_card_pattern() and exported gpu_should_include_card() with include/exclude pattern logic and logging; added forward declaration for struct in_gpu_metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Plugin as Plugin (in_gpu_metrics)
    participant NVMLLib as NVML Library (libnvidia-ml.so)
    participant GPU as GPU Hardware

    rect rgba(100,150,200,0.5)
    Note over Plugin,NVMLLib: Initialization
    Plugin->>NVMLLib: dlopen(libnvidia-ml.so[.1])
    NVMLLib-->>Plugin: handle / failure
    Plugin->>NVMLLib: dlsym() resolve functions
    NVMLLib-->>Plugin: function pointers
    Plugin->>NVMLLib: nvmlInit_v2()
    NVMLLib-->>Plugin: init status
    end

    rect rgba(100,200,150,0.5)
    Note over Plugin,NVMLLib: Detection
    Plugin->>NVMLLib: nvmlDeviceGetCount_v2()
    NVMLLib-->>GPU: query devices
    GPU-->>NVMLLib: device count
    NVMLLib-->>Plugin: count
    Plugin->>Plugin: allocate gpu_card entries (GPU_BACKEND_NVML)
    end

    rect rgba(200,150,100,0.5)
    Note over Plugin,NVMLLib: Per-card Collection
    Plugin->>NVMLLib: nvmlDeviceGetHandleByIndex()/GetHandleByUUID()
    NVMLLib-->>Plugin: device handle
    Plugin->>NVMLLib: nvmlDeviceGetUtilizationRates()/nvmlDeviceGetMemoryInfo()/...()
    NVMLLib-->>Plugin: metric values
    Plugin->>Plugin: update collectors (labels: card_id, backend)
    end

    rect rgba(200,100,150,0.5)
    Note over Plugin,NVMLLib: Shutdown
    Plugin->>NVMLLib: nvmlShutdown()
    NVMLLib-->>Plugin: shutdown complete
    Plugin->>NVMLLib: dlclose()
    NVMLLib-->>Plugin: handle released
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • edsiper
  • koleini

Poem

🐰 I dug through libs and found a gleam,
Dl-opened NVML like a dream,
Cards counted, metrics hum,
From AMD to NVIDIA they come,
A rabbit hops—collectors beam!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding NVIDIA GPU metrics support via NVML to the gpu_metrics plugin.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cosmo0920-implement-nvidia-gpu-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
plugins/in_gpu_metrics/gpu_metrics.c (1)

45-50: Consider logging a warning for unknown backend types.

Cards with an uninitialized or unknown backend_type will be silently skipped. A debug log for unhandled types could help diagnose configuration issues.

Optional: Add fallback logging
         if (card->backend_type == GPU_BACKEND_AMD) {
             amd_gpu_collect_metrics(ctx, card);
         }
         else if (card->backend_type == GPU_BACKEND_NVML) {
             nvml_gpu_collect_metrics(ctx, card);
         }
+        else {
+            flb_plg_debug(ctx->ins, "unknown backend_type %d for card%d",
+                          card->backend_type, card->id);
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_gpu_metrics/gpu_metrics.c` around lines 45 - 50, The backend-type
switch silently ignores unknown values; update the code where card->backend_type
is checked (the branch that currently calls amd_gpu_collect_metrics(ctx, card)
and nvml_gpu_collect_metrics(ctx, card)) to add an else fallback that emits a
warning-level log mentioning the card identifier and the unexpected backend
value (e.g., include card->id or similar) so configuration/missing-init issues
are visible; keep existing AMD and NVML branches and only add the warning for
unhandled backend types.
plugins/in_gpu_metrics/nvml_gpu.c (1)

60-68: Static function pointers may cause issues with multiple plugin instances or threaded use.

These static function pointers are global state shared across all in_gpu_metrics instances. Since the plugin is marked with FLB_INPUT_THREADED, concurrent access could be problematic if multiple instances initialize/shutdown simultaneously.

Consider storing these pointers in the struct in_gpu_metrics context, or adding a reference count and mutex if shared state is intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_gpu_metrics/nvml_gpu.c` around lines 60 - 68, The static NVML
function pointer globals (f_nvml_init_v2, f_nvml_shutdown,
f_nvml_device_get_count_v2, f_nvml_device_get_handle_by_index_v2,
f_nvml_device_get_memory_info, f_nvml_device_get_utilization_rates,
f_nvml_device_get_temperature, f_nvml_device_get_power_usage,
f_nvml_error_string) create shared global state that can race across threaded
plugin instances; move these pointers into the instance context (struct
in_gpu_metrics) so each plugin instance has its own set, and update all code
that references the static names to use the context fields (e.g.,
ctx->f_nvml_init_v2) during init/shutdown and usage, or alternatively implement
a reference-counted/shared structure protected by a mutex and update
init/shutdown paths to manage the refcount to prevent concurrent teardown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 128-130: The call to nvml_gpu_detect_cards(ctx) ignores its return
value so partial failures go unnoticed; update the block that calls
nvml_gpu_initialize(ctx) and nvml_gpu_detect_cards(ctx) to check
nvml_gpu_detect_cards' return code, and if non-zero log a warning (including ctx
context and the returned error code/message) via the plugin's logging facility
and handle the failure path appropriately (e.g., skip enabling metrics or return
an error) so partial detection errors are surfaced; reference the existing
nvml_gpu_initialize and nvml_gpu_detect_cards calls and use ctx to locate the
surrounding code to add the check and log.

In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 155-165: In nvml_gpu_detect_cards, a flb_calloc failure leaves
previously added gpu_card entries in ctx->cards causing inconsistent state;
update the error path so that when flb_calloc returns NULL you iterate the
ctx->cards list, remove and free each gpu_card (undoing cfl_list_add) before
returning -1 (use the same list/remove and flb_free helpers already used
elsewhere for gpu_card cleanup), ensuring ctx->cards is empty on failure;
reference nvml_gpu_detect_cards, struct gpu_card, ctx->cards, cfl_list_add and
flb_calloc when locating the fix.

---

Nitpick comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 45-50: The backend-type switch silently ignores unknown values;
update the code where card->backend_type is checked (the branch that currently
calls amd_gpu_collect_metrics(ctx, card) and nvml_gpu_collect_metrics(ctx,
card)) to add an else fallback that emits a warning-level log mentioning the
card identifier and the unexpected backend value (e.g., include card->id or
similar) so configuration/missing-init issues are visible; keep existing AMD and
NVML branches and only add the warning for unhandled backend types.

In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 60-68: The static NVML function pointer globals (f_nvml_init_v2,
f_nvml_shutdown, f_nvml_device_get_count_v2,
f_nvml_device_get_handle_by_index_v2, f_nvml_device_get_memory_info,
f_nvml_device_get_utilization_rates, f_nvml_device_get_temperature,
f_nvml_device_get_power_usage, f_nvml_error_string) create shared global state
that can race across threaded plugin instances; move these pointers into the
instance context (struct in_gpu_metrics) so each plugin instance has its own
set, and update all code that references the static names to use the context
fields (e.g., ctx->f_nvml_init_v2) during init/shutdown and usage, or
alternatively implement a reference-counted/shared structure protected by a
mutex and update init/shutdown paths to manage the refcount to prevent
concurrent teardown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48ace6bb-45d1-4576-8f13-4dbbb0c53f05

📥 Commits

Reviewing files that changed from the base of the PR and between 63ed88e and ea52727.

📒 Files selected for processing (6)
  • plugins/in_gpu_metrics/CMakeLists.txt
  • plugins/in_gpu_metrics/amd_gpu.c
  • plugins/in_gpu_metrics/gpu_metrics.c
  • plugins/in_gpu_metrics/gpu_metrics.h
  • plugins/in_gpu_metrics/nvml_gpu.c
  • plugins/in_gpu_metrics/nvml_gpu.h

Comment thread plugins/in_gpu_metrics/gpu_metrics.c
Comment thread plugins/in_gpu_metrics/nvml_gpu.c
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea5272723e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/in_gpu_metrics/nvml_gpu.c Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
plugins/in_gpu_metrics/gpu_metrics.c (1)

128-131: ⚠️ Potential issue | 🟡 Minor

Surface NVML detection failures more prominently.

A non-zero nvml_gpu_detect_cards() here still only produces a generic debug log, so partial NVIDIA discovery failures remain easy to miss in production. Please log at least a warning and include the return code.

Suggested change
     if (nvml_gpu_initialize(ctx) == 0) {
-        if (nvml_gpu_detect_cards(ctx) != 0) {
-            flb_plg_debug(ctx->ins, "NVML card detection encountered errors");
+        ret = nvml_gpu_detect_cards(ctx);
+        if (ret != 0) {
+            flb_plg_warn(ctx->ins,
+                         "NVML card detection encountered errors: ret=%d",
+                         ret);
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_gpu_metrics/gpu_metrics.c` around lines 128 - 131, The NVML card
detection path currently only emits a debug message; change the handling in the
nvml_gpu_initialize success branch so that when nvml_gpu_detect_cards(ctx)
returns non-zero you log a warning (use flb_plg_warn or equivalent) on ctx->ins
and include the returned error code and a short contextual message (e.g. "NVML
card detection encountered errors, rc=%d") so partial discovery failures are
surfaced in production.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@plugins/in_gpu_metrics/gpu_metrics.c`:
- Around line 128-131: The NVML card detection path currently only emits a debug
message; change the handling in the nvml_gpu_initialize success branch so that
when nvml_gpu_detect_cards(ctx) returns non-zero you log a warning (use
flb_plg_warn or equivalent) on ctx->ins and include the returned error code and
a short contextual message (e.g. "NVML card detection encountered errors,
rc=%d") so partial discovery failures are surfaced in production.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e2ac74dd-faa7-495d-817a-7f8c1005f482

📥 Commits

Reviewing files that changed from the base of the PR and between ea52727 and 505af8a.

📒 Files selected for processing (1)
  • plugins/in_gpu_metrics/gpu_metrics.c

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
plugins/in_gpu_metrics/gpu_metrics.h (1)

63-64: Consider enum + namespaced constants for backend IDs.

These macros work, but replacing them with an enum (and namespaced values) improves type-safety and avoids global macro collisions.

♻️ Optional refactor
-#define GPU_BACKEND_AMD 1
-#define GPU_BACKEND_NVML 2
+enum flb_gpu_backend_type {
+    FLB_GPU_BACKEND_AMD = 1,
+    FLB_GPU_BACKEND_NVML = 2
+};
 struct gpu_card {
     int id;
-    int backend_type;
+    enum flb_gpu_backend_type backend_type;
     flb_sds_t hwmon_path;
     struct cfl_list _head;
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_gpu_metrics/gpu_metrics.h` around lines 63 - 64, Replace the two
preprocessor macros GPU_BACKEND_AMD and GPU_BACKEND_NVML with a scoped, typed
enum (e.g. enum class Backend : int { AMD = 1, NVML = 2 }) or a namespaced
constexpr int constants inside a namespace (e.g. namespace gpu::Backend {
constexpr int AMD = 1; constexpr int NVML = 2; }) to avoid global macro
collisions; remove the `#define` lines and update all uses of GPU_BACKEND_AMD and
GPU_BACKEND_NVML to the new symbols (Backend::AMD / Backend::NVML or
gpu::Backend::AMD / gpu::Backend::NVML) and adjust any function signatures or
switch statements expecting ints to accept the new enum type or cast as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/in_gpu_metrics/gpu_common.c`:
- Around line 98-106: Replace the unsafe atoi(token) usage in the non-range
branch with strtol: call errno = 0; long val = strtol(token, &endptr, 10);
validate that endptr points to the terminating NUL (i.e., *endptr == '\0'),
errno is 0, and val fits in int bounds (INT_MIN <= val <= INT_MAX); only then
cast to int and compare with card_id (card_id == (int)val); if validation fails
treat the token as non-matching (do not match card 0) and proceed without
changing the free/return behavior. Ensure to include errno and limits checks
(errno, endptr, INT_MAX/INT_MIN) and adjust error handling accordingly in the
same function where token, card_id, start, end and flb_free(dup) are used.

---

Nitpick comments:
In `@plugins/in_gpu_metrics/gpu_metrics.h`:
- Around line 63-64: Replace the two preprocessor macros GPU_BACKEND_AMD and
GPU_BACKEND_NVML with a scoped, typed enum (e.g. enum class Backend : int { AMD
= 1, NVML = 2 }) or a namespaced constexpr int constants inside a namespace
(e.g. namespace gpu::Backend { constexpr int AMD = 1; constexpr int NVML = 2; })
to avoid global macro collisions; remove the `#define` lines and update all uses
of GPU_BACKEND_AMD and GPU_BACKEND_NVML to the new symbols (Backend::AMD /
Backend::NVML or gpu::Backend::AMD / gpu::Backend::NVML) and adjust any function
signatures or switch statements expecting ints to accept the new enum type or
cast as needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f18d3d8-7150-4062-aee8-d77426809ef4

📥 Commits

Reviewing files that changed from the base of the PR and between 505af8a and 7d7ed04.

📒 Files selected for processing (8)
  • plugins/in_gpu_metrics/CMakeLists.txt
  • plugins/in_gpu_metrics/amd_gpu.c
  • plugins/in_gpu_metrics/gpu_common.c
  • plugins/in_gpu_metrics/gpu_common.h
  • plugins/in_gpu_metrics/gpu_metrics.c
  • plugins/in_gpu_metrics/gpu_metrics.h
  • plugins/in_gpu_metrics/nvml_gpu.c
  • plugins/in_gpu_metrics/nvml_gpu.h
✅ Files skipped from review due to trivial changes (2)
  • plugins/in_gpu_metrics/CMakeLists.txt
  • plugins/in_gpu_metrics/nvml_gpu.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • plugins/in_gpu_metrics/gpu_metrics.c
  • plugins/in_gpu_metrics/amd_gpu.c
  • plugins/in_gpu_metrics/nvml_gpu.c

Comment thread plugins/in_gpu_metrics/gpu_common.c
@edsiper
Copy link
Copy Markdown
Member

edsiper commented Apr 16, 2026

@cosmo0920 thanks for this! would you please describe in the PR summary the new metrics collected ? also would be good we can match the metrics from the official Nvidia prometheus exporter

@cosmo0920 cosmo0920 added this to the Fluent Bit v5.1 milestone Apr 17, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/in_gpu_metrics/nvml_gpu.c`:
- Around line 327-344: nvml_get_device_handle currently falls back to
nvmlDeviceGetHandleByIndex_v2 when card->uuid is NULL, which returns the parent
GPU handle for MIG children; change it so that if card->uuid is NULL and this
card is a MIG child (detect via the field set in nvml_detect_mig_devices, e.g.,
card->parent_card_id is set/ != card->id or card->is_mig_device), the function
returns -1 instead of calling f_nvml_device_get_handle_by_index_v2; otherwise
keep the existing index-based lookup. Ensure you update only
nvml_get_device_handle and use the card->parent_card_id or card->is_mig_device
symbol to decide the early failure so MIG child devices don’t get the parent GPU
handle.
- Around line 754-764: nvml_gpu_shutdown currently dlcloses the NVML library but
leaves the file-scope function pointers (the f_nvml_* statics) pointing at the
unloaded code; after dlclose you must reset all f_nvml_* pointers to NULL to
avoid a use-after-unload when the plugin is re-initialized. Update
nvml_gpu_shutdown to, after setting ctx->nvml_lib_handle = NULL, assign NULL to
every f_nvml_* symbol (e.g., f_nvml_init, f_nvml_shutdown, f_nvmlDeviceGetCount,
etc. — cover all f_nvml_ prefixed function pointers in the file) so subsequent
nvml_gpu_initialize will not retain stale addresses.
- Around line 466-496: In nvml_collect_process_memory, avoid calling all
resolved NVML variants sequentially; instead select a single compute and a
single graphics API in priority order v3 → v2 → v1 and invoke only that one to
prevent duplicate per-PID emissions. Concretely, check
f_nvml_device_get_compute_running_processes_v3, else _v2, else base
f_nvml_device_get_compute_running_processes and call the corresponding
nvml_emit_process_memory_samples_v3/_v2/_v1 once; do the same for
f_nvml_device_get_graphics_running_processes_* and their emitters so only one
compute emitter and one graphics emitter run per invocation of
nvml_collect_process_memory.
- Around line 620-635: The code calls nvml_read_device_uuid(device, uuid,
sizeof(uuid)) twice—once to populate uuid and again inside the ternary passed to
nvml_detect_mig_devices—causing redundant NVML reads; modify the block to call
nvml_read_device_uuid only once, store its boolean result (e.g., uuid_ok) and
reuse uuid when calling nvml_detect_mig_devices (pass NULL if uuid_ok is false),
adjusting nvml_register_card logic to use uuid/NULL accordingly; also move any
new variable declarations (uuid_ok or similar) to the top of the function to
comply with the project's C style guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 660a1b0c-6753-4c0d-b475-e9500359c2c8

📥 Commits

Reviewing files that changed from the base of the PR and between da59549 and 017a3b9.

📒 Files selected for processing (4)
  • plugins/in_gpu_metrics/amd_gpu.c
  • plugins/in_gpu_metrics/gpu_metrics.c
  • plugins/in_gpu_metrics/gpu_metrics.h
  • plugins/in_gpu_metrics/nvml_gpu.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugins/in_gpu_metrics/gpu_metrics.h

Comment thread plugins/in_gpu_metrics/nvml_gpu.c
Comment thread plugins/in_gpu_metrics/nvml_gpu.c
Comment thread plugins/in_gpu_metrics/nvml_gpu.c Outdated
Comment thread plugins/in_gpu_metrics/nvml_gpu.c
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-nvidia-gpu-metrics branch from 51c173e to ea159a7 Compare April 17, 2026 06:42
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-nvidia-gpu-metrics branch from ea159a7 to 017a3b9 Compare April 17, 2026 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants