in_podman_metrics: fix multiple cgroup v2 issues#11719
in_podman_metrics: fix multiple cgroup v2 issues#11719stondo wants to merge 1 commit intofluent:masterfrom
Conversation
Fix four cgroup v2 bugs in the podman_metrics input plugin:
1. CPU counter division: cgroup v2 cpu.stat reports usage in
microseconds, not nanoseconds like cgroup v1 cpuacct.
Use the correct divisor (1e6) when converting to seconds.
2. RSS memory key: cgroup v2 memory.stat does not have a "rss"
field. The equivalent metric is "anon" (anonymous memory).
Add V2_STAT_KEY_RSS and use it in the v2 collection path.
3. memory.max "max" keyword: cgroup v2 uses the literal string
"max" in memory.max when the memory limit is unlimited.
read_from_file() fails to parse this with fscanf("%lu"),
causing spurious warnings. Add read_from_sysfs_or_max()
helper that returns 0 for "max" (unlimited).
4. PID alt path typo: V2_SYSFS_FILE_PIDS_ALT was set to
"containers/cgroup.procs" (plural) but the actual cgroup v2
subdirectory is "container/cgroup.procs" (singular). This
caused PID lookup to fail for all containers, which in turn
prevented all network metrics from being collected.
Fixes: fluent#7769
Signed-off-by: Stefano Tondo <stefano.tondo.ext@siemens.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThese changes enhance cgroup v2 support in the Podman metrics plugin by correcting unit conversions for CPU metrics, updating sysfs file paths for cgroup v2, adding RSS memory metric mapping for cgroup v2, and implementing proper handling of unlimited memory values. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Fix four cgroup v2 bugs in the
in_podman_metricsinput plugin that causedmost container metrics to be absent or incorrect on systems using cgroup v2
(unified hierarchy).
All four bugs are confirmed on an ARM64 embedded device running
Fluent Bit 4.2.0/5.0.2 with Podman and cgroup v2 unified hierarchy.
Bugs fixed
1. CPU counter division (incorrect unit conversion)
create_counter()divides raw CPU values by 1,000,000,000 (nanoseconds)unconditionally. This is correct for cgroup v1 (
cpuacct.usagereportsnanoseconds) but wrong for cgroup v2 (
cpu.statreportsusage_usecand
user_usecin microseconds).On v2, integer division truncates all values below 1e9 usec (~16.7 min of
CPU time) to zero. In practice,
container_cpu_usage_seconds_totalandcontainer_cpu_user_seconds_totalalways read 0.Fix: Check
ctx->cgroup_versionand use the correct divisor:1e9 for v1 (nanoseconds), 1e6 for v2 (microseconds).
2. RSS memory key name (wrong key for v2)
STAT_KEY_RSSis defined as"rss", which is correct for cgroup v1memory.stat. However, cgroup v2memory.statdoes not have arssfield; the equivalent is
anon(anonymous memory pages).Result:
container_memory_rssgauge is never reported for anycontainer on v2, with a
[warn] rss not found in .../memory.statlogmessage emitted per container per scrape.
Fix: Add
V2_STAT_KEY_RSS "anon"and use it infill_counters_with_sysfs_data_v2().3. memory.max "max" keyword (parse failure)
cgroup v2
memory.maxcontains the literal string"max"when thememory limit is unlimited (no limit set).
read_from_file()usesfscanf(fp, "%lu", &value)which fails to parse"max", returningUINT64_MAX and logging a spurious warning per affected container per
scrape.
Result:
container_spec_memory_limit_bytesis missing forcontainers without an explicit memory limit. Warning spam in logs.
Fix: Add
read_from_sysfs_or_max()helper that parses the "max"keyword and returns 0 (unlimited), matching the convention used by
cAdvisor and other container metric exporters.
4. PID fallback path typo (singular vs. plural)
V2_SYSFS_FILE_PIDS_ALTis defined as"containers/cgroup.procs"(plural), but the actual cgroup v2 subdirectory is
"container/cgroup.procs"(singular).On v2,
cgroup.procsat the scope level is empty for all containers.Processes live only in the
container/cgroup.procssubdirectory. Theplugin correctly tries the alt path, but the typo means it always fails.
Result: PID lookup fails for all containers, which prevents
get_net_data_from_proc()from being called. All fourcontainer_network_*metrics are completely absent.Fix: Change
"containers/cgroup.procs"to"container/cgroup.procs".Before/After (gw-cloud-connector container, ARM64 device)
container_cpu_usage_seconds_totalcontainer_cpu_user_seconds_totalcontainer_memory_rsscontainer_spec_memory_limit_bytescontainer_network_receive_bytes_totalcontainer_network_transmit_bytes_totalAfter the fix, all 10 metric types are successfully emitted for all 9
containers, with zero warnings in the log.
Testing
Tested on:
Verified:
cpu.statdivided by 1e6anonfield inmemory.statFixes #7769
Summary by CodeRabbit
Bug Fixes