Add benchmarks workflow and CI for collector throughput by leostar0412 · Pull Request #224 · cppalliance/boost-data-collector

leostar0412 · 2026-05-20T02:03:14Z

Summary

Add benchmark suite and baseline comparison for key collector paths.
Add GitHub Actions workflow to run benchmarks in CI.

Test plan

CI workflow passes on this branch.
Benchmarks run locally (pytest / project benchmark command as documented).

Closes #213

Summary by CodeRabbit

New Features
- Added performance benchmarking infrastructure to detect throughput regressions and monitor system performance.
Documentation
- Expanded contributing guidelines with comprehensive performance benchmarking instructions and workflows.
- Standardized and corrected documentation references across all modules.
Chores
- Enhanced CI pipeline with automated performance regression detection.
- Updated dependencies and configuration to support benchmark testing infrastructure.

coderabbitai · 2026-05-20T02:03:26Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive performance benchmarking infrastructure alongside documentation alignment and CI hardening. Benchmarking framework includes pytest fixtures, two test modules measuring commit processing and service-layer write throughput, baseline tracking with regression detection, and a new CI workflow. Documentation links are systematically updated to point from legacy paths to the canonical CONTRIBUTING.md. GitHub Actions are SHA-pinned for supply-chain security.

Changes

Performance Benchmarking Framework

Layer / File(s)	Summary
Benchmark framework configuration `benchmarks/conftest.py`, `conftest.py`, `pytest.ini`, `requirements-dev.in`, `pyproject.toml`, `.gitignore`	Configures pytest benchmark collection via marker definition, pytest-benchmark dependency, benchmark directory exclusion, fixture providing configurable batch size from environment, and gitignore rule for generated `bench.json`.
Benchmark test implementations `benchmarks/test_github_commits_throughput.py`, `benchmarks/test_service_bulk_insert.py`	Two benchmark tests: one measuring `_process_commit_data` throughput with synthesized GitHub commit payloads over `n` iterations; another measuring service-layer write performance for bulk commit creation and file change recording within a single transaction. Both record iteration count in benchmark metadata.
Baseline tracking and regression detection `benchmarks/baselines.json`, `benchmarks/compare_to_baseline.py`	Baseline JSON stores expected median execution times and sample counts for benchmarks. Comparison script validates run medians against baselines with configurable regression threshold (1.25× default), emits warnings for metadata mismatches, and fails if regressions exceed threshold.
Benchmark CI workflow `.github/workflows/benchmarks.yml`	New workflow provisioning PostgreSQL, installing dependencies via uv, running benchmark tests with `pytest-benchmark` JSON output and GC disabled, executing baseline comparison on success, and uploading artifacts with 30-day retention.
Contributing guide with benchmark documentation `CONTRIBUTING.md`	Updates service API documentation references to new locations and adds "Performance benchmarks" section documenting opt-in collection rules via `RUN_BENCHMARKS=1`, environment prerequisites, local execution with baseline comparison, threshold behavior, and CI automation.

Documentation Link and Reference Corrections

Layer / File(s)	Summary
Service module docstring corrections `boost_library_tracker/services.py`, `boost_mailing_list_tracker/services.py`, `boost_usage_tracker/services.py`, `cppa_pinecone_sync/services.py`, `cppa_slack_tracker/services.py`, `cppa_user_tracker/services.py`, `cppa_youtube_script_tracker/services.py`, `github_activity_tracker/services.py`	Updates eight service module docstrings to reference `CONTRIBUTING.md` instead of legacy `docs/Contributing.md` for the project-wide service-layer write rule.
Markdown documentation link corrections `README.md`, `docs/How_to_add_a_collector.md`, `docs/Onboarding.md`, `docs/README.md`, `docs/Service_API.md`, `docs/boost_library_docs_tracker.md`, `docs/cross-app-dependencies.md`, `docs/service_api/README.md`, `docs/service_api/boost_usage_tracker.md`, `docs/service_api/clang_github_tracker.md`, `docs/service_api/cppa_pinecone_sync.md`, `docs/service_api/cppa_user_tracker.md`, `docs/service_api/discord_activity_tracker.md`, `docs/service_api/github_activity_tracker.md`	Corrects internal documentation links across generated and manual docs from legacy `docs/Contributing.md` or `Contributing.md` paths to canonical `../CONTRIBUTING.md` or `../../CONTRIBUTING.md` with proper relative path resolution.

CI Workflow and Configuration Updates

Layer / File(s)	Summary
GitHub Actions SHA pinning `.github/workflows/actions.yml`	Replaces floating action version references (`@v4`, `@v3`) with SHA-pinned commit revisions across checkout, setup-uv, cache, and upload-artifact actions in lint, pyright, test, and compose-smoke jobs for reproducibility and supply-chain security.
Version metadata update `core/_version.py`	Updates auto-generated version string to current development build identifier.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

cppalliance/boost-data-collector#154: Introduces docs/Onboarding.md and links from docs/README.md; this PR updates those onboarding docs to reference the canonical CONTRIBUTING.md location.

Suggested Reviewers

jonathanMLDev
wpak-ai
snowfox1003

Poem

🐰 Benchmarks bound through data streams,
Baselines tracked by CI dreams,
Links aligned from docs to root,
Actions pinned, no shifting boot!
A rabbit's test of speed and care.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding a benchmarks workflow and CI integration for measuring collector throughput performance.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/Service_API.md (1)

43-43: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the remaining stale Contributing.md link in this file.

Line 43 still points to Contributing.md, which is inconsistent with the canonical root CONTRIBUTING.md path and can break on case-sensitive environments.

Suggested fix

-Tables in each file are **generated** from source; see [Contributing.md](Contributing.md#regenerating-service-api-docs).
+Tables in each file are **generated** from source; see [CONTRIBUTING.md](../CONTRIBUTING.md#regenerating-service-api-docs).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/Service_API.md` at line 43, Update the stale link that points to
"Contributing.md" to use the canonical uppercase path "CONTRIBUTING.md" in the
docs/Service_API.md content (the line that currently reads "see
[Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both
the filename and its fragment target if needed so the link becomes "see
[CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid
case-sensitivity issues.

🧹 Nitpick comments (1)

.github/workflows/benchmarks.yml (1)

31-33: ⚡ Quick win

Add persist-credentials: false to the checkout step for GitHub Actions.

The checkout action retains git credentials by default. Set persist-credentials: false to minimize token exposure in this workflow.

Suggested patch

       - name: Checkout
         uses: actions/checkout@v4
+        with:
+          persist-credentials: false

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmarks.yml around lines 31 - 33, The Checkout step
using actions/checkout@v4 currently retains credentials by default; update the
"Checkout" step to include persist-credentials: false (i.e., add the key
persist-credentials with value false under the uses: actions/checkout@v4 entry)
so the checkout action does not persist git credentials into the workflow
environment.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/benchmarks.yml:
- Line 32: The workflow uses floating action tags (e.g., actions/checkout@v4,
actions/setup-node@v7, actions/cache@v4 and the other action at around line 74)
which must be replaced with immutable commit SHAs; locate each uses: entry for
actions/checkout, actions/setup-node, actions/cache and the other referenced
action and replace the `@v`* tag with the corresponding full commit SHA from that
action's GitHub releases/tags page so the workflow pins to a specific commit SHA
instead of a floating version.

In `@bench.json`:
- Around line 2-38: The bench.json file contains sensitive local machine
metadata under machine_info and commit_info and must not be committed; remove
bench.json from the repo, add its filename or pattern to .gitignore, and move
generation of this artifact to CI/artifacts rather than source control; if this
file was already pushed, purge it from history using a history-rewriting tool
(git filter-repo or BFG) or remove it via git rm --cached and force-push, and
update CI (the job that produces bench.json) to upload it as a build artifact
instead of committing.

In `@benchmarks/baselines.json`:
- Around line 5-10: The baseline median_seconds for the benchmarks are
unrealistically high and make regression checks useless; update the
"median_seconds" values for the affected entries (the first entry with
"median_seconds": 45.0 and the
"benchmarks/test_service_bulk_insert.py::test_service_bulk_commits_and_file_changes"
entry with "median_seconds": 35.0) to the actual recorded medians from this PR’s
bench.json (approximately 0.1369 and 0.1406 respectively) while leaving other
fields (like "n": 50) unchanged so CI regression thresholds are meaningful.

In `@benchmarks/compare_to_baseline.py`:
- Line 77: The error string uses a non-ASCII multiplication character "×" which
triggers Ruff RUF001; update the formatted message where f"(baseline
{float(ref):.6f}s × {args.regression_ratio})" is constructed (referencing
variables ref and args.regression_ratio) to use an ASCII character such as "x"
or "*" instead (e.g. "x") so the text becomes f"(baseline {float(ref):.6f}s x
{args.regression_ratio})".

In `@benchmarks/test_service_bulk_insert.py`:
- Line 28: The current commit hash generation in the 'hashes' list uses
f"svcbulk{i:056d}"[:40], which truncates the variable part and makes all entries
identical; update the expression that builds 'hashes' (the list comprehension
assigned to the variable hashes) so the varying suffix is preserved—for example
use the trailing slice f"svcbulk{i:056d}"[-40:] or otherwise include i in the
kept portion (or replace with a deterministic hash like
hashlib.sha1(f"svcbulk{i}".encode()).hexdigest()[:40]) so each iteration
produces a distinct commit_hash.

In `@CONTRIBUTING.md`:
- Around line 89-90: The documented local benchmark command (the pytest
invocation shown: "uv run pytest benchmarks/ -m benchmark --benchmark-only
--benchmark-json=bench.json -v") is missing the CI-only flag; update that
command in CONTRIBUTING.md to include --benchmark-disable-gc so local runs
disable the GC exactly like the CI workflow, ensuring comparable results.

In `@docs/service_api/cppa_pinecone_sync.md`:
- Line 5: In docs/service_api/cppa_pinecone_sync.md update the remaining legacy
CONTRIBUTING link by replacing the incorrect "../Contributing.md" occurrence
(the link text on Line 25) with the correct "../../CONTRIBUTING.md" so both
references use the canonical uppercase CONTRIBUTING.md path; search for the
string "../Contributing.md" and change it to "../../CONTRIBUTING.md" to ensure
consistency with the earlier fix.

---

Outside diff comments:
In `@docs/Service_API.md`:
- Line 43: Update the stale link that points to "Contributing.md" to use the
canonical uppercase path "CONTRIBUTING.md" in the docs/Service_API.md content
(the line that currently reads "see
[Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both
the filename and its fragment target if needed so the link becomes "see
[CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid
case-sensitivity issues.

---

Nitpick comments:
In @.github/workflows/benchmarks.yml:
- Around line 31-33: The Checkout step using actions/checkout@v4 currently
retains credentials by default; update the "Checkout" step to include
persist-credentials: false (i.e., add the key persist-credentials with value
false under the uses: actions/checkout@v4 entry) so the checkout action does not
persist git credentials into the workflow environment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17f058c6-9386-4b86-aedf-5a5e3b4e4528

📥 Commits

Reviewing files that changed from the base of the PR and between ba453d9 and 94302c7.

⛔ Files ignored due to path filters (1)

requirements-dev.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

.github/workflows/benchmarks.yml
CONTRIBUTING.md
README.md
bench.json
benchmarks/baselines.json
benchmarks/compare_to_baseline.py
benchmarks/conftest.py
benchmarks/test_github_commits_throughput.py
benchmarks/test_service_bulk_insert.py
boost_library_tracker/services.py
boost_mailing_list_tracker/services.py
boost_usage_tracker/services.py
conftest.py
cppa_pinecone_sync/services.py
cppa_slack_tracker/services.py
cppa_user_tracker/services.py
cppa_youtube_script_tracker/services.py
docs/How_to_add_a_collector.md
docs/Onboarding.md
docs/README.md
docs/Service_API.md
docs/boost_library_docs_tracker.md
docs/cross-app-dependencies.md
docs/service_api/README.md
docs/service_api/boost_usage_tracker.md
docs/service_api/clang_github_tracker.md
docs/service_api/cppa_pinecone_sync.md
docs/service_api/cppa_user_tracker.md
docs/service_api/discord_activity_tracker.md
docs/service_api/github_activity_tracker.md
github_activity_tracker/services.py
pyproject.toml
pytest.ini
requirements-dev.in

…odify CI scripts for benchmark integration

leostar0412 added 2 commits May 19, 2026 18:40

feat(benchmarks): add performance benchmarks and CI integration

3b5e4b6

chore: remove uv.lock file and update version to 0.1.0 in _version.py

94302c7

leostar0412 self-assigned this May 20, 2026

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

chore: update .gitignore to exclude bench.json and remove the file; m…

5495397

…odify CI scripts for benchmark integration

leostar0412 requested a review from jonathanMLDev May 20, 2026 15:50

jonathanMLDev approved these changes May 21, 2026

View reviewed changes

leostar0412 requested a review from wpak-ai May 21, 2026 17:15

wpak-ai approved these changes May 21, 2026

View reviewed changes

wpak-ai merged commit 70cc837 into cppalliance:develop May 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks workflow and CI for collector throughput#224

Add benchmarks workflow and CI for collector throughput#224
wpak-ai merged 3 commits into
cppalliance:developfrom
leostar0412:feat/benchmarks-and-ci

leostar0412 commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leostar0412 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leostar0412 commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading