Skip to content

Add benchmarks workflow and CI for collector throughput#224

Merged
wpak-ai merged 3 commits into
cppalliance:developfrom
leostar0412:feat/benchmarks-and-ci
May 21, 2026
Merged

Add benchmarks workflow and CI for collector throughput#224
wpak-ai merged 3 commits into
cppalliance:developfrom
leostar0412:feat/benchmarks-and-ci

Conversation

@leostar0412
Copy link
Copy Markdown
Collaborator

@leostar0412 leostar0412 commented May 20, 2026

Summary

  • Add benchmark suite and baseline comparison for key collector paths.
  • Add GitHub Actions workflow to run benchmarks in CI.

Test plan

  • CI workflow passes on this branch.
  • Benchmarks run locally (pytest / project benchmark command as documented).

Closes #213

Summary by CodeRabbit

  • New Features

    • Added performance benchmarking infrastructure to detect throughput regressions and monitor system performance.
  • Documentation

    • Expanded contributing guidelines with comprehensive performance benchmarking instructions and workflows.
    • Standardized and corrected documentation references across all modules.
  • Chores

    • Enhanced CI pipeline with automated performance regression detection.
    • Updated dependencies and configuration to support benchmark testing infrastructure.

Review Change Stack

@leostar0412 leostar0412 self-assigned this May 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive performance benchmarking infrastructure alongside documentation alignment and CI hardening. Benchmarking framework includes pytest fixtures, two test modules measuring commit processing and service-layer write throughput, baseline tracking with regression detection, and a new CI workflow. Documentation links are systematically updated to point from legacy paths to the canonical CONTRIBUTING.md. GitHub Actions are SHA-pinned for supply-chain security.

Changes

Performance Benchmarking Framework

Layer / File(s) Summary
Benchmark framework configuration
benchmarks/conftest.py, conftest.py, pytest.ini, requirements-dev.in, pyproject.toml, .gitignore
Configures pytest benchmark collection via marker definition, pytest-benchmark dependency, benchmark directory exclusion, fixture providing configurable batch size from environment, and gitignore rule for generated bench.json.
Benchmark test implementations
benchmarks/test_github_commits_throughput.py, benchmarks/test_service_bulk_insert.py
Two benchmark tests: one measuring _process_commit_data throughput with synthesized GitHub commit payloads over n iterations; another measuring service-layer write performance for bulk commit creation and file change recording within a single transaction. Both record iteration count in benchmark metadata.
Baseline tracking and regression detection
benchmarks/baselines.json, benchmarks/compare_to_baseline.py
Baseline JSON stores expected median execution times and sample counts for benchmarks. Comparison script validates run medians against baselines with configurable regression threshold (1.25× default), emits warnings for metadata mismatches, and fails if regressions exceed threshold.
Benchmark CI workflow
.github/workflows/benchmarks.yml
New workflow provisioning PostgreSQL, installing dependencies via uv, running benchmark tests with pytest-benchmark JSON output and GC disabled, executing baseline comparison on success, and uploading artifacts with 30-day retention.
Contributing guide with benchmark documentation
CONTRIBUTING.md
Updates service API documentation references to new locations and adds "Performance benchmarks" section documenting opt-in collection rules via RUN_BENCHMARKS=1, environment prerequisites, local execution with baseline comparison, threshold behavior, and CI automation.

Documentation Link and Reference Corrections

Layer / File(s) Summary
Service module docstring corrections
boost_library_tracker/services.py, boost_mailing_list_tracker/services.py, boost_usage_tracker/services.py, cppa_pinecone_sync/services.py, cppa_slack_tracker/services.py, cppa_user_tracker/services.py, cppa_youtube_script_tracker/services.py, github_activity_tracker/services.py
Updates eight service module docstrings to reference CONTRIBUTING.md instead of legacy docs/Contributing.md for the project-wide service-layer write rule.
Markdown documentation link corrections
README.md, docs/How_to_add_a_collector.md, docs/Onboarding.md, docs/README.md, docs/Service_API.md, docs/boost_library_docs_tracker.md, docs/cross-app-dependencies.md, docs/service_api/README.md, docs/service_api/boost_usage_tracker.md, docs/service_api/clang_github_tracker.md, docs/service_api/cppa_pinecone_sync.md, docs/service_api/cppa_user_tracker.md, docs/service_api/discord_activity_tracker.md, docs/service_api/github_activity_tracker.md
Corrects internal documentation links across generated and manual docs from legacy docs/Contributing.md or Contributing.md paths to canonical ../CONTRIBUTING.md or ../../CONTRIBUTING.md with proper relative path resolution.

CI Workflow and Configuration Updates

Layer / File(s) Summary
GitHub Actions SHA pinning
.github/workflows/actions.yml
Replaces floating action version references (@v4, @v3) with SHA-pinned commit revisions across checkout, setup-uv, cache, and upload-artifact actions in lint, pyright, test, and compose-smoke jobs for reproducibility and supply-chain security.
Version metadata update
core/_version.py
Updates auto-generated version string to current development build identifier.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • cppalliance/boost-data-collector#154: Introduces docs/Onboarding.md and links from docs/README.md; this PR updates those onboarding docs to reference the canonical CONTRIBUTING.md location.

Suggested Reviewers

  • jonathanMLDev
  • wpak-ai
  • snowfox1003

Poem

🐰 Benchmarks bound through data streams,
Baselines tracked by CI dreams,
Links aligned from docs to root,
Actions pinned, no shifting boot!
A rabbit's test of speed and care.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a benchmarks workflow and CI integration for measuring collector throughput performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/Service_API.md (1)

43-43: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the remaining stale Contributing.md link in this file.

Line 43 still points to Contributing.md, which is inconsistent with the canonical root CONTRIBUTING.md path and can break on case-sensitive environments.

Suggested fix
-Tables in each file are **generated** from source; see [Contributing.md](Contributing.md#regenerating-service-api-docs).
+Tables in each file are **generated** from source; see [CONTRIBUTING.md](../CONTRIBUTING.md#regenerating-service-api-docs).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/Service_API.md` at line 43, Update the stale link that points to
"Contributing.md" to use the canonical uppercase path "CONTRIBUTING.md" in the
docs/Service_API.md content (the line that currently reads "see
[Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both
the filename and its fragment target if needed so the link becomes "see
[CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid
case-sensitivity issues.
🧹 Nitpick comments (1)
.github/workflows/benchmarks.yml (1)

31-33: ⚡ Quick win

Add persist-credentials: false to the checkout step for GitHub Actions.

The checkout action retains git credentials by default. Set persist-credentials: false to minimize token exposure in this workflow.

Suggested patch
       - name: Checkout
         uses: actions/checkout@v4
+        with:
+          persist-credentials: false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmarks.yml around lines 31 - 33, The Checkout step
using actions/checkout@v4 currently retains credentials by default; update the
"Checkout" step to include persist-credentials: false (i.e., add the key
persist-credentials with value false under the uses: actions/checkout@v4 entry)
so the checkout action does not persist git credentials into the workflow
environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/benchmarks.yml:
- Line 32: The workflow uses floating action tags (e.g., actions/checkout@v4,
actions/setup-node@v7, actions/cache@v4 and the other action at around line 74)
which must be replaced with immutable commit SHAs; locate each uses: entry for
actions/checkout, actions/setup-node, actions/cache and the other referenced
action and replace the `@v`* tag with the corresponding full commit SHA from that
action's GitHub releases/tags page so the workflow pins to a specific commit SHA
instead of a floating version.

In `@bench.json`:
- Around line 2-38: The bench.json file contains sensitive local machine
metadata under machine_info and commit_info and must not be committed; remove
bench.json from the repo, add its filename or pattern to .gitignore, and move
generation of this artifact to CI/artifacts rather than source control; if this
file was already pushed, purge it from history using a history-rewriting tool
(git filter-repo or BFG) or remove it via git rm --cached and force-push, and
update CI (the job that produces bench.json) to upload it as a build artifact
instead of committing.

In `@benchmarks/baselines.json`:
- Around line 5-10: The baseline median_seconds for the benchmarks are
unrealistically high and make regression checks useless; update the
"median_seconds" values for the affected entries (the first entry with
"median_seconds": 45.0 and the
"benchmarks/test_service_bulk_insert.py::test_service_bulk_commits_and_file_changes"
entry with "median_seconds": 35.0) to the actual recorded medians from this PR’s
bench.json (approximately 0.1369 and 0.1406 respectively) while leaving other
fields (like "n": 50) unchanged so CI regression thresholds are meaningful.

In `@benchmarks/compare_to_baseline.py`:
- Line 77: The error string uses a non-ASCII multiplication character "×" which
triggers Ruff RUF001; update the formatted message where f"(baseline
{float(ref):.6f}s × {args.regression_ratio})" is constructed (referencing
variables ref and args.regression_ratio) to use an ASCII character such as "x"
or "*" instead (e.g. "x") so the text becomes f"(baseline {float(ref):.6f}s x
{args.regression_ratio})".

In `@benchmarks/test_service_bulk_insert.py`:
- Line 28: The current commit hash generation in the 'hashes' list uses
f"svcbulk{i:056d}"[:40], which truncates the variable part and makes all entries
identical; update the expression that builds 'hashes' (the list comprehension
assigned to the variable hashes) so the varying suffix is preserved—for example
use the trailing slice f"svcbulk{i:056d}"[-40:] or otherwise include i in the
kept portion (or replace with a deterministic hash like
hashlib.sha1(f"svcbulk{i}".encode()).hexdigest()[:40]) so each iteration
produces a distinct commit_hash.

In `@CONTRIBUTING.md`:
- Around line 89-90: The documented local benchmark command (the pytest
invocation shown: "uv run pytest benchmarks/ -m benchmark --benchmark-only
--benchmark-json=bench.json -v") is missing the CI-only flag; update that
command in CONTRIBUTING.md to include --benchmark-disable-gc so local runs
disable the GC exactly like the CI workflow, ensuring comparable results.

In `@docs/service_api/cppa_pinecone_sync.md`:
- Line 5: In docs/service_api/cppa_pinecone_sync.md update the remaining legacy
CONTRIBUTING link by replacing the incorrect "../Contributing.md" occurrence
(the link text on Line 25) with the correct "../../CONTRIBUTING.md" so both
references use the canonical uppercase CONTRIBUTING.md path; search for the
string "../Contributing.md" and change it to "../../CONTRIBUTING.md" to ensure
consistency with the earlier fix.

---

Outside diff comments:
In `@docs/Service_API.md`:
- Line 43: Update the stale link that points to "Contributing.md" to use the
canonical uppercase path "CONTRIBUTING.md" in the docs/Service_API.md content
(the line that currently reads "see
[Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both
the filename and its fragment target if needed so the link becomes "see
[CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid
case-sensitivity issues.

---

Nitpick comments:
In @.github/workflows/benchmarks.yml:
- Around line 31-33: The Checkout step using actions/checkout@v4 currently
retains credentials by default; update the "Checkout" step to include
persist-credentials: false (i.e., add the key persist-credentials with value
false under the uses: actions/checkout@v4 entry) so the checkout action does not
persist git credentials into the workflow environment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17f058c6-9386-4b86-aedf-5a5e3b4e4528

📥 Commits

Reviewing files that changed from the base of the PR and between ba453d9 and 94302c7.

⛔ Files ignored due to path filters (1)
  • requirements-dev.lock is excluded by !**/*.lock
📒 Files selected for processing (34)
  • .github/workflows/benchmarks.yml
  • CONTRIBUTING.md
  • README.md
  • bench.json
  • benchmarks/baselines.json
  • benchmarks/compare_to_baseline.py
  • benchmarks/conftest.py
  • benchmarks/test_github_commits_throughput.py
  • benchmarks/test_service_bulk_insert.py
  • boost_library_tracker/services.py
  • boost_mailing_list_tracker/services.py
  • boost_usage_tracker/services.py
  • conftest.py
  • cppa_pinecone_sync/services.py
  • cppa_slack_tracker/services.py
  • cppa_user_tracker/services.py
  • cppa_youtube_script_tracker/services.py
  • docs/How_to_add_a_collector.md
  • docs/Onboarding.md
  • docs/README.md
  • docs/Service_API.md
  • docs/boost_library_docs_tracker.md
  • docs/cross-app-dependencies.md
  • docs/service_api/README.md
  • docs/service_api/boost_usage_tracker.md
  • docs/service_api/clang_github_tracker.md
  • docs/service_api/cppa_pinecone_sync.md
  • docs/service_api/cppa_user_tracker.md
  • docs/service_api/discord_activity_tracker.md
  • docs/service_api/github_activity_tracker.md
  • github_activity_tracker/services.py
  • pyproject.toml
  • pytest.ini
  • requirements-dev.in

Comment thread .github/workflows/benchmarks.yml Outdated
Comment thread bench.json Outdated
Comment thread benchmarks/baselines.json Outdated
Comment thread benchmarks/compare_to_baseline.py Outdated
Comment thread benchmarks/test_service_bulk_insert.py Outdated
Comment thread CONTRIBUTING.md Outdated
Comment thread docs/service_api/cppa_pinecone_sync.md
@leostar0412 leostar0412 requested a review from jonathanMLDev May 20, 2026 15:50
@leostar0412 leostar0412 requested a review from wpak-ai May 21, 2026 17:15
@wpak-ai wpak-ai merged commit 70cc837 into cppalliance:develop May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Performance Benchmarks for Collection Throughput

3 participants