Add benchmarks workflow and CI for collector throughput#224
Conversation
📝 WalkthroughWalkthroughThis pull request introduces comprehensive performance benchmarking infrastructure alongside documentation alignment and CI hardening. Benchmarking framework includes pytest fixtures, two test modules measuring commit processing and service-layer write throughput, baseline tracking with regression detection, and a new CI workflow. Documentation links are systematically updated to point from legacy paths to the canonical ChangesPerformance Benchmarking Framework
Documentation Link and Reference Corrections
CI Workflow and Configuration Updates
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/Service_API.md (1)
43-43:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the remaining stale
Contributing.mdlink in this file.Line 43 still points to
Contributing.md, which is inconsistent with the canonical rootCONTRIBUTING.mdpath and can break on case-sensitive environments.Suggested fix
-Tables in each file are **generated** from source; see [Contributing.md](Contributing.md#regenerating-service-api-docs). +Tables in each file are **generated** from source; see [CONTRIBUTING.md](../CONTRIBUTING.md#regenerating-service-api-docs).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/Service_API.md` at line 43, Update the stale link that points to "Contributing.md" to use the canonical uppercase path "CONTRIBUTING.md" in the docs/Service_API.md content (the line that currently reads "see [Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both the filename and its fragment target if needed so the link becomes "see [CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid case-sensitivity issues.
🧹 Nitpick comments (1)
.github/workflows/benchmarks.yml (1)
31-33: ⚡ Quick winAdd
persist-credentials: falseto the checkout step for GitHub Actions.The checkout action retains git credentials by default. Set
persist-credentials: falseto minimize token exposure in this workflow.Suggested patch
- name: Checkout uses: actions/checkout@v4 + with: + persist-credentials: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/benchmarks.yml around lines 31 - 33, The Checkout step using actions/checkout@v4 currently retains credentials by default; update the "Checkout" step to include persist-credentials: false (i.e., add the key persist-credentials with value false under the uses: actions/checkout@v4 entry) so the checkout action does not persist git credentials into the workflow environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/benchmarks.yml:
- Line 32: The workflow uses floating action tags (e.g., actions/checkout@v4,
actions/setup-node@v7, actions/cache@v4 and the other action at around line 74)
which must be replaced with immutable commit SHAs; locate each uses: entry for
actions/checkout, actions/setup-node, actions/cache and the other referenced
action and replace the `@v`* tag with the corresponding full commit SHA from that
action's GitHub releases/tags page so the workflow pins to a specific commit SHA
instead of a floating version.
In `@bench.json`:
- Around line 2-38: The bench.json file contains sensitive local machine
metadata under machine_info and commit_info and must not be committed; remove
bench.json from the repo, add its filename or pattern to .gitignore, and move
generation of this artifact to CI/artifacts rather than source control; if this
file was already pushed, purge it from history using a history-rewriting tool
(git filter-repo or BFG) or remove it via git rm --cached and force-push, and
update CI (the job that produces bench.json) to upload it as a build artifact
instead of committing.
In `@benchmarks/baselines.json`:
- Around line 5-10: The baseline median_seconds for the benchmarks are
unrealistically high and make regression checks useless; update the
"median_seconds" values for the affected entries (the first entry with
"median_seconds": 45.0 and the
"benchmarks/test_service_bulk_insert.py::test_service_bulk_commits_and_file_changes"
entry with "median_seconds": 35.0) to the actual recorded medians from this PR’s
bench.json (approximately 0.1369 and 0.1406 respectively) while leaving other
fields (like "n": 50) unchanged so CI regression thresholds are meaningful.
In `@benchmarks/compare_to_baseline.py`:
- Line 77: The error string uses a non-ASCII multiplication character "×" which
triggers Ruff RUF001; update the formatted message where f"(baseline
{float(ref):.6f}s × {args.regression_ratio})" is constructed (referencing
variables ref and args.regression_ratio) to use an ASCII character such as "x"
or "*" instead (e.g. "x") so the text becomes f"(baseline {float(ref):.6f}s x
{args.regression_ratio})".
In `@benchmarks/test_service_bulk_insert.py`:
- Line 28: The current commit hash generation in the 'hashes' list uses
f"svcbulk{i:056d}"[:40], which truncates the variable part and makes all entries
identical; update the expression that builds 'hashes' (the list comprehension
assigned to the variable hashes) so the varying suffix is preserved—for example
use the trailing slice f"svcbulk{i:056d}"[-40:] or otherwise include i in the
kept portion (or replace with a deterministic hash like
hashlib.sha1(f"svcbulk{i}".encode()).hexdigest()[:40]) so each iteration
produces a distinct commit_hash.
In `@CONTRIBUTING.md`:
- Around line 89-90: The documented local benchmark command (the pytest
invocation shown: "uv run pytest benchmarks/ -m benchmark --benchmark-only
--benchmark-json=bench.json -v") is missing the CI-only flag; update that
command in CONTRIBUTING.md to include --benchmark-disable-gc so local runs
disable the GC exactly like the CI workflow, ensuring comparable results.
In `@docs/service_api/cppa_pinecone_sync.md`:
- Line 5: In docs/service_api/cppa_pinecone_sync.md update the remaining legacy
CONTRIBUTING link by replacing the incorrect "../Contributing.md" occurrence
(the link text on Line 25) with the correct "../../CONTRIBUTING.md" so both
references use the canonical uppercase CONTRIBUTING.md path; search for the
string "../Contributing.md" and change it to "../../CONTRIBUTING.md" to ensure
consistency with the earlier fix.
---
Outside diff comments:
In `@docs/Service_API.md`:
- Line 43: Update the stale link that points to "Contributing.md" to use the
canonical uppercase path "CONTRIBUTING.md" in the docs/Service_API.md content
(the line that currently reads "see
[Contributing.md](Contributing.md#regenerating-service-api-docs)"); replace both
the filename and its fragment target if needed so the link becomes "see
[CONTRIBUTING.md](CONTRIBUTING.md#regenerating-service-api-docs)" to avoid
case-sensitivity issues.
---
Nitpick comments:
In @.github/workflows/benchmarks.yml:
- Around line 31-33: The Checkout step using actions/checkout@v4 currently
retains credentials by default; update the "Checkout" step to include
persist-credentials: false (i.e., add the key persist-credentials with value
false under the uses: actions/checkout@v4 entry) so the checkout action does not
persist git credentials into the workflow environment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 17f058c6-9386-4b86-aedf-5a5e3b4e4528
⛔ Files ignored due to path filters (1)
requirements-dev.lockis excluded by!**/*.lock
📒 Files selected for processing (34)
.github/workflows/benchmarks.ymlCONTRIBUTING.mdREADME.mdbench.jsonbenchmarks/baselines.jsonbenchmarks/compare_to_baseline.pybenchmarks/conftest.pybenchmarks/test_github_commits_throughput.pybenchmarks/test_service_bulk_insert.pyboost_library_tracker/services.pyboost_mailing_list_tracker/services.pyboost_usage_tracker/services.pyconftest.pycppa_pinecone_sync/services.pycppa_slack_tracker/services.pycppa_user_tracker/services.pycppa_youtube_script_tracker/services.pydocs/How_to_add_a_collector.mddocs/Onboarding.mddocs/README.mddocs/Service_API.mddocs/boost_library_docs_tracker.mddocs/cross-app-dependencies.mddocs/service_api/README.mddocs/service_api/boost_usage_tracker.mddocs/service_api/clang_github_tracker.mddocs/service_api/cppa_pinecone_sync.mddocs/service_api/cppa_user_tracker.mddocs/service_api/discord_activity_tracker.mddocs/service_api/github_activity_tracker.mdgithub_activity_tracker/services.pypyproject.tomlpytest.inirequirements-dev.in
…odify CI scripts for benchmark integration
Summary
Test plan
pytest/ project benchmark command as documented).Closes #213
Summary by CodeRabbit
New Features
Documentation
Chores