Skip to content

F-022: perf(services): RegexSet first-pass for secret scanning#25

Closed
Sephyi wants to merge 1 commit intodevelopmentfrom
audit/f-022-regexset-secret-scanning
Closed

F-022: perf(services): RegexSet first-pass for secret scanning#25
Sephyi wants to merge 1 commit intodevelopmentfrom
audit/f-022-regexset-secret-scanning

Conversation

@Sephyi
Copy link
Copy Markdown
Owner

@Sephyi Sephyi commented Apr 22, 2026

Summary

perf(services): RegexSet first-pass for secret scanning.

Audit context

Closes audit entry F-022 from #3.

Verification

  • cargo fmt --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --all-targets

Note: one pre-existing test porcelain_exits_within_timeout_with_no_staged_changes is a known macOS cold-start flake that reproduces on unmodified development — unrelated to this change.

Secret scanning previously ran each of the 24 built-in regexes against
every added diff line, making the cost O(N x 24 x L) per scan. The
common case (no secret on the line) paid the full per-pattern overhead.

Bundle the patterns into a new `PatternSet` that pairs the
`Vec<SecretPattern>` with a `regex::RegexSet` built from the same
sources. Per-line scanning now asks the aggregated automaton first; only
lines that actually hit a pattern resolve an index back to the owning
`SecretPattern`. `RegexSet::matches().iter()` yields indices in
ascending order, so taking `.next()` preserves the prior
"first pattern wins" semantics — output shape of `SecretMatch` is
unchanged.

The `RegexSet` is compiled once per `build_patterns` call (and cached
via `LazyLock` for the default pattern set), so build cost is amortised.
All 38 existing safety tests plus the three proptests continue to pass.

Closes audit entry F-022 from #3.
Copilot AI review requested due to automatic review settings April 22, 2026 19:51
@Sephyi Sephyi added the audit Codebase audit cleanup (issue #3) label Apr 22, 2026
@Sephyi Sephyi self-assigned this Apr 22, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a RegexSet-backed first-pass filter for secret scanning to reduce per-line scanning overhead and address audit finding F-022.

Changes:

  • Add PatternSet to bundle secret patterns with a compiled RegexSet for fast first-pass matching.
  • Update build_patterns/DEFAULT_PATTERNS to produce and cache a PatternSet instead of a raw Vec<SecretPattern>.
  • Refactor both staged-diff and full-diff scanners to use a shared first_match helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/services/safety.rs
Comment on lines +62 to +74
/// Since every individual [`Regex`] in `patterns` is already known to
/// compile, the combined [`RegexSet`] compile is expected to succeed; if
/// it somehow does not (size limits, etc.), we fall back to an empty
/// [`RegexSet`] and the scanner still works correctly — it just loses
/// the first-pass speedup and runs the per-pattern check on every line.
fn from_patterns(patterns: Vec<SecretPattern>) -> Self {
// `Regex::as_str` returns the exact source the regex was built from,
// so feeding the same strings into `RegexSet` yields identical
// semantics to the per-pattern `Regex::is_match` fallback.
let sources: Vec<&str> = patterns.iter().map(|p| p.regex.as_str()).collect();
let set = RegexSet::new(&sources).unwrap_or_else(|_| RegexSet::empty());
Self { patterns, set }
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternSet::from_patterns falls back to RegexSet::empty() when RegexSet::new fails, but the rest of the scanner now relies exclusively on RegexSet via first_match. In that failure mode (e.g., size limits with custom patterns), scanning will return false negatives for all secrets. Preserve correctness by recording the compile failure (e.g., Option<RegexSet> / Result) and falling back to per-pattern Regex::is_match when the set is unavailable (or when patterns is non-empty but the set has length 0 due to a compile fallback).

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
Comment on lines +26 to +33
/// A bundle of secret-detection patterns together with a [`RegexSet`] over
/// the same patterns for a fast first-pass filter.
///
/// Per-line scanning calls [`RegexSet::matches`] first; only on hits does it
/// fall back to individual [`Regex::is_match`] on the matching pattern. This
/// turns the common no-match case from `O(N × 24 × L)` into
/// `O(N × L)` (one aggregated automaton pass) plus an `O(1)` membership check.
pub struct PatternSet {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc for PatternSet/from_patterns describes a per-pattern Regex::is_match fallback after RegexSet::matches, but first_match currently returns the first RegexSet hit without any per-pattern verification, and there is no fallback path when the set can’t be built. Please align the documentation with the actual algorithm, or implement the described fallback behavior (especially for the RegexSet compile-failure case).

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
Comment on lines +255 to +261
/// [`RegexSet::matches`] as a single-automaton first-pass filter — the
/// common case (no match) returns without touching any individual
/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so
/// taking `.next()` preserves the previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
let idx = patterns.set.matches(line).iter().next()?;
Some(&patterns.patterns[idx])
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first_match uses only patterns.set.matches(line) to decide a match. If the RegexSet was not successfully compiled (currently represented as RegexSet::empty()), this will always return None even when individual patterns would match. Consider making first_match explicitly fall back to scanning patterns.patterns with Regex::is_match when the set is unavailable/empty while patterns is non-empty.

Suggested change
/// [`RegexSet::matches`] as a single-automaton first-pass filter — the
/// common case (no match) returns without touching any individual
/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so
/// taking `.next()` preserves the previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
let idx = patterns.set.matches(line).iter().next()?;
Some(&patterns.patterns[idx])
/// [`RegexSet::matches`] as a single-automaton first-pass filter, then
/// falls back to checking individual [`Regex`] values if the set yields
/// no match. This preserves correctness if the set is unavailable while
/// keeping the fast path for the common case. `SetMatches::iter()`
/// yields indices in ascending order, so taking `.next()` preserves the
/// previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
if let Some(idx) = patterns.set.matches(line).iter().next() {
return Some(&patterns.patterns[idx]);
}
patterns
.patterns
.iter()
.find(|pattern| pattern.regex.is_match(line))

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
});
break; // One match per line is enough
}
if let Some(pat) = first_match(line, patterns) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scan_for_secrets_with_patterns, first_match is called with the full diff line (including the leading '+'), while scan_full_diff_with_patterns strips the '+' before matching. This inconsistency can cause anchored custom patterns (e.g. ^TOKEN_...) to never match in the per-file API. Consider matching against &line[1..] here as well to align semantics across both scanners.

Suggested change
if let Some(pat) = first_match(line, patterns) {
let content = &line[1..];
if let Some(pat) = first_match(content, patterns) {

Copilot uses AI. Check for mistakes.
@Sephyi
Copy link
Copy Markdown
Owner Author

Sephyi commented Apr 30, 2026

Closing as the conflict with #16 (LazyLock cache for builtin patterns, merged) is semantic, not textual: this PR changes build_patterns()'s return type from Vec<SecretPattern> to PatternSet and affects every call site, while #16 also reworked the same flow.

A clean re-implementation should:

  1. Build on top of the merged cache (BUILTIN_PATTERNS: LazyLock<Vec<SecretPattern>> in src/services/safety.rs).
  2. Introduce PatternSet as a wrapper that holds both the cached pattern slice and a derived RegexSet for first-pass matching.
  3. Update scan_diff_with_patterns and scan_full_diff_with_patterns to use the RegexSet first-pass on each line and only fall through to the per-pattern Regex::is_match on a hit.

Reopening as a fresh PR after rebase is the cleanest path.

@Sephyi Sephyi closed this Apr 30, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 30, 2026
Sephyi added a commit that referenced this pull request Apr 30, 2026
Replaces per-pattern `Regex::is_match` loops in the scanners with a
single `RegexSet` traversal that returns the index of the lowest-
indexed pattern that matched. The pattern slice is then consulted only
to look up the descriptive name on a hit.

Layout:
- New `PatternSet` struct holds an owned `Vec<SecretPattern>` plus the
  derived `RegexSet`. `PatternSet::first_match` does the combined NFA
  traversal.
- `BUILTIN_PATTERN_SET: LazyLock<PatternSet>` caches the set for the
  built-in patterns (built once on first access on top of the existing
  `BUILTIN_PATTERNS` cache from F-012).
- `scan_for_secrets` and `scan_full_diff_for_secrets` (no-args) use the
  cached set directly.
- `scan_for_secrets_with_patterns` / `scan_full_diff_with_patterns`
  preserve their `&[SecretPattern]` API by building a one-shot
  `PatternSet` and delegating to private `*_with_pattern_set` helpers,
  so external callers see no API change.
- Behaviour preserved: at most one match per added line; pattern
  precedence is the lowest-index in the configured set, matching the
  previous "first hit, break" semantics.

Closes #25.
@Sephyi Sephyi deleted the audit/f-022-regexset-secret-scanning branch April 30, 2026 17:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

audit Codebase audit cleanup (issue #3)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants