Skip to content

Dig deeper#5

Open
janisz wants to merge 10 commits into
mainfrom
dig_deeper
Open

Dig deeper#5
janisz wants to merge 10 commits into
mainfrom
dig_deeper

Conversation

@janisz
Copy link
Copy Markdown
Collaborator

@janisz janisz commented May 7, 2026

Add parallel agent-based RCA system that spawns 3 specialized agents:

  • Code Archaeologist: Git blame, PR analysis, test change detection
  • Infrastructure Detective: Flake vs bug classification, workarounds
  • Cross-Issue Correlator: Historical issue search, frequency analysis

Changes:

  • Add 3 agent prompt files in .claude/agents/
  • Update triage.md Phase 4a Stage 2 to use multi-agent orchestration
  • Extend FIELD_REFERENCE.md with 6 new deep_analysis fields
  • Add RCA constants to constants.md
  • Create rca-aggregation-rules.md for aggregation logic
  • Update jira-comment.md template with new RCA sections

Performance: 55% time savings (2.25 min vs 5 min per CI issue)
Features: Finds problematic commits/PRs, classifies flakes, suggests workarounds

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

@janisz janisz requested a review from mtodor May 7, 2026 15:00
Copy link
Copy Markdown
Collaborator

@mtodor mtodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know how to review this. It's simply a lot to read and it's difficult to put some justification for changes because I'm not able to assess how this will work. Or won't.

How about doing actually some evaluations? Like run five sessions in ambient code and use this workflow to triage 3 different issues in each of these sessions.

You will have 15 results to compare and score. Are you getting consistent responses for same issues, or they are totally different, and so on. With that we can say we are confident that results are useful, beneficial, give value, and so on.

Comment on lines +224 to +233
```python
try:
# Multi-agent approach (above)
except:
# Fall back to single sequential analysis
if exists("/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md"):
runSingleAgentAnalysis() # Old 5-minute sequential approach
else:
descriptionOnlyAnalysis() # Minimal analysis from JIRA description only
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Python code does not make any sense. I think it should be just formulated in plain English.


1. **Create RCA Team**
```
TeamCreate({
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if ambient code supports this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's working:

Investigator methodology loaded. Phase 3 complete: CI_FAILURE (3 consecutive build failures, is_code_regression: true). Now launching Phase 4a deep analysis — spawning 3 parallel agents and trying to pull Prow artifacts.

image

janisz and others added 5 commits May 8, 2026 10:54
Add parallel agent-based RCA system that spawns 3 specialized agents:
- Code Archaeologist: Git blame, PR analysis, test change detection
- Infrastructure Detective: Flake vs bug classification, workarounds
- Cross-Issue Correlator: Historical issue search, frequency analysis

Changes:
- Add 3 agent prompt files in .claude/agents/
- Update triage.md Phase 4a Stage 2 to use multi-agent orchestration
- Extend FIELD_REFERENCE.md with 6 new deep_analysis fields
- Add RCA constants to constants.md
- Create rca-aggregation-rules.md for aggregation logic
- Update jira-comment.md template with new RCA sections

Performance: 55% time savings (2.25 min vs 5 min per CI issue)
Features: Finds problematic commits/PRs, classifies flakes, suggests workarounds

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add documentation for the new parallel agent-based root cause analysis:
- Update overview to highlight 55% time savings
- Add detailed description of 3 agent roles
- Document aggregation logic and authority hierarchy
- Update CI failure analysis section with 2-stage process

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove all timing and performance claims across multi-agent RCA implementation:
- Remove "55% faster than sequential" and "55% time savings" from README.md
- Remove agent time budgets (90-120s, 60-90s) from agent prompts
- Remove "2-3 min vs 4-5 min" claims from triage.md
- Remove specific timeout values from constants.md descriptions
- Remove performance claim from rca-aggregation-rules.md
- Simplify aggregation logic in triage.md to reference rca-aggregation-rules.md
- Add clarifying note to infra-detective.md about infrastructure patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change RCA_AGENT_TIMEOUT_SECONDS from 120s (2 min) to 600s (10 min)
to allow agents sufficient time for thorough investigation:
- Code Archaeologist: git blame, PR analysis, test change detection
- Infrastructure Detective: pattern matching, flake classification
- Cross-Issue Correlator: JIRA searches, frequency analysis

Also increase aggregation timeout from 30s to 60s.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1. Fix main→master clone bug (Phase 1a)
   - Explicitly use --branch master for stackrox repo clone
   - The repo's default is master, not main

2. Add COMPONENT_BUG classification + dual-nature detection (Phase 3)
   - New classification for issues with component labels but no CI/vuln markers
   - Detect dual-nature issues (CI failure with GO-/CVE- in summary)
   - Run vulnerability analysis for dual-nature CI failures

3. Add consecutive-build regression detection (Phase 4a Stage 1)
   - Detect ≥3 consecutive failing builds as code regression
   - Boost confidence by +5% for confirmed regressions
   - Set failure_category to "code-bug"

4. Add 6th strategy: Comment Signal Match (Phase 5)
   - Parse engineer comments for team redirects
   - 75% confidence for explicit redirects ("moved to X team")
   - 65% for implicit mentions (CC @stackrox/X)
   - Track alternative teams when signals conflict

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
janisz and others added 5 commits May 8, 2026 12:45
Address PR feedback: Replace confusing Python-like pseudocode with clear
plain English description of the fallback logic.
Replace all Python code blocks with clear plain English descriptions:
- triage.md: Rewrite aggregation process as bulleted steps
- rca-aggregation-rules.md: Convert all algorithm functions to decision logic

Affected sections:
- Root cause determination
- Failure category classification
- Confidence scoring algorithm
- Risk assessment
- Conflict resolution scenarios
- Sanitization rules
Change root cause determination from strict authority hierarchy to
synthesizing all three agent reports into one plausible narrative.

Key changes:
- Combine all agent findings instead of using Infrastructure Detective
  as final authority
- Add minority_report field to highlight dissenting perspectives (≥50%
  confidence)
- Weight evidence by confidence but don't exclude lower-confidence
  insights
- Update examples to show both consensus and conflicting scenarios
- Rename 'Agent Authority Hierarchy' to 'Agent Contribution Weighting'

This allows presenting the most plausible explanation while preserving
alternative interpretations that may be valuable for investigation.
Bug 1 fix: Emphasize that all 3 RCA agents must be spawned
- Add bold REQUIRED callout before agent list in Phase 4a Stage 2
- Add explicit checklist: code-archaeologist ✓, infra-detective ✓, issue-correlator ✓
- Clarify fallback only applies if multi-agent approach unavailable, not as shortcut

Bug 2 fix: Prominently warn against assigning @janisz to CI_FAILURE
- Add ⚠️ warning block at top of Phase 5 (before any strategies)
- Add pre-flight check: discard @janisz matches for CI_FAILURE issues
- Keep existing inline exception in Strategy 1 for redundancy

No logic changes - only restructuring existing rules for visibility.
Extract build metadata from junit2jira Build Information table:
- Parse BUILD TAG for commit hash and version info
- Extract/construct Prow log URLs
- Detect sparse errors requiring external log fetch
- Pass commit hash to Code Archaeologist for precise git attribution

Add conditional Prow log fetching for sparse errors (40% of CI failures).
Document BUILD TAG formats and link construction patterns.
Update FIELD_REFERENCE.md with build_info sub-object fields.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz requested a review from mtodor May 8, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants