Dig deeper#5
Conversation
mtodor
left a comment
There was a problem hiding this comment.
To be honest, I don't know how to review this. It's simply a lot to read and it's difficult to put some justification for changes because I'm not able to assess how this will work. Or won't.
How about doing actually some evaluations? Like run five sessions in ambient code and use this workflow to triage 3 different issues in each of these sessions.
You will have 15 results to compare and score. Are you getting consistent responses for same issues, or they are totally different, and so on. With that we can say we are confident that results are useful, beneficial, give value, and so on.
| ```python | ||
| try: | ||
| # Multi-agent approach (above) | ||
| except: | ||
| # Fall back to single sequential analysis | ||
| if exists("/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md"): | ||
| runSingleAgentAnalysis() # Old 5-minute sequential approach | ||
| else: | ||
| descriptionOnlyAnalysis() # Minimal analysis from JIRA description only | ||
| ``` |
There was a problem hiding this comment.
This Python code does not make any sense. I think it should be just formulated in plain English.
|
|
||
| 1. **Create RCA Team** | ||
| ``` | ||
| TeamCreate({ |
There was a problem hiding this comment.
Did you check if ambient code supports this?
Add parallel agent-based RCA system that spawns 3 specialized agents: - Code Archaeologist: Git blame, PR analysis, test change detection - Infrastructure Detective: Flake vs bug classification, workarounds - Cross-Issue Correlator: Historical issue search, frequency analysis Changes: - Add 3 agent prompt files in .claude/agents/ - Update triage.md Phase 4a Stage 2 to use multi-agent orchestration - Extend FIELD_REFERENCE.md with 6 new deep_analysis fields - Add RCA constants to constants.md - Create rca-aggregation-rules.md for aggregation logic - Update jira-comment.md template with new RCA sections Performance: 55% time savings (2.25 min vs 5 min per CI issue) Features: Finds problematic commits/PRs, classifies flakes, suggests workarounds Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add documentation for the new parallel agent-based root cause analysis: - Update overview to highlight 55% time savings - Add detailed description of 3 agent roles - Document aggregation logic and authority hierarchy - Update CI failure analysis section with 2-stage process Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove all timing and performance claims across multi-agent RCA implementation: - Remove "55% faster than sequential" and "55% time savings" from README.md - Remove agent time budgets (90-120s, 60-90s) from agent prompts - Remove "2-3 min vs 4-5 min" claims from triage.md - Remove specific timeout values from constants.md descriptions - Remove performance claim from rca-aggregation-rules.md - Simplify aggregation logic in triage.md to reference rca-aggregation-rules.md - Add clarifying note to infra-detective.md about infrastructure patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change RCA_AGENT_TIMEOUT_SECONDS from 120s (2 min) to 600s (10 min) to allow agents sufficient time for thorough investigation: - Code Archaeologist: git blame, PR analysis, test change detection - Infrastructure Detective: pattern matching, flake classification - Cross-Issue Correlator: JIRA searches, frequency analysis Also increase aggregation timeout from 30s to 60s. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1. Fix main→master clone bug (Phase 1a)
- Explicitly use --branch master for stackrox repo clone
- The repo's default is master, not main
2. Add COMPONENT_BUG classification + dual-nature detection (Phase 3)
- New classification for issues with component labels but no CI/vuln markers
- Detect dual-nature issues (CI failure with GO-/CVE- in summary)
- Run vulnerability analysis for dual-nature CI failures
3. Add consecutive-build regression detection (Phase 4a Stage 1)
- Detect ≥3 consecutive failing builds as code regression
- Boost confidence by +5% for confirmed regressions
- Set failure_category to "code-bug"
4. Add 6th strategy: Comment Signal Match (Phase 5)
- Parse engineer comments for team redirects
- 75% confidence for explicit redirects ("moved to X team")
- 65% for implicit mentions (CC @stackrox/X)
- Track alternative teams when signals conflict
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address PR feedback: Replace confusing Python-like pseudocode with clear plain English description of the fallback logic.
Replace all Python code blocks with clear plain English descriptions: - triage.md: Rewrite aggregation process as bulleted steps - rca-aggregation-rules.md: Convert all algorithm functions to decision logic Affected sections: - Root cause determination - Failure category classification - Confidence scoring algorithm - Risk assessment - Conflict resolution scenarios - Sanitization rules
Change root cause determination from strict authority hierarchy to synthesizing all three agent reports into one plausible narrative. Key changes: - Combine all agent findings instead of using Infrastructure Detective as final authority - Add minority_report field to highlight dissenting perspectives (≥50% confidence) - Weight evidence by confidence but don't exclude lower-confidence insights - Update examples to show both consensus and conflicting scenarios - Rename 'Agent Authority Hierarchy' to 'Agent Contribution Weighting' This allows presenting the most plausible explanation while preserving alternative interpretations that may be valuable for investigation.
Bug 1 fix: Emphasize that all 3 RCA agents must be spawned - Add bold REQUIRED callout before agent list in Phase 4a Stage 2 - Add explicit checklist: code-archaeologist ✓, infra-detective ✓, issue-correlator ✓ - Clarify fallback only applies if multi-agent approach unavailable, not as shortcut Bug 2 fix: Prominently warn against assigning @janisz to CI_FAILURE - Add⚠️ warning block at top of Phase 5 (before any strategies) - Add pre-flight check: discard @janisz matches for CI_FAILURE issues - Keep existing inline exception in Strategy 1 for redundancy No logic changes - only restructuring existing rules for visibility.
Extract build metadata from junit2jira Build Information table: - Parse BUILD TAG for commit hash and version info - Extract/construct Prow log URLs - Detect sparse errors requiring external log fetch - Pass commit hash to Code Archaeologist for precise git attribution Add conditional Prow log fetching for sparse errors (40% of CI failures). Document BUILD TAG formats and link construction patterns. Update FIELD_REFERENCE.md with build_info sub-object fields. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add parallel agent-based RCA system that spawns 3 specialized agents:
Changes:
Performance: 55% time savings (2.25 min vs 5 min per CI issue)
Features: Finds problematic commits/PRs, classifies flakes, suggests workarounds
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com