Conversation
* Add agent testing harness for skill verification Implements the skill testing harness (#2872-#2882, #2898) that validates AI coding agents can discover, load, and follow Golem skill files to produce correct build artifacts. - CLI entrypoint with arg parsing and scenario filtering - YAML scenario loader with Zod schema validation - AgentDriver interface with Claude Code and Gemini (stub) drivers - SkillWatcher with inotifywait (Linux), fswatch (macOS), atime and presence-based fallback detection - Skill activation assertion engine (default, strict, allowedExtras) - Build verification with golem.yaml directory discovery - JSON report generation per scenario run - Bootstrap scenario: golem-new-project-ts - golem-new-project skill (SKILL.md) - CI workflow for unit + integration tests on Ubuntu - 21 unit tests (watcher, executor assertions, loader validation) * Fix CI test glob pattern for Linux compatibility Change **/*.test.js to *.test.js since /bin/sh (dash) on Linux does not support globstar. All test files are in a flat directory so ** is unnecessary. * Fix CI: golem binary, health check, and exit code on failure - Download golem v1.4.2 binary from golemcloud/golem releases - Use golem server run with /healthcheck readiness loop - Fix executor health check to use /healthcheck endpoint - Exit with code 1 when any scenario fails
|
✅ All contributors have signed the CLA. |
…oke, shell/sleep/trigger, create/delete agent, opencode driver, aggregated report - Add assertion engine with exit_code, stdout, body, status, and result_json checks (#2895) - Add scenario-level settings, prerequisites, step timeout, and continue_session (#2887) - Add deploy verification with implicit build (#2889) - Add invoke check with expect assertions (#2890) - Add shell, sleep, and trigger step actions (#2893) - Add create_agent and delete_agent step actions (#2894) - Add OpenCode driver stub with `opencode run` (#2897) - Add aggregated summary report with summary.json output (#2912)
…cies scenario, update paths
- Add --approval-mode yolo to Gemini CLI to enable all tools (run_shell_command, activate_skill, etc.) - Symlink skills to .gemini/skills/ so Gemini can discover them - Watch all agent skill dirs (.claude, .gemini, .agents) for activation - Fix macOS APFS relatime: reset atime before mtime so reads trigger updates - Add fswatch -a (access) and -L (follow-links) flags for macOS - Remove presence-check fallback, log full paths for detected skills - Move skills dir default from golem/skills to skills/
- Add opencode (opencode-ai) to matrix agents - Replace Gemini CLI placeholder with actual npm install - Update path triggers from golem/skills/ to skills/
Each driver now declares a skillDirs array instead of duplicating the symlink loop. Removes ~60 lines of repeated code.
Adds CodexAgentDriver using codex exec with session resume support. Makes --scenarios default to ./scenarios so --scenario can be used alone.
Adds version check fallback between golem and golem-cli binaries and clarifies that golem should not be built from scratch.
- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json - Use GEMINI_API_KEY secret directly - Add codex agent to CI matrix with OPENAI_API_KEY
OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.
Codex CLI requires explicit login rather than reading OPENAI_API_KEY directly from the environment.
Copies harness and skills to /tmp/harness-run/ with a fresh git init so agents cannot crawl up into the golem repo.
vigoo
requested changes
Mar 16, 2026
Contributor
vigoo
left a comment
There was a problem hiding this comment.
Good start, added some comments.
When those are addressed, let's remove the ticket links from this PR that are not solved yet (drivers that are only stubs etc) so we can merge this and follow-up PRs can target main directly.
Restructure to group skill definitions and the testing harness under a single top-level golem-skills/ directory. Update CI workflow paths, .gitignore, and AGENTS.md accordingly.
Move issue tracking to PR description. Update OpenCode driver comment to clarify session continuity status.
Add .refine() to StepSpecSchema ensuring exactly one action field per step. Define StepSpec as a union type for better type narrowing. Add negative tests for zero and multiple actions per step.
Extract createDriver() function, add SUPPORTED_AGENTS and SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language matrix. Update report filenames to include agent-language prefix. Show default timeout (300s) in help text.
Add file existence check via shell step and deploy verification to demonstrate more harness capabilities.
Share the default timeout value between executor and run.ts help text via an exported constant.
The scaffolded project has no components, so deploy produces an empty diff that the CLI misreads as a concurrent modification error.
vigoo
approved these changes
Mar 18, 2026
Contributor
vigoo
left a comment
There was a problem hiding this comment.
Let's merge the other PRs based on this (when approved), then merge this.
Contributor
|
Before merge, unlink some issues that are not solved yet, if any (otherwise they will get closed automatically) |
…nd GitHub summary to harness (#2960) - Template variable substitution ({{agent}}, {{language}}, {{workspace}}, {{scenario}}) in step fields - Conditional step execution with only_if/skip_if on agent, language, and os - --dry-run flag to validate scenarios and print step summaries without executing - Graceful Ctrl+C handling with partial result writing via AbortController - GitHub Actions job summary markdown output via GITHUB_STEP_SUMMARY - Remove issue number references from comments
…ness (#2979) Use golem server clean + restart for full cleanup between scenarios, with interactive confirmation when running on a TTY outside CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merges the skill testing harness infrastructure into main. This includes the full agent testing harness for validating AI coding agents can discover and follow Golem skill files, CI workflow, and bootstrap scenarios.
Issues
Closes #2872
Closes #2873
Closes #2874
Closes #2875
Closes #2876
Closes #2878
Closes #2879
Closes #2880
Closes #2881
Closes #2882
Closes #2898
Closes #2906
Closes #2907
Closes #2895
Closes #2897
Closes #2889
Closes #2911, #2916, #2915, #2913, #2912, #2903
Closes #2885, closes #2884, closes #2914, closes #2917, closes #2904