Skip to content

Latest commit

Β 

History

History
439 lines (328 loc) Β· 8.8 KB

File metadata and controls

439 lines (328 loc) Β· 8.8 KB

Safety Model

The DevOps Execution Engine is designed with safety as the primary concern. This document explains the safety guarantees and controls.


Core Principle

No execution without explicit human approval for medium+ risk actions.


Risk Classification

Every action is classified into one of four risk levels:

🟒 LOW Risk

Read-only operations with no impact on running systems

Examples:

  • kubectl get (any resource)
  • kubectl describe
  • kubectl logs
  • kubectl top
  • aws ec2 describe-instances
  • docker ps
  • Log analysis
  • System health checks

Safety:

  • May be auto-approved (if configured)
  • No approval required by default
  • Cannot cause harm

🟑 MEDIUM Risk

Modifications that are reversible and have controlled impact

Examples:

  • kubectl scale deployment
  • kubectl edit configmap
  • kubectl patch (non-critical resources)
  • kubectl rollout restart
  • Scaling operations
  • Configuration changes
  • Non-production deployments

Safety:

  • Always requires approval
  • Plan β†’ Approve β†’ Execute workflow
  • Rollback plan must be included
  • Pre-flight validation required
  • Post-execution validation required

πŸ”΄ HIGH Risk

Production changes with potential for service disruption

Examples:

  • Production deployments
  • kubectl delete (any resource)
  • Database schema changes
  • Certificate updates
  • Network policy changes
  • Security configuration
  • Critical service restarts

Safety:

  • Always requires approval
  • Impact analysis required
  • Detailed rollback plan
  • Multiple validation steps
  • Monitored execution
  • Logged to audit trail

β›” CRITICAL Risk

Operations that can cause data loss or security issues

Examples:

  • kubectl delete namespace
  • kubectl delete pv (persistent volumes)
  • RBAC/security policy changes
  • Production database deletion
  • Secret/credential rotation
  • Multi-cluster operations

Safety:

  • Blocked by default
  • Requires explicit override
  • Two-person approval (recommended)
  • Comprehensive rollback plan
  • Dry-run required first
  • Full audit trail

Approval Workflow

Step 1: Diagnosis

You: The API is slow

Clawd: [performs diagnosis]
- Checks pod status
- Analyzes logs
- Reviews metrics
- Identifies root cause

No approval needed - This is read-only.

Step 2: Plan Generation

Clawd: I found the issue. Generating execution plan...

πŸ“‹ EXECUTION PLAN: plan-20260126-001
Title: Fix slow API response (increase replicas)
Risk: MEDIUM
Time: ~3min

[detailed plan shown]

Approve? (yes/no/explain)

Human review required - Plan is presented but not executed.

Step 3: Human Review

You can:
- Review the plan in detail
- Ask questions ("explain step 2")
- Request modifications ("use 5 replicas instead of 3")
- Approve ("yes" or "approve")
- Reject ("no" or "cancel")

Step 4: Execution (Only After Approval)

You: approve

Clawd: βœ… Executing plan-20260126-001...

Step 1/3: Scaling deployment... βœ“
Step 2/3: Waiting for rollout... βœ“ (45s)
Step 3/3: Verifying pods... βœ“

βœ… Complete! All pods running.
Logged to: memory/actions-log.jsonl

Step 5: Verification

Clawd: Post-execution validation:
βœ“ All 5 replicas running
βœ“ No error logs
βœ“ API response time improved

Would you like me to monitor for the next 5 minutes?

Safety Mechanisms

1. Pre-Flight Validation

Before execution, verify:

  • Resources exist
  • Current state matches assumptions
  • Dependencies are met
  • No conflicting operations

2. Rollback Plans

Every plan includes:

  • How to undo the change
  • Estimated rollback time
  • Verification steps

Example:

rollback:
  method: "Rollback deployment to previous revision"
  commands:
    - "kubectl rollout undo deployment/api-service"
  estimated_time: "2min"

3. Step-by-Step Execution

  • One command at a time
  • Verify outcome before next step
  • Stop on first error
  • Log each action

4. Post-Execution Validation

After execution, verify:

  • Desired state achieved
  • No errors introduced
  • Services healthy
  • Metrics normal

5. Audit Logging

Every action logged to actions-log.jsonl:

{
  "timestamp": "2026-01-26T13:00:00Z",
  "plan_id": "plan-20260126-001",
  "action": "kubectl scale deployment",
  "resource": "production/api-service",
  "risk": "MEDIUM",
  "status": "success",
  "approver": "human",
  "duration_seconds": 45
}

What Can Go Wrong

Scenario 1: Step Fails During Execution

What happens:

  • Execution stops immediately
  • Error logged
  • Rollback plan presented
❌ Step 2 failed: Rollout timeout after 5 minutes

Error: context deadline exceeded

Rollback available:
kubectl rollout undo deployment/api-service

Execute rollback? (yes/no)

You decide whether to rollback or investigate.

Scenario 2: Validation Fails

What happens:

  • Post-execution check fails
  • Alert generated
  • Automatic rollback suggested
⚠️ Post-execution validation failed
Check: All pods running
Expected: 5 pods
Got: 3 pods (2 in CrashLoopBackOff)

Suggest immediate rollback? (yes/no)

Scenario 3: Accidental Approval

What happens:

  • Execution begins
  • You can still cancel early steps
  • Full audit trail maintained
You: wait, stop!

Clawd: ⏸️ Pausing after current step...
Step 1/5 completed
Remaining steps cancelled

Rollback? (yes/no)

Configuration Options

Enable Auto-Approval for LOW Risk

execution:
  auto_approve_low_risk: true

Caution: Even with auto-approval, actions are still logged.

Block Production Changes

safety:
  block_production: true
  production_namespaces:
    - production
    - prod
    - live

Effect: Any action on these namespaces requires explicit override.

Require Dry-Run First

execution:
  require_dry_run: true

Effect: High-risk actions must be tested in dry-run mode before actual execution.


Best Practices

1. Start Read-Only

  • Install with read-only kubeconfig
  • Build trust through diagnosis
  • Graduate to write access

2. Review Every Plan

  • Don't just say "yes"
  • Read the commands
  • Understand the impact
  • Check the rollback plan

3. Test in Non-Production First

  • Generate plans for staging
  • Verify the approach works
  • Then apply to production

4. Monitor Audit Logs

# Review recent actions
tail -20 ~/clawd/memory/actions-log.jsonl

# Check for failures
grep '"status":"failed"' ~/clawd/memory/actions-log.jsonl

5. Keep Plans for Reference

  • Don't delete execution plans
  • Build a library of tested patterns
  • Learn from failures

Emergency Procedures

If Something Goes Wrong

  1. Stop execution - Say "stop" or "cancel"
  2. Assess damage - Check current state
  3. Execute rollback - Use the pre-defined rollback plan
  4. Verify recovery - Confirm services restored
  5. Document incident - Log what happened

If You Need to Override Safety Controls

# In config.yaml
safety:
  allow_critical: true
  require_approval: false  # NOT RECOMMENDED

⚠️ WARNING: Only disable safety controls if you fully understand the risks.


Trust Model

What Clawd Can Do (With Approval)

βœ… Diagnose issues (read-only)
βœ… Generate execution plans
βœ… Execute approved commands
βœ… Monitor and verify results
βœ… Rollback on failure
βœ… Log all actions

What Clawd Cannot Do (Without You)

❌ Execute commands without approval
❌ Bypass safety controls
❌ Hide actions from audit log
❌ Delete data without confirmation
❌ Modify security policies automatically


Compliance & Audit

Audit Trail

  • Every action logged (JSONL format)
  • Timestamped and attributed
  • Immutable (append-only)
  • Includes approval info

Compliance-Friendly

  • Separation of duties (AI suggests, human approves)
  • Full auditability
  • Rollback capability
  • Change documentation (execution plans)

Export Audit Logs

# Export last 30 days
node -e "
  const logger = require('./core/logger');
  const l = new logger();
  const start = new Date();
  start.setDate(start.getDate() - 30);
  l.exportRange(start, new Date()).then(console.log);
"

Questions?

Q: Can Clawd execute commands without me knowing?
A: No. Every MEDIUM+ risk action requires explicit approval.

Q: What if I approve something by mistake?
A: Early steps can be cancelled, and full rollback plans are always available.

Q: How do I know what Clawd is doing?
A: Check the execution plan before approval, monitor during execution, review audit logs after.

Q: Can I trust the execution engine?
A: It's open source (Apache 2.0). Review the code yourself.

Q: What happens if Clawd crashes during execution?
A: Partial execution is logged. Resume manually or rollback using the plan.


Remember: You are always in control. Clawd assists, you decide.