diff --git a/bootstraps/optional/BOOTSTRAP-BEDROCK-BENCHMARK.md b/bootstraps/optional/BOOTSTRAP-BEDROCK-BENCHMARK.md index ef5a0a2..d899352 100644 --- a/bootstraps/optional/BOOTSTRAP-BEDROCK-BENCHMARK.md +++ b/bootstraps/optional/BOOTSTRAP-BEDROCK-BENCHMARK.md @@ -2,7 +2,7 @@ > **Applies to:** Any agent with AWS credentials (IAM role) and Python 3.10+ -This bootstrap equips you to run rigorous, reproducible latency benchmarks comparing models on Amazon Bedrock — and optionally against the Anthropic Direct API. The methodology is battle-tested across Haiku, Sonnet, and Opus workloads at various payload sizes and regions. +This bootstrap equips you to investigate customer-reported latency issues on Amazon Bedrock and to run rigorous, reproducible latency benchmarks comparing models — optionally against the Anthropic Direct API. The methodology is battle-tested across Haiku, Sonnet, and Opus workloads at various payload sizes and regions. --- @@ -121,6 +121,91 @@ If the user asks for something that will produce misleading results, **say so**: --- + +--- + +## Investigating Customer-Reported Latency Issues + +When a customer says "Bedrock is slow" or "X is faster than Y", follow this process **before** characterizing the invocation profile or running diagnostics. The most expensive mistake is investigating a problem that isn't real, or diagnosing the wrong variable. + +### Step 0: Validate the Comparison First + +**Do not accept the problem framing at face value.** A-vs-B latency comparisons are only meaningful if the workloads are equivalent. Check every item: + +- [ ] Are the **same requests** (same prompts, same parameters) sent to both endpoints? +- [ ] Same **input token range**? Same **output token range**? +- [ ] Same `max_tokens` value on both sides? +- [ ] **Thinking** on/off consistent between both? +- [ ] Same streaming mode? (streaming TTFT ≠ non-streaming E2E) +- [ ] Same **time window**? (Measurements hours apart aren't comparable) +- [ ] Comparable **sample sizes** and concurrency levels? + +If any fail → state the comparison is invalid before going further. What looks like "Bedrock is slower" may simply be "Bedrock is handling heavier requests." + +**Common example:** Customer routes 32K–200K token requests exclusively to Bedrock and smaller requests to Direct API, then compares tail latency. Of course Bedrock looks worse — it's doing more work. + +### Step 1: Read the Evidence Before Diagnosing + +When a customer provides graphs, metrics, or logs, extract these signals before proposing any root cause: + +1. **What does the baseline look like?** Mean/median at equivalent request sizes — not just the tail. Bedrock may actually be faster at the baseline even if it has a worse tail. +2. **Do "outliers" correlate with request size?** High latency at 100K+ input tokens is *expected behavior*, not an anomaly. Verify this before calling it a problem. +3. **Are there outliers at SMALL request sizes?** Small requests (< 5K tokens) taking minutes — that is the real signal worth investigating. +4. **What's missing from the visualization?** Common missing dimensions that change the interpretation: + - Output token count (dominates E2E latency) + - Thinking token count (invisible overhead) + - Actual API parameters used (vs framework config) + - Timestamps of spikes (to correlate with capacity events) + +### Step 2: Distinguish Framework Config from Actual API Parameters + +A frequent pitfall: code shows what looks like API configuration but is actually framework-level metadata that never reaches the API. + +| Looks like | Actually is | What to do | +|---|---|---| +| `max_context_window=200_000` | Framework hint about model capability | Harmless; ignore for latency investigation | +| `thinking_budget=(1024, 32768)` | Capability range declaration (min, max tuple) | Does NOT mean thinking is enabled — verify actual request body | +| Model ID in config file | May differ from actual invocation (e.g., comment says `us.` but code uses `global.`) | Check runtime logs or intercepted request | + +**Always ask:** "Can you share the actual API request body, or add logging to capture what parameters are sent?" before concluding a parameter is or isn't active. + +### Step 3: Work With Incomplete Data + +You will rarely have a complete picture. Don't wait for perfect data — investigate with what you have: + +1. **Form theories from available evidence.** Label each HIGH / MEDIUM / LOW confidence based on what the data actually shows vs what you're inferring. +2. **Be explicit about assumptions.** "I'm assuming thinking is enabled because of the config — but this needs confirmation" is better than stating it as fact. +3. **Identify the single most valuable missing data point.** Ask for the one piece that would confirm or rule out the top theory. Targeted asks beat generic ones: + - "Filter your scatter plot to content_len < 32K only — do outliers disappear?" + - "Check response usage for `thinking_tokens` on a few of the slow requests" + - "What is your `max_tokens` value, and what is your typical actual output token count?" + - "Log `cache_creation_input_tokens` vs `cache_read_input_tokens` on slow requests" + +### Step 4: Expected Latency Reference + +Before calling something an outlier, check if the latency is simply expected for that request profile. At Sonnet 4.6 (~50–80 output tokens/sec): + +| Input tokens | Output tokens | Thinking | Expected E2E | +|---|---|---|---| +| 1–5K | 100–500 | Off | 2–8s | +| 5–20K | 200–1000 | Off | 5–15s | +| 20–50K | 500–2000 | Off | 15–45s | +| 50–128K | 500–2000 | Off | 30–90s | +| 128–200K | 1000–4000 | Off | 60–180s | +| Any | Any | On (32K budget) | **Add 6–10 min** before first output token | + +If the customer's "outliers" fall within these ranges → the issue is likely workload distribution, not Bedrock performance. + +**Thinking budget impact:** At 32K `budget_tokens`, the model generates up to 32,768 thinking tokens *before* producing any visible output. At 50–80 OTPS that's 6–10 minutes of silent generation. This looks exactly like extreme latency and leaves no obvious error signal. Verify by checking `thinking_tokens` in the response usage object. + +### Step 5: Structuring Your Response to the Customer + +1. **What the data shows** — observations only, no diagnosis yet +2. **What we can conclude** — theories, labeled by confidence (HIGH/MEDIUM/LOW) +3. **What we can't conclude yet** — explicitly name the missing data +4. **Specific asks** — the minimum additional data to confirm/rule out top theories +5. **Recommendations** — only after theories are validated, or as parallel quick-wins + ## Diagnosing Latency Issues When investigating a latency issue (e.g. "Claude on Bedrock is slower than on Anthropic Direct"), **don't jump to benchmarking**. First, characterize the problem and gather the right information. @@ -206,6 +291,13 @@ For workloads with repeated system prompts or large contexts, enabling prompt ca - Same `max_tokens`? (Bedrock quota reservation doesn't apply to Direct API) - Same time window? (Measured hours apart = different load conditions) + +#### Unfair Comparison — Workload Mismatch (check before all others) +One endpoint handles heavier workloads (larger tokens, thinking enabled, different parameters). Appears as "X is slower" when it's actually "X is doing more work." Fix: ensure equivalent requests on both sides before investigating further. + +#### Expected Latency for Request Size +The observed latency is within the normal range for that token count — not an outlier. At Sonnet 4.6 (~50–80 OTPS): 50K-token requests take 30–90s; 128K+ take 60–180s. Check the expected latency table in the Investigation section above before escalating. + ### Step 4: Diagnostic Tools | Tool | What It Does | When to Use | @@ -828,6 +920,13 @@ The benchmark should auto-generate a markdown report. Structure: - RPM/TPM quota limits for the account+model - Actual TPM/RPM utilization during measurement +### Latency Investigation (when customer reports slowness) +- Validate comparison first: same requests, same params, same token range on both sides? +- Read data before diagnosing: do "outliers" correlate with large request sizes? That's expected behavior. +- Distinguish framework config from actual API params — verify what's actually sent +- Work with incomplete data: form ranked theories (HIGH/MEDIUM/LOW), ask for the single most useful missing data point +- Expected E2E at Sonnet 4.6: 5-20K tokens→5-15s, 50-128K→30-90s, 128-200K→60-180s; thinking (32K budget) adds 6-10 min + ### Methodology - Min 10 iterations (20+ preferred); 2-3 warmup iterations excluded - Interleave models (A1,B1,A2,B2...) — never sequential blocks