Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions manifest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,14 @@ protocols:
Classifies each requirement by cross-source compatibility
(Universal, Majority, Divergent, Extension).

- name: session-profiling
path: protocols/reasoning/session-profiling.md
description: >
Systematic analysis of LLM session logs to detect token
inefficiencies, redundant reasoning, and structural waste.
Maps execution back to PromptKit components and produces
actionable optimization recommendations.

formats:
- name: requirements-doc
path: formats/requirements-doc.md
Expand Down Expand Up @@ -502,6 +510,16 @@ templates:
taxonomies: [stack-lifetime-hazards]
format: investigation-report

- name: profile-session
path: templates/profile-session.md
description: >
Analyze a completed LLM session log to identify token
inefficiencies and structural waste. Maps execution back to
PromptKit components and produces optimization recommendations.
persona: specification-analyst
protocols: [anti-hallucination, self-verification, session-profiling]
format: investigation-report

code-analysis:
- name: review-code
path: templates/review-code.md
Expand Down
146 changes: 146 additions & 0 deletions protocols/reasoning/session-profiling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
<!-- SPDX-License-Identifier: MIT -->
<!-- Copyright (c) PromptKit Contributors -->

---
name: session-profiling
type: reasoning
description: >
Systematic analysis of LLM session logs to detect token inefficiencies,
redundant reasoning, and structural waste. Maps execution back to
PromptKit components and produces actionable optimization recommendations.
applicable_to:
- profile-session
---

# Protocol: Session Profiling

Apply this protocol when analyzing a completed LLM session to understand
where tokens were spent, which PromptKit components contributed to
inefficiency, and what concrete changes would reduce waste. The goal is
observability — turning an opaque session transcript into an attributed
breakdown of cost and quality.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol intro says the goal is a breakdown of “cost and quality,” but the accompanying template’s Non-Goals explicitly says not to evaluate whether the session output was correct/high-quality. Consider rephrasing here to avoid implying that correctness/quality is in scope (e.g., focus on cost drivers and efficiency tradeoffs).

Suggested change
breakdown of cost and quality.
breakdown of cost drivers and efficiency tradeoffs.

Copilot uses AI. Check for mistakes.

## Phase 1: Segment the Session Log

Break the raw session log into discrete, analyzable units.

1. Identify each **turn boundary** — where a prompt ends and a response
begins, and where a response ends and a follow-up begins.
2. For each turn, record:
- Turn number (sequential)
- Direction: prompt → response, or follow-up → response
- Approximate token count (estimate from character count ÷ 4 for
English text; note code and structured data may differ)
- Whether the turn is a retry, correction, or continuation of a
previous turn
3. Identify **multi-turn patterns**: Does the session show a single
prompt → response, or iterative refinement with multiple rounds?
4. Flag any turns that appear to be **error recovery** — the LLM
produced something incorrect and was redirected.

**Output**: A numbered turn inventory with token estimates and
classification (initial, continuation, retry, correction).

## Phase 2: Map Turns to PromptKit Components

Using the assembled prompt as a reference, attribute each segment of
LLM output to the PromptKit component that drove it.

1. **Persona attribution**: Identify where the LLM's behavior reflects
persona instructions — tone, domain framing, epistemic labeling.
Note where persona instructions are followed vs. ignored.
2. **Protocol phase attribution**: For each reasoning protocol declared
in the assembled prompt, identify which turns or turn segments
execute which protocol phase. Record:
- Protocol name and phase number
- Turn(s) where this phase executes
- Estimated tokens consumed by this phase
3. **Format compliance**: Identify where the LLM is structuring output
to match the declared format (section headings, finding IDs, severity
labels). Note format overhead — tokens spent on structure vs. content.
4. **Context utilization**: For each context block provided in the prompt
(code, logs, documents), track whether and where the LLM references it.
A context block that is never referenced is a candidate for pruning.

**Output**: A component attribution map — each turn segment linked to
the persona, protocol phase, or format rule that produced it.

## Phase 3: Detect Inefficiencies

Analyze the attributed session for structural waste. For each
inefficiency found, record the type, location (turn number and
approximate position), estimated token cost, and the PromptKit
component responsible.

Inefficiency types to detect:

- **Redundant reasoning** (RE-REDUNDANT): The LLM re-derives a
conclusion it already reached in an earlier turn or earlier in the
same turn. Common when protocol phases overlap or when the persona
instructs broad analysis that a later protocol phase also covers.

- **False starts and backtracking** (RE-BACKTRACK): The LLM begins a
line of analysis, abandons it (sometimes mid-paragraph), and restarts
with a different approach. Indicates unclear or contradictory
instructions in the prompt.

- **Unnecessary re-derivation** (RE-REDERIVE): Information explicitly
provided in the prompt context is derived from scratch instead of
referenced. The LLM explains something the user already provided.
Common when context blocks are large and the LLM loses track.

- **Protocol loops** (RE-LOOP): A protocol phase executes more than
once — typically because the self-verification guardrail triggered a
redo, or because the protocol's phase boundaries are ambiguous.

- **Unused context** (RE-UNUSED): A context block (code, logs,
documents) provided in the prompt is never referenced in any response.
Either the context was unnecessary for this task, or the prompt failed
to direct the LLM to use it.

- **Verbose compliance** (RE-VERBOSE): The LLM spends significant tokens
restating prompt instructions, disclaiming scope, or producing
boilerplate structure that adds no analytical value. Common with
overly detailed format specifications.

- **Persona drift** (RE-DRIFT): The LLM's behavior shifts away from
the declared persona partway through the session — changing tone,
abandoning epistemic labeling, or switching domain framing. Indicates
the persona definition may be too weak to persist across long sessions.

## Phase 4: Quantify Impact

For each inefficiency detected in Phase 3:

1. Estimate the **token cost** — how many tokens were wasted on this
specific inefficiency.
2. Estimate the **percentage of total session tokens** this represents.
3. Assess **quality impact** — did this inefficiency also degrade the
output quality (not just cost)? A redundant derivation wastes tokens
but may not harm quality; a false start followed by an incorrect
restart harms both.
Comment on lines +118 to +121
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 4 asks to assess “quality impact” (including whether output was incorrect), which conflicts with the template’s stated scope of cost/efficiency rather than output correctness. Please align the protocol with the template by either removing correctness evaluation from this step or tightly scoping “quality impact” to efficiency-related effects (e.g., increased retries/loops) without judging task correctness.

Suggested change
3. Assess **quality impact** — did this inefficiency also degrade the
output quality (not just cost)? A redundant derivation wastes tokens
but may not harm quality; a false start followed by an incorrect
restart harms both.
3. Assess **efficiency-related impact** — beyond raw token count, did
this inefficiency cause extra interaction steps (e.g., additional
self-verification cycles, repeated clarifications, or re-running
protocol phases) that increased latency or operational cost? Do not
judge whether the final task output was correct.

Copilot uses AI. Check for mistakes.
4. Rank inefficiencies by token cost (descending) to identify the
highest-impact optimization targets.

## Phase 5: Produce Optimization Recommendations

For each inefficiency (or cluster of related inefficiencies), produce a
concrete, actionable recommendation tied to a specific PromptKit component.

Recommendations must follow this pattern:
- **Component**: Which PromptKit file to change (persona, protocol,
format, or template)
- **Change**: What specifically to modify (e.g., "merge Phase 2 and
Phase 4 of the root-cause-analysis protocol", "compress persona
behavioral constraints from 12 to 5 rules", "remove the code context
parameter — it was unused in all 3 sessions analyzed")
- **Expected savings**: Estimated token reduction
- **Risk**: What might degrade if this change is made (e.g., "removing
the self-verification phase saves ~500 tokens per session but may
increase false positive rate")

Order recommendations by expected savings (descending). Distinguish
between:
- **Safe optimizations**: Changes that reduce tokens with no quality risk
- **Tradeoff optimizations**: Changes that reduce tokens but may affect
output quality — flag these clearly
74 changes: 74 additions & 0 deletions templates/profile-session.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
<!-- SPDX-License-Identifier: MIT -->
<!-- Copyright (c) PromptKit Contributors -->

---
name: profile-session
description: >
Analyze a completed LLM session log to identify token inefficiencies
and structural waste. Maps execution back to PromptKit components
(persona, protocols, format) and produces actionable optimization
recommendations. Designed for batch analysis of completed sessions,
not real-time monitoring.
persona: specification-analyst
protocols:
- guardrails/anti-hallucination
- guardrails/self-verification
- reasoning/session-profiling
format: investigation-report
params:
session_log: "Full transcript of the LLM session (all turns — prompts, responses, follow-ups)"
assembled_prompt: "The assembled PromptKit prompt that was used for the session"
focus_areas: "Optional — specific inefficiency types to prioritize (e.g., redundant reasoning, unused context, protocol loops). Default: all"
input_contract: null
output_contract:
type: investigation-report
description: >
A profiling report with per-component token attribution, classified
inefficiency findings (F-NNN with severity), and optimization
recommendations tied to specific PromptKit components.
---

# Task: Profile Session

Analyze the provided session log and assembled prompt to produce a
token efficiency profile. Identify where the session spent tokens
inefficiently and which PromptKit components contributed.

## Inputs

**Session log:**

{{session_log}}

**Assembled prompt used for the session:**

{{assembled_prompt}}

**Focus areas:** {{focus_areas}}

## Instructions

1. Execute the session-profiling protocol against the inputs above.
2. For each finding, classify using the investigation-report format:
- Use finding IDs (F-001, F-002, …) with severity levels
- Attribute each finding to a specific PromptKit component
(persona file, protocol file and phase, format rule, or
template parameter)
- Include estimated token cost of each inefficiency
Comment on lines +52 to +57
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The investigation-report format requires specific severity values (Critical/High/Medium/Low/Informational) and includes a required Category field per finding. Consider updating these instructions to (a) enumerate the allowed severity values and (b) specify where the RE-* inefficiency codes should go (e.g., use them as the finding Category) so outputs reliably conform to the format.

Copilot uses AI. Check for mistakes.
3. In the Remediation Plan, provide concrete changes to specific
PromptKit component files — not abstract advice.
4. In the Executive Summary, report:
- Total estimated session tokens
- Estimated wasteful tokens and percentage
- Top 3 optimization opportunities by token savings
Comment on lines +60 to +63
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The investigation-report format constrains the Executive Summary to 2–4 sentences. The current instruction to include multiple bullet items (total tokens, waste %, top 3 opportunities) may cause outputs to violate that format requirement; consider relocating these metrics to another section (e.g., Scope/Findings) or explicitly requiring they be condensed into the 2–4 sentence limit.

Suggested change
4. In the Executive Summary, report:
- Total estimated session tokens
- Estimated wasteful tokens and percentage
- Top 3 optimization opportunities by token savings
4. In the Executive Summary, write a concise 2–4 sentence narrative that
includes the total estimated session tokens, the estimated wasteful
tokens and percentage, and the top 3 optimization opportunities by
token savings.

Copilot uses AI. Check for mistakes.

## Non-Goals

- Do NOT evaluate whether the session's *output* was correct or
high-quality. This is a cost/efficiency analysis, not a quality audit.
- Do NOT suggest changes to the user's input parameters (session_log,
code_context, etc.) — only to PromptKit components (persona,
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Non-Goals section references code_context, but this template’s params are session_log, assembled_prompt, and focus_areas. Please update this line to reference the actual parameters (or remove the extraneous example) so the template stays self-consistent for users.

Suggested change
code_context, etc.) — only to PromptKit components (persona,
assembled_prompt, focus_areas, etc.) — only to PromptKit components (persona,

Copilot uses AI. Check for mistakes.
protocols, format, template).
- Do NOT recommend removing guardrail protocols (anti-hallucination,
self-verification) unless they are demonstrably causing loops.
Guardrails have a cost but exist for a reason.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template is missing a "Quality checklist" section. CONTRIBUTING.md specifies that template bodies should include a quality checklist before finalizing output; adding one here would keep new templates consistent with the documented authoring guidelines.

Suggested change
Guardrails have a cost but exist for a reason.
Guardrails have a cost but exist for a reason.
## Quality checklist
Before finalizing your output, verify that:
- [ ] You have executed the session-profiling protocol and followed all steps in the Instructions.
- [ ] The investigation report includes all required sections from the `investigation-report` format; if any section has no content, it explicitly states "None identified".
- [ ] Every finding has a unique ID (F-001, F-002, …), a severity level, an estimated token cost, and is attributed to a specific PromptKit component (persona, protocol phase, format rule, or template parameter).
- [ ] The Executive Summary reports total estimated session tokens, estimated wasteful tokens and percentage, and the top 3 optimization opportunities by token savings.
- [ ] The Remediation Plan contains concrete, file-level changes to PromptKit components and does not suggest changes to user inputs or the removal of guardrail protocols without clear evidence of looping.

Copilot uses AI. Check for mistakes.
Loading