feat(sdk): Add TestLLM class for testing without mocking LiteLLM by VascoSch92 · Pull Request #2016 · OpenHands/software-agent-sdk

VascoSch92 · 2026-02-11T17:58:47Z

Summary

This PR introduces a new TestLLM class in openhands.sdk.testing that provides a clean way to write tests without needing to mock LiteLLM internals.

Key Features

TestLLM is a real LLM subclass that works anywhere an LLM is accepted: in Agent(llm=...), in fallback_llms, in condensers, in routers, etc.
No patching needed: just pass TestLLM as the llm= argument
Tests speak in SDK types (Message, TextContent, MessageToolCall) instead of LiteLLM types
Clear error when responses are exhausted: raises TestLLMExhaustedError with a helpful message
Zero-cost metrics by default
Always uses completion() path (uses_responses_api returns False)

Example Usage

from openhands.sdk.testing import TestLLM
from openhands.sdk.llm import Message, TextContent, MessageToolCall

# Simple text response
llm = TestLLM.from_messages([
    Message(role="assistant", content=[TextContent(text="Done!")]),
])

# Response with tool calls
llm = TestLLM.from_messages([
    Message(
        role="assistant",
        content=[TextContent(text="")],
        tool_calls=[
            MessageToolCall(
                id="call_1",
                name="my_tool",
                arguments='{"arg": "value"}',
                origin="completion",
            )
        ],
    ),
    Message(role="assistant", content=[TextContent(text="Done!")]),
])

What This Replaces

Today a typical test looks like ~20 lines of setup (@patch decorator + mock response factory + side_effect list). With TestLLM.from_messages(...) it's ~3 lines. The maintenance burden drops because tests speak in SDK types instead of LiteLLM types.

This PR also refactors tests/sdk/conversation/local/test_agent_status_transition.py to demonstrate the new TestLLM usage.

Closes #2005

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

@VascoSch92 can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9895a77-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9895a77-python \
  ghcr.io/openhands/agent-server:9895a77-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9895a77-golang-amd64
ghcr.io/openhands/agent-server:9895a77-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9895a77-golang-arm64
ghcr.io/openhands/agent-server:9895a77-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9895a77-java-amd64
ghcr.io/openhands/agent-server:9895a77-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9895a77-java-arm64
ghcr.io/openhands/agent-server:9895a77-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9895a77-python-amd64
ghcr.io/openhands/agent-server:9895a77-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9895a77-python-arm64
ghcr.io/openhands/agent-server:9895a77-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9895a77-golang
ghcr.io/openhands/agent-server:9895a77-java
ghcr.io/openhands/agent-server:9895a77-python

About Multi-Architecture Support

Each variant tag (e.g., 9895a77-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 9895a77-python-amd64) are also available if needed

@patch

This commit introduces a new TestLLM class in openhands.sdk.testing that provides a clean way to write tests without needing to mock LiteLLM internals. Key features: - TestLLM is a real LLM subclass that works anywhere an LLM is accepted - No @patch decorators needed - just pass TestLLM as the llm= argument - Tests speak in SDK types (Message, TextContent, MessageToolCall) - Clear error when scripted responses are exhausted - Zero-cost metrics by default - Always uses completion() path (uses_responses_api returns False) Also refactors test_agent_status_transition.py to demonstrate the new TestLLM usage, replacing ~20 lines of mock setup with ~3 lines. Closes #2005

github-actions · 2026-02-11T18:01:05Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/testing
test_llm.py	63	4	93%	182, 192, 221, 296
TOTAL	17907	5373	69%

openhands-sdk/openhands/sdk/testing/test_llm.py

enyst · 2026-02-11T19:30:38Z

@OpenHands WDYT about this draft PR? Feel free to read AGENTS.md and make the project, and find litellm codebase in .venv. Investigate and understand all that it's doing and how would we use this. Look at the linked issue too.

Post your feedback as a comment to this PR, so I can read it. Note that it will be rendered as markdown. Use smolpaws.

openhands-ai · 2026-02-11T19:30:48Z

I'm on it! enyst can track my progress at all-hands.dev

enyst · 2026-02-11T19:43:52Z

Feedback on PR #2016: `TestLLM` (no LiteLLM mocking)

Overall assessment

This is a solid direction and does make the test suite better.

The biggest win is decoupling tests from LiteLLM transport objects (ModelResponse, Choices, etc.) and from patching litellm_completion (which is both verbose and a subtle source of global-state leakage, because LiteLLM has global knobs like modify_params). A real LLM subclass that can be injected everywhere is the right shape.

I also like that you put it under openhands.sdk.testing rather than under openhands.sdk.llm.* — that answers the concern raised in the issue thread: users can and do implement their own LLM subclasses, so a test-double living in testing feels appropriate and keeps the runtime surface cleaner.

What I learned from reading the code / LiteLLM internals

I looked at the actual call sites in LLM:

LLM._transport_call() (completion path) calls litellm_completion(...) and expects a litellm.types.utils.ModelResponse.
LLM.responses() calls litellm_responses(...) and expects a ResponsesAPIResponse (or a streaming iterator).

TestLLM avoids both of those by overriding completion() / responses() directly, so tests never hit LiteLLM.

In our local venv here, LiteLLM is 1.80.10 and ModelResponse is a Pydantic model with fields: id, created, model, object, choices, .... The PR constructs a minimal-but-valid ModelResponse with a single Choices(message=...), which is enough for LLMResponse.id to work and for any downstream debugging that inspects raw_response.

API/design nits (actionable)

These aren’t blockers, but they are the spots I’d tighten before merging:

_return_metrics currently ignored
- completion(..., _return_metrics=True) is a supported calling convention in the SDK. TestLLM currently always returns metrics, but always zeroed.
- Suggestion: either (a) respect _return_metrics and return None/an empty snapshot when false (to match expectations), or (b) document in the docstring that TestLLM always returns zero metrics.
uses_responses_api() returning False is good, but responses() is overridden anyway
- Returning False is consistent with the intent (“don’t force tests to implement both paths”), but since responses() is implemented, uses_responses_api is mostly relevant only for external code that chooses which method to call.
- Not an issue; just noting the API coupling.
_content_to_string() drops non-text content silently
- It only includes TextContent. If a scripted message contains ImageContent etc, the generated raw_response will have content="".
- Probably fine because tests should assert on Message not on raw_response, but it might surprise people when debugging.
- Suggestion: either include a sentinel like "[non-text content omitted]" or explicitly mention this behavior.
Scripted response queue mutability
- completion() consumes by pop(0). That’s totally fine for tests, but pop(0) is O(n).
- Suggestion: use collections.deque for scripted responses (minor perf / cleanliness).

(Edit (Vasco): I changed to use deque. Another point in favour of that was that the order of the message was inverted)

Error ergonomics
- TestLLMExhaustedError message is already helpful.
- A further small improvement would be to include the last user message (or call index) to help debug test failures faster. (Optional.)

How we would use this (beyond the demo test)

This utility will pay off most in tests that currently patch litellm_completion and have to build complex tool-call payloads.

Patterns I expect to use a lot:

Testing tool loops: script assistant(tool_calls=[...]) then script the final assistant message.
Testing router/fallback behavior: script TestLLM for primary to raise an exception, then have fallback TestLLM return a message. (Right now TestLLM only scripts messages, but you can extend it to also script exceptions.)
Testing condensers: condensers call LLMs internally; being able to inject TestLLM avoids patching deep internals.

CLI usefulness?

If the question is “is this useful for the end-user CLI at runtime?” probably not directly — it’s a test double.

But it is useful for CLI integration tests and for “repro scripts” where we want deterministic agent runs without network calls.

If we want a runtime-feature analogue, that’s more like a “replay / transcript LLM” (load a JSONL of prompts/responses). TestLLM is close to that, but it currently lives (correctly) as a testing utility.

Compatibility / maintenance risks

The only real coupling to LiteLLM is the minimal ModelResponse construction. If LiteLLM changes these Pydantic models significantly, TestLLM might need updates.
- Mitigation: keep the raw_response creation minimal (as you did), and consider making raw_response optional in LLMResponse long-term (bigger change; not for this PR).

One meta point: naming / discoverability

from openhands.sdk.testing import TestLLM is good.

If you want extra discoverability, you might consider exporting it via docs / a short section in the SDK testing guide, but that’s outside the scope of this PR.

Net: 👍 on the direction and module placement. A couple small tweaks (metrics flag + content conversion note) would make it even smoother.

malhotra5 · 2026-02-11T22:12:17Z

Ooh this is cool! Super helpful that we can compose the responses with typed objects rather than mocking json strings for llm responses

Is there a way to support multi turn llm responses? (emit all messages in the queue until a "finish" response?)

VascoSch92 · 2026-02-11T23:14:33Z

Ooh this is cool! Super helpful that we can compose the responses with typed objects rather than mocking json strings for llm responses

Is there a way to support multi turn llm responses? (emit all messages in the queue until a "finish" response?)

I think it is already supporting that. For example:

from openhands.sdk import Agent
from openhands.sdk.llm.message import Message, TextContent
from openhands.sdk import LLM, Agent, Conversation
from pathlib import Path

llm = TestLLM.from_messages([
     Message(role="assistant", content=[TextContent(text="Hello!")]),
     Message(role="assistant", content=[TextContent(text="Goodbye!")]),
  ])
agent = Agent(llm=llm)
conversation = Conversation(agent=agent, workspace=Path.cwd())
conversation.send_message("Hey")
conversation.run()
conversation.send_message("Goodbye")
conversation.run()

it is faking the conversation

Message from User ─────────────────────────────────────────────────────────

Hey

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from Agent ─────────────────────────────────────────────────────────

Hello!

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from User ─────────────────────────────────────────────────────────

Goodbye

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from Agent ─────────────────────────────────────────────────────────

Goodbye!

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

VascoSch92 · 2026-02-11T23:36:05Z

@malhotra5 Actually integration with tools is also working.

For example

if __name__ == "__main__":
    from openhands.sdk import Agent, Conversation
    from openhands.sdk.llm.message import Message, MessageToolCall, TextContent
    from pathlib import Path

    llm = TestLLM.from_messages([
        Message(
            role="assistant",
            content=[TextContent(text="Oh yeah. Let's call the finish tool")],
            tool_calls=[
                MessageToolCall(
                    id="call_1",
                    name="finish",
                    arguments='{"message": "I like trains"}',
                    origin="completion",
                )
            ],
        ),
        Message(role="assistant", content=[TextContent(text="This message will never been displayed")]),
    ])
    agent = Agent(llm=llm)
    conversation = Conversation(agent=agent, workspace=Path.cwd())
    conversation.send_message("Let's see what happens with tools :-)")
    conversation.run()

is faking the conversation

Message from User ─────────────────────────────────────────────────────────

Let's see what happens with tools :-)

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Agent Action ─────────────────────────────────────────────────────────

Summary: finish: {"message": "I like trains"}

Thought:
Oh yeah. Let's call the finish tool

Finish with message:
I like trains

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

This is pretty nice... and also show how well structured is the sdk :-)

I think this class can be really helpfull for the sdk test suite and the end to end CLI tests....

openhands-ai · 2026-02-11T23:52:37Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Review Thread Gate

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2016 at branch `openhands/test-llm-profile`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

malhotra5 · 2026-02-12T15:31:16Z

This is great! Thanks for clarifying the behavior this seems super powerful

Do we want the ability to raise known llm error types as well? Could be great if we raise a budget error, or context window error for tests (doesn't have to be in this PR)

VascoSch92 · 2026-02-13T10:08:20Z

This is great! Thanks for clarifying the behavior this seems super powerful

Do we want the ability to raise known llm error types as well? Could be great if we raise a budget error, or context window error for tests (doesn't have to be in this PR)

I made a couple of changes and now you can pass excpetion which will be raises. In this way you can also tests behaviours when errors happen.

from pathlib import Path

from openhands.sdk import Agent, Conversation
from openhands.sdk.llm.message import Message, TextContent


llm = TestLLM.from_messages(
    [
        Message(role="assistant", content=[TextContent(text="Hello!")]),
        Message(role="assistant", content=[TextContent(text="Goodbye!")]),
        LLMContextWindowExceedError("context too long"),
    ]
)
agent = Agent(llm=llm)
conversation = Conversation(agent=agent, workspace=Path.cwd())
conversation.send_message("Hey")
conversation.run()
conversation.send_message("Goodbye")
conversation.run()
conversation.send_message("I forgot something!")
conversation.run()

This will raise LLMContextWindowExceedError

all-hands-bot

🟡 Acceptable - Elegant solution that dramatically simplifies test writing (20 lines → 3 lines). Core idea is sound, but found a few important issues around silent data loss and missing test coverage for the testing utility itself.

all-hands-bot · 2026-02-13T10:19:26Z

openhands-sdk/openhands/sdk/testing/test_llm.py

+        """Convert message content to a string."""
+        from openhands.sdk.llm.message import TextContent
+
+        parts = []


🟠 Important: _content_to_string only handles TextContent and silently ignores other content types like ImageContent. This could mask bugs in tests where messages accidentally contain unsupported content types.

Suggest either:

Handle all content types appropriately, or

Raise an error for unsupported types with a clear message

Suggested change

parts = []

def _content_to_string(self, message: Message) -> str:

"""Convert message content to a string."""

from openhands.sdk.llm.message import TextContent

parts = []

for item in message.content:

if isinstance(item, TextContent):

parts.append(item.text)

else:

raise ValueError(

f"TestLLM only supports TextContent, got {type(item).__name__}"

)

return "\n".join(parts)

all-hands-bot · 2026-02-13T10:19:26Z

openhands-sdk/openhands/sdk/testing/test_llm.py

+    >>> llm.completion([...])  # returns "OK"
+    >>> llm.completion([...])  # raises LLMContextWindowExceedError
+"""
+


🟠 Important: No unit tests for TestLLM itself. A testing utility should have its own tests to ensure:

_create_model_response produces valid LiteLLM responses

_content_to_string handles edge cases correctly

TestLLMExhaustedError is raised when responses are exhausted

Exception responses in the queue are raised correctly

remaining_responses and call_count properties work

Without these tests, bugs in TestLLM could cause confusing test failures.

all-hands-bot · 2026-02-13T10:19:26Z

openhands-sdk/openhands/sdk/testing/test_llm.py

+            tools=tools,
+            _return_metrics=_return_metrics,
+            add_security_risk_prediction=add_security_risk_prediction,
+            on_token=on_token,


🟡 Suggestion: uses_responses_api() always returning False means TestLLM cannot be used to test code paths that specifically use the responses API.

This is a significant limitation that should be documented in the class docstring, or consider adding a parameter to control this behavior:

def __init__(self, *, use_responses_api: bool = False, **data: Any) -> None: self._use_responses_api = use_responses_api # ... def uses_responses_api(self) -> bool: return self._use_responses_api

all-hands-bot · 2026-02-13T10:19:26Z

openhands-sdk/openhands/sdk/testing/test_llm.py

+                prompt_tokens=0,
+                completion_tokens=0,
+            ),
+        )


🟡 Suggestion: _create_model_response creates a minimal ModelResponse that might not match real LLM behavior. Real LLMs may include additional fields that downstream code depends on (e.g., usage stats, model-specific metadata).

Consider adding integration tests that verify TestLLM produces responses processed identically to real LLM responses by the Agent/Conversation machinery. This would catch cases where the minimal response causes different behavior than production.

all-hands-bot · 2026-02-13T10:19:26Z

tests/sdk/conversation/local/test_agent_status_transition.py

+    class SignalingExecutor(
+        ToolExecutor[StatusTransitionMockAction, StatusTransitionMockObservation]
+    ):
+        """Executor that signals when execution starts and captures status."""


🟡 Suggestion: SignalingExecutor is defined inline here, but StatusCheckingExecutor is still referenced in the module (imported but not shown in diff). Consider consolidating these test executors:

Move both to conftest.py as reusable fixtures, or

Use SignalingExecutor consistently throughout the file

This follows the repo guideline: "If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!"

openhands-ai bot mentioned this pull request Feb 11, 2026

feat: TestLLM profile #2005

Open

enyst reviewed Feb 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/testing/test_llm.py Show resolved Hide resolved

enyst reviewed Feb 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/testing/test_llm.py Show resolved Hide resolved

This comment was marked as duplicate.

Sign in to view

update class TestLLM to use deque instead of a stack

3f16332

VascoSch92 changed the title ~~feat: Add TestLLM class for testing without mocking LiteLLM~~ feat(SDK): Add TestLLM class for testing without mocking LiteLLM Feb 12, 2026

VascoSch92 changed the title ~~feat(SDK): Add TestLLM class for testing without mocking LiteLLM~~ feat(sdk): Add TestLLM class for testing without mocking LiteLLM Feb 12, 2026

add possibility to trigger errors

6224963

Merge branch 'main' into openhands/test-llm-profile

ea0f077

VascoSch92 marked this pull request as ready for review February 13, 2026 10:16

VascoSch92 requested a review from enyst February 13, 2026 10:17

all-hands-bot reviewed Feb 13, 2026

View reviewed changes

enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026

-        parts = []
+    def _content_to_string(self, message: Message) -> str:
+        """Convert message content to a string."""
+        from openhands.sdk.llm.message import TextContent
+        parts = []
+        for item in message.content:
+            if isinstance(item, TextContent):
+                parts.append(item.text)
+            else:
+                raise ValueError(
+                    f"TestLLM only supports TextContent, got {type(item).__name__}"
+                )
+        return "\n".join(parts)

Conversation

VascoSch92 commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Example Usage

What This Replaces

Checklist

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enyst commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

enyst commented Feb 11, 2026 • edited by VascoSch92 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feedback on PR #2016: TestLLM (no LiteLLM mocking)

Overall assessment

What I learned from reading the code / LiteLLM internals

API/design nits (actionable)

How we would use this (beyond the demo test)

CLI usefulness?

Compatibility / maintenance risks

One meta point: naming / discoverability

Uh oh!

This comment was marked as duplicate.

malhotra5 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VascoSch92 commented Feb 11, 2026

Uh oh!

VascoSch92 commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

malhotra5 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VascoSch92 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VascoSch92 commented Feb 11, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading

enyst commented Feb 11, 2026 •

edited by VascoSch92

Loading

Feedback on PR #2016: `TestLLM` (no LiteLLM mocking)

malhotra5 commented Feb 11, 2026 •

edited

Loading

malhotra5 commented Feb 12, 2026 •

edited

Loading

VascoSch92 commented Feb 13, 2026 •

edited

Loading