Skip to content

feat(sdk): Add TestLLM class for testing without mocking LiteLLM#2016

Open
VascoSch92 wants to merge 4 commits intomainfrom
openhands/test-llm-profile
Open

feat(sdk): Add TestLLM class for testing without mocking LiteLLM#2016
VascoSch92 wants to merge 4 commits intomainfrom
openhands/test-llm-profile

Conversation

@VascoSch92
Copy link
Contributor

@VascoSch92 VascoSch92 commented Feb 11, 2026

Summary

This PR introduces a new TestLLM class in openhands.sdk.testing that provides a clean way to write tests without needing to mock LiteLLM internals.

Key Features

  • TestLLM is a real LLM subclass that works anywhere an LLM is accepted: in Agent(llm=...), in fallback_llms, in condensers, in routers, etc.
  • No patching needed: just pass TestLLM as the llm= argument
  • Tests speak in SDK types (Message, TextContent, MessageToolCall) instead of LiteLLM types
  • Clear error when responses are exhausted: raises TestLLMExhaustedError with a helpful message
  • Zero-cost metrics by default
  • Always uses completion() path (uses_responses_api returns False)

Example Usage

from openhands.sdk.testing import TestLLM
from openhands.sdk.llm import Message, TextContent, MessageToolCall

# Simple text response
llm = TestLLM.from_messages([
    Message(role="assistant", content=[TextContent(text="Done!")]),
])

# Response with tool calls
llm = TestLLM.from_messages([
    Message(
        role="assistant",
        content=[TextContent(text="")],
        tool_calls=[
            MessageToolCall(
                id="call_1",
                name="my_tool",
                arguments='{"arg": "value"}',
                origin="completion",
            )
        ],
    ),
    Message(role="assistant", content=[TextContent(text="Done!")]),
])

What This Replaces

Today a typical test looks like ~20 lines of setup (@patch decorator + mock response factory + side_effect list). With TestLLM.from_messages(...) it's ~3 lines. The maintenance burden drops because tests speak in SDK types instead of LiteLLM types.

This PR also refactors tests/sdk/conversation/local/test_agent_status_transition.py to demonstrate the new TestLLM usage.

Closes #2005

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

@VascoSch92 can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9895a77-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9895a77-python \
  ghcr.io/openhands/agent-server:9895a77-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9895a77-golang-amd64
ghcr.io/openhands/agent-server:9895a77-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9895a77-golang-arm64
ghcr.io/openhands/agent-server:9895a77-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9895a77-java-amd64
ghcr.io/openhands/agent-server:9895a77-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9895a77-java-arm64
ghcr.io/openhands/agent-server:9895a77-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9895a77-python-amd64
ghcr.io/openhands/agent-server:9895a77-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9895a77-python-arm64
ghcr.io/openhands/agent-server:9895a77-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9895a77-golang
ghcr.io/openhands/agent-server:9895a77-java
ghcr.io/openhands/agent-server:9895a77-python

About Multi-Architecture Support

  • Each variant tag (e.g., 9895a77-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 9895a77-python-amd64) are also available if needed

This commit introduces a new TestLLM class in openhands.sdk.testing that
provides a clean way to write tests without needing to mock LiteLLM internals.

Key features:
- TestLLM is a real LLM subclass that works anywhere an LLM is accepted
- No @patch decorators needed - just pass TestLLM as the llm= argument
- Tests speak in SDK types (Message, TextContent, MessageToolCall)
- Clear error when scripted responses are exhausted
- Zero-cost metrics by default
- Always uses completion() path (uses_responses_api returns False)

Also refactors test_agent_status_transition.py to demonstrate the new
TestLLM usage, replacing ~20 lines of mock setup with ~3 lines.

Closes #2005
@openhands-ai openhands-ai bot mentioned this pull request Feb 11, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/testing
   test_llm.py63493%182, 192, 221, 296
TOTAL17907537369% 

@enyst
Copy link
Collaborator

enyst commented Feb 11, 2026

@OpenHands WDYT about this draft PR? Feel free to read AGENTS.md and make the project, and find litellm codebase in .venv. Investigate and understand all that it's doing and how would we use this. Look at the linked issue too.

Post your feedback as a comment to this PR, so I can read it. Note that it will be rendered as markdown. Use smolpaws.

@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

enyst commented Feb 11, 2026

Feedback on PR #2016: TestLLM (no LiteLLM mocking)

Overall assessment

This is a solid direction and does make the test suite better.

The biggest win is decoupling tests from LiteLLM transport objects (ModelResponse, Choices, etc.) and from patching litellm_completion (which is both verbose and a subtle source of global-state leakage, because LiteLLM has global knobs like modify_params). A real LLM subclass that can be injected everywhere is the right shape.

I also like that you put it under openhands.sdk.testing rather than under openhands.sdk.llm.* — that answers the concern raised in the issue thread: users can and do implement their own LLM subclasses, so a test-double living in testing feels appropriate and keeps the runtime surface cleaner.

What I learned from reading the code / LiteLLM internals

I looked at the actual call sites in LLM:

  • LLM._transport_call() (completion path) calls litellm_completion(...) and expects a litellm.types.utils.ModelResponse.
  • LLM.responses() calls litellm_responses(...) and expects a ResponsesAPIResponse (or a streaming iterator).

TestLLM avoids both of those by overriding completion() / responses() directly, so tests never hit LiteLLM.

In our local venv here, LiteLLM is 1.80.10 and ModelResponse is a Pydantic model with fields: id, created, model, object, choices, .... The PR constructs a minimal-but-valid ModelResponse with a single Choices(message=...), which is enough for LLMResponse.id to work and for any downstream debugging that inspects raw_response.

API/design nits (actionable)

These aren’t blockers, but they are the spots I’d tighten before merging:

  1. _return_metrics currently ignored

    • completion(..., _return_metrics=True) is a supported calling convention in the SDK. TestLLM currently always returns metrics, but always zeroed.
    • Suggestion: either (a) respect _return_metrics and return None/an empty snapshot when false (to match expectations), or (b) document in the docstring that TestLLM always returns zero metrics.
  2. uses_responses_api() returning False is good, but responses() is overridden anyway

    • Returning False is consistent with the intent (“don’t force tests to implement both paths”), but since responses() is implemented, uses_responses_api is mostly relevant only for external code that chooses which method to call.
    • Not an issue; just noting the API coupling.
  3. _content_to_string() drops non-text content silently

    • It only includes TextContent. If a scripted message contains ImageContent etc, the generated raw_response will have content="".
    • Probably fine because tests should assert on Message not on raw_response, but it might surprise people when debugging.
    • Suggestion: either include a sentinel like "[non-text content omitted]" or explicitly mention this behavior.
  4. Scripted response queue mutability

    • completion() consumes by pop(0). That’s totally fine for tests, but pop(0) is O(n).
    • Suggestion: use collections.deque for scripted responses (minor perf / cleanliness).

(Edit (Vasco): I changed to use deque. Another point in favour of that was that the order of the message was inverted)

  1. Error ergonomics
    • TestLLMExhaustedError message is already helpful.
    • A further small improvement would be to include the last user message (or call index) to help debug test failures faster. (Optional.)

How we would use this (beyond the demo test)

This utility will pay off most in tests that currently patch litellm_completion and have to build complex tool-call payloads.

Patterns I expect to use a lot:

  • Testing tool loops: script assistant(tool_calls=[...]) then script the final assistant message.
  • Testing router/fallback behavior: script TestLLM for primary to raise an exception, then have fallback TestLLM return a message. (Right now TestLLM only scripts messages, but you can extend it to also script exceptions.)
  • Testing condensers: condensers call LLMs internally; being able to inject TestLLM avoids patching deep internals.

CLI usefulness?

If the question is “is this useful for the end-user CLI at runtime?” probably not directly — it’s a test double.

But it is useful for CLI integration tests and for “repro scripts” where we want deterministic agent runs without network calls.

If we want a runtime-feature analogue, that’s more like a “replay / transcript LLM” (load a JSONL of prompts/responses). TestLLM is close to that, but it currently lives (correctly) as a testing utility.

Compatibility / maintenance risks

  • The only real coupling to LiteLLM is the minimal ModelResponse construction. If LiteLLM changes these Pydantic models significantly, TestLLM might need updates.
    • Mitigation: keep the raw_response creation minimal (as you did), and consider making raw_response optional in LLMResponse long-term (bigger change; not for this PR).

One meta point: naming / discoverability

from openhands.sdk.testing import TestLLM is good.

If you want extra discoverability, you might consider exporting it via docs / a short section in the SDK testing guide, but that’s outside the scope of this PR.


Net: 👍 on the direction and module placement. A couple small tweaks (metrics flag + content conversion note) would make it even smoother.

@openhands-ai

This comment was marked as duplicate.

@malhotra5
Copy link
Collaborator

malhotra5 commented Feb 11, 2026

Ooh this is cool! Super helpful that we can compose the responses with typed objects rather than mocking json strings for llm responses

Is there a way to support multi turn llm responses? (emit all messages in the queue until a "finish" response?)

@VascoSch92
Copy link
Contributor Author

Ooh this is cool! Super helpful that we can compose the responses with typed objects rather than mocking json strings for llm responses

Is there a way to support multi turn llm responses? (emit all messages in the queue until a "finish" response?)

I think it is already supporting that. For example:

from openhands.sdk import Agent
from openhands.sdk.llm.message import Message, TextContent
from openhands.sdk import LLM, Agent, Conversation
from pathlib import Path

llm = TestLLM.from_messages([
     Message(role="assistant", content=[TextContent(text="Hello!")]),
     Message(role="assistant", content=[TextContent(text="Goodbye!")]),
  ])
agent = Agent(llm=llm)
conversation = Conversation(agent=agent, workspace=Path.cwd())
conversation.send_message("Hey")
conversation.run()
conversation.send_message("Goodbye")
conversation.run()

it is faking the conversation

Message from User ─────────────────────────────────────────────────────────

Hey

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from Agent ─────────────────────────────────────────────────────────

Hello!

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from User ─────────────────────────────────────────────────────────

Goodbye

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Message from Agent ─────────────────────────────────────────────────────────

Goodbye!

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

@VascoSch92
Copy link
Contributor Author

@malhotra5 Actually integration with tools is also working.

For example

if __name__ == "__main__":
    from openhands.sdk import Agent, Conversation
    from openhands.sdk.llm.message import Message, MessageToolCall, TextContent
    from pathlib import Path

    llm = TestLLM.from_messages([
        Message(
            role="assistant",
            content=[TextContent(text="Oh yeah. Let's call the finish tool")],
            tool_calls=[
                MessageToolCall(
                    id="call_1",
                    name="finish",
                    arguments='{"message": "I like trains"}',
                    origin="completion",
                )
            ],
        ),
        Message(role="assistant", content=[TextContent(text="This message will never been displayed")]),
    ])
    agent = Agent(llm=llm)
    conversation = Conversation(agent=agent, workspace=Path.cwd())
    conversation.send_message("Let's see what happens with tools :-)")
    conversation.run()

is faking the conversation

Message from User ─────────────────────────────────────────────────────────

Let's see what happens with tools :-)

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

Agent Action ─────────────────────────────────────────────────────────

Summary: finish: {"message": "I like trains"}

Thought:
Oh yeah. Let's call the finish tool

Finish with message:
I like trains

Tokens: ↑ input 0 • cache hit N/A • ↓ output 0 • $ 0.00

This is pretty nice... and also show how well structured is the sdk :-)

I think this class can be really helpfull for the sdk test suite and the end to end CLI tests....

@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Review Thread Gate

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2016 at branch `openhands/test-llm-profile`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@VascoSch92 VascoSch92 changed the title feat: Add TestLLM class for testing without mocking LiteLLM feat(SDK): Add TestLLM class for testing without mocking LiteLLM Feb 12, 2026
@VascoSch92 VascoSch92 changed the title feat(SDK): Add TestLLM class for testing without mocking LiteLLM feat(sdk): Add TestLLM class for testing without mocking LiteLLM Feb 12, 2026
@malhotra5
Copy link
Collaborator

malhotra5 commented Feb 12, 2026

This is great! Thanks for clarifying the behavior this seems super powerful

Do we want the ability to raise known llm error types as well? Could be great if we raise a budget error, or context window error for tests (doesn't have to be in this PR)

@VascoSch92
Copy link
Contributor Author

VascoSch92 commented Feb 13, 2026

This is great! Thanks for clarifying the behavior this seems super powerful

Do we want the ability to raise known llm error types as well? Could be great if we raise a budget error, or context window error for tests (doesn't have to be in this PR)

I made a couple of changes and now you can pass excpetion which will be raises. In this way you can also tests behaviours when errors happen.

from pathlib import Path

from openhands.sdk import Agent, Conversation
from openhands.sdk.llm.message import Message, TextContent


llm = TestLLM.from_messages(
    [
        Message(role="assistant", content=[TextContent(text="Hello!")]),
        Message(role="assistant", content=[TextContent(text="Goodbye!")]),
        LLMContextWindowExceedError("context too long"),
    ]
)
agent = Agent(llm=llm)
conversation = Conversation(agent=agent, workspace=Path.cwd())
conversation.send_message("Hey")
conversation.run()
conversation.send_message("Goodbye")
conversation.run()
conversation.send_message("I forgot something!")
conversation.run()

This will raise LLMContextWindowExceedError

@VascoSch92 VascoSch92 marked this pull request as ready for review February 13, 2026 10:16
@VascoSch92 VascoSch92 requested a review from enyst February 13, 2026 10:17
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Elegant solution that dramatically simplifies test writing (20 lines → 3 lines). Core idea is sound, but found a few important issues around silent data loss and missing test coverage for the testing utility itself.

"""Convert message content to a string."""
from openhands.sdk.llm.message import TextContent

parts = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: _content_to_string only handles TextContent and silently ignores other content types like ImageContent. This could mask bugs in tests where messages accidentally contain unsupported content types.

Suggest either:

  1. Handle all content types appropriately, or
  2. Raise an error for unsupported types with a clear message
Suggested change
parts = []
def _content_to_string(self, message: Message) -> str:
"""Convert message content to a string."""
from openhands.sdk.llm.message import TextContent
parts = []
for item in message.content:
if isinstance(item, TextContent):
parts.append(item.text)
else:
raise ValueError(
f"TestLLM only supports TextContent, got {type(item).__name__}"
)
return "\n".join(parts)

>>> llm.completion([...]) # returns "OK"
>>> llm.completion([...]) # raises LLMContextWindowExceedError
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: No unit tests for TestLLM itself. A testing utility should have its own tests to ensure:

  • _create_model_response produces valid LiteLLM responses
  • _content_to_string handles edge cases correctly
  • TestLLMExhaustedError is raised when responses are exhausted
  • Exception responses in the queue are raised correctly
  • remaining_responses and call_count properties work

Without these tests, bugs in TestLLM could cause confusing test failures.

tools=tools,
_return_metrics=_return_metrics,
add_security_risk_prediction=add_security_risk_prediction,
on_token=on_token,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: uses_responses_api() always returning False means TestLLM cannot be used to test code paths that specifically use the responses API.

This is a significant limitation that should be documented in the class docstring, or consider adding a parameter to control this behavior:

def __init__(self, *, use_responses_api: bool = False, **data: Any) -> None:
    self._use_responses_api = use_responses_api
    # ...

def uses_responses_api(self) -> bool:
    return self._use_responses_api

prompt_tokens=0,
completion_tokens=0,
),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: _create_model_response creates a minimal ModelResponse that might not match real LLM behavior. Real LLMs may include additional fields that downstream code depends on (e.g., usage stats, model-specific metadata).

Consider adding integration tests that verify TestLLM produces responses processed identically to real LLM responses by the Agent/Conversation machinery. This would catch cases where the minimal response causes different behavior than production.

class SignalingExecutor(
ToolExecutor[StatusTransitionMockAction, StatusTransitionMockObservation]
):
"""Executor that signals when execution starts and captures status."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: SignalingExecutor is defined inline here, but StatusCheckingExecutor is still referenced in the module (imported but not shown in diff). Consider consolidating these test executors:

  • Move both to conftest.py as reusable fixtures, or
  • Use SignalingExecutor consistently throughout the file

This follows the repo guideline: "If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!"

@enyst enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: TestLLM profile

5 participants