Skip to content

refactor(llm): use litellm params for reasoning support#1990

Merged
enyst merged 17 commits intomainfrom
cleanup-reasoning-models
Mar 19, 2026
Merged

refactor(llm): use litellm params for reasoning support#1990
enyst merged 17 commits intomainfrom
cleanup-reasoning-models

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Feb 10, 2026

Summary

  • derive reasoning_effort support from litellm and remove hardcoded list
  • normalize litellm_proxy-prefixed models before checking supported params
  • expectations should match the old reasoning models hardcoding

Testing

  • uv run pre-commit run --files openhands-sdk/openhands/sdk/llm/utils/model_features.py
  • uv run pre-commit run --files tests/sdk/llm/test_model_features.py

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:4704427-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-4704427-python \
  ghcr.io/openhands/agent-server:4704427-python

All tags pushed for this build

ghcr.io/openhands/agent-server:4704427-golang-amd64
ghcr.io/openhands/agent-server:4704427-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:4704427-golang-arm64
ghcr.io/openhands/agent-server:4704427-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:4704427-java-amd64
ghcr.io/openhands/agent-server:4704427-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:4704427-java-arm64
ghcr.io/openhands/agent-server:4704427-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:4704427-python-amd64
ghcr.io/openhands/agent-server:4704427-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:4704427-python-arm64
ghcr.io/openhands/agent-server:4704427-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:4704427-golang
ghcr.io/openhands/agent-server:4704427-java
ghcr.io/openhands/agent-server:4704427-python

About Multi-Architecture Support

  • Each variant tag (e.g., 4704427-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 4704427-python-amd64) are also available if needed

Derive reasoning_effort support from get_supported_openai_params and
update Nova 2 Lite expectations in tests.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Works but introduces risks without adequate safeguards

Core Trade-off: You replaced a simple, explicit list with a runtime dependency on LiteLLM. This is DRY and reduces maintenance, but sacrifices control and introduces failure modes. The old approach was "worse is better" - ugly but bulletproof. The new approach is elegant but brittle.

Verdict:Needs rework - The bare except Exception: is unacceptable, cache is unbounded, and error paths are untested.

Key Insight: Dynamic lookups are only better than static lists when the external source is more reliable than you are—prove that LiteLLM is.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Works but introduces risks without adequate safeguards

Core Trade-off: You replaced a simple, explicit list with a runtime dependency on LiteLLM. This is DRY and reduces maintenance, but sacrifices control and introduces failure modes. The old approach was "worse is better" - ugly but bulletproof. The new approach is elegant but brittle.

Verdict:Needs rework - The bare except Exception: is unacceptable, cache is unbounded, and error paths are untested.

Key Insight: Dynamic lookups are only better than static lists when the external source is more reliable than you are—prove that LiteLLM is.

all-hands-bot

This comment was marked as outdated.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment for line 83

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment for line 4

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment for test file line 35

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Critical Issues

Following up on my initial review with remaining inline comments:

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Critical Issues

Following up on my initial review with remaining inline comments:

all-hands-bot

This comment was marked as outdated.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Testing Gaps - Critical Missing Coverage

You added complex logic (prefix stripping, provider splitting, exception handling, LRU cache) but only test the happy path through get_features().

Missing test coverage for _supports_reasoning_effort():

  1. Error handling: What happens when get_supported_openai_params raises an exception? Your bare except Exception: hides this.

  2. LRU cache behavior: Does it actually work? Test: Call same model twice, verify only one LiteLLM call made.

  3. Edge cases:

    • Empty string: _supports_reasoning_effort("")
    • None: _supports_reasoning_effort(None)
    • Whitespace-only: _supports_reasoning_effort(" ")
    • Malformed provider: _supports_reasoning_effort("a/b/c/d")
  4. Prefix handling: Does litellm_proxy/ prefix stripping work correctly?

    • _supports_reasoning_effort("litellm_proxy/o1")
    • _supports_reasoning_effort("litellm_proxy/openai/o1")
  5. Custom provider splitting: Does it handle multiple slashes correctly?

Tests should prove the code works, not just assert it was called. Add unit tests for _supports_reasoning_effort directly (yes, it's private, but it has complex logic that deserves dedicated tests).

Bottom line: Without these tests, you're flying blind. When LiteLLM changes or edge cases hit production, you'll have no safety net.

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@enyst
Copy link
Collaborator Author

enyst commented Feb 11, 2026

@OpenHands The goal of this PR is good, but the implementation is bad. Remove the cached thing.

LiteLLM has a supports_reasoning method or something named like that, right? Make the project with uv, find litellm codebase in .venv, investigate.

Lets use that for detecting which models support reasoning. Commit, run precommit, push.

@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

Final summary (new changes since the last PR state)

Checklist against the request

  • Removed the cached thing: deleted the @lru_cache-based reasoning-support detection.
  • Investigated LiteLLM in the uv-managed .venv: confirmed litellm.utils.supports_reasoning exists (and inspected its signature/source).
  • Used LiteLLM’s built-in method for detection: _supports_reasoning_effort() now delegates to supports_reasoning(...) after normalizing the model string (including litellm_proxy/ prefix handling).
  • Committed the change: refactor(llm): use litellm supports_reasoning (commit c6847155).
  • Ran pre-commit: uv run pre-commit run --all-files passes.
  • Pushed to origin/cleanup-reasoning-models (PR updated).

Conciseness / extraneous changes check

  • The diff is limited to one file: openhands-sdk/openhands/sdk/llm/utils/model_features.py.
  • No unrelated refactors/tests/docs were changed.

Branch is clean and fully pushed (no pending local changes).

View full conversation

@enyst enyst marked this pull request as draft February 11, 2026 21:42
Copy link
Collaborator Author

enyst commented Feb 12, 2026

HUMAN:

So my problem here with your code is this: I think litellm does more complex code to find out what the real provider is. We always send to litellm "full model name" which includes provider, so litellm figured it out. We never did extract provider ourselves.

Why it matters: apart from the obvious nice cases, there are cases like "openrouter/anthropic/sonnet-4-5", or "bedrock/anthropic/haiku-3-5:v1", or "openai/llama-4" (which is NOT openai, but it is openai-compatible provider, including custom, local, provider). Litellm knows how to deal with them and extract provider correctly.

Investigate its code. Find out if litellm has some method we can call so that it parses provider from our model, instead of duplicating it ourselves.

ANSWER:

I investigated LiteLLM 1.80.10 (the version in this repo's uv .venv). There is a built-in provider+model parser, and supports_reasoning() already uses it.

What LiteLLM uses internally

litellm.utils.supports_reasoning() is implemented as:

  • supports_reasoning()_supports_factory(model, custom_llm_provider, key="supports_reasoning")
  • _supports_factory() calls litellm.get_llm_provider(model=..., custom_llm_provider=...) first.

Source (from .venv/site-packages/litellm/utils.py):

model, custom_llm_provider, _, _ = litellm.get_llm_provider(
    model=model, custom_llm_provider=custom_llm_provider
)

So the intended public entrypoint for parsing a "full model name" is:

  • litellm.get_llm_provider(...)

It returns a tuple:

  • (model, custom_llm_provider, dynamic_api_key, api_base)

and the implementation contains a lot of special-case logic beyond a naive split("/", 1):

  • supports provider prefixes in the model string
  • handles OpenAI-compatible endpoints via api_base
  • has provider-specific routing rules (azure non-openai models, anthropic_text vs anthropic, cohere_chat vs cohere, etc.)

Implication for our code

We should not pre-split the provider ourselves just to call supports_reasoning().

Instead we can pass the full model string (after stripping our own litellm_proxy/ prefix if needed), and call:

  • supports_reasoning(model=full_model_string, custom_llm_provider=None)

LiteLLM will internally call get_llm_provider() and do the complex parsing.

About the examples you mentioned

Your examples like:

  • openrouter/anthropic/sonnet-4-5
  • bedrock/anthropic/haiku-3-5:v1
  • openai/llama-4

are exactly the kind of thing get_llm_provider() is meant to normalize (and this is the same function supports_reasoning() relies on). So the safest approach is to delegate parsing to LiteLLM by passing the full model string.

(As a side note: when I ran supports_reasoning() on those exact strings locally, it returned False for them in 1.80.10, but that's about capability metadata, not parsing — parsing is still happening via get_llm_provider().)

@OpenHands OpenHands deleted a comment from openhands-ai bot Feb 12, 2026
@OpenHands OpenHands deleted a comment from openhands-ai bot Feb 12, 2026
@OpenHands OpenHands deleted a comment from openhands-ai bot Feb 12, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 14, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm/options
   chat_options.py40392%74–76
openhands-sdk/openhands/sdk/llm/utils
   model_features.py48197%34
TOTAL20344591770% 

@enyst enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026
@OpenHands OpenHands deleted a comment from all-hands-bot Mar 17, 2026
@OpenHands OpenHands deleted a comment from all-hands-bot Mar 17, 2026
@enyst
Copy link
Collaborator Author

enyst commented Mar 17, 2026

@OpenHands understand this pr, then merge main branch into it and fix conflicts; push.

@openhands-ai
Copy link

openhands-ai bot commented Mar 17, 2026

I'm on it! enyst can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Mar 17, 2026

Final summary

Since the last summary, I completed the requested PR maintenance work:

  • Understood the PR context by reviewing PR refactor(llm): use litellm params for reasoning support #1990 metadata and the branch diff against main.
  • Merged main into cleanup-reasoning-models.
  • Resolved merge conflicts in:
    • .gitignore
    • openhands-sdk/openhands/sdk/llm/utils/model_features.py
  • Preserved the PR’s intended behavior:
    • kept the LiteLLM-based supports_reasoning(...) reasoning detection
    • kept the PR’s related LLM option/test changes
    • kept both .agent_tmp/ and tests/integration/api_compliance/outputs/ in .gitignore
  • Validated the result with:
    • make build
    • targeted uv run pre-commit run --files ...
    • targeted uv run pytest tests/sdk/llm/test_model_features.py tests/sdk/llm/test_chat_options.py
  • Committed and pushed the merge resolution to the PR branch:
    • commit: 49ec97d1
    • branch: cleanup-reasoning-models

Conciseness check

  • The new changes were concise and limited to merge/conflict resolution plus validation.
  • I did not introduce unrelated edits beyond what was necessary to complete the merge cleanly.

Result

The request was fully addressed: main has been merged into the PR branch, conflicts were fixed, validation passed, and the updated branch was pushed to GitHub.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@enyst
Copy link
Collaborator Author

enyst commented Mar 19, 2026

@OpenHands /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Mar 19, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator Author

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - this is the right simplification.

You removed the dumb duplicate model list and delegated provider parsing back to LiteLLM, which is exactly where that logic belongs. The litellm_proxy/ normalization is the only special case you actually own, and keeping that wrapper handling local is good taste.

I don’t see an important correctness issue in the current diff:

  • reasoning support now follows LiteLLM’s provider/model resolution instead of our partial reimplementation
  • the Gemini 2.5 Pro default is preserved
  • tests cover the real behavior change, including provider-prefixed and litellm_proxy/ model IDs

I also re-ran the targeted suite locally:

  • uv run pytest -q tests/sdk/llm/test_model_features.py tests/sdk/llm/test_chat_options.py
  • result: 142 passed

VERDICT: ✅ Worth merging from a code-quality standpoint.

KEY INSIGHT: The real fix here is not the helper function — it’s deleting our fake source of truth and letting LiteLLM answer its own capability question.

One caveat: this does change LLM capability detection / request shaping, so I’m leaving a comment rather than an approval and would still want a human maintainer to make the final merge call after whatever lightweight eval confidence they consider sufficient.

@openhands-ai
Copy link

openhands-ai bot commented Mar 19, 2026

Final summary

Since the last summary, I completed the requested PR comment handling work:

  • Reviewed PR refactor(llm): use litellm params for reasoning support #1990 in context: checked the current branch, PR metadata, changed files, prior comments, and relevant AGENTS.md guidance.
  • Validated the current state locally by re-running the targeted tests:
    • uv run pytest -q tests/sdk/llm/test_model_features.py tests/sdk/llm/test_chat_options.py
    • Result: 142 passed
  • Processed the /codereview-roasted comment and posted a GitHub comment review on PR refactor(llm): use litellm params for reasoning support #1990.
  • Followed the repo review policy: I left a comment, not an approval, because this PR changes LLM capability detection / request shaping and should get a human maintainer’s final merge decision.

Review conclusion I posted

  • I did not find an important correctness issue in the current diff.
  • I noted that the PR is a good simplification:
    • it removes the duplicate local reasoning-support list,
    • delegates provider/model parsing back to LiteLLM,
    • keeps the litellm_proxy/ handling as the only local special case,
    • and preserves the Gemini 2.5 Pro default behavior.

Conciseness check

  • No code changes were made in this step.
  • The work was limited to review + validation + posting the PR response.
  • There are no extraneous repository changes to revert.

Overall, the request has been fully addressed.

@enyst enyst marked this pull request as ready for review March 19, 2026 19:23
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solid simplification but with pragmatic concerns

The core refactor is good taste: you deleted the duplicate model list and delegated to LiteLLM's capability metadata. That's the right call. However, the Gemini special case introduces defensive logic that feels fragile, and the silent exception handling makes debugging harder.

@enyst
Copy link
Collaborator Author

enyst commented Mar 19, 2026

@OpenHands use gh to read the AI review threads (and my reply to one of them, do that), and fix the others if you find them reasonable. Push. Babysit PR until CI is green. Then ask for re-review from this bot, the same bot.

@openhands-ai
Copy link

openhands-ai bot commented Mar 19, 2026

I'm on it! enyst can track my progress at all-hands.dev

Trust LiteLLM's current Gemini reasoning support handling and drop the obsolete Gemini 2.5 default-low special case. Also remove the redundant wrapper around LiteLLM's own supports_reasoning error handling and update the targeted chat options regression test.

Co-authored-by: openhands <openhands@all-hands.dev>
@enyst enyst requested a review from all-hands-bot March 19, 2026 19:50
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean refactor that eliminates duplicate model lists.

You delegated reasoning capability detection to LiteLLM's metadata, which is exactly where that logic belongs. The behavior changes (nova-2-lite, OpenRouter models) align with LiteLLM's current capabilities and are documented in tests.

tests/integration/outputs/
tests/integration/api_compliance/outputs/

# Agent-generated temp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable: This addition is fine but appears unrelated to the reasoning effort refactor. Minor organizational nit - consider keeping PRs focused on a single concern when possible.

@openhands-ai
Copy link

openhands-ai bot commented Mar 19, 2026

No new changes since my last summary.

Final status

  • Read the AI review threads with gh, including enyst’s reply
  • Fixed the remaining reasonable issues
  • Pushed the changes to cleanup-reasoning-models
  • Replied to and resolved the addressed review threads
  • Babysat the PR until CI was fully green
  • Requested re-review from the same bot (all-hands-bot)

New changes included in that final round

  • Removed the stale Gemini 2.5 reasoning_effort="low" workaround from chat_options.py
  • Removed the redundant outer except Exception around LiteLLM supports_reasoning() in model_features.py
  • Updated the Gemini chat-options test to match the new behavior
  • Commit pushed: 485f29e7 (fix(llm): remove stale gemini reasoning workaround)

Verification

  • Targeted pre-commit checks passed on all edited files
  • Targeted pytest for test_model_features.py and test_chat_options.py passed
  • PR checks are green
  • The OpenHands re-review workflow completed successfully after requesting all-hands-bot

Conciseness

  • Changes are concise and directly related to the review feedback
  • No extraneous file changes remain
  • Working tree is clean

PR: #1990

@enyst enyst added the run-eval-1 Runs evaluation on 1 SWE-bench instance label Mar 19, 2026
@github-actions
Copy link
Contributor

Evaluation Triggered

@all-hands-bot
Copy link
Collaborator

🎉 Swebench Evaluation Complete

Evaluation: 23314477099-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: benchmark (test)
Commit: 485f29e748253eac098d60574ffab0be0173db8d
Timestamp: 26-03-19-20-24
Triggered by: @engel Nyst

📊 Results

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Eval limit: 1
  • Success rate: 1/1 (100.0%)

🔗 Links

Full Archive

@enyst enyst merged commit 26db836 into main Mar 19, 2026
53 of 54 checks passed
@enyst enyst deleted the cleanup-reasoning-models branch March 19, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-eval-1 Runs evaluation on 1 SWE-bench instance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants