Skip to content

Managed inference gateway returns 200 with content:null on token exhaustion #2118

Description

@latenighthackathon

Surfaced via NVIDIA/NemoClaw#4398. POSTing to the managed inference endpoint (inference.local -> gateway -> vLLM/NIM) with a low max_tokens against a reasoning model returns HTTP 200 with choices[0].message.content == null and finish_reason: "length" when the token budget is consumed by reasoning_content. OpenAI-compatible clients that concatenate content then fail (can only concatenate str (not NoneType) to str).

content: null with finish_reason: "length" is technically valid OpenAI-compatible output, but it is a sharp edge for clients. NemoClaw is config/probe-only and is not on the /v1/chat/completions request/response path, so smoothing this needs to live in the gateway passthrough or the model server.

Suggested: on finish_reason: "length" with empty assistant content, return the accumulated partial text (or a clear content: "") rather than null, so clients do not receive a null content field on a successful (200) response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions