Managed inference gateway returns 200 with content:null on token exhaustion

Surfaced via NVIDIA/NemoClaw#4398. POSTing to the managed inference endpoint (`inference.local` -> gateway -> vLLM/NIM) with a low `max_tokens` against a reasoning model returns HTTP 200 with `choices[0].message.content == null` and `finish_reason: "length"` when the token budget is consumed by `reasoning_content`. OpenAI-compatible clients that concatenate `content` then fail (`can only concatenate str (not NoneType) to str`).

`content: null` with `finish_reason: "length"` is technically valid OpenAI-compatible output, but it is a sharp edge for clients. NemoClaw is config/probe-only and is not on the `/v1/chat/completions` request/response path, so smoothing this needs to live in the gateway passthrough or the model server.

**Suggested:** on `finish_reason: "length"` with empty assistant `content`, return the accumulated partial text (or a clear `content: ""`) rather than `null`, so clients do not receive a null content field on a successful (200) response.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Managed inference gateway returns 200 with content:null on token exhaustion #2118

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Managed inference gateway returns 200 with content:null on token exhaustion #2118

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions