Surfaced via NVIDIA/NemoClaw#4398. POSTing to the managed inference endpoint (inference.local -> gateway -> vLLM/NIM) with a low max_tokens against a reasoning model returns HTTP 200 with choices[0].message.content == null and finish_reason: "length" when the token budget is consumed by reasoning_content. OpenAI-compatible clients that concatenate content then fail (can only concatenate str (not NoneType) to str).
content: null with finish_reason: "length" is technically valid OpenAI-compatible output, but it is a sharp edge for clients. NemoClaw is config/probe-only and is not on the /v1/chat/completions request/response path, so smoothing this needs to live in the gateway passthrough or the model server.
Suggested: on finish_reason: "length" with empty assistant content, return the accumulated partial text (or a clear content: "") rather than null, so clients do not receive a null content field on a successful (200) response.
Surfaced via NVIDIA/NemoClaw#4398. POSTing to the managed inference endpoint (
inference.local-> gateway -> vLLM/NIM) with a lowmax_tokensagainst a reasoning model returns HTTP 200 withchoices[0].message.content == nullandfinish_reason: "length"when the token budget is consumed byreasoning_content. OpenAI-compatible clients that concatenatecontentthen fail (can only concatenate str (not NoneType) to str).content: nullwithfinish_reason: "length"is technically valid OpenAI-compatible output, but it is a sharp edge for clients. NemoClaw is config/probe-only and is not on the/v1/chat/completionsrequest/response path, so smoothing this needs to live in the gateway passthrough or the model server.Suggested: on
finish_reason: "length"with empty assistantcontent, return the accumulated partial text (or a clearcontent: "") rather thannull, so clients do not receive a null content field on a successful (200) response.