Skip to content

feat(multimodal): audio input via OpenAI input_audio content parts#267

Merged
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/audio-input
Jun 21, 2026
Merged

feat(multimodal): audio input via OpenAI input_audio content parts#267
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/audio-input

Conversation

@vaiju1981

Copy link
Copy Markdown

Summary

Add audio input to the typed multimodal API, extending the vision work to audio
(llama.cpp discussion #13759: Ultravox /
Qwen2-Audio / Qwen2.5-Omni). No native/JNI change is needed — upstream b9739 already decodes the
OpenAI input_audio content part (server-common.cpp) into the same media buffer vision uses, which
the JNI bridge already threads to mtmd's audio pipeline (mtmd_support_audio /
mtmd_bitmap_init_from_audio). The only gap was the Java typed API:

  • ContentPart — new INPUT_AUDIO kind + inputAudio(byte[], "wav"|"mp3") / audioFile(Path)
    factories (extension → format), with base64 data + format accessors.
  • ParameterJsonSerializer.buildMessages emits
    {"type":"input_audio","input_audio":{"data","format"}}; ChatMessage.concatText already skips
    non-text parts, so getContent() is unaffected.
  • LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already
    existed, fed by the native meta's modalities.audio).

The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions.

Test plan

  • Affected tests pass locally — ContentPart audio factories + format validation, a ChatRequest
    serializer test asserting the input_audio JSON shape (50 unit tests total); Spotless + Javadoc clean.
  • CI is green on this branch
  • Docs — README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property
    table + run command.

Related issues / PRs

Implements audio input per llama.cpp discussion #13759; extends the vision-input work.

Note for reviewer

AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) is gated and self-skips without the three
audio.* system properties (model / mmproj / clip), exactly like MultimodalIntegrationTest. The audio
model download is intentionally not added to CI (Ultravox is large and the test self-skips) — it's
documented as locally/CI-runnable. So the serialization path is unit-tested; the real-model audio path
is gated and unrun in CI.

Checklist

  • I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
  • My commits follow Conventional Commits
  • No security-sensitive changes (if there are, I have notified the maintainer privately per SECURITY.md)

Extends the typed multimodal API from vision to audio (llama.cpp discussion #13759). No native/JNI
change is needed: upstream b9739 already decodes the OpenAI `input_audio` content part
(server-common.cpp) into the same media buffer vision uses, which the JNI bridge already threads
through to mtmd's audio pipeline; mtmd supports audio (mtmd_support_audio / mtmd_bitmap_init_from_audio).

- ContentPart: new INPUT_AUDIO kind + factories ContentPart.inputAudio(byte[], "wav"|"mp3") and
  audioFile(Path) (extension -> format), with base64 data + format accessors.
- ParameterJsonSerializer.buildMessages emits {"type":"input_audio","input_audio":{"data","format"}};
  ChatMessage.concatText already skips non-text parts, so getContent() is unaffected.
- LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already existed,
  fed by the native meta's modalities.audio).
- Tests: ContentPart audio factories + format validation, a ChatRequest serializer test asserting the
  input_audio JSON shape, and a gated AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) that
  self-skips without the audio model / mmproj / clip (3 new audio.* system properties).
- Docs: README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property
  table + run command.

The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions. The audio model download is intentionally NOT added to CI (Ultravox is large and
the test self-skips); it's documented as locally/CI-runnable.

Verified: 50 affected unit tests + audio serializer test green, integration test self-skips,
Spotless + Javadoc clean.
@bernardladenthin bernardladenthin merged commit 2c91d1a into bernardladenthin:main Jun 21, 2026
6 of 9 checks passed
vaiju1981 pushed a commit to vaiju1981/java-llama.cpp that referenced this pull request Jun 22, 2026
…ladenthin#266/bernardladenthin#267 findings

SpotBugs (effort=Max) flagged 5 Low/High findings; all are established
false-positive categories already suppressed elsewhere with the same rationale:

This PR (TextToSpeech, a native-handle wrapper like LlamaModel):
- IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a
  toString would emit just a pointer (mirrors the LlamaModelBackend suppression).
- WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed"
  precondition guard (mirrors the server request-parser guards).

Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the
rebase; main is also red on them):
- OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as
  the already-suppressed toInferenceParameters; extended the existing Or-block.
- ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the
  canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII
  pair; same false-positive class as the server.* IMPROPER_UNICODE block.

Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
vaiju1981 pushed a commit to vaiju1981/java-llama.cpp that referenced this pull request Jun 22, 2026
…xisting bernardladenthin#267)

The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE
mutations, all in ContentPart's audio methods from the merged audio-input feature
(bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path)
had no tests at all. Pre-existing on main; this branch inherits it via the rebase.

Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection,
and audioFile unknown-extension rejection — mirroring the existing imageFile tests.
Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants