feat(multimodal): audio input via OpenAI input_audio content parts#267
Merged
bernardladenthin merged 1 commit intoJun 21, 2026
Merged
Conversation
Extends the typed multimodal API from vision to audio (llama.cpp discussion #13759). No native/JNI
change is needed: upstream b9739 already decodes the OpenAI `input_audio` content part
(server-common.cpp) into the same media buffer vision uses, which the JNI bridge already threads
through to mtmd's audio pipeline; mtmd supports audio (mtmd_support_audio / mtmd_bitmap_init_from_audio).
- ContentPart: new INPUT_AUDIO kind + factories ContentPart.inputAudio(byte[], "wav"|"mp3") and
audioFile(Path) (extension -> format), with base64 data + format accessors.
- ParameterJsonSerializer.buildMessages emits {"type":"input_audio","input_audio":{"data","format"}};
ChatMessage.concatText already skips non-text parts, so getContent() is unaffected.
- LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already existed,
fed by the native meta's modalities.audio).
- Tests: ContentPart audio factories + format validation, a ChatRequest serializer test asserting the
input_audio JSON shape, and a gated AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) that
self-skips without the audio model / mmproj / clip (3 new audio.* system properties).
- Docs: README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property
table + run command.
The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions. The audio model download is intentionally NOT added to CI (Ultravox is large and
the test self-skips); it's documented as locally/CI-runnable.
Verified: 50 affected unit tests + audio serializer test green, integration test self-skips,
Spotless + Javadoc clean.
1c5ce83 to
55a6fa0
Compare
vaiju1981
pushed a commit
to vaiju1981/java-llama.cpp
that referenced
this pull request
Jun 22, 2026
…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
vaiju1981
pushed a commit
to vaiju1981/java-llama.cpp
that referenced
this pull request
Jun 22, 2026
…xisting bernardladenthin#267) The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE mutations, all in ContentPart's audio methods from the merged audio-input feature (bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path) had no tests at all. Pre-existing on main; this branch inherits it via the rebase. Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection, and audioFile unknown-extension rejection — mirroring the existing imageFile tests. Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add audio input to the typed multimodal API, extending the vision work to audio
(llama.cpp discussion #13759: Ultravox /
Qwen2-Audio / Qwen2.5-Omni). No native/JNI change is needed — upstream b9739 already decodes the
OpenAI
input_audiocontent part (server-common.cpp) into the same media buffer vision uses, whichthe JNI bridge already threads to
mtmd's audio pipeline (mtmd_support_audio/mtmd_bitmap_init_from_audio). The only gap was the Java typed API:ContentPart— newINPUT_AUDIOkind +inputAudio(byte[], "wav"|"mp3")/audioFile(Path)factories (extension → format), with base64
data+formataccessors.ParameterJsonSerializer.buildMessagesemits{"type":"input_audio","input_audio":{"data","format"}};ChatMessage.concatTextalready skipsnon-text parts, so
getContent()is unaffected.LlamaModel.supportsAudio()(parallel tosupportsVision();ModelMeta.supportsAudioalreadyexisted, fed by the native meta's
modalities.audio).The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions.Test plan
ContentPartaudio factories + format validation, aChatRequestserializer test asserting the
input_audioJSON shape (50 unit tests total); Spotless + Javadoc clean.table + run command.
Related issues / PRs
Implements audio input per llama.cpp discussion #13759; extends the vision-input work.
Note for reviewer
AudioInputIntegrationTest(Ultravox / Qwen2.5-Omni) is gated and self-skips without the threeaudio.*system properties (model / mmproj / clip), exactly likeMultimodalIntegrationTest. The audiomodel download is intentionally not added to CI (Ultravox is large and the test self-skips) — it's
documented as locally/CI-runnable. So the serialization path is unit-tested; the real-model audio path
is gated and unrun in CI.
Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdSECURITY.md)