feat(multimodal): audio input via OpenAI input_audio content parts by vaiju1981 · Pull Request #267 · bernardladenthin/java-llama.cpp

vaiju1981 · 2026-06-21T18:59:37Z

Summary

Add audio input to the typed multimodal API, extending the vision work to audio
(llama.cpp discussion #13759: Ultravox /
Qwen2-Audio / Qwen2.5-Omni). No native/JNI change is needed — upstream b9739 already decodes the
OpenAI input_audio content part (server-common.cpp) into the same media buffer vision uses, which
the JNI bridge already threads to mtmd's audio pipeline (mtmd_support_audio /
mtmd_bitmap_init_from_audio). The only gap was the Java typed API:

ContentPart — new INPUT_AUDIO kind + inputAudio(byte[], "wav"|"mp3") / audioFile(Path)
factories (extension → format), with base64 data + format accessors.
ParameterJsonSerializer.buildMessages emits
{"type":"input_audio","input_audio":{"data","format"}}; ChatMessage.concatText already skips
non-text parts, so getContent() is unaffected.
LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already
existed, fed by the native meta's modalities.audio).

The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions.

Test plan

Affected tests pass locally — ContentPart audio factories + format validation, a ChatRequest
serializer test asserting the input_audio JSON shape (50 unit tests total); Spotless + Javadoc clean.
CI is green on this branch
Docs — README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property
table + run command.

Related issues / PRs

Implements audio input per llama.cpp discussion #13759; extends the vision-input work.

Note for reviewer

AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) is gated and self-skips without the three
audio.* system properties (model / mmproj / clip), exactly like MultimodalIntegrationTest. The audio
model download is intentionally not added to CI (Ultravox is large and the test self-skips) — it's
documented as locally/CI-runnable. So the serialization path is unit-tested; the real-model audio path
is gated and unrun in CI.

Checklist

I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
My commits follow Conventional Commits
No security-sensitive changes (if there are, I have notified the maintainer privately per SECURITY.md)

Extends the typed multimodal API from vision to audio (llama.cpp discussion #13759). No native/JNI change is needed: upstream b9739 already decodes the OpenAI `input_audio` content part (server-common.cpp) into the same media buffer vision uses, which the JNI bridge already threads through to mtmd's audio pipeline; mtmd supports audio (mtmd_support_audio / mtmd_bitmap_init_from_audio). - ContentPart: new INPUT_AUDIO kind + factories ContentPart.inputAudio(byte[], "wav"|"mp3") and audioFile(Path) (extension -> format), with base64 data + format accessors. - ParameterJsonSerializer.buildMessages emits {"type":"input_audio","input_audio":{"data","format"}}; ChatMessage.concatText already skips non-text parts, so getContent() is unaffected. - LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already existed, fed by the native meta's modalities.audio). - Tests: ContentPart audio factories + format validation, a ChatRequest serializer test asserting the input_audio JSON shape, and a gated AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) that self-skips without the audio model / mmproj / clip (3 new audio.* system properties). - Docs: README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property table + run command. The OpenAI server needs no change — audio content parts already round-trip verbatim through /v1/chat/completions. The audio model download is intentionally NOT added to CI (Ultravox is large and the test self-skips); it's documented as locally/CI-runnable. Verified: 50 affected unit tests + audio serializer test green, integration test self-skips, Spotless + Javadoc clean.

…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).

…xisting bernardladenthin#267) The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE mutations, all in ContentPart's audio methods from the merged audio-input feature (bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path) had no tests at all. Pre-existing on main; this branch inherits it via the rebase. Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection, and audioFile unknown-extension rejection — mirroring the existing imageFile tests. Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.

vaiju1981 requested a review from bernardladenthin as a code owner June 21, 2026 18:59

vaiju1981 temporarily deployed to startgate June 21, 2026 18:59 — with GitHub Actions Inactive

vaiju1981 force-pushed the feat/audio-input branch from 1c5ce83 to 55a6fa0 Compare June 21, 2026 19:44

vaiju1981 temporarily deployed to startgate June 21, 2026 19:44 — with GitHub Actions Inactive

bernardladenthin merged commit 2c91d1a into bernardladenthin:main Jun 21, 2026
6 of 9 checks passed

vaiju1981 mentioned this pull request Jun 23, 2026

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multimodal): audio input via OpenAI input_audio content parts#267

feat(multimodal): audio input via OpenAI input_audio content parts#267
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/audio-input

vaiju1981 commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vaiju1981 commented Jun 21, 2026

Summary

Test plan

Related issues / PRs

Note for reviewer

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants