feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline by vaiju1981 · Pull Request #268 · bernardladenthin/java-llama.cpp

vaiju1981 · 2026-06-21T19:54:12Z

Adds TextToSpeech — an AutoCloseable native type that synthesizes audio from text over llama.cpp's two-model OuteTTS pipeline (OuteTTS text-to-codes → WavTokenizer codes-to-speech vocoder), returning a 24 kHz mono 16-bit WAV byte stream.

Approach: the OuteTTS helpers are DERIVED from upstream, never hand-copied

cmake/generate-tts-upstream.cmake reads the pinned upstream tools/tts/tts.cpp at configure time, drops main(), gives the called helpers external linkage, extracts the two default-speaker literals, and writes build/tts_generated/tts_upstream_gen.cpp (never committed, regenerated on every configure). tts_engine.{h,cpp} is only the orchestration; tts_wav.hpp is our in-memory WAV writer.

What's included

Native: generator + tts_upstream.h (declarations) + tts_engine.{h,cpp} (orchestration) + tts_wav.hpp, compiled into libjllama; 3 new JNI methods.
Java: TextToSpeech with synthesize(text) and an explicit-sampling overload; GPU-offload constructor.
Tests: test_tts_wav.cpp (the WAV writer) + TtsIntegrationTest (self-skips without GGUFs). The derived DSP is upstream's, covered end-to-end by the integration test.
CI: both test GGUFs wired into the Linux x86_64 job so TtsIntegrationTest runs in CI.
Docs: README "Text-to-Speech" section + system properties; CLAUDE.md "OuteTTS build-time extraction" section + architecture/test entries.

English number words are now expanded for speech (e.g. 3 → "three", via upstream's real process_text); non-English text is not romanized. Synthesis uses the built-in default speaker profile.

Verified end-to-end locally

OuteTTS-0.2-500M-Q4 + WavTokenizer on CPU: TtsIntegrationTest PASS; manual synthesis → valid 24 kHz mono 16-bit WAV, 2.91 s, genuine modulated speech (afinfo-validated). Builds + passes Java tests on every platform (Linux aarch64, Windows VS+Ninja, macOS Metal/no-Metal, Android ±OpenCL, manylinux CUDA).

Responding to the review feedback

1. DRY / no copy of tts.cpp (+ MIT attribution). Done via the build-time extraction above — tts_dsp.hpp and all copied helpers are gone; divergence is impossible (derived from the pinned source each build). The derived TU carries the upstream MIT/the llama.cpp authors banner.

Why not "add tts.cpp to target_sources" / the wrap-main() patch. Verified against b9739: even with main() guarded, every helper is static (internal linkage → uncallable cross-TU) and the pipeline + default-speaker literals live inside main() (compiled out by the guard) — so direct inclusion links but exports nothing usable, and speaker_from_file / audio_*_from_speaker aren't reachable either (also static). Hence the extraction route (the reviewer's own suggested fallback). It is read-only, so no patches/ entry is required.

3. ArchUnit. Fixed (12/12). The failure was pre-existing on main (the #266 LlamaModelBackend → value edge), not TextToSpeech, which sits in the root layer and needs no rule change.

Also fixed (pre-existing on main, inherited via rebase): SpotBugs findings in ContentPart/OpenAiRequestMapper (#266/#267) suppressed with the established rationale; PIT mutation gate restored to 100% by covering the ContentPart audio paths (#267) that lacked tests.

Deliberate follow-ups (not in this PR)

Companion upstream llama.cpp PR to split tts.cpp into a library TU + thin CLI (so the generator could shrink/drop later) — the "ideal long-term" route; left as a follow-up since the extraction is self-contained.
Upstream SPDX copyright line on committed files — intentionally not added, because no committed file contains upstream code (the derived TU does, and carries the banner). Provenance is referenced in the committed headers' comments.

Pre-existing, not addressable here

FOSSA "License Compliance" flags the dependency tree; this PR changes no dependencies, so its findings are identical to main.

First step toward text-to-speech output. llama.cpp's TTS lives only in the standalone `llama-tts` CLI (tools/tts/tts.cpp), not in the server TUs jllama compiles, and its audio synthesis is hand-rolled DSP. Vendor that pure DSP (no llama/ggml/JNI state) into a header so the eventual JNI bridge and the C++ tests can both use it: - src/main/cpp/tts_dsp.hpp: fill_hann_window / twiddle / irfft / fold / embd_to_audio vendored byte-faithful from tts.cpp (kept verbatim so a llama.cpp bump is a mechanical re-sync), plus pcm_to_wav16_bytes — an in-memory replacement for tts.cpp's file-writing save_wav16, since the JNI layer will return WAV bytes to Java. - src/test/cpp/test_tts_dsp.cpp: 5 unit tests (WAV header/payload + little-endian clamping, Hann window, fold trimming, embd_to_audio output-length identity) — pure, no model needed. C++ suite 457/457 (was 452); clang-format clean. NEXT (separate commits): a JNI method orchestrating the two-model OuteTTS pipeline (TTC LLM -> audio codes via llama_decode; CTS vocoder -> embeddings -> embd_to_audio), then the Java TextToSpeech API returning byte[] WAV, then a gated OuteTTS+WavTokenizer integration test.

Builds on the vendored DSP (milestone 1) to wire the full text-to-speech pipeline. Native (compiles + links; JNI symbols exported): - src/main/cpp/tts_engine.{h,cpp}: a self-contained OuteTTS orchestration adapted from tools/tts/tts.cpp main(), single-stream (n_parallel=1) with the built-in default speaker. Loads the TTC (OuteTTS) + CTS (WavTokenizer vocoder) models via common_init_from_params, builds the OuteTTS prompt, runs the llama_decode loop to generate audio codes, filters to the codec token range, runs the vocoder (llama_encode + llama_get_embeddings), and feeds embd_to_audio -> pcm_to_wav16_bytes. OuteTTS prompt helpers + default speaker vendored byte-faithfully. - jllama.cpp: 3 TextToSpeech JNI methods (loadNative / synthesizeNative -> byte[] WAV / deleteNative), reusing parse_jstring + the c_llama_error exception-conversion pattern. tts_engine.cpp added to the jllama target. Java: - net.ladenthin.llama.TextToSpeech (AutoCloseable): new TextToSpeech(ttcPath, vocoderPath[, gpuLayers, threads]); synthesize(text) -> 24 kHz mono 16-bit WAV byte[]. - Gated TtsIntegrationTest (self-skips without the OuteTTS + WavTokenizer GGUFs) + 2 tts.* properties. Verified: jllama links with the TTS engine, TextToSpeech JNI symbols exported in libjllama, C++ suite 457/457, TtsIntegrationTest compiles + self-skips, Spotless + Javadoc + clang-format clean. NOT yet verified: the end-to-end synthesis at runtime — needs OuteTTS + WavTokenizer GGUFs (not staged here); the gated test is the runtime gate. Known simplification: number-to-words romanization is a pass-through (digits dropped), as noted in tts_engine.cpp. Remaining: README/CLAUDE.md docs.

…ution Add the two-model OuteTTS TTS pipeline to CI so TtsIntegrationTest runs: - publish.yml: TTS_MODEL_URL/NAME (OuteTTS-0.2-500M-Q4_K_M) + TTS_VOCODER_URL/NAME (WavTokenizer-Large-75-F16) env vars; download steps + the matching -Dnet.ladenthin.llama.tts.{ttc,vocoder}.model flags on the Linux x86_64 (jcstress) test job. - validate-models.sh: both GGUFs added to OPTIONAL_MODELS (validated when present, skipped where not downloaded). Both URLs verified HTTP 200 (OuteTTS ~385 MB, WavTokenizer ~124 MB). Per request, drop all in-code attribution from the TTS sources (tts_dsp.hpp, tts_engine.cpp, tts_engine.h): remove the "The llama.cpp authors" SPDX line and reword the "vendored/adapted from tts.cpp" comments to neutral descriptions. Each file keeps its single Bernard Ladenthin SPDX header + MIT license (REUSE stays compliant). Comment-only change: native lib builds, clang-format clean, TTS DSP C++ tests pass.

- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop limitation, compatible GGUF links); two new rows in the System Properties Reference (tts.ttc.model / tts.vocoder.model). - CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines); TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the C++ test-file table and the drifted counts reconciled to the actual 457 (test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP). Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.

bernardladenthin · 2026-06-21T20:22:01Z

Thank you very much for the idea and the effort — TTS support via the OuteTTS pipeline is genuinely exciting and a great direction for this project.

However, before this can be merged, a cleanup phase is needed. Here are the issues to address:

1. Copyright / DRY violation: do not copy `tools/tts/tts.cpp`

A line-by-line comparison of tts_dsp.hpp and tts_engine.cpp against llama.cpp/tools/tts/tts.cpp shows that virtually every function — fill_hann_window, twiddle, irfft, fold, embd_to_audio, process_text, prompt_add, prompt_init, prepare_guide_tokens, the default speaker profile strings, and the codec token range 151672..155772 — is copied nearly verbatim. Only cosmetic changes were made (static → inline, the WAV writer refactored to return bytes).

This violates the DRY principle and creates a maintenance burden: every llama.cpp upgrade risks silent divergence between the copied code and the upstream implementation. It also raises a MIT attribution concern: the MIT licence requires preserving the original copyright notice, which is absent from the new files.

The right approach is to compile tools/tts/tts.cpp directly, exactly as server-context.cpp, server-queue.cpp etc. are already wired in via CMakeLists.txt. The functions needed (fill_hann_window, irfft, fold, embd_to_audio, …) live in the upstream file — include it in the build rather than duplicating it.

2. For anything that cannot be used directly: `patches/` + upstream PR

If parts of tools/tts/tts.cpp need adaptation for the JNI embedding (e.g. making the WAV writer return bytes instead of writing a file, or extracting the pipeline into a callable function rather than main()), the correct workflow is:

Add a minimal *.patch file under patches/ (the existing cmake/apply-llama-patches.cmake machinery picks it up automatically).
Open a PR on the llama.cpp side proposing the clean API surface (e.g. a tts_synthesize() function with an output-buffer parameter) so the patch can eventually be dropped.

This keeps all modifications auditable and keeps the upgrade path clean.

3. ArchUnit tests are currently failing

Please check and fix the ArchUnit layering violations before requesting re-review. The TextToSpeech class likely needs to be placed in (or explicitly permitted by) the architecture rules.

Summary of what a revised PR should look like:

CMakeLists.txt adds ${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp to the jllama target (or links the object, same pattern as the server files).
tts_dsp.hpp and the duplicated functions in tts_engine.cpp are removed entirely; the JNI bridge calls into the upstream functions directly.
Any necessary surface changes to tts.cpp (in-memory WAV output, callable API) are delivered as a patches/0002-tts-jni-api.patch with a companion llama.cpp PR.
Copyright headers on new files reference the upstream origin.
ArchUnit violations are fixed.

Happy to help design the patch or the upstream PR if that would be useful. Looking forward to a revised version!

Generated by Claude Code

bernardladenthin · 2026-06-21T20:48:15Z

One more thought on the fallback strategy: the strong preference is to simply add tools/tts/tts.cpp to the CMake compile phase — the same way server-context.cpp, server-queue.cpp etc. are already pulled in from the FetchContent tree. If the functions are in a linkable translation unit, there is nothing to copy at all.

If it turns out that tts.cpp cannot be included directly (e.g. because it defines a main() or has other link conflicts that make it unsuitable as a library TU), the next-best option before resorting to a manual copy would be a build-time extraction script: a small CMake configure_file step or a Python/shell script that runs during the configure or generate phase, reads the specific functions out of the upstream tts.cpp source file (which is already on disk via FetchContent), and writes a derived .cpp into the build tree. That way the "copy" is always derived mechanically from the exact upstream file at the pinned version, divergence is impossible, and upgrading llama.cpp automatically picks up any fixes. It is more complex than a direct include, but still infinitely better than a hand-maintained 1:1 duplicate.

That said — given that the existing server files (server-context.cpp etc.) are already compiled into libjllama without a main() conflict, it is very likely that the TTS functions can be extracted into a small helper header/source by the upstream PR mentioned above, making even the extraction script unnecessary. That route is worth trying first.

Generated by Claude Code

bernardladenthin · 2026-06-21T21:14:17Z

Following up on the earlier comment about the DRY principle and avoiding the hard copy of tools/tts/tts.cpp — here is a concrete proposal for how to include the upstream file directly.

Why direct inclusion is blocked

tools/tts/tts.cpp contains int main(int argc, char ** argv). Adding it to target_sources(jllama ...) would cause a multiple-definition linker error for main when building the shared library. This is the same reason tools/server/server.cpp is excluded from this project's CMakeLists.txt while server-context.cpp, server-queue.cpp, server-task.cpp etc. are included — those library TUs have no main().

Proposed solution: patch via `patches/` mechanism

The project already has a patch mechanism (patches/ + cmake/apply-llama-patches.cmake) that applies git apply-compatible unified diffs to the llama.cpp source before it compiles. A small patch wrapping main() in a preprocessor guard solves the problem cleanly:

patches/0002-tts-wrap-main.patch

--- a/tools/tts/tts.cpp
+++ b/tools/tts/tts.cpp
@@ -312,6 +312,7 @@ static void prepare_guide_tokens(llama_model * model, const std::string & text,
 }
 
+#ifndef JLLAMA_SKIP_TTS_MAIN
 int main(int argc, char ** argv) {
     common_params params;
 
@@ -420,3 +421,4 @@ int main(int argc, char ** argv) {
 
     return 0;
 }
+#endif // JLLAMA_SKIP_TTS_MAIN

Then in CMakeLists.txt, replace the hand-copied tts_engine.cpp with:

target_sources(jllama PRIVATE
    ${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
)
target_compile_definitions(jllama PRIVATE JLLAMA_SKIP_TTS_MAIN)

The JLLAMA_SKIP_TTS_MAIN compile definition is scoped to the jllama target only. If the upstream standalone tts executable is also in the CMake tree, it compiles normally (no define → main() present). The standalone tool is completely unaffected.

This removes tts_dsp.hpp, tts_engine.h, and tts_engine.cpp from the project entirely, replacing them with the single upstream source. The richer upstream API (speaker_from_file, audio_text_from_speaker, audio_data_from_speaker for JSON-based speaker profiles) also becomes available without any extra effort.

A companion PR to llama.cpp to properly split tts.cpp into a library TU + CLI would be ideal long-term (so the patch can eventually be dropped), but the patch is a complete, self-contained solution on its own.

The exact @@ -NNN @@ line numbers in the patch will need to match the actual file at the pinned b9739 tag — please adjust them accordingly when creating the patch file.

Generated by Claude Code

vaiju1981 · 2026-06-22T00:55:55Z

@bernardladenthin , i will take a look at patch, that sounds more realistic, the other is looking at llama-box as alternate option. I will work on this first thing tomorrow after my work day.

… hand-copy) Addresses review feedback on PR bernardladenthin#268: the TTS native pipeline reused llama.cpp's tools/tts/tts.cpp by hand-copying its DSP/prompt/text helpers and default-speaker strings into tts_dsp.hpp + tts_engine.cpp — a DRY/maintenance hazard that would silently diverge on every llama.cpp upgrade, and a missing-attribution concern. tts.cpp cannot simply be added to target_sources: it defines its own main() (link clash, same reason server.cpp is excluded) and every helper is `static` (internal linkage — unreachable from another TU). So instead of copying, the helpers are now DERIVED MECHANICALLY from the pinned upstream source at configure time: - cmake/generate-tts-upstream.cmake reads the pinned tools/tts/tts.cpp, keeps the pre-main() span, strips `static` from the helpers the engine calls (external linkage), and extracts the two default-speaker literals out of main() into `extern const` strings. Emits build/tts_generated/tts_upstream_gen.cpp (never committed; regenerated from whatever tts.cpp the GIT_TAG resolves to, so a version bump is picked up automatically). - CMakeLists runs it after FetchContent_MakeAvailable(llama.cpp) and compiles the generated TU into jllama. - tts_upstream.h: committed, hand-written declarations of the extracted symbols (interface only). tts_engine.cpp keeps only our orchestration + the in-memory WAV writer (tts_wav.hpp, ours). tts_dsp.hpp and all copied helpers are removed. Fail-loud on drift (same contract as patches/): the generator asserts the `int main(` anchor, every de-static signature, and both speaker literals; a rename aborts the configure, a type change fails the link. Silent divergence is impossible. Bonus: using upstream's real process_text (which calls replace_numbers_with_words) fixes the previous digit-drop limitation — English numbers are now spoken. Verified: jllama builds + links, 454 C++ tests pass, and TtsIntegrationTest synthesizes a valid 24 kHz WAV end-to-end against the real OuteTTS + WavTokenizer models. test_tts_dsp.cpp -> test_tts_wav.cpp (now covers only our WAV writer; the DSP is upstream's, covered end-to-end by TtsIntegrationTest).

…adenthin#266 regression) LlamaArchitectureTest.layeredArchitecture was already failing on main (not introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer), but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT set of packages that reference it today" — was not updated. Add "Server" to it, the same maintenance the rule's own javadoc prescribes. Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its own commit so it can be cherry-picked to main independently.

The synthesizeNative signature added in the TTS milestone was wrapped by a non-pinned clang-format; reflow it with the CI-pinned 22.1.5 so the clang-format check passes. No behavior change.

…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).

…xisting bernardladenthin#267) The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE mutations, all in ContentPart's audio methods from the merged audio-input feature (bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path) had no tests at all. Pre-existing on main; this branch inherits it via the rebase. Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection, and audioFile unknown-extension rejection — mirroring the existing imageFile tests. Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.

vaiju1981 · 2026-06-23T15:04:21Z

@bernardladenthin the PR is ready to re-review again.

bernardladenthin · 2026-06-23T15:08:30Z

ty, I'll check later

vaiju1981 requested a review from bernardladenthin as a code owner June 21, 2026 19:54

vaijurao added 4 commits June 21, 2026 12:59

vaiju1981 force-pushed the feat/tts-output branch from 2c0321b to f009837 Compare June 21, 2026 20:05

vaiju1981 temporarily deployed to startgate June 21, 2026 20:05 — with GitHub Actions Inactive

vaijurao added 2 commits June 21, 2026 21:58

vaiju1981 temporarily deployed to startgate June 22, 2026 05:00 — with GitHub Actions Inactive

vaijurao added 2 commits June 21, 2026 23:27

style(jni): clang-format TextToSpeech JNI signature with pinned 22.1.5

f6d1e91

The synthesizeNative signature added in the TTS milestone was wrapped by a non-pinned clang-format; reflow it with the CI-pinned 22.1.5 so the clang-format check passes. No behavior change.

vaiju1981 temporarily deployed to startgate June 22, 2026 06:34 — with GitHub Actions Inactive

vaiju1981 deployed to startgate June 22, 2026 08:24 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268
vaiju1981 wants to merge 9 commits into
bernardladenthin:mainfrom
vaiju1981:feat/tts-output

vaiju1981 commented Jun 21, 2026 •

edited

Loading

Uh oh!

bernardladenthin commented Jun 21, 2026

Uh oh!

bernardladenthin commented Jun 21, 2026

Uh oh!

bernardladenthin commented Jun 21, 2026

Uh oh!

vaiju1981 commented Jun 22, 2026

Uh oh!

vaiju1981 commented Jun 23, 2026

Uh oh!

bernardladenthin commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vaiju1981 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach: the OuteTTS helpers are DERIVED from upstream, never hand-copied

What's included

Verified end-to-end locally

Responding to the review feedback

Deliberate follow-ups (not in this PR)

Pre-existing, not addressable here

Uh oh!

bernardladenthin commented Jun 21, 2026

1. Copyright / DRY violation: do not copy tools/tts/tts.cpp

2. For anything that cannot be used directly: patches/ + upstream PR

3. ArchUnit tests are currently failing

Uh oh!

bernardladenthin commented Jun 21, 2026

Uh oh!

bernardladenthin commented Jun 21, 2026

Why direct inclusion is blocked

Proposed solution: patch via patches/ mechanism

Uh oh!

vaiju1981 commented Jun 22, 2026

Uh oh!

vaiju1981 commented Jun 23, 2026

Uh oh!

bernardladenthin commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaiju1981 commented Jun 21, 2026 •

edited

Loading

1. Copyright / DRY violation: do not copy `tools/tts/tts.cpp`

2. For anything that cannot be used directly: `patches/` + upstream PR

Proposed solution: patch via `patches/` mechanism