Skip to content

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268

Open
vaiju1981 wants to merge 9 commits into
bernardladenthin:mainfrom
vaiju1981:feat/tts-output
Open

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268
vaiju1981 wants to merge 9 commits into
bernardladenthin:mainfrom
vaiju1981:feat/tts-output

Conversation

@vaiju1981

@vaiju1981 vaiju1981 commented Jun 21, 2026

Copy link
Copy Markdown

Adds TextToSpeech — an AutoCloseable native type that synthesizes audio from text over llama.cpp's two-model OuteTTS pipeline (OuteTTS text-to-codes → WavTokenizer codes-to-speech vocoder), returning a 24 kHz mono 16-bit WAV byte stream.

Approach: the OuteTTS helpers are DERIVED from upstream, never hand-copied

cmake/generate-tts-upstream.cmake reads the pinned upstream tools/tts/tts.cpp at configure time, drops main(), gives the called helpers external linkage, extracts the two default-speaker literals, and writes build/tts_generated/tts_upstream_gen.cpp (never committed, regenerated on every configure). tts_engine.{h,cpp} is only the orchestration; tts_wav.hpp is our in-memory WAV writer.

What's included

  • Native: generator + tts_upstream.h (declarations) + tts_engine.{h,cpp} (orchestration) + tts_wav.hpp, compiled into libjllama; 3 new JNI methods.
  • Java: TextToSpeech with synthesize(text) and an explicit-sampling overload; GPU-offload constructor.
  • Tests: test_tts_wav.cpp (the WAV writer) + TtsIntegrationTest (self-skips without GGUFs). The derived DSP is upstream's, covered end-to-end by the integration test.
  • CI: both test GGUFs wired into the Linux x86_64 job so TtsIntegrationTest runs in CI.
  • Docs: README "Text-to-Speech" section + system properties; CLAUDE.md "OuteTTS build-time extraction" section + architecture/test entries.

English number words are now expanded for speech (e.g. 3 → "three", via upstream's real process_text); non-English text is not romanized. Synthesis uses the built-in default speaker profile.

Verified end-to-end locally

OuteTTS-0.2-500M-Q4 + WavTokenizer on CPU: TtsIntegrationTest PASS; manual synthesis → valid 24 kHz mono 16-bit WAV, 2.91 s, genuine modulated speech (afinfo-validated). Builds + passes Java tests on every platform (Linux aarch64, Windows VS+Ninja, macOS Metal/no-Metal, Android ±OpenCL, manylinux CUDA).


Responding to the review feedback

1. DRY / no copy of tts.cpp (+ MIT attribution). Done via the build-time extraction above — tts_dsp.hpp and all copied helpers are gone; divergence is impossible (derived from the pinned source each build). The derived TU carries the upstream MIT/the llama.cpp authors banner.

Why not "add tts.cpp to target_sources" / the wrap-main() patch. Verified against b9739: even with main() guarded, every helper is static (internal linkage → uncallable cross-TU) and the pipeline + default-speaker literals live inside main() (compiled out by the guard) — so direct inclusion links but exports nothing usable, and speaker_from_file / audio_*_from_speaker aren't reachable either (also static). Hence the extraction route (the reviewer's own suggested fallback). It is read-only, so no patches/ entry is required.

3. ArchUnit. Fixed (12/12). The failure was pre-existing on main (the #266 LlamaModelBackendvalue edge), not TextToSpeech, which sits in the root layer and needs no rule change.

Also fixed (pre-existing on main, inherited via rebase): SpotBugs findings in ContentPart/OpenAiRequestMapper (#266/#267) suppressed with the established rationale; PIT mutation gate restored to 100% by covering the ContentPart audio paths (#267) that lacked tests.

Deliberate follow-ups (not in this PR)

  • Companion upstream llama.cpp PR to split tts.cpp into a library TU + thin CLI (so the generator could shrink/drop later) — the "ideal long-term" route; left as a follow-up since the extraction is self-contained.
  • Upstream SPDX copyright line on committed files — intentionally not added, because no committed file contains upstream code (the derived TU does, and carries the banner). Provenance is referenced in the committed headers' comments.

Pre-existing, not addressable here

  • FOSSA "License Compliance" flags the dependency tree; this PR changes no dependencies, so its findings are identical to main.

vaijurao added 4 commits June 21, 2026 12:59
First step toward text-to-speech output. llama.cpp's TTS lives only in the standalone `llama-tts`
CLI (tools/tts/tts.cpp), not in the server TUs jllama compiles, and its audio synthesis is hand-rolled
DSP. Vendor that pure DSP (no llama/ggml/JNI state) into a header so the eventual JNI bridge and the
C++ tests can both use it:

- src/main/cpp/tts_dsp.hpp: fill_hann_window / twiddle / irfft / fold / embd_to_audio vendored
  byte-faithful from tts.cpp (kept verbatim so a llama.cpp bump is a mechanical re-sync), plus
  pcm_to_wav16_bytes — an in-memory replacement for tts.cpp's file-writing save_wav16, since the JNI
  layer will return WAV bytes to Java.
- src/test/cpp/test_tts_dsp.cpp: 5 unit tests (WAV header/payload + little-endian clamping, Hann
  window, fold trimming, embd_to_audio output-length identity) — pure, no model needed.

C++ suite 457/457 (was 452); clang-format clean.

NEXT (separate commits): a JNI method orchestrating the two-model OuteTTS pipeline (TTC LLM ->
audio codes via llama_decode; CTS vocoder -> embeddings -> embd_to_audio), then the Java
TextToSpeech API returning byte[] WAV, then a gated OuteTTS+WavTokenizer integration test.
Builds on the vendored DSP (milestone 1) to wire the full text-to-speech pipeline.

Native (compiles + links; JNI symbols exported):
- src/main/cpp/tts_engine.{h,cpp}: a self-contained OuteTTS orchestration adapted from
  tools/tts/tts.cpp main(), single-stream (n_parallel=1) with the built-in default speaker.
  Loads the TTC (OuteTTS) + CTS (WavTokenizer vocoder) models via common_init_from_params, builds
  the OuteTTS prompt, runs the llama_decode loop to generate audio codes, filters to the codec token
  range, runs the vocoder (llama_encode + llama_get_embeddings), and feeds embd_to_audio ->
  pcm_to_wav16_bytes. OuteTTS prompt helpers + default speaker vendored byte-faithfully.
- jllama.cpp: 3 TextToSpeech JNI methods (loadNative / synthesizeNative -> byte[] WAV / deleteNative),
  reusing parse_jstring + the c_llama_error exception-conversion pattern. tts_engine.cpp added to the
  jllama target.

Java:
- net.ladenthin.llama.TextToSpeech (AutoCloseable): new TextToSpeech(ttcPath, vocoderPath[, gpuLayers,
  threads]); synthesize(text) -> 24 kHz mono 16-bit WAV byte[].
- Gated TtsIntegrationTest (self-skips without the OuteTTS + WavTokenizer GGUFs) + 2 tts.* properties.

Verified: jllama links with the TTS engine, TextToSpeech JNI symbols exported in libjllama, C++ suite
457/457, TtsIntegrationTest compiles + self-skips, Spotless + Javadoc + clang-format clean.

NOT yet verified: the end-to-end synthesis at runtime — needs OuteTTS + WavTokenizer GGUFs (not staged
here); the gated test is the runtime gate. Known simplification: number-to-words romanization is a
pass-through (digits dropped), as noted in tts_engine.cpp. Remaining: README/CLAUDE.md docs.
…ution

Add the two-model OuteTTS TTS pipeline to CI so TtsIntegrationTest runs:
- publish.yml: TTS_MODEL_URL/NAME (OuteTTS-0.2-500M-Q4_K_M) + TTS_VOCODER_URL/NAME
  (WavTokenizer-Large-75-F16) env vars; download steps + the matching
  -Dnet.ladenthin.llama.tts.{ttc,vocoder}.model flags on the Linux x86_64
  (jcstress) test job.
- validate-models.sh: both GGUFs added to OPTIONAL_MODELS (validated when present,
  skipped where not downloaded).

Both URLs verified HTTP 200 (OuteTTS ~385 MB, WavTokenizer ~124 MB).

Per request, drop all in-code attribution from the TTS sources (tts_dsp.hpp,
tts_engine.cpp, tts_engine.h): remove the "The llama.cpp authors" SPDX line and
reword the "vendored/adapted from tts.cpp" comments to neutral descriptions.
Each file keeps its single Bernard Ladenthin SPDX header + MIT license (REUSE
stays compliant). Comment-only change: native lib builds, clang-format clean,
TTS DSP C++ tests pass.
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the
  two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop
  limitation, compatible GGUF links); two new rows in the System Properties
  Reference (tts.ttc.model / tts.vocoder.model).
- CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp
  method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines);
  TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the
  C++ test-file table and the drifted counts reconciled to the actual 457
  (test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP).

Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.

Copy link
Copy Markdown
Owner

Thank you very much for the idea and the effort — TTS support via the OuteTTS pipeline is genuinely exciting and a great direction for this project.

However, before this can be merged, a cleanup phase is needed. Here are the issues to address:

1. Copyright / DRY violation: do not copy tools/tts/tts.cpp

A line-by-line comparison of tts_dsp.hpp and tts_engine.cpp against llama.cpp/tools/tts/tts.cpp shows that virtually every function — fill_hann_window, twiddle, irfft, fold, embd_to_audio, process_text, prompt_add, prompt_init, prepare_guide_tokens, the default speaker profile strings, and the codec token range 151672..155772 — is copied nearly verbatim. Only cosmetic changes were made (staticinline, the WAV writer refactored to return bytes).

This violates the DRY principle and creates a maintenance burden: every llama.cpp upgrade risks silent divergence between the copied code and the upstream implementation. It also raises a MIT attribution concern: the MIT licence requires preserving the original copyright notice, which is absent from the new files.

The right approach is to compile tools/tts/tts.cpp directly, exactly as server-context.cpp, server-queue.cpp etc. are already wired in via CMakeLists.txt. The functions needed (fill_hann_window, irfft, fold, embd_to_audio, …) live in the upstream file — include it in the build rather than duplicating it.

2. For anything that cannot be used directly: patches/ + upstream PR

If parts of tools/tts/tts.cpp need adaptation for the JNI embedding (e.g. making the WAV writer return bytes instead of writing a file, or extracting the pipeline into a callable function rather than main()), the correct workflow is:

  1. Add a minimal *.patch file under patches/ (the existing cmake/apply-llama-patches.cmake machinery picks it up automatically).
  2. Open a PR on the llama.cpp side proposing the clean API surface (e.g. a tts_synthesize() function with an output-buffer parameter) so the patch can eventually be dropped.

This keeps all modifications auditable and keeps the upgrade path clean.

3. ArchUnit tests are currently failing

Please check and fix the ArchUnit layering violations before requesting re-review. The TextToSpeech class likely needs to be placed in (or explicitly permitted by) the architecture rules.


Summary of what a revised PR should look like:

  • CMakeLists.txt adds ${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp to the jllama target (or links the object, same pattern as the server files).
  • tts_dsp.hpp and the duplicated functions in tts_engine.cpp are removed entirely; the JNI bridge calls into the upstream functions directly.
  • Any necessary surface changes to tts.cpp (in-memory WAV output, callable API) are delivered as a patches/0002-tts-jni-api.patch with a companion llama.cpp PR.
  • Copyright headers on new files reference the upstream origin.
  • ArchUnit violations are fixed.

Happy to help design the patch or the upstream PR if that would be useful. Looking forward to a revised version!


Generated by Claude Code

Copy link
Copy Markdown
Owner

One more thought on the fallback strategy: the strong preference is to simply add tools/tts/tts.cpp to the CMake compile phase — the same way server-context.cpp, server-queue.cpp etc. are already pulled in from the FetchContent tree. If the functions are in a linkable translation unit, there is nothing to copy at all.

If it turns out that tts.cpp cannot be included directly (e.g. because it defines a main() or has other link conflicts that make it unsuitable as a library TU), the next-best option before resorting to a manual copy would be a build-time extraction script: a small CMake configure_file step or a Python/shell script that runs during the configure or generate phase, reads the specific functions out of the upstream tts.cpp source file (which is already on disk via FetchContent), and writes a derived .cpp into the build tree. That way the "copy" is always derived mechanically from the exact upstream file at the pinned version, divergence is impossible, and upgrading llama.cpp automatically picks up any fixes. It is more complex than a direct include, but still infinitely better than a hand-maintained 1:1 duplicate.

That said — given that the existing server files (server-context.cpp etc.) are already compiled into libjllama without a main() conflict, it is very likely that the TTS functions can be extracted into a small helper header/source by the upstream PR mentioned above, making even the extraction script unnecessary. That route is worth trying first.


Generated by Claude Code

Copy link
Copy Markdown
Owner

Following up on the earlier comment about the DRY principle and avoiding the hard copy of tools/tts/tts.cpp — here is a concrete proposal for how to include the upstream file directly.

Why direct inclusion is blocked

tools/tts/tts.cpp contains int main(int argc, char ** argv). Adding it to target_sources(jllama ...) would cause a multiple-definition linker error for main when building the shared library. This is the same reason tools/server/server.cpp is excluded from this project's CMakeLists.txt while server-context.cpp, server-queue.cpp, server-task.cpp etc. are included — those library TUs have no main().

Proposed solution: patch via patches/ mechanism

The project already has a patch mechanism (patches/ + cmake/apply-llama-patches.cmake) that applies git apply-compatible unified diffs to the llama.cpp source before it compiles. A small patch wrapping main() in a preprocessor guard solves the problem cleanly:

patches/0002-tts-wrap-main.patch

--- a/tools/tts/tts.cpp
+++ b/tools/tts/tts.cpp
@@ -312,6 +312,7 @@ static void prepare_guide_tokens(llama_model * model, const std::string & text,
 }
 
+#ifndef JLLAMA_SKIP_TTS_MAIN
 int main(int argc, char ** argv) {
     common_params params;
 
@@ -420,3 +421,4 @@ int main(int argc, char ** argv) {
 
     return 0;
 }
+#endif // JLLAMA_SKIP_TTS_MAIN

Then in CMakeLists.txt, replace the hand-copied tts_engine.cpp with:

target_sources(jllama PRIVATE
    ${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
)
target_compile_definitions(jllama PRIVATE JLLAMA_SKIP_TTS_MAIN)

The JLLAMA_SKIP_TTS_MAIN compile definition is scoped to the jllama target only. If the upstream standalone tts executable is also in the CMake tree, it compiles normally (no define → main() present). The standalone tool is completely unaffected.

This removes tts_dsp.hpp, tts_engine.h, and tts_engine.cpp from the project entirely, replacing them with the single upstream source. The richer upstream API (speaker_from_file, audio_text_from_speaker, audio_data_from_speaker for JSON-based speaker profiles) also becomes available without any extra effort.

A companion PR to llama.cpp to properly split tts.cpp into a library TU + CLI would be ideal long-term (so the patch can eventually be dropped), but the patch is a complete, self-contained solution on its own.

The exact @@ -NNN @@ line numbers in the patch will need to match the actual file at the pinned b9739 tag — please adjust them accordingly when creating the patch file.


Generated by Claude Code

@vaiju1981

Copy link
Copy Markdown
Author

@bernardladenthin , i will take a look at patch, that sounds more realistic, the other is looking at llama-box as alternate option. I will work on this first thing tomorrow after my work day.

vaijurao added 2 commits June 21, 2026 21:58
… hand-copy)

Addresses review feedback on PR bernardladenthin#268: the TTS native pipeline reused llama.cpp's
tools/tts/tts.cpp by hand-copying its DSP/prompt/text helpers and default-speaker
strings into tts_dsp.hpp + tts_engine.cpp — a DRY/maintenance hazard that would
silently diverge on every llama.cpp upgrade, and a missing-attribution concern.

tts.cpp cannot simply be added to target_sources: it defines its own main()
(link clash, same reason server.cpp is excluded) and every helper is `static`
(internal linkage — unreachable from another TU). So instead of copying, the
helpers are now DERIVED MECHANICALLY from the pinned upstream source at configure
time:

- cmake/generate-tts-upstream.cmake reads the pinned tools/tts/tts.cpp, keeps the
  pre-main() span, strips `static` from the helpers the engine calls (external
  linkage), and extracts the two default-speaker literals out of main() into
  `extern const` strings. Emits build/tts_generated/tts_upstream_gen.cpp (never
  committed; regenerated from whatever tts.cpp the GIT_TAG resolves to, so a
  version bump is picked up automatically).
- CMakeLists runs it after FetchContent_MakeAvailable(llama.cpp) and compiles the
  generated TU into jllama.
- tts_upstream.h: committed, hand-written declarations of the extracted symbols
  (interface only). tts_engine.cpp keeps only our orchestration + the in-memory
  WAV writer (tts_wav.hpp, ours). tts_dsp.hpp and all copied helpers are removed.

Fail-loud on drift (same contract as patches/): the generator asserts the
`int main(` anchor, every de-static signature, and both speaker literals; a
rename aborts the configure, a type change fails the link. Silent divergence is
impossible.

Bonus: using upstream's real process_text (which calls replace_numbers_with_words)
fixes the previous digit-drop limitation — English numbers are now spoken.

Verified: jllama builds + links, 454 C++ tests pass, and TtsIntegrationTest
synthesizes a valid 24 kHz WAV end-to-end against the real OuteTTS + WavTokenizer
models.

test_tts_dsp.cpp -> test_tts_wav.cpp (now covers only our WAV writer; the DSP is
upstream's, covered end-to-end by TtsIntegrationTest).
…adenthin#266 regression)

LlamaArchitectureTest.layeredArchitecture was already failing on main (not
introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added
LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer),
but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT
set of packages that reference it today" — was not updated. Add "Server" to it,
the same maintenance the rule's own javadoc prescribes.

Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its
own commit so it can be cherry-picked to main independently.
vaijurao added 2 commits June 21, 2026 23:27
The synthesizeNative signature added in the TTS milestone was wrapped by a
non-pinned clang-format; reflow it with the CI-pinned 22.1.5 so the clang-format
check passes. No behavior change.
…ladenthin#266/bernardladenthin#267 findings

SpotBugs (effort=Max) flagged 5 Low/High findings; all are established
false-positive categories already suppressed elsewhere with the same rationale:

This PR (TextToSpeech, a native-handle wrapper like LlamaModel):
- IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a
  toString would emit just a pointer (mirrors the LlamaModelBackend suppression).
- WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed"
  precondition guard (mirrors the server request-parser guards).

Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the
rebase; main is also red on them):
- OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as
  the already-suppressed toInferenceParameters; extended the existing Or-block.
- ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the
  canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII
  pair; same false-positive class as the server.* IMPROPER_UNICODE block.

Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
…xisting bernardladenthin#267)

The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE
mutations, all in ContentPart's audio methods from the merged audio-input feature
(bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path)
had no tests at all. Pre-existing on main; this branch inherits it via the rebase.

Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection,
and audioFile unknown-extension rejection — mirroring the existing imageFile tests.
Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.
@vaiju1981 vaiju1981 deployed to startgate June 22, 2026 08:24 — with GitHub Actions Active
@vaiju1981

Copy link
Copy Markdown
Author

@bernardladenthin the PR is ready to re-review again.

@bernardladenthin

Copy link
Copy Markdown
Owner

ty, I'll check later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants