feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268
feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline#268vaiju1981 wants to merge 9 commits into
Conversation
First step toward text-to-speech output. llama.cpp's TTS lives only in the standalone `llama-tts` CLI (tools/tts/tts.cpp), not in the server TUs jllama compiles, and its audio synthesis is hand-rolled DSP. Vendor that pure DSP (no llama/ggml/JNI state) into a header so the eventual JNI bridge and the C++ tests can both use it: - src/main/cpp/tts_dsp.hpp: fill_hann_window / twiddle / irfft / fold / embd_to_audio vendored byte-faithful from tts.cpp (kept verbatim so a llama.cpp bump is a mechanical re-sync), plus pcm_to_wav16_bytes — an in-memory replacement for tts.cpp's file-writing save_wav16, since the JNI layer will return WAV bytes to Java. - src/test/cpp/test_tts_dsp.cpp: 5 unit tests (WAV header/payload + little-endian clamping, Hann window, fold trimming, embd_to_audio output-length identity) — pure, no model needed. C++ suite 457/457 (was 452); clang-format clean. NEXT (separate commits): a JNI method orchestrating the two-model OuteTTS pipeline (TTC LLM -> audio codes via llama_decode; CTS vocoder -> embeddings -> embd_to_audio), then the Java TextToSpeech API returning byte[] WAV, then a gated OuteTTS+WavTokenizer integration test.
Builds on the vendored DSP (milestone 1) to wire the full text-to-speech pipeline.
Native (compiles + links; JNI symbols exported):
- src/main/cpp/tts_engine.{h,cpp}: a self-contained OuteTTS orchestration adapted from
tools/tts/tts.cpp main(), single-stream (n_parallel=1) with the built-in default speaker.
Loads the TTC (OuteTTS) + CTS (WavTokenizer vocoder) models via common_init_from_params, builds
the OuteTTS prompt, runs the llama_decode loop to generate audio codes, filters to the codec token
range, runs the vocoder (llama_encode + llama_get_embeddings), and feeds embd_to_audio ->
pcm_to_wav16_bytes. OuteTTS prompt helpers + default speaker vendored byte-faithfully.
- jllama.cpp: 3 TextToSpeech JNI methods (loadNative / synthesizeNative -> byte[] WAV / deleteNative),
reusing parse_jstring + the c_llama_error exception-conversion pattern. tts_engine.cpp added to the
jllama target.
Java:
- net.ladenthin.llama.TextToSpeech (AutoCloseable): new TextToSpeech(ttcPath, vocoderPath[, gpuLayers,
threads]); synthesize(text) -> 24 kHz mono 16-bit WAV byte[].
- Gated TtsIntegrationTest (self-skips without the OuteTTS + WavTokenizer GGUFs) + 2 tts.* properties.
Verified: jllama links with the TTS engine, TextToSpeech JNI symbols exported in libjllama, C++ suite
457/457, TtsIntegrationTest compiles + self-skips, Spotless + Javadoc + clang-format clean.
NOT yet verified: the end-to-end synthesis at runtime — needs OuteTTS + WavTokenizer GGUFs (not staged
here); the gated test is the runtime gate. Known simplification: number-to-words romanization is a
pass-through (digits dropped), as noted in tts_engine.cpp. Remaining: README/CLAUDE.md docs.
…ution
Add the two-model OuteTTS TTS pipeline to CI so TtsIntegrationTest runs:
- publish.yml: TTS_MODEL_URL/NAME (OuteTTS-0.2-500M-Q4_K_M) + TTS_VOCODER_URL/NAME
(WavTokenizer-Large-75-F16) env vars; download steps + the matching
-Dnet.ladenthin.llama.tts.{ttc,vocoder}.model flags on the Linux x86_64
(jcstress) test job.
- validate-models.sh: both GGUFs added to OPTIONAL_MODELS (validated when present,
skipped where not downloaded).
Both URLs verified HTTP 200 (OuteTTS ~385 MB, WavTokenizer ~124 MB).
Per request, drop all in-code attribution from the TTS sources (tts_dsp.hpp,
tts_engine.cpp, tts_engine.h): remove the "The llama.cpp authors" SPDX line and
reword the "vendored/adapted from tts.cpp" comments to neutral descriptions.
Each file keeps its single Bernard Ladenthin SPDX header + MIT license (REUSE
stays compliant). Comment-only change: native lib builds, clang-format clean,
TTS DSP C++ tests pass.
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop limitation, compatible GGUF links); two new rows in the System Properties Reference (tts.ttc.model / tts.vocoder.model). - CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines); TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the C++ test-file table and the drifted counts reconciled to the actual 457 (test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP). Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.
2c0321b to
f009837
Compare
|
Thank you very much for the idea and the effort — TTS support via the OuteTTS pipeline is genuinely exciting and a great direction for this project. However, before this can be merged, a cleanup phase is needed. Here are the issues to address: 1. Copyright / DRY violation: do not copy
|
|
One more thought on the fallback strategy: the strong preference is to simply add If it turns out that That said — given that the existing server files ( Generated by Claude Code |
|
Following up on the earlier comment about the DRY principle and avoiding the hard copy of Why direct inclusion is blocked
Proposed solution: patch via
|
|
@bernardladenthin , i will take a look at patch, that sounds more realistic, the other is looking at llama-box as alternate option. I will work on this first thing tomorrow after my work day. |
… hand-copy) Addresses review feedback on PR bernardladenthin#268: the TTS native pipeline reused llama.cpp's tools/tts/tts.cpp by hand-copying its DSP/prompt/text helpers and default-speaker strings into tts_dsp.hpp + tts_engine.cpp — a DRY/maintenance hazard that would silently diverge on every llama.cpp upgrade, and a missing-attribution concern. tts.cpp cannot simply be added to target_sources: it defines its own main() (link clash, same reason server.cpp is excluded) and every helper is `static` (internal linkage — unreachable from another TU). So instead of copying, the helpers are now DERIVED MECHANICALLY from the pinned upstream source at configure time: - cmake/generate-tts-upstream.cmake reads the pinned tools/tts/tts.cpp, keeps the pre-main() span, strips `static` from the helpers the engine calls (external linkage), and extracts the two default-speaker literals out of main() into `extern const` strings. Emits build/tts_generated/tts_upstream_gen.cpp (never committed; regenerated from whatever tts.cpp the GIT_TAG resolves to, so a version bump is picked up automatically). - CMakeLists runs it after FetchContent_MakeAvailable(llama.cpp) and compiles the generated TU into jllama. - tts_upstream.h: committed, hand-written declarations of the extracted symbols (interface only). tts_engine.cpp keeps only our orchestration + the in-memory WAV writer (tts_wav.hpp, ours). tts_dsp.hpp and all copied helpers are removed. Fail-loud on drift (same contract as patches/): the generator asserts the `int main(` anchor, every de-static signature, and both speaker literals; a rename aborts the configure, a type change fails the link. Silent divergence is impossible. Bonus: using upstream's real process_text (which calls replace_numbers_with_words) fixes the previous digit-drop limitation — English numbers are now spoken. Verified: jllama builds + links, 454 C++ tests pass, and TtsIntegrationTest synthesizes a valid 24 kHz WAV end-to-end against the real OuteTTS + WavTokenizer models. test_tts_dsp.cpp -> test_tts_wav.cpp (now covers only our WAV writer; the DSP is upstream's, covered end-to-end by TtsIntegrationTest).
…adenthin#266 regression) LlamaArchitectureTest.layeredArchitecture was already failing on main (not introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer), but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT set of packages that reference it today" — was not updated. Add "Server" to it, the same maintenance the rule's own javadoc prescribes. Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its own commit so it can be cherry-picked to main independently.
The synthesizeNative signature added in the TTS milestone was wrapped by a non-pinned clang-format; reflow it with the CI-pinned 22.1.5 so the clang-format check passes. No behavior change.
…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
…xisting bernardladenthin#267) The PIT mutation gate (100% on value.*) was failing at 98% — 4 NO_COVERAGE mutations, all in ContentPart's audio methods from the merged audio-input feature (bernardladenthin#267): inputAudio was never exercised with "mp3" (only "wav"), and audioFile(Path) had no tests at all. Pre-existing on main; this branch inherits it via the rebase. Add four ContentPartTest cases — inputAudio("mp3"), audioFile .wav/.mp3 detection, and audioFile unknown-extension rejection — mirroring the existing imageFile tests. Local PIT now reports 243/243 killed (100%); ContentPartTest 17 -> 21, all green.
|
@bernardladenthin the PR is ready to re-review again. |
|
ty, I'll check later |
Adds
TextToSpeech— anAutoCloseablenative type that synthesizes audio from text over llama.cpp's two-model OuteTTS pipeline (OuteTTS text-to-codes → WavTokenizer codes-to-speech vocoder), returning a 24 kHz mono 16-bit WAV byte stream.Approach: the OuteTTS helpers are DERIVED from upstream, never hand-copied
cmake/generate-tts-upstream.cmakereads the pinned upstreamtools/tts/tts.cppat configure time, dropsmain(), gives the called helpers external linkage, extracts the two default-speaker literals, and writesbuild/tts_generated/tts_upstream_gen.cpp(never committed, regenerated on every configure).tts_engine.{h,cpp}is only the orchestration;tts_wav.hppis our in-memory WAV writer.What's included
tts_upstream.h(declarations) +tts_engine.{h,cpp}(orchestration) +tts_wav.hpp, compiled intolibjllama; 3 new JNI methods.TextToSpeechwithsynthesize(text)and an explicit-sampling overload; GPU-offload constructor.test_tts_wav.cpp(the WAV writer) +TtsIntegrationTest(self-skips without GGUFs). The derived DSP is upstream's, covered end-to-end by the integration test.TtsIntegrationTestruns in CI.English number words are now expanded for speech (e.g.
3→ "three", via upstream's realprocess_text); non-English text is not romanized. Synthesis uses the built-in default speaker profile.Verified end-to-end locally
OuteTTS-0.2-500M-Q4 + WavTokenizer on CPU:
TtsIntegrationTestPASS; manual synthesis → valid 24 kHz mono 16-bit WAV, 2.91 s, genuine modulated speech (afinfo-validated). Builds + passes Java tests on every platform (Linux aarch64, Windows VS+Ninja, macOS Metal/no-Metal, Android ±OpenCL, manylinux CUDA).Responding to the review feedback
1. DRY / no copy of
tts.cpp(+ MIT attribution). Done via the build-time extraction above —tts_dsp.hppand all copied helpers are gone; divergence is impossible (derived from the pinned source each build). The derived TU carries the upstream MIT/the llama.cpp authorsbanner.Why not "add
tts.cpptotarget_sources" / the wrap-main()patch. Verified against b9739: even withmain()guarded, every helper isstatic(internal linkage → uncallable cross-TU) and the pipeline + default-speaker literals live insidemain()(compiled out by the guard) — so direct inclusion links but exports nothing usable, andspeaker_from_file/audio_*_from_speakeraren't reachable either (alsostatic). Hence the extraction route (the reviewer's own suggested fallback). It is read-only, so nopatches/entry is required.3. ArchUnit. Fixed (12/12). The failure was pre-existing on
main(the #266LlamaModelBackend→valueedge), notTextToSpeech, which sits in the root layer and needs no rule change.Also fixed (pre-existing on
main, inherited via rebase): SpotBugs findings inContentPart/OpenAiRequestMapper(#266/#267) suppressed with the established rationale; PIT mutation gate restored to 100% by covering theContentPartaudio paths (#267) that lacked tests.Deliberate follow-ups (not in this PR)
tts.cppinto a library TU + thin CLI (so the generator could shrink/drop later) — the "ideal long-term" route; left as a follow-up since the extraction is self-contained.Pre-existing, not addressable here
main.