Skip to content

bug: Jetson Orin compile + install #560

Description

@mcr-ksh

Issue description

Can't run on Jetson Orin / Thor

Expected Behavior

Be able to run node-llama-cpp on a Jetson Orin / Thor.

Actual Behavior

Crashing:

/root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
[node-llama-cpp] CUDA error: an internal operation failed
[node-llama-cpp]   current device: 0, in function ggml_cuda_op_mul_mat_cublas at /root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1363
[node-llama-cpp]   cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
Aborted (core dumped)

Steps to reproduce

after trying 2 full days to get it running I cant compile llama.cpp to run:

by now I have tried all sorts of compile options. This is only my last attempt:

NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON npx --no node-llama-cpp source build --gpu

If I dont set GGML_CUDA_NO_VMM=ON I get memory allocation errors.
If not set CMAKE_CUDA_ARCHITECTURES=87 some random virtual cuda arch is detected.

Here my latest attempt as done with the llama.cpp from GitHub:
NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA=ON NODE_LLAMA_CPP_CMAKE_OPTION_DNLC_VARIANT=cuda.b8121 NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_CUB_3DOT2=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_BACKEND_DL=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUBLAS=OFF NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc NODE_LLAMA_CPP_CMAKE_OPTION_CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda npx --no node-llama-cpp source build --gpu

My Environment

Dependency Version
Operating System Jetson Orin
CPU ARM aarch64
Node.js version 24.13.1
Typescript version ?
node-llama-cpp version 3.16.2

$ cat /etc/nv_tegra_release
R36 (release), REVISION: 4.7, GCID: 42132812, BOARD: generic, EABI: aarch64, DATE: Thu Sep 18 22:54:44 UTC 2025
KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

npx --yes node-llama-cpp inspect gpu output:

# npx --yes node-llama-cpp inspect gpu
OS: Ubuntu 22.04.5 LTS (arm64)
Node: 24.13.1 (arm64)

node-llama-cpp: 3.16.2
Prebuilt binaries: b8121
Cloned source: b8121

CUDA: available
Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan

CUDA device: Orin
CUDA used VRAM: 5.8% (3.56GB/61.37GB)
CUDA free VRAM: 94.19% (57.81GB/61.37GB)

CPU model: Cortex-A78AE
Math cores: 12
Used RAM: 5.8% (3.56GB/61.37GB)
Free RAM: 94.19% (57.81GB/61.37GB)
Used swap: 0.67% (211MB/30.68GB)
Max swap size: 30.68GB
mmap: supported

Additional Context

using the llama.cpp from src compiled on the orin works great.

$ compile.sh

cmake -B build -DGGML_CUDA=ON -DDNLC_VARIANT=cuda.b8121 -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_CUDA_CUB_3DOT2=ON -DGGML_BACKEND_DL=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_NO_VMM=ON -DGGML_CUBLAS=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda

test run:

root@jetson-orin:/usr/src/llama.cpp/build/bin# ./llama-cli -m /root/.cache/qmd/models/hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf -p "Hello, how are you?" -n 128
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: no
load_backend: loaded CUDA backend from /mnt/src/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /mnt/src/llama.cpp/build/bin/libggml-cpu.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8136-9051663d5
model      : hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Hello, how are you?

[Start thinking]
lex: greetings from the virtual assistant
lex: how are you today?
vec: greetings from the virtual assistant
vec: how are you today?
hyde: The topic of hello, how are you? covers greetings from the virtual assistant. Proper implementation follows established patterns and best practices.

[ Prompt: 293.1 t/s | Generation: 60.9 t/s ]

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, but no idea. Im not a dev.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrequires triageRequires triaging

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions