LocalAI version:
Tested on:
localai/localai:latest-gpu-nvidia-cuda-13
localai/localai:master-gpu-nvidia-cuda-13
- Backends tested: llama-cpp (stable), cuda13-llama-cpp (stable), llama-cpp-development, and cuda13-llama-cpp-development
Environment, CPU architecture, OS, and Version:
Linux garayco 6.6.87.2-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Thu Jun 5 18:30:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Environment: Docker running inside WSL2 (Ubuntu) on Windows.
Hardware: NVIDIA RTX 3060 (12GB VRAM).
Describe the bug
LocalAI fails to load the Gemma 3 models (specifically gemma3:12b-it-q4_K_M) when using the llama-cpp backend. The model loading aborts with an RPC Internal Error stating that the key gemma3.attention.layer_norm_rms_epsilon is not found in the model hyperparameters.
To Reproduce
- Start the LocalAI container with GPU support:
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13
- Download the model via Ollama
- Run the model in the chat
- The backend process crashes/timeouts during the model load phase.
Logs
BackendLoader starting modelID="gemma3:12b-it-q4_K_M" backend="llama-cpp-development" model="gemma3__12b-it-q4_K_M"
ERROR Failed to load model modelID="gemma3:12b-it-q4_K_M" error=failed to load model with internal loader: could not load model: rpc error: code = Internal desc = Failed to load model: /models/gemma3__12b-it-q4_K_M. Error: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model; llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model; llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model backend="llama-cpp-development"
ERROR Stream ended with error error=failed to load model with internal loader: could not load model: rpc error: code = Internal desc = Failed to load model: /models/gemma3__12b-it-q4_K_M. Error: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model; llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model; llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model
INFO HTTP request method="POST" path="/v1/chat/completions" status=200
Additional context
I initially tested this using the standard stable backends (llama-cpp and cuda13-llama-cpp). Thinking it was a versioning issue, I then completely removed the stable versions via the WebUI and exclusively tested the -development versions to force the latest build. The exact same RPC error persists across all of them. It seems the -development builds need to be re-triggered upstream to pull the latest llama.cpp master branch.

LocalAI version:
Tested on:
localai/localai:latest-gpu-nvidia-cuda-13localai/localai:master-gpu-nvidia-cuda-13Environment, CPU architecture, OS, and Version:
Linux garayco 6.6.87.2-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Thu Jun 5 18:30:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Environment: Docker running inside WSL2 (Ubuntu) on Windows.
Hardware: NVIDIA RTX 3060 (12GB VRAM).
Describe the bug
LocalAI fails to load the Gemma 3 models (specifically
gemma3:12b-it-q4_K_M) when using thellama-cppbackend. The model loading aborts with an RPC Internal Error stating that the keygemma3.attention.layer_norm_rms_epsilonis not found in the model hyperparameters.To Reproduce
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13Logs
Additional context
I initially tested this using the standard stable backends (llama-cpp and cuda13-llama-cpp). Thinking it was a versioning issue, I then completely removed the stable versions via the WebUI and exclusively tested the -development versions to force the latest build. The exact same RPC error persists across all of them. It seems the -development builds need to be re-triggered upstream to pull the latest llama.cpp master branch.