Skip to content

RPC Internal Error when loading Gemma 3 (llama-cpp backends) #9414

@garayco

Description

@garayco

LocalAI version:
Tested on:

  • localai/localai:latest-gpu-nvidia-cuda-13
  • localai/localai:master-gpu-nvidia-cuda-13
  • Backends tested: llama-cpp (stable), cuda13-llama-cpp (stable), llama-cpp-development, and cuda13-llama-cpp-development

Environment, CPU architecture, OS, and Version:
Linux garayco 6.6.87.2-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Thu Jun 5 18:30:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Environment: Docker running inside WSL2 (Ubuntu) on Windows.
Hardware: NVIDIA RTX 3060 (12GB VRAM).

Describe the bug
LocalAI fails to load the Gemma 3 models (specifically gemma3:12b-it-q4_K_M) when using the llama-cpp backend. The model loading aborts with an RPC Internal Error stating that the key gemma3.attention.layer_norm_rms_epsilon is not found in the model hyperparameters.

To Reproduce

  1. Start the LocalAI container with GPU support:
    docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13
  2. Download the model via Ollama
  3. Run the model in the chat
  4. The backend process crashes/timeouts during the model load phase.

Logs

BackendLoader starting modelID="gemma3:12b-it-q4_K_M" backend="llama-cpp-development" model="gemma3__12b-it-q4_K_M"
ERROR Failed to load model modelID="gemma3:12b-it-q4_K_M" error=failed to load model with internal loader: could not load model: rpc error: code = Internal desc = Failed to load model: /models/gemma3__12b-it-q4_K_M. Error: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model; llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model; llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model backend="llama-cpp-development"
ERROR Stream ended with error error=failed to load model with internal loader: could not load model: rpc error: code = Internal desc = Failed to load model: /models/gemma3__12b-it-q4_K_M. Error: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model; llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model; llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon; llama_model_load_from_file_impl: failed to load model
INFO  HTTP request method="POST" path="/v1/chat/completions" status=200

Additional context
I initially tested this using the standard stable backends (llama-cpp and cuda13-llama-cpp). Thinking it was a versioning issue, I then completely removed the stable versions via the WebUI and exclusively tested the -development versions to force the latest build. The exact same RPC error persists across all of them. It seems the -development builds need to be re-triggered upstream to pull the latest llama.cpp master branch.

Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions