Fix a crash on multiple active LoRa (issue 18050) #18375
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Split command line parameters and runtime adapter info into different struct-s.
Bump max graph size according to LoRa count and tensor size.
Fixes bug #18050
Some rationale on the changes. LoRa adapters can be as complex as the original model by hooking into every stage i.e. Wk, Wq, Wv in attention and Wup, Wgate, Wdown in FFN.
With multiple LoRa-s the graph size can become more than double of the original graph, thus leading to the 18050 crash; this means that you cannot safely add arbitrary LoRa-s unless you specifically initialized
llama_contextviacommon_init_from_paramsfor these specific adapters.I could have just passed the tensor count into
llama_contextconstructor without much other noise, but that would make the mess even harder to untangle latter. And there are more issues to it, for example, with LoRa-s enabled I get:That's why the
build graph with all lora-s active?question left inllama_context::graph_reserve.So in the end I moved LoRa adapters initialization into
common_init_from_paramsand explicitly bound the adapters lifetime to thecommon_init_result— original code implicitly stored adapters incommon_init_resultbut most handling was done via raw pointers stored incommon_paramsinstead which was just a timebomb waiting to explode.Particulary,
server_contextmakes unrestricted deep copies/assignments ofcommon_params, so it was all messed up anyway (and that's also why ngxson left todo comments in the code).Another small thing: adapter scale
-1.0fis a valid scale i.e. inverted adapter, but
server_contexttreated negative scales as "adapter disabled" which is incorrect.