Add support for end to end in-memory QA evaluation#1865
Add support for end to end in-memory QA evaluation#1865jperez999 merged 10 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds two major features: (1) an end-to-end in-memory QA evaluation mode on Most of the issues raised in the previous review round have been addressed: the QA-sweep exit code bug is fixed, the heterogeneous-judge warning is promoted to a
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/retriever.py | Adds retrieve(), retrieve_batch(), answer(), and pipeline() live-RAG methods plus RetrieverPipelineBuilder; thread-safe top_k threading is correct but the judge builder accesses a private attribute and the embedding credential normalization was silently weakened. |
| nemo_retriever/src/nemo_retriever/llm/clients/litellm.py | New LiteLLMClient with structured transport/sampling split; default temperature now correctly set to 0.0 in the structured constructor, resolving the prior mismatch. |
| nemo_retriever/src/nemo_retriever/llm/clients/judge.py | LLMJudge refactored from evaluation/judges.py with structured params split; missing public transport property forces callers to reach into _client. |
| nemo_retriever/src/nemo_retriever/llm/clients/init.py | Package re-export for backward compat after module-to-package refactor; private helpers _build_rag_prompt and _parse_judge_response included in all unnecessarily. |
| nemo_retriever/src/nemo_retriever/evaluation/retrievers.py | _initialize_index refactored as single source of truth for both file and in-memory construction paths; from_lancedb() and _from_dict() factory methods added with comprehensive tests. |
| nemo_retriever/src/nemo_retriever/evaluation/config.py | Distinct-judge warning promoted to ValueError; build_eval_chain gains lancedb guard to match build_eval_pipeline; num_retries threaded through to judge constructor. |
| nemo_retriever/src/nemo_retriever/io/markdown.py | Large-column filtering now applied on the dataframe= path via _MARKDOWN_PARQUET_COLUMNS allow-list, matching the parquet_dir path and resolving the prior memory spike concern. |
| nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py | End-to-end QA mode added; exit code bug fixed via qa_failed flag + typer.Exit(code=1); eval_config=None guard added before load_eval_config. |
| nemo_retriever/tests/test_live_rag.py | 645-line test file covering Retriever.retrieve, retrieve_batch, answer, pipeline builder, and RetrieverPipelineBuilder with mocked LanceDB and LLM calls. |
Sequence Diagram
sequenceDiagram
participant U as User
participant R as Retriever
participant RPB as RetrieverPipelineBuilder
participant LRO as LiveRetrievalOperator
participant LDB as LanceDB
participant LLM as LiteLLMClient
participant J as LLMJudge
U->>R: retrieve(query, top_k)
R->>LDB: vector search
LDB-->>R: hits[]
R-->>U: RetrievalResult
U->>R: answer(query, llm, judge, reference)
R->>LDB: vector search
LDB-->>R: hits[]
R->>LLM: generate(query, chunks)
LLM-->>R: GenerationResult
par Tier 1+2 scoring
R->>R: token_f1 / answer_in_context
and Tier 3 judging
R->>J: judge(query, reference, answer)
J-->>R: JudgeResult
end
R-->>U: AnswerResult
U->>R: pipeline(top_k)
R-->>RPB: RetrieverPipelineBuilder
U->>RPB: .generate(llm).score().judge(judge)
U->>RPB: .run(queries, reference)
RPB->>LRO: run(df)
LRO->>LDB: batch vector search
LDB-->>LRO: hits[]
LRO-->>RPB: df with context
RPB->>LLM: QAGenerationOperator
LLM-->>RPB: answers
RPB->>J: JudgingOperator
J-->>RPB: scores
RPB-->>U: pd.DataFrame
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 573
Comment:
**Private attribute access creates hidden coupling**
`judge._client.transport` reaches into `LLMJudge`'s private implementation. If `LLMJudge` is ever refactored (e.g. `_client` renamed, or the transport stored differently), this raises an `AttributeError` at runtime with no static-analysis warning.
The cleaner fix is to add a public `transport` property to `LLMJudge` (mirroring the existing `model` property) and use that here, then the builder becomes `transport = judge.transport`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/llm/clients/__init__.py
Line: 57-64
Comment:
**Private helpers included in `__all__`**
`_build_rag_prompt` and `_parse_judge_response` are named with a leading underscore (private convention) but appear in `__all__`, making them part of the documented public surface of this package. Users who discover them via tab-completion will import them expecting stability, but they're intended as implementation details.
Consider either removing them from `__all__` (they remain importable by name for internal use), or renaming them without the `_` prefix if backward compatibility truly requires them to be public.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 133
Comment:
**Whitespace normalization removed from embedding credential**
The previous call site used `.strip()` on `self.embedding_api_key` before passing it to the NIM embed function. This PR removes that guard, passing the raw value. A credential set from an environment variable that includes a trailing newline or spaces (a common shell pitfall) will now reach the NIM endpoint unmodified and cause a cryptic authentication failure instead of succeeding silently. Consider restoring `.strip()` here.
How can I resolve this? If you propose a fix, please make it concise.Reviews (10): Last reviewed commit: "change judge shape to support retry para..." | Re-trigger Greptile
7ce7eba to
37a4d53
Compare
The eval CLI (cli.py) already aborts when retrieval coverage falls below the configured min_coverage. The graph_pipeline QA path logged coverage but never enforced the threshold, creating inconsistent behavior between the two entry points. Made-with: Cursor
- build_eval_chain() now raises a clear ValueError when retrieval.type is not 'file', matching build_eval_pipeline(). - graph_pipeline validates eval_config and query_csv file existence before starting ingestion so a typo does not waste a full run. Made-with: Cursor
add144b to
b04e4ab
Compare
Description
E2E in-memory QA mode on graph_pipeline (d9af7e4, 31d4e88, 632cd36) - one command that ingests, builds the page-markdown index in-memory, queries LanceDB, and runs the full QA sweep. Closes the gap that previously forced Ingest -> Export JSON -> Eval as three separate invocations.
Live-RAG SDK on Retriever (b04e4ab) - the much larger commit. Turns Retriever from a raw-hits wrapper into a first-class RAG SDK with .retrieve(), .retrieve_batch(), .answer() (single-query Tier 1+2+3 scoring), and a fluent .pipeline().generate(...).score().judge(...).run(queries) batch builder. Plus a ground-up nemo_retriever/llm/ package (shared types.py, Pydantic LLMInferenceParams + new LLMRemoteClientParams, clients/litellm.py, clients/judge.py, lazy-loaded init.py) that the eval framework was refactored to import from.
Built off of this existing PR #1754
Checklist