Add support for end to end in-memory QA evaluation by KyleZheng1284 · Pull Request #1865 · NVIDIA/NeMo-Retriever

KyleZheng1284 · 2026-04-15T23:48:22Z

Description

E2E in-memory QA mode on graph_pipeline (d9af7e4, 31d4e88, 632cd36) - one command that ingests, builds the page-markdown index in-memory, queries LanceDB, and runs the full QA sweep. Closes the gap that previously forced Ingest -> Export JSON -> Eval as three separate invocations.
Live-RAG SDK on Retriever (b04e4ab) - the much larger commit. Turns Retriever from a raw-hits wrapper into a first-class RAG SDK with .retrieve(), .retrieve_batch(), .answer() (single-query Tier 1+2+3 scoring), and a fluent .pipeline().generate(...).score().judge(...).run(queries) batch builder. Plus a ground-up nemo_retriever/llm/ package (shared types.py, Pydantic LLMInferenceParams + new LLMRemoteClientParams, clients/litellm.py, clients/judge.py, lazy-loaded init.py) that the eval framework was refactored to import from.

Built off of this existing PR #1754

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

copy-pr-bot · 2026-04-15T23:48:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-15T23:55:10Z

Greptile Summary

This PR adds two major features: (1) an end-to-end in-memory QA evaluation mode on graph_pipeline.py that ingests, builds a LanceDB index, and runs the full evaluation sweep in a single invocation; and (2) a Live-RAG SDK surface on Retriever (.retrieve(), .retrieve_batch(), .answer(), .pipeline()) backed by a new nemo_retriever/llm/ package with shared types, Pydantic params, and concrete LiteLLMClient / LLMJudge clients.

Most of the issues raised in the previous review round have been addressed: the QA-sweep exit code bug is fixed, the heterogeneous-judge warning is promoted to a ValueError, build_eval_chain gains the missing lancedb guard, _from_dict is fully tested, self.top_k mutation is eliminated from the concurrency path, and the large-column filtering is now applied on the dataframe= path in io/markdown.py. Remaining findings are P2 style items.

Confidence Score: 4/5

Safe to merge after addressing the private-attribute coupling in RetrieverPipelineBuilder.judge(); all other findings are style-level suggestions.

All previously reported P1 issues have been resolved in this revision. The three remaining findings are P2: accessing judge._client.transport from RetrieverPipelineBuilder (brittle but not currently broken), private helpers in all, and the removed .strip() on the embedding credential. None block correctness today but the first will silently break if LLMJudge is refactored.

nemo_retriever/src/nemo_retriever/retriever.py (RetrieverPipelineBuilder.judge accesses judge._client), nemo_retriever/src/nemo_retriever/llm/clients/init.py (all includes private helpers)

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/retriever.py	Adds retrieve(), retrieve_batch(), answer(), and pipeline() live-RAG methods plus RetrieverPipelineBuilder; thread-safe top_k threading is correct but the judge builder accesses a private attribute and the embedding credential normalization was silently weakened.
nemo_retriever/src/nemo_retriever/llm/clients/litellm.py	New LiteLLMClient with structured transport/sampling split; default temperature now correctly set to 0.0 in the structured constructor, resolving the prior mismatch.
nemo_retriever/src/nemo_retriever/llm/clients/judge.py	LLMJudge refactored from evaluation/judges.py with structured params split; missing public transport property forces callers to reach into _client.
nemo_retriever/src/nemo_retriever/llm/clients/init.py	Package re-export for backward compat after module-to-package refactor; private helpers _build_rag_prompt and _parse_judge_response included in all unnecessarily.
nemo_retriever/src/nemo_retriever/evaluation/retrievers.py	_initialize_index refactored as single source of truth for both file and in-memory construction paths; from_lancedb() and _from_dict() factory methods added with comprehensive tests.
nemo_retriever/src/nemo_retriever/evaluation/config.py	Distinct-judge warning promoted to ValueError; build_eval_chain gains lancedb guard to match build_eval_pipeline; num_retries threaded through to judge constructor.
nemo_retriever/src/nemo_retriever/io/markdown.py	Large-column filtering now applied on the dataframe= path via _MARKDOWN_PARQUET_COLUMNS allow-list, matching the parquet_dir path and resolving the prior memory spike concern.
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py	End-to-end QA mode added; exit code bug fixed via qa_failed flag + typer.Exit(code=1); eval_config=None guard added before load_eval_config.
nemo_retriever/tests/test_live_rag.py	645-line test file covering Retriever.retrieve, retrieve_batch, answer, pipeline builder, and RetrieverPipelineBuilder with mocked LanceDB and LLM calls.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant R as Retriever
    participant RPB as RetrieverPipelineBuilder
    participant LRO as LiveRetrievalOperator
    participant LDB as LanceDB
    participant LLM as LiteLLMClient
    participant J as LLMJudge

    U->>R: retrieve(query, top_k)
    R->>LDB: vector search
    LDB-->>R: hits[]
    R-->>U: RetrievalResult

    U->>R: answer(query, llm, judge, reference)
    R->>LDB: vector search
    LDB-->>R: hits[]
    R->>LLM: generate(query, chunks)
    LLM-->>R: GenerationResult
    par Tier 1+2 scoring
        R->>R: token_f1 / answer_in_context
    and Tier 3 judging
        R->>J: judge(query, reference, answer)
        J-->>R: JudgeResult
    end
    R-->>U: AnswerResult

    U->>R: pipeline(top_k)
    R-->>RPB: RetrieverPipelineBuilder
    U->>RPB: .generate(llm).score().judge(judge)
    U->>RPB: .run(queries, reference)
    RPB->>LRO: run(df)
    LRO->>LDB: batch vector search
    LDB-->>LRO: hits[]
    LRO-->>RPB: df with context
    RPB->>LLM: QAGenerationOperator
    LLM-->>RPB: answers
    RPB->>J: JudgingOperator
    J-->>RPB: scores
    RPB-->>U: pd.DataFrame

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 573

Comment:
**Private attribute access creates hidden coupling**

`judge._client.transport` reaches into `LLMJudge`'s private implementation. If `LLMJudge` is ever refactored (e.g. `_client` renamed, or the transport stored differently), this raises an `AttributeError` at runtime with no static-analysis warning.

The cleaner fix is to add a public `transport` property to `LLMJudge` (mirroring the existing `model` property) and use that here, then the builder becomes `transport = judge.transport`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/llm/clients/__init__.py
Line: 57-64

Comment:
**Private helpers included in `__all__`**

`_build_rag_prompt` and `_parse_judge_response` are named with a leading underscore (private convention) but appear in `__all__`, making them part of the documented public surface of this package. Users who discover them via tab-completion will import them expecting stability, but they're intended as implementation details.

Consider either removing them from `__all__` (they remain importable by name for internal use), or renaming them without the `_` prefix if backward compatibility truly requires them to be public.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 133

Comment:
**Whitespace normalization removed from embedding credential**

The previous call site used `.strip()` on `self.embedding_api_key` before passing it to the NIM embed function. This PR removes that guard, passing the raw value. A credential set from an environment variable that includes a trailing newline or spaces (a common shell pitfall) will now reach the NIM endpoint unmodified and cause a cryptic authentication failure instead of succeeding silently. Consider restoring `.strip()` here.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (10): Last reviewed commit: "change judge shape to support retry para..." | Re-trigger Greptile}

The eval CLI (cli.py) already aborts when retrieval coverage falls below the configured min_coverage. The graph_pipeline QA path logged coverage but never enforced the threshold, creating inconsistent behavior between the two entry points. Made-with: Cursor

- build_eval_chain() now raises a clear ValueError when retrieval.type is not 'file', matching build_eval_pipeline(). - graph_pipeline validates eval_config and query_csv file existence before starting ingestion so a typo does not waste a full run. Made-with: Cursor

jperez999

I dont

KyleZheng1284 requested review from a team as code owners April 15, 2026 23:48

KyleZheng1284 requested a review from charlesbluca April 15, 2026 23:48

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

KyleZheng1284 force-pushed the full-e2e-eval branch from 7ce7eba to 37a4d53 Compare April 16, 2026 20:42

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/evaluation/retrievers.py

Kyle Zheng and others added 4 commits April 20, 2026 17:20

add support for full e2e qa run from one cli command

d9af7e4

add live rag sdk support

b04e4ab

KyleZheng1284 force-pushed the full-e2e-eval branch from add144b to b04e4ab Compare April 21, 2026 03:30

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/retriever.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/retriever.py

resolve unsafe thread + silent drop

2a867ba

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/llm/clients/litellm.py Outdated

fix minor style fixes + issues

346a8f7

jperez999 requested changes Apr 21, 2026

View reviewed changes

KyleZheng1284 added 4 commits April 22, 2026 00:23

resolve greptile issues + add tests cases + fix existing ones

c803a4b

fix smoke testing + issues

8137d3b

regenerate uv lock

117defc

change judge shape to support retry param

321e885

jperez999 approved these changes Apr 22, 2026

View reviewed changes

jperez999 merged commit 26091df into NVIDIA:main Apr 22, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for end to end in-memory QA evaluation#1865

Add support for end to end in-memory QA evaluation#1865
jperez999 merged 10 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval

KyleZheng1284 commented Apr 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

greptile-apps Bot commented Apr 15, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KyleZheng1284 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

greptile-apps Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KyleZheng1284 commented Apr 15, 2026 •

edited

Loading

greptile-apps Bot commented Apr 15, 2026 •

edited

Loading