Skip to content

Add support for end to end in-memory QA evaluation#1865

Merged
jperez999 merged 10 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval
Apr 22, 2026
Merged

Add support for end to end in-memory QA evaluation#1865
jperez999 merged 10 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval

Conversation

@KyleZheng1284
Copy link
Copy Markdown
Member

@KyleZheng1284 KyleZheng1284 commented Apr 15, 2026

Description

  • E2E in-memory QA mode on graph_pipeline (d9af7e4, 31d4e88, 632cd36) - one command that ingests, builds the page-markdown index in-memory, queries LanceDB, and runs the full QA sweep. Closes the gap that previously forced Ingest -> Export JSON -> Eval as three separate invocations.

  • Live-RAG SDK on Retriever (b04e4ab) - the much larger commit. Turns Retriever from a raw-hits wrapper into a first-class RAG SDK with .retrieve(), .retrieve_batch(), .answer() (single-query Tier 1+2+3 scoring), and a fluent .pipeline().generate(...).score().judge(...).run(queries) batch builder. Plus a ground-up nemo_retriever/llm/ package (shared types.py, Pydantic LLMInferenceParams + new LLMRemoteClientParams, clients/litellm.py, clients/judge.py, lazy-loaded init.py) that the eval framework was refactored to import from.

Built off of this existing PR #1754

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@KyleZheng1284 KyleZheng1284 requested review from a team as code owners April 15, 2026 23:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR adds two major features: (1) an end-to-end in-memory QA evaluation mode on graph_pipeline.py that ingests, builds a LanceDB index, and runs the full evaluation sweep in a single invocation; and (2) a Live-RAG SDK surface on Retriever (.retrieve(), .retrieve_batch(), .answer(), .pipeline()) backed by a new nemo_retriever/llm/ package with shared types, Pydantic params, and concrete LiteLLMClient / LLMJudge clients.

Most of the issues raised in the previous review round have been addressed: the QA-sweep exit code bug is fixed, the heterogeneous-judge warning is promoted to a ValueError, build_eval_chain gains the missing lancedb guard, _from_dict is fully tested, self.top_k mutation is eliminated from the concurrency path, and the large-column filtering is now applied on the dataframe= path in io/markdown.py. Remaining findings are P2 style items.

Confidence Score: 4/5

Safe to merge after addressing the private-attribute coupling in RetrieverPipelineBuilder.judge(); all other findings are style-level suggestions.

All previously reported P1 issues have been resolved in this revision. The three remaining findings are P2: accessing judge._client.transport from RetrieverPipelineBuilder (brittle but not currently broken), private helpers in all, and the removed .strip() on the embedding credential. None block correctness today but the first will silently break if LLMJudge is refactored.

nemo_retriever/src/nemo_retriever/retriever.py (RetrieverPipelineBuilder.judge accesses judge._client), nemo_retriever/src/nemo_retriever/llm/clients/init.py (all includes private helpers)

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/retriever.py Adds retrieve(), retrieve_batch(), answer(), and pipeline() live-RAG methods plus RetrieverPipelineBuilder; thread-safe top_k threading is correct but the judge builder accesses a private attribute and the embedding credential normalization was silently weakened.
nemo_retriever/src/nemo_retriever/llm/clients/litellm.py New LiteLLMClient with structured transport/sampling split; default temperature now correctly set to 0.0 in the structured constructor, resolving the prior mismatch.
nemo_retriever/src/nemo_retriever/llm/clients/judge.py LLMJudge refactored from evaluation/judges.py with structured params split; missing public transport property forces callers to reach into _client.
nemo_retriever/src/nemo_retriever/llm/clients/init.py Package re-export for backward compat after module-to-package refactor; private helpers _build_rag_prompt and _parse_judge_response included in all unnecessarily.
nemo_retriever/src/nemo_retriever/evaluation/retrievers.py _initialize_index refactored as single source of truth for both file and in-memory construction paths; from_lancedb() and _from_dict() factory methods added with comprehensive tests.
nemo_retriever/src/nemo_retriever/evaluation/config.py Distinct-judge warning promoted to ValueError; build_eval_chain gains lancedb guard to match build_eval_pipeline; num_retries threaded through to judge constructor.
nemo_retriever/src/nemo_retriever/io/markdown.py Large-column filtering now applied on the dataframe= path via _MARKDOWN_PARQUET_COLUMNS allow-list, matching the parquet_dir path and resolving the prior memory spike concern.
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py End-to-end QA mode added; exit code bug fixed via qa_failed flag + typer.Exit(code=1); eval_config=None guard added before load_eval_config.
nemo_retriever/tests/test_live_rag.py 645-line test file covering Retriever.retrieve, retrieve_batch, answer, pipeline builder, and RetrieverPipelineBuilder with mocked LanceDB and LLM calls.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant R as Retriever
    participant RPB as RetrieverPipelineBuilder
    participant LRO as LiveRetrievalOperator
    participant LDB as LanceDB
    participant LLM as LiteLLMClient
    participant J as LLMJudge

    U->>R: retrieve(query, top_k)
    R->>LDB: vector search
    LDB-->>R: hits[]
    R-->>U: RetrievalResult

    U->>R: answer(query, llm, judge, reference)
    R->>LDB: vector search
    LDB-->>R: hits[]
    R->>LLM: generate(query, chunks)
    LLM-->>R: GenerationResult
    par Tier 1+2 scoring
        R->>R: token_f1 / answer_in_context
    and Tier 3 judging
        R->>J: judge(query, reference, answer)
        J-->>R: JudgeResult
    end
    R-->>U: AnswerResult

    U->>R: pipeline(top_k)
    R-->>RPB: RetrieverPipelineBuilder
    U->>RPB: .generate(llm).score().judge(judge)
    U->>RPB: .run(queries, reference)
    RPB->>LRO: run(df)
    LRO->>LDB: batch vector search
    LDB-->>LRO: hits[]
    LRO-->>RPB: df with context
    RPB->>LLM: QAGenerationOperator
    LLM-->>RPB: answers
    RPB->>J: JudgingOperator
    J-->>RPB: scores
    RPB-->>U: pd.DataFrame
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 573

Comment:
**Private attribute access creates hidden coupling**

`judge._client.transport` reaches into `LLMJudge`'s private implementation. If `LLMJudge` is ever refactored (e.g. `_client` renamed, or the transport stored differently), this raises an `AttributeError` at runtime with no static-analysis warning.

The cleaner fix is to add a public `transport` property to `LLMJudge` (mirroring the existing `model` property) and use that here, then the builder becomes `transport = judge.transport`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/llm/clients/__init__.py
Line: 57-64

Comment:
**Private helpers included in `__all__`**

`_build_rag_prompt` and `_parse_judge_response` are named with a leading underscore (private convention) but appear in `__all__`, making them part of the documented public surface of this package. Users who discover them via tab-completion will import them expecting stability, but they're intended as implementation details.

Consider either removing them from `__all__` (they remain importable by name for internal use), or renaming them without the `_` prefix if backward compatibility truly requires them to be public.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/retriever.py
Line: 133

Comment:
**Whitespace normalization removed from embedding credential**

The previous call site used `.strip()` on `self.embedding_api_key` before passing it to the NIM embed function. This PR removes that guard, passing the raw value. A credential set from an environment variable that includes a trailing newline or spaces (a common shell pitfall) will now reach the NIM endpoint unmodified and cause a cryptic authentication failure instead of succeeding silently. Consider restoring `.strip()` here.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (10): Last reviewed commit: "change judge shape to support retry para..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/io/markdown.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Kyle Zheng and others added 4 commits April 20, 2026 17:20
The eval CLI (cli.py) already aborts when retrieval coverage falls
below the configured min_coverage.  The graph_pipeline QA path
logged coverage but never enforced the threshold, creating
inconsistent behavior between the two entry points.

Made-with: Cursor
- build_eval_chain() now raises a clear ValueError when
  retrieval.type is not 'file', matching build_eval_pipeline().
- graph_pipeline validates eval_config and query_csv file existence
  before starting ingestion so a typo does not waste a full run.

Made-with: Cursor
Comment thread nemo_retriever/src/nemo_retriever/retriever.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/retriever.py
Comment thread nemo_retriever/src/nemo_retriever/llm/clients/litellm.py Outdated
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont

Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Comment thread nemo_retriever/src/nemo_retriever/retriever.py
Comment thread nemo_retriever/src/nemo_retriever/retriever.py
Comment thread nemo_retriever/src/nemo_retriever/retriever.py
Comment thread nemo_retriever/src/nemo_retriever/retriever.py
@jperez999 jperez999 merged commit 26091df into NVIDIA:main Apr 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants