Skip to content

leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29

Open
radinhamidi wants to merge 1 commit into
mainfrom
leaderboard/critical-table-revision
Open

leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29
radinhamidi wants to merge 1 commit into
mainfrom
leaderboard/critical-table-revision

Conversation

@radinhamidi
Copy link
Copy Markdown
Member

Summary

Comprehensive table-quality pass across every leaderboard page, driven by issues found in the latest review:

  • No more phantom metric columns. /datasets/[id] used to render whatever eval_metrics the dataset registry listed (MAP on TREC DL, recall_1000 on BEIR) regardless of whether those metrics actually appeared in the data. The per-dataset shard now derives its columns from the actual run rows, same approach as the home matrix uses.
  • Chip filters on every per-X page, not just the home page. /datasets/[id] gets Method/Model/Retriever/Metric; /methods/[id] gets Model/Retriever/Metric; /models/[id] gets Method/Retriever/Metric; /retrievers/[id] gets Method/Model/Metric. Behavior matches the home page (chip→qg-chip-hiddenqg-itable-reapply handshake).
  • Metric toggle everywhere. Per-method/model/retriever pages now expose both primary (nDCG@10) and secondary (R@1k or R@100) per dataset column, swapped via the Metric chip.
  • Pretty labels everywhere. Dataset short labels + METRIC_LABEL (ndcg_cut_10nDCG@10, recall_1000R@1k, recall_100R@100, mapMAP) on /datasets/[id] and the per-X pages too.
  • Drop the ugly inner scrollbar. Home + /datasets/[id] no longer set max-h-[70vh] overflow-y-auto. The page scrolls naturally; sticky top-0 thead sticks to the viewport.
  • /models index renders the display label (gpt-4.1) not the provider-prefixed id (openai/gpt-4.1) — matches the /methods index convention.
  • /runs/[run_id] reproduce snippet rebuilt against the real example pipeline. Pyserini index names no longer have the spurious .flat.splade-pp-ed / .flat.bge-base-en-v1.5 for non-lexical paradigms; trec_eval references the qrels key from the dataset registry, not the topics key.
  • /runs/[run_id] Method field shows the display name (Q2D (FS) etc.) not the raw method_id.
  • /about no longer claims every row ships a .run.txt and queries.tsv — those are optional under the current schema; path includes the {retriever} segment that PR Schema: optional artifacts + DL-HARD dataset entry #20 added.
  • Replaces duplicate cell + chip-bar code across 5 pages with two shared components: MatrixCell.astro (link + primary/secondary spans + sort hooks) and FilterChips.astro (groups + metric special-case + reapply event).

Test plan

  • python -m pytest reproducibility/tests/ — 44/44 passing
  • pnpm --filter @qg/leaderboard build — clean (1113 pages built)
  • /datasets/beir-v1.0.0-scifact: single metric column with nDCG@10 + R@100 toggle, no recall_1000 phantom column
  • /datasets/msmarco-v1-passage.trecdl2019: no MAP phantom column
  • /models/ index card titles show display labels (gpt-4.1, Qwen2.5-72B-Instruct…)
  • /runs/* Method field shows Q2D (FS) / Q2D (COT) for query2doc variants
  • /runs/* reproduce snippet generates beir-v1.0.0-trec-covid.splade-pp-ed, not .flat.splade-pp-ed
  • Home page produces no max-h-[70vh] wrapper

🤖 Generated with Claude Code

…ric toggle on every table

- per-dataset shard reads metrics from actual runs (no MAP/recall_1000 phantom columns)
- shared FilterChips + MatrixCell components reused across home / dataset / method / model / retriever pages
- every per-X table gets chip filters (method/model/retriever/metric as applicable) + metric toggle
- pretty metric labels (nDCG@10, R@1k, R@100, MAP) everywhere
- drop double scrollbar on home + per-dataset tables
- /models index renders display label, not provider-prefixed id
- /runs page shows method display name; reproduce snippet aligned to example pipeline with correct Pyserini index names and qrels-based trec_eval
- /about page no longer claims run.txt/queries.tsv are guaranteed; path includes retriever segment

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant