Skip to content

Create the golden evaluation datasetΒ #2

@AJaccP

Description

@AJaccP

🧠 Context

We have no objective way to measure answer quality - right now improvements to scraping, retrieval, and prompts are judged by eyeballing. A small, hand-curated golden dataset of question β†’ ideal answer β†’ expected source URLs fixes that: later eval tickets (retrieval@k, answer-faithfulness) will score the bot against it. This ticket creates that dataset.


πŸ—‚ Schema

evals/golden.yaml is a list of entries; each entry has exactly these three fields:

- question: "How do I get into a COMP course that's full?"
  expected_answer: "A concise, factual answer grounded only in the source page(s) below."
  expected_sources:
    - "https://ccss.carleton.ca/resources/faqs/some-faq/"
    - "https://..."
    
- question: "..."
  expected_answer: "..."
  expected_sources:
    - "https://..."
  • question (string) β€” what a student would actually ask.
  • expected_answer (string) β€” the ideal grounded answer, concise and in the bot's style.
  • expected_sources (list of strings) β€” the URL(s) of the page(s) that actually contain the answer.

πŸ›  How to build it

  1. Create evals/golden.yaml.
  2. Start from the CCSS FAQ collection: https://ccss.carleton.ca/resources/#faqs-heading. Each useful FAQ becomes an entry β€” the FAQ's question β†’ question, its answer (reworded concisely) β†’ expected_answer, the page URL and any sources it links β†’ expected_sources.
    Feel free to curate other entries outside of CCSS FAQs as well.
  3. Aim for ~15–20 entries to start (the set can grow later). Spread them across topics - registration, course information, co-op, etc.
  4. Deliberately include several questions whose answers live in FAQ accordion content such as FAQs at the end of this page: https://carleton.ca/scs/current-students/bachelor-of-cybersecurity/bcyber-courses-and-registration/. These are the cases that measure scraper quality.
  5. Prefer expected_sources that are in-scope Carleton CS pages β€” ideally ones already in data/webpages/list.json or slated for it.

Notes

  • This is a cross-component contract β€” keep the field names exactly as above; later eval scripts parse them. If you think a field needs adding or renaming, check with Jacc first (same rule as the shared domain types).
  • Answers must be grounded in the cited page, not invented or pulled from general knowledge β€” the whole point is to test grounded retrieval.
  • expected_sources must be real, resolving URLs.
  • No code or tests to run.
  • The eval scripts that consume this file are separate, later tickets; this ticket just produces the data.

βœ… Acceptance Criteria

  • evals/golden.yaml exists and is valid YAML.
  • ~15–20 entries, each with question, expected_answer, and a non-empty expected_sources list, using exactly those field names.
  • Several entries target FAQ/accordion content.
  • Every expected_sources URL resolves and is an in-scope Carleton CS page.
  • Answers are concise and grounded in their cited sources (not general knowledge).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions