Create the golden evaluation dataset

## 🧠 Context

We have no objective way to measure answer quality - right now improvements to scraping, retrieval, and prompts are judged by eyeballing. A small, hand-curated **golden dataset** of `question → ideal answer → expected source URLs` fixes that: later eval tickets (retrieval@k, answer-faithfulness) will score the bot against it. This ticket creates that dataset.

---

## 🗂 Schema

`evals/golden.yaml` is a list of entries; each entry has exactly these three fields:

```yaml
- question: "How do I get into a COMP course that's full?"
  expected_answer: "A concise, factual answer grounded only in the source page(s) below."
  expected_sources:
    - "https://ccss.carleton.ca/resources/faqs/some-faq/"
    - "https://..."
    
- question: "..."
  expected_answer: "..."
  expected_sources:
    - "https://..."
```

* **`question`** (string) — what a student would actually ask.
* **`expected_answer`** (string) — the ideal grounded answer, concise and in the bot's style.
* **`expected_sources`** (list of strings) — the URL(s) of the page(s) that actually contain the answer.

---

## 🛠 How to build it

1. Create `evals/golden.yaml`.
2. **Start from the CCSS FAQ collection:** https://ccss.carleton.ca/resources/#faqs-heading. Each useful FAQ becomes an entry — the FAQ's question → `question`, its answer (reworded concisely) → `expected_answer`, the page URL and any sources it links → `expected_sources`.
Feel free to curate other entries outside of CCSS FAQs as well.
3. Aim for **~15–20 entries** to start (the set can grow later). Spread them across topics - registration, course information, co-op, etc.
4. **Deliberately include several questions whose answers live in FAQ accordion content** such as FAQs at the end of this page: https://carleton.ca/scs/current-students/bachelor-of-cybersecurity/bcyber-courses-and-registration/. These are the cases that measure scraper quality.
5. Prefer `expected_sources` that are **in-scope Carleton CS pages** — ideally ones already in `data/webpages/list.json` or slated for it.

**Notes**

* **This is a cross-component contract** — keep the field names exactly as above; later eval scripts parse them. If you think a field needs adding or renaming, check with Jacc first (same rule as the shared domain types).
* Answers must be **grounded in the cited page**, not invented or pulled from general knowledge — the whole point is to test grounded retrieval.
* `expected_sources` must be real, resolving URLs.
* No code or tests to run. 
* The eval scripts that *consume* this file are separate, later tickets; this ticket just produces the data.

---

## ✅ Acceptance Criteria

* `evals/golden.yaml` exists and is valid YAML.
* ~15–20 entries, each with `question`, `expected_answer`, and a non-empty `expected_sources` list, using exactly those field names.
* Several entries target FAQ/accordion content.
* Every `expected_sources` URL resolves and is an in-scope Carleton CS page.
* Answers are concise and grounded in their cited sources (not general knowledge).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create the golden evaluation dataset #2

🧠 Context

🗂 Schema

🛠 How to build it

✅ Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Create the golden evaluation dataset #2

Description

🧠 Context

🗂 Schema

🛠 How to build it

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions