Expand the curated URL list

## 🧠 Context

The knowledge base is only as good as the pages we feed it. `data/webpages/list.json` currently has only a handful of URLs. This ticket expands it with vetted Carleton CS pages that are useful to incoming and current students.

### Important design point - the scraper does *not* crawl

By design, the scraper only fetches the exact URLs listed in `data/webpages/list.json`. It does **not** follow or crawl links it finds on those pages. This is intentional: it keeps the knowledge base a deliberately **curated** set and avoids pulling in un-vetted or inaccurate content.

Practical consequence: **adding an index/parent page does not pull in the pages it links to** — you must add each useful page's URL individually. The parent pages below are *starting points* for discovering sub-pages worth adding; they don't auto-expand.

---

## 🔗 Starting points

* **CCSS resources** — individual articles + FAQs: https://ccss.carleton.ca/resources/
* **School of Computer Science (SCS)** — pages useful to incoming/current students: https://carleton.ca/scs/
* **Co-op program** pages: https://carleton.ca/co-op/
* **Main registration** (new undergrad): https://carleton.ca/registration/new-ug/
* **CS undergrad course calendar**: https://calendar.carleton.ca/undergrad/undergradprograms/computerscience/
* **Academic calendar**: https://calendar.carleton.ca/academicyear/

These are a starting point and lean toward parent pages that link to many useful sub-pages. Use your best judgment on what to include and what to skip — the list can always grow later.

---

## 🛠 How to build it

1. Browse the starting-point pages above and collect the URLs of **individual** pages that would help a CS student (course info, registration, co-op, program requirements, FAQs, etc.).
2. Add each URL as a string to the array in `data/webpages/list.json`. Keep the file a **flat JSON array of URL strings** (same shape as now) — don't restructure it into objects or categories; the ingest script reads it as a plain list of strings.
3. Don't duplicate URLs already in the list
4. Run `make ingest` and confirm the new pages scrape without errors and the stored content grows. If you can't run it, note that in the PR so a reviewer can confirm.

**Notes**

* **Include / skip — use judgment.** Good candidates: stable, content-rich pages relevant to incoming/current CS students. Skip: login-walled pages, pages with little real text, news/events with a short shelf life, and anything not actually useful for the CS program.
* Some pages (e.g. FAQ accordions) may not extract perfectly with the current scraper — that's fine, add them anyway if the content is valuable; extraction quality is being improved separately.
* No code, no dependencies.

---

## ✅ Acceptance Criteria

* New vetted URLs are added to `data/webpages/list.json`, and the file is still a valid **flat JSON array of URL strings**.
* All added URLs resolve (no 404s) and are in-scope Carleton CS pages useful to students.
* No duplicate entries.
* `make ingest` completes without errors on the updated list and the stored content grows; otherwise this is noted in the PR for a reviewer to verify.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand the curated URL list #1

🧠 Context

Important design point - the scraper does not crawl

🔗 Starting points

🛠 How to build it

✅ Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expand the curated URL list #1

Description

🧠 Context

Important design point - the scraper does not crawl

🔗 Starting points

🛠 How to build it

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions