🧠 Context
The knowledge base is only as good as the pages we feed it. data/webpages/list.json currently has only a handful of URLs. This ticket expands it with vetted Carleton CS pages that are useful to incoming and current students.
Important design point - the scraper does not crawl
By design, the scraper only fetches the exact URLs listed in data/webpages/list.json. It does not follow or crawl links it finds on those pages. This is intentional: it keeps the knowledge base a deliberately curated set and avoids pulling in un-vetted or inaccurate content.
Practical consequence: adding an index/parent page does not pull in the pages it links to — you must add each useful page's URL individually. The parent pages below are starting points for discovering sub-pages worth adding; they don't auto-expand.
🔗 Starting points
These are a starting point and lean toward parent pages that link to many useful sub-pages. Use your best judgment on what to include and what to skip — the list can always grow later.
🛠 How to build it
- Browse the starting-point pages above and collect the URLs of individual pages that would help a CS student (course info, registration, co-op, program requirements, FAQs, etc.).
- Add each URL as a string to the array in
data/webpages/list.json. Keep the file a flat JSON array of URL strings (same shape as now) — don't restructure it into objects or categories; the ingest script reads it as a plain list of strings.
- Don't duplicate URLs already in the list
- Run
make ingest and confirm the new pages scrape without errors and the stored content grows. If you can't run it, note that in the PR so a reviewer can confirm.
Notes
- Include / skip — use judgment. Good candidates: stable, content-rich pages relevant to incoming/current CS students. Skip: login-walled pages, pages with little real text, news/events with a short shelf life, and anything not actually useful for the CS program.
- Some pages (e.g. FAQ accordions) may not extract perfectly with the current scraper — that's fine, add them anyway if the content is valuable; extraction quality is being improved separately.
- No code, no dependencies.
✅ Acceptance Criteria
- New vetted URLs are added to
data/webpages/list.json, and the file is still a valid flat JSON array of URL strings.
- All added URLs resolve (no 404s) and are in-scope Carleton CS pages useful to students.
- No duplicate entries.
make ingest completes without errors on the updated list and the stored content grows; otherwise this is noted in the PR for a reviewer to verify.
🧠 Context
The knowledge base is only as good as the pages we feed it.
data/webpages/list.jsoncurrently has only a handful of URLs. This ticket expands it with vetted Carleton CS pages that are useful to incoming and current students.Important design point - the scraper does not crawl
By design, the scraper only fetches the exact URLs listed in
data/webpages/list.json. It does not follow or crawl links it finds on those pages. This is intentional: it keeps the knowledge base a deliberately curated set and avoids pulling in un-vetted or inaccurate content.Practical consequence: adding an index/parent page does not pull in the pages it links to — you must add each useful page's URL individually. The parent pages below are starting points for discovering sub-pages worth adding; they don't auto-expand.
🔗 Starting points
These are a starting point and lean toward parent pages that link to many useful sub-pages. Use your best judgment on what to include and what to skip — the list can always grow later.
🛠 How to build it
data/webpages/list.json. Keep the file a flat JSON array of URL strings (same shape as now) — don't restructure it into objects or categories; the ingest script reads it as a plain list of strings.make ingestand confirm the new pages scrape without errors and the stored content grows. If you can't run it, note that in the PR so a reviewer can confirm.Notes
✅ Acceptance Criteria
data/webpages/list.json, and the file is still a valid flat JSON array of URL strings.make ingestcompletes without errors on the updated list and the stored content grows; otherwise this is noted in the PR for a reviewer to verify.