Skip to content

Expand the curated URL list #1

@AJaccP

Description

@AJaccP

🧠 Context

The knowledge base is only as good as the pages we feed it. data/webpages/list.json currently has only a handful of URLs. This ticket expands it with vetted Carleton CS pages that are useful to incoming and current students.

Important design point - the scraper does not crawl

By design, the scraper only fetches the exact URLs listed in data/webpages/list.json. It does not follow or crawl links it finds on those pages. This is intentional: it keeps the knowledge base a deliberately curated set and avoids pulling in un-vetted or inaccurate content.

Practical consequence: adding an index/parent page does not pull in the pages it links to — you must add each useful page's URL individually. The parent pages below are starting points for discovering sub-pages worth adding; they don't auto-expand.


🔗 Starting points

These are a starting point and lean toward parent pages that link to many useful sub-pages. Use your best judgment on what to include and what to skip — the list can always grow later.


🛠 How to build it

  1. Browse the starting-point pages above and collect the URLs of individual pages that would help a CS student (course info, registration, co-op, program requirements, FAQs, etc.).
  2. Add each URL as a string to the array in data/webpages/list.json. Keep the file a flat JSON array of URL strings (same shape as now) — don't restructure it into objects or categories; the ingest script reads it as a plain list of strings.
  3. Don't duplicate URLs already in the list
  4. Run make ingest and confirm the new pages scrape without errors and the stored content grows. If you can't run it, note that in the PR so a reviewer can confirm.

Notes

  • Include / skip — use judgment. Good candidates: stable, content-rich pages relevant to incoming/current CS students. Skip: login-walled pages, pages with little real text, news/events with a short shelf life, and anything not actually useful for the CS program.
  • Some pages (e.g. FAQ accordions) may not extract perfectly with the current scraper — that's fine, add them anyway if the content is valuable; extraction quality is being improved separately.
  • No code, no dependencies.

✅ Acceptance Criteria

  • New vetted URLs are added to data/webpages/list.json, and the file is still a valid flat JSON array of URL strings.
  • All added URLs resolve (no 404s) and are in-scope Carleton CS pages useful to students.
  • No duplicate entries.
  • make ingest completes without errors on the updated list and the stored content grows; otherwise this is noted in the PR for a reviewer to verify.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions