Improve scraper's content extraction

## 🧠 Context

`html_scraper.scrape()` (`src/ingestion/scrapers/html_scraper.py`) does static-HTML extraction with `httpx` + `trafilatura`. It works for plain prose pages but misses a lot of what matters on Carleton CS pages. The known gaps (already noted in the file's comments) are:

* Content inside **accordion / FAQ widgets** is lost — often the most important info. Two distinct failure modes:
  1. Content hidden via `aria-hidden="true"` is dropped entirely — **both the question and the answer disappear.**
  2. On other FAQ layouts, the **answer text is captured but the question isn't** — so the answer has no context.
* Pages carry a lot of **nav/header/footer/sidebar boilerplate** that bloats chunks.
* **PDF links and other links** found on a page aren't surfaced usefully — they get dropped or mangled.

This ticket improves extraction quality across the page types we actually ingest, and captures PDF/links instead of losing them.

---

## 🎯 Goals

1. **Main-content extraction** — scope extraction to the page's main content, dropping nav/header/footer/sidebars so chunks aren't full of boilerplate.
2. **Recover accordion/FAQ content** — after this ticket, for both failure modes above, **the question and its answer both appear in the output and stay together** as a coherent Q&A.
3. **Capture links instead of losing them** — surface PDF links as `[PDF: <url>]` markers so they ride into the content and can be cited. The domain schema already has a `pointer` source type (see the `source_type` argument on `Repository.upsert_chunk` / `get_or_create_source`) for representing PDFs/links if you choose to store them as their own entries. Other useful links should be preserved in the text.

## 🚧 Hard boundaries

* **Do not crawl/follow links or download PDFs.** Capture the PDF/link URL only — this keeps the knowledge base a deliberately curated set. No PDF text extraction.
* **Chunking is unchanged.** Leave the `RecursiveCharacterTextSplitter` config as-is — this ticket is about *what text we extract*, not how it's chunked.
* **Don't populate `section_heading`.** Leave it `None`; heading/structure extraction is a separate effort.

---

## 🛠 Approach (investigation-first)

1. **Spike first.** Run the current scraper against a few representative pages (see test targets below) and look at exactly what's missing. Don't commit to an approach before you've seen the real HTML.
2. **Try `httpx` → HTML preprocessing → `trafilatura`.** The likely fix is preprocessing the HTML before extraction: select the main-content container, and un-hide / restructure the accordion markup so the hidden Q&A survives (the content is in the static HTML — trafilatura just discards hidden/widget nodes). Trafilatura already runs with `include_links`/`include_tables`; the gap is the hidden content, which is a preprocessing problem.
3. **Fall back to BeautifulSoup** if preprocessing + trafilatura can't get there. `lxml` is already installed (a good bs4 parser); add `beautifulsoup4` via `uv add beautifulsoup4` and **commit the updated `uv.lock`** (CI runs `uv sync --frozen`).
4. Keep `scrape()`'s return shape stable if you can. If you need to return PDFs/links separately, update `ingest_url` in `src/ingestion/services/ingestion_service.py` to match (both files are in scope for this ticket).

**Files**

* `src/ingestion/scrapers/html_scraper.py` — the extraction logic (main change).
* `src/ingestion/services/ingestion_service.py` — only if the scraper's output shape changes, or if you store PDFs/links as `pointer` entries (it currently hardcodes `source_type="html"`).

## 🔬 Test targets (good representative pages)
* An SCS page with end-of-page accordion style FAQs: https://carleton.ca/scs/current-students/bachelor-of-cybersecurity/bcyber-courses-and-registration/
* A university FAQ page with a different design: https://carleton.ca/registration/new-ug/new-student-faqs/
* Any page containing PDF links and links to other pages.


---

## 🧪 Tests

Save a few real-page **HTML snapshots as fixtures** under `tests/ingestion/` and test the scraper against them **offline** (no network, no Docker). Assert that:

* accordion question + answer text now appears in the output,
* nav/header/footer boilerplate is gone,
* PDF links show up as `[PDF: ...]` markers.

---

## ✅ Acceptance Criteria

* For pages with accordion/FAQ content, **both the question and answer text appear** in the scraped output (both failure modes fixed), and they stay together.
* Extracted content excludes nav/header/footer/sidebar boilerplate — main content only.
* PDF links are captured as `[PDF: <url>]` markers (and/or stored via the `pointer` source type); other useful links are preserved. Links/PDFs are **not** fetched, crawled, or downloaded.
* Chunking config and `section_heading` behavior are unchanged.
* Tests run against saved HTML fixtures (no network/Docker) and assert: recovered accordion Q&A, boilerplate removed, PDF markers present.
* `make test` and `make lint` pass. If `beautifulsoup4` was added, the updated `uv.lock` is committed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scraper's content extraction #6

🧠 Context

🎯 Goals

🚧 Hard boundaries

🛠 Approach (investigation-first)

🔬 Test targets (good representative pages)

🧪 Tests

✅ Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve scraper's content extraction #6

Description

🧠 Context

🎯 Goals

🚧 Hard boundaries

🛠 Approach (investigation-first)

🔬 Test targets (good representative pages)

🧪 Tests

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions