🧠 Context
html_scraper.scrape() (src/ingestion/scrapers/html_scraper.py) does static-HTML extraction with httpx + trafilatura. It works for plain prose pages but misses a lot of what matters on Carleton CS pages. The known gaps (already noted in the file's comments) are:
- Content inside accordion / FAQ widgets is lost — often the most important info. Two distinct failure modes:
- Content hidden via
aria-hidden="true" is dropped entirely — both the question and the answer disappear.
- On other FAQ layouts, the answer text is captured but the question isn't — so the answer has no context.
- Pages carry a lot of nav/header/footer/sidebar boilerplate that bloats chunks.
- PDF links and other links found on a page aren't surfaced usefully — they get dropped or mangled.
This ticket improves extraction quality across the page types we actually ingest, and captures PDF/links instead of losing them.
🎯 Goals
- Main-content extraction — scope extraction to the page's main content, dropping nav/header/footer/sidebars so chunks aren't full of boilerplate.
- Recover accordion/FAQ content — after this ticket, for both failure modes above, the question and its answer both appear in the output and stay together as a coherent Q&A.
- Capture links instead of losing them — surface PDF links as
[PDF: <url>] markers so they ride into the content and can be cited. The domain schema already has a pointer source type (see the source_type argument on Repository.upsert_chunk / get_or_create_source) for representing PDFs/links if you choose to store them as their own entries. Other useful links should be preserved in the text.
🚧 Hard boundaries
- Do not crawl/follow links or download PDFs. Capture the PDF/link URL only — this keeps the knowledge base a deliberately curated set. No PDF text extraction.
- Chunking is unchanged. Leave the
RecursiveCharacterTextSplitter config as-is — this ticket is about what text we extract, not how it's chunked.
- Don't populate
section_heading. Leave it None; heading/structure extraction is a separate effort.
🛠 Approach (investigation-first)
- Spike first. Run the current scraper against a few representative pages (see test targets below) and look at exactly what's missing. Don't commit to an approach before you've seen the real HTML.
- Try
httpx → HTML preprocessing → trafilatura. The likely fix is preprocessing the HTML before extraction: select the main-content container, and un-hide / restructure the accordion markup so the hidden Q&A survives (the content is in the static HTML — trafilatura just discards hidden/widget nodes). Trafilatura already runs with include_links/include_tables; the gap is the hidden content, which is a preprocessing problem.
- Fall back to BeautifulSoup if preprocessing + trafilatura can't get there.
lxml is already installed (a good bs4 parser); add beautifulsoup4 via uv add beautifulsoup4 and commit the updated uv.lock (CI runs uv sync --frozen).
- Keep
scrape()'s return shape stable if you can. If you need to return PDFs/links separately, update ingest_url in src/ingestion/services/ingestion_service.py to match (both files are in scope for this ticket).
Files
src/ingestion/scrapers/html_scraper.py — the extraction logic (main change).
src/ingestion/services/ingestion_service.py — only if the scraper's output shape changes, or if you store PDFs/links as pointer entries (it currently hardcodes source_type="html").
🔬 Test targets (good representative pages)
🧪 Tests
Save a few real-page HTML snapshots as fixtures under tests/ingestion/ and test the scraper against them offline (no network, no Docker). Assert that:
- accordion question + answer text now appears in the output,
- nav/header/footer boilerplate is gone,
- PDF links show up as
[PDF: ...] markers.
✅ Acceptance Criteria
- For pages with accordion/FAQ content, both the question and answer text appear in the scraped output (both failure modes fixed), and they stay together.
- Extracted content excludes nav/header/footer/sidebar boilerplate — main content only.
- PDF links are captured as
[PDF: <url>] markers (and/or stored via the pointer source type); other useful links are preserved. Links/PDFs are not fetched, crawled, or downloaded.
- Chunking config and
section_heading behavior are unchanged.
- Tests run against saved HTML fixtures (no network/Docker) and assert: recovered accordion Q&A, boilerplate removed, PDF markers present.
make test and make lint pass. If beautifulsoup4 was added, the updated uv.lock is committed.
🧠 Context
html_scraper.scrape()(src/ingestion/scrapers/html_scraper.py) does static-HTML extraction withhttpx+trafilatura. It works for plain prose pages but misses a lot of what matters on Carleton CS pages. The known gaps (already noted in the file's comments) are:aria-hidden="true"is dropped entirely — both the question and the answer disappear.This ticket improves extraction quality across the page types we actually ingest, and captures PDF/links instead of losing them.
🎯 Goals
[PDF: <url>]markers so they ride into the content and can be cited. The domain schema already has apointersource type (see thesource_typeargument onRepository.upsert_chunk/get_or_create_source) for representing PDFs/links if you choose to store them as their own entries. Other useful links should be preserved in the text.🚧 Hard boundaries
RecursiveCharacterTextSplitterconfig as-is — this ticket is about what text we extract, not how it's chunked.section_heading. Leave itNone; heading/structure extraction is a separate effort.🛠 Approach (investigation-first)
httpx→ HTML preprocessing →trafilatura. The likely fix is preprocessing the HTML before extraction: select the main-content container, and un-hide / restructure the accordion markup so the hidden Q&A survives (the content is in the static HTML — trafilatura just discards hidden/widget nodes). Trafilatura already runs withinclude_links/include_tables; the gap is the hidden content, which is a preprocessing problem.lxmlis already installed (a good bs4 parser); addbeautifulsoup4viauv add beautifulsoup4and commit the updateduv.lock(CI runsuv sync --frozen).scrape()'s return shape stable if you can. If you need to return PDFs/links separately, updateingest_urlinsrc/ingestion/services/ingestion_service.pyto match (both files are in scope for this ticket).Files
src/ingestion/scrapers/html_scraper.py— the extraction logic (main change).src/ingestion/services/ingestion_service.py— only if the scraper's output shape changes, or if you store PDFs/links aspointerentries (it currently hardcodessource_type="html").🔬 Test targets (good representative pages)
🧪 Tests
Save a few real-page HTML snapshots as fixtures under
tests/ingestion/and test the scraper against them offline (no network, no Docker). Assert that:[PDF: ...]markers.✅ Acceptance Criteria
[PDF: <url>]markers (and/or stored via thepointersource type); other useful links are preserved. Links/PDFs are not fetched, crawled, or downloaded.section_headingbehavior are unchanged.make testandmake lintpass. Ifbeautifulsoup4was added, the updateduv.lockis committed.