Skip to content

Improve scraper's content extraction #6

@AJaccP

Description

@AJaccP

🧠 Context

html_scraper.scrape() (src/ingestion/scrapers/html_scraper.py) does static-HTML extraction with httpx + trafilatura. It works for plain prose pages but misses a lot of what matters on Carleton CS pages. The known gaps (already noted in the file's comments) are:

  • Content inside accordion / FAQ widgets is lost — often the most important info. Two distinct failure modes:
    1. Content hidden via aria-hidden="true" is dropped entirely — both the question and the answer disappear.
    2. On other FAQ layouts, the answer text is captured but the question isn't — so the answer has no context.
  • Pages carry a lot of nav/header/footer/sidebar boilerplate that bloats chunks.
  • PDF links and other links found on a page aren't surfaced usefully — they get dropped or mangled.

This ticket improves extraction quality across the page types we actually ingest, and captures PDF/links instead of losing them.


🎯 Goals

  1. Main-content extraction — scope extraction to the page's main content, dropping nav/header/footer/sidebars so chunks aren't full of boilerplate.
  2. Recover accordion/FAQ content — after this ticket, for both failure modes above, the question and its answer both appear in the output and stay together as a coherent Q&A.
  3. Capture links instead of losing them — surface PDF links as [PDF: <url>] markers so they ride into the content and can be cited. The domain schema already has a pointer source type (see the source_type argument on Repository.upsert_chunk / get_or_create_source) for representing PDFs/links if you choose to store them as their own entries. Other useful links should be preserved in the text.

🚧 Hard boundaries

  • Do not crawl/follow links or download PDFs. Capture the PDF/link URL only — this keeps the knowledge base a deliberately curated set. No PDF text extraction.
  • Chunking is unchanged. Leave the RecursiveCharacterTextSplitter config as-is — this ticket is about what text we extract, not how it's chunked.
  • Don't populate section_heading. Leave it None; heading/structure extraction is a separate effort.

🛠 Approach (investigation-first)

  1. Spike first. Run the current scraper against a few representative pages (see test targets below) and look at exactly what's missing. Don't commit to an approach before you've seen the real HTML.
  2. Try httpx → HTML preprocessing → trafilatura. The likely fix is preprocessing the HTML before extraction: select the main-content container, and un-hide / restructure the accordion markup so the hidden Q&A survives (the content is in the static HTML — trafilatura just discards hidden/widget nodes). Trafilatura already runs with include_links/include_tables; the gap is the hidden content, which is a preprocessing problem.
  3. Fall back to BeautifulSoup if preprocessing + trafilatura can't get there. lxml is already installed (a good bs4 parser); add beautifulsoup4 via uv add beautifulsoup4 and commit the updated uv.lock (CI runs uv sync --frozen).
  4. Keep scrape()'s return shape stable if you can. If you need to return PDFs/links separately, update ingest_url in src/ingestion/services/ingestion_service.py to match (both files are in scope for this ticket).

Files

  • src/ingestion/scrapers/html_scraper.py — the extraction logic (main change).
  • src/ingestion/services/ingestion_service.py — only if the scraper's output shape changes, or if you store PDFs/links as pointer entries (it currently hardcodes source_type="html").

🔬 Test targets (good representative pages)


🧪 Tests

Save a few real-page HTML snapshots as fixtures under tests/ingestion/ and test the scraper against them offline (no network, no Docker). Assert that:

  • accordion question + answer text now appears in the output,
  • nav/header/footer boilerplate is gone,
  • PDF links show up as [PDF: ...] markers.

✅ Acceptance Criteria

  • For pages with accordion/FAQ content, both the question and answer text appear in the scraped output (both failure modes fixed), and they stay together.
  • Extracted content excludes nav/header/footer/sidebar boilerplate — main content only.
  • PDF links are captured as [PDF: <url>] markers (and/or stored via the pointer source type); other useful links are preserved. Links/PDFs are not fetched, crawled, or downloaded.
  • Chunking config and section_heading behavior are unchanged.
  • Tests run against saved HTML fixtures (no network/Docker) and assert: recovered accordion Q&A, boilerplate removed, PDF markers present.
  • make test and make lint pass. If beautifulsoup4 was added, the updated uv.lock is committed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions