Skip to content

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591

Open
svij-sc wants to merge 16 commits intomainfrom
svij/easy-analyz-bq
Open

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
svij-sc wants to merge 16 commits intomainfrom
svij/easy-analyz-bq

Conversation

@svij-sc
Copy link
Copy Markdown
Collaborator

@svij-sc svij-sc commented Apr 17, 2026

Summary

  • Standalone DataAnalyzer module that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.
  • 4-tier validation: hard fails (dangling edges, referential integrity, duplicate nodes) → core metrics (degree distribution, hubs, cold-start, memory budget, neighbor explosion estimate) → label/heterogeneous (class imbalance, label coverage, edge type distribution) → opt-in advanced (reciprocity, homophily, connected components, clustering).
  • Thresholds and check selection backed by a literature review of 18 production GNN papers (PinSage, LiGNN, TwHIN, GiGL, BLADE, AliGraph, GraphSMOTE, Beyond Homophily, Feature Propagation, and more). Each threshold cites its source paper.

Changes

  • gigl/analytics/data_analyzer/config.py, types.py, queries.py (18 SQL templates), graph_structure_analyzer.py, feature_profiler.py (stub), data_analyzer.py orchestrator + CLI
  • gigl/analytics/data_analyzer/report/PRD.md, SPEC.md, report_generator.py, and AI-owned report.ai.html, charts.ai.js, styles.ai.css (regenerable from PRD + SPEC)
  • tests/unit/analytics/data_analyzer/ — 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshot
  • tests/test_assets/analytics/sample_analyzer_config.yaml + golden_report.html snapshot
  • docs/plans/ — design doc, literature review, 1-pager, engineering spec (all colocated)
  • pyproject.toml — package-data declaration so .ai.* assets ship in installed wheels

Test plan

  • uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t . → 26/26 pass
  • make type_check → clean on 651 files
  • make check_format → clean
  • Manual: run analyzer CLI against a real BQ dataset and inspect the generated HTML

v1 scope cuts (follow-up PRs)

  • FeatureProfiler: TFDV/Dataflow integration is a working stub that logs a warning and returns empty results. The full Beam pipeline wiring (reusing GenerateAndVisualizeStats, IngestRawFeatures, init_beam_pipeline_options from the existing DataPreprocessor) will land in a follow-up PR.
  • GCS upload: The orchestrator generates the HTML but does not yet upload it; currently returns the target path with a TODO.
  • Tier 4 advanced queries: Reciprocity, homophily, connected components, and clustering coefficient are not implemented. Power-law exponent is computed as a degree-stats approximation.

Docs

  • Design doc: docs/plans/20260415-bq-data-analyzer.md
  • Literature review: docs/plans/20260415-bq-data-analyzer-references.md
  • 1-pager: docs/plans/20260416-data-analyzer-1-pager.md
  • Engineering spec: docs/plans/20260416-data-analyzer-engineering-spec.md
  • Report PRD (product intent): gigl/analytics/data_analyzer/report/PRD.md
  • Report SPEC (technical contract): gigl/analytics/data_analyzer/report/SPEC.md

svij-sc and others added 14 commits April 17, 2026 20:25
Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…sisResult, FeatureProfileResult)

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the orchestration layer for BQ-based graph data quality checks:
- Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes)
  raise DataQualityError carrying a partially populated result.
- Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL
  rates) plus Python-side feature memory and neighbor-explosion estimates.
- Tier 3 label/heterogeneous checks auto-enabled by config (label_column
  presence; multiple edge tables).
- Tier 4 opt-in placeholders (power-law exponent from degree stats).

Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…assets

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the report_generator module that stitches AI-owned template,
styles, and chart JS into a single self-contained HTML report by
replacing the four INJECT_* placeholders. Adds a golden-file snapshot
test (and four structural tests) so future AI-driven edits to the
report assets fail fast until the snapshot is regenerated. Registers
the *.ai.{html,js,css} assets as package-data so importlib.resources
can resolve them from an installed wheel.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the main orchestrator class that coordinates graph structure
analysis, feature profiling, and HTML report generation. Includes CLI
entry point with argparse for analyzer_config_uri and resource_config_uri.

Co-Authored-By: shubhamvij <svij@snapchat.com>
…deferred)

Co-Authored-By: shubhamvij <svij@snapchat.com>
Narrows the Union return type for mypy in the direct-merge test path.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Sits alongside SPEC.md to separate product requirements (why and what)
from technical implementation contract (how). Both are AI-owned and
together form the input for regenerating report.ai.html, charts.ai.js,
and styles.ai.css.

Co-Authored-By: shubhamvij <svij@snapchat.com>
svij-sc added 2 commits April 17, 2026 23:51
… 1-pager, engineering spec

Colocates all planning docs for the BQ Data Analyzer feature:
- 20260415-bq-data-analyzer.md: full design doc with 4-tier validation,
  cost control, tradeoff analysis
- 20260415-bq-data-analyzer-references.md: literature review of 18
  production GNN papers with 100+ findings, common themes, and
  consolidated threshold table
- 20260416-data-analyzer-1-pager.md: executive summary for peer
  engineers and decision makers
- 20260416-data-analyzer-engineering-spec.md: per-layer implementation
  plan that the analyzer code in this branch follows

Co-Authored-By: shubhamvij <svij@snapchat.com>
@svij-sc svij-sc changed the title feat(analytics): add BQ Data Analyzer for pre-training graph data analysis Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026
@svij-sc svij-sc changed the title Feature Analytics: Add Data Analyzer for pre-training graph data analysis [WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant