[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis by svij-sc · Pull Request #591 · Snapchat/GiGL

svij-sc · 2026-04-17T23:51:18Z

Summary

Standalone DataAnalyzer module that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.
4-tier validation: hard fails (dangling edges, referential integrity, duplicate nodes) → core metrics (degree distribution, hubs, cold-start, memory budget, neighbor explosion estimate) → label/heterogeneous (class imbalance, label coverage, edge type distribution) → opt-in advanced (reciprocity, homophily, connected components, clustering).
Thresholds and check selection backed by a literature review of 18 production GNN papers (PinSage, LiGNN, TwHIN, GiGL, BLADE, AliGraph, GraphSMOTE, Beyond Homophily, Feature Propagation, and more). Each threshold cites its source paper.

Changes

gigl/analytics/data_analyzer/ — config.py, types.py, queries.py (18 SQL templates), graph_structure_analyzer.py, feature_profiler.py (stub), data_analyzer.py orchestrator + CLI
gigl/analytics/data_analyzer/report/ — PRD.md, SPEC.md, report_generator.py, and AI-owned report.ai.html, charts.ai.js, styles.ai.css (regenerable from PRD + SPEC)
tests/unit/analytics/data_analyzer/ — 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshot
tests/test_assets/analytics/ — sample_analyzer_config.yaml + golden_report.html snapshot
docs/plans/ — design doc, literature review, 1-pager, engineering spec (all colocated)
pyproject.toml — package-data declaration so .ai.* assets ship in installed wheels

Test plan

uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t . → 26/26 pass
make type_check → clean on 651 files
make check_format → clean
Manual: run analyzer CLI against a real BQ dataset and inspect the generated HTML

v1 scope cuts (follow-up PRs)

FeatureProfiler: TFDV/Dataflow integration is a working stub that logs a warning and returns empty results. The full Beam pipeline wiring (reusing GenerateAndVisualizeStats, IngestRawFeatures, init_beam_pipeline_options from the existing DataPreprocessor) will land in a follow-up PR.
GCS upload: The orchestrator generates the HTML but does not yet upload it; currently returns the target path with a TODO.
Tier 4 advanced queries: Reciprocity, homophily, connected components, and clustering coefficient are not implemented. Power-law exponent is computed as a degree-stats approximation.

Docs

Design doc: docs/plans/20260415-bq-data-analyzer.md
Literature review: docs/plans/20260415-bq-data-analyzer-references.md
1-pager: docs/plans/20260416-data-analyzer-1-pager.md
Engineering spec: docs/plans/20260416-data-analyzer-engineering-spec.md
Report PRD (product intent): gigl/analytics/data_analyzer/report/PRD.md
Report SPEC (technical contract): gigl/analytics/data_analyzer/report/SPEC.md

Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the orchestration layer for BQ-based graph data quality checks: - Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes) raise DataQualityError carrying a partially populated result. - Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL rates) plus Python-side feature memory and neighbor-explosion estimates. - Tier 3 label/heterogeneous checks auto-enabled by config (label_column presence; multiple edge tables). - Tier 4 opt-in placeholders (power-law exponent from degree stats). Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

…assets Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the report_generator module that stitches AI-owned template, styles, and chart JS into a single self-contained HTML report by replacing the four INJECT_* placeholders. Adds a golden-file snapshot test (and four structural tests) so future AI-driven edits to the report assets fail fast until the snapshot is regenerated. Registers the *.ai.{html,js,css} assets as package-data so importlib.resources can resolve them from an installed wheel. Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the main orchestrator class that coordinates graph structure analysis, feature profiling, and HTML report generation. Includes CLI entry point with argparse for analyzer_config_uri and resource_config_uri. Co-Authored-By: shubhamvij <svij@snapchat.com>

…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>

Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

Sits alongside SPEC.md to separate product requirements (why and what) from technical implementation contract (how). Both are AI-owned and together form the input for regenerating report.ai.html, charts.ai.js, and styles.ai.css. Co-Authored-By: shubhamvij <svij@snapchat.com>

… 1-pager, engineering spec Colocates all planning docs for the BQ Data Analyzer feature: - 20260415-bq-data-analyzer.md: full design doc with 4-tier validation, cost control, tradeoff analysis - 20260415-bq-data-analyzer-references.md: literature review of 18 production GNN papers with 100+ findings, common themes, and consolidated threshold table - 20260416-data-analyzer-1-pager.md: executive summary for peer engineers and decision makers - 20260416-data-analyzer-engineering-spec.md: per-layer implementation plan that the analyzer code in this branch follows Co-Authored-By: shubhamvij <svij@snapchat.com>

svij-sc and others added 14 commits April 17, 2026 20:25

feat(analytics): scaffold data_analyzer package structure

c079b9f

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add DataAnalyzerConfig with YAML loading and tests

3988493

Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>

fix(analytics): remove unused imports in config_test.py

cf69b38

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add result type dataclasses (DegreeStats, GraphAnaly…

8abae4a

…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add 18 SQL query templates for graph structure analysis

f1c7f52

Co-Authored-By: shubhamvij <svij@snapchat.com>

style(analytics): apply black formatter to test files

793190c

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add report SPEC.md and initial AI-owned HTML/JS/CSS …

0b01b5c

…assets Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add FeatureProfiler stub (TFDV/Dataflow integration …

42f8d78

…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>

fix(analytics): cast OmegaConf.to_object result in config_test

56eb170

Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>

style(analytics): apply isort and mdformat to data_analyzer files

7f387f6

Co-Authored-By: shubhamvij <svij@snapchat.com>

svij-sc requested review from kmontemayor2-sc, mkolodner-sc, nshah-sc, xgao4-sc, yliu2-sc and zfan3-sc as code owners April 17, 2026 23:51

svij-sc added 2 commits April 17, 2026 23:51

delete plans

d3f1eb8

svij-sc changed the title ~~feat(analytics): add BQ Data Analyzer for pre-training graph data analysis~~ Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026

svij-sc changed the title ~~Feature Analytics: Add Data Analyzer for pre-training graph data analysis~~ [WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
svij-sc wants to merge 16 commits intomainfrom
svij/easy-analyz-bq

svij-sc commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svij-sc commented Apr 17, 2026

Summary

Changes

Test plan

v1 scope cuts (follow-up PRs)

Docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant