[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
Open
[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
Conversation
Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the orchestration layer for BQ-based graph data quality checks: - Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes) raise DataQualityError carrying a partially populated result. - Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL rates) plus Python-side feature memory and neighbor-explosion estimates. - Tier 3 label/heterogeneous checks auto-enabled by config (label_column presence; multiple edge tables). - Tier 4 opt-in placeholders (power-law exponent from degree stats). Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…assets Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the report_generator module that stitches AI-owned template,
styles, and chart JS into a single self-contained HTML report by
replacing the four INJECT_* placeholders. Adds a golden-file snapshot
test (and four structural tests) so future AI-driven edits to the
report assets fail fast until the snapshot is regenerated. Registers
the *.ai.{html,js,css} assets as package-data so importlib.resources
can resolve them from an installed wheel.
Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the main orchestrator class that coordinates graph structure analysis, feature profiling, and HTML report generation. Includes CLI entry point with argparse for analyzer_config_uri and resource_config_uri. Co-Authored-By: shubhamvij <svij@snapchat.com>
…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>
Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Sits alongside SPEC.md to separate product requirements (why and what) from technical implementation contract (how). Both are AI-owned and together form the input for regenerating report.ai.html, charts.ai.js, and styles.ai.css. Co-Authored-By: shubhamvij <svij@snapchat.com>
… 1-pager, engineering spec Colocates all planning docs for the BQ Data Analyzer feature: - 20260415-bq-data-analyzer.md: full design doc with 4-tier validation, cost control, tradeoff analysis - 20260415-bq-data-analyzer-references.md: literature review of 18 production GNN papers with 100+ findings, common themes, and consolidated threshold table - 20260416-data-analyzer-1-pager.md: executive summary for peer engineers and decision makers - 20260416-data-analyzer-engineering-spec.md: per-layer implementation plan that the analyzer code in this branch follows Co-Authored-By: shubhamvij <svij@snapchat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataAnalyzermodule that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.Changes
gigl/analytics/data_analyzer/—config.py,types.py,queries.py(18 SQL templates),graph_structure_analyzer.py,feature_profiler.py(stub),data_analyzer.pyorchestrator + CLIgigl/analytics/data_analyzer/report/—PRD.md,SPEC.md,report_generator.py, and AI-ownedreport.ai.html,charts.ai.js,styles.ai.css(regenerable from PRD + SPEC)tests/unit/analytics/data_analyzer/— 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshottests/test_assets/analytics/—sample_analyzer_config.yaml+golden_report.htmlsnapshotdocs/plans/— design doc, literature review, 1-pager, engineering spec (all colocated)pyproject.toml— package-data declaration so.ai.*assets ship in installed wheelsTest plan
uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t .→ 26/26 passmake type_check→ clean on 651 filesmake check_format→ cleanv1 scope cuts (follow-up PRs)
GenerateAndVisualizeStats,IngestRawFeatures,init_beam_pipeline_optionsfrom the existing DataPreprocessor) will land in a follow-up PR.Docs
docs/plans/20260415-bq-data-analyzer.mddocs/plans/20260415-bq-data-analyzer-references.mddocs/plans/20260416-data-analyzer-1-pager.mddocs/plans/20260416-data-analyzer-engineering-spec.mdgigl/analytics/data_analyzer/report/PRD.mdgigl/analytics/data_analyzer/report/SPEC.md