diff --git a/gigl/analytics/README.md b/gigl/analytics/README.md
new file mode 100644
index 000000000..25ee7b3e5
--- /dev/null
+++ b/gigl/analytics/README.md
@@ -0,0 +1,188 @@
+# GiGL Analytics
+
+Pre-training graph data validation and analysis tooling. Use this module before committing to a GNN training run to
+catch data quality and structural issues that silently degrade model quality.
+
+Two subpackages:
+
+- [`data_analyzer/`](data_analyzer/) — end-to-end `DataAnalyzer` that runs BigQuery checks and produces a single
+  self-contained HTML report. **Start here.**
+- [`graph_validation/`](graph_validation/) — lightweight standalone validators (currently: `BQGraphValidator` for
+  dangling-edge checks). Use when you only need one check and not the full report.
+
+## Quickstart
+
+**Prerequisites.** Follow the [GiGL installation guide](../../docs/user_guide/getting_started/installation.md) so that
+`uv` and GiGL's Python dependencies are available. Then authenticate to BigQuery:
+
+```bash
+gcloud auth application-default login
+```
+
+**1. Write a YAML config.** Save as `my_analyzer_config.yaml`:
+
+```yaml
+node_tables:
+  - bq_table: "your-project.your_dataset.user_nodes"
+    node_type: "user"
+    id_column: "user_id"
+    feature_columns: ["age", "country"]  # optional; [] or omit if the node has no features
+    # label_column: "label"              # optional; enables Tier 3 label checks
+
+edge_tables:
+  - bq_table: "your-project.your_dataset.user_edges"
+    edge_type: "follows"
+    src_id_column: "src_user_id"
+    dst_id_column: "dst_user_id"
+
+# Where to write the HTML report. Local path for quick iteration, or a gs:// URI.
+output_gcs_path: "/tmp/my_analysis/"
+
+# Optional: sizing for the neighbor-explosion estimate (fan-out per GNN layer).
+fan_out: [15, 10, 5]
+```
+
+**2. Run the analyzer.**
+
+```bash
+uv run python -m gigl.analytics.data_analyzer \
+    --analyzer_config_uri my_analyzer_config.yaml
+```
+
+**3. Open the report.** When the run completes:
+
+```
+[INFO] Report written to /tmp/my_analysis/report.html
+```
+
+Open the file in any browser. No server, no external dependencies, fully offline.
+
+## What it checks
+
+The analyzer organizes checks into four tiers. Tiers 1 and 2 always run; Tier 3 auto-enables when your config supports
+it; Tier 4 is opt-in.
+
+| Tier                         | When                                                                                 | What it checks                                                                                                                                                                                                                                                                         |
+| ---------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **1. Hard fails**            | Always                                                                               | Dangling edges (NULL src/dst), referential integrity (edges pointing to nodes not in the node table), duplicate nodes. Raises `DataQualityError` — the report still renders to show partial results.                                                                                   |
+| **2. Core metrics**          | Always                                                                               | Node/edge counts, degree distribution (in/out) with percentiles, degree buckets, top-K hubs, super-hub int16 clamp count, cold-start node count, self-loops, duplicate edges, NULL rates per column, feature memory budget estimate, neighbor-explosion estimate (requires `fan_out`). |
+| **3. Label + heterogeneous** | Auto when `label_column` is set on any node table, or when multiple edge types exist | Class imbalance, label coverage, edge type distribution, per-edge-type node coverage.                                                                                                                                                                                                  |
+| **4. Advanced**              | Opt-in via config flags                                                              | Power-law exponent (implemented as a degree-stats approximation). Reciprocity, homophily, connected components, clustering coefficient are **not yet implemented** — the flags are accepted but currently no-op.                                                                       |
+
+The thresholds below come from a review of production GNN papers (PinSage, BLADE, LiGNN, TwHIN, AliGraph, GraphSMOTE,
+Beyond Homophily, Feature Propagation, and others). See the inline citations in the threshold table for what each paper
+contributes.
+
+## Interpreting the report
+
+The report color-codes every numeric finding. Summary of the most important thresholds:
+
+| Metric                                                   | Green | Yellow     | Red     | What to do when yellow/red                                                                                                                                    |
+| -------------------------------------------------------- | ----- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Dangling edges / referential integrity / duplicate nodes | 0     | —          | any > 0 | Fix the input tables. Training will fail or silently corrupt otherwise.                                                                                       |
+| Feature missing rate                                     | < 10% | 10–50%     | > 90%   | Plan an imputation strategy; above ~95% the Feature Propagation phase transition (Rossi et al., ICLR 2022) hits and GNNs stop recovering signal reliably.     |
+| Isolated node fraction                                   | < 1%  | 1–5%       | > 5%    | Filter isolated nodes or densify (LiGNN, KDD 2024) for cold-start cohorts.                                                                                    |
+| Cold-start fraction (degree 0–1)                         | < 5%  | 5–10%      | > 10%   | Candidates for graph densification; also flag for special handling at serving time.                                                                           |
+| Super-hub int16 clamp (degree > 32,767)                  | 0     | —          | any > 0 | GiGL silently truncates super-hub degrees in `gigl/distributed/utils/degree.py`. Either cap the hub's edges upstream or plan to address the clamp.            |
+| Degree p99 / median                                      | < 50  | 50–100     | > 100   | Use importance sampling (PinSage, KDD 2018) or degree-adaptive neighborhoods (BLADE, WSDM 2023) — degree skew is the single biggest lever in production GNNs. |
+| Class imbalance ratio                                    | < 1:5 | 1:5 – 1:10 | > 1:10  | Message passing amplifies label imbalance 2–3× in representation space (GraphSMOTE, WSDM 2021). Consider resampling or GraphSMOTE-style synthetic nodes.      |
+| Edge homophily (Tier 4, future)                          | > 0.7 | 0.3 – 0.7  | < 0.3   | Standard GCN/GAT fail at low h (Zhu et al., NeurIPS 2020). Consider H2GCN-style architectures; below h ≈ 0.2 a plain MLP often wins.                          |
+
+## Advanced config
+
+Optional YAML keys beyond the minimal quickstart:
+
+```yaml
+# Enable Tier 3 class-imbalance + label-coverage checks for a node type:
+node_tables:
+  - bq_table: ...
+    label_column: "label"
+
+# Neighbor explosion estimation — the fan-out per GNN layer you plan to train with:
+fan_out: [15, 10, 5]
+
+# Tier 4 opt-in flags. Default false.
+# NOTE: Only `compute_reciprocity` is wired into the analyzer today and it logs a
+# warning rather than computing a result. The other three flags are placeholders
+# for future work (see "Scope and limitations" below).
+compute_reciprocity: true
+compute_homophily: true
+compute_connected_components: true
+compute_clustering: true
+
+# Per-edge-type timestamp hint. NOTE: accepted by the config schema but not yet
+# consumed by any Tier 4 query (temporal freshness check is planned).
+edge_tables:
+  - bq_table: ...
+    timestamp_column: "created_at"
+```
+
+## Python API
+
+The CLI wraps a regular class. Call from your own code when you want programmatic access to the `GraphAnalysisResult`:
+
+```python
+from gigl.analytics.data_analyzer import DataAnalyzer
+from gigl.analytics.data_analyzer.config import load_analyzer_config
+
+config = load_analyzer_config("my_analyzer_config.yaml")
+analyzer = DataAnalyzer()
+report_path = analyzer.run(config=config)
+# report_path points to the written report.html (local path or gs:// URI)
+```
+
+The underlying `GraphStructureAnalyzer` is also callable directly if you want the raw result dataclass and no HTML:
+
+```python
+from gigl.analytics.data_analyzer.graph_structure_analyzer import GraphStructureAnalyzer
+
+result = GraphStructureAnalyzer().analyze(config)
+print(result.degree_stats)
+```
+
+See a rendered report example at
+[`tests/test_assets/analytics/golden_report.html`](../../tests/test_assets/analytics/golden_report.html) to preview the
+output format before authenticating to BQ.
+
+## graph_validation
+
+One-off validators for the subset of cases where the full analyzer is overkill. Today the only check is dangling-edge
+detection:
+
+```python
+from gigl.analytics.graph_validation import BQGraphValidator
+
+has_dangling = BQGraphValidator.does_edge_table_have_dangling_edges(
+    edge_table="your-project.your_dataset.user_edges",
+    src_node_column_name="src_user_id",
+    dst_node_column_name="dst_user_id",
+)
+```
+
+The `DataAnalyzer` runs this check (and many more) as part of Tier 1, so prefer the full analyzer unless you
+specifically need a one-line gate (e.g., inside an Airflow task or a preprocessing job). This subpackage is the intended
+home for additional standalone validators in the future.
+
+## Scope and limitations
+
+Current implementation status:
+
+- **FeatureProfiler is a stub.** The class is wired in but the TFDV/Dataflow pipeline that would produce FACETS HTML per
+  table is deferred to a follow-up PR. Calling it today logs a warning and returns an empty `FeatureProfileResult`. The
+  main report is fully functional without it.
+- **Tier 4 checks are partial.** Power-law exponent is computed as a degree-stats approximation. Reciprocity, homophily,
+  connected components, and clustering coefficient config flags are accepted but currently no-op. The `timestamp_column`
+  edge field is accepted but no temporal-freshness query runs yet.
+- **Heterogeneous graphs: referential integrity caveat.** For each edge table, the referential-integrity check joins
+  against `config.node_tables[0]`. On heterogeneous graphs where different edges reference different node types, the
+  current implementation will under-report integrity violations — fix is tracked for a follow-up.
+- **GCS upload** works via `GcsUtils.upload_from_string` when `output_gcs_path` is a `gs://` URI, and falls back to
+  local filesystem write otherwise.
+
+## Related documents
+
+Within this module:
+
+- [`data_analyzer/report/PRD.md`](data_analyzer/report/PRD.md) — product intent for the HTML report (AI-owned)
+- [`data_analyzer/report/SPEC.md`](data_analyzer/report/SPEC.md) — technical contract for the AI-owned HTML/JS/CSS
+  assets
diff --git a/gigl/analytics/data_analyzer/__init__.py b/gigl/analytics/data_analyzer/__init__.py
new file mode 100644
index 000000000..45304dacc
--- /dev/null
+++ b/gigl/analytics/data_analyzer/__init__.py
@@ -0,0 +1,10 @@
+"""
+BQ Data Analyzer for pre-training graph data analysis.
+
+Produces a single HTML report covering data quality, feature distributions,
+and graph structure metrics from BigQuery node/edge tables.
+"""
+
+from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer
+
+__all__ = ["DataAnalyzer"]
diff --git a/gigl/analytics/data_analyzer/__main__.py b/gigl/analytics/data_analyzer/__main__.py
new file mode 100644
index 000000000..693551d33
--- /dev/null
+++ b/gigl/analytics/data_analyzer/__main__.py
@@ -0,0 +1,6 @@
+"""Entry point for running the BQ Data Analyzer as a module: python -m gigl.analytics.data_analyzer."""
+
+from gigl.analytics.data_analyzer.data_analyzer import main
+
+if __name__ == "__main__":
+    main()
diff --git a/gigl/analytics/data_analyzer/config.py b/gigl/analytics/data_analyzer/config.py
new file mode 100644
index 000000000..c892edb0f
--- /dev/null
+++ b/gigl/analytics/data_analyzer/config.py
@@ -0,0 +1,177 @@
+import re
+from dataclasses import dataclass, field
+from typing import Optional
+
+from omegaconf import MISSING, OmegaConf
+
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+# BigQuery identifier regexes used to reject configs that would be interpolated
+# directly into SQL. See https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical
+# for the allowed grammar. Tables are of the form project.dataset.table;
+# columns are simple unquoted identifiers.
+_BQ_TABLE_REGEX = re.compile(r"^[A-Za-z0-9_.\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_$\-]+$")
+_BQ_COLUMN_REGEX = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
+
+
+def _validate_bq_table(name: str, field_label: str) -> None:
+    if not _BQ_TABLE_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery table reference. "
+            f"Expected project.dataset.table with no backticks, whitespace, or quotes."
+        )
+
+
+def _validate_bq_column(name: str, field_label: str) -> None:
+    if not _BQ_COLUMN_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery column identifier. "
+            f"Expected [A-Za-z_][A-Za-z0-9_]* with no backticks, whitespace, or quotes."
+        )
+
+
+@dataclass
+class NodeTableSpec:
+    """Specification for a node table in BigQuery."""
+
+    bq_table: str = MISSING
+    node_type: str = MISSING
+    id_column: str = MISSING
+    feature_columns: list[str] = field(default_factory=list)
+    label_column: Optional[str] = None
+
+
+@dataclass
+class EdgeTableSpec:
+    """Specification for an edge table in BigQuery.
+
+    For heterogeneous graphs (more than one node table), src_node_type and
+    dst_node_type must be set to the node_type of the matching node table.
+    For homogeneous graphs (single node table) they default to that node_type.
+    """
+
+    bq_table: str = MISSING
+    edge_type: str = MISSING
+    src_id_column: str = MISSING
+    dst_id_column: str = MISSING
+    src_node_type: Optional[str] = None
+    dst_node_type: Optional[str] = None
+    feature_columns: list[str] = field(default_factory=list)
+    timestamp_column: Optional[str] = None
+
+
+@dataclass
+class DataAnalyzerConfig:
+    """Configuration for the BQ Data Analyzer.
+
+    Parsed from YAML via OmegaConf.
+
+    Example:
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> config.node_tables[0].bq_table
+        'project.dataset.user_nodes'
+    """
+
+    node_tables: list[NodeTableSpec] = MISSING
+    edge_tables: list[EdgeTableSpec] = MISSING
+    output_gcs_path: str = MISSING
+    fan_out: Optional[list[int]] = None
+    compute_reciprocity: bool = False
+    compute_homophily: bool = False
+    compute_connected_components: bool = False
+    compute_clustering: bool = False
+
+
+def _validate_and_backfill(config: DataAnalyzerConfig) -> None:
+    """Run identifier validation and backfill default node-type references.
+
+    - Every bq_table must match project.dataset.table.
+    - Every id_column / src_id_column / dst_id_column / feature_column /
+      label_column / timestamp_column must be a bare BQ identifier.
+    - For homogeneous configs, an edge table with no src_node_type /
+      dst_node_type inherits the single node table's node_type.
+    - For heterogeneous configs, every edge table must explicitly declare
+      src_node_type and dst_node_type, and both must resolve to a known
+      node_type.
+    """
+    known_node_types = {nt.node_type for nt in config.node_tables}
+    single_node_type: Optional[str] = (
+        next(iter(known_node_types)) if len(config.node_tables) == 1 else None
+    )
+
+    for node_table in config.node_tables:
+        _validate_bq_table(node_table.bq_table, "node_tables.bq_table")
+        _validate_bq_column(node_table.id_column, "node_tables.id_column")
+        for col in node_table.feature_columns:
+            _validate_bq_column(col, "node_tables.feature_columns")
+        if node_table.label_column is not None:
+            _validate_bq_column(node_table.label_column, "node_tables.label_column")
+
+    for edge_table in config.edge_tables:
+        _validate_bq_table(edge_table.bq_table, "edge_tables.bq_table")
+        _validate_bq_column(edge_table.src_id_column, "edge_tables.src_id_column")
+        _validate_bq_column(edge_table.dst_id_column, "edge_tables.dst_id_column")
+        for col in edge_table.feature_columns:
+            _validate_bq_column(col, "edge_tables.feature_columns")
+        if edge_table.timestamp_column is not None:
+            _validate_bq_column(
+                edge_table.timestamp_column, "edge_tables.timestamp_column"
+            )
+
+        if edge_table.src_node_type is None:
+            if single_node_type is not None:
+                edge_table.src_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: src_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.dst_node_type is None:
+            if single_node_type is not None:
+                edge_table.dst_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: dst_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.src_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: src_node_type="
+                f"{edge_table.src_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+        if edge_table.dst_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: dst_node_type="
+                f"{edge_table.dst_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+
+
+def load_analyzer_config(config_path: str) -> DataAnalyzerConfig:
+    """Load and validate a DataAnalyzerConfig from a YAML file.
+
+    Args:
+        config_path: Local file path or GCS URI to the YAML config.
+
+    Returns:
+        Validated DataAnalyzerConfig instance with node-type references
+        backfilled on edge tables.
+
+    Raises:
+        omegaconf.errors.MissingMandatoryValue: If required fields are missing.
+        ValueError: If any bq_table or column name is not a valid BigQuery
+            identifier, or if a heterogeneous config is missing a required
+            src_node_type / dst_node_type.
+    """
+    raw = OmegaConf.load(config_path)
+    merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+    config: DataAnalyzerConfig = OmegaConf.to_object(merged)  # type: ignore
+    _validate_and_backfill(config)
+    logger.info(
+        f"Loaded analyzer config with {len(config.node_tables)} node tables "
+        f"and {len(config.edge_tables)} edge tables"
+    )
+    return config
diff --git a/gigl/analytics/data_analyzer/data_analyzer.py b/gigl/analytics/data_analyzer/data_analyzer.py
new file mode 100644
index 000000000..f8062fa56
--- /dev/null
+++ b/gigl/analytics/data_analyzer/data_analyzer.py
@@ -0,0 +1,137 @@
+"""Main orchestrator and CLI entry point for the BQ Data Analyzer."""
+import argparse
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+from typing import Optional
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig, load_analyzer_config
+from gigl.analytics.data_analyzer.feature_profiler import FeatureProfiler
+from gigl.analytics.data_analyzer.graph_structure_analyzer import (
+    DataQualityError,
+    GraphStructureAnalyzer,
+)
+from gigl.analytics.data_analyzer.report.report_generator import generate_report
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from gigl.common import GcsUri, Uri, UriFactory
+from gigl.common.logger import Logger
+from gigl.common.utils.gcs import GcsUtils
+
+logger = Logger()
+
+
+def _write_report(html: str, output_gcs_path: str) -> str:
+    """Write the HTML report to a GCS URI or local path.
+
+    Args:
+        html: Rendered HTML string.
+        output_gcs_path: Output directory. If it starts with ``gs://`` the
+            report is uploaded via ``GcsUtils``. Otherwise it is written to
+            the local filesystem (the directory is created if missing).
+
+    Returns:
+        The full path to the written ``report.html`` file.
+    """
+    trimmed = output_gcs_path.rstrip("/")
+    report_path = f"{trimmed}/report.html"
+    if trimmed.startswith("gs://"):
+        GcsUtils().upload_from_string(GcsUri(report_path), html)
+    else:
+        local_path = Path(report_path).expanduser().resolve()
+        local_path.parent.mkdir(parents=True, exist_ok=True)
+        local_path.write_text(html)
+        report_path = str(local_path)
+    return report_path
+
+
+class DataAnalyzer:
+    """Orchestrates graph structure analysis, feature profiling, and report generation.
+
+    Example:
+        >>> from gigl.analytics.data_analyzer.config import load_analyzer_config
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> analyzer = DataAnalyzer()
+        >>> report_path = analyzer.run(config=config)
+    """
+
+    def run(
+        self,
+        config: DataAnalyzerConfig,
+        resource_config_uri: Optional[Uri] = None,
+    ) -> str:
+        """Run the full analysis pipeline and write an HTML report.
+
+        The report is written to ``{config.output_gcs_path}/report.html`` via
+        ``GcsUtils`` when the output path is a ``gs://`` URI, or to the local
+        filesystem otherwise (the parent directory is created if missing).
+
+        Args:
+            config: Analyzer configuration.
+            resource_config_uri: Optional resource config for Dataflow sizing.
+
+        Returns:
+            The path to the written ``report.html`` (GCS URI or local path).
+        """
+        structure_analyzer = GraphStructureAnalyzer()
+        feature_profiler = FeatureProfiler()
+
+        with ThreadPoolExecutor(max_workers=2) as executor:
+            structure_future = executor.submit(structure_analyzer.analyze, config)
+            profile_future = executor.submit(
+                feature_profiler.profile, config, resource_config_uri
+            )
+
+            analysis_result: GraphAnalysisResult
+            try:
+                analysis_result = structure_future.result()
+            except DataQualityError as e:
+                logger.error(f"Tier 1 data quality failure: {e}")
+                analysis_result = e.partial_result
+
+            profile_result: FeatureProfileResult
+            try:
+                profile_result = profile_future.result()
+            except Exception as e:
+                logger.exception(f"Feature profiler failed: {e}")
+                profile_result = FeatureProfileResult()
+
+        html = generate_report(
+            analysis_result=analysis_result,
+            profile_result=profile_result,
+            config=config,
+        )
+
+        report_path = _write_report(html, config.output_gcs_path)
+        logger.info(f"Report written to {report_path}")
+        return report_path
+
+
+def main() -> None:
+    """CLI entry point for the BQ Data Analyzer."""
+    parser = argparse.ArgumentParser(
+        description="BQ Data Analyzer: analyze graph data in BigQuery before GNN training"
+    )
+    parser.add_argument(
+        "--analyzer_config_uri",
+        required=True,
+        help="Path or GCS URI to the analyzer YAML config",
+    )
+    parser.add_argument(
+        "--resource_config_uri",
+        required=False,
+        help="Path or GCS URI to the resource config for Dataflow sizing",
+    )
+    args = parser.parse_args()
+
+    config = load_analyzer_config(args.analyzer_config_uri)
+    resource_config_uri: Optional[Uri] = (
+        UriFactory.create_uri(args.resource_config_uri)
+        if args.resource_config_uri
+        else None
+    )
+    analyzer = DataAnalyzer()
+    report_path = analyzer.run(config=config, resource_config_uri=resource_config_uri)
+    logger.info(f"Report generated at: {report_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/gigl/analytics/data_analyzer/feature_profiler.py b/gigl/analytics/data_analyzer/feature_profiler.py
new file mode 100644
index 000000000..e1227ac08
--- /dev/null
+++ b/gigl/analytics/data_analyzer/feature_profiler.py
@@ -0,0 +1,189 @@
+"""TFDV feature profiling via Beam/Dataflow.
+
+Launches one Dataflow pipeline per (node or edge) table that declares
+``feature_columns`` in the analyzer config. Each pipeline reads the
+selected columns from BigQuery, emits ``pa.RecordBatch`` batches, and
+runs ``tfdv.GenerateStatistics`` to write a Facets HTML visualization
+plus a TFDV stats TFRecord to GCS.
+
+Pipelines are launched concurrently using an internal
+``ThreadPoolExecutor``; each worker blocks on
+``p.run().wait_until_finish()`` for its table. Per-table exceptions are
+logged and the failed table is omitted from the returned
+``FeatureProfileResult`` - callers (and the HTML report) already handle
+missing keys.
+"""
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass
+from typing import Optional
+
+import apache_beam as beam
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig
+from gigl.analytics.data_analyzer.types import FeatureProfileResult
+from gigl.common import Uri, UriFactory
+from gigl.common.beam.tfdv_transforms import (
+    BqTableToRecordBatch,
+    GenerateAndVisualizeStats,
+)
+from gigl.common.logger import Logger
+from gigl.env.pipelines_config import get_resource_config
+from gigl.src.common.constants.components import GiGLComponents
+from gigl.src.common.types import AppliedTaskIdentifier
+from gigl.src.common.utils.dataflow import init_beam_pipeline_options
+
+logger = Logger()
+
+_PARALLEL_DATAFLOW_WORKERS = 10
+_APPLIED_TASK_IDENTIFIER = AppliedTaskIdentifier("data-analyzer")
+
+
+@dataclass(frozen=True)
+class _ProfileTask:
+    """One profiling unit: all features of a single node or edge table.
+
+    ``kind`` is ``"node"`` or ``"edge"`` (singular) and is used to build
+    the GCS output path and the result key (``"node:user"``, etc.).
+    """
+
+    kind: str
+    type_name: str
+    bq_table: str
+    feature_columns: list[str]
+
+    @property
+    def result_key(self) -> str:
+        return f"{self.kind}:{self.type_name}"
+
+
+class FeatureProfiler:
+    """Runs TFDV feature profiling on BQ tables via Dataflow.
+
+    Example:
+        >>> profiler = FeatureProfiler()
+        >>> result = profiler.profile(config, resource_config_uri=uri)
+        >>> result.facets_html_paths["node:user"]
+        'gs://bucket/analyzer/feature_profiler/nodes/user/facets.html'
+    """
+
+    def profile(
+        self,
+        config: DataAnalyzerConfig,
+        resource_config_uri: Optional[Uri] = None,
+    ) -> FeatureProfileResult:
+        """Run TFDV profiling on all tables with declared feature columns.
+
+        Launches one Dataflow pipeline per table concurrently. Tables with
+        no ``feature_columns`` are skipped. Per-table failures are logged
+        and omitted from the result.
+
+        Args:
+            config: Analyzer configuration with node and edge table specs.
+            resource_config_uri: Resource config for Dataflow sizing.
+                Required - TFDV profiling needs Dataflow.
+
+        Returns:
+            ``FeatureProfileResult`` with GCS paths keyed by
+            ``"node:{type}"`` / ``"edge:{type}"``. Empty if no tables
+            declared feature columns.
+
+        Raises:
+            ValueError: If ``resource_config_uri`` is None.
+        """
+        if resource_config_uri is None:
+            raise ValueError(
+                "FeatureProfiler requires a resource_config_uri for Dataflow sizing. "
+                "Pass --resource_config_uri when invoking the DataAnalyzer CLI."
+            )
+        # Eagerly populate the process-global resource config so that
+        # `init_beam_pipeline_options` (called on worker threads below)
+        # can resolve it without args.
+        get_resource_config(resource_config_uri=resource_config_uri)
+
+        tasks = _collect_profile_tasks(config)
+        if not tasks:
+            logger.info("No tables declared feature_columns; returning empty result.")
+            return FeatureProfileResult()
+
+        logger.info(f"Launching {len(tasks)} Dataflow feature-profile job(s).")
+        result = FeatureProfileResult()
+        with ThreadPoolExecutor(max_workers=_PARALLEL_DATAFLOW_WORKERS) as executor:
+            future_to_task = {
+                executor.submit(
+                    self._run_single_pipeline, task, config.output_gcs_path
+                ): task
+                for task in tasks
+            }
+            for future in as_completed(future_to_task):
+                task = future_to_task[future]
+                try:
+                    facets_uri, stats_uri = future.result()
+                    result.facets_html_paths[task.result_key] = facets_uri
+                    result.stats_paths[task.result_key] = stats_uri
+                except Exception as exc:
+                    logger.exception(
+                        f"Feature profiling failed for {task.result_key} "
+                        f"(table={task.bq_table}): {exc}"
+                    )
+        return result
+
+    def _run_single_pipeline(
+        self, task: _ProfileTask, output_gcs_path: str
+    ) -> tuple[str, str]:
+        """Build, run, and block on a single table's Dataflow pipeline.
+
+        Returns the ``(facets_uri, stats_uri)`` strings on success.
+        """
+        base = f"{output_gcs_path.rstrip('/')}/feature_profiler/{task.kind}s/{task.type_name}"
+        facets_uri = UriFactory.create_uri(f"{base}/facets.html")
+        stats_uri = UriFactory.create_uri(f"{base}/stats.tfrecord")
+
+        options = init_beam_pipeline_options(
+            applied_task_identifier=_APPLIED_TASK_IDENTIFIER,
+            job_name_suffix=f"profile-{task.kind}-{task.type_name}",
+            component=GiGLComponents.DataAnalyzer,
+        )
+        with beam.Pipeline(options=options) as p:
+            _ = (
+                p
+                | f"Read {task.result_key} from BQ"
+                >> BqTableToRecordBatch(
+                    bq_table=task.bq_table,
+                    feature_columns=task.feature_columns,
+                )
+                | f"Generate TFDV stats for {task.result_key}"
+                >> GenerateAndVisualizeStats(
+                    facets_report_uri=facets_uri,
+                    stats_output_uri=stats_uri,
+                )
+            )
+        logger.info(f"Finished feature profiling for {task.result_key}.")
+        return facets_uri.uri, stats_uri.uri
+
+
+def _collect_profile_tasks(config: DataAnalyzerConfig) -> list[_ProfileTask]:
+    """Flatten the analyzer config into one ``_ProfileTask`` per table that
+    has non-empty ``feature_columns``. Tables without features are skipped.
+    """
+    tasks: list[_ProfileTask] = []
+    for node_table in config.node_tables:
+        if node_table.feature_columns:
+            tasks.append(
+                _ProfileTask(
+                    kind="node",
+                    type_name=node_table.node_type,
+                    bq_table=node_table.bq_table,
+                    feature_columns=list(node_table.feature_columns),
+                )
+            )
+    for edge_table in config.edge_tables:
+        if edge_table.feature_columns:
+            tasks.append(
+                _ProfileTask(
+                    kind="edge",
+                    type_name=edge_table.edge_type,
+                    bq_table=edge_table.bq_table,
+                    feature_columns=list(edge_table.feature_columns),
+                )
+            )
+    return tasks
diff --git a/gigl/analytics/data_analyzer/graph_structure_analyzer.py b/gigl/analytics/data_analyzer/graph_structure_analyzer.py
new file mode 100644
index 000000000..b8a97086f
--- /dev/null
+++ b/gigl/analytics/data_analyzer/graph_structure_analyzer.py
@@ -0,0 +1,561 @@
+"""GraphStructureAnalyzer: 4-tier BigQuery-based graph data quality checks.
+
+Tier 1 (hard fails)
+    dangling edges, referential integrity, duplicate nodes. Any violation
+    raises DataQualityError with a partially populated GraphAnalysisResult.
+
+Tier 2 (core metrics)
+    node/edge counts, degree distribution, top-K hubs, INT16 clamp hazards,
+    isolated/cold-start nodes, duplicate edges, self-loops, NULL rates, and
+    two Python-side computations (feature memory budget, neighbor explosion).
+
+Tier 3 (label and heterogeneous)
+    class imbalance and label coverage (auto-enabled when node_tables have a
+    label_column); edge-type distribution and per-edge-type node coverage
+    (auto-enabled when more than one edge table is declared).
+
+Tier 4 (opt-in)
+    reciprocity, power-law exponent estimate. Gated by config flags.
+"""
+
+import math
+from concurrent.futures import ThreadPoolExecutor
+from typing import Optional
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.queries import (
+    CLASS_IMBALANCE_QUERY,
+    COLD_START_NODE_COUNT_QUERY,
+    DANGLING_EDGES_QUERY,
+    DEGREE_BUCKET_QUERY,
+    DEGREE_DISTRIBUTION_QUERY,
+    DUPLICATE_EDGE_COUNT_QUERY,
+    DUPLICATE_NODE_COUNT_QUERY,
+    EDGE_COUNT_QUERY,
+    EDGE_REFERENTIAL_INTEGRITY_QUERY,
+    EDGE_TYPE_DISTRIBUTION_QUERY,
+    EDGE_TYPE_NODE_COVERAGE_QUERY,
+    ISOLATED_NODE_COUNT_QUERY,
+    LABEL_COVERAGE_QUERY,
+    NODE_COUNT_QUERY,
+    SELF_LOOP_COUNT_QUERY,
+    SUPER_HUB_INT16_CLAMP_QUERY,
+    TOP_K_HUBS_QUERY,
+    build_null_rates_query,
+)
+from gigl.analytics.data_analyzer.types import DegreeStats, GraphAnalysisResult
+from gigl.common.logger import Logger
+from gigl.src.common.utils.bq import BqUtils
+
+logger = Logger()
+
+# Default assumption for feature memory budget: float64 per feature column.
+_BYTES_PER_FEATURE = 8
+_TOP_K_HUBS = 20
+_PARALLEL_BQ_WORKERS = 10
+
+
+class DataQualityError(Exception):
+    """Raised when Tier 1 hard-fail checks detect data quality violations.
+
+    Carries a partially populated GraphAnalysisResult so callers can inspect
+    which specific checks failed without re-running the analyzer.
+    """
+
+    def __init__(self, message: str, partial_result: GraphAnalysisResult) -> None:
+        super().__init__(message)
+        self.partial_result = partial_result
+
+
+class GraphStructureAnalyzer:
+    """Runs BigQuery SQL checks across 4 tiers against the tables declared in a config.
+
+    Example:
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> analyzer = GraphStructureAnalyzer()
+        >>> result = analyzer.analyze(config)
+        >>> result.node_counts["user"]
+        1000000
+
+    Tier 1 is blocking: a violation raises DataQualityError before Tiers 2-4 run.
+    Tiers 2-4 are aggregated best-effort into a single GraphAnalysisResult.
+    """
+
+    def __init__(self, bq_project: Optional[str] = None) -> None:
+        self._bq_utils = BqUtils(project=bq_project)
+
+    def analyze(self, config: DataAnalyzerConfig) -> GraphAnalysisResult:
+        """Run all applicable tiers and return aggregated results.
+
+        Args:
+            config: Data analyzer configuration declaring node and edge tables
+                plus any opt-in expensive checks (reciprocity, etc.).
+
+        Returns:
+            GraphAnalysisResult with tier 1-4 fields populated per config.
+
+        Raises:
+            DataQualityError: If tier 1 checks find any violations. The
+                exception carries a partial result with the specific counts.
+        """
+        result = GraphAnalysisResult()
+        logger.info("Starting graph structure analysis (Tier 1: hard fails)")
+        self._run_tier1(config, result)
+
+        logger.info("Tier 1 passed. Running Tier 2 (core metrics)")
+        self._run_tier2(config, result)
+
+        logger.info("Running Tier 3 (label / heterogeneous)")
+        self._run_tier3(config, result)
+
+        logger.info("Running Tier 4 (opt-in)")
+        self._run_tier4(config, result)
+        return result
+
+    # ------------------------------------------------------------------ #
+    # Tier 1: hard fails                                                  #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier1(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Run all tier 1 checks; raise DataQualityError on any violation."""
+        violations: list[str] = []
+        node_tables_by_type = {nt.node_type: nt for nt in config.node_tables}
+
+        # Duplicate nodes (per node table).
+        for node_table in config.node_tables:
+            query = DUPLICATE_NODE_COUNT_QUERY.format(
+                table=node_table.bq_table, id_column=node_table.id_column
+            )
+            count = self._query_scalar(query, "duplicate_count")
+            result.duplicate_node_counts[node_table.node_type] = count
+            if count > 0:
+                violations.append(
+                    f"node_type={node_table.node_type} has {count} duplicate IDs"
+                )
+
+        # Dangling edges and referential integrity (per edge table).
+        for edge_table in config.edge_tables:
+            dangling_query = DANGLING_EDGES_QUERY.format(
+                table=edge_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            )
+            dangling = self._query_scalar(dangling_query, "dangling_count")
+            result.dangling_edge_counts[edge_table.edge_type] = dangling
+            if dangling > 0:
+                violations.append(
+                    f"edge_type={edge_table.edge_type} has {dangling} dangling edges"
+                )
+
+            # Referential integrity: src and dst can resolve to different node
+            # tables on heterogeneous graphs. `load_analyzer_config` guarantees
+            # src_node_type / dst_node_type are populated and known.
+            if not config.node_tables:
+                continue
+            assert edge_table.src_node_type is not None, (
+                f"edge_type={edge_table.edge_type} has no src_node_type; "
+                "load the config via load_analyzer_config to backfill it."
+            )
+            assert edge_table.dst_node_type is not None, (
+                f"edge_type={edge_table.edge_type} has no dst_node_type; "
+                "load the config via load_analyzer_config to backfill it."
+            )
+            src_node_table = node_tables_by_type[edge_table.src_node_type]
+            dst_node_table = node_tables_by_type[edge_table.dst_node_type]
+            ref_query = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+                edge_table=edge_table.bq_table,
+                src_node_table=src_node_table.bq_table,
+                dst_node_table=dst_node_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+                src_node_id_column=src_node_table.id_column,
+                dst_node_id_column=dst_node_table.id_column,
+            )
+            rows = list(self._bq_utils.run_query(query=ref_query, labels={}))
+            if len(rows) != 1:
+                raise RuntimeError(
+                    f"Referential integrity query expected exactly 1 row; "
+                    f"got {len(rows)}. Query: {ref_query.strip()[:200]}"
+                )
+            missing_src = int(rows[0]["missing_src_count"] or 0)
+            missing_dst = int(rows[0]["missing_dst_count"] or 0)
+            total_missing = missing_src + missing_dst
+            result.referential_integrity_violations[
+                edge_table.edge_type
+            ] = total_missing
+            if total_missing > 0:
+                violations.append(
+                    f"edge_type={edge_table.edge_type} has {total_missing} "
+                    "referential integrity violations"
+                )
+
+        if violations:
+            msg = "Tier 1 data quality violations detected:\n  - " + "\n  - ".join(
+                violations
+            )
+            logger.error(msg)
+            raise DataQualityError(msg, partial_result=result)
+
+    # ------------------------------------------------------------------ #
+    # Tier 2: core metrics                                                #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier2(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Collect core structural metrics, fanning out BQ jobs in parallel.
+
+        Edge-level metrics are computed from the src-side perspective:
+        isolated/cold-start joins pair each edge with its src_node_type's
+        table. Hetero dst-perspective coverage is exposed separately via
+        Tier 3 edge_type_node_coverage.
+
+        BQ jobs are I/O-bound so ThreadPoolExecutor is used. Each worker
+        writes to distinct keys of the shared `result` dict (one key per
+        node_type / edge_type), so no lock is required under CPython's GIL.
+        """
+        node_tables_by_type = {nt.node_type: nt for nt in config.node_tables}
+
+        with ThreadPoolExecutor(max_workers=_PARALLEL_BQ_WORKERS) as executor:
+            futures = []
+            for node_table in config.node_tables:
+                futures.append(
+                    executor.submit(self._tier2_node_metrics, node_table, result)
+                )
+            for edge_table in config.edge_tables:
+                src_node_table = node_tables_by_type.get(edge_table.src_node_type or "")
+                futures.append(
+                    executor.submit(
+                        self._tier2_edge_metrics, edge_table, src_node_table, result
+                    )
+                )
+            for future in futures:
+                future.result()  # re-raise any exception
+
+        # Python-side computations run after all BQ data is collected.
+        self._compute_feature_memory_budget(config, result)
+        self._compute_neighbor_explosion_estimate(config, result)
+
+    def _tier2_node_metrics(
+        self, node_table: NodeTableSpec, result: GraphAnalysisResult
+    ) -> None:
+        node_count = self._query_scalar(
+            NODE_COUNT_QUERY.format(table=node_table.bq_table), "node_count"
+        )
+        result.node_counts[node_table.node_type] = node_count
+
+        columns_to_check: list[str] = [node_table.id_column]
+        columns_to_check.extend(node_table.feature_columns)
+        if node_table.label_column:
+            columns_to_check.append(node_table.label_column)
+
+        null_query = build_null_rates_query(
+            table=node_table.bq_table, columns=columns_to_check
+        )
+        rows = list(self._bq_utils.run_query(query=null_query, labels={}))
+        if rows:
+            row = rows[0]
+            rates: dict[str, float] = {}
+            for col in columns_to_check:
+                key = f"{col}_null_rate"
+                rate = row[key]
+                rates[col] = float(rate) if rate is not None else 0.0
+            result.null_rates[node_table.node_type] = rates
+
+    def _tier2_edge_metrics(
+        self,
+        edge_table: EdgeTableSpec,
+        node_table: Optional[NodeTableSpec],
+        result: GraphAnalysisResult,
+    ) -> None:
+        edge_type = edge_table.edge_type
+
+        # Scalar counts.
+        result.edge_counts[edge_type] = self._query_scalar(
+            EDGE_COUNT_QUERY.format(table=edge_table.bq_table), "edge_count"
+        )
+        result.duplicate_edge_counts[edge_type] = self._query_scalar(
+            DUPLICATE_EDGE_COUNT_QUERY.format(
+                table=edge_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            ),
+            "duplicate_count",
+        )
+        result.self_loop_counts[edge_type] = self._query_scalar(
+            SELF_LOOP_COUNT_QUERY.format(
+                table=edge_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            ),
+            "self_loop_count",
+        )
+
+        # Super-hub INT16 clamp check (indexed by src).
+        result.super_hub_int16_clamp_count[edge_type] = self._query_scalar(
+            SUPER_HUB_INT16_CLAMP_QUERY.format(
+                table=edge_table.bq_table, id_column=edge_table.src_id_column
+            ),
+            "super_hub_count",
+        )
+
+        # Isolated and cold-start require a node table join.
+        if node_table is not None:
+            result.isolated_node_counts[edge_type] = self._query_scalar(
+                ISOLATED_NODE_COUNT_QUERY.format(
+                    node_table=node_table.bq_table,
+                    edge_table=edge_table.bq_table,
+                    node_id_column=node_table.id_column,
+                    src_id_column=edge_table.src_id_column,
+                    dst_id_column=edge_table.dst_id_column,
+                ),
+                "isolated_count",
+            )
+            result.cold_start_node_counts[edge_type] = self._query_scalar(
+                COLD_START_NODE_COUNT_QUERY.format(
+                    node_table=node_table.bq_table,
+                    edge_table=edge_table.bq_table,
+                    node_id_column=node_table.id_column,
+                    src_id_column=edge_table.src_id_column,
+                    dst_id_column=edge_table.dst_id_column,
+                ),
+                "cold_start_count",
+            )
+
+        # Top-K hubs (by src).
+        top_hub_rows = list(
+            self._bq_utils.run_query(
+                query=TOP_K_HUBS_QUERY.format(
+                    table=edge_table.bq_table,
+                    id_column=edge_table.src_id_column,
+                    k=_TOP_K_HUBS,
+                ),
+                labels={},
+            )
+        )
+        result.top_hubs[edge_type] = [
+            (str(row["node_id"]), int(row["degree"])) for row in top_hub_rows
+        ]
+
+        # Degree statistics: distribution + buckets, in + out directions.
+        for direction, id_column in (
+            ("out", edge_table.src_id_column),
+            ("in", edge_table.dst_id_column),
+        ):
+            result.degree_stats[f"{edge_type}_{direction}"] = self._build_degree_stats(
+                table=edge_table.bq_table, id_column=id_column
+            )
+
+    def _build_degree_stats(self, table: str, id_column: str) -> DegreeStats:
+        """Run degree distribution + bucket queries and pack into DegreeStats."""
+        dist_rows = list(
+            self._bq_utils.run_query(
+                query=DEGREE_DISTRIBUTION_QUERY.format(
+                    table=table, id_column=id_column
+                ),
+                labels={},
+            )
+        )
+        bucket_rows = list(
+            self._bq_utils.run_query(
+                query=DEGREE_BUCKET_QUERY.format(table=table, id_column=id_column),
+                labels={},
+            )
+        )
+        dist_row = dist_rows[0]
+        bucket_row = bucket_rows[0]
+
+        percentiles_raw = list(dist_row["percentiles"])
+        percentiles = [int(p) if p is not None else 0 for p in percentiles_raw]
+        # APPROX_QUANTILES(degree, 100) returns 101 values: index 0..100.
+        median = percentiles[50] if len(percentiles) > 50 else 0
+        p90 = percentiles[90] if len(percentiles) > 90 else percentiles[-1]
+        p99 = percentiles[99] if len(percentiles) > 99 else percentiles[-1]
+        # We only have 100-bucket quantiles, so p999 ~= p99 as best-effort.
+        p999 = p99
+
+        # Bucket keys must match BUCKET_ORDER in report/charts.ai.js for the
+        # histogram to render correctly; keep uppercase K.
+        buckets: dict[str, int] = {
+            "0-1": int(bucket_row["bucket_0_1"]),
+            "2-10": int(bucket_row["bucket_2_10"]),
+            "11-100": int(bucket_row["bucket_11_100"]),
+            "101-1K": int(bucket_row["bucket_101_1k"]),
+            "1K-10K": int(bucket_row["bucket_1k_10k"]),
+            "10K+": int(bucket_row["bucket_10k_plus"]),
+        }
+
+        return DegreeStats(
+            min=int(dist_row["min_degree"] or 0),
+            max=int(dist_row["max_degree"] or 0),
+            mean=float(dist_row["avg_degree"] or 0.0),
+            median=median,
+            p90=p90,
+            p99=p99,
+            p999=p999,
+            percentiles=percentiles,
+            buckets=buckets,
+        )
+
+    # ------------------------------------------------------------------ #
+    # Tier 3: label and heterogeneous                                     #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier3(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        # Label-related checks per node table with a label column.
+        for node_table in config.node_tables:
+            if not node_table.label_column:
+                continue
+            class_rows = list(
+                self._bq_utils.run_query(
+                    query=CLASS_IMBALANCE_QUERY.format(
+                        table=node_table.bq_table,
+                        label_column=node_table.label_column,
+                    ),
+                    labels={},
+                )
+            )
+            result.class_imbalance[node_table.node_type] = {
+                str(row["label"]): int(row["count"]) for row in class_rows
+            }
+
+            coverage_rows = list(
+                self._bq_utils.run_query(
+                    query=LABEL_COVERAGE_QUERY.format(
+                        table=node_table.bq_table,
+                        label_column=node_table.label_column,
+                    ),
+                    labels={},
+                )
+            )
+            if coverage_rows:
+                coverage = coverage_rows[0]["coverage"]
+                result.label_coverage[node_table.node_type] = (
+                    float(coverage) if coverage is not None else 0.0
+                )
+
+        # Heterogeneous distribution only if more than one edge type.
+        if len(config.edge_tables) > 1:
+            for edge_table in config.edge_tables:
+                edge_type = edge_table.edge_type
+                # Edge-type distribution is effectively the edge count; reuse.
+                if edge_type in result.edge_counts:
+                    result.edge_type_distribution[edge_type] = result.edge_counts[
+                        edge_type
+                    ]
+                else:
+                    result.edge_type_distribution[edge_type] = self._query_scalar(
+                        EDGE_TYPE_DISTRIBUTION_QUERY.format(table=edge_table.bq_table),
+                        "edge_count",
+                    )
+                coverage_rows = list(
+                    self._bq_utils.run_query(
+                        query=EDGE_TYPE_NODE_COVERAGE_QUERY.format(
+                            table=edge_table.bq_table,
+                            src_id_column=edge_table.src_id_column,
+                            dst_id_column=edge_table.dst_id_column,
+                        ),
+                        labels={},
+                    )
+                )
+                if coverage_rows:
+                    row = coverage_rows[0]
+                    result.edge_type_node_coverage[edge_type] = {
+                        "distinct_src_count": int(row["distinct_src_count"] or 0),
+                        "distinct_dst_count": int(row["distinct_dst_count"] or 0),
+                    }
+
+    # ------------------------------------------------------------------ #
+    # Tier 4: opt-in                                                      #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier4(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Populate opt-in metrics gated by config flags.
+
+        Power-law exponent is always cheap (derived from existing degree stats)
+        and is computed whenever degree stats are available. Reciprocity,
+        homophily, connected components and clustering require dedicated
+        queries not yet defined; they remain empty unless the corresponding
+        flag is enabled AND a query is implemented.
+        """
+        # Power-law exponent: approximate from degree stats using a simple
+        # heuristic: alpha ~= 1 + log(max) / log(median) for median > 1.
+        for degree_key, stats in result.degree_stats.items():
+            if stats.median > 1 and stats.max > stats.median:
+                exponent = 1.0 + math.log(stats.max) / math.log(stats.median)
+                result.power_law_exponent[degree_key] = exponent
+
+        if config.compute_reciprocity:
+            # Query not yet defined; log and skip.
+            logger.warning(
+                "compute_reciprocity=True but reciprocity query is not implemented; "
+                "skipping Tier 4 reciprocity."
+            )
+
+    # ------------------------------------------------------------------ #
+    # Python-only computations                                            #
+    # ------------------------------------------------------------------ #
+
+    def _compute_feature_memory_budget(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Estimate per-node-type memory footprint of features (float64 assumed)."""
+        for node_table in config.node_tables:
+            node_count = result.node_counts.get(node_table.node_type, 0)
+            num_features = len(node_table.feature_columns)
+            result.feature_memory_bytes[node_table.node_type] = (
+                node_count * num_features * _BYTES_PER_FEATURE
+            )
+
+    def _compute_neighbor_explosion_estimate(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Multiply fan-out factors and scale by out-degree mean per edge type."""
+        if not config.fan_out:
+            return
+        fan_out_product = 1
+        for hop in config.fan_out:
+            fan_out_product *= int(hop)
+        for edge_table in config.edge_tables:
+            out_stats = result.degree_stats.get(f"{edge_table.edge_type}_out")
+            if out_stats is None:
+                continue
+            estimate = int(fan_out_product * max(out_stats.mean, 1.0))
+            result.neighbor_explosion_estimate[edge_table.edge_type] = estimate
+
+    # ------------------------------------------------------------------ #
+    # Helpers                                                             #
+    # ------------------------------------------------------------------ #
+
+    def _query_scalar(self, query: str, column: str) -> int:
+        """Run a single-row, single-column query and return the scalar as int.
+
+        Scalar queries (COUNT, COUNTIF) must return exactly one row with a
+        non-NULL value for the requested column. Any deviation indicates a
+        driver, auth, or schema mismatch rather than legitimate data — raise
+        loudly instead of silently coercing to 0, which would let a broken run
+        pass through as a green-light result.
+        """
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        if len(rows) != 1:
+            raise RuntimeError(
+                f"Scalar query expected exactly 1 row; got {len(rows)}. "
+                f"Query: {query.strip()[:200]}"
+            )
+        value = rows[0][column]
+        if value is None:
+            raise RuntimeError(
+                f"Scalar query returned NULL for column '{column}'. "
+                f"Query: {query.strip()[:200]}"
+            )
+        return int(value)
diff --git a/gigl/analytics/data_analyzer/queries.py b/gigl/analytics/data_analyzer/queries.py
new file mode 100644
index 000000000..3243d2727
--- /dev/null
+++ b/gigl/analytics/data_analyzer/queries.py
@@ -0,0 +1,189 @@
+"""SQL query templates for graph structure analysis.
+
+Each constant is a format-string template parameterized with table names
+and column names. Pattern matches gigl/src/data_preprocessor/lib/enumerate/queries.py.
+"""
+
+import torch
+
+INT16_MAX = int(torch.iinfo(torch.int16).max)  # 32767
+
+# --- Tier 1: Hard fails ---
+
+DANGLING_EDGES_QUERY = """
+SELECT COUNT(*) AS dangling_count
+FROM `{table}`
+WHERE {src_id_column} IS NULL OR {dst_id_column} IS NULL
+"""
+
+EDGE_REFERENTIAL_INTEGRITY_QUERY = """
+SELECT
+    COUNTIF(src_node.{src_node_id_column} IS NULL) AS missing_src_count,
+    COUNTIF(dst_node.{dst_node_id_column} IS NULL) AS missing_dst_count
+FROM `{edge_table}` AS e
+LEFT JOIN `{src_node_table}` AS src_node
+    ON e.{src_id_column} = src_node.{src_node_id_column}
+LEFT JOIN `{dst_node_table}` AS dst_node
+    ON e.{dst_id_column} = dst_node.{dst_node_id_column}
+"""
+
+DUPLICATE_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS duplicate_count FROM (
+    SELECT {id_column}
+    FROM `{table}`
+    GROUP BY {id_column}
+    HAVING COUNT(*) > 1
+)
+"""
+
+# --- Tier 2: Core metrics ---
+
+NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS node_count FROM `{table}`
+"""
+
+EDGE_COUNT_QUERY = """
+SELECT COUNT(*) AS edge_count FROM `{table}`
+"""
+
+DUPLICATE_EDGE_COUNT_QUERY = """
+SELECT COUNT(*) AS duplicate_count FROM (
+    SELECT {src_id_column}, {dst_id_column}
+    FROM `{table}`
+    GROUP BY {src_id_column}, {dst_id_column}
+    HAVING COUNT(*) > 1
+)
+"""
+
+SELF_LOOP_COUNT_QUERY = """
+SELECT COUNT(*) AS self_loop_count
+FROM `{table}`
+WHERE {src_id_column} = {dst_id_column}
+"""
+
+ISOLATED_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS isolated_count FROM (
+    SELECT n.{node_id_column}
+    FROM `{node_table}` AS n
+    LEFT JOIN `{edge_table}` AS e_src
+        ON n.{node_id_column} = e_src.{src_id_column}
+    LEFT JOIN `{edge_table}` AS e_dst
+        ON n.{node_id_column} = e_dst.{dst_id_column}
+    WHERE e_src.{src_id_column} IS NULL
+        AND e_dst.{dst_id_column} IS NULL
+)
+"""
+
+DEGREE_DISTRIBUTION_QUERY = """
+SELECT
+    MIN(degree) AS min_degree,
+    MAX(degree) AS max_degree,
+    AVG(degree) AS avg_degree,
+    APPROX_QUANTILES(degree, 100) AS percentiles
+FROM (
+    SELECT {id_column}, COUNT(*) AS degree
+    FROM `{table}`
+    GROUP BY {id_column}
+)
+"""
+
+DEGREE_BUCKET_QUERY = """
+SELECT
+    COUNTIF(degree BETWEEN 0 AND 1) AS bucket_0_1,
+    COUNTIF(degree BETWEEN 2 AND 10) AS bucket_2_10,
+    COUNTIF(degree BETWEEN 11 AND 100) AS bucket_11_100,
+    COUNTIF(degree BETWEEN 101 AND 1000) AS bucket_101_1k,
+    COUNTIF(degree BETWEEN 1001 AND 10000) AS bucket_1k_10k,
+    COUNTIF(degree > 10000) AS bucket_10k_plus
+FROM (
+    SELECT {id_column}, COUNT(*) AS degree
+    FROM `{table}`
+    GROUP BY {id_column}
+)
+"""
+
+TOP_K_HUBS_QUERY = """
+SELECT {id_column} AS node_id, COUNT(*) AS degree
+FROM `{table}`
+GROUP BY {id_column}
+ORDER BY degree DESC
+LIMIT {k}
+"""
+
+SUPER_HUB_INT16_CLAMP_QUERY = f"""
+SELECT COUNT(*) AS super_hub_count FROM (
+    SELECT {{id_column}}, COUNT(*) AS degree
+    FROM `{{table}}`
+    GROUP BY {{id_column}}
+    HAVING COUNT(*) > {INT16_MAX}
+)
+"""
+
+COLD_START_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS cold_start_count FROM (
+    SELECT n.{node_id_column}, COALESCE(e.degree, 0) AS degree
+    FROM `{node_table}` AS n
+    LEFT JOIN (
+        SELECT nid, COUNT(*) AS degree FROM (
+            SELECT {src_id_column} AS nid FROM `{edge_table}`
+            UNION ALL
+            SELECT {dst_id_column} AS nid FROM `{edge_table}`
+        )
+        GROUP BY nid
+    ) AS e ON n.{node_id_column} = e.nid
+    WHERE COALESCE(e.degree, 0) <= 1
+)
+"""
+
+# --- Tier 3: Label and heterogeneous ---
+
+CLASS_IMBALANCE_QUERY = """
+SELECT {label_column} AS label, COUNT(*) AS count
+FROM `{table}`
+WHERE {label_column} IS NOT NULL
+GROUP BY {label_column}
+ORDER BY count DESC
+"""
+
+LABEL_COVERAGE_QUERY = """
+SELECT
+    COUNT(*) AS total,
+    COUNTIF({label_column} IS NOT NULL) AS labeled,
+    SAFE_DIVIDE(COUNTIF({label_column} IS NOT NULL), COUNT(*)) AS coverage
+FROM `{table}`
+"""
+
+EDGE_TYPE_DISTRIBUTION_QUERY = """
+SELECT COUNT(*) AS edge_count FROM `{table}`
+"""
+
+EDGE_TYPE_NODE_COVERAGE_QUERY = """
+SELECT
+    APPROX_COUNT_DISTINCT({src_id_column}) AS distinct_src_count,
+    APPROX_COUNT_DISTINCT({dst_id_column}) AS distinct_dst_count
+FROM `{table}`
+"""
+
+
+def build_null_rates_query(table: str, columns: list[str]) -> str:
+    """Build a batched NULL rates query for multiple columns.
+
+    One query, one table scan, one COUNTIF per column.
+
+    Args:
+        table: Fully qualified BQ table name.
+        columns: List of column names to check.
+
+    Returns:
+        SQL query string.
+    """
+    countif_clauses = ",\n    ".join(
+        f"SAFE_DIVIDE(COUNTIF({col} IS NULL), COUNT(*)) AS {col}_null_rate"
+        for col in columns
+    )
+    return f"""
+SELECT
+    COUNT(*) AS total_rows,
+    {countif_clauses}
+FROM `{table}`
+"""
diff --git a/gigl/analytics/data_analyzer/report/PRD.md b/gigl/analytics/data_analyzer/report/PRD.md
new file mode 100644
index 000000000..43f5fc1e9
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/PRD.md
@@ -0,0 +1,170 @@
+# PRD: BQ Data Analyzer HTML Report
+
+## Status
+
+**AI-owned.** An AI agent reads this PRD together with the sibling `SPEC.md` and regenerates `report.ai.html`,
+`charts.ai.js`, and `styles.ai.css` when the product intent or technical contract changes. This PRD describes *why* and
+*what*; `SPEC.md` describes *how*.
+
+## Problem
+
+Before training a GNN on graph data in BigQuery, engineers need a fast way to see whether the data is healthy enough to
+train on. Today they find out only after a Dataflow job crashes or a trainer produces a poor model, which costs days and
+thousands of dollars per iteration.
+
+A review of 18 production GNN papers ([reference doc](../../../docs/plans/20260415-bq-data-analyzer-references.md))
+found that graph-specific data properties drive 30-230% model quality differences. None of these are caught by standard
+tabular data quality tools. We need a report that surfaces these graph-specific issues in a form engineers can act on in
+minutes, not days.
+
+## Users
+
+| Persona                                  | Primary need                                                              | Frequency                  |
+| ---------------------------------------- | ------------------------------------------------------------------------- | -------------------------- |
+| **GNN engineer running an applied task** | Decide whether a new BQ dataset is trainable, and if not, what to fix     | Per new dataset or refresh |
+| **Applied task reviewer / tech lead**    | Sanity-check a teammate's dataset choices before approving a training run | Per PR                     |
+| **On-call engineer**                     | Triage why a training run degraded vs last week                           | Per incident               |
+
+Out of scope: data scientists doing generic exploratory data analysis, product managers, non-technical stakeholders.
+
+## User Stories
+
+1. **As a GNN engineer**, I point the analyzer at a new BQ node/edge table pair and open the resulting HTML report.
+   Within 30 seconds of scrolling I know whether the dataset has any training-blocking issues (dangling edges,
+   referential integrity, duplicates).
+2. **As a GNN engineer**, I inspect the degree distribution histogram for each edge type and decide whether my planned
+   fan-out is realistic or will cause neighbor explosion.
+3. **As a reviewer**, I share the GCS link to the report in a PR comment. My teammate opens it in a browser without
+   installing anything.
+4. **As an on-call engineer**, I run the analyzer on today's data and last week's data and diff the two reports to see
+   what changed.
+5. **As any of the above**, I expand the collapsed sections I do not care about so the overview stays scannable.
+
+## Goals
+
+1. **Zero-setup viewing.** The report opens in any modern browser with no server, no CDN, no authentication beyond the
+   GCS link. Works offline once downloaded.
+2. **Action-oriented.** Every numeric finding is color-coded against a literature-derived threshold (green/yellow/red)
+   so the reader knows what to do about it.
+3. **Traceable.** Every color-coded threshold and every check cites the paper or codebase location that justifies it, so
+   readers can verify claims.
+4. **Portable.** A single `.html` file that can be shared in chat, stored indefinitely in GCS, and archived alongside
+   the training run it describes.
+5. **Graph-native.** Surfaces metrics that matter for GNNs specifically (degree distribution, super-hub int16 clamp,
+   cold-start fraction, homophily, neighbor explosion), not just generic tabular stats.
+6. **AI-regenerable.** The three `.ai.*` assets can be regenerated deterministically from this PRD plus `SPEC.md`
+   without human intervention on the HTML/JS/CSS.
+
+## Non-Goals
+
+- **Not a real-time monitoring dashboard.** Aegis covers that
+  ([Phase 2](../../../docs/plans/20260415-bq-data-analyzer.md#aegis-integration-phase-2)). This report is a
+  point-in-time snapshot.
+- **Not a BI tool.** No filtering, drill-down, or ad-hoc querying. The report is a rendered artifact, not an interactive
+  app.
+- **Not cross-dataset comparison.** Diffing reports is a user workflow (open two tabs), not a report feature.
+- **Not a model evaluation report.** This is about training data, not trained model performance.
+- **Not accessible (WCAG AA) in v1.** We document this gap and will address it if the report is used by users who need
+  it.
+
+## Functional Requirements
+
+Each requirement maps to a section of `SPEC.md` where the implementation contract lives.
+
+**FR-1: Overview at a glance.** The first screen (above the fold) shows total nodes, total edges, node/edge type counts,
+and a single green/yellow/red status light summarizing the worst issue found. Rationale: engineers decide "do I need to
+look deeper" in the first 5 seconds.
+
+**FR-2: Hard-fail visibility.** Dangling edges, referential integrity violations, and duplicate nodes render red
+regardless of magnitude. These block training entirely. The report shows them prominently even if count is exactly one.
+Rationale: [GiGL](../../../docs/plans/20260415-bq-data-analyzer-references.md#6-gigl),
+[AliGraph (7.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#7-aligraph) — silent NaN propagation from
+referential integrity violations is a production-documented failure mode.
+
+**FR-3: Degree distribution per edge type.** Inline SVG histogram using the six literature-aligned buckets: `0-1`,
+`2-10`, `11-100`, `101-1K`, `1K-10K`, `10K+`. Separate in-degree and out-degree. Rationale:
+[BLADE](../../../docs/plans/20260415-bq-data-analyzer-references.md#3-blade) showed 230% embedding improvement from
+degree-adaptive neighborhoods; the reader needs to see which buckets dominate.
+
+**FR-4: Super-hub warning.** A red call-out appears when any node exceeds the GiGL int16 degree clamp (32,767). Include
+the count and the affected edge type. Rationale:
+[GiGL (6.2)](../../../docs/plans/20260415-bq-data-analyzer-references.md#6-gigl) — the clamp is silent in production and
+corrupts PPR sampling probabilities. Users have no other way to discover this.
+
+**FR-5: Cold-start visibility.** Show the count and fraction of degree-0-1 nodes per type. Color-code the fraction
+against the 5% / 10% threshold. Rationale:
+[LiGNN (4.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#4-lignn) — +0.28% AUC from cold-start
+densification; the reader decides whether densification is worth investigating.
+
+**FR-6: Optional Tier 3 visibility.** Class imbalance, label coverage, edge type distribution, and per-edge-type node
+coverage are shown only when the input data supports them. Rationale: a report full of "not applicable" sections is
+noise.
+
+**FR-7: Embedded FACETS.** When feature profiling is available, the FACETS HTML output is embedded inline via
+`<iframe srcdoc="...">` so that the TFDV-generated styles do not leak into the main report. Rationale: FACETS is an
+industry-standard visualization; engineers already know how to read it.
+
+**FR-8: Collapsible sections.** Every section below the overview is independently collapsible via native
+`<details>`/`<summary>` with sensible defaults (hard fails always open; advanced sections closed by default). Rationale:
+the report is comprehensive by design, but any one reading needs only the sections relevant to their question.
+
+**FR-9: Literature citations in footer.** The footer lists the 18 source papers used to set thresholds, with inline
+references wherever a threshold is color-coded. Rationale: "cite sources" is an explicit user preference, and traceable
+thresholds are more defensible than magic numbers.
+
+**FR-10: Raw artifact links.** The footer lists GCS paths to the raw outputs (TFDV stats `.tfrecord`, FACETS `.html` per
+table, schema `.pbtxt`) so the reader can dig deeper with other tools.
+
+## Non-Functional Requirements
+
+| Requirement                                      | Target                                                                                                                   |
+| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
+| **Load time** (opening the HTML from local disk) | Under 3 seconds for a report with up to 20 tables                                                                        |
+| **File size**                                    | Under 1 MB baseline; up to ~10 MB when FACETS iframes are embedded                                                       |
+| **Browser support**                              | Latest Chrome, Firefox, Safari, Edge. No IE.                                                                             |
+| **Dependencies**                                 | Zero external — no CDN, no Google Fonts, no JS framework. All CSS/JS inlined.                                            |
+| **Portability**                                  | Viewing the report over a GCS `gs://` link works without re-download. Saving to disk works.                              |
+| **Determinism**                                  | Same input data + same analyzer version produces byte-identical HTML (enables snapshot testing).                         |
+| **Security**                                     | All data injected via `textContent`, never `innerHTML`. FACETS embeds are isolated in iframes. No remote resource loads. |
+| **Accessibility**                                | Best-effort only in v1: semantic HTML, reasonable color contrast. Full WCAG AA is a non-goal.                            |
+
+## Success Metrics
+
+How we know this PRD was successfully implemented:
+
+1. **Snapshot test stays green.** The golden file at `tests/test_assets/analytics/golden_report.html` matches the
+   generated output for a known input. Any intentional change to the report requires a reviewed update to the golden
+   file.
+2. **Report opens standalone.** Downloading the HTML file and opening it offline produces the same rendering as opening
+   it from GCS.
+3. **All threshold values match the design doc.** A reviewer can open `SPEC.md`, the `20260415-bq-data-analyzer.md`
+   design doc, and the rendered report and confirm all three agree on green/yellow/red cutoffs.
+4. **Regeneration works end-to-end.** An AI agent, given only this PRD and `SPEC.md`, regenerates `report.ai.html`,
+   `charts.ai.js`, and `styles.ai.css` such that the snapshot test still passes.
+
+## Open Questions
+
+1. **Should the report surface the power-law exponent estimate by default?** We compute it from degree stats (cheap),
+   but
+   [Demystifying (17.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#17-demystifying-common-beliefs-in-graph-ml)
+   cautions against relying on derived metrics that summarize away the full distribution. Current answer: show it only
+   in the Advanced section with a caveat.
+2. **Should FACETS embeds be lazy-loaded?** A 20-table report with FACETS per table can be ~10 MB. Lazy loading (iframe
+   `loading="lazy"`) would speed first paint but complicates the "single self-contained HTML" goal. Current answer:
+   eager load; revisit if reports routinely exceed 10 MB.
+3. **Should we support dark mode?** Not in v1. The color-coded thresholds (red/yellow/green) assume a light background;
+   a dark theme would need separate color values.
+
+## References
+
+- **Technical spec:** [`SPEC.md`](SPEC.md) in this directory — the contract for regenerating the `.ai.*` files.
+- **Design doc:** [`docs/plans/20260415-bq-data-analyzer.md`](../../../docs/plans/20260415-bq-data-analyzer.md) —
+  architecture, 4-tier validation, cost control, tradeoff analysis.
+- **Literature review:**
+  [`docs/plans/20260415-bq-data-analyzer-references.md`](../../../docs/plans/20260415-bq-data-analyzer-references.md) —
+  18 papers, 100+ findings with source citations, consolidated threshold table.
+- **1-pager:** [`docs/plans/20260416-data-analyzer-1-pager.md`](../../../docs/plans/20260416-data-analyzer-1-pager.md) —
+  executive summary for peer engineers.
+- **Engineering spec:**
+  [`docs/plans/20260416-data-analyzer-engineering-spec.md`](../../../docs/plans/20260416-data-analyzer-engineering-spec.md)
+  — per-layer implementation plan.
diff --git a/gigl/analytics/data_analyzer/report/SPEC.md b/gigl/analytics/data_analyzer/report/SPEC.md
new file mode 100644
index 000000000..a67649497
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/SPEC.md
@@ -0,0 +1,122 @@
+# Report Generator SPEC
+
+## Purpose
+
+This SPEC defines the single self-contained HTML report that the BQ Data Analyzer produces for a graph dataset. The
+three `.ai.{html,js,css}` files in this directory implement the SPEC and should be regenerated from it whenever the SPEC
+changes. The Python `report_generator.py` module is the only non-AI-owned component in this directory; it loads the AI
+assets via `importlib.resources`, injects data from a `GraphAnalysisResult` dataclass, and writes a single HTML file to
+disk.
+
+## Constraints
+
+- Single self-contained HTML file. No external CDN, no external JS/CSS/font dependencies, no network requests at view
+  time.
+- Opens in any modern browser (Chrome, Firefox, Safari, Edge) without a server.
+- Max-width 1200px, centered horizontally.
+- Light background (`#f8f9fa`).
+- Monospace font (`ui-monospace`, `SFMono-Regular`, `Menlo`, `monospace`) for all numeric data values; sans-serif
+  (`system-ui`, `-apple-system`, `"Segoe UI"`, `Roboto`, sans-serif) for labels and headings.
+- Collapsible sections use `<details>` / `<summary>` (no JS required to expand/collapse).
+- Color coding for status uses these exact values:
+  - Green: `#28a745` (OK)
+  - Yellow: `#ffc107` (warning)
+  - Red: `#dc3545` (critical)
+- Total report HTML should be reasonable in size (a single dataset's report with embedded FACETS iframes may be
+  multi-MB; that is acceptable).
+
+## Sections (in display order)
+
+1. **Header** (`<header id="report-header">`) — "GiGL Data Analysis Report" title, generation timestamp, and a short
+   config summary listing the analyzed node tables and edge tables.
+2. **Overview Dashboard** (`<section id="overview">`) — Card grid showing total nodes, total edges, number of node
+   types, number of edge types, and an overall traffic-light status indicator (green/yellow/red). The status is the
+   worst severity across all detected issues.
+3. **Data Quality** (`<section id="data-quality">`) — Per-table NULL rates table sorted highest-first with rows
+   color-coded (NULL rate > 50% = yellow,
+   > 90% = red). Duplicate node counts, duplicate edge counts, dangling edge counts, and referential integrity
+   > violations. Any nonzero count in these four is rendered red.
+4. **Feature Statistics** (`<section id="feature-statistics">`) — Optional. One subsection per table with the
+   corresponding FACETS HTML embedded inside an `<iframe srcdoc="...">` to isolate styles. Entire section is hidden if
+   no profile data is provided.
+5. **Graph Structure** (`<section id="graph-structure">`) — Node and edge count table. Per-edge-type degree distribution
+   rendered as inline SVG histogram using the `buckets` dict from `DegreeStats` (buckets `0-1`, `2-10`, `11-100`,
+   `101-1K`, `1K-10K`, `10K+`). Top-20 hub table per edge type. Super-hub int16 clamp warning box (red) shown if any
+   edge type reports a clamp count > 0.
+6. **Advanced** (`<section id="advanced">`) — Optional Tier 3 / Tier 4 data. Shown only if the relevant fields are
+   populated:
+   - Class imbalance (bar chart and per-class counts)
+   - Label coverage (percentage per node type)
+   - Edge type distribution (bar chart)
+   - Reciprocity per edge type
+   - Power-law exponent per edge type
+7. **Footer** (`<footer id="report-footer">`) — GiGL version / commit, list of raw artifact GCS paths, and a condensed
+   list of literature references (the 18 papers from `docs/plans/20260415-bq-data-analyzer-references.md`).
+
+## Key Thresholds
+
+Thresholds used to color-code metrics. These must match the design doc (`docs/plans/20260415-bq-data-analyzer.md`)
+exactly.
+
+| Metric                           | Green         | Yellow     | Red        |
+| -------------------------------- | ------------- | ---------- | ---------- |
+| Edge homophily                   | > 0.7         | 0.3 - 0.7  | < 0.3      |
+| Class imbalance ratio            | < 1:5         | 1:5 - 1:10 | > 1:10     |
+| Feature missing rate             | < 10%         | 10 - 50%   | > 90%      |
+| Isolated node fraction           | < 1%          | 1 - 5%     | > 5%       |
+| Degree p99/median                | < 50          | 50 - 100   | > 100      |
+| Node degree (int16 clamp)        | < 32,767      | n/a        | > 32,767   |
+| Cold-start fraction (degree 0-1) | < 5%          | 5 - 10%    | > 10%      |
+| Edge type dominance              | No type > 80% | Any > 90%  | Any < 0.1% |
+
+## Data Injection Contract
+
+`report_generator.py` produces a final HTML file by performing four exact string replacements on `report.ai.html`:
+
+| Placeholder                  | Replaced with                                                |
+| ---------------------------- | ------------------------------------------------------------ |
+| `/* INJECT_STYLES */`        | Raw contents of `styles.ai.css`                              |
+| `/* INJECT_SCRIPTS */`       | Raw contents of `charts.ai.js`                               |
+| `/* INJECT_ANALYSIS_DATA */` | JSON-serialized `GraphAnalysisResult` (`dataclasses.asdict`) |
+| `/* INJECT_PROFILE_DATA */`  | JSON-serialized `FeatureProfileResult` (or `{}` if absent)   |
+
+The JS reads these injected JSON strings from hidden script tags:
+
+```html
+<script id="analysis-data" type="application/json">/* INJECT_ANALYSIS_DATA */</script>
+<script id="profile-data"  type="application/json">/* INJECT_PROFILE_DATA */</script>
+```
+
+On page load the JS:
+
+1. Parses both JSON blobs.
+2. Populates each section by generating DOM nodes (never `innerHTML` with untrusted strings; always `textContent`).
+3. Renders the degree distribution as an inline SVG bar chart.
+4. Applies color coding (`status-green`, `status-yellow`, `status-red`) based on the thresholds above.
+5. Hides `#feature-statistics` if the profile data is empty / `{}`.
+6. Hides `#advanced` if no Tier 3 or Tier 4 data is present.
+
+### JS Element Contract
+
+The JS queries these DOM IDs. The HTML template must provide them:
+
+- `#report-header`
+- `#overview`
+- `#data-quality`
+- `#feature-statistics`
+- `#graph-structure`
+- `#advanced`
+- `#report-footer`
+- `#analysis-data` (hidden JSON script tag)
+- `#profile-data` (hidden JSON script tag)
+
+## Regeneration Instructions
+
+To regenerate `report.ai.html`, `charts.ai.js`, and `styles.ai.css`:
+
+1. Read this `SPEC.md` in full.
+2. Implement the sections, element IDs, thresholds, and data injection contract exactly as specified.
+3. Keep the HTML template minimal (all content is rendered by JS).
+4. Keep the JS as a single IIFE with no external dependencies; use DOM helpers, not templating libraries.
+5. Use the exact color hex values specified in "Constraints".
+6. Update the snapshot test golden file at `tests/test_assets/analytics/golden_report.html` after regenerating.
diff --git a/gigl/analytics/data_analyzer/report/__init__.py b/gigl/analytics/data_analyzer/report/__init__.py
new file mode 100644
index 000000000..8cde20291
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/__init__.py
@@ -0,0 +1,6 @@
+"""
+HTML report generation for the BQ Data Analyzer.
+
+AI-owned assets (*.ai.html, *.ai.js, *.ai.css) are defined by SPEC.md
+in this directory and can be regenerated from that spec.
+"""
diff --git a/gigl/analytics/data_analyzer/report/charts.ai.js b/gigl/analytics/data_analyzer/report/charts.ai.js
new file mode 100644
index 000000000..f04b70064
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/charts.ai.js
@@ -0,0 +1,610 @@
+(function () {
+    "use strict";
+
+    // Bucket order for degree histograms; must match GraphStructureAnalyzer output.
+    const BUCKET_ORDER = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"];
+
+    function parseJSONTag(id) {
+        const node = document.getElementById(id);
+        if (!node) return {};
+        const raw = (node.textContent || "").trim();
+        if (!raw) return {};
+        try {
+            return JSON.parse(raw);
+        } catch (e) {
+            console.error("Failed to parse JSON tag #" + id, e);
+            return {};
+        }
+    }
+
+    function createElement(tag, attrs, ...children) {
+        const el = document.createElement(tag);
+        if (attrs) {
+            for (const key of Object.keys(attrs)) {
+                const val = attrs[key];
+                if (val === null || val === undefined || val === false) continue;
+                if (key === "className") el.className = val;
+                else if (key === "text") el.textContent = val;
+                else if (key === "hidden") el.hidden = Boolean(val);
+                else el.setAttribute(key, val);
+            }
+        }
+        for (const child of children) {
+            if (child === null || child === undefined) continue;
+            if (typeof child === "string" || typeof child === "number") {
+                el.appendChild(document.createTextNode(String(child)));
+            } else {
+                el.appendChild(child);
+            }
+        }
+        return el;
+    }
+
+    function formatNumber(n) {
+        if (n === null || n === undefined) return "-";
+        if (typeof n !== "number") return String(n);
+        return n.toLocaleString("en-US");
+    }
+
+    function formatPercent(fraction) {
+        if (fraction === null || fraction === undefined) return "-";
+        return (fraction * 100).toFixed(2) + "%";
+    }
+
+    function classForThreshold(value, green, yellow) {
+        // value <= green -> green, value <= yellow -> yellow, else red.
+        if (value <= green) return "status-green";
+        if (value <= yellow) return "status-yellow";
+        return "status-red";
+    }
+
+    function classForNullRate(rate) {
+        if (rate > 0.9) return "status-red";
+        if (rate > 0.5) return "status-yellow";
+        return "status-green";
+    }
+
+    function sumValues(obj) {
+        if (!obj) return 0;
+        let total = 0;
+        for (const key of Object.keys(obj)) {
+            const v = obj[key];
+            if (typeof v === "number") total += v;
+        }
+        return total;
+    }
+
+    function hasAnyPositive(obj) {
+        if (!obj) return false;
+        for (const key of Object.keys(obj)) {
+            if (obj[key] > 0) return true;
+        }
+        return false;
+    }
+
+    // ---- Rendering ----
+
+    function renderHeader(analysis) {
+        const metaEl = document.getElementById("report-meta");
+        const cfgEl = document.getElementById("report-config-summary");
+        const now = new Date().toISOString();
+        metaEl.textContent = "Generated at " + now;
+
+        const nodeTypes = Object.keys(analysis.node_counts || {});
+        const edgeTypes = Object.keys(analysis.edge_counts || {});
+        cfgEl.textContent =
+            "Node tables: " + (nodeTypes.length ? nodeTypes.join(", ") : "(none)") +
+            " | Edge tables: " + (edgeTypes.length ? edgeTypes.join(", ") : "(none)");
+    }
+
+    function overallStatus(analysis) {
+        // Hard fails -> red.
+        if (hasAnyPositive(analysis.duplicate_node_counts) ||
+            hasAnyPositive(analysis.dangling_edge_counts) ||
+            hasAnyPositive(analysis.referential_integrity_violations) ||
+            hasAnyPositive(analysis.super_hub_int16_clamp_count)) {
+            return "status-red";
+        }
+        // Check thresholded metrics for yellow.
+        const totalNodes = sumValues(analysis.node_counts);
+        if (totalNodes > 0) {
+            const isolatedFrac = sumValues(analysis.isolated_node_counts) / totalNodes;
+            const coldFrac = sumValues(analysis.cold_start_node_counts) / totalNodes;
+            if (isolatedFrac > 0.05 || coldFrac > 0.10) return "status-red";
+            if (isolatedFrac > 0.01 || coldFrac > 0.05) return "status-yellow";
+        }
+        // NULL rates.
+        const nullRates = analysis.null_rates || {};
+        for (const table of Object.keys(nullRates)) {
+            for (const col of Object.keys(nullRates[table])) {
+                const r = nullRates[table][col];
+                if (r > 0.9) return "status-red";
+            }
+        }
+        return "status-green";
+    }
+
+    function renderOverview(analysis) {
+        const container = document.getElementById("overview-cards");
+        const totalNodes = sumValues(analysis.node_counts);
+        const totalEdges = sumValues(analysis.edge_counts);
+        const nodeTypes = Object.keys(analysis.node_counts || {}).length;
+        const edgeTypes = Object.keys(analysis.edge_counts || {}).length;
+        const status = overallStatus(analysis);
+
+        const cards = [
+            ["Total nodes", formatNumber(totalNodes)],
+            ["Total edges", formatNumber(totalEdges)],
+            ["Node types", formatNumber(nodeTypes)],
+            ["Edge types", formatNumber(edgeTypes)],
+        ];
+        for (const [label, value] of cards) {
+            container.appendChild(createElement("div", { className: "card" },
+                createElement("div", { className: "card-label", text: label }),
+                createElement("div", { className: "card-value data-value", text: value })
+            ));
+        }
+        const statusLabel = status === "status-green" ? "OK" :
+                            status === "status-yellow" ? "WARNING" : "CRITICAL";
+        container.appendChild(createElement("div", { className: "card" },
+            createElement("div", { className: "card-label", text: "Overall status" }),
+            createElement("div", { className: "card-value" },
+                createElement("span", { className: status, text: statusLabel }))
+        ));
+    }
+
+    function renderNullRates(analysis) {
+        const container = document.getElementById("null-rates-container");
+        const rates = analysis.null_rates || {};
+        const rows = [];
+        for (const table of Object.keys(rates)) {
+            for (const col of Object.keys(rates[table])) {
+                rows.push({ table: table, column: col, rate: rates[table][col] });
+            }
+        }
+        if (rows.length === 0) {
+            container.appendChild(createElement("p", { text: "No NULL rate data available." }));
+            return;
+        }
+        rows.sort((a, b) => b.rate - a.rate);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Table" }),
+                createElement("th", { text: "Column" }),
+                createElement("th", { text: "NULL rate" })));
+        const tbody = createElement("tbody");
+        for (const r of rows) {
+            const cls = classForNullRate(r.rate);
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: r.table }),
+                createElement("td", { text: r.column }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatPercent(r.rate) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderIntegrity(analysis) {
+        const container = document.getElementById("integrity-container");
+        const rows = [
+            ["Duplicate nodes", analysis.duplicate_node_counts],
+            ["Duplicate edges", analysis.duplicate_edge_counts],
+            ["Dangling edges", analysis.dangling_edge_counts],
+            ["Referential integrity violations", analysis.referential_integrity_violations],
+            ["Self loops", analysis.self_loop_counts],
+            ["Isolated nodes", analysis.isolated_node_counts],
+            ["Cold-start nodes (degree 0-1)", analysis.cold_start_node_counts],
+        ];
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Check" }),
+                createElement("th", { text: "Per-type counts" }),
+                createElement("th", { text: "Total" })));
+        const tbody = createElement("tbody");
+        for (const [label, obj] of rows) {
+            const total = sumValues(obj);
+            const isHardFail = (label === "Duplicate nodes" ||
+                                label === "Dangling edges" ||
+                                label === "Referential integrity violations");
+            const cls = isHardFail
+                ? (total > 0 ? "status-red" : "status-green")
+                : (total > 0 ? "status-yellow" : "status-green");
+            const detail = obj && Object.keys(obj).length
+                ? Object.keys(obj).map(k => k + ": " + formatNumber(obj[k])).join(", ")
+                : "(none)";
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: label }),
+                createElement("td", { className: "data-value", text: detail }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatNumber(total) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderFeatureStatistics(profile) {
+        const section = document.getElementById("feature-statistics");
+        const container = document.getElementById("feature-statistics-container");
+        const facets = (profile && profile.facets_html_paths) || {};
+        const keys = Object.keys(facets);
+        if (keys.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        for (const tableName of keys) {
+            const srcdoc = facets[tableName] || "";
+            container.appendChild(createElement("details", { open: "" },
+                createElement("summary", { text: "FACETS: " + tableName }),
+                createElement("iframe", {
+                    className: "facets-embed",
+                    srcdoc: srcdoc,
+                    sandbox: "allow-scripts allow-same-origin",
+                })));
+        }
+    }
+
+    function renderCounts(analysis) {
+        const container = document.getElementById("counts-container");
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Type" }),
+                createElement("th", { text: "Kind" }),
+                createElement("th", { text: "Count" })));
+        const tbody = createElement("tbody");
+        for (const [name, count] of Object.entries(analysis.node_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "node" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        for (const [name, count] of Object.entries(analysis.edge_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "edge" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderDegreeHistogram(buckets) {
+        // Returns an SVG element for the given bucket counts.
+        const width = 720;
+        const height = 220;
+        const padLeft = 50;
+        const padRight = 10;
+        const padTop = 16;
+        const padBottom = 40;
+        const innerW = width - padLeft - padRight;
+        const innerH = height - padTop - padBottom;
+
+        const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
+        svg.setAttribute("class", "histogram");
+        svg.setAttribute("viewBox", "0 0 " + width + " " + height);
+
+        const counts = BUCKET_ORDER.map(k => (buckets && buckets[k]) || 0);
+        const maxCount = Math.max(1, ...counts);
+        const barWidth = innerW / BUCKET_ORDER.length;
+
+        // Axis lines.
+        const axis = document.createElementNS("http://www.w3.org/2000/svg", "line");
+        axis.setAttribute("class", "axis");
+        axis.setAttribute("x1", padLeft);
+        axis.setAttribute("y1", padTop + innerH);
+        axis.setAttribute("x2", padLeft + innerW);
+        axis.setAttribute("y2", padTop + innerH);
+        svg.appendChild(axis);
+
+        for (let i = 0; i < BUCKET_ORDER.length; i++) {
+            const c = counts[i];
+            const h = (c / maxCount) * innerH;
+            const x = padLeft + i * barWidth + 4;
+            const y = padTop + innerH - h;
+            const rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+            rect.setAttribute("class", "bar");
+            rect.setAttribute("x", x);
+            rect.setAttribute("y", y);
+            rect.setAttribute("width", Math.max(1, barWidth - 8));
+            rect.setAttribute("height", h);
+            svg.appendChild(rect);
+
+            const valueLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            valueLabel.setAttribute("class", "value");
+            valueLabel.setAttribute("x", x + (barWidth - 8) / 2);
+            valueLabel.setAttribute("y", y - 4);
+            valueLabel.setAttribute("text-anchor", "middle");
+            valueLabel.textContent = formatNumber(c);
+            svg.appendChild(valueLabel);
+
+            const xLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            xLabel.setAttribute("class", "label");
+            xLabel.setAttribute("x", x + (barWidth - 8) / 2);
+            xLabel.setAttribute("y", padTop + innerH + 16);
+            xLabel.setAttribute("text-anchor", "middle");
+            xLabel.textContent = BUCKET_ORDER[i];
+            svg.appendChild(xLabel);
+        }
+
+        // Y-axis max label.
+        const maxLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+        maxLabel.setAttribute("class", "label");
+        maxLabel.setAttribute("x", padLeft - 6);
+        maxLabel.setAttribute("y", padTop + 10);
+        maxLabel.setAttribute("text-anchor", "end");
+        maxLabel.textContent = formatNumber(maxCount);
+        svg.appendChild(maxLabel);
+
+        return svg;
+    }
+
+    function renderDegree(analysis) {
+        const container = document.getElementById("degree-container");
+        const degrees = analysis.degree_stats || {};
+        const keys = Object.keys(degrees);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No degree stats available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            const stats = degrees[edgeType];
+            const median = stats.median || 1;
+            const ratio = stats.p99 / Math.max(1, median);
+            const ratioClass = classForThreshold(ratio, 50, 100);
+
+            const statsLine = createElement("p", { className: "data-value" },
+                "min=" + formatNumber(stats.min) +
+                ", mean=" + (stats.mean !== undefined ? stats.mean.toFixed(2) : "-") +
+                ", median=" + formatNumber(stats.median) +
+                ", p90=" + formatNumber(stats.p90) +
+                ", p99=" + formatNumber(stats.p99) +
+                ", p99.9=" + formatNumber(stats.p999) +
+                ", max=" + formatNumber(stats.max) +
+                " | p99/median=",
+                createElement("span", { className: ratioClass, text: ratio.toFixed(1) }));
+
+            container.appendChild(createElement("h3", { text: edgeType }));
+            container.appendChild(statsLine);
+            container.appendChild(renderDegreeHistogram(stats.buckets || {}));
+        }
+    }
+
+    function renderHubs(analysis) {
+        const container = document.getElementById("hubs-container");
+        const hubs = analysis.top_hubs || {};
+        const keys = Object.keys(hubs);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No hub data available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            container.appendChild(createElement("h3", { text: edgeType }));
+            const thead = createElement("thead", null,
+                createElement("tr", null,
+                    createElement("th", { text: "Rank" }),
+                    createElement("th", { text: "Node ID" }),
+                    createElement("th", { text: "Degree" })));
+            const tbody = createElement("tbody");
+            const rows = (hubs[edgeType] || []).slice(0, 20);
+            rows.forEach((entry, i) => {
+                const nodeId = Array.isArray(entry) ? entry[0] : entry.node_id;
+                const degree = Array.isArray(entry) ? entry[1] : entry.degree;
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: String(i + 1) }),
+                    createElement("td", { className: "data-value", text: String(nodeId) }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(degree) })));
+            });
+            container.appendChild(createElement("table", null, thead, tbody));
+        }
+    }
+
+    function renderSuperHubWarning(analysis) {
+        const box = document.getElementById("super-hub-warning");
+        const clamps = analysis.super_hub_int16_clamp_count || {};
+        const totalClamps = sumValues(clamps);
+        if (totalClamps <= 0) {
+            box.hidden = true;
+            return;
+        }
+        box.hidden = false;
+        box.className = "warning-box";
+        const detail = Object.keys(clamps)
+            .map(k => k + ": " + formatNumber(clamps[k]))
+            .join(", ");
+        box.appendChild(createElement("strong", { text: "Super-hub int16 clamp warning. " }));
+        box.appendChild(document.createTextNode(
+            formatNumber(totalClamps) + " node(s) exceed the int16 degree limit (32,767) and " +
+            "will be silently clamped by GiGL. Per-type: " + detail
+        ));
+    }
+
+    function renderAdvanced(analysis) {
+        const section = document.getElementById("advanced");
+        const container = document.getElementById("advanced-container");
+
+        const classImb = analysis.class_imbalance || {};
+        const labelCov = analysis.label_coverage || {};
+        const edgeDist = analysis.edge_type_distribution || {};
+        const reciprocity = analysis.reciprocity || {};
+        const powerLaw = analysis.power_law_exponent || {};
+
+        const hasTier3 = Object.keys(classImb).length ||
+                         Object.keys(labelCov).length ||
+                         Object.keys(edgeDist).length;
+        const hasTier4 = Object.keys(reciprocity).length ||
+                         Object.keys(powerLaw).length;
+
+        if (!hasTier3 && !hasTier4) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+
+        if (Object.keys(classImb).length) {
+            container.appendChild(createElement("h3", { text: "Class imbalance" }));
+            for (const nodeType of Object.keys(classImb)) {
+                const counts = classImb[nodeType];
+                const values = Object.values(counts);
+                const maxC = Math.max(...values);
+                const minC = Math.max(1, Math.min(...values));
+                const ratio = maxC / minC;
+                const cls = ratio > 10 ? "status-red" : ratio > 5 ? "status-yellow" : "status-green";
+                container.appendChild(createElement("p", { className: "data-value" },
+                    nodeType + " max/min ratio = ",
+                    createElement("span", { className: cls, text: "1:" + ratio.toFixed(1) })));
+                const tbody = createElement("tbody");
+                for (const [label, count] of Object.entries(counts)) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(label) }),
+                        createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+                }
+                container.appendChild(createElement("table", null,
+                    createElement("thead", null, createElement("tr", null,
+                        createElement("th", { text: "Class" }),
+                        createElement("th", { text: "Count" }))),
+                    tbody));
+            }
+        }
+
+        if (Object.keys(labelCov).length) {
+            container.appendChild(createElement("h3", { text: "Label coverage" }));
+            const tbody = createElement("tbody");
+            for (const [nodeType, frac] of Object.entries(labelCov)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: nodeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(frac) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Node type" }),
+                    createElement("th", { text: "Coverage" }))),
+                tbody));
+        }
+
+        if (Object.keys(edgeDist).length) {
+            container.appendChild(createElement("h3", { text: "Edge type distribution" }));
+            const total = sumValues(edgeDist);
+            const tbody = createElement("tbody");
+            for (const [edgeType, count] of Object.entries(edgeDist)) {
+                const frac = total > 0 ? count / total : 0;
+                let cls = "status-green";
+                if (frac < 0.001) cls = "status-red";
+                else if (frac > 0.9) cls = "status-red";
+                else if (frac > 0.8) cls = "status-yellow";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(count) }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: formatPercent(frac) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Share" }))),
+                tbody));
+        }
+
+        if (Object.keys(reciprocity).length) {
+            container.appendChild(createElement("h3", { text: "Reciprocity" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, val] of Object.entries(reciprocity)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(val) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Reciprocity" }))),
+                tbody));
+        }
+
+        if (Object.keys(powerLaw).length) {
+            container.appendChild(createElement("h3", { text: "Power-law exponent" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, alpha] of Object.entries(powerLaw)) {
+                const cls = alpha < 2 ? "status-red" : alpha < 2.5 ? "status-yellow" : "status-green";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: alpha.toFixed(2) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Alpha" }))),
+                tbody));
+        }
+    }
+
+    function renderFooter(analysis, profile) {
+        const container = document.getElementById("footer-container");
+
+        const artifacts = [];
+        const paths = (profile && profile.facets_html_paths) || {};
+        for (const [k, v] of Object.entries(paths)) {
+            artifacts.push("FACETS " + k + ": " + v);
+        }
+        const statsPaths = (profile && profile.stats_paths) || {};
+        for (const [k, v] of Object.entries(statsPaths)) {
+            artifacts.push("Stats " + k + ": " + v);
+        }
+
+        if (artifacts.length) {
+            container.appendChild(createElement("h3", { text: "Raw artifacts" }));
+            const ul = createElement("ul", { className: "footer-list" });
+            for (const a of artifacts) {
+                ul.appendChild(createElement("li", null, createElement("code", { text: a })));
+            }
+            container.appendChild(ul);
+        }
+
+        container.appendChild(createElement("h3", { text: "References" }));
+        const refs = [
+            "PinSage (Pinterest, KDD 2018)",
+            "PinnerSage (Pinterest, KDD 2020)",
+            "BLADE (Amazon, WSDM 2023)",
+            "LiGNN (LinkedIn, KDD 2024)",
+            "TwHIN (Twitter/X, KDD 2022)",
+            "GiGL (Snap, KDD 2025)",
+            "AliGraph (Alibaba, VLDB 2019)",
+            "GraphSMOTE (WSDM 2021)",
+            "Beyond Homophily (NeurIPS 2020)",
+            "Uber Fraud Detection + Grab Spade (VLDB 2023)",
+            "Google Maps ETA (CIKM 2021)",
+            "Feature Propagation (ICLR 2022)",
+            "GraphBFF (Feb 2026)",
+            "DQuaG (EDBT 2025)",
+            "LinkedIn Cross-Domain GNN (June 2025)",
+            "Oversmoothing/Oversquashing Complexity (March 2026)",
+            "Demystifying Common Beliefs (ICLR 2026)",
+            "Meta GEM and Adaptive Ranking (2025-2026)",
+        ];
+        const refUl = createElement("ul", { className: "footer-list" });
+        for (const r of refs) {
+            refUl.appendChild(createElement("li", { text: r }));
+        }
+        container.appendChild(refUl);
+    }
+
+    function main() {
+        const analysis = parseJSONTag("analysis-data");
+        const profile = parseJSONTag("profile-data");
+        renderHeader(analysis);
+        renderOverview(analysis);
+        renderNullRates(analysis);
+        renderIntegrity(analysis);
+        renderFeatureStatistics(profile);
+        renderCounts(analysis);
+        renderDegree(analysis);
+        renderHubs(analysis);
+        renderSuperHubWarning(analysis);
+        renderAdvanced(analysis);
+        renderFooter(analysis, profile);
+    }
+
+    if (document.readyState === "loading") {
+        document.addEventListener("DOMContentLoaded", main);
+    } else {
+        main();
+    }
+})();
diff --git a/gigl/analytics/data_analyzer/report/report.ai.html b/gigl/analytics/data_analyzer/report/report.ai.html
new file mode 100644
index 000000000..cde781196
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/report.ai.html
@@ -0,0 +1,69 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="utf-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <title>GiGL Data Analysis Report</title>
+    <style>/* INJECT_STYLES */</style>
+</head>
+<body>
+    <header id="report-header">
+        <h1>GiGL Data Analysis Report</h1>
+        <p class="meta" id="report-meta"></p>
+        <p class="config-summary" id="report-config-summary"></p>
+    </header>
+
+    <section id="overview">
+        <h2>Overview</h2>
+        <div class="card-grid" id="overview-cards"></div>
+    </section>
+
+    <section id="data-quality">
+        <h2>Data Quality</h2>
+        <details open>
+            <summary>NULL rates per column</summary>
+            <div id="null-rates-container"></div>
+        </details>
+        <details open>
+            <summary>Integrity checks</summary>
+            <div id="integrity-container"></div>
+        </details>
+    </section>
+
+    <section id="feature-statistics" hidden>
+        <h2>Feature Statistics</h2>
+        <div id="feature-statistics-container"></div>
+    </section>
+
+    <section id="graph-structure">
+        <h2>Graph Structure</h2>
+        <details open>
+            <summary>Node and edge counts</summary>
+            <div id="counts-container"></div>
+        </details>
+        <details open>
+            <summary>Degree distribution</summary>
+            <div id="degree-container"></div>
+        </details>
+        <details open>
+            <summary>Top-20 hubs</summary>
+            <div id="hubs-container"></div>
+        </details>
+        <div id="super-hub-warning" hidden></div>
+    </section>
+
+    <section id="advanced" hidden>
+        <h2>Advanced Metrics</h2>
+        <div id="advanced-container"></div>
+    </section>
+
+    <footer id="report-footer">
+        <h2>References and Artifacts</h2>
+        <div id="footer-container"></div>
+    </footer>
+
+    <script id="analysis-data" type="application/json">/* INJECT_ANALYSIS_DATA */</script>
+    <script id="profile-data" type="application/json">/* INJECT_PROFILE_DATA */</script>
+    <script>/* INJECT_SCRIPTS */</script>
+</body>
+</html>
diff --git a/gigl/analytics/data_analyzer/report/report_generator.py b/gigl/analytics/data_analyzer/report/report_generator.py
new file mode 100644
index 000000000..a24cb2c67
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/report_generator.py
@@ -0,0 +1,63 @@
+"""Generates a single self-contained HTML report from analysis results.
+
+Loads the AI-owned template (report.ai.html), styles (styles.ai.css),
+and chart logic (charts.ai.js), then injects serialized analysis data.
+
+The template, styles, and chart logic are defined by SPEC.md in this
+directory. AI-owned files (*.ai.html, *.ai.js, *.ai.css) can be
+regenerated from the SPEC.
+"""
+import dataclasses
+import json
+from importlib import resources
+from typing import Optional
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+
+def generate_report(
+    analysis_result: GraphAnalysisResult,
+    profile_result: Optional[FeatureProfileResult],
+    config: Optional[DataAnalyzerConfig],
+) -> str:
+    """Generate a self-contained HTML report from analysis results.
+
+    Args:
+        analysis_result: Graph structure analysis results.
+        profile_result: TFDV feature profiling results (optional).
+        config: Analyzer config for metadata display (optional).
+
+    Returns:
+        Complete HTML string that opens standalone in any browser.
+
+    Example:
+        >>> html = generate_report(
+        ...     analysis_result=result,
+        ...     profile_result=None,
+        ...     config=None,
+        ... )
+        >>> # Write to GCS or local file
+    """
+    template_dir = resources.files("gigl.analytics.data_analyzer.report")
+    html_template = template_dir.joinpath("report.ai.html").read_text()
+    css = template_dir.joinpath("styles.ai.css").read_text()
+    js = template_dir.joinpath("charts.ai.js").read_text()
+
+    analysis_json = json.dumps(dataclasses.asdict(analysis_result), default=str)
+    profile_json = json.dumps(
+        dataclasses.asdict(profile_result) if profile_result else {},
+        default=str,
+    )
+
+    html = html_template
+    html = html.replace("/* INJECT_STYLES */", css)
+    html = html.replace("/* INJECT_SCRIPTS */", js)
+    html = html.replace("/* INJECT_ANALYSIS_DATA */", analysis_json)
+    html = html.replace("/* INJECT_PROFILE_DATA */", profile_json)
+
+    logger.info(f"Generated HTML report ({len(html)} bytes)")
+    return html
diff --git a/gigl/analytics/data_analyzer/report/styles.ai.css b/gigl/analytics/data_analyzer/report/styles.ai.css
new file mode 100644
index 000000000..7c170d76e
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/styles.ai.css
@@ -0,0 +1,173 @@
+:root {
+    --color-ok: #28a745;
+    --color-warn: #ffc107;
+    --color-crit: #dc3545;
+    --color-bg: #f8f9fa;
+    --color-card-bg: #ffffff;
+    --color-border: #dee2e6;
+    --color-text: #212529;
+    --color-text-muted: #6c757d;
+    --font-sans: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+    --font-mono: ui-monospace, SFMono-Regular, Menlo, monospace;
+}
+
+* { box-sizing: border-box; }
+
+body {
+    max-width: 1200px;
+    margin: 0 auto;
+    padding: 24px;
+    font-family: var(--font-sans);
+    background: var(--color-bg);
+    color: var(--color-text);
+    line-height: 1.5;
+}
+
+h1, h2, h3, h4 {
+    font-family: var(--font-sans);
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+
+h1 { font-size: 1.8rem; }
+h2 { font-size: 1.4rem; border-bottom: 1px solid var(--color-border); padding-bottom: 4px; }
+h3 { font-size: 1.15rem; }
+
+.meta, .config-summary {
+    color: var(--color-text-muted);
+    font-size: 0.9rem;
+    margin: 4px 0;
+}
+
+.data-value {
+    font-family: var(--font-mono);
+    color: #111;
+}
+
+.status-green  { background: var(--color-ok);   color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+.status-yellow { background: var(--color-warn); color: #212529; padding: 2px 6px; border-radius: 3px; }
+.status-red    { background: var(--color-crit); color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+
+.status-dot {
+    display: inline-block;
+    width: 12px;
+    height: 12px;
+    border-radius: 50%;
+    vertical-align: middle;
+}
+.status-dot.status-green  { background: var(--color-ok); }
+.status-dot.status-yellow { background: var(--color-warn); }
+.status-dot.status-red    { background: var(--color-crit); }
+
+details {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 8px 12px;
+    margin: 8px 0;
+}
+
+summary {
+    cursor: pointer;
+    font-weight: 600;
+    padding: 4px 0;
+    user-select: none;
+}
+
+.card-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
+    gap: 12px;
+    margin-top: 8px;
+}
+
+.card {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 16px;
+    text-align: center;
+}
+
+.card .card-label {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+    margin-bottom: 6px;
+}
+
+.card .card-value {
+    font-family: var(--font-mono);
+    font-size: 1.4rem;
+    font-weight: 600;
+}
+
+table {
+    width: 100%;
+    border-collapse: collapse;
+    margin: 8px 0;
+    background: var(--color-card-bg);
+    font-size: 0.92rem;
+}
+
+th, td {
+    padding: 6px 10px;
+    border-bottom: 1px solid var(--color-border);
+    text-align: left;
+}
+
+th {
+    background: #f1f3f5;
+    font-weight: 600;
+}
+
+tbody tr:nth-child(even) {
+    background: #fafbfc;
+}
+
+td.numeric {
+    font-family: var(--font-mono);
+    text-align: right;
+}
+
+svg.histogram {
+    width: 100%;
+    max-width: 720px;
+    height: 220px;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+}
+
+svg.histogram .bar     { fill: #4c78a8; }
+svg.histogram .axis    { stroke: var(--color-border); stroke-width: 1; }
+svg.histogram .label   { font-family: var(--font-sans); font-size: 11px; fill: var(--color-text); }
+svg.histogram .value   { font-family: var(--font-mono); font-size: 11px; fill: var(--color-text); }
+
+iframe.facets-embed {
+    width: 100%;
+    min-height: 600px;
+    border: 0;
+}
+
+.warning-box {
+    background: var(--color-crit);
+    color: #ffffff;
+    padding: 12px 16px;
+    border-radius: 6px;
+    margin: 12px 0;
+    font-weight: 600;
+}
+
+.footer-list {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+}
+
+.footer-list code { font-family: var(--font-mono); }
+
+@media print {
+    body { background: #ffffff; padding: 0; max-width: none; }
+    details { break-inside: avoid; border: 1px solid #ccc; }
+    iframe.facets-embed { min-height: 400px; }
+    .card, table, svg.histogram { break-inside: avoid; }
+}
diff --git a/gigl/analytics/data_analyzer/types.py b/gigl/analytics/data_analyzer/types.py
new file mode 100644
index 000000000..01d5b43eb
--- /dev/null
+++ b/gigl/analytics/data_analyzer/types.py
@@ -0,0 +1,70 @@
+from dataclasses import dataclass, field
+
+
+@dataclass
+class DegreeStats:
+    """Degree distribution statistics for one edge type and direction.
+
+    Computed from APPROX_QUANTILES(degree, 100) in BigQuery.
+    """
+
+    min: int
+    max: int
+    mean: float
+    median: int
+    p90: int
+    p99: int
+    p999: int
+    percentiles: list[int]
+    buckets: dict[str, int]  # "0-1": count, "2-10": count, etc.
+
+
+@dataclass
+class GraphAnalysisResult:
+    """Complete result of graph structure analysis across all tiers.
+
+    Tier 1 fields are always populated. Tier 3/4 fields may be empty
+    dicts if the corresponding checks were not applicable or not enabled.
+    """
+
+    # Tier 1: hard fails
+    duplicate_node_counts: dict[str, int] = field(default_factory=dict)
+    dangling_edge_counts: dict[str, int] = field(default_factory=dict)
+    referential_integrity_violations: dict[str, int] = field(default_factory=dict)
+
+    # Tier 2: core metrics
+    node_counts: dict[str, int] = field(default_factory=dict)
+    edge_counts: dict[str, int] = field(default_factory=dict)
+    null_rates: dict[str, dict[str, float]] = field(default_factory=dict)
+    duplicate_edge_counts: dict[str, int] = field(default_factory=dict)
+    self_loop_counts: dict[str, int] = field(default_factory=dict)
+    isolated_node_counts: dict[str, int] = field(default_factory=dict)
+    degree_stats: dict[str, DegreeStats] = field(default_factory=dict)
+    top_hubs: dict[str, list[tuple[str, int]]] = field(default_factory=dict)
+    super_hub_int16_clamp_count: dict[str, int] = field(default_factory=dict)
+    cold_start_node_counts: dict[str, int] = field(default_factory=dict)
+    feature_memory_bytes: dict[str, int] = field(default_factory=dict)
+    neighbor_explosion_estimate: dict[str, int] = field(default_factory=dict)
+
+    # Tier 3: label and heterogeneous
+    class_imbalance: dict[str, dict[str, int]] = field(default_factory=dict)
+    label_coverage: dict[str, float] = field(default_factory=dict)
+    edge_type_distribution: dict[str, int] = field(default_factory=dict)
+    edge_type_node_coverage: dict[str, dict[str, int]] = field(default_factory=dict)
+
+    # Tier 4: opt-in
+    reciprocity: dict[str, float] = field(default_factory=dict)
+    power_law_exponent: dict[str, float] = field(default_factory=dict)
+
+
+@dataclass
+class FeatureProfileResult:
+    """Result of TFDV feature profiling across all tables.
+
+    Contains GCS paths to generated artifacts.
+    """
+
+    facets_html_paths: dict[str, str] = field(default_factory=dict)
+    stats_paths: dict[str, str] = field(default_factory=dict)
+    schema_paths: dict[str, str] = field(default_factory=dict)
+    anomalies: dict[str, list[str]] = field(default_factory=dict)
diff --git a/gigl/common/beam/tfdv_transforms.py b/gigl/common/beam/tfdv_transforms.py
new file mode 100644
index 000000000..03058dc8e
--- /dev/null
+++ b/gigl/common/beam/tfdv_transforms.py
@@ -0,0 +1,169 @@
+"""Shared TFDV / Beam PTransforms usable by the data preprocessor and analytics.
+
+Exposes:
+  * ``GenerateAndVisualizeStats`` - Runs ``tfdv.GenerateStatistics`` over a
+    ``PCollection[pa.RecordBatch]`` and writes both a Facets HTML
+    visualization and a TFDV stats TFRecord.
+  * ``BqTableToRecordBatch`` - Reads the given columns from a BigQuery table
+    and emits ``PCollection[pa.RecordBatch]`` suitable for TFDV. Schema is
+    inferred from row values; no pre-declared TFDV schema is required.
+"""
+
+from typing import Iterable, Optional
+
+import apache_beam as beam
+import pyarrow as pa
+import tensorflow_data_validation as tfdv
+from apache_beam.pvalue import PBegin, PCollection
+from apache_beam.transforms.window import GlobalWindow
+from apache_beam.utils.windowed_value import WindowedValue
+from tensorflow_metadata.proto.v0 import statistics_pb2
+
+from gigl.common import Uri
+
+_DEFAULT_BQ_READ_BATCH_SIZE = 1000
+
+
+class GenerateAndVisualizeStats(beam.PTransform):
+    """Generate TFDV statistics and a Facets HTML visualization from a record
+    batch ``PCollection``.
+
+    Writes two side-effect outputs:
+      * A single-shard Facets HTML file at ``facets_report_uri``.
+      * A TFRecord of ``DatasetFeatureStatisticsList`` at ``stats_output_uri``.
+
+    Args:
+        facets_report_uri: URI for the Facets HTML visualization (typically
+            a ``GcsUri``; local ``LocalUri`` is also accepted for tests).
+        stats_output_uri: URI (file prefix) for the TFDV stats TFRecord.
+    """
+
+    def __init__(self, facets_report_uri: Uri, stats_output_uri: Uri):
+        self.facets_report_uri = facets_report_uri
+        self.stats_output_uri = stats_output_uri
+
+    def expand(
+        self, features: PCollection[pa.RecordBatch]
+    ) -> PCollection[statistics_pb2.DatasetFeatureStatisticsList]:
+        stats = features | "Generate TFDV statistics" >> tfdv.GenerateStatistics()
+
+        _ = (
+            stats
+            | "Generate stats visualization"
+            >> beam.Map(tfdv.utils.display_util.get_statistics_html)
+            | "Write stats Facets report HTML"
+            >> beam.io.WriteToText(
+                self.facets_report_uri.uri, num_shards=1, shard_name_template=""
+            )
+        )
+
+        _ = (
+            stats
+            | "Write TFDV stats output TFRecord"
+            >> tfdv.WriteStatisticsToTFRecord(self.stats_output_uri.uri)
+        )
+
+        return stats
+
+
+class _RowsToRecordBatchDoFn(beam.DoFn):
+    """Buffer incoming row dicts and emit ``pa.RecordBatch`` batches.
+
+    Each output column is encoded as an Arrow list-typed column
+    (``list<T>``) with NULLs mapped to Arrow nulls, matching TFDV's
+    expectation that each feature column be a ``(Large)List<primitive|struct>``
+    (or null). See ``tfdv.utils.stats_util.get_feature_type_from_arrow_type``.
+    """
+
+    def __init__(self, batch_size: int, feature_columns: list[str]):
+        self._batch_size = batch_size
+        self._feature_columns = feature_columns
+        self._buffer: list[dict] = []
+
+    def start_bundle(self) -> None:
+        self._buffer = []
+
+    def process(self, element: dict) -> Iterable[pa.RecordBatch]:
+        self._buffer.append(element)
+        if len(self._buffer) >= self._batch_size:
+            yield self._drain()
+
+    def finish_bundle(self) -> Iterable[WindowedValue]:
+        if self._buffer:
+            yield WindowedValue(
+                value=self._drain(),
+                timestamp=0,
+                windows=(GlobalWindow(),),
+            )
+
+    def _drain(self) -> pa.RecordBatch:
+        buffered = self._buffer
+        self._buffer = []
+        column_values: dict[str, list] = {col: [] for col in self._feature_columns}
+        for row in buffered:
+            for col in self._feature_columns:
+                value = row[col]
+                column_values[col].append(None if value is None else [value])
+        return pa.RecordBatch.from_pydict(
+            {col: pa.array(values) for col, values in column_values.items()}
+        )
+
+
+class BqTableToRecordBatch(beam.PTransform):
+    """Read selected columns from a BigQuery table and emit Arrow record batches.
+
+    The output is a ``PCollection[pa.RecordBatch]`` whose columns are Arrow
+    list-typed (``list<T>``), which is the shape TFDV expects. Schema is
+    inferred from row values; rows with NULL values are represented as Arrow
+    nulls (missing features).
+
+    Args:
+        bq_table: Fully qualified ``project.dataset.table`` reference.
+        feature_columns: Columns to select; also the columns exposed to TFDV.
+        batch_size: Rows per emitted ``RecordBatch``. Defaults to 1000.
+        bq_project: Optional GCP project to bill the read against. Defaults to
+            the project inferred by ``beam.io.ReadFromBigQuery``.
+    """
+
+    def __init__(
+        self,
+        bq_table: str,
+        feature_columns: list[str],
+        batch_size: int = _DEFAULT_BQ_READ_BATCH_SIZE,
+        bq_project: Optional[str] = None,
+    ):
+        if not feature_columns:
+            raise ValueError(
+                f"BqTableToRecordBatch requires at least one feature column "
+                f"for table {bq_table!r}"
+            )
+        self.bq_table = bq_table
+        self.feature_columns = feature_columns
+        self.batch_size = batch_size
+        self.bq_project = bq_project
+
+    def expand(self, pbegin: PBegin) -> PCollection[pa.RecordBatch]:
+        if not isinstance(pbegin, PBegin):
+            raise TypeError(
+                f"Input to {BqTableToRecordBatch.__name__} transform must be "
+                f"a PBegin but found {pbegin})"
+            )
+        column_list = ", ".join(f"`{c}`" for c in self.feature_columns)
+        query = f"SELECT {column_list} FROM `{self.bq_table}`"
+        read_kwargs: dict = {
+            "query": query,
+            "use_standard_sql": True,
+        }
+        if self.bq_project is not None:
+            read_kwargs["project"] = self.bq_project
+        return (
+            pbegin
+            | "Read feature rows from BQ" >> beam.io.ReadFromBigQuery(**read_kwargs)
+            | "Buffer rows and emit record batches"
+            >> beam.ParDo(
+                _RowsToRecordBatchDoFn(
+                    batch_size=self.batch_size,
+                    feature_columns=self.feature_columns,
+                )
+            )
+        )
diff --git a/gigl/src/common/constants/components.py b/gigl/src/common/constants/components.py
index 29e9e4091..ae52a5cbb 100644
--- a/gigl/src/common/constants/components.py
+++ b/gigl/src/common/constants/components.py
@@ -10,6 +10,7 @@ class GiGLComponents(Enum):
     Trainer = "trainer"
     Inferencer = "inferencer"
     PostProcessor = "post_processor"
+    DataAnalyzer = "data_analyzer"
 
     @property
     def kebab_case_value(self):
diff --git a/gigl/src/data_preprocessor/lib/transform/utils.py b/gigl/src/data_preprocessor/lib/transform/utils.py
index f2b990abf..9694005cc 100644
--- a/gigl/src/data_preprocessor/lib/transform/utils.py
+++ b/gigl/src/data_preprocessor/lib/transform/utils.py
@@ -2,11 +2,10 @@
 
 import apache_beam as beam
 import pyarrow as pa
-import tensorflow_data_validation as tfdv
 import tensorflow_transform
 import tfx_bsl
 from apache_beam.pvalue import PBegin, PCollection, PDone
-from tensorflow_metadata.proto.v0 import schema_pb2, statistics_pb2
+from tensorflow_metadata.proto.v0 import schema_pb2
 from tensorflow_transform import beam as tft_beam
 from tensorflow_transform.tf_metadata import schema_utils
 from tfx_bsl.tfxio.record_based_tfxio import RecordBasedTFXIO
@@ -117,35 +116,6 @@ def expand(self, pbegin: PBegin) -> PCollection[pa.RecordBatch]:
         )
 
 
-class GenerateAndVisualizeStats(beam.PTransform):
-    def __init__(self, facets_report_uri: GcsUri, stats_output_uri: GcsUri):
-        self.facets_report_uri = facets_report_uri
-        self.stats_output_uri = stats_output_uri
-
-    def expand(
-        self, features: PCollection[pa.RecordBatch]
-    ) -> PCollection[statistics_pb2.DatasetFeatureStatisticsList]:
-        stats = features | "Generate TFDV statistics" >> tfdv.GenerateStatistics()
-
-        _ = (
-            stats
-            | "Generate stats visualization"
-            >> beam.Map(tfdv.utils.display_util.get_statistics_html)
-            | "Write stats Facets report HTML"
-            >> beam.io.WriteToText(
-                self.facets_report_uri.uri, num_shards=1, shard_name_template=""
-            )
-        )
-
-        _ = (
-            stats
-            | "Write TFDV stats output TFRecord"
-            >> tfdv.WriteStatisticsToTFRecord(self.stats_output_uri.uri)
-        )
-
-        return stats
-
-
 class ReadExistingTFTransformFn(beam.PTransform):
     def __init__(self, tf_transform_directory: Uri):
         assert isinstance(tf_transform_directory, (GcsUri, LocalUri)), (
diff --git a/pyproject.toml b/pyproject.toml
index d83e5587b..15c0495cf 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -246,6 +246,7 @@ gigl-post-install = "gigl.scripts.post_install:main"
 # Include dep_vars.env from the root directory
 "gigl" = ["dep_vars.env", "**/*.yaml"]
 "gigl.scripts" = ["*.sh"]
+"gigl.analytics.data_analyzer.report" = ["*.ai.html", "*.ai.js", "*.ai.css"]
 
 
 [tool.black]
diff --git a/tests/test_assets/analytics/__init__.py b/tests/test_assets/analytics/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/test_assets/analytics/golden_report.html b/tests/test_assets/analytics/golden_report.html
new file mode 100644
index 000000000..b231427db
--- /dev/null
+++ b/tests/test_assets/analytics/golden_report.html
@@ -0,0 +1,852 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="utf-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <title>GiGL Data Analysis Report</title>
+    <style>:root {
+    --color-ok: #28a745;
+    --color-warn: #ffc107;
+    --color-crit: #dc3545;
+    --color-bg: #f8f9fa;
+    --color-card-bg: #ffffff;
+    --color-border: #dee2e6;
+    --color-text: #212529;
+    --color-text-muted: #6c757d;
+    --font-sans: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+    --font-mono: ui-monospace, SFMono-Regular, Menlo, monospace;
+}
+
+* { box-sizing: border-box; }
+
+body {
+    max-width: 1200px;
+    margin: 0 auto;
+    padding: 24px;
+    font-family: var(--font-sans);
+    background: var(--color-bg);
+    color: var(--color-text);
+    line-height: 1.5;
+}
+
+h1, h2, h3, h4 {
+    font-family: var(--font-sans);
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+
+h1 { font-size: 1.8rem; }
+h2 { font-size: 1.4rem; border-bottom: 1px solid var(--color-border); padding-bottom: 4px; }
+h3 { font-size: 1.15rem; }
+
+.meta, .config-summary {
+    color: var(--color-text-muted);
+    font-size: 0.9rem;
+    margin: 4px 0;
+}
+
+.data-value {
+    font-family: var(--font-mono);
+    color: #111;
+}
+
+.status-green  { background: var(--color-ok);   color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+.status-yellow { background: var(--color-warn); color: #212529; padding: 2px 6px; border-radius: 3px; }
+.status-red    { background: var(--color-crit); color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+
+.status-dot {
+    display: inline-block;
+    width: 12px;
+    height: 12px;
+    border-radius: 50%;
+    vertical-align: middle;
+}
+.status-dot.status-green  { background: var(--color-ok); }
+.status-dot.status-yellow { background: var(--color-warn); }
+.status-dot.status-red    { background: var(--color-crit); }
+
+details {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 8px 12px;
+    margin: 8px 0;
+}
+
+summary {
+    cursor: pointer;
+    font-weight: 600;
+    padding: 4px 0;
+    user-select: none;
+}
+
+.card-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
+    gap: 12px;
+    margin-top: 8px;
+}
+
+.card {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 16px;
+    text-align: center;
+}
+
+.card .card-label {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+    margin-bottom: 6px;
+}
+
+.card .card-value {
+    font-family: var(--font-mono);
+    font-size: 1.4rem;
+    font-weight: 600;
+}
+
+table {
+    width: 100%;
+    border-collapse: collapse;
+    margin: 8px 0;
+    background: var(--color-card-bg);
+    font-size: 0.92rem;
+}
+
+th, td {
+    padding: 6px 10px;
+    border-bottom: 1px solid var(--color-border);
+    text-align: left;
+}
+
+th {
+    background: #f1f3f5;
+    font-weight: 600;
+}
+
+tbody tr:nth-child(even) {
+    background: #fafbfc;
+}
+
+td.numeric {
+    font-family: var(--font-mono);
+    text-align: right;
+}
+
+svg.histogram {
+    width: 100%;
+    max-width: 720px;
+    height: 220px;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+}
+
+svg.histogram .bar     { fill: #4c78a8; }
+svg.histogram .axis    { stroke: var(--color-border); stroke-width: 1; }
+svg.histogram .label   { font-family: var(--font-sans); font-size: 11px; fill: var(--color-text); }
+svg.histogram .value   { font-family: var(--font-mono); font-size: 11px; fill: var(--color-text); }
+
+iframe.facets-embed {
+    width: 100%;
+    min-height: 600px;
+    border: 0;
+}
+
+.warning-box {
+    background: var(--color-crit);
+    color: #ffffff;
+    padding: 12px 16px;
+    border-radius: 6px;
+    margin: 12px 0;
+    font-weight: 600;
+}
+
+.footer-list {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+}
+
+.footer-list code { font-family: var(--font-mono); }
+
+@media print {
+    body { background: #ffffff; padding: 0; max-width: none; }
+    details { break-inside: avoid; border: 1px solid #ccc; }
+    iframe.facets-embed { min-height: 400px; }
+    .card, table, svg.histogram { break-inside: avoid; }
+}
+</style>
+</head>
+<body>
+    <header id="report-header">
+        <h1>GiGL Data Analysis Report</h1>
+        <p class="meta" id="report-meta"></p>
+        <p class="config-summary" id="report-config-summary"></p>
+    </header>
+
+    <section id="overview">
+        <h2>Overview</h2>
+        <div class="card-grid" id="overview-cards"></div>
+    </section>
+
+    <section id="data-quality">
+        <h2>Data Quality</h2>
+        <details open>
+            <summary>NULL rates per column</summary>
+            <div id="null-rates-container"></div>
+        </details>
+        <details open>
+            <summary>Integrity checks</summary>
+            <div id="integrity-container"></div>
+        </details>
+    </section>
+
+    <section id="feature-statistics" hidden>
+        <h2>Feature Statistics</h2>
+        <div id="feature-statistics-container"></div>
+    </section>
+
+    <section id="graph-structure">
+        <h2>Graph Structure</h2>
+        <details open>
+            <summary>Node and edge counts</summary>
+            <div id="counts-container"></div>
+        </details>
+        <details open>
+            <summary>Degree distribution</summary>
+            <div id="degree-container"></div>
+        </details>
+        <details open>
+            <summary>Top-20 hubs</summary>
+            <div id="hubs-container"></div>
+        </details>
+        <div id="super-hub-warning" hidden></div>
+    </section>
+
+    <section id="advanced" hidden>
+        <h2>Advanced Metrics</h2>
+        <div id="advanced-container"></div>
+    </section>
+
+    <footer id="report-footer">
+        <h2>References and Artifacts</h2>
+        <div id="footer-container"></div>
+    </footer>
+
+    <script id="analysis-data" type="application/json">{"duplicate_node_counts": {"user": 0}, "dangling_edge_counts": {"follows": 0}, "referential_integrity_violations": {"follows": 0}, "node_counts": {"user": 1000000}, "edge_counts": {"follows": 5000000}, "null_rates": {"p.d.nodes": {"age": 0.05, "country": 0.12}}, "duplicate_edge_counts": {"follows": 150}, "self_loop_counts": {"follows": 0}, "isolated_node_counts": {"user": 8000}, "degree_stats": {"follows_out": {"min": 0, "max": 50000, "mean": 10.0, "median": 5, "p90": 25, "p99": 200, "p999": 5000, "percentiles": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], "buckets": {"0-1": 100000, "2-10": 600000, "11-100": 250000, "101-1K": 45000, "1K-10K": 4500, "10K+": 500}}}, "top_hubs": {"follows_out": [["hub_1", 50000], ["hub_2", 35000]]}, "super_hub_int16_clamp_count": {"follows_out": 2}, "cold_start_node_counts": {"user": 100000}, "feature_memory_bytes": {"user": 8000000000}, "neighbor_explosion_estimate": {"follows": 75000}, "class_imbalance": {}, "label_coverage": {}, "edge_type_distribution": {}, "edge_type_node_coverage": {}, "reciprocity": {}, "power_law_exponent": {}}</script>
+    <script id="profile-data" type="application/json">{}</script>
+    <script>(function () {
+    "use strict";
+
+    // Bucket order for degree histograms; must match GraphStructureAnalyzer output.
+    const BUCKET_ORDER = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"];
+
+    function parseJSONTag(id) {
+        const node = document.getElementById(id);
+        if (!node) return {};
+        const raw = (node.textContent || "").trim();
+        if (!raw) return {};
+        try {
+            return JSON.parse(raw);
+        } catch (e) {
+            console.error("Failed to parse JSON tag #" + id, e);
+            return {};
+        }
+    }
+
+    function createElement(tag, attrs, ...children) {
+        const el = document.createElement(tag);
+        if (attrs) {
+            for (const key of Object.keys(attrs)) {
+                const val = attrs[key];
+                if (val === null || val === undefined || val === false) continue;
+                if (key === "className") el.className = val;
+                else if (key === "text") el.textContent = val;
+                else if (key === "hidden") el.hidden = Boolean(val);
+                else el.setAttribute(key, val);
+            }
+        }
+        for (const child of children) {
+            if (child === null || child === undefined) continue;
+            if (typeof child === "string" || typeof child === "number") {
+                el.appendChild(document.createTextNode(String(child)));
+            } else {
+                el.appendChild(child);
+            }
+        }
+        return el;
+    }
+
+    function formatNumber(n) {
+        if (n === null || n === undefined) return "-";
+        if (typeof n !== "number") return String(n);
+        return n.toLocaleString("en-US");
+    }
+
+    function formatPercent(fraction) {
+        if (fraction === null || fraction === undefined) return "-";
+        return (fraction * 100).toFixed(2) + "%";
+    }
+
+    function classForThreshold(value, green, yellow) {
+        // value <= green -> green, value <= yellow -> yellow, else red.
+        if (value <= green) return "status-green";
+        if (value <= yellow) return "status-yellow";
+        return "status-red";
+    }
+
+    function classForNullRate(rate) {
+        if (rate > 0.9) return "status-red";
+        if (rate > 0.5) return "status-yellow";
+        return "status-green";
+    }
+
+    function sumValues(obj) {
+        if (!obj) return 0;
+        let total = 0;
+        for (const key of Object.keys(obj)) {
+            const v = obj[key];
+            if (typeof v === "number") total += v;
+        }
+        return total;
+    }
+
+    function hasAnyPositive(obj) {
+        if (!obj) return false;
+        for (const key of Object.keys(obj)) {
+            if (obj[key] > 0) return true;
+        }
+        return false;
+    }
+
+    // ---- Rendering ----
+
+    function renderHeader(analysis) {
+        const metaEl = document.getElementById("report-meta");
+        const cfgEl = document.getElementById("report-config-summary");
+        const now = new Date().toISOString();
+        metaEl.textContent = "Generated at " + now;
+
+        const nodeTypes = Object.keys(analysis.node_counts || {});
+        const edgeTypes = Object.keys(analysis.edge_counts || {});
+        cfgEl.textContent =
+            "Node tables: " + (nodeTypes.length ? nodeTypes.join(", ") : "(none)") +
+            " | Edge tables: " + (edgeTypes.length ? edgeTypes.join(", ") : "(none)");
+    }
+
+    function overallStatus(analysis) {
+        // Hard fails -> red.
+        if (hasAnyPositive(analysis.duplicate_node_counts) ||
+            hasAnyPositive(analysis.dangling_edge_counts) ||
+            hasAnyPositive(analysis.referential_integrity_violations) ||
+            hasAnyPositive(analysis.super_hub_int16_clamp_count)) {
+            return "status-red";
+        }
+        // Check thresholded metrics for yellow.
+        const totalNodes = sumValues(analysis.node_counts);
+        if (totalNodes > 0) {
+            const isolatedFrac = sumValues(analysis.isolated_node_counts) / totalNodes;
+            const coldFrac = sumValues(analysis.cold_start_node_counts) / totalNodes;
+            if (isolatedFrac > 0.05 || coldFrac > 0.10) return "status-red";
+            if (isolatedFrac > 0.01 || coldFrac > 0.05) return "status-yellow";
+        }
+        // NULL rates.
+        const nullRates = analysis.null_rates || {};
+        for (const table of Object.keys(nullRates)) {
+            for (const col of Object.keys(nullRates[table])) {
+                const r = nullRates[table][col];
+                if (r > 0.9) return "status-red";
+            }
+        }
+        return "status-green";
+    }
+
+    function renderOverview(analysis) {
+        const container = document.getElementById("overview-cards");
+        const totalNodes = sumValues(analysis.node_counts);
+        const totalEdges = sumValues(analysis.edge_counts);
+        const nodeTypes = Object.keys(analysis.node_counts || {}).length;
+        const edgeTypes = Object.keys(analysis.edge_counts || {}).length;
+        const status = overallStatus(analysis);
+
+        const cards = [
+            ["Total nodes", formatNumber(totalNodes)],
+            ["Total edges", formatNumber(totalEdges)],
+            ["Node types", formatNumber(nodeTypes)],
+            ["Edge types", formatNumber(edgeTypes)],
+        ];
+        for (const [label, value] of cards) {
+            container.appendChild(createElement("div", { className: "card" },
+                createElement("div", { className: "card-label", text: label }),
+                createElement("div", { className: "card-value data-value", text: value })
+            ));
+        }
+        const statusLabel = status === "status-green" ? "OK" :
+                            status === "status-yellow" ? "WARNING" : "CRITICAL";
+        container.appendChild(createElement("div", { className: "card" },
+            createElement("div", { className: "card-label", text: "Overall status" }),
+            createElement("div", { className: "card-value" },
+                createElement("span", { className: status, text: statusLabel }))
+        ));
+    }
+
+    function renderNullRates(analysis) {
+        const container = document.getElementById("null-rates-container");
+        const rates = analysis.null_rates || {};
+        const rows = [];
+        for (const table of Object.keys(rates)) {
+            for (const col of Object.keys(rates[table])) {
+                rows.push({ table: table, column: col, rate: rates[table][col] });
+            }
+        }
+        if (rows.length === 0) {
+            container.appendChild(createElement("p", { text: "No NULL rate data available." }));
+            return;
+        }
+        rows.sort((a, b) => b.rate - a.rate);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Table" }),
+                createElement("th", { text: "Column" }),
+                createElement("th", { text: "NULL rate" })));
+        const tbody = createElement("tbody");
+        for (const r of rows) {
+            const cls = classForNullRate(r.rate);
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: r.table }),
+                createElement("td", { text: r.column }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatPercent(r.rate) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderIntegrity(analysis) {
+        const container = document.getElementById("integrity-container");
+        const rows = [
+            ["Duplicate nodes", analysis.duplicate_node_counts],
+            ["Duplicate edges", analysis.duplicate_edge_counts],
+            ["Dangling edges", analysis.dangling_edge_counts],
+            ["Referential integrity violations", analysis.referential_integrity_violations],
+            ["Self loops", analysis.self_loop_counts],
+            ["Isolated nodes", analysis.isolated_node_counts],
+            ["Cold-start nodes (degree 0-1)", analysis.cold_start_node_counts],
+        ];
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Check" }),
+                createElement("th", { text: "Per-type counts" }),
+                createElement("th", { text: "Total" })));
+        const tbody = createElement("tbody");
+        for (const [label, obj] of rows) {
+            const total = sumValues(obj);
+            const isHardFail = (label === "Duplicate nodes" ||
+                                label === "Dangling edges" ||
+                                label === "Referential integrity violations");
+            const cls = isHardFail
+                ? (total > 0 ? "status-red" : "status-green")
+                : (total > 0 ? "status-yellow" : "status-green");
+            const detail = obj && Object.keys(obj).length
+                ? Object.keys(obj).map(k => k + ": " + formatNumber(obj[k])).join(", ")
+                : "(none)";
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: label }),
+                createElement("td", { className: "data-value", text: detail }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatNumber(total) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderFeatureStatistics(profile) {
+        const section = document.getElementById("feature-statistics");
+        const container = document.getElementById("feature-statistics-container");
+        const facets = (profile && profile.facets_html_paths) || {};
+        const keys = Object.keys(facets);
+        if (keys.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        for (const tableName of keys) {
+            const srcdoc = facets[tableName] || "";
+            container.appendChild(createElement("details", { open: "" },
+                createElement("summary", { text: "FACETS: " + tableName }),
+                createElement("iframe", {
+                    className: "facets-embed",
+                    srcdoc: srcdoc,
+                    sandbox: "allow-scripts allow-same-origin",
+                })));
+        }
+    }
+
+    function renderCounts(analysis) {
+        const container = document.getElementById("counts-container");
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Type" }),
+                createElement("th", { text: "Kind" }),
+                createElement("th", { text: "Count" })));
+        const tbody = createElement("tbody");
+        for (const [name, count] of Object.entries(analysis.node_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "node" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        for (const [name, count] of Object.entries(analysis.edge_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "edge" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderDegreeHistogram(buckets) {
+        // Returns an SVG element for the given bucket counts.
+        const width = 720;
+        const height = 220;
+        const padLeft = 50;
+        const padRight = 10;
+        const padTop = 16;
+        const padBottom = 40;
+        const innerW = width - padLeft - padRight;
+        const innerH = height - padTop - padBottom;
+
+        const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
+        svg.setAttribute("class", "histogram");
+        svg.setAttribute("viewBox", "0 0 " + width + " " + height);
+
+        const counts = BUCKET_ORDER.map(k => (buckets && buckets[k]) || 0);
+        const maxCount = Math.max(1, ...counts);
+        const barWidth = innerW / BUCKET_ORDER.length;
+
+        // Axis lines.
+        const axis = document.createElementNS("http://www.w3.org/2000/svg", "line");
+        axis.setAttribute("class", "axis");
+        axis.setAttribute("x1", padLeft);
+        axis.setAttribute("y1", padTop + innerH);
+        axis.setAttribute("x2", padLeft + innerW);
+        axis.setAttribute("y2", padTop + innerH);
+        svg.appendChild(axis);
+
+        for (let i = 0; i < BUCKET_ORDER.length; i++) {
+            const c = counts[i];
+            const h = (c / maxCount) * innerH;
+            const x = padLeft + i * barWidth + 4;
+            const y = padTop + innerH - h;
+            const rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+            rect.setAttribute("class", "bar");
+            rect.setAttribute("x", x);
+            rect.setAttribute("y", y);
+            rect.setAttribute("width", Math.max(1, barWidth - 8));
+            rect.setAttribute("height", h);
+            svg.appendChild(rect);
+
+            const valueLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            valueLabel.setAttribute("class", "value");
+            valueLabel.setAttribute("x", x + (barWidth - 8) / 2);
+            valueLabel.setAttribute("y", y - 4);
+            valueLabel.setAttribute("text-anchor", "middle");
+            valueLabel.textContent = formatNumber(c);
+            svg.appendChild(valueLabel);
+
+            const xLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            xLabel.setAttribute("class", "label");
+            xLabel.setAttribute("x", x + (barWidth - 8) / 2);
+            xLabel.setAttribute("y", padTop + innerH + 16);
+            xLabel.setAttribute("text-anchor", "middle");
+            xLabel.textContent = BUCKET_ORDER[i];
+            svg.appendChild(xLabel);
+        }
+
+        // Y-axis max label.
+        const maxLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+        maxLabel.setAttribute("class", "label");
+        maxLabel.setAttribute("x", padLeft - 6);
+        maxLabel.setAttribute("y", padTop + 10);
+        maxLabel.setAttribute("text-anchor", "end");
+        maxLabel.textContent = formatNumber(maxCount);
+        svg.appendChild(maxLabel);
+
+        return svg;
+    }
+
+    function renderDegree(analysis) {
+        const container = document.getElementById("degree-container");
+        const degrees = analysis.degree_stats || {};
+        const keys = Object.keys(degrees);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No degree stats available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            const stats = degrees[edgeType];
+            const median = stats.median || 1;
+            const ratio = stats.p99 / Math.max(1, median);
+            const ratioClass = classForThreshold(ratio, 50, 100);
+
+            const statsLine = createElement("p", { className: "data-value" },
+                "min=" + formatNumber(stats.min) +
+                ", mean=" + (stats.mean !== undefined ? stats.mean.toFixed(2) : "-") +
+                ", median=" + formatNumber(stats.median) +
+                ", p90=" + formatNumber(stats.p90) +
+                ", p99=" + formatNumber(stats.p99) +
+                ", p99.9=" + formatNumber(stats.p999) +
+                ", max=" + formatNumber(stats.max) +
+                " | p99/median=",
+                createElement("span", { className: ratioClass, text: ratio.toFixed(1) }));
+
+            container.appendChild(createElement("h3", { text: edgeType }));
+            container.appendChild(statsLine);
+            container.appendChild(renderDegreeHistogram(stats.buckets || {}));
+        }
+    }
+
+    function renderHubs(analysis) {
+        const container = document.getElementById("hubs-container");
+        const hubs = analysis.top_hubs || {};
+        const keys = Object.keys(hubs);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No hub data available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            container.appendChild(createElement("h3", { text: edgeType }));
+            const thead = createElement("thead", null,
+                createElement("tr", null,
+                    createElement("th", { text: "Rank" }),
+                    createElement("th", { text: "Node ID" }),
+                    createElement("th", { text: "Degree" })));
+            const tbody = createElement("tbody");
+            const rows = (hubs[edgeType] || []).slice(0, 20);
+            rows.forEach((entry, i) => {
+                const nodeId = Array.isArray(entry) ? entry[0] : entry.node_id;
+                const degree = Array.isArray(entry) ? entry[1] : entry.degree;
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: String(i + 1) }),
+                    createElement("td", { className: "data-value", text: String(nodeId) }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(degree) })));
+            });
+            container.appendChild(createElement("table", null, thead, tbody));
+        }
+    }
+
+    function renderSuperHubWarning(analysis) {
+        const box = document.getElementById("super-hub-warning");
+        const clamps = analysis.super_hub_int16_clamp_count || {};
+        const totalClamps = sumValues(clamps);
+        if (totalClamps <= 0) {
+            box.hidden = true;
+            return;
+        }
+        box.hidden = false;
+        box.className = "warning-box";
+        const detail = Object.keys(clamps)
+            .map(k => k + ": " + formatNumber(clamps[k]))
+            .join(", ");
+        box.appendChild(createElement("strong", { text: "Super-hub int16 clamp warning. " }));
+        box.appendChild(document.createTextNode(
+            formatNumber(totalClamps) + " node(s) exceed the int16 degree limit (32,767) and " +
+            "will be silently clamped by GiGL. Per-type: " + detail
+        ));
+    }
+
+    function renderAdvanced(analysis) {
+        const section = document.getElementById("advanced");
+        const container = document.getElementById("advanced-container");
+
+        const classImb = analysis.class_imbalance || {};
+        const labelCov = analysis.label_coverage || {};
+        const edgeDist = analysis.edge_type_distribution || {};
+        const reciprocity = analysis.reciprocity || {};
+        const powerLaw = analysis.power_law_exponent || {};
+
+        const hasTier3 = Object.keys(classImb).length ||
+                         Object.keys(labelCov).length ||
+                         Object.keys(edgeDist).length;
+        const hasTier4 = Object.keys(reciprocity).length ||
+                         Object.keys(powerLaw).length;
+
+        if (!hasTier3 && !hasTier4) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+
+        if (Object.keys(classImb).length) {
+            container.appendChild(createElement("h3", { text: "Class imbalance" }));
+            for (const nodeType of Object.keys(classImb)) {
+                const counts = classImb[nodeType];
+                const values = Object.values(counts);
+                const maxC = Math.max(...values);
+                const minC = Math.max(1, Math.min(...values));
+                const ratio = maxC / minC;
+                const cls = ratio > 10 ? "status-red" : ratio > 5 ? "status-yellow" : "status-green";
+                container.appendChild(createElement("p", { className: "data-value" },
+                    nodeType + " max/min ratio = ",
+                    createElement("span", { className: cls, text: "1:" + ratio.toFixed(1) })));
+                const tbody = createElement("tbody");
+                for (const [label, count] of Object.entries(counts)) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(label) }),
+                        createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+                }
+                container.appendChild(createElement("table", null,
+                    createElement("thead", null, createElement("tr", null,
+                        createElement("th", { text: "Class" }),
+                        createElement("th", { text: "Count" }))),
+                    tbody));
+            }
+        }
+
+        if (Object.keys(labelCov).length) {
+            container.appendChild(createElement("h3", { text: "Label coverage" }));
+            const tbody = createElement("tbody");
+            for (const [nodeType, frac] of Object.entries(labelCov)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: nodeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(frac) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Node type" }),
+                    createElement("th", { text: "Coverage" }))),
+                tbody));
+        }
+
+        if (Object.keys(edgeDist).length) {
+            container.appendChild(createElement("h3", { text: "Edge type distribution" }));
+            const total = sumValues(edgeDist);
+            const tbody = createElement("tbody");
+            for (const [edgeType, count] of Object.entries(edgeDist)) {
+                const frac = total > 0 ? count / total : 0;
+                let cls = "status-green";
+                if (frac < 0.001) cls = "status-red";
+                else if (frac > 0.9) cls = "status-red";
+                else if (frac > 0.8) cls = "status-yellow";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(count) }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: formatPercent(frac) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Share" }))),
+                tbody));
+        }
+
+        if (Object.keys(reciprocity).length) {
+            container.appendChild(createElement("h3", { text: "Reciprocity" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, val] of Object.entries(reciprocity)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(val) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Reciprocity" }))),
+                tbody));
+        }
+
+        if (Object.keys(powerLaw).length) {
+            container.appendChild(createElement("h3", { text: "Power-law exponent" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, alpha] of Object.entries(powerLaw)) {
+                const cls = alpha < 2 ? "status-red" : alpha < 2.5 ? "status-yellow" : "status-green";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: alpha.toFixed(2) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Alpha" }))),
+                tbody));
+        }
+    }
+
+    function renderFooter(analysis, profile) {
+        const container = document.getElementById("footer-container");
+
+        const artifacts = [];
+        const paths = (profile && profile.facets_html_paths) || {};
+        for (const [k, v] of Object.entries(paths)) {
+            artifacts.push("FACETS " + k + ": " + v);
+        }
+        const statsPaths = (profile && profile.stats_paths) || {};
+        for (const [k, v] of Object.entries(statsPaths)) {
+            artifacts.push("Stats " + k + ": " + v);
+        }
+
+        if (artifacts.length) {
+            container.appendChild(createElement("h3", { text: "Raw artifacts" }));
+            const ul = createElement("ul", { className: "footer-list" });
+            for (const a of artifacts) {
+                ul.appendChild(createElement("li", null, createElement("code", { text: a })));
+            }
+            container.appendChild(ul);
+        }
+
+        container.appendChild(createElement("h3", { text: "References" }));
+        const refs = [
+            "PinSage (Pinterest, KDD 2018)",
+            "PinnerSage (Pinterest, KDD 2020)",
+            "BLADE (Amazon, WSDM 2023)",
+            "LiGNN (LinkedIn, KDD 2024)",
+            "TwHIN (Twitter/X, KDD 2022)",
+            "GiGL (Snap, KDD 2025)",
+            "AliGraph (Alibaba, VLDB 2019)",
+            "GraphSMOTE (WSDM 2021)",
+            "Beyond Homophily (NeurIPS 2020)",
+            "Uber Fraud Detection + Grab Spade (VLDB 2023)",
+            "Google Maps ETA (CIKM 2021)",
+            "Feature Propagation (ICLR 2022)",
+            "GraphBFF (Feb 2026)",
+            "DQuaG (EDBT 2025)",
+            "LinkedIn Cross-Domain GNN (June 2025)",
+            "Oversmoothing/Oversquashing Complexity (March 2026)",
+            "Demystifying Common Beliefs (ICLR 2026)",
+            "Meta GEM and Adaptive Ranking (2025-2026)",
+        ];
+        const refUl = createElement("ul", { className: "footer-list" });
+        for (const r of refs) {
+            refUl.appendChild(createElement("li", { text: r }));
+        }
+        container.appendChild(refUl);
+    }
+
+    function main() {
+        const analysis = parseJSONTag("analysis-data");
+        const profile = parseJSONTag("profile-data");
+        renderHeader(analysis);
+        renderOverview(analysis);
+        renderNullRates(analysis);
+        renderIntegrity(analysis);
+        renderFeatureStatistics(profile);
+        renderCounts(analysis);
+        renderDegree(analysis);
+        renderHubs(analysis);
+        renderSuperHubWarning(analysis);
+        renderAdvanced(analysis);
+        renderFooter(analysis, profile);
+    }
+
+    if (document.readyState === "loading") {
+        document.addEventListener("DOMContentLoaded", main);
+    } else {
+        main();
+    }
+})();
+</script>
+</body>
+</html>
diff --git a/tests/test_assets/analytics/sample_analyzer_config.yaml b/tests/test_assets/analytics/sample_analyzer_config.yaml
new file mode 100644
index 000000000..acd3fb7b9
--- /dev/null
+++ b/tests/test_assets/analytics/sample_analyzer_config.yaml
@@ -0,0 +1,16 @@
+node_tables:
+  - bq_table: "test_project.test_dataset.user_nodes"
+    node_type: "user"
+    id_column: "user_id"
+    feature_columns: ["age", "country"]
+    label_column: "label"
+
+edge_tables:
+  - bq_table: "test_project.test_dataset.user_edges"
+    edge_type: "follows"
+    src_id_column: "src_user_id"
+    dst_id_column: "dst_user_id"
+    feature_columns: ["weight"]
+
+output_gcs_path: "gs://test-bucket/analysis_output/"
+fan_out: [15, 10, 5]
diff --git a/tests/unit/analytics/__init__.py b/tests/unit/analytics/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/__init__.py b/tests/unit/analytics/data_analyzer/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/config_test.py b/tests/unit/analytics/data_analyzer/config_test.py
new file mode 100644
index 000000000..57a6b643c
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/config_test.py
@@ -0,0 +1,176 @@
+from pathlib import Path
+from typing import cast
+
+from omegaconf import OmegaConf
+from omegaconf.errors import MissingMandatoryValue
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig, load_analyzer_config
+from tests.test_assets.test_case import TestCase
+
+SAMPLE_CONFIG_PATH = (
+    Path(__file__).parents[3]
+    / "test_assets"
+    / "analytics"
+    / "sample_analyzer_config.yaml"
+)
+
+
+class DataAnalyzerConfigTest(TestCase):
+    def test_load_valid_config(self) -> None:
+        config = load_analyzer_config(str(SAMPLE_CONFIG_PATH))
+        self.assertIsInstance(config, DataAnalyzerConfig)
+        self.assertEqual(len(config.node_tables), 1)
+        self.assertEqual(len(config.edge_tables), 1)
+        self.assertEqual(config.node_tables[0].node_type, "user")
+        self.assertEqual(config.node_tables[0].label_column, "label")
+        self.assertEqual(config.edge_tables[0].edge_type, "follows")
+        self.assertEqual(config.output_gcs_path, "gs://test-bucket/analysis_output/")
+        self.assertEqual(config.fan_out, [15, 10, 5])
+
+    def test_optional_fields_default_to_none_or_false(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+            feature_columns: ["f1"]
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertIsNone(config.node_tables[0].label_column)
+        self.assertIsNone(config.edge_tables[0].timestamp_column)
+        self.assertIsNone(config.fan_out)
+        self.assertFalse(config.compute_reciprocity)
+        self.assertFalse(config.compute_homophily)
+
+    def test_missing_required_field_raises(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+        edge_tables: []
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        with self.assertRaises(MissingMandatoryValue):
+            merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+            OmegaConf.to_object(merged)
+
+    def test_node_table_without_feature_columns(self) -> None:
+        """Nodes with no features are legal; feature_columns defaults to []."""
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertEqual(config.node_tables[0].feature_columns, [])
+
+    def test_homogeneous_edge_backfills_src_and_dst_node_type(self) -> None:
+        """Single-node-table configs auto-populate src/dst node types."""
+        config = load_analyzer_config(str(SAMPLE_CONFIG_PATH))
+        self.assertEqual(config.edge_tables[0].src_node_type, "user")
+        self.assertEqual(config.edge_tables[0].dst_node_type, "user")
+
+
+SAMPLE_HETERO_YAML = """
+node_tables:
+  - bq_table: "p.d.users"
+    node_type: "user"
+    id_column: "uid"
+  - bq_table: "p.d.content"
+    node_type: "content"
+    id_column: "cid"
+edge_tables:
+  - bq_table: "p.d.viewed"
+    edge_type: "viewed"
+    src_id_column: "user_id"
+    dst_id_column: "content_id"
+    src_node_type: "user"
+    dst_node_type: "content"
+output_gcs_path: "gs://bucket/out/"
+"""
+
+
+class DataAnalyzerConfigHeterogeneousTest(TestCase):
+    """Tests for heterogeneous graph support (I3) and identifier validation (I1)."""
+
+    def _write_yaml(self, yaml_str: str) -> str:
+        import tempfile
+
+        fd = tempfile.NamedTemporaryFile(
+            mode="w", suffix=".yaml", delete=False, encoding="utf-8"
+        )
+        fd.write(yaml_str)
+        fd.close()
+        return fd.name
+
+    def test_heterogeneous_config_with_node_types_loads(self) -> None:
+        path = self._write_yaml(SAMPLE_HETERO_YAML)
+        config = load_analyzer_config(path)
+        self.assertEqual(len(config.node_tables), 2)
+        self.assertEqual(config.edge_tables[0].src_node_type, "user")
+        self.assertEqual(config.edge_tables[0].dst_node_type, "content")
+
+    def test_heterogeneous_missing_src_node_type_raises(self) -> None:
+        """Regression test for I3: multi-node-table configs must declare both sides."""
+        yaml_str = SAMPLE_HETERO_YAML.replace('    src_node_type: "user"\n', "")
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("src_node_type is required", str(ctx.exception))
+
+    def test_heterogeneous_unknown_node_type_raises(self) -> None:
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            '    dst_node_type: "content"', '    dst_node_type: "movie"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("is not a declared node_type", str(ctx.exception))
+
+    def test_invalid_bq_table_reference_raises(self) -> None:
+        """Regression test for I1: reject malformed table identifiers."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'bq_table: "p.d.users"', 'bq_table: "p.d.users; DROP TABLE x"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery table reference", str(ctx.exception))
+
+    def test_invalid_column_identifier_raises(self) -> None:
+        """Regression test for I1: reject column names with backticks/quotes."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'src_id_column: "user_id"', 'src_id_column: "user`id"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery column identifier", str(ctx.exception))
+
+    def test_column_with_whitespace_raises(self) -> None:
+        """Regression test for I1: reject column names containing whitespace."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'dst_id_column: "content_id"', 'dst_id_column: "content id"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery column identifier", str(ctx.exception))
diff --git a/tests/unit/analytics/data_analyzer/data_analyzer_test.py b/tests/unit/analytics/data_analyzer/data_analyzer_test.py
new file mode 100644
index 000000000..15f6675de
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/data_analyzer_test.py
@@ -0,0 +1,159 @@
+import tempfile
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer, _write_report
+from gigl.analytics.data_analyzer.graph_structure_analyzer import DataQualityError
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from gigl.common import LocalUri
+from tests.test_assets.test_case import TestCase
+
+HTML = "<html><body>report</body></html>"
+
+
+def _make_config(output_gcs_path: str) -> DataAnalyzerConfig:
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age"],
+            )
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.follows",
+                edge_type="follows",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+            )
+        ],
+        output_gcs_path=output_gcs_path,
+    )
+
+
+class WriteReportLocalTest(TestCase):
+    def test_writes_to_local_directory(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            path = _write_report(HTML, tmpdir)
+            report = Path(path)
+            self.assertTrue(report.exists())
+            self.assertEqual(report.read_text(), HTML)
+            self.assertEqual(report.name, "report.html")
+
+    def test_creates_missing_parent_dirs(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            nested = Path(tmpdir) / "nested" / "path"
+            path = _write_report(HTML, str(nested))
+            self.assertTrue(Path(path).exists())
+
+
+@patch("gigl.analytics.data_analyzer.data_analyzer.GcsUtils")
+class WriteReportGcsTest(TestCase):
+    def test_uploads_to_gcs(self, mock_gcs_cls: MagicMock) -> None:
+        path = _write_report(HTML, "gs://my-bucket/output/")
+        self.assertEqual(path, "gs://my-bucket/output/report.html")
+        mock_gcs_cls.return_value.upload_from_string.assert_called_once()
+
+    def test_handles_trailing_slash(self, mock_gcs_cls: MagicMock) -> None:
+        path_with = _write_report(HTML, "gs://my-bucket/output/")
+        path_without = _write_report(HTML, "gs://my-bucket/output")
+        self.assertEqual(path_with, path_without)
+
+
+class DataAnalyzerRunTest(TestCase):
+    """Orchestrator tests: structure analyzer and feature profiler run
+    concurrently, their results both reach ``generate_report``, and failures
+    in either are handled independently without blocking the other.
+    """
+
+    def setUp(self) -> None:
+        super().setUp()
+        self._generate_report = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.generate_report",
+            return_value=HTML,
+        ).start()
+        self._analyze = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.GraphStructureAnalyzer.analyze",
+        ).start()
+        self._profile = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.FeatureProfiler.profile",
+        ).start()
+        self.addCleanup(patch.stopall)
+
+    def test_invokes_both_analyzer_and_profiler(self) -> None:
+        analysis = GraphAnalysisResult()
+        profile = FeatureProfileResult(
+            facets_html_paths={"node:user": "gs://b/facets.html"}
+        )
+        self._analyze.return_value = analysis
+        self._profile.return_value = profile
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            DataAnalyzer().run(
+                config=_make_config(tmpdir),
+                resource_config_uri=LocalUri("/tmp/fake.yaml"),
+            )
+
+        self.assertEqual(self._analyze.call_count, 1)
+        self.assertEqual(self._profile.call_count, 1)
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["analysis_result"], analysis)
+        self.assertIs(call_kwargs["profile_result"], profile)
+
+    def test_profiler_failure_does_not_block_report(self) -> None:
+        self._analyze.return_value = GraphAnalysisResult()
+        self._profile.side_effect = RuntimeError("Dataflow went boom")
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            path = DataAnalyzer().run(
+                config=_make_config(tmpdir),
+                resource_config_uri=LocalUri("/tmp/fake.yaml"),
+            )
+            _, call_kwargs = self._generate_report.call_args
+            self.assertIsInstance(call_kwargs["profile_result"], FeatureProfileResult)
+            self.assertEqual(call_kwargs["profile_result"].facets_html_paths, {})
+            self.assertTrue(Path(path).exists())
+
+    def test_data_quality_error_uses_partial_result_and_still_runs_profiler(
+        self,
+    ) -> None:
+        partial = GraphAnalysisResult(dangling_edge_counts={"follows": 1})
+        self._analyze.side_effect = DataQualityError(
+            "Tier 1 failure", partial_result=partial
+        )
+        profile = FeatureProfileResult(
+            facets_html_paths={"node:user": "gs://b/facets.html"}
+        )
+        self._profile.return_value = profile
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            DataAnalyzer().run(
+                config=_make_config(tmpdir),
+                resource_config_uri=LocalUri("/tmp/fake.yaml"),
+            )
+
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["analysis_result"], partial)
+        self.assertIs(call_kwargs["profile_result"], profile)
+
+    def test_passes_resource_config_uri_to_profiler(self) -> None:
+        self._analyze.return_value = GraphAnalysisResult()
+        self._profile.return_value = FeatureProfileResult()
+        resource_config_uri = LocalUri("/tmp/fake.yaml")
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            DataAnalyzer().run(
+                config=_make_config(tmpdir),
+                resource_config_uri=resource_config_uri,
+            )
+
+        self.assertEqual(self._profile.call_args.args[1], resource_config_uri)
diff --git a/tests/unit/analytics/data_analyzer/feature_profiler_test.py b/tests/unit/analytics/data_analyzer/feature_profiler_test.py
new file mode 100644
index 000000000..fa345bf16
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/feature_profiler_test.py
@@ -0,0 +1,256 @@
+"""Unit tests for the FeatureProfiler.
+
+Dataflow job execution is mocked: ``beam.Pipeline`` is replaced with a
+dummy that records construction, and ``init_beam_pipeline_options`` /
+``get_resource_config`` are patched so tests don't touch real GCP
+resources.
+"""
+import itertools
+from typing import Optional
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.feature_profiler import (
+    FeatureProfiler,
+    _collect_profile_tasks,
+)
+from gigl.common import LocalUri
+from gigl.src.common.constants.components import GiGLComponents
+from tests.test_assets.test_case import TestCase
+
+
+def _make_config(
+    node_specs: Optional[list[NodeTableSpec]] = None,
+    edge_specs: Optional[list[EdgeTableSpec]] = None,
+    output_gcs_path: str = "gs://bucket/out",
+) -> DataAnalyzerConfig:
+    return DataAnalyzerConfig(
+        node_tables=node_specs
+        if node_specs is not None
+        else [
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age", "country"],
+            )
+        ],
+        edge_tables=edge_specs
+        if edge_specs is not None
+        else [
+            EdgeTableSpec(
+                bq_table="p.d.follows",
+                edge_type="follows",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+                feature_columns=["weight"],
+            )
+        ],
+        output_gcs_path=output_gcs_path,
+    )
+
+
+class CollectProfileTasksTest(TestCase):
+    def test_skips_tables_without_feature_columns(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.a",
+                    node_type="a",
+                    id_column="id",
+                    feature_columns=["f1"],
+                ),
+                NodeTableSpec(
+                    bq_table="p.d.b",
+                    node_type="b",
+                    id_column="id",
+                    feature_columns=[],
+                ),
+            ],
+            edge_specs=[
+                EdgeTableSpec(
+                    bq_table="p.d.e1",
+                    edge_type="e1",
+                    src_id_column="s",
+                    dst_id_column="d",
+                    src_node_type="a",
+                    dst_node_type="a",
+                    feature_columns=["w"],
+                ),
+                EdgeTableSpec(
+                    bq_table="p.d.e2",
+                    edge_type="e2",
+                    src_id_column="s",
+                    dst_id_column="d",
+                    src_node_type="a",
+                    dst_node_type="a",
+                    feature_columns=[],
+                ),
+            ],
+        )
+        tasks = _collect_profile_tasks(config)
+        keys = sorted(t.result_key for t in tasks)
+        self.assertEqual(keys, ["edge:e1", "node:a"])
+
+    def test_preserves_feature_columns(self) -> None:
+        config = _make_config()
+        tasks = _collect_profile_tasks(config)
+        by_key = {t.result_key: t for t in tasks}
+        self.assertEqual(by_key["node:user"].feature_columns, ["age", "country"])
+        self.assertEqual(by_key["edge:follows"].feature_columns, ["weight"])
+
+
+class FeatureProfilerRaisesTest(TestCase):
+    def test_raises_when_resource_config_uri_missing(self) -> None:
+        profiler = FeatureProfiler()
+        with self.assertRaises(ValueError):
+            profiler.profile(config=_make_config(), resource_config_uri=None)
+
+
+class FeatureProfilerRunTest(TestCase):
+    def setUp(self) -> None:
+        super().setUp()
+        self._resource_config_uri = LocalUri("/tmp/fake_resource_config.yaml")
+
+        self._get_resource_config = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.get_resource_config"
+        ).start()
+        self._init_beam_pipeline_options = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.init_beam_pipeline_options",
+            return_value=MagicMock(name="PipelineOptions"),
+        ).start()
+
+        self._pipelines: list[MagicMock] = []
+
+        def _make_pipeline(*args, **kwargs):
+            pipeline = MagicMock(name="Pipeline")
+            pipeline.__enter__ = MagicMock(return_value=pipeline)
+            pipeline.__exit__ = MagicMock(return_value=False)
+            self._pipelines.append(pipeline)
+            return pipeline
+
+        self._pipeline_ctor = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.beam.Pipeline",
+            side_effect=_make_pipeline,
+        ).start()
+
+        self.addCleanup(patch.stopall)
+
+    def test_returns_empty_when_no_tables_have_features(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[
+                EdgeTableSpec(
+                    bq_table="p.d.follows",
+                    edge_type="follows",
+                    src_id_column="src",
+                    dst_id_column="dst",
+                    src_node_type="user",
+                    dst_node_type="user",
+                    feature_columns=[],
+                )
+            ],
+        )
+        profiler = FeatureProfiler()
+        result = profiler.profile(
+            config=config, resource_config_uri=self._resource_config_uri
+        )
+        self.assertEqual(result.facets_html_paths, {})
+        self.assertEqual(result.stats_paths, {})
+        self.assertEqual(len(self._pipelines), 0)
+
+    def test_launches_one_pipeline_per_feature_table(self) -> None:
+        config = _make_config()
+        profiler = FeatureProfiler()
+        result = profiler.profile(
+            config=config, resource_config_uri=self._resource_config_uri
+        )
+        self.assertEqual(len(self._pipelines), 2)
+        self.assertEqual(
+            sorted(result.facets_html_paths.keys()),
+            ["edge:follows", "node:user"],
+        )
+        self.assertEqual(
+            sorted(result.stats_paths.keys()),
+            ["edge:follows", "node:user"],
+        )
+        self._get_resource_config.assert_called_once_with(
+            resource_config_uri=self._resource_config_uri
+        )
+        component_kwargs = [
+            call.kwargs.get("component")
+            for call in self._init_beam_pipeline_options.call_args_list
+        ]
+        self.assertTrue(all(c == GiGLComponents.DataAnalyzer for c in component_kwargs))
+
+    def test_gcs_paths_use_expected_layout(self) -> None:
+        profiler = FeatureProfiler()
+        result = profiler.profile(
+            config=_make_config(output_gcs_path="gs://bucket/run1/"),
+            resource_config_uri=self._resource_config_uri,
+        )
+        self.assertEqual(
+            result.facets_html_paths["node:user"],
+            "gs://bucket/run1/feature_profiler/nodes/user/facets.html",
+        )
+        self.assertEqual(
+            result.stats_paths["node:user"],
+            "gs://bucket/run1/feature_profiler/nodes/user/stats.tfrecord",
+        )
+        self.assertEqual(
+            result.facets_html_paths["edge:follows"],
+            "gs://bucket/run1/feature_profiler/edges/follows/facets.html",
+        )
+
+    def test_individual_pipeline_failure_is_caught(self) -> None:
+        counter = itertools.count(1)
+
+        def _make_pipeline_fail_second(*args, **kwargs):
+            pipeline = MagicMock(name="Pipeline")
+            pipeline.__enter__ = MagicMock(return_value=pipeline)
+            if next(counter) == 2:
+                pipeline.__exit__ = MagicMock(side_effect=RuntimeError("Dataflow boom"))
+            else:
+                pipeline.__exit__ = MagicMock(return_value=False)
+            self._pipelines.append(pipeline)
+            return pipeline
+
+        self._pipeline_ctor.side_effect = _make_pipeline_fail_second
+
+        profiler = FeatureProfiler()
+        result = profiler.profile(
+            config=_make_config(),
+            resource_config_uri=self._resource_config_uri,
+        )
+        self.assertEqual(len(self._pipelines), 2)
+        total_keys = set(result.facets_html_paths.keys())
+        self.assertEqual(len(total_keys), 1)
+        self.assertLessEqual(total_keys, {"node:user", "edge:follows"})
+
+    def test_uses_data_analyzer_job_name_suffix(self) -> None:
+        profiler = FeatureProfiler()
+        profiler.profile(
+            config=_make_config(),
+            resource_config_uri=self._resource_config_uri,
+        )
+        suffixes = {
+            call.kwargs.get("job_name_suffix")
+            for call in self._init_beam_pipeline_options.call_args_list
+        }
+        self.assertEqual(
+            suffixes,
+            {"profile-node-user", "profile-edge-follows"},
+        )
diff --git a/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py b/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py
new file mode 100644
index 000000000..b7d943251
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py
@@ -0,0 +1,396 @@
+"""Unit tests for GraphStructureAnalyzer.
+
+All BQ calls are mocked via patching BqUtils. The goal is to exercise the
+orchestration logic (tier ordering, gating, result population) without hitting
+a real BigQuery backend.
+"""
+
+from typing import Any, Optional
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.graph_structure_analyzer import (
+    DataQualityError,
+    GraphStructureAnalyzer,
+)
+from tests.test_assets.test_case import TestCase
+
+
+def _make_config(
+    label_column: Optional[str] = None,
+    compute_reciprocity: bool = False,
+    extra_edge: bool = False,
+) -> DataAnalyzerConfig:
+    edge_tables = [
+        EdgeTableSpec(
+            bq_table="p.d.edges",
+            edge_type="follows",
+            src_id_column="src",
+            dst_id_column="dst",
+            src_node_type="user",
+            dst_node_type="user",
+        )
+    ]
+    if extra_edge:
+        edge_tables.append(
+            EdgeTableSpec(
+                bq_table="p.d.edges2",
+                edge_type="likes",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+            )
+        )
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.nodes",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["f1", "f2"],
+                label_column=label_column,
+            )
+        ],
+        edge_tables=edge_tables,
+        output_gcs_path="gs://bucket/out/",
+        fan_out=[15, 10],
+        compute_reciprocity=compute_reciprocity,
+    )
+
+
+def _make_heterogeneous_config() -> DataAnalyzerConfig:
+    """User -[viewed]-> content bipartite graph."""
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age"],
+            ),
+            NodeTableSpec(
+                bq_table="p.d.content",
+                node_type="content",
+                id_column="cid",
+                feature_columns=["topic"],
+            ),
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.viewed",
+                edge_type="viewed",
+                src_id_column="user_id",
+                dst_id_column="content_id",
+                src_node_type="user",
+                dst_node_type="content",
+            )
+        ],
+        output_gcs_path="gs://bucket/out/",
+    )
+
+
+def _mock_row(data: dict[str, Any]) -> MagicMock:
+    """Mock a BigQuery Row supporting both key and attribute access."""
+    row = MagicMock()
+    keys = list(data.keys())
+    values = list(data.values())
+    row.__getitem__ = lambda self, key: (
+        data[key] if isinstance(key, str) else values[key]
+    )
+    row.keys = lambda: keys
+    row.values = lambda: values
+    for k, v in data.items():
+        setattr(row, k, v)
+    return row
+
+
+def _mock_row_iterator(rows: list[dict[str, Any]]) -> MagicMock:
+    """Mock a RowIterator yielding the given row dicts."""
+    mock = MagicMock()
+    mock.__iter__ = lambda self: iter([_mock_row(r) for r in rows])
+    return mock
+
+
+def _default_row_for_query(query: str) -> dict[str, Any]:
+    """Return a reasonable 'zero violation, small graph' row for any query."""
+    q = query.lower()
+    if "dangling_count" in q:
+        return {"dangling_count": 0}
+    if "missing_src_count" in q:
+        return {"missing_src_count": 0, "missing_dst_count": 0}
+    if "duplicate_count" in q:
+        return {"duplicate_count": 0}
+    if "node_count" in q and "distinct_src_count" not in q:
+        return {"node_count": 1000}
+    if "edge_count" in q:
+        return {"edge_count": 5000}
+    if "self_loop_count" in q:
+        return {"self_loop_count": 0}
+    if "isolated_count" in q:
+        return {"isolated_count": 0}
+    if "min_degree" in q or "approx_quantiles" in q:
+        return {
+            "min_degree": 0,
+            "max_degree": 100,
+            "avg_degree": 5.0,
+            "percentiles": list(range(101)),
+        }
+    if "bucket_0_1" in q:
+        return {
+            "bucket_0_1": 10,
+            "bucket_2_10": 900,
+            "bucket_11_100": 80,
+            "bucket_101_1k": 10,
+            "bucket_1k_10k": 0,
+            "bucket_10k_plus": 0,
+        }
+    if "super_hub_count" in q:
+        return {"super_hub_count": 0}
+    if "cold_start_count" in q:
+        return {"cold_start_count": 50}
+    if "null_rate" in q:
+        # Include any plausible column name ending in _null_rate with zero default.
+        # Extend this list when adding new feature columns to test configs.
+        return {
+            "total_rows": 1000,
+            "f1_null_rate": 0.0,
+            "f2_null_rate": 0.01,
+            "uid_null_rate": 0.0,
+            "cid_null_rate": 0.0,
+            "age_null_rate": 0.0,
+            "topic_null_rate": 0.0,
+            "is_active_null_rate": 0.0,
+        }
+    if "distinct_src_count" in q:
+        return {"distinct_src_count": 900, "distinct_dst_count": 950}
+    if "labeled" in q:
+        return {"total": 1000, "labeled": 800, "coverage": 0.8}
+    if "label" in q and "count" in q:
+        return {"label": 0, "count": 500}
+    # Fallback: one zero-valued scalar
+    return {"count": 0}
+
+
+def _default_rows_for_query(query: str) -> list[dict[str, Any]]:
+    q = query.lower()
+    if "order by degree desc" in q:
+        # Top-K hubs query returns multiple rows
+        return [
+            {"node_id": "u1", "degree": 500},
+            {"node_id": "u2", "degree": 400},
+        ]
+    if "group by " in q and "label" in q and "order by count" in q:
+        return [{"label": 0, "count": 600}, {"label": 1, "count": 400}]
+    return [_default_row_for_query(query)]
+
+
+@patch("gigl.analytics.data_analyzer.graph_structure_analyzer.BqUtils")
+class GraphStructureAnalyzerTest(TestCase):
+    def test_tier1_passes_when_no_violations(self, mock_bq_cls: MagicMock) -> None:
+        """With zero dangling, zero duplicates, zero referential violations, Tier 1 passes."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIsNotNone(result)
+        self.assertEqual(result.dangling_edge_counts["follows"], 0)
+        self.assertEqual(result.duplicate_node_counts["user"], 0)
+        self.assertEqual(result.node_counts["user"], 1000)
+
+    def test_dangling_edges_raises(self, mock_bq_cls: MagicMock) -> None:
+        """If dangling edge query returns > 0, DataQualityError is raised."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            if "dangling_count" in query:
+                return _mock_row_iterator([{"dangling_count": 42}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError) as ctx:
+            analyzer.analyze(_make_config())
+        self.assertEqual(
+            ctx.exception.partial_result.dangling_edge_counts["follows"], 42
+        )
+
+    def test_duplicate_nodes_raises(self, mock_bq_cls: MagicMock) -> None:
+        """If duplicate node query returns > 0, DataQualityError is raised."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            q = query.lower()
+            # The duplicate_node query groups on id_column with HAVING COUNT(*) > 1.
+            if "duplicate_count" in q and "having count(*) > 1" in q and "uid" in q:
+                return _mock_row_iterator([{"duplicate_count": 5}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError):
+            analyzer.analyze(_make_config())
+
+    def test_tier3_skipped_without_label(self, mock_bq_cls: MagicMock) -> None:
+        """Without label_column, class_imbalance and label_coverage dicts are empty."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(label_column=None))
+        self.assertEqual(result.class_imbalance, {})
+        self.assertEqual(result.label_coverage, {})
+
+    def test_tier3_populated_with_label(self, mock_bq_cls: MagicMock) -> None:
+        """With label_column, class_imbalance and label_coverage are populated."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(label_column="is_active"))
+        self.assertIn("user", result.class_imbalance)
+        self.assertIn("user", result.label_coverage)
+        self.assertAlmostEqual(result.label_coverage["user"], 0.8)
+
+    def test_tier4_skipped_when_flag_false(self, mock_bq_cls: MagicMock) -> None:
+        """Without compute_reciprocity flag, reciprocity dict is empty."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(compute_reciprocity=False))
+        self.assertEqual(result.reciprocity, {})
+
+    def test_feature_memory_budget_computed(self, mock_bq_cls: MagicMock) -> None:
+        """feature_memory_bytes is computed from schema metadata in Python, not a BQ query."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("user", result.feature_memory_bytes)
+        # 1000 nodes * 2 features * 8 bytes/float64 = 16000
+        self.assertEqual(result.feature_memory_bytes["user"], 1000 * 2 * 8)
+
+    def test_neighbor_explosion_populated(self, mock_bq_cls: MagicMock) -> None:
+        """With fan_out=[15,10] and avg degree 5, explosion estimate = 15*10*5."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("follows", result.neighbor_explosion_estimate)
+        self.assertGreater(result.neighbor_explosion_estimate["follows"], 0)
+
+    def test_edge_type_distribution_populated_for_multiple_edges(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """edge_type_distribution is populated when there are multiple edge types."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(extra_edge=True))
+        self.assertIn("follows", result.edge_type_distribution)
+        self.assertIn("likes", result.edge_type_distribution)
+
+    def test_degree_stats_bucket_keys_match_report_bucket_order(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """Bucket keys must exactly match BUCKET_ORDER in report/charts.ai.js.
+
+        Regression test for C1: previously, Python emitted lowercase 'k' keys
+        (e.g., '101-1k') while the JS renderer expected uppercase 'K', causing
+        the three highest buckets to silently render as zero.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("follows_out", result.degree_stats)
+        stats = result.degree_stats["follows_out"]
+        expected_bucket_keys = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"]
+        self.assertEqual(list(stats.buckets.keys()), expected_bucket_keys)
+
+    def test_cold_start_query_includes_both_src_and_dst_columns(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """Cold-start must be computed from total degree (src + dst).
+
+        Regression test for C2: previously only src-side edges were counted,
+        misclassifying pure-destination nodes as cold-start.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        analyzer.analyze(_make_config())
+        cold_start_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "cold_start_count" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(cold_start_queries), 1)
+        for sql in cold_start_queries:
+            self.assertIn("src", sql)
+            self.assertIn("dst", sql)
+            self.assertIn("UNION ALL", sql)
+
+    def test_query_scalar_raises_on_empty_rows(self, mock_bq_cls: MagicMock) -> None:
+        """Scalar queries must fail loudly on unexpected empty results.
+
+        Regression test for I2: previously _query_scalar silently returned 0
+        when BQ returned no rows, hiding driver/auth/schema issues.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            []
+        )
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(RuntimeError) as ctx:
+            analyzer.analyze(_make_config())
+        self.assertIn("expected exactly 1 row", str(ctx.exception))
+
+    def test_heterogeneous_tier1_joins_correct_node_tables(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """For hetero edges, src and dst must join against their own node tables.
+
+        Regression test for I3: previously every edge table was joined against
+        node_tables[0] on both sides, producing false-positive missing_dst
+        violations for bipartite edges like user->content.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_heterogeneous_config())
+        self.assertEqual(result.referential_integrity_violations["viewed"], 0)
+        # Inspect the referential integrity query: src joins user_nodes, dst joins content_nodes.
+        ref_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "missing_src_count" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(ref_queries), 1)
+        ref_sql = ref_queries[0]
+        self.assertIn("`p.d.users`", ref_sql)
+        self.assertIn("`p.d.content`", ref_sql)
+        self.assertIn("e.user_id = src_node.uid", ref_sql)
+        self.assertIn("e.content_id = dst_node.cid", ref_sql)
diff --git a/tests/unit/analytics/data_analyzer/queries_test.py b/tests/unit/analytics/data_analyzer/queries_test.py
new file mode 100644
index 000000000..e7de870a3
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/queries_test.py
@@ -0,0 +1,144 @@
+from gigl.analytics.data_analyzer.queries import (
+    COLD_START_NODE_COUNT_QUERY,
+    DANGLING_EDGES_QUERY,
+    DEGREE_BUCKET_QUERY,
+    DEGREE_DISTRIBUTION_QUERY,
+    DUPLICATE_NODE_COUNT_QUERY,
+    EDGE_REFERENTIAL_INTEGRITY_QUERY,
+    NODE_COUNT_QUERY,
+    SUPER_HUB_INT16_CLAMP_QUERY,
+    TOP_K_HUBS_QUERY,
+    build_null_rates_query,
+)
+from tests.test_assets.test_case import TestCase
+
+NODE_TABLE = "project.dataset.user_nodes"
+EDGE_TABLE = "project.dataset.user_edges"
+
+
+class NodeCountQueryTest(TestCase):
+    def test_contains_table_name(self) -> None:
+        sql = NODE_COUNT_QUERY.format(table=NODE_TABLE)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn("COUNT(*)", sql)
+
+
+class DanglingEdgesQueryTest(TestCase):
+    def test_contains_null_checks(self) -> None:
+        sql = DANGLING_EDGES_QUERY.format(
+            table=EDGE_TABLE, src_id_column="src_uid", dst_id_column="dst_uid"
+        )
+        self.assertIn("src_uid IS NULL", sql)
+        self.assertIn("dst_uid IS NULL", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
+
+
+class EdgeReferentialIntegrityQueryTest(TestCase):
+    def test_contains_left_join_homogeneous(self) -> None:
+        """Homogeneous case: src and dst resolve to the same node table."""
+        sql = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+            edge_table=EDGE_TABLE,
+            src_node_table=NODE_TABLE,
+            dst_node_table=NODE_TABLE,
+            src_id_column="src_uid",
+            dst_id_column="dst_uid",
+            src_node_id_column="user_id",
+            dst_node_id_column="user_id",
+        )
+        self.assertIn("LEFT JOIN", sql)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
+        self.assertIn("IS NULL", sql)
+
+    def test_contains_left_join_heterogeneous(self) -> None:
+        """Heterogeneous case: src and dst resolve to different node tables.
+
+        Regression test for I3: previously the query took a single node_table
+        and always joined both sides against it, producing false-positive
+        missing_dst violations on bipartite graphs.
+        """
+        user_table = "project.dataset.user_nodes"
+        content_table = "project.dataset.content_nodes"
+        sql = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+            edge_table=EDGE_TABLE,
+            src_node_table=user_table,
+            dst_node_table=content_table,
+            src_id_column="user_id",
+            dst_id_column="content_id",
+            src_node_id_column="uid",
+            dst_node_id_column="cid",
+        )
+        self.assertIn(f"`{user_table}`", sql)
+        self.assertIn(f"`{content_table}`", sql)
+        self.assertIn("e.user_id = src_node.uid", sql)
+        self.assertIn("e.content_id = dst_node.cid", sql)
+
+
+class DuplicateNodeCountQueryTest(TestCase):
+    def test_contains_group_by_having(self) -> None:
+        sql = DUPLICATE_NODE_COUNT_QUERY.format(table=NODE_TABLE, id_column="user_id")
+        self.assertIn("GROUP BY", sql)
+        self.assertIn("HAVING", sql)
+        self.assertIn("user_id", sql)
+
+
+class DegreeDistributionQueryTest(TestCase):
+    def test_contains_approx_quantiles(self) -> None:
+        sql = DEGREE_DISTRIBUTION_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("APPROX_QUANTILES", sql)
+        self.assertIn("src_uid", sql)
+
+
+class DegreeBucketQueryTest(TestCase):
+    def test_contains_countif_buckets(self) -> None:
+        sql = DEGREE_BUCKET_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("COUNTIF", sql)
+        self.assertIn("src_uid", sql)
+
+
+class NullRatesQueryTest(TestCase):
+    def test_batches_multiple_columns(self) -> None:
+        sql = build_null_rates_query(
+            table=NODE_TABLE, columns=["age", "country", "embedding"]
+        )
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertEqual(sql.count("COUNTIF"), 3)
+        self.assertIn("age", sql)
+        self.assertIn("country", sql)
+        self.assertIn("embedding", sql)
+
+
+class SuperHubInt16ClampQueryTest(TestCase):
+    def test_contains_32767_threshold(self) -> None:
+        sql = SUPER_HUB_INT16_CLAMP_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("32767", sql)
+
+
+class TopKHubsQueryTest(TestCase):
+    def test_contains_limit(self) -> None:
+        sql = TOP_K_HUBS_QUERY.format(table=EDGE_TABLE, id_column="src_uid", k=20)
+        self.assertIn("LIMIT 20", sql)
+        self.assertIn("ORDER BY", sql)
+        self.assertIn("DESC", sql)
+
+
+class ColdStartNodeCountQueryTest(TestCase):
+    def test_unions_src_and_dst_columns(self) -> None:
+        """Cold-start is a property of total degree, not out-degree alone.
+
+        Regression test for C2: previously the query only counted src-side
+        edges, which misclassified pure-destination node types (e.g., content
+        receiving likes) as cold-start regardless of in-degree.
+        """
+        sql = COLD_START_NODE_COUNT_QUERY.format(
+            node_table=NODE_TABLE,
+            edge_table=EDGE_TABLE,
+            node_id_column="user_id",
+            src_id_column="src_uid",
+            dst_id_column="dst_uid",
+        )
+        self.assertIn("src_uid", sql)
+        self.assertIn("dst_uid", sql)
+        self.assertIn("UNION ALL", sql)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
diff --git a/tests/unit/analytics/data_analyzer/report/__init__.py b/tests/unit/analytics/data_analyzer/report/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/report/report_generator_test.py b/tests/unit/analytics/data_analyzer/report/report_generator_test.py
new file mode 100644
index 000000000..a9c980e9d
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/report/report_generator_test.py
@@ -0,0 +1,122 @@
+from pathlib import Path
+
+from gigl.analytics.data_analyzer.report.report_generator import generate_report
+from gigl.analytics.data_analyzer.types import DegreeStats, GraphAnalysisResult
+from tests.test_assets.test_case import TestCase
+
+GOLDEN_REPORT_PATH = (
+    Path(__file__).parents[4] / "test_assets" / "analytics" / "golden_report.html"
+)
+
+
+def _make_test_result() -> GraphAnalysisResult:
+    """Deterministic test data for snapshot testing."""
+    return GraphAnalysisResult(
+        duplicate_node_counts={"user": 0},
+        dangling_edge_counts={"follows": 0},
+        referential_integrity_violations={"follows": 0},
+        node_counts={"user": 1000000},
+        edge_counts={"follows": 5000000},
+        null_rates={"p.d.nodes": {"age": 0.05, "country": 0.12}},
+        duplicate_edge_counts={"follows": 150},
+        self_loop_counts={"follows": 0},
+        isolated_node_counts={"user": 8000},
+        degree_stats={
+            "follows_out": DegreeStats(
+                min=0,
+                max=50000,
+                mean=10.0,
+                median=5,
+                p90=25,
+                p99=200,
+                p999=5000,
+                percentiles=list(range(101)),
+                buckets={
+                    "0-1": 100000,
+                    "2-10": 600000,
+                    "11-100": 250000,
+                    "101-1K": 45000,
+                    "1K-10K": 4500,
+                    "10K+": 500,
+                },
+            )
+        },
+        top_hubs={"follows_out": [("hub_1", 50000), ("hub_2", 35000)]},
+        super_hub_int16_clamp_count={"follows_out": 2},
+        cold_start_node_counts={"user": 100000},
+        feature_memory_bytes={"user": 8000000000},
+        neighbor_explosion_estimate={"follows": 75000},
+    )
+
+
+class ReportGeneratorStructuralTest(TestCase):
+    """Structural assertions on the generated HTML."""
+
+    def test_output_is_non_empty_html(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+            config=None,
+        )
+        self.assertIsInstance(html, str)
+        self.assertGreater(len(html), 1000)
+        self.assertIn("<html", html)
+        self.assertIn("GiGL Data Analysis Report", html)
+
+    def test_placeholders_all_replaced(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+            config=None,
+        )
+        # None of the injection placeholders should remain in the output.
+        self.assertNotIn("/* INJECT_STYLES */", html)
+        self.assertNotIn("/* INJECT_SCRIPTS */", html)
+        self.assertNotIn("/* INJECT_ANALYSIS_DATA */", html)
+        self.assertNotIn("/* INJECT_PROFILE_DATA */", html)
+
+    def test_injected_data_present(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+            config=None,
+        )
+        # The JSON data lives inside a hidden script tag.
+        self.assertIn('"node_counts"', html)
+        # Either 1000000 (int) or "1000000" (str) is acceptable depending on serialization.
+        self.assertTrue('"1000000"' in html or "1000000" in html)
+
+    def test_empty_profile_serializes_as_empty_object(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+            config=None,
+        )
+        # When profile_result is None, we inject an empty JSON object.
+        self.assertIn('id="profile-data"', html)
+
+
+class ReportGeneratorSnapshotTest(TestCase):
+    """Golden-file snapshot test to catch structural regressions."""
+
+    def test_snapshot_matches_golden(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+            config=None,
+        )
+        if not GOLDEN_REPORT_PATH.exists():
+            self.fail(
+                f"Golden file missing: {GOLDEN_REPORT_PATH}. "
+                f"Create it by writing the current output of generate_report "
+                f"with _make_test_result() as input."
+            )
+        golden = GOLDEN_REPORT_PATH.read_text()
+        self.assertEqual(
+            html,
+            golden,
+            msg=(
+                "HTML output changed. If this is intentional, regenerate the "
+                f"golden file at {GOLDEN_REPORT_PATH}."
+            ),
+        )
diff --git a/tests/unit/common/beam/__init__.py b/tests/unit/common/beam/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/common/beam/tfdv_transforms_test.py b/tests/unit/common/beam/tfdv_transforms_test.py
new file mode 100644
index 000000000..f52c09a6c
--- /dev/null
+++ b/tests/unit/common/beam/tfdv_transforms_test.py
@@ -0,0 +1,154 @@
+"""Unit tests for the shared TFDV/Beam PTransforms."""
+import os
+import tempfile
+from pathlib import Path
+from unittest.mock import patch
+
+import apache_beam as beam
+import pyarrow as pa
+from apache_beam.testing.util import assert_that, equal_to
+
+from gigl.common import LocalUri
+from gigl.common.beam.tfdv_transforms import (
+    BqTableToRecordBatch,
+    GenerateAndVisualizeStats,
+)
+from tests.test_assets.test_case import TestCase
+
+
+class BqTableToRecordBatchTest(TestCase):
+    def test_raises_on_empty_feature_columns(self) -> None:
+        with self.assertRaises(ValueError):
+            BqTableToRecordBatch(bq_table="p.d.t", feature_columns=[])
+
+    def test_query_uses_backtick_quoted_columns_and_table(self) -> None:
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            feature_columns=["age", "country"],
+        )
+        captured_kwargs: dict = {}
+
+        def _fake_read(**kwargs):
+            captured_kwargs.update(kwargs)
+            return beam.Create([{"age": 1, "country": "US"}])
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(
+            captured_kwargs["query"],
+            "SELECT `age`, `country` FROM `proj.ds.users`",
+        )
+        self.assertTrue(captured_kwargs["use_standard_sql"])
+        self.assertNotIn("project", captured_kwargs)
+
+    def test_passes_bq_project_when_given(self) -> None:
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            feature_columns=["age"],
+            bq_project="billing-project",
+        )
+        captured_kwargs: dict = {}
+
+        def _fake_read(**kwargs):
+            captured_kwargs.update(kwargs)
+            return beam.Create([{"age": 1}])
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(captured_kwargs["project"], "billing-project")
+
+    def test_emits_record_batches_with_list_typed_columns(self) -> None:
+        rows = [
+            {"age": 30, "country": "US"},
+            {"age": 25, "country": "CA"},
+            {"age": None, "country": "US"},
+        ]
+
+        def _fake_read(**kwargs):
+            return beam.Create(rows)
+
+        def _extract(batch: pa.RecordBatch) -> tuple:
+            age_type = batch.schema.field("age").type
+            country_type = batch.schema.field("country").type
+            return (
+                batch.num_rows,
+                tuple(sorted(batch.schema.names)),
+                pa.types.is_list(age_type),
+                pa.types.is_list(country_type),
+                tuple(batch.column("age").to_pylist()),
+                tuple(batch.column("country").to_pylist()),
+            )
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                batches = p | BqTableToRecordBatch(
+                    bq_table="p.d.t",
+                    feature_columns=["age", "country"],
+                    batch_size=10,
+                )
+                summaries = batches | "Summarize batch" >> beam.Map(_extract)
+                assert_that(
+                    summaries,
+                    equal_to(
+                        [
+                            (
+                                3,
+                                ("age", "country"),
+                                True,
+                                True,
+                                ([30], [25], None),
+                                (["US"], ["CA"], ["US"]),
+                            )
+                        ]
+                    ),
+                )
+
+
+class GenerateAndVisualizeStatsTest(TestCase):
+    def test_runs_and_writes_artifacts(self) -> None:
+        """Smoke test: runs the PTransform on a tiny in-memory RecordBatch and
+        verifies that both the Facets HTML and the stats TFRecord are written.
+        """
+        batch = pa.RecordBatch.from_pydict(
+            {
+                "age": pa.array([[30], [25], [40]], type=pa.list_(pa.int64())),
+                "country": pa.array(
+                    [["US"], ["CA"], ["US"]], type=pa.list_(pa.string())
+                ),
+            }
+        )
+        with tempfile.TemporaryDirectory() as tmpdir:
+            facets_path = os.path.join(tmpdir, "facets.html")
+            stats_path = os.path.join(tmpdir, "stats.tfrecord")
+            with beam.Pipeline() as p:
+                _ = (
+                    p
+                    | "Create a single record batch" >> beam.Create([batch])
+                    | GenerateAndVisualizeStats(
+                        facets_report_uri=LocalUri(facets_path),
+                        stats_output_uri=LocalUri(stats_path),
+                    )
+                )
+
+            self.assertTrue(
+                Path(facets_path).exists(),
+                f"Facets HTML not written at {facets_path}",
+            )
+            self.assertGreater(Path(facets_path).stat().st_size, 0)
+            written = list(Path(tmpdir).glob("stats.tfrecord*"))
+            self.assertTrue(
+                written, f"No stats TFRecord written under prefix {stats_path}"
+            )