Skip to content

feat: add BigQuery partitioning and clustering to data_monitoring_metrics#972

Merged
haritamar merged 1 commit intomasterfrom
devin/ELE-5300-1773666092
Mar 16, 2026
Merged

feat: add BigQuery partitioning and clustering to data_monitoring_metrics#972
haritamar merged 1 commit intomasterfrom
devin/ELE-5300-1773666092

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 16, 2026

Summary

Addresses high BigQuery scan costs on data_monitoring_metrics by adding partitioning (on bucket_end) and clustering (on full_table_name, metric_name). This aligns with the hot-path query filters on this table.

Changes:

  • get_partition_by(column="created_at") — Renamed from get_default_partition_by with a new column parameter. A backward-compatible alias get_default_partition_by is retained so existing top-level references keep working.
  • get_cluster_by(columns) — New macro using the same adapter dispatch pattern. Returns the columns list on BigQuery, none elsewhere. Controlled by a separate bigquery_disable_clustering config var.
  • data_monitoring_metrics — Now configured with partition_by=bucket_end and cluster_by=[full_table_name, metric_name].
  • dbt_run_results / dbt_invocations — Updated to call get_partition_by() (no behavior change; default column is still created_at).
  • Integration tests — Two new BigQuery-only tests verifying partition and clustering columns via INFORMATION_SCHEMA.

Review & Testing Checklist for Human

  • Backward compatibility of adapter dispatch: The old dispatched names (bigquery__get_default_partition_by, default__get_default_partition_by) are replaced with bigquery__get_partition_by / default__get_partition_by. Users who have overridden the adapter-specific macros in their own dbt project will need to rename their overrides. Evaluate whether this is acceptable or if dispatched aliases are also needed.
  • --full-refresh migration: Existing data_monitoring_metrics tables will not gain partitioning/clustering until rebuilt with --full-refresh. Confirm this will be documented in release notes.
  • Verify BigQuery CI passes: The new integration tests (test_data_monitoring_metrics_partitioned, test_data_monitoring_metrics_clustered) can only be validated against a real BigQuery instance. Confirm they pass in CI.
  • Partition column choice: bucket_end was chosen over created_at because both hot-path queries filter on bucket_end. Verify this matches the actual query patterns in the anomaly detection macros.

Notes

  • The unique_key='id' merge strategy on data_monitoring_metrics is unaffected — partition key controls scan pruning only, not merge logic.
  • cluster_by is a no-op on non-BigQuery adapters (returns none), and partition_by continues to return none on non-BigQuery adapters as before.
  • Resolves ELE-5300.

Link to Devin session: https://app.devin.ai/sessions/b30f1f8be6ff4ca3b48b98fe0689c60b
Requested by: @haritamar

Summary by CodeRabbit

  • New Features

    • Added configurable table clustering for BigQuery with an option to disable it.
    • Enhanced table partitioning with improved parameterized configuration control.
  • Tests

    • Added tests to verify BigQuery partitioning and clustering behavior on data tables.

…rics

- Rename get_default_partition_by -> get_partition_by(column='created_at')
  with a column parameter so models can specify their partition column.
  A backward-compatible alias get_default_partition_by is kept.
- Create new get_cluster_by(columns) macro with adapter dispatch pattern.
  Returns columns list on BigQuery, none on other adapters.
  Controlled by a separate bigquery_disable_clustering flag.
- Apply partition_by (on bucket_end) and cluster_by (full_table_name,
  metric_name) to data_monitoring_metrics.
- Update existing callers (dbt_run_results, dbt_invocations) to use
  the renamed get_partition_by macro.
- Add integration tests for data_monitoring_metrics partitioning and
  clustering on BigQuery.

Resolves ELE-5300

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link

linear bot commented Mar 16, 2026

@github-actions
Copy link
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

The changes introduce BigQuery clustering functionality through a new get_cluster_by macro with a configurable disable flag, refactor existing partition_by macros to accept explicit column parameters, and update models to use the new partitioning and clustering configurations. Integration tests verify BigQuery partitioning and clustering behavior on the data_monitoring_metrics table.

Changes

Cohort / File(s) Summary
Partitioning and Clustering Macros
macros/utils/cross_db_utils/cluster_by.sql, macros/utils/cross_db_utils/partition_by.sql, macros/edr/system/system_utils/get_config_var.sql
Introduces get_cluster_by macro with BigQuery-specific implementation that respects a bigquery_disable_clustering config flag. Refactors partition_by macros to accept explicit column parameters via get_partition_by(column) and maintains backward compatibility through get_default_partition_by() wrapper.
Model Configuration Updates
models/edr/data_monitoring/data_monitoring/data_monitoring_metrics.sql, models/edr/dbt_artifacts/dbt_invocations.sql, models/edr/dbt_artifacts/dbt_run_results.sql
Updates incremental models to use new get_partition_by() macro instead of deprecated get_default_partition_by(). Adds partition_by and cluster_by configuration to data_monitoring_metrics with clustering on full_table_name and metric_name, partitioning on bucket_end.
Integration Tests
integration_tests/tests/test_dbt_artifacts/test_artifacts.py
Adds two BigQuery-specific tests: test_data_monitoring_metrics_partitioned and test_data_monitoring_metrics_clustered that verify partitioning and clustering via INFORMATION_SCHEMA queries. Updates context in existing partition test comment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hop-hop, the clusters align,
Buckets and metrics in rows so fine,
BigQuery now dances with partitions neat,
A toggleable feature makes cleanup complete!
Tests verify the magic we've made,
In the garden of data, our seeds are laid. 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding BigQuery partitioning and clustering to data_monitoring_metrics, which is the primary objective and focus of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/ELE-5300-1773666092
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
integration_tests/tests/test_dbt_artifacts/test_artifacts.py (1)

260-293: LGTM!

The new tests correctly verify BigQuery-specific partitioning and clustering behavior using INFORMATION_SCHEMA.COLUMNS. The query for clustering columns properly orders by clustering_ordinal_position, and assertions match the model configuration.

Minor note: Both tests perform full_refresh=True on data_monitoring_metrics. While this ensures test isolation (which is good), you could consider combining them into a single test to reduce CI runtime if BigQuery test runs are slow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines
260 - 293, Two tests both call
dbt_project.dbt_runner.run(select="data_monitoring_metrics", full_refresh=True)
causing duplicate full-refresh work; combine
test_data_monitoring_metrics_partitioned and
test_data_monitoring_metrics_clustered into a single test (e.g.,
test_data_monitoring_metrics_partitioning_and_clustering) that runs the
full_refresh once, then queries INFORMATION_SCHEMA.COLUMNS twice (once for
is_partitioning_column and once for clustering_ordinal_position ORDER BY
clustering_ordinal_position) and asserts both expected column lists so CI
runtime is reduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py`:
- Around line 260-293: Two tests both call
dbt_project.dbt_runner.run(select="data_monitoring_metrics", full_refresh=True)
causing duplicate full-refresh work; combine
test_data_monitoring_metrics_partitioned and
test_data_monitoring_metrics_clustered into a single test (e.g.,
test_data_monitoring_metrics_partitioning_and_clustering) that runs the
full_refresh once, then queries INFORMATION_SCHEMA.COLUMNS twice (once for
is_partitioning_column and once for clustering_ordinal_position ORDER BY
clustering_ordinal_position) and asserts both expected column lists so CI
runtime is reduced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59769bc7-f120-4483-b9f4-34aacffe98c0

📥 Commits

Reviewing files that changed from the base of the PR and between f820e11 and abefe28.

📒 Files selected for processing (7)
  • integration_tests/tests/test_dbt_artifacts/test_artifacts.py
  • macros/edr/system/system_utils/get_config_var.sql
  • macros/utils/cross_db_utils/cluster_by.sql
  • macros/utils/cross_db_utils/partition_by.sql
  • models/edr/data_monitoring/data_monitoring/data_monitoring_metrics.sql
  • models/edr/dbt_artifacts/dbt_invocations.sql
  • models/edr/dbt_artifacts/dbt_run_results.sql

@haritamar haritamar merged commit 7a2b542 into master Mar 16, 2026
28 checks passed
@haritamar haritamar deleted the devin/ELE-5300-1773666092 branch March 16, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant