feat: add BigQuery partitioning and clustering to data_monitoring_metrics#972
feat: add BigQuery partitioning and clustering to data_monitoring_metrics#972
Conversation
…rics - Rename get_default_partition_by -> get_partition_by(column='created_at') with a column parameter so models can specify their partition column. A backward-compatible alias get_default_partition_by is kept. - Create new get_cluster_by(columns) macro with adapter dispatch pattern. Returns columns list on BigQuery, none on other adapters. Controlled by a separate bigquery_disable_clustering flag. - Apply partition_by (on bucket_end) and cluster_by (full_table_name, metric_name) to data_monitoring_metrics. - Update existing callers (dbt_run_results, dbt_invocations) to use the renamed get_partition_by macro. - Add integration tests for data_monitoring_metrics partitioning and clustering on BigQuery. Resolves ELE-5300 Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
📝 WalkthroughWalkthroughThe changes introduce BigQuery clustering functionality through a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
integration_tests/tests/test_dbt_artifacts/test_artifacts.py (1)
260-293: LGTM!The new tests correctly verify BigQuery-specific partitioning and clustering behavior using
INFORMATION_SCHEMA.COLUMNS. The query for clustering columns properly orders byclustering_ordinal_position, and assertions match the model configuration.Minor note: Both tests perform
full_refresh=Trueondata_monitoring_metrics. While this ensures test isolation (which is good), you could consider combining them into a single test to reduce CI runtime if BigQuery test runs are slow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines 260 - 293, Two tests both call dbt_project.dbt_runner.run(select="data_monitoring_metrics", full_refresh=True) causing duplicate full-refresh work; combine test_data_monitoring_metrics_partitioned and test_data_monitoring_metrics_clustered into a single test (e.g., test_data_monitoring_metrics_partitioning_and_clustering) that runs the full_refresh once, then queries INFORMATION_SCHEMA.COLUMNS twice (once for is_partitioning_column and once for clustering_ordinal_position ORDER BY clustering_ordinal_position) and asserts both expected column lists so CI runtime is reduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py`:
- Around line 260-293: Two tests both call
dbt_project.dbt_runner.run(select="data_monitoring_metrics", full_refresh=True)
causing duplicate full-refresh work; combine
test_data_monitoring_metrics_partitioned and
test_data_monitoring_metrics_clustered into a single test (e.g.,
test_data_monitoring_metrics_partitioning_and_clustering) that runs the
full_refresh once, then queries INFORMATION_SCHEMA.COLUMNS twice (once for
is_partitioning_column and once for clustering_ordinal_position ORDER BY
clustering_ordinal_position) and asserts both expected column lists so CI
runtime is reduced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 59769bc7-f120-4483-b9f4-34aacffe98c0
📒 Files selected for processing (7)
integration_tests/tests/test_dbt_artifacts/test_artifacts.pymacros/edr/system/system_utils/get_config_var.sqlmacros/utils/cross_db_utils/cluster_by.sqlmacros/utils/cross_db_utils/partition_by.sqlmodels/edr/data_monitoring/data_monitoring/data_monitoring_metrics.sqlmodels/edr/dbt_artifacts/dbt_invocations.sqlmodels/edr/dbt_artifacts/dbt_run_results.sql
Summary
Addresses high BigQuery scan costs on
data_monitoring_metricsby adding partitioning (onbucket_end) and clustering (onfull_table_name,metric_name). This aligns with the hot-path query filters on this table.Changes:
get_partition_by(column="created_at")— Renamed fromget_default_partition_bywith a newcolumnparameter. A backward-compatible aliasget_default_partition_byis retained so existing top-level references keep working.get_cluster_by(columns)— New macro using the same adapter dispatch pattern. Returns the columns list on BigQuery,noneelsewhere. Controlled by a separatebigquery_disable_clusteringconfig var.data_monitoring_metrics— Now configured withpartition_by=bucket_endandcluster_by=[full_table_name, metric_name].dbt_run_results/dbt_invocations— Updated to callget_partition_by()(no behavior change; default column is stillcreated_at).INFORMATION_SCHEMA.Review & Testing Checklist for Human
bigquery__get_default_partition_by,default__get_default_partition_by) are replaced withbigquery__get_partition_by/default__get_partition_by. Users who have overridden the adapter-specific macros in their own dbt project will need to rename their overrides. Evaluate whether this is acceptable or if dispatched aliases are also needed.--full-refreshmigration: Existingdata_monitoring_metricstables will not gain partitioning/clustering until rebuilt with--full-refresh. Confirm this will be documented in release notes.test_data_monitoring_metrics_partitioned,test_data_monitoring_metrics_clustered) can only be validated against a real BigQuery instance. Confirm they pass in CI.bucket_endwas chosen overcreated_atbecause both hot-path queries filter onbucket_end. Verify this matches the actual query patterns in the anomaly detection macros.Notes
unique_key='id'merge strategy ondata_monitoring_metricsis unaffected — partition key controls scan pruning only, not merge logic.cluster_byis a no-op on non-BigQuery adapters (returnsnone), andpartition_bycontinues to returnnoneon non-BigQuery adapters as before.Link to Devin session: https://app.devin.ai/sessions/b30f1f8be6ff4ca3b48b98fe0689c60b
Requested by: @haritamar
Summary by CodeRabbit
New Features
Tests