Optimize regexp_replace by stripping trailing .* from anchored patterns. 2.4x improvement#21379
Optimize regexp_replace by stripping trailing .* from anchored patterns. 2.4x improvement#21379Dandandan wants to merge 8 commits intoapache:mainfrom
Conversation
For anchored patterns like `^prefix(capture)/.*$`, use regex-syntax HIR analysis to build a shorter regex without the trailing `.*`. Uses captures() + expand() on the shorter regex instead of replacen(), since the replacement replaces the entire string (original was ^...$) and we only need correct capture group positions. For ClickBench Q28's `^https?://(?:www\.)?([^/]+)/.*$`, the effective regex becomes `^https?://(?:www\.)?([^/]+)/` — the backtracker stops at the first `/` after the domain instead of scanning the full URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (62046ab) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (62046ab) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (62046ab) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (61e3663) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (ab44f24) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
Split anchored ^prefix(capture)suffix.*$ patterns into separate prefix and content regexes (no capture groups). Uses two find() calls instead of captures() + expand(), avoiding capture-group tracking overhead and String allocation in the hot loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (ab44f24) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
…_read Replace HIR-based regex splitting with a simple string check: strip trailing `.*$` from anchored patterns and use captures_read with pre-allocated CaptureLocations for direct extraction. Eliminates regex-syntax dependency, expand(), and String allocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (2bce86b) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (2bce86b) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (2bce86b) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
Performance optimization, no issue.
Rationale for this change
regexp_replacewith anchored patterns like^https?://(?:www\.)?([^/]+)/.*$spends time scanning the trailing.*$and usingcaptures()+expand()withStringallocation on every row.It just happens this
SELECT regexp_replace(url, '^https?://(?:www\.)?([^/]+)/.*$', '\1')query benefits from this optimization (2.4x faster)What changes are included in this PR?
.*$from the pattern string for anchored patterns where the replacement is\1captures_readwith pre-allocatedCaptureLocationsfor direct byte-slice extractionAre these changes tested?
Yes, covered by existing
regexp_replaceunit tests, ClickBench sqllogictests, and the new URL domain extraction sqllogictest.Are there any user-facing changes?
No.