perf: Optimize scalar fast path & write() encoding for sha2 #20116

kumarUjjawal · 2026-02-02T18:19:34Z

Which issue does this PR close?

Closes Optimize spark sha2 #20046 .

Rationale for this change

Spark sha2 currently evaluates scalars via make_scalar_function(sha2_impl, vec![]), which expands scalar inputs to size-1 arrays before execution. This adds avoidable overhead for scalar evaluation / constant folding scenarios.

In addition, the existing digest-to-hex formatting uses write!(&mut s, "{b:02x}") in a loop, which is significantly slower than a LUT-based hex encoder.

What changes are included in this PR?

a match-based scalar fast path for sha2 to avoid scalar→array expansion, and
a faster LUT-based hex encoder to replace write! formatting.

Benchmark	Before	After	Speedup
`sha2/scalar/size=1`	1.0408 µs	339.29 ns	~3.07x
`sha2/array_binary_256/size=1024`	604.13 µs	295.09 µs	~2.05x
`sha2/array_binary_256/size=4096`	2.3508 ms	1.2095 ms	~1.94x
`sha2/array_binary_256/size=8192`	4.5192 ms	2.2826 ms	~1.98x

Are these changes tested?

Yes

Are there any user-facing changes?

No

Jefffrey

We should look into adding a fast path for when values is array but bit length is scalar; I assume that would be another common usecase

datafusion/spark/src/function/hash/sha2.rs

comphead · 2026-02-03T02:18:34Z

datafusion/spark/src/function/hash/sha2.rs

+    let bytes = data.as_ref();
+    let mut out = Vec::with_capacity(bytes.len() * 2);
+    for &b in bytes {
+        out.push(HEX_CHARS[(b >> 4) as usize]);


it might be a reason for extra LLVM boundaries check, maybe worth to check also

let hi = b >> 4; let lo = b & 0x0F; out.push(HEX_CHARS[hi as usize]); out.push(HEX_CHARS[lo as usize]);

rustc might be smart enough to rewrite by itself

Thanks! That looks reasonable.

Co-authored-by: Jeffrey Vo <[email protected]>

datafusion/spark/benches/sha2.rs

martin-g · 2026-02-03T14:18:27Z

datafusion/spark/src/function/hash/sha2.rs

+        let hi = b >> 4;
+        let lo = b & 0x0F;
+        out.push(HEX_CHARS[hi as usize]);
+        out.push(HEX_CHARS[lo as usize]);


The hex crate is used in other datafusion-** crates as an optional dependency. It also uses SIMD to be even faster for bigger input.
Consider using it here too.

Thank you! I will look into this.

Since sha2 digests are fixed-size (28/32/48/64 bytes), the LUT approach is already quite fast. I don't know if adding the hex will help here? What do you think?

martin-g · 2026-02-03T14:19:36Z

datafusion/spark/src/function/hash/sha2.rs

    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
-        make_scalar_function(sha2_impl, vec![])(&args.args)
+        let values = &args.args[0];
+        let bit_lengths = &args.args[1];


Maybe use take_function_args() for consistency with other functions and better error handling ?

Co-authored-by: Martin Grigorov <[email protected]>

comphead

Thanks @kumarUjjawal it is LGTM, I'll wait for other folks to confirm

Ive restarted CI, seems the CI errors are not PR related

kumarUjjawal · 2026-02-03T17:20:26Z

Thanks @kumarUjjawal it is LGTM, I'll wait for other folks to confirm

Ive restarted CI, seems the CI errors are not PR related

Thank you!

Jefffrey · 2026-02-04T02:41:40Z

datafusion/spark/src/function/hash/sha2.rs

+where
+    BinaryArrType: BinaryArrayType<'a>,
+{
+    sha2_binary_bitlen_iter(values, std::iter::repeat(Some(bit_length)))


I was thinking along the lines of removing the match logic on the hot loop below, if we know the bit length for all values; I think it'll result in more verbose code but could be worth performance. Can look into this in a followup

Jefffrey · 2026-02-04T02:43:01Z

datafusion/spark/src/function/hash/sha2.rs

+                ColumnarValue::Scalar(value_scalar),
+                ColumnarValue::Scalar(ScalarValue::Int32(Some(bit_length))),
+            ) => {
+                if value_scalar.is_null() {


We should pull all null checks into a single branch at the top, e.g.

match (values, bit_lengths) { (ColumnarValue::Scalar(s), _) | (_, ColumnarValue::Scalar(s)) if s.is_null() => { // return scalar null }

This means we'd only need 4 arms:

One arm checking if either is null

One arm for scalar + scalar

One arm for array + scalar

Catch all (array + array, scalar + array)

Should i make the changes in this pr or will these be in the follow up too?

perf: Optimize scalar fast path & write() encoding for sha2

9b6d715

github-actions bot added the spark label Feb 2, 2026

fix clippy

1e3c875

Jefffrey reviewed Feb 3, 2026

View reviewed changes

datafusion/spark/src/function/hash/sha2.rs Outdated Show resolved Hide resolved

comphead reviewed Feb 3, 2026

View reviewed changes

kumarUjjawal and others added 2 commits February 3, 2026 10:06

suggestion from jeffrey

4bc388b

Co-authored-by: Jeffrey Vo <[email protected]>

fast path array, and hex hi,lo check

fab5c2e

martin-g reviewed Feb 3, 2026

View reviewed changes

kumarUjjawal and others added 3 commits February 3, 2026 20:15

suggestion from martin-g

d6a80f1

Co-authored-by: Martin Grigorov <[email protected]>

use take_function_args

2afc263

fix clone issue

780799c

comphead approved these changes Feb 3, 2026

View reviewed changes

Merge branch 'main' into perf/sha_scalar_path

ea3be8e

Jefffrey approved these changes Feb 4, 2026

View reviewed changes

Merge branch 'main' into perf/sha_scalar_path

76266d1

perf: Optimize scalar fast path & write() encoding for sha2 #20116

Are you sure you want to change the base?

perf: Optimize scalar fast path & write() encoding for sha2 #20116

Uh oh!

Conversation

kumarUjjawal commented Feb 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

comphead Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martin-g Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal commented Feb 3, 2026

Uh oh!

Jefffrey Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

comphead Feb 3, 2026 •

edited

Loading

kumarUjjawal Feb 3, 2026 •

edited

Loading

comphead left a comment •

edited

Loading