Skip to content

refactor(workflow-operator): extract shared Sklearn descriptor base#6048

Open
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base
Open

refactor(workflow-operator): extract shared Sklearn descriptor base#6048
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Add abstract SklearnModelOpDesc extends PythonOperatorDescriptor holding the four shared config fields (target, countVectorizer, text, tfidfTransformer) with their Jackson/JSON-schema annotations, the two @JsonIgnore hooks (getImportStatements, getUserFriendlyModelName), and the shared getOutputSchemas.
  • Change SklearnClassifierOpDesc and SklearnTrainingOpDesc to extend the new base, so each keeps only what genuinely differs (generatePythonCode, operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.
  • The 51 leaf subclasses inherit the fields unchanged, so the generated operator schema is identical (no behavior change).

Any related issues, documentation, discussions?

Closes: #6038

How was this PR tested?

  • Pure refactor with no behavior change; covered by existing SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inherited target/countVectorizer fields, and asserts on generatePythonCode() output. Run it in CI via sbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.
  • Verified locally with sbt "WorkflowOperator/Test/compile" (compiles clean; both descriptors resolve against SklearnModelOpDesc).
  • Note: sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec" could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendored JsonSchemaDraft.class on this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under the WorkflowOperator/jacoco job.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions github-actions Bot added refactor Refactor the code common labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @aglinxinyuan, @carloea2
    You can notify them by mentioning @aglinxinyuan, @carloea2 in a comment.

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.63%. Comparing base (24b587f) to head (406b2ef).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ber/operator/sklearn/SklearnClassifierOpDesc.scala 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6048      +/-   ##
============================================
- Coverage     56.89%   56.63%   -0.27%     
- Complexity     3058     3059       +1     
============================================
  Files          1129     1127       -2     
  Lines         43802    43326     -476     
  Branches       4743     4669      -74     
============================================
- Hits          24922    24536     -386     
+ Misses        17449    17392      -57     
+ Partials       1431     1398      -33     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from 16a3ddf
amber 58.66% <86.66%> (+0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 49.39% <ø> (-0.74%) ⬇️ Carriedforward from 16a3ddf
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.51% <ø> (-0.64%) ⬇️ Carriedforward from 16a3ddf
python 90.69% <ø> (ø) Carriedforward from 16a3ddf
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 6 worse · ⚪ 7 noise (<±5%) · 0 without baseline

Compared against main 24b587f benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 365 0.222 25,474/44,057/44,057 us 🔴 +19.3% / 🔴 +188.2%
🔴 bs=100 sw=10 sl=64 777 0.474 129,690/142,726/142,726 us 🔴 +6.9% / 🔴 +31.7%
bs=1000 sw=10 sl=64 915 0.559 1,087,055/1,144,885/1,144,885 us ⚪ within ±5% / 🔴 +10.4%
Baseline details

Latest main 24b587f from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 365 tuples/sec 405 tuples/sec 770.95 tuples/sec -9.9% -52.7%
bs=10 sw=10 sl=64 MB/s 0.222 MB/s 0.247 MB/s 0.471 MB/s -10.1% -52.8%
bs=10 sw=10 sl=64 p50 25,474 us 23,027 us 12,775 us +10.6% +99.4%
bs=10 sw=10 sl=64 p95 44,057 us 36,915 us 15,286 us +19.3% +188.2%
bs=10 sw=10 sl=64 p99 44,057 us 36,915 us 18,795 us +19.3% +134.4%
bs=100 sw=10 sl=64 throughput 777 tuples/sec 805 tuples/sec 976.93 tuples/sec -3.5% -20.5%
bs=100 sw=10 sl=64 MB/s 0.474 MB/s 0.492 MB/s 0.596 MB/s -3.7% -20.5%
bs=100 sw=10 sl=64 p50 129,690 us 121,325 us 102,557 us +6.9% +26.5%
bs=100 sw=10 sl=64 p95 142,726 us 153,232 us 108,383 us -6.9% +31.7%
bs=100 sw=10 sl=64 p99 142,726 us 153,232 us 115,249 us -6.9% +23.8%
bs=1000 sw=10 sl=64 throughput 915 tuples/sec 921 tuples/sec 1,009 tuples/sec -0.7% -9.3%
bs=1000 sw=10 sl=64 MB/s 0.559 MB/s 0.562 MB/s 0.616 MB/s -0.5% -9.2%
bs=1000 sw=10 sl=64 p50 1,087,055 us 1,083,681 us 997,695 us +0.3% +9.0%
bs=1000 sw=10 sl=64 p95 1,144,885 us 1,119,007 us 1,036,731 us +2.3% +10.4%
bs=1000 sw=10 sl=64 p99 1,144,885 us 1,119,007 us 1,069,334 us +2.3% +7.1%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,548.65,200,128000,365,0.222,25473.54,44056.63,44056.63
1,100,10,64,20,2575.62,2000,1280000,777,0.474,129689.60,142726.00,142726.00
2,1000,10,64,20,21855.82,20000,12800000,915,0.559,1087054.65,1144884.51,1144884.51

@aglinxinyuan aglinxinyuan requested review from aglinxinyuan and Copilot and removed request for Copilot July 2, 2026 02:29

@JsonIgnore
def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"
override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.


@JsonIgnore
def getUserFriendlyModelName = "RandomForest Training"
override def getUserFriendlyModelName = "RandomForest Training"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common refactor Refactor the code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplicate Sklearn classifier and training operator descriptors via a shared base

3 participants