refactor(workflow-operator): extract shared Sklearn descriptor base by Ma77Ball · Pull Request #6048 · apache/texera

Ma77Ball · 2026-07-01T18:23:33Z

What changes were proposed in this PR?

Add abstract SklearnModelOpDesc extends PythonOperatorDescriptor holding the four shared config fields (target, countVectorizer, text, tfidfTransformer) with their Jackson/JSON-schema annotations, the two @JsonIgnore hooks (getImportStatements, getUserFriendlyModelName), and the shared getOutputSchemas.
Change SklearnClassifierOpDesc and SklearnTrainingOpDesc to extend the new base, so each keeps only what genuinely differs (generatePythonCode, operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.
The 51 leaf subclasses inherit the fields unchanged, so the generated operator schema is identical (no behavior change).

Any related issues, documentation, discussions?

Closes: #6038

How was this PR tested?

Pure refactor with no behavior change; covered by existing SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inherited target/countVectorizer fields, and asserts on generatePythonCode() output. Run it in CI via sbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.
Verified locally with sbt "WorkflowOperator/Test/compile" (compiles clean; both descriptors resolve against SklearnModelOpDesc).
Note: sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec" could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendored JsonSchemaDraft.class on this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under the WorkflowOperator/jacoco job.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

…ssifieropdesc and sklearntrainingopdesc

…-shared-base

github-actions · 2026-07-01T18:23:51Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @aglinxinyuan, @carloea2
You can notify them by mentioning @aglinxinyuan, @carloea2 in a comment.

codecov-commenter · 2026-07-01T18:26:44Z

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.63%. Comparing base (24b587f) to head (406b2ef).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ber/operator/sklearn/SklearnClassifierOpDesc.scala	33.33%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6048      +/-   ##
============================================
- Coverage     56.89%   56.63%   -0.27%     
- Complexity     3058     3059       +1     
============================================
  Files          1129     1127       -2     
  Lines         43802    43326     -476     
  Branches       4743     4669      -74     
============================================
- Hits          24922    24536     -386     
+ Misses        17449    17392      -57     
+ Partials       1431     1398      -33

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.59% <ø> (ø)`		Carriedforward from 16a3ddf
amber	`58.66% <86.66%> (+0.01%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (ø)`
file-service	`62.81% <ø> (ø)`
frontend	`49.39% <ø> (-0.74%)`	⬇️	Carriedforward from 16a3ddf
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.51% <ø> (-0.64%)`	⬇️	Carriedforward from 16a3ddf
python	`90.69% <ø> (ø)`		Carriedforward from 16a3ddf
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-07-01T18:29:08Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 6 worse · ⚪ 7 noise (<±5%) · 0 without baseline

Compared against main 24b587f benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	365	0.222	25,474/44,057/44,057 us	🔴 +19.3% / 🔴 +188.2%
🔴	bs=100 sw=10 sl=64	777	0.474	129,690/142,726/142,726 us	🔴 +6.9% / 🔴 +31.7%
⚪	bs=1000 sw=10 sl=64	915	0.559	1,087,055/1,144,885/1,144,885 us	⚪ within ±5% / 🔴 +10.4%

Baseline details

Latest main 24b587f from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	365 tuples/sec	405 tuples/sec	770.95 tuples/sec	-9.9%	-52.7%
bs=10 sw=10 sl=64	MB/s	0.222 MB/s	0.247 MB/s	0.471 MB/s	-10.1%	-52.8%
bs=10 sw=10 sl=64	p50	25,474 us	23,027 us	12,775 us	+10.6%	+99.4%
bs=10 sw=10 sl=64	p95	44,057 us	36,915 us	15,286 us	+19.3%	+188.2%
bs=10 sw=10 sl=64	p99	44,057 us	36,915 us	18,795 us	+19.3%	+134.4%
bs=100 sw=10 sl=64	throughput	777 tuples/sec	805 tuples/sec	976.93 tuples/sec	-3.5%	-20.5%
bs=100 sw=10 sl=64	MB/s	0.474 MB/s	0.492 MB/s	0.596 MB/s	-3.7%	-20.5%
bs=100 sw=10 sl=64	p50	129,690 us	121,325 us	102,557 us	+6.9%	+26.5%
bs=100 sw=10 sl=64	p95	142,726 us	153,232 us	108,383 us	-6.9%	+31.7%
bs=100 sw=10 sl=64	p99	142,726 us	153,232 us	115,249 us	-6.9%	+23.8%
bs=1000 sw=10 sl=64	throughput	915 tuples/sec	921 tuples/sec	1,009 tuples/sec	-0.7%	-9.3%
bs=1000 sw=10 sl=64	MB/s	0.559 MB/s	0.562 MB/s	0.616 MB/s	-0.5%	-9.2%
bs=1000 sw=10 sl=64	p50	1,087,055 us	1,083,681 us	997,695 us	+0.3%	+9.0%
bs=1000 sw=10 sl=64	p95	1,144,885 us	1,119,007 us	1,036,731 us	+2.3%	+10.4%
bs=1000 sw=10 sl=64	p99	1,144,885 us	1,119,007 us	1,069,334 us	+2.3%	+7.1%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,548.65,200,128000,365,0.222,25473.54,44056.63,44056.63
1,100,10,64,20,2575.62,2000,1280000,777,0.474,129689.60,142726.00,142726.00
2,1000,10,64,20,21855.82,20000,12800000,915,0.559,1087054.65,1144884.51,1144884.51

aglinxinyuan · 2026-07-02T02:31:29Z


-  @JsonIgnore
-  def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"
+  override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"


This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.

aglinxinyuan · 2026-07-02T02:31:34Z


-  @JsonIgnore
-  def getUserFriendlyModelName = "RandomForest Training"
+  override def getUserFriendlyModelName = "RandomForest Training"


Ma77Ball added 2 commits July 1, 2026 10:45

refactor(sklearn): moved base class to shared file between sklearncla…

42ae372

…ssifieropdesc and sklearntrainingopdesc

Merge branch 'main' of github.com:apache/texera into refactor/sklearn…

fd17af0

…-shared-base

github-actions Bot assigned Ma77Ball Jul 1, 2026

github-actions Bot added refactor Refactor the code common labels Jul 1, 2026

aglinxinyuan requested review from aglinxinyuan and Copilot and removed request for Copilot July 2, 2026 02:29

aglinxinyuan approved these changes Jul 2, 2026

View reviewed changes

Merge branch 'main' into refactor/sklearn-shared-base

406b2ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(workflow-operator): extract shared Sklearn descriptor base#6048

refactor(workflow-operator): extract shared Sklearn descriptor base#6048
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base

Ma77Ball commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

codecov-commenter commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

aglinxinyuan Jul 2, 2026

Uh oh!

aglinxinyuan Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Ma77Ball commented Jul 1, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jul 1, 2026

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

aglinxinyuan Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

aglinxinyuan Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading