refactor(workflow-operator): extract shared Sklearn descriptor base#6048
refactor(workflow-operator): extract shared Sklearn descriptor base#6048Ma77Ball wants to merge 3 commits into
Conversation
…ssifieropdesc and sklearntrainingopdesc
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6048 +/- ##
============================================
- Coverage 56.89% 56.63% -0.27%
- Complexity 3058 3059 +1
============================================
Files 1129 1127 -2
Lines 43802 43326 -476
Branches 4743 4669 -74
============================================
- Hits 24922 24536 -386
+ Misses 17449 17392 -57
+ Partials 1431 1398 -33
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 365 | 0.222 | 25,474/44,057/44,057 us | 🔴 +19.3% / 🔴 +188.2% |
| 🔴 | bs=100 sw=10 sl=64 | 777 | 0.474 | 129,690/142,726/142,726 us | 🔴 +6.9% / 🔴 +31.7% |
| ⚪ | bs=1000 sw=10 sl=64 | 915 | 0.559 | 1,087,055/1,144,885/1,144,885 us | ⚪ within ±5% / 🔴 +10.4% |
Baseline details
Latest main 24b587f from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 365 tuples/sec | 405 tuples/sec | 770.95 tuples/sec | -9.9% | -52.7% |
| bs=10 sw=10 sl=64 | MB/s | 0.222 MB/s | 0.247 MB/s | 0.471 MB/s | -10.1% | -52.8% |
| bs=10 sw=10 sl=64 | p50 | 25,474 us | 23,027 us | 12,775 us | +10.6% | +99.4% |
| bs=10 sw=10 sl=64 | p95 | 44,057 us | 36,915 us | 15,286 us | +19.3% | +188.2% |
| bs=10 sw=10 sl=64 | p99 | 44,057 us | 36,915 us | 18,795 us | +19.3% | +134.4% |
| bs=100 sw=10 sl=64 | throughput | 777 tuples/sec | 805 tuples/sec | 976.93 tuples/sec | -3.5% | -20.5% |
| bs=100 sw=10 sl=64 | MB/s | 0.474 MB/s | 0.492 MB/s | 0.596 MB/s | -3.7% | -20.5% |
| bs=100 sw=10 sl=64 | p50 | 129,690 us | 121,325 us | 102,557 us | +6.9% | +26.5% |
| bs=100 sw=10 sl=64 | p95 | 142,726 us | 153,232 us | 108,383 us | -6.9% | +31.7% |
| bs=100 sw=10 sl=64 | p99 | 142,726 us | 153,232 us | 115,249 us | -6.9% | +23.8% |
| bs=1000 sw=10 sl=64 | throughput | 915 tuples/sec | 921 tuples/sec | 1,009 tuples/sec | -0.7% | -9.3% |
| bs=1000 sw=10 sl=64 | MB/s | 0.559 MB/s | 0.562 MB/s | 0.616 MB/s | -0.5% | -9.2% |
| bs=1000 sw=10 sl=64 | p50 | 1,087,055 us | 1,083,681 us | 997,695 us | +0.3% | +9.0% |
| bs=1000 sw=10 sl=64 | p95 | 1,144,885 us | 1,119,007 us | 1,036,731 us | +2.3% | +10.4% |
| bs=1000 sw=10 sl=64 | p99 | 1,144,885 us | 1,119,007 us | 1,069,334 us | +2.3% | +7.1% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,548.65,200,128000,365,0.222,25473.54,44056.63,44056.63
1,100,10,64,20,2575.62,2000,1280000,777,0.474,129689.60,142726.00,142726.00
2,1000,10,64,20,21855.82,20000,12800000,915,0.559,1087054.65,1144884.51,1144884.51|
|
||
| @JsonIgnore | ||
| def getImportStatements = "from sklearn.ensemble import RandomForestClassifier" | ||
| override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier" |
There was a problem hiding this comment.
This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.
|
|
||
| @JsonIgnore | ||
| def getUserFriendlyModelName = "RandomForest Training" | ||
| override def getUserFriendlyModelName = "RandomForest Training" |
What changes were proposed in this PR?
SklearnModelOpDesc extends PythonOperatorDescriptorholding the four shared config fields (target,countVectorizer,text,tfidfTransformer) with their Jackson/JSON-schema annotations, the two@JsonIgnorehooks (getImportStatements,getUserFriendlyModelName), and the sharedgetOutputSchemas.SklearnClassifierOpDescandSklearnTrainingOpDescto extend the new base, so each keeps only what genuinely differs (generatePythonCode,operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.Any related issues, documentation, discussions?
Closes: #6038
How was this PR tested?
SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inheritedtarget/countVectorizerfields, and asserts ongeneratePythonCode()output. Run it in CI viasbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.sbt "WorkflowOperator/Test/compile"(compiles clean; both descriptors resolve againstSklearnModelOpDesc).sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec"could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendoredJsonSchemaDraft.classon this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under theWorkflowOperator/jacocojob.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF