-
Notifications
You must be signed in to change notification settings - Fork 503
feat: implemented tokenizer plugins in rust #5583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> { | ||
| // Note: This is not the most efficient approach for repeated tokenization, | ||
| // but it ensures thread safety and simplifies lifetime management. | ||
| // For production use, consider caching the factory/tokenizer. | ||
| let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text); | ||
| BoxTokenStream::new(stream) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the factory is expensive because it loads a tokenizer plugin (~*MB).
Should we cache the factory only, or cache both the factory and the tokenizer?
| // Plugin tokenizer is handled separately as it returns LanceTokenizer directly | ||
| if self.base_tokenizer == "plugin" { | ||
| return self.build_plugin_tokenizer(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since build_plugin_tokenizer() returns a LanceTokenizer, I added an early return instead of adding a new match condition in build_base_tokenizer(), which returns a TextAnalyzerBuilder.
Should I unify this branching logic?
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
related to #3222
Key changes are as bellows:
include/lance_tokenizer_plugin.h: add C API for the tokenizer pluginprotos/index_old.proto: add two fields to restore the plugin tokenizer configurationrust/lance-index/src/scalar/inverted/tokenizer.rsandrust/lance-index/src/scalar/inverted/plugin/*: implement tokenizer loadingrust/lance-index/examples/: add an example usageDuring the PR creation process, I had two questions and left comments in the PR.