Skip to content

Conversation

@KazuhitoT
Copy link
Contributor

@KazuhitoT KazuhitoT commented Dec 26, 2025

related to #3222

Key changes are as bellows:

  • include/lance_tokenizer_plugin.h: add C API for the tokenizer plugin
  • protos/index_old.proto: add two fields to restore the plugin tokenizer configuration
  • rust/lance-index/src/scalar/inverted/tokenizer.rs and rust/lance-index/src/scalar/inverted/plugin/*: implement tokenizer loading
  • rust/lance-index/examples/: add an example usage

During the PR creation process, I had two questions and left comments in the PR.

@github-actions github-actions bot added the enhancement New feature or request label Dec 26, 2025
Comment on lines +48 to +55
fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
// Note: This is not the most efficient approach for repeated tokenization,
// but it ensures thread safety and simplifies lifetime management.
// For production use, consider caching the factory/tokenizer.
let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text);
BoxTokenStream::new(stream)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the factory is expensive because it loads a tokenizer plugin (~*MB).
Should we cache the factory only, or cache both the factory and the tokenizer?

Comment on lines +313 to +317
// Plugin tokenizer is handled separately as it returns LanceTokenizer directly
if self.base_tokenizer == "plugin" {
return self.build_plugin_tokenizer();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since build_plugin_tokenizer() returns a LanceTokenizer, I added an early return instead of adding a new match condition in build_base_tokenizer(), which returns a TextAnalyzerBuilder.
Should I unify this branching logic?

@codecov
Copy link

codecov bot commented Dec 26, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant