support support for ngram indexer#197
Closed
jomart1985 wants to merge 274 commits into
Closed
Conversation
Alain Barbet <alian@amisw.com>
Improved the README file's build instructions
Update for boost-1.5
- Thanks to kojik1010.
Commit e3f8992 ported the code to use Boost.Filesystem V3 API, so the warning about not being able to use Boost 1.50+ is no longer true.
Remove outdated note about Boost incompatibility from the README.
I've observed crashes where one thread is in the middle of initializing ZZ_CMAP and other is trying to use the partially initialized array and crashes. Use boost::once to ensure that only one thread handles the initialization and no thread uses the data until it is fully initialized.
Fix races when initializing static arrays in StandardTokenizerImpl
The argument should be named startOffset, not endOffset, otherwise the function is a no-op.
Use StringUtils::toString() before trying to concatenate or use operator<< which is more type-safe.
Fix some issues identified using clang's static analysis
Add support for compiling with -std=c++11.
Before 1.54, there was no support for varadic calls to boost:call_once(), so make the arrays static members to avoid the need to pass them to the static init methods.
Fix compatibility with Boost versions before 1.54
8628278 broke compilation due to a typo (boost:call_once instead of boost::call_once). Additionally, VC++ compilation with precompiled header was broken, because LuceneInc.h must be included as the very first header.
There was a typo in the output expression, appending a number to a string, instead of concatenating them as indented.
Fix accidental use of operator+ instead of operator<<.
Lucene++ keeps paths around as wide strings, but uses narrow char APIs (e.g. std::ifstream) when accessing files, using conversion to UTF-8 to get char* strings. This is correct on OS X and usually(!) correct on modern Unix systems, but is completely wrong on Windows, which _never_ uses UTF-8 for filenames. Fix this using boost::filesystem classes (path and streams) and appropriate conversions. In one place, use a Windows-specific workaround to deal with lack of wide char boost API. In particular: - Use boost::filesystem::*fstream classes that accept Unicode paths. - Use boost::filesystem::(w)path for manual conversion when needed. - When using char* only API (interprocess::file_lock), use GetShortPathName() as a workaround.
Fix incorrect paths handling on Windows.
Boost.System has been header only since Boost 1.69.0
Handle file enumeration exceptions in FileUtils::listDirectory
Fix build new cmake
Fix typo in MAX_VARINT32_LENGTH constant in BufferedIndexInput.cpp
Update DefaultSimilarity.cpp
Fix old comment about C++ standard
BitSet: Partial fix for Boost 1.90
Use conditional compilation to support both old and new Boost.Bind API: - Boost >= 1.73.0: Use boost/bind/bind.hpp - Boost < 1.73.0: Use boost/bind.hpp This approach maintains backward compatibility while fixing deprecation warnings in newer Boost versions.
Use new Boost.Bind API to fix deprecation warnings
Also remove Boost_SYSTEM_LIBRARIES, removed in #219
Several tests have custom mock classes. Unfortunately these frequently have identical names across tests, which creates problems when building with LTO, as everything is merged into a single test executable. GCC rightfully complains about this, since classes with the same name are assumed to have the same shape, and that is just not true here. Therefore rename the mock classes with the initials of the containing test. With this the entire test suite compiles and passes when built with LTO.
While trying to verify tests previously excluded in Gentoo (gentoo/gentoo@b9d1c7a) I noticed that ParallelMultiSearcherTest & SortTest would work, but hang in ~ThreadPool() on threadGroup.join_all(), preventing the test executable from terminating cleanly. Stopping the io_context makes join_all() work immediately.
Stop io_context before joining threads in ThreadPool destructor
Use unique class names for inner test mock classes
1. Added NGramAnalyzer, NGramTokenFilter, and NGramTokenizer classes for n-gram text analysis 2. Implemented configurable min/max gram sizes with validation 3. Added preserve original token option to NGramTokenFilter 4. Included comprehensive test cases for all new components
Update dependencies.cmake for new boost 1.90 without system-libraries
Add N-Gram analyzer components
Collaborator
|
Closed by #228 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
i hope if you support support for ngram indexer