|
ICU4X
International Components for Unicode
|
#include <WordSegmenter.d.hpp>
Public Member Functions | |
| std::unique_ptr< icu4x::WordBreakIteratorUtf8 > | segment (std::string_view input) const |
| std::unique_ptr< icu4x::WordBreakIteratorUtf16 > | segment16 (std::u16string_view input) const |
| std::unique_ptr< icu4x::WordBreakIteratorLatin1 > | segment_latin1 (icu4x::diplomat::span< const uint8_t > input) const |
An ICU4X word-break segmenter, capable of finding word breakpoints in strings.
See the Rust documentation for WordSegmenter for more information.
|
inlinestatic |
Construct a WordSegmenter with automatically selecting the best available LSTM or dictionary payload data, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_auto for more information.
|
inlinestatic |
Construct a WordSegmenter with automatically selecting the best available LSTM or dictionary payload data, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_auto for more information.
|
inlinestatic |
Construct a WordSegmenter with automatically selecting the best available LSTM or dictionary payload data, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_auto for more information.
|
inlinestatic |
Construct a WordSegmenter with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_dictionary for more information.
|
inlinestatic |
Construct a WordSegmenter with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_dictionary for more information.
|
inlinestatic |
Construct a WordSegmenter with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_dictionary for more information.
|
inlinestatic |
Construct a WordSegmenter with no support for scripts requiring complex context dependent word breaks (Chinese, Japanese, Burmese, Khmer, Lao, and Thai), using compiled data. This does not assume any content locale.
See the Rust documentation for new_for_non_complex_scripts for more information.
|
inlinestatic |
Construct a WordSegmenter with no support for scripts requiring complex context dependent word breaks (Chinese, Japanese, Burmese, Khmer, Lao, and Thai), using compiled data.
See the Rust documentation for try_new_for_non_complex_scripts for more information.
|
inlinestatic |
Construct a WordSegmenter with no support for scripts requiring complex context dependent word breaks (Chinese, Japanese, Burmese, Khmer, Lao, and Thai), using a particular data source.
See the Rust documentation for try_new_for_non_complex_scripts for more information.
|
inlinestatic |
Construct a WordSegmenter with LSTM payload data for Burmese, Khmer, Lao, and Thai, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_lstm for more information.
|
inlinestatic |
Construct a WordSegmenter with LSTM payload data for Burmese, Khmer, Lao, and Thai, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_lstm for more information.
|
inlinestatic |
Construct a WordSegmenter with LSTM payload data for Burmese, Khmer, Lao, and Thai, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_lstm for more information.
|
inlinestatic |
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf8 for more information.
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf16 for more information.
|
inline |
Segments a Latin-1 string.
See the Rust documentation for segment_latin1 for more information.