ICU4X
International Components for Unicode
|
#include <WordSegmenter.d.hpp>
Public Member Functions | |
std::unique_ptr< icu4x::WordBreakIteratorUtf8 > | segment (std::string_view input) const |
std::unique_ptr< icu4x::WordBreakIteratorUtf16 > | segment16 (std::u16string_view input) const |
std::unique_ptr< icu4x::WordBreakIteratorLatin1 > | segment_latin1 (diplomat::span< const uint8_t > input) const |
An ICU4X word-break segmenter, capable of finding word breakpoints in strings.
See the Rust documentation for WordSegmenter
for more information.
|
inlinestatic |
Construct an WordSegmenter
with automatically selecting the best available LSTM or dictionary payload data, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_auto
for more information.
|
inlinestatic |
Construct an WordSegmenter
with automatically selecting the best available LSTM or dictionary payload data, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_auto
for more information.
|
inlinestatic |
Construct an WordSegmenter
with automatically selecting the best available LSTM or dictionary payload data, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_auto
for more information.
|
inlinestatic |
Construct an WordSegmenter
with with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_dictionary
for more information.
|
inlinestatic |
Construct an WordSegmenter
with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_dictionary
for more information.
|
inlinestatic |
Construct an WordSegmenter
with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and dictionary for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_dictionary
for more information.
|
inlinestatic |
Construct an WordSegmenter
with LSTM payload data for Burmese, Khmer, Lao, and Thai, using compiled data. This does not assume any content locale.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_lstm
for more information.
|
inlinestatic |
Construct an WordSegmenter
with LSTM payload data for Burmese, Khmer, Lao, and Thai, using compiled data.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_lstm
for more information.
|
inlinestatic |
Construct an WordSegmenter
with LSTM payload data for Burmese, Khmer, Lao, and Thai, using a particular data source.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for try_new_lstm
for more information.
|
inlinestatic |
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf8
for more information.
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf16
for more information.
|
inline |
Segments a Latin-1 string.
See the Rust documentation for segment_latin1
for more information.