|
ICU4X
International Components for Unicode
|
#include <ICU4XWordSegmenter.hpp>
Public Member Functions | |
| ICU4XWordBreakIteratorUtf8 | segment_utf8 (const std::string_view input) const |
| ICU4XWordBreakIteratorUtf16 | segment_utf16 (const std::u16string_view input) const |
| ICU4XWordBreakIteratorLatin1 | segment_latin1 (const diplomat::span< const uint8_t > input) const |
| ICU4XWordSegmenter (capi::ICU4XWordSegmenter *i) | |
| ICU4XWordSegmenter ()=default | |
| ICU4XWordSegmenter (ICU4XWordSegmenter &&) noexcept=default | |
| ICU4XWordSegmenter & | operator= (ICU4XWordSegmenter &&other) noexcept=default |
Static Public Member Functions | |
| static diplomat::result< ICU4XWordSegmenter, ICU4XError > | create_auto (const ICU4XDataProvider &provider) |
| static diplomat::result< ICU4XWordSegmenter, ICU4XError > | create_lstm (const ICU4XDataProvider &provider) |
| static diplomat::result< ICU4XWordSegmenter, ICU4XError > | create_dictionary (const ICU4XDataProvider &provider) |
An ICU4X word-break segmenter, capable of finding word breakpoints in strings.
See the Rust documentation for WordSegmenter for more information.
|
inlineexplicit |
|
default |
|
defaultnoexcept |
|
inlinestatic |
Construct an [ICU4XWordSegmenter] with automatically selecting the best available LSTM or dictionary payload data.
Note: currently, it uses dictionary for Chinese and Japanese, and LSTM for Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_auto for more information.
|
inlinestatic |
Construct an [ICU4XWordSegmenter] with dictionary payload data for Chinese, Japanese, Burmese, Khmer, Lao, and Thai.
See the Rust documentation for new_dictionary for more information.
|
inlinestatic |
Construct an [ICU4XWordSegmenter] with LSTM payload data for Burmese, Khmer, Lao, and Thai.
Warning: [ICU4XWordSegmenter] created by this function doesn't handle Chinese or Japanese.
See the Rust documentation for new_lstm for more information.
|
defaultnoexcept |
|
inline |
Segments a Latin-1 string.
See the Rust documentation for segment_latin1 for more information.
Lifetimes: this, input must live at least as long as the output.
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf16 for more information.
Lifetimes: this, input must live at least as long as the output.
|
inline |
Segments a string.
Ill-formed input is treated as if errors had been replaced with REPLACEMENT CHARACTERs according to the WHATWG Encoding Standard.
See the Rust documentation for segment_utf8 for more information.
Lifetimes: this, input must live at least as long as the output.