Expand description
🚧 [Experimental] Segment strings by lines, graphemes, word, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
- Line breaker that is compatible with Unicode Standard Annex #14 and CSS properties.
- Grapheme cluster breaker, word breaker, and sentence breaker that are compatible with Unicode Standard Annex #29.
Examples
Line Break
Segment a string with default options:
use icu::segmenter::LineSegmenter;
let segmenter =
LineSegmenter::try_new_unstable(&icu_testdata::unstable())
.expect("Data exists");
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);
See LineSegmenter
for more examples.
Grapheme Cluster Break
See GraphemeClusterSegmenter
for examples.
Word Break
Segment a string:
use icu::segmenter::WordSegmenter;
let segmenter =
WordSegmenter::try_new_unstable(&icu_testdata::unstable())
.expect("Data exists");
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
See WordSegmenter
for more examples.
Sentence Break
See SentenceSegmenter
for examples.
Modules
Data provider struct definitions for this ICU4X component.
Structs
Segments a string into grapheme clusters.
Implements the Iterator
trait over the line break opportunities of the given string. Please
see the examples in LineSegmenter
for its usages.
Options to tailor line breaking behavior, such as for CSS.
Supports loading line break data, and creating line break iterators for different string encodings.
Implements the Iterator
trait over the segmenter break opportunities of the given string.
Supports loading sentence break data, and creating sentence break iterators for different string encodings.
Supports loading word break data, and creating word break iterators for different string encodings.
Enums
A list of error outcomes for various operations in the icu_timezone
crate.
An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line breaker.
A list of error outcomes for various operations in the icu_timezone
crate.
An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line breaker.
Type Definitions
Grapheme cluster break iterator for a Latin-1 (8-bit) string.
Grapheme cluster break iterator for a potentially invalid UTF-8 string.
Grapheme cluster break iterator for an str
(a UTF-8 string).
Grapheme cluster break iterator for a UTF-16 string.
Line break iterator for a Latin-1 (8-bit) string.
Line break iterator for a potentially invalid UTF-8 string
Line break iterator for an str
(a UTF-8 string).
Line break iterator for a UTF-16 string.
Sentence break iterator for a Latin-1 (8-bit) string.
Sentence break iterator for potentially invalid UTF-8 strings
Sentence break iterator for an str
(a UTF-8 string).
Sentence break iterator for a UTF-16 string.
Word break iterator for a Latin-1 (8-bit) string.
Word break iterator for a potentially invalid UTF-8 string
Word break iterator for an str
(a UTF-8 string).
Word break iterator for a UTF-16 string.