(no title)
snyy | 1 month ago
I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French
There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!
smlacy|1 month ago
ks2048|1 month ago
snyy|1 month ago
// With multi-byte pattern
let metaspace = "<japanese_full_stop>".as_bytes();
let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();
unknown|1 month ago
[deleted]