(no title)
gthompson512 | 9 months ago
Edit: it seems like it just splits in to sentences which is a weird thing to do given in English only 95%ish percent agreement is even possible on what a sentence is. ``` // Process in batches for batch in sentences.chunks(batch_size) { // Truncate each sentence to max_length * median_token_length chars let truncated: Vec<&str> = batch .iter() .map(|text| { if let Some(max_tok) = max_length { Self::truncate_str(text, max_tok, self.median_token_length) } else { text.as_str() } }) .collect(); ```
gthompson512|9 months ago
Edit: That wasn't intended to be mean, although it may come off that way, but what is this supposed to be for? Myself I have text >8k tokens that need to be embedded and test things regularly.
Tananon|9 months ago
stephantul|9 months ago