"""
Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines:
1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks.
2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words.
3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits.
4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this.
"""
Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.
Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?
This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?
CuriouslyC|2 years ago
""" Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines: 1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks. 2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words. 3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits. 4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this. """
eevmanu|2 years ago
[0]: https://huggingface.co/microsoft/phi-2
[1]: https://old.reddit.com/r/LocalLLaMA/comments/197kweu/experie...
WhitneyLand|2 years ago
politelemon|2 years ago
c0brac0bra|2 years ago
eurekin|2 years ago