top | item 36315969

(no title)

Kelamir | 2 years ago

> We start by parsing documents into chunks. A sensible default is to chunk documents by token length, typically 1,500 to 3,000 tokens per chunk. However, I found that this didn’t work very well. A better approach might be to chunk by paragraphs (e.g., split on \n\n).

Hmm good insight there. I've done some experimenting formerly by chunk length and it's been pretty troublesome due to missing context.

discuss

gwern|2 years ago

You don't do a sliding window? That seems like the logical way to maintain context but allow look up by 'chunks'. Embed it, say, 3 paragraphs at a time, advancing 1 paragraph per embedding.

chaxor|2 years ago

This is only a good idea if you are *specifically not* using OpenAI.

If you use local models then it's a fantastic idea.

screye|2 years ago

If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues.

Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better.

That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows.

SmooL|2 years ago

I've thought about doing this as well, but I haven't tried it yet. Are there any resources/blogs/information on various strategies on how to best chunk & embed arbitrary text?

crucialfelix|2 years ago

The unstructured package works well to partition text, markdown, html, even pdf on structural boundaries like paragraphs, h, hr etc

https://unstructured-io.github.io/unstructured/bricks.html#p...