top | item 42045576

(no title)

dmpetrov | 1 year ago

I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.

discuss

jerednel|1 year ago

For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.