top | item 44960420

(no title)

adtac | 6 months ago

>That's basically your documents lossily encoded.

Vector embeddings are lossy encodings of documents roughly in the same way a SHA256 hash is a lossy encoding. It's virtually impossible to reverse the embedding vector to recover the original document.

Note: when vectors are combined with other components for search and retrieval, it's trivial to end up with a horribly insecure system, but just vector embeddings are useful by themselves and you said "all useful AI retrieval systems are insecure by design", so I felt it necessary to disagree with that part.

discuss

sfink|6 months ago

> Vector embeddings are lossy encodings of documents roughly in the same way a SHA256 hash is a lossy encoding.

Incorrect. With a hash, I need to have the identical input to know whether it matches. If I'm one bit off, I get no information. Vector embeddings by design will react differently for similar inputs, so if you can reproduce the embedding algorithm then you can know how close you are to the input. It's like a combination lock that tells you how many numbers match so far (and for ones that don't, how close they are).

> It's virtually impossible to reverse the embedding vector to recover the original document.

If you can reproduce the embedding process, it is very possible (with a hot/cold type of search: "you're getting warmer!"). But also, you no longer even need to recover the exact original. You can recover something close enough (and spend more time to make it incrementally closer).

mpeg|6 months ago

I wouldn't say those two are equivalent. A cryptographic hash requires the exact full document to be available to "recover it" from the hash. With a vector embedding you can extract information related to the document from the embedding alone as long as you know (or can guess) what embedding model was used. You won't be able to reconstruct the document but you will be able to infer some meaning from the vector alone

frakt0x90|6 months ago

Yes there have been multiple papers showing information extraction from embedding vectors if you know the model used. SHA by design maps similar strings pseud-randomly. Embeddings by design map similar strings similarly.

unknown|6 months ago

[deleted]