Ask HN: Embeddings as "Semantic Hashes"
15 points| DavidHaerer | 2 years ago
To draw an analogy, can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256?
How can we make embeddings future-proof and exchangeable between independent parties?
konstruction|2 years ago
This goes even further, as a model sophisticated enough to capture a probability distribution will produce embeddings that encode this distribution (to some extent) so that any two models of that kind produce "equivalent" embeddings that can be transformed into each other. This is an area of active research (in fact, I've just been to a seminar talk about that).
So the answer to the "How can we .." would be: by capturing the distribution, by making the embedding big enough and the training task difficult enough.
Examples of embeddings that are re-used are variants of word2vec, CLIP and CLAP.
As others have already mentioned: the hash analogy would be correct if you think about non-cryptographic hashes, but I doubt that this clarifies anything.
simne|2 years ago
No. And no equivalent. Different target.
Crypto- hashes are created for unique representation, even when just one bit changed. Target, to detect if data changed, to protect data from change (intentional or non-intentional).
Vector representation designed, to easy find similarities, so many pieces of data with different bits, will have equal vector repr, or very close.
Good vector repr even consider computationally effective measurement of distance between different pieces of data.
Most problem with vector repr's compatibility, that exists few different algorithms and they use large parameters sets, and at the moment, I have not seen any tries to standardize these parameters sets, because they are very large and expensive to create, and copyright issues.
Also, I don't know exactly, but may exists some patented algorithms.
As example, consider some legal text in English, and it's good translation to French (or other language) - they will be binary totally different, but will be equal in some vector repr.
Unfortunately, conversion from one vector space to other impossible in abstract case.
Because vector spaces are not intersect 100%, so some cases possible in one space are impossible in other.
Second problem, conversions between many-dimensional vector spaces are computationally expensive and not strict.
As example of difficulties of vector spaces conversion, exists anecdote that somebody translated with early automatic translator from Russian to English, and back phrase "the spirit is strong but the flesh is weak", and got result "vodka is good, but the meat is rotten".
simne|2 years ago
Fortunately, we reach point, when near everybody could have his own copy of large subset of terrestrial knowledge (wikipedia).
So I think, in nearest future we will see massively used, at least wikipedia based vector repr, and some org like Mozilla, could make standard context.
But for AGI, we need more, we need 3d representation of world (at least geography and houses/buildings, not all exact, but some adequate); we need non-restricted knowledge base of pictures (video), sounds, and for best results, tactile representations of large enough subset of world objects; and we need some anthropocentric representation of moves of live objects, like humans, trees, some animals.
specproc|2 years ago
You couldn't do that with a hash, as far as I understand it, as hashing doesn't attempt to put similar things together -- quite the opposite.
sargstuff|2 years ago
Do a sterioscopic embeding. One eye for meaning, other for distance. Put gis coordinates as g-code/grbl[0] code in a docstore database as a 3d printable bias relief[0]
[0] g-code/GRBL : https://www.libhunt.com/compare-Universal-G-Code-Sender-vs-G...
[1] : https://www.yeggi.com/q/bias/tlb|2 years ago
You'll almost certain want to update the model over time, as the input distribution changes, to maintain good accuracy. So you need to keep the original source data and recalculate embeddings as needed.
DavidHaerer|2 years ago
A use case could be that anyone can provide a piece of content with the accompanying embedding and this can be used for semantic search. I.e. the search engine does not have to compute the embedding of everything, just the query.
mikewarot|2 years ago
persnickety|2 years ago
You can recover the original word from the embedding, but not from the hash.
A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones
travisd|2 years ago
Returning unrelated (distant) hashes for similar inputs is a possible property of a hash function, and oftentimes a desirable one (especially for cryptography), but there are in fact use cases where one wants similar inputs to map to similar (or the same) hash. https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing
dinosaurdynasty|2 years ago
Not all hashes are cryptographic hashes.