It's surprisingly difficult, and the "obvious" techniques (just do embeddings) don't really work. I wrote about it and did benchmarks here: https://joecooper.me/blog/redundancy/
Thank you for actually testing and measuring an implementation & hypothesis. I appreciate the leads for evaluating my own similarity problems and efficacy.
donavanm|14 days ago