It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.
No comments yet.