(no title)
throw_14JAS | 6 years ago
My use case is a bit different -- I was doing a lot of database cleanups, particularly CRMs. I rewrote/reused code to build a duplicate detector a number of times; always wish there were a service that I could send data to, and it would flag my dupes. Even was using human labelers to train domain specific models.
chiscript|6 years ago
throw_14JAS|6 years ago
Specific models might be an interesting addon. Address parsing, normalization, and deduplication (with potential covariates like phone number, email address, etc.) is a massive pain in the ass for any data engineer who works with sales or marketing folks. Their databases (CRMs) are awful -- it was always a chore to clean these up, but measurably saved money (imagine you mail physical cards, and only want 1 per customer... but you have 5 different contacts at that company for 3 unique individuals).
I would have paid for a deduplication service -- say, quarterly batches at somewhere >$500/quarter for e.g. 20-50k contacts.
The 1-size-fits-all isn't really a value add for me, that wasn't so much my issue. For other target users, I can see that use -- for them, the interface is the value add. Especially if you can read/write Excel files directly.
Stop words aren't something I used in my deduplication efforts. How many of your users request or use this? What kind of stop words do you want to exclude from comparing two entries? I would be worried that stopwords still carry information: "The Store" versus "Store" might be significant.
C1sc0cat|6 years ago
I used the Levenshtein edit distance to generate a list of potentials for human review - I extended MySQL to have a Levenshtein distance function for speed.