top | item 22096879

(no title)

I've had a similar idea in the back of my mind for a few years now. Congrats on launching!

My use case is a bit different -- I was doing a lot of database cleanups, particularly CRMs. I rewrote/reused code to build a duplicate detector a number of times; always wish there were a service that I could send data to, and it would flag my dupes. Even was using human labelers to train domain specific models.

discuss

chiscript|6 years ago

I see... There are other solutions that claim to use A.I to train or generate models for their their apps. I'm not really sure how effective they are though. That said, however, Flookup can help you flag dupes quite well. There are many ways I tried to shore up the fact that no algorithm is a one-size-fits all solution. For example, Flookup allows you to dictate what stop words to remove or combine lookup variables for more specificity or even return the next best match in case the first one isn't to your liking. All this makes it quite malleable and usable for a case like yours.

throw_14JAS|6 years ago

Thanks for replying! I don't tend to do that type of work anymore, but I'm still stoked to see a solution to the problem I had frequently. I think there's a great service to be built (and maybe it's yours!) that deduplicates data.

Specific models might be an interesting addon. Address parsing, normalization, and deduplication (with potential covariates like phone number, email address, etc.) is a massive pain in the ass for any data engineer who works with sales or marketing folks. Their databases (CRMs) are awful -- it was always a chore to clean these up, but measurably saved money (imagine you mail physical cards, and only want 1 per customer... but you have 5 different contacts at that company for 3 unique individuals).

I would have paid for a deduplication service -- say, quarterly batches at somewhere >$500/quarter for e.g. 20-50k contacts.

The 1-size-fits-all isn't really a value add for me, that wasn't so much my issue. For other target users, I can see that use -- for them, the interface is the value add. Especially if you can read/write Excel files directly.

Stop words aren't something I used in my deduplication efforts. How many of your users request or use this? What kind of stop words do you want to exclude from comparing two entries? I would be worried that stopwords still carry information: "The Store" versus "Store" might be significant.

C1sc0cat|6 years ago

I have done something similar to help with building negatives for an Adwords campaign for Centre Parcs in the UK

I used the Levenshtein edit distance to generate a list of potentials for human review - I extended MySQL to have a Levenshtein distance function for speed.