top | item 19148028

(no title)

jakequist | 7 years ago

I've felt your pain. So much so that I took 6 months off and put a lot of groundwork into starting a company that would solve this problem. But in the end, I decided to abandon it.

I realized that in order to be 10x better than the alternatives, I was going to need to solve some very tricky AI problems. For example, acurately deduplicating a customer record "John Doe" vs "Johnathon Doe" is not straightforward. Maybe it's two different people?. Maybe it's just a spelling mismatch? The system must have a great deal of context to accurately determine if the data is indeed duplicated. And even if it does, perhaps there's a perfectly good reason for the spelling mismatch. (e.g. perhaps one table is his preferred name, while the other is just referential, etc). In the end, deduplication often comes down to the requirements of the company and it's hard to generalize.

I think there's space in the market for this kind of business, but it'll be a slog. Unless you have a 10x solution (i.e. super AI), you'll be competing with the likes of Trifacta, etc. And it's hard to compete with that kind of sales force.

Really good question. Thanks for posting.

discuss

PaulHoule|7 years ago

@jakequist I think trifacta and similar tools are aimed at "data analysis" more than operations.

I think the question is about line of business software and issues there are very different.

For instance there is a literature on record matching and good techniques exist, but without an exception handling workflow you don't have a way to deal with the unusual cases the code works up.

I would love to talk and share notes about what you did.

tabtab|7 years ago

Paul, you are correct. It's more about preventing duplication in the software design rather than cleaning it up after the fact (which is an interesting problem, but a diff topic).

Think of it this way: one could build a detailed Entity Relationship diagram (or OOP equivalent) in a machine-readable format with all the relationship and column-size constraints defined. One could then push a button and have a machine generate a working version of the software. Those tools do exist. But they are usually missing useful details and result in UI's poorly tuned for how employees will likely be using the system.

Many of the tweaks to make it "practical" will be exceptions or local customizations to the original ER diagram data. Those customizations/deviations are the bottleneck such that in practice most stacks use duplication of info instead. See DRY ("Don't Repeat Yourself") in software engineering slang sites.