A novel approach to entity resolution using serverless technology

[+] Major_Grooves|4 years ago|reply

Hi, I’m one of the (prospective) co-founders of TiloDB, a serverless “entity-resolution” technology.

We built TiloDB as the tech team at a European consumer credit bureau when we were faced with the technical challenge of how to assemble hundreds of millions of data sets about tens of millions of people in a way that is scalable and allows fast searching, without breaking the bank.

We tried various technologies, such as graph databases, but none of them could give us satisfactory performance.

So we turned to the opportunities of serverless technology (AWS specifically) to build a new type of entity resolution technology.

In this article we write about the technology breakthroughs that led to TiloDB, and there is also an interactive demo where you can submit data, see it linked, and see other people submitting data in real-time.

We want to spin the tech out into a new company, release it as OSS, and so are keen to hear about potential use cases you might have.

[+] willvarfar|4 years ago|reply

Very cool project.

Setting up a business based on a new DB tech that has one user, though, is tricky.

Playing devil's advocate, how do you plan to make money? Who are the users, why do they turn to TiloDB, how do they learn about it, how do they adopt it, how do they be convinced to pay you something for it? Etc

[+] Major_Grooves|4 years ago|reply

This blog post from the CTO of VMware and Springsource, gives a pretty good summary of the entity-resolution field: https://blog.acolyer.org/2020/12/14/entity-resolution/

[+] lmeyerov|4 years ago|reply

ER for identity graphs is a great use case! We see teams do this a lot and with not-great tools. (Ex: users/IPs in splunk/elastic, which are better for simpler matches.)

For one Graphistry project, we run a single node neo4j with 0.5b nodes/edges, so something in the description isn't adding up for me here wrt perf. Maybe an open benchmark would help?

I do agree indexing matters, as that was night/day for our use cases. For ML workloads, we are looking at vector indexes, which graph DBs do not currently support. The ones in this article are on text and take > 100ms, so I'm curious..

[+] skafoi|4 years ago|reply

Regarding the performance on neo4j: the challenge for an honest and fair test towards this would be about how to properly compare a server-based solution vs. a serverless solution. TiloDB automatically scales up and down without any further interaction due to using Lambdas for all calculations. So would you compare it with a relatively small neo4j instance or with a large cluster? I honestly don't know. When we started doing this internally for our previous company, we obviously did test the edge cases. After all, we didn't want to create our own solution. For graph databases, the issue here are either a lot of edges leaving from one node or a long chain of edges. The first scenario was still handled okish: response times of around 6 seconds for 1.000 nodes if I remember correctly. The second scenario was a total fail. The problem for the later one lies in the transitivity as a graph database has to jump from one node to the next one and so on. To be fair though: when it comes to dynamic entities, so choosing which rules are relevant, graph databases might be the better choice - especially when response times don't matter.

The response times provided in the article are for the whole process of searching and returning the entity. The indexes themself are obviously a lot faster - to be precice we are using DynamoDB for storing the indexes, which most times return results in <10ms. Compared to other databases this may still sound slow, but we know that we won't run into scaling issues in this way and that's kind of what matters currently most for us.

Hope that somehow makes sense what I wrote.

[+] Major_Grooves|4 years ago|reply

I'd be very interested to hear peoples' thoughts on OSS licences. We are rather new to that world so very rapidly learning about the difference between Open Core, Elastic 2.0 and Apache Licence etc.

What licence should a new company like us adopt when we want to build a community but we also want to commercialise the technology, especially when we already have a "enterprise ready" version of the tech?

[+] Zababa|4 years ago|reply

Here is the page of CockroachDB https://www.cockroachlabs.com/docs/stable/licensing-faqs.htm.... Their focus seems to be forbidding cloud vendors from selling their product as a service unless they allow it (a bit like Elastic).

[+] trhway|4 years ago|reply

from experience with similar product (where we had similarly sounding way of entity resolution based on rule based fuzzy indexes and fuzzy matching, and it was working for tens of millions of entities on regular, though beefy, RDBMS more than a decade ago) - the issue isn't that much technological, it is that each customer/client has custom everything when it comes to ER, and thus scaling that business is extremely hard (that specific business collapsed primarily for that reason)

[+] skafoi|4 years ago|reply

I would really love to hear more about your experience regarding the client customizations. So far the two things I can see are domain model customization and rule customization. Obviously with the rule customizations being the more challenging one.

[+] Major_Grooves|4 years ago|reply

the real technical challenge is the "transitive hop" problem that we describe. The matching of the data is not so complicated - that can be done with any technology - but searching with Data A, and getting result Z - that was the tricky bit that took us years to solve and was only possible thanks to serverless tech.

28 comments