skafoi's comments

skafoi | 2 years ago | on: Automounted, Encrypted SSDs in Windows Subsystem

Not sure if I got your comment correctly, but this is rather not about existing drives (though it could be applied to them), but to have a dedicated drive running within WSL that has a native performance. WSL handles existing drives (those that are readable under windows and linux) as a network drive using the 9p protocol. And, frankly, the performance sucks. E.g. I only reached write speeds of ~125MB/s when using 9p (while the disk is NTFS formatted and encrypted with Bitlocker), while the native LUKS encryption + ext4 reached 460MB/s. Using this approach transfer between Windows and Linux still would be slow though.

Thinking about that...I should probably have mentioned this in the article.

skafoi | 3 years ago | on: Entity Resolution: Reflections on the most common data science challenge

I am one of the developers (and a co-founder) of TiloRes. We are not using a graph database for that purpose, because they would be way to slow for huge entities. Instead we're using a combination of AWS serverless services, for storing the data mainly DynamoDB and S3.

skafoi | 3 years ago | on: Show HN: A 3D city created with GitHub real-time contribution datas

That's looking pretty. Little bit lagging though in my browser as long as the "Real-Time Pull Requests" are not paused.

skafoi | 4 years ago | on: Show HN: Alicorn Cloud – Easily move between AWS, GCP and Azure

Do you also offer some kind of abstraction libraries when e.g. working with DynamoDB on AWS and then switching to GCP? Or would I still have to prepare my code for that?

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

The idea for that is, that typical enterprise features like authorization for certain records or even attributes are not publicly available. Also e.g. encryption of the data in S3 and other parts may be an enterprise only feature. Other things, like API authorization, preventing public access to S3 and therelike must be included in the OSS version for the same reasons you mentioned.

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

Thanks for your input.

1k hops is also not something we see on a regular basis in our old business, which is much about people moving houses and transactional data from payment service providers. Ppl with money issues seem to move a lot more often and also fraud cases often have a lot of hops.

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

Regarding the performance on neo4j: the challenge for an honest and fair test towards this would be about how to properly compare a server-based solution vs. a serverless solution. TiloDB automatically scales up and down without any further interaction due to using Lambdas for all calculations. So would you compare it with a relatively small neo4j instance or with a large cluster? I honestly don't know. When we started doing this internally for our previous company, we obviously did test the edge cases. After all, we didn't want to create our own solution. For graph databases, the issue here are either a lot of edges leaving from one node or a long chain of edges. The first scenario was still handled okish: response times of around 6 seconds for 1.000 nodes if I remember correctly. The second scenario was a total fail. The problem for the later one lies in the transitivity as a graph database has to jump from one node to the next one and so on. To be fair though: when it comes to dynamic entities, so choosing which rules are relevant, graph databases might be the better choice - especially when response times don't matter.

The response times provided in the article are for the whole process of searching and returning the entity. The indexes themself are obviously a lot faster - to be precice we are using DynamoDB for storing the indexes, which most times return results in <10ms. Compared to other databases this may still sound slow, but we know that we won't run into scaling issues in this way and that's kind of what matters currently most for us.

Hope that somehow makes sense what I wrote.

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

Thanks a lot. Sounds similar to the issues we had in our previous company. We provide some basic ETL functionalities to tackle these. Will be interessting to see where this will eventually end up and how much our solution does there or where it is better to use the probably in those companies existing ETL tools.

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

That sounds mostly like deduplication which is often used in marketing contexts. There are indeed some good solutions out there, but from our experience they have difficulties handling huge amounts of data (>1 billion data sets) and they are often batch based, so your data is always outdated, whereas we constantly add new data in near real-time.

skafoi | 4 years ago | on: A novel approach to entity resolution using serverless technology

I would really love to hear more about your experience regarding the client customizations. So far the two things I can see are domain model customization and rule customization. Obviously with the rule customizations being the more challenging one.

skafoi | 4 years ago | on: Show HN: TiloDB – serverless entity resolution technology

Not yet. But we hope to make it open source in the future.

skafoi | 4 years ago | on: Show HN: TiloDB – serverless entity resolution technology

1. We indeed made benchmarks when we started with this. But since this is quite some time ago, I would not call for the exact numbers right now. For our use case, there are basically two extrems. a) everything being the same data: assuming proper deduplication happened, meaning having one node and everything else is still one node away without being connected with each other, then this is still the case where graph databases work quite ok (somewhere around 6 seconds if i remember correctly) b) having a long chain of data: A->B->C->D (in that use case basically a person who moves very often). beside having to write an utterly complex query for that, I remember that I was not able to receive any results within an excaptable time.

2. That is only for that use case. But the underlying matching library we developed can work together with any kind of structured data. It would be interessting to actually use it in some other contexts as so far we have not tested that out yet.

3. Currently no - pure GraphQL api currently. But I was thinking about that. In order to actually support something like this, it would be very interessting to also focus on cross entity linking to make it really cool. We have something like this, but didn't really focus on that yet.

skafoi | 4 years ago | on: Show HN: TiloDB – serverless entity resolution technology

Just had another look on what is written on the home page regarding "constant speed". This is indeed misleading. Sorry for that.

skafoi | 4 years ago | on: Show HN: TiloDB – serverless entity resolution technology

Let's start with the last thing: It is close to constant speed - as pointed out in the article, the more data sets you have, the longer it takes to download and aggregate the entity. But the search itself is always the same steps and therefore constant. Obviously, having an entity with 10.000 data sets in it will not load in 150ms - but finding the place where the full data is stored is easily doable in that time.

I agree, that having a server which you bought once (or rented) is predictable cost. But what happens in cases of burst? You would have to buy another server (without knowing if you will still need that tomorrow). Predictable in this case means, that since you have the same steps for each request, you can tell the exact the cost per request. And that is from my point what is important.