Using a Hilbert Curve for indexing is a pretty standard approach for geospatial queries. Nothing new here...
However, using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
For example, construct a heatmap of your 'load score' and shard based on that, in two dimensions. Then use an S2 curve to index inside that shard.
However, using a Hilbert Curve for sharding doesn't seem like the best approach.
Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.
> using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
If you want to shard by proximity (items close in space are likely to be in the same shard, then the transform to 1d is the way to go, why wouldn't it be? What is your definition of "optimal"?
Sharding by proximity or by something else depends on the relative frequency of queries by location or something else. If you shard by location, then a query to one location goes (ususally) to one shard. That should scale better. Otherwise, each location-based query goes to each shard.
Thanks for the suggestion and there are definitely room for optimization.
In the original design we did consider approach similar to the "heatmap approach" you mentioned, it would reduce the shard movement for people who commutes in the city.
Later we figured that simply sharding along on the Hilbert Curve already give what we need, and the shard move is unavoidable, we would explain more details around the shard move in future blog.
I present to you yak shedding: pointless bickering about an irrelevant detail that everyone has an opinion of, and the result of this decision process has such fundamental implications that the whole project is re-engineered from scratch to include a yak farm.
How big of a problem is the fact that shards constructed this way don't like crossing the "big" boundaries in the S2 curve? Especially if a high-usage location happens to straddle one of those boundaries (like, IIRC, Toronto)?
Is this an advertisement for S2? there is no description of alternatives, of how the choice was made, no comparison to any other system or provider, ...
Redis has geo search for example, why not using it?
geo-indexing is different from geo-sharding, search indexes normally have geo-indexing, but it is still one big index handles every single search. The alternative(geohash) is mentioned in article, which has high distortion around earth poles.
First of all, we normally do our sharding by the hash of the primary key. Typically it is the publisherId and hash of the name of a “stream”, which is our general dynamic data structure. What this does is essentially distribute the load evenly between hashes without having to worry about the “water” stuff mentioned above.
Ok now about location based search...
1. We use GeoHash for addressing the surface of the Earth. For lookups, we calculate actual corners of the lat-lng rectangle in Haversine distance and then translate it to geohash.
2. Technically we could have just used a database index at this point and be done with it. But we want to handle both PULL requests and PUSH notifications when something happens nearby.
3. So given a radius the subscriber is interested in, we set up 4 streams for SUBSCRIBERS that together (as rectangles) cover the circle surrounding the location and radius.
4. Then we also set up streams for PUBLISHERS which correspond to all the various drop-down values we have for the radius. This is compicated by the fact that we have one set for metric system and another for imperial. But we just make one stream for each of them. The more radii available the more streams you have to post to (like 10-20) but posting doesn’t happen that often and it’s better to precompute this stuff.
5. Relations between streams are one of the many features that streams do out of the box. So basically the user can just subscribe to a certain location and radius, interest, etc. and get either realtime updates or offline notifications. The item being posted is itself a stream and what’s being inserted is relations between it and the category stream the person is interested in.
Thus you can also see right away who is interested in what. In the future however we plan to encrypt the data end to end so the servers won’t know this info. Then the location search will be harder. (qbix.com/blog)
Elasticsearch has geo-indexing as well(based on geohash internally), and by default it does id hashing similar to what you said(murmurhash3), we actually leverages that for location based searches.
The challenge addressed in the blog is not in how to search/address(as said Elasticsearch handles it already), it is about how to distribute the load so calculation only happens on limited nodes, and reduce the index size so it can be more performant.
When implementing similar algorithms on a load-balancing reverse proxy cache-friendliness and lockless reads and updates in the "load scores" can be way more important than optimizing for CPU cycles.
Another option would be to map coords into very small 2D tiles and than look up "tile -> shard number" with a simple, large, array. A variable number of tiles would land into the same shard. Fast and lockless reads and writes. I wonder if it could be faster.
Very interesting. Used to do something similar for a project I work on. Used geohash based sharding, instead of complex curves like that. Worked great, but with recent advances in ES its no longer needed. They don't seem to mention which version of ES they are on, and would be curious how it has changed over time in their experience.
Same strategy can be applied regardless of ES version. ES6 does has index sorting and other optimizations, but not as good as pre-sharding for location based services, cases like Tinder.
They mentioned the distortion near the poles when using geohashing (which uses z-order curves), but I doubt they have many users at the North or South Pole.
a bit off-topic, it's funny you use all this fancy technology behind the scenes and your iOS app is buggy, slow and always problems with login and notifications which do not update.
I came here to write the same thing, so have an upvote. Tinder’s iOS app is such a gigantic travesty of engineering that I’m honestly surprised to hear they have an actual engineering team.
I’m sorry, but Tinder and similar swipe-on-photos cargo-cult hookup apps have de facto cannibalized the online dating-scene with its entire focus on appearance (and atrophying genuine communication), promoting a major captological phenomenon of manipulating women into rating men more negatively based on nothing other than appearances. It contributes to unrealistic expectations, disposable interactions, paradox of choice and a lack of authenticity. If you’re not perfect and super photogenic with 4 professional “candid” pics, you have zero chance on these anti-relationship apps. tldr: it’s far phonier than it needs to be, encourages casual wasting of time and leads to a society where people stop talking to/slam the door on the real world. If you move to a new area, the solid, normal people aren’t going to be on hookup apps, but psycho nymphos with herpes are most def on there.
I use an image of a strong interest I have as my pic, no personal info, not even my real name. It acts as a strong filter, so far my experience has been net positive.
I find it far more rewarding to try to hack the awful world around me instead of complaining how awful the world is/is becoming.
People already judge by looks. There’s nothing noble about wasting your time approaching someone that never found you attractive,
that’s not their choice.
If you have the time and preference to meet people face to face, you can still do that. People still do that and Tinder is a nice supplement.
But many of us look at dating as a numbers game. And Tinder is a hell of an upgrade to sending messages to women on other dating platforms where you don’t even know if they like your skin color. That’s a waste of time.
You don’t need to be perfect to get Tinder dates. But being ugly in this world with or without Tinder already stacks cards against you. Tinder is not so different from cold-approaching women at the bar. Except in Tinder you know she is more likely to give you a shot beforehand. But you have to trade away the ability to charm her in person. Why does it bother you so much that many people like tht trade-off?
Your post reeks of someone who’s mad that all those attractive people seem to be fucking everyone but you. Dating is hard. I’m willing to try many paradigms/sources at once to date at my desired level. If I was meeting new women every week through my social circle (the ideal imo) then I wouldn’t use Tinder or go out just to find single women, but until then...
If you’re not happy with your dating life, then maybe consider increasing your hustle yourself and/or find the approaches that work for you. Maybe Tinder just isn’t for you, but seems a bit outward to turn it into technosocial criticism.
My observation is that most upvotes are because of the title, and then a minority of people will read and critique the article in the top-voted comments.
Doesn't sound like a perfect system, but actually does a wonderful job at debunking very appealing-sounding articles.
[+] [-] jamesrom|7 years ago|reply
However, using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
For example, construct a heatmap of your 'load score' and shard based on that, in two dimensions. Then use an S2 curve to index inside that shard.
[+] [-] mbid|7 years ago|reply
Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.
EDIT: k-means is adapted directly here: https://elki-project.github.io/tutorial/same-size_k_means
[+] [-] geophile|7 years ago|reply
If you want to shard by proximity (items close in space are likely to be in the same shard, then the transform to 1d is the way to go, why wouldn't it be? What is your definition of "optimal"?
Sharding by proximity or by something else depends on the relative frequency of queries by location or something else. If you shard by location, then a query to one location goes (ususally) to one shard. That should scale better. Otherwise, each location-based query goes to each shard.
[+] [-] realfun|7 years ago|reply
In the original design we did consider approach similar to the "heatmap approach" you mentioned, it would reduce the shard movement for people who commutes in the city.
Later we figured that simply sharding along on the Hilbert Curve already give what we need, and the shard move is unavoidable, we would explain more details around the shard move in future blog.
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] 101km|7 years ago|reply
tl;dr - S2 is a good library.
There's also a joke in here somewhere about bikeshedding and yak shaving.
[+] [-] cornholio|7 years ago|reply
[+] [-] lainga|7 years ago|reply
[+] [-] lixtra|7 years ago|reply
[+] [-] gregoriol|7 years ago|reply
Redis has geo search for example, why not using it?
[+] [-] realfun|7 years ago|reply
[+] [-] EGreg|7 years ago|reply
https://github.com/Qbix/Platform/tree/master/platform/plugin...
First of all, we normally do our sharding by the hash of the primary key. Typically it is the publisherId and hash of the name of a “stream”, which is our general dynamic data structure. What this does is essentially distribute the load evenly between hashes without having to worry about the “water” stuff mentioned above.
Ok now about location based search...
1. We use GeoHash for addressing the surface of the Earth. For lookups, we calculate actual corners of the lat-lng rectangle in Haversine distance and then translate it to geohash.
2. Technically we could have just used a database index at this point and be done with it. But we want to handle both PULL requests and PUSH notifications when something happens nearby.
3. So given a radius the subscriber is interested in, we set up 4 streams for SUBSCRIBERS that together (as rectangles) cover the circle surrounding the location and radius.
4. Then we also set up streams for PUBLISHERS which correspond to all the various drop-down values we have for the radius. This is compicated by the fact that we have one set for metric system and another for imperial. But we just make one stream for each of them. The more radii available the more streams you have to post to (like 10-20) but posting doesn’t happen that often and it’s better to precompute this stuff.
5. Relations between streams are one of the many features that streams do out of the box. So basically the user can just subscribe to a certain location and radius, interest, etc. and get either realtime updates or offline notifications. The item being posted is itself a stream and what’s being inserted is relations between it and the category stream the person is interested in.
Thus you can also see right away who is interested in what. In the future however we plan to encrypt the data end to end so the servers won’t know this info. Then the location search will be harder. (qbix.com/blog)
[+] [-] realfun|7 years ago|reply
Elasticsearch has geo-indexing as well(based on geohash internally), and by default it does id hashing similar to what you said(murmurhash3), we actually leverages that for location based searches.
The challenge addressed in the blog is not in how to search/address(as said Elasticsearch handles it already), it is about how to distribute the load so calculation only happens on limited nodes, and reduce the index size so it can be more performant.
[+] [-] eeZah7Ux|7 years ago|reply
Another option would be to map coords into very small 2D tiles and than look up "tile -> shard number" with a simple, large, array. A variable number of tiles would land into the same shard. Fast and lockless reads and writes. I wonder if it could be faster.
[+] [-] realfun|7 years ago|reply
[+] [-] Bedon292|7 years ago|reply
[+] [-] realfun|7 years ago|reply
[+] [-] bitL|7 years ago|reply
[+] [-] aeroevan|7 years ago|reply
[+] [-] _b3dj|7 years ago|reply
TLDR; Awful
[+] [-] psergeant|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] Lewton|7 years ago|reply
[+] [-] Uberphallus|7 years ago|reply
[+] [-] wilg|7 years ago|reply
[+] [-] modells|7 years ago|reply
[+] [-] pennaMan|7 years ago|reply
I find it far more rewarding to try to hack the awful world around me instead of complaining how awful the world is/is becoming.
[+] [-] wild_preference|7 years ago|reply
If you have the time and preference to meet people face to face, you can still do that. People still do that and Tinder is a nice supplement.
But many of us look at dating as a numbers game. And Tinder is a hell of an upgrade to sending messages to women on other dating platforms where you don’t even know if they like your skin color. That’s a waste of time.
You don’t need to be perfect to get Tinder dates. But being ugly in this world with or without Tinder already stacks cards against you. Tinder is not so different from cold-approaching women at the bar. Except in Tinder you know she is more likely to give you a shot beforehand. But you have to trade away the ability to charm her in person. Why does it bother you so much that many people like tht trade-off?
Your post reeks of someone who’s mad that all those attractive people seem to be fucking everyone but you. Dating is hard. I’m willing to try many paradigms/sources at once to date at my desired level. If I was meeting new women every week through my social circle (the ideal imo) then I wouldn’t use Tinder or go out just to find single women, but until then...
If you’re not happy with your dating life, then maybe consider increasing your hustle yourself and/or find the approaches that work for you. Maybe Tinder just isn’t for you, but seems a bit outward to turn it into technosocial criticism.
[+] [-] ahakki|7 years ago|reply
[+] [-] smt88|7 years ago|reply
Doesn't sound like a perfect system, but actually does a wonderful job at debunking very appealing-sounding articles.
[+] [-] unknown|7 years ago|reply
[deleted]