Thanks for the feedback! Rainbow tables probably won't work here since the epoch (rounded to 5 minutes) salts everything, so precomputed tables would expire constantly. You would use use BSSIDs + SSID in real implementations. The MinHash part creates 128 hash values from the entire observed set, not a subset of 3, then LSH divides these into bands where similar sets collide probabilistically based on Jaccard similarity. I've tested with two phones where one saw 3-4 networks and the other saw ~10. They still found each other, and you can try it in the interactive demo yourself. You're right it's not entirely privacy preserving, really depends on your threat model as I discuss in the security section. Your idea about combining area hints with time-based salts is interesting though, feels like it could bridge geospatial indexing with this approach! The whole geospatial indexing thing was something I became aware of late in the project and didn't go into further. It's probably a much simpler approach if geolocation is important. In the current approach, geolocation doesn't matter at all, only 'the environment', whatever that may be.
vessenes|3 months ago
I thought about geo largely because it radically changes the order of magnitude of work necessary; it lets you segment ‘possible’ subsets of APs down to sets of say 100, not millions, and changes the combinatorics. A side effect is knowing a rough spatial location.
Off the top of my head, I don’t think that epochs alone make a big difference. If I want to see if you’ve been somewhere, or tell you I’m somewhere, why not take the 3-4 networks you mentioned, and forward hash them for the next million epochs?
Or, more ambitiously, why not take 3-4 networks each from the geo indexed clusters available at https://wigle.net/ and do the forward and backward epochs, letting me track where you’ve been and pretend to be near you any time in the future?
Wigle reports 1.7bn networks; a rough look at a suburban street near me shows most places have 10 in a reasonable range boundary; so call it 200mm “locations” with 128 segmented hashes, 250 billion hashes per epoch — I think we’re in the “seconds per epoch” range for a reasonable compute heavy server to cover the entire space.
Upshot - I think the salting needs to be something local / not predictable or stored remotely.
Hopefully these comments hit you right - I like the idea a lot - and I don’t fully understand the system - but as I understand it, the system does not offer privacy — I could replay any phone’s hashes against a system that cost a few dollars to reconstruct your location and time, if my understanding is correct.
waerhert|3 months ago
Edit: Maybe some preshared group hash (kinda beats the point), or combining multiple modalities (eg bluetooth, shared interests) or some kind of proof of work token could help mitigate some of these issues. I guess anything to reduce the time to attack helps in this case? Or anything that really pins down environment + time, like what smath described in his comment. In essence, the core idea of minhash + lsh works and it doesn't limit you to just wifi networks. The key is being able to grab a fingerprint that is unique enough and different enough each epoch. Wifi networks are just easy enough to grab vs something more low level like an APs beacon timing interval jitter or something.