top | item 16472089

Why Searching Through 500M Pwned Passwords Is So Quick

304 points| pstadler | 8 years ago |troyhunt.com

172 comments

order
[+] zaroth|8 years ago|reply
Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.

It’s read-only static data. Spending even 60ms on the response is ridiculous. Reading from files in blob storage... WTF?

Ctrl-F Redis - was disappointed.

Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP responses and store in a string array. The 5 character hex is the index into the array. Write < 100 lines of Go to load the data structure and serve it. What am I missing?

This is like “Hello World” in those HTTP Framework Benchmarks that used to make the rounds every few months.

[+] djhworld|8 years ago|reply
The 250mb redis instance on Azure costs $0.055 an hour[1], so nearly $500 p/year

Going down the small hosting a 100 line Go program route, the cheapest "B1" instance type with 1GiB of RAM costs $105.12 a year, and then if you want the service to be HA you need probably another instance in a different zone, maybe a load balancer as well.

I suspect Azure Functions + Blob Storage costs a lot less. Although he didn't mention how much CloudFlare was costing him.

[1] https://azure.microsoft.com/en-us/pricing/details/cache/ [2] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[+] th3byrdm4n|8 years ago|reply
Precisely my first thought. Static readonly data? Why isn't this just sitting in memory?
[+] quadrature|8 years ago|reply
I think this solution is banking on storage + azure function cost being cheaper than the per instance cost to service the same load with an acceptable latency.
[+] nodesocket|8 years ago|reply
Couldn't he run a simple KV Consul cluster in a few zones and regions in either AWS or Google Cloud and be done?
[+] dpweb|8 years ago|reply
For an extreme diet, Id html app cache the pretty pictures. I know its deprecated but is still massive support in all browsers and is instantaneous fast. Removing all origin hits except api req for sure.

On the lookups CF workers has a global state that is shared amongst nodes. Been playing with that. Wondering if you hold string array in that and can get the origin hits below 10% you can run this svc with many million reqs day practically for free. Vps tho, not amzn.

[+] dx034|8 years ago|reply
Just putting this on a vps or cheap dedicated server with 16gb ram would've led to sub-ms response times at much lower costs (if you don't get Azure and Cloudflare for free like him). At those response speeds, scalability is also not really an issue if you cache aggressively at the edge.

Argos is then nice to have but not really necessary. If the server responds back <1ms, those 30% saved RTT are probably not detectable for the user.

[+] eropple|8 years ago|reply
I'm not sure that you quite realize that Troy's intention here is to eventually be soaking a whole lot of zeroes with this. Add on to that that, if people are relying on this as a service, it had better not go down under any remotely plausible circumstances, and some growth-hacker's dedicated server from Hetzner or whatever is unsuitable. Edge caching only blunts sufficiently this if everybody's searching for the same passwords and proving that that's the case is a burden you have not shouldered.

To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot like the middlebrow thing we're supposed to not be fans of around here.

[+] skrebbel|8 years ago|reply
Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in RAM (or just a bunch of HTML files on an nginx with caching on) and be done with it? I mean he writes that a single response, gzipped, averages 10kb so that's only 1GB of RAM in total.

Even better: host this on a service like Netlify and not even have the 30 day cache timeout Troy has here (which means 30 days old info in case of new breaches). Just regenerate the entire set on the dev box whenever there's a new breach (should be fast enough, it's a linear search & split) and push it to Netlify, it'll invalidate all edge caches automatically.

[+] Bedon292|8 years ago|reply
The breach data isn't updated that often. It was 6 months between v1 and v2, so a long cache is totally fine. And then invalidate the cache if there is an update.
[+] barrkel|8 years ago|reply
When you don't need transactions, don't have a cache invalidation problem, and are querying read-only data, then this architecture - or really any architecture that takes advantage of what makes HTTP scalable, mostly idempotent, cacheable responses to GETs - makes sense.
[+] sleepychu|8 years ago|reply
Surprised Troy is so pro-cloudflare. I feel like they create a lot of security headaches.
[+] tomalpha|8 years ago|reply
Troy is clearly very pro Cloudflare and Azure. I’m only familiar with them at a high level and it would be great to hear counterpoints to his praise and get (at least the impression of) a debate that encompasses their pros and cons.

+1 for being interested in expanding this

[+] dx034|8 years ago|reply
He says that he supports CF because they enable SSL on so many websites that wouldn't have it otherwise. I guess the fact that he gets a lot of their premium features for free helps as well.
[+] r1ch|8 years ago|reply
Agreed. It's getting a bit much, almost every one of his blog posts has some advertisement for Cloudflare.

2 of Cloudflare's 3 SSL options generate an insecure setup yet look perfectly fine to the user with a shiny green padlock in the address bar. That really isn't acceptable.

[+] Cthulhu_|8 years ago|reply
In this case though, security concerns do not apply; the source data is secure, the website does not use authentication, and HTTPS adds an extra layer of security on any input (which is hashed and cropped clientside)
[+] darkport|8 years ago|reply
I love the k-Anonymity model. Makes it actually feasible to check passwords against HIBP when carrying out password audits for clients. Shameless plug but I've added it to my Active Directory audit password tool: https://github.com/eth0izzle/cracke-dit
[+] manigandham|8 years ago|reply
Lots of suggestions in this thread about better architecture but they all seem to forget that this is designed to be minimal in cost, complexity and maintenance while delivering 100% availability and great performance.

While Redis or a VM would be faster, that's way more overhead compared to a few cloud functions and table storage. This whole thing is event-driven and easy to build with just your browser, along with having cheap and granular billing. Cloudflare also already caches the responses so there's really no need for the origin to be perfect.

[+] yupyup|8 years ago|reply
Somewhat related (and nitpicky), but there are some spelling errors (derrivation, seperate...) on the Cloudflare post that explains k-anonimity:

https://blog.cloudflare.com/validating-leaked-passwords-with...

P.S.: As a non-native speaker had to look those words up to check them, as I trusted the spelling from an official blog post.

[+] StavrosK|8 years ago|reply
Here's a slightly more easily auditable version of the checker, in Python:

https://www.pastery.net/wwzqua/

The bash one was fine, I just prefer the readability of Python to make sure I know that only my truncated hash version is ever sent.

[+] jwilk|8 years ago|reply
What bash checker you are referring to?
[+] zcam|8 years ago|reply
Wouldn't using a simple bloom filter make sense in their case? Just build the thing offline and your app loads it in RAM at startup.
[+] euroclydon|8 years ago|reply
Anyone know, if we permute all 6-16 character length alpha numeric strings, how many would would have their sha-1 hash be a match for a given five character prefix?

I’m certainly not saying I think this is an issue! I’m just academically curious about the number and how to go about calculating it.

[+] dsacco|8 years ago|reply
A SHA-1 hash digest is a 160-bit hexadecimal string. These are 40 digits long, with 16 possible values per digit, yielding a total search space of 16^40 possible values.

If we splice off the first five digits (which Troy originally used as the database partition key), we get 1,048,576 five digit values each (16^5). Then we continue calculating with the sixth position as the new first position in the string.

The math from here is a straightforward function mapping 16^n -> 16^(n - 5):

* 16 six digit values match any given five digit partition key, p,

* 16^2 = 256 seven digit values match any p,

* 16^3 = 4,096 eight digit values match any p,

* 16^4 = 65,536 nine digit values match any p,

* 16^5 = 1,048,576 ten digit values match any p,

* 16^6 = 16,777,216 11 digit values match any p,

* 16^7 = 268,435,456 12 digit values match any p,

* 16^8 = 4,294,967,296 13 digit values match any p,

* 16^9 = 68,719,476,736 14 digit values match any p,

* 16^10 = 1,099,511,627,776 15 digit values match any p,

* 16^11 = 17,592,186,044,416 16 digit values match any p.

So in general, to calculate how many n digit SHA-1 digests correspond to any m digit prefix, we simply calculate (16^n)/(16^m), which yields 16^(n - m). Thus we have (16^[6..16]) / (16^5) for the five digit prefix case. Hopefully that elucidates it for you!

[+] e12e|8 years ago|reply
If by "alphanumeric" you mean [0-9a-zA-Z], that's an alphabet of 2 * 26+10=62 characters. Since I'll be doing this in my head, I'll add two punctation characters so we end up at 64=2^6.

Now, I assume you mean "five character prefix of the hex-encoded 160 bit hash"? One hex digit is half a byte; a nibble or 4 bits. 5 * 4=20 bits, leaves 160-20=140 bit hash. Which can represent 2^140 different values.

För a password of 1 to 16 characters from our alphabet, we have 64^16 passwords. The five character prefixes account for 64^5 of those.

64^16=2^(6 * 16)=2^96

64^5=2^30.

(2^96-2^30) < 2^140.

So, if I understood your question: all of them could (should) have unique representations in the remaining hash.

But I think you meant how many would collide in the prefix?

That'd be (2^96-2^30)/(2^20).

I think, it's getting to be bed time. But at least I should be sufficiently wrong for someone to correct me ;)

[+] _pdp_|8 years ago|reply
...or CloudFlare/CloudFront plus DynamoDB table with primary key of the first/last n-number of characters from the hash with potential secondary index for filtering.

Btw, indexing can be done cheeper with Google I think.

There is also another way (probably better) and that is to use s3. 1tb can be stored for as little as $20 - the rest is endpoint caching.

Luckily for all of us it is easier than ever to single-handedly scale to millions of users at minimal cost.

[+] jwilk|8 years ago|reply
TL;DR why brotli is HTTPS-only: some middle-boxes mangle responses with content encodings they don't know.
[+] aplorbust|8 years ago|reply
What does he do with the logs of all the passwords submitted in searches?
[+] reificator|8 years ago|reply
> What does he do with the logs of all the passwords submitted in searches?

> imagine if you wanted to check whether the password "P@ssw0rd" exists in the data set. [...] The SHA-1 hash of that string is "21BD12DC183F740EE76F27B78EB39C8AD972A757" so what we're going to do is take just the first 5 characters, in this case that means "21BD1". That gets sent to the Pwned Passwords API and it responds with 475 hash suffixes (that is everything after "21BD1") and a count of how many times the original password has been seen.

[+] frogpelt|8 years ago|reply
How am I supposed to pronounced 'pwned'?

Can't we find something other than 4chan language to describe this?

[+] MBCook|8 years ago|reply
‘owned’ with a P on the front.

At this point the term pwned seems far too well known (in the industry) to change it.

For example, it seems to be in my iPhone’s internal dictionary.

[+] slow_donkey|8 years ago|reply
Powned. Literally owned then add a hard p
[+] Quarrelsome|8 years ago|reply
That password header warning is the coolest security improvement I've seen online.
[+] kpennell|8 years ago|reply
Was expecting an Algolia ad but was delightfully surprised.
[+] Sir_Cmpwn|8 years ago|reply
Starting to get a little uncomfortable with how hard this is being pushed on HN right now.

https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...

[+] wyldfire|8 years ago|reply
It's good to be skeptical, but everything I've read so far makes me believe Hunt's motives are good. His efforts are beneficial to HN readers and beyond.

This search shows four results within the last week and then it starts to drop off. How different are these search results from other popular sites? Ars Technica, Techcrunch, Anandtech, Bloomberg, LWN? Even if you focus on individuals, maybe it's not too different from the articles of Bruce Schneier, ESR, Linus, Theo de Raadt, etc?