Down the Cloudflare / Stripe / OWASP Rabbit Hole

[+] a2tech|3 years ago|reply

This boils down to 'Cloudflare did something' and without the Enterprise plan you'd never be able to pull the data required to diagnose the problem. Oh also, Cloudflare knows they did something but plays the blame game until you get to someone that openly acknowledges that they know something is broken.

I've said it before and I'll say it again--Cloudflare is not making the Internet safer, its just making it less open. I know that is not the overwhelming sentiment of HN and me complaining about it isn't going to change anyones mind.

[+] cuuupid|3 years ago|reply

As a bit of a meta point it’s a bit astonishing how good Cloudflare is at PR and community management. The sentiment I’ve seen here in the past is overwhelmingly pro-Cloudflare. Every now and then they’ll publish a hit piece on AWS here and there’ll be an AWS hate thread, but no one seems to question Cloudflare’s motives for promoting bad press about their competitor directly to their target market.

Then again in most things, the underdog trying to upstage the incumbent is always a popular narrative. You’re right that talking about it likely isn’t going to change any sentiment.

[+] JohnFen|3 years ago|reply

> Cloudflare is not making the Internet safer

I am certainly not a fan of Cloudflare and woudn't use their services, but I think this is not an accurate statement. Their services do objectively provide a security benefit. The only question is really whether or not the cost/benefit ratio is favorable.

[+] Lt_Riza_Hawkeye|3 years ago|reply

Overall I agree with you - the only caveat I have to offer is Cloudflare's support of eSNI. My opinion on CF used to be quite black and white, but there is at least someone in there (for who knows how long) contributing to the actual security of the web. Not mutually exclusive with doing harm in other ways.

[+] aestetix|3 years ago|reply

And don't forget, if they don't like you, they will happily deactivate your account with no notice and no reason given.

[+] carapace|3 years ago|reply

> Cloudflare is not making the Internet safer, its just making it less open.

They're doing both, eh?

I switched my family's home DNS to Cloudflare's family-safe DNS ( https://blog.cloudflare.com/introducing-1-1-1-1-for-families... ) to protect them from malware and porno (I'm not a prude, they don't use the Internet for that (no really! We are weirdos.) and porn sites are often a malware vector anyway.)

A few months ago I noticed that you can't browse weed stores through their DNS anymore.

I don't really blame them, I'm assuming it's due to pressure from the US Federal Gov. (who still consider pot to be some insanely dangerous narcotic!) But it was definitely a personal "until they come for you" moment.

[+] Jamie9912|3 years ago|reply

Cloudflare isn't really to blame here when the customer has FULL control over all security settings - they can define rules as they please, and have all the tools (including the API) to do this

[+] danpalmer|3 years ago|reply

See also: Cloudflare in front of Mastodon.

A lot of naive Mastodon admins are putting default Cloudflare configurations in front of their instances. The problem is that inter-instance requests necessary for federation then get caught as bots (because they essentially are) and connections in the network degrade, causing eventual de-federation.

Cloudflare is magic, in the good and bad ways. I'd only use it for very small personal sites or things that can afford downtime and don't need integrations, or for large businesses that can afford the enterprise plans and will actively manage the account and respond to business and tech needs with config changes.

[+] robga|3 years ago|reply

Are there any public reports of this? It’s not something I’ve seen anywhere and not my experience. I “googled” it and all that came up was your mastodon post saying the same thing.

[+] pbronez|3 years ago|reply

I wonder if their Wildebeast ActivityPub implementation will have similar issues.

[+] worldofmatthew|3 years ago|reply

"very small personal sites" don't need Cloudflare.

[+] Perseids|3 years ago|reply

What I find bleak about the situation, is that it is a glaring design-fail, even though everyone involved should have the necessary expertise to do better. A callback from your payment provider should never go through a best-effort WAF. Instead, as you already have a strong business relationship, you could easily exchange/store/configure strong credentials with stripe [1]. When even a security professional doesn't do that, what does it say about the state of documentation of this feature?

Looking at the documentation directly, what they advise you to do is kind of the worst idea they could come up with: https://stripe.com/docs/webhooks/signatures – you need custom logic[2] to verify that their MAC ("signature" they call it incorrectly) is valid and you need to configure a different secret for each of your endpoints. And then you still need to handle replay attacks somehow, which is its own nightmare to do correctly. It's no wonder the WAF can't do that for you.

From a few years old personal experience, I'm really irritated by stripes web-hook approach overall. Payment process information is such a vital business concern that "let's try to call them and if that fails... well we tried" is broken on principal alone. The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds if your framework doesn't support async well. This is also trivial to do securely: You're HTTPS library already authenticates stripe during the TLS handshake and that is all that is necessary.

[1] Best case scenario: Let stripe authenticate with mutual TLS, but I know this is quite a long way away from typical web server configurations.

[2] Stripe's approach very much reminds of DPoP https://news.ycombinator.com/item?id=31266575 which shall in now way be construed as a compliment.

[+] lvice|3 years ago|reply

I just would like to give my opinion on some of your points, which I don't agree with:

> Looking at the documentation directly, what they advise you to do is kind of the worst idea they could come up with: https://stripe.com/docs/webhooks/signatures – you need custom logic[2] to verify that their MAC ("signature" they call it incorrectly) is valid and you need to configure a different secret for each of your endpoints

It certainly help that I use their official SDK, but it's one line of code to add the signature validation. Also, I'm not sure why you would want to create a lot of endpoints to listen to these webhook. I simply have one, and the Stripe SDK helps me in determine the event type, its deserialization, etc.

> Payment process information is such a vital business concern that "let's try to call them and if that fails... well we tried" is broken on principal alone

That's not how it works. The webhooks keep retrying with exponential backoff until they succeed. You can also manually retrigger them for individual events.

> The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds if your framework doesn't support async well

Nothing is preventing you to do that. In fact, in my codebase I do polling to the Stripe API as a fallback to check if payment is successful in case there are issues with webhooks. But it's nice to have the webhook telling you immediately if a payment fails/succeed, in order to give feedback to the user fast about the status of his payment (and not wait the next long polling iteration)

Not everything on Stripe is perfect, but I do find it really pleasant to work with in general

[+] davedx|3 years ago|reply

> The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds

This is also much much easier for the payments integrations developers to test, compared to all the messing about and running dodgy proxies that testing webhooks involves.

Webhooks for critical application paths just seem like a bad idea all around really

[+] AtNightWeCode|3 years ago|reply

You should not poll a payment provider in a general flow. Do you realize how many requests that will cause if everybody did that? A payment flow is event-driven by nature. The payment provider pushes the states back to the initiating system. If that fails it is up to the consumer to detect problems and make sure that the system is in sync with the provider. Some payment providers do push data up to at least 24h. It is an obvious design flaw...

[+] philipwhiuk|3 years ago|reply

Yeah, Stripe is well documented but very ugly to actually handle in practice.

[+] technion|3 years ago|reply

There are multiple places where he uses a feature only available to enterprise users (read: Not me) to resolve this. I guess the "normal" thing here would be to disable the WAF and be done with it.

[+] e1g|3 years ago|reply

This jumped out at me too. Cloudflare keeps many diagnostic and debugging tools available only on the Enterprise plan (typically >$3k per month). When something like this happens to most small teams and startups, they are working in the dark even on the Business plan at $200/mo.

[+] gz5|3 years ago|reply

And WAF isn't the only design option for webhooks. WAF = define what to block. Another options is to define what to accept (block all else by default).

OpenZiti approach (1) for example:

a. enroll each side of the webhook w/ X.509 identity

b. X.509 gates a network overlay between the servers

c. each server initiates outbound sessions to the overlay

d. block everything else (deny-all inbound on both servers)

(1) disclosure: i am a maintainer of the openziti foss, and you can only (fully) use the technique above if you have enough control of both sides, e.g. use a Lambda function: https://blog.openziti.io/my-intern-assignment-call-a-dark-we...

[+] raverbashing|3 years ago|reply

Yeah, while I would have tried to investigate this a bit, this seems to be an issue on Cloudflare

Bypassing the rules is a workaround, but not a fix

[+] a2tech|3 years ago|reply

And if you have to disable WAF, what exactly is Cloudflare doing for you?

[+] talkingtab|3 years ago|reply

I wonder what the take away from this is? The simple one is "bad cloudflare" or "bad stripe" or even "bad hibp". Or maybe all in conjunction. Or maybe none.

But that seems simplistic to me. The smell of this is a system that is so poorly made that it has layer upon layer of obscure hacks to protect it. It appears that no one can understand why this happened and the best guess is that it had something legitimate that was misunderstood. Maybe the word "alter" and "table"? This is the equivalent of you walking into a bank, telling the person "Hi my name is Rob and I came to the bank today to ..." And then the bank goes into automatic shut down.

This is broken. IMHO.

[+] jsnell|3 years ago|reply

From the information given, bad Cloudflare. These kinds of content-matching rules should be triggering deterministically, and testable in a hermetic test environment. They also have sample payloads that get blocked vs. ones that gets through, despite being essentially identical. It should be about as easy to debug as it gets.

That it's tricky to debug suggests there's something totally different just badly understood rules. Maybe a server with a hardware fault that's making it return bogus results (though that should be easy to find in monitoring), maybe some kind of race condition, or running of different rules in parallel + having global or request-scoped state such that the order in which the rules finish running matters.

[+] cromulent|3 years ago|reply

From someone else in the article's comments:

> if we treated the customer's phone number as a hex representation of ASCII, it spelled something that was recognisable as a command.

And the WAF team suggested they ask the customer to change their phone number.

Goodness me.

[+] zX41ZdbW|3 years ago|reply

Case 1. It blocks my SQL playground due to SQL injections: https://play.clickhouse.com/play?user=play

- solution: disable WAF.

Case 2: It damages my presentations by removing whitespaces in HTML elements styled as "white-space: pre" at https://presentations.clickhouse.com/

- solution: disable auto minification.

Case 3: It makes the debian packages repository inconsistent https://packages.clickhouse.com/

- solution: disable caching.

In fact, Cloudflare is an amazing service - it is powerful and easy to use, you only have to take care when enabling and disabling its features.

[+] NovemberWhiskey|3 years ago|reply

This is absolutely not the first (or second) time I've seen an outage triggered by a well-meaning security rules update on a WAF.

To be honest, a lot of security-related deployment processes would be regarded as unacceptable, wild-west level shit if they occurred in the software lifecycle - like difficulty to identify that a change had even occurred, inability to see before/after for the change, release processes effected manually via consoles, change deployed directly to production without going through a lower environment, and big-banged as opposed to canaried etc. etc.

[+] tiffanyh|3 years ago|reply

> About Page: “I'm Troy Hunt, an Australian Microsoft Regional Director and Microsoft Most Valuable Professional for Developer Security. I don't work for Microsoft, but they're kind enough to recognise my community contributions by way of their award programs which I've been a part of since 2011.”

Can someone explain what this Microsoft Regional Director role is because it sounds like he works for Microsoft but then says he does not.

[+] mpalfrey|3 years ago|reply

Microsoft Regional Director is effectively someone who is an advisor to Microsoft - https://rd.microsoft.com/en-us/about/ .

[+] eis|3 years ago|reply

Ignoring the question of who is actually doing something wrong and the fact that most customers on lower tiers wouldn't even be able to get most of the presented information, it seems absurd to me that a customer on an Enterprise plan that costs thousands of dollars a month can't get a simple "this is why those requests trigger the firewall" answer from CloudFlare. The diff between the request that passed and the one that triggered the rule is simple enough and the WAF is not some opaque AI I hope.

Why can't the triggering request be replayed through the WAF with output that shows the scores for each bit?

[+] ufmace|3 years ago|reply

My first thought is - why is this traffic going through CloudFlare at all? There's no caching benefit and it's all going to be coming from a datacenter anyways.

Maybe it's more trouble than it's worth to bother setting up a separate non-CDNed DNS for API routes for this particular site. But then how much time was spent trying to sort out why those requests were being blocked?

[+] wlonkly|3 years ago|reply

Cloudflare blocks denial of service attacks, too, which is probably important for tools in the security space, and blocking denial of service attacks means making sure that attackers don't know how to bypass Cloudflare and hit your origin directly.

If you let things go direct to the origin, then you're giving away information about where the origin is, even if the things are only Stripe.

[+] mac-chaffee|3 years ago|reply

I've been running the OWASP coreruleset in production for about a year now and it has been a big pain. The way I made it manageable was 1) training users that "if you see a 403 error, tell me ASAP" and 2) learning the ModSecurity rule syntax to be able to create rule exceptions for users very quickly. This is not possible to do at Cloudflare's scale.

Even then, users who didn't know the intricate details of the Web Application Firewall (like Troy in this case) would waste hours hunting down the issue. Since less popular sites often have more illegitimate traffic than legitimate traffic, there was really no good way for me to proactively fix WAF issues.

The conclusion I have drawn is that WAFs really only have a few very narrow use-cases.

The main use-case is when you want to write your own rule to protect hosts from a specific zero day while they are being patched. Like a simple rule to detect Log4J [1] was an effective band-aid while we scrambled to implement real patches. But WAFs have an inherent weakness: clever attackers can pretty much always circumvent rules, or force to you write a rule that is so complex that it causes slowness or blocks legitimate traffic.

Another use-case is when you have to deploy some untrusted code that is likely vulnerable to common (>1 year old) vulnerabilities. Like running an old/archived wordpress instance. This is the only time when the coreruleset makes sense IMO.

As I see it, WAFs are a tool created in a simpler time when the number of possible attacks and applications was small. In the modern era where there is a constant deluge of zero-days, huge attack surfaces, tons of variability in applications, and lots of sites where RCE/SQLi is a feature (think CI job definitions, Juptyer notebooks, custom query languages), WAFs have lost their effectiveness.

[1]: https://github.com/coreruleset/coreruleset/issues/2331

[+] perfecto_maduro|3 years ago|reply

We whitelisted the stripe IPs completely after getting burned once. If Stripe gets hacked so that the hackers jump off to our site, we have far bigger problems to worry about.

[+] danpalmer|3 years ago|reply

It seems like Cloudflare should be doing this for you. It wouldn't be hard for them to keep a list of IPs from common known-good integrations. They could prompt on first hit to ask you if you want to allow-list those companies, or even just do it by default.

[+] jamespo|3 years ago|reply

Running a WAF with a dynamic ruleset between you and your payment provider seems a bit risky to me.

[+] danuker|3 years ago|reply

Using a third-party's WAF and trusting it will not log your data also shows some risk.

By centralizing all data going through the web you paint a massive target on yourself (for black hats).

[+] yellow_lead|3 years ago|reply

WAF blocking things randomly isn't a good look for Cloudflare. I wonder how many outages this has caused for their customers.

[+] amluto|3 years ago|reply

I’ve generally thought that POST we hooks are the wrong abstraction for this type of payment confirmation, and reading this makes me believe it even more:

This whole process needs something like a message queue. Stripe should publish and event to the queue, and HIBP should receive that event. A message queue is still subject to network failures, but any sensible implementation will notice and recover missed events.

[+] berkle4455|3 years ago|reply

Damn this is absurd. Is there not a way on Clouldflare to flag webhooks with a "if request contains this api key, let it through, block the rest"?

[+] tyingq|3 years ago|reply

It looks like "WAF - add exceptions" supports Wireshark style expressions, so there's things like:

(http.request.uri.query contains "some-string")

Or similar for checking headers, post body content, etc.

[+] a2tech|3 years ago|reply

You can create custom rulesets, which is what the article's author did. This allowed the API calls to essentially bypass the WAF.

[+] davedx|3 years ago|reply

"I look at the managed WAF Cloudflare provides more favourably than I did before simply because I have a better understanding of how comprehensive it is. I want to write code and run apps on the web, that's my focus, and I want someone else to provide that additional layer on top that continuously adapts to block new and emerging threats. I want to understand it (and I now do, at least certainly better than before), but I don't want managing it day in and day out to be my job."

My takeaway from this is actually that you can't just have a managed WAF that gets out of your way. This looks like it took easily a day or more of work. What's the advantage of using a CloudFlare "managed" WAF versus running your own within AWS? I guess the "infrastructure" is managed, but the operations isn't...

I'm a pretty experienced engineer, and I honestly don't know if I'd have been able to solve this issue personally. Most likely I would have just whitelisted all of Stripe's IP's, like the initial hotfix did.

[+] pixl97|3 years ago|reply

Working around WAFs for years in enterprise support this kind of crap happens all the time. In my current work we'll have a client try to access our app in some way that oddly explodes. In general a browser HAR file is very useful. Then we have to check our app (hosted on the customer's servers), then we'd have to look at the load balancer, and when that doesn't bear results it's quite often we find a WAF in network path. Most of the time it's near impossible to find the team that manages it and then get helpful information out of them about the issue.

[+] arpa|3 years ago|reply

It's like complex systems have complex failure modes, who'd thunk?

[+] aaroncloud|3 years ago|reply

I have used a few different WAFs to date, AWS WAF, CloudFlare WAF and Akamai WAF. I have had issues with all but usually the frequent issue is an unexpected blocking of request. Would be interesting to benchmark a set of WAFs to see which ones block threats vs which block valid traffic.

127 comments