top | item 23081271

Post Mortem on Salt Incident

118 points| sfg75 | 5 years ago |blog.algolia.com | reply

65 comments

order
[+] cetra3|5 years ago|reply
This whole salt-stack incident could've been handled a lot better by salt themselves:

- the notification was a week ago to a small mailing list, which is tucked away on their site

- no notification to the registry to when you go to download salt (at least I never received an email, but still get plenty of marketing spam)

- no posts on social media as far as I can tell, I couldn't find a tweet, anything on reddit, or anything on hn.

- they only blogged about it on their official site yesterday, way after damage had been done

- one week's notice between the initial announcement and the patch coming out. The patch being released is basically a disclosure of the vulnerability

- the patch was released late Thursday early Friday depending on your timezone, giving attackers the weekend head start

- the official salt docker images were only patched yesterday

- You can't get a patch for older versions without filling out a form and supplying details

- Ubuntu and other repositories are still vulnerable

[+] mtam|5 years ago|reply
+1, however, from what I read, the vulnerability can only be exploited if the attacker has network access to the salt masters port, which should never occur. The people that got compromised had Salt exposed to the Internet, which is obviously ridiculous.

Not trying to downplay the critical nature of the vulnerability but the ones that were compromised by this issue have deeper security issues to deal with.

[+] bawolff|5 years ago|reply
> one week's notice between the initial announcement and the patch coming out. The patch being released is basically a disclosure of the vulnerability

While your other points may be valid, one week should be plenty of time between announcement and patch. Any longer and i would call the time table problematic.

[+] cat199|5 years ago|reply
> - Ubuntu and other repositories are still vulnerable

isn't really salt's problem though.. same could be said for relying on any distro-provided package

[+] VWWHFSfQ|5 years ago|reply
The intruders had root access to every server in a salt deployment for who knows how long and yet everyone is claiming there's no evidence that any data or secrets (customer's or otherwise) were exfiltrated from the network. This is a very dangerous assumption. Nobody has any idea what was run on the servers since it seems that once the initial attack script was deployed it downloaded and executed new scripts every 60s and then removed themselves. Pretty standard C&C ops. It may have started as a mining operation, but that doesn't mean it was the only thing it was doing.
[+] Jedd|5 years ago|reply
> ... and yet everyone is claiming there's no evidence that any data or secrets (customer's or otherwise) were exfiltrated from the network.

A number of people have carefully reviewed the payload that was deployed to servers, especially during what we're calling v1-v4 of the attack. (v5 onwards got more complex, but that wasn't until Monday (with variability for timezone).

> Nobody has any idea what was run on the servers ...

Well that's not true - there's a number of victims that have useful IDS tools, including auditd, plus the review of binaries and shell scripts deployed, etc.

Some of us also have netflow collection at the edge, and can review connections initiated from within our networks.

> ... once the initial attack script was deployed it downloaded and executed new scripts every 60s and then removed themselves.

I don't think any of us have found scripts that removed themselves. While that may sound naive, there's a few researchers that have been analysing these tools, including via large honeypot networks, and this just hasn't (at least for the first 2-3 days) been a profile of the attack.

Thankfully - and I appreciate it's very weird to say this - the initial attacks were very much vanilla crypto currency mining opportunities. It could have been a lot worse, and algolia's assessment matches a lot of other independent assessments on this front.

[+] johann-algolia|5 years ago|reply
Hello,

I'll try to give you some insight as I'm a security engineer at Algolia.

Your concern is valid, and it's true, we cannot know for sure. That's the reason why, as explained in the blog post, we are reinstalling all impacted servers and rotating our secrets. If our assumption is false, this should contain the issue.

That being said, we have good reasons to make that assumption.

- Our analysis of the incident and how the malware behaved on our systems didn't find any evidence towards access and transfer of data.

- There are other public analysis of the malware. Other companies hit have the same analysis than us, and you can have a look at https://saltexploit.com/ which is maintaining an interesting list of what is known on the attack, how it behaved, and how it's evolving fast to adapt.

I hope this answers your concern.

[+] lasdfas|5 years ago|reply
I agree. I would like to seem more details of how they determined it was only crypto mining. Finding only mining scripts in your logs doesn't mean they were not running other code once they had root.
[+] kureikain|5 years ago|reply
It's weird that these salt master are reach-able from internet and they can sleep well with it.

Even with zero-trust network or beyondcorp idea, I still found one extra layer of protection a VPC give are so great. Few years ago, it has an issue with K8S API Server, and updating k8s isn't a walk in the park. I felt relax back then because we have everything inside VPC.

You can use SSH or VPN to access service inside VPC. But any of tools that had permission to manage your infrastructure should never expose to the internet.

Same thing with Jenkins, if you are using Jenkins to manage Terraform or trigger Ansible/Salt/Chef run, make sure Jenkins is not reachable from internet. Using different method to route webhook into it.

[+] trabant00|5 years ago|reply
I never understood the current trent to say VPN is a thing of the past. Redundancy in security layers is how you dont't get affected by every CVE out there.

Imo this is THE lesson to learn from this story.

Seondary: salt and ansible are not very mature yet.

[+] darkwater|5 years ago|reply
Yeah, I completely agree and really don't see the point of having a Configuration Management server facing Internet and basically having all your servers connect to it through the Internet! One thing is BeyondCorp idea to eliminate the roadwarrior concept and another is having your infra management exposed to CVEs in the wild!

For Jenkins it's a bit more complicated because GitHub webhooks although they do publish their IPs in a programmatic form so you can whitelist them.

[+] mtam|5 years ago|reply
“We’ve secured the impacted SaltStack service by updating it and adding additional IP filtering, allowing only our servers to connect to it.”

So this means they had Salt master ports publicly accessible? Why would anyone have salt ports open/exposed to public/internet?

[+] dijit|5 years ago|reply
> Why would anyone have salt ports open/exposed to public/internet?

If you're bootstrapping random servers, this is a fine approach.

The whole Salt connection methodology is 'trust on first connect' (a bit like the default SSH) with a manual stage in accepting an incoming request and the connection stream is encrypted.

If you're using salt to bootstrap your VPN servers or network appliances then it's understandable that you'd have it exposed to a more public network, and the documentation was clear that this was fine.

Not everything is a virtual machine on a cloud provider.

[+] mirimir|5 years ago|reply
Yeah, that jumped out for me too. I'm guessing that they didn't want to deploy some sort of private network layer.
[+] lrpublic|5 years ago|reply
Trusting a central control server is the fundamental mistake here.

It creates a very high value target that is difficult to secure.

I prefer a model where the management commands are signed at a management workstation and those commands are pushed by the server and authenticated at the managed node against a security policy.

[+] brianjlogan|5 years ago|reply
What configuration management tools use this methodology?
[+] 0x0|5 years ago|reply
Both this and the ghost cms updates seem to hint that the only reason this was discovered was the fact that loud crypto miners were exhausting resources. What are the chances a more quiet attacker hasn't thoroughly ploughed through the entire infrastructure days ahead?

Also think about how many years this vuln has been present and exposed. Who's to know blackhats haven't sat on this 0day for years, quietly compromosing private keys and other data? Spooky.

[+] ciprian_craciun|5 years ago|reply
I've seen mentioned in the comments various "deployment" tools (or call them "configuration management" if you will) being called "insecure" or "immature", or one being claimed better than another; however I think this is a good opportunity to talk about a deeper problem, namely the architectural choices each tool has taken.

These choices all impact the reliability and security of the resulting system, especially the following:

* do they rely on SSH, or they have implemented their own authentication / authorization techniques? (personally I would be very reluctant to trust anything that just listens on a network port for deployment commands, and it's not SSH;)

* do the agents run with full `root` privileges, or is there a builtin mechanism that allows the agent to act only in a limited capacity, within the confines of a set of whitelisted actions? (perhaps even requiring a secondary authentication mechanism for certain "sensitive" actions, for example something integrated with `sudo`, that provides a sort of 2-factor-authentication with a human in the loop;)

* do the operators have enough "visibility" into what is happening during the deployments? (more specifically, are the deployment scripts easily auditable or are they a spaghetti of dependencies? are the concrete actions to be taken clearly described, or are they hidden in the source code of the tool?)

* are there builtin mechanisms to "verify" the results of the deployments?

* and building upon the previous item, are there mechanisms to continuously "verify" if the deployment hasn't changed behind the scenes?

I understand that some of these features wouldn't have helped directly to prevent this particular case, however it would have helped in alerting and diagnosis.

[+] alexbrower|5 years ago|reply
Can anyone describe the business benefits of an algolia implementation (vs Elasticsearch?) for a company that doesn't heavily rely on content searches? It seems expensive and something that I'd build on my own.

(Disclaimer: long-time operator and fledgling programmer)

[+] aseure|5 years ago|reply
Disclaimer: I'm a developer at Algolia.

IMHO the two main advantages in favor of Algolia, are the sane defaults for relevancy and speed and the fact that the service is hosted and can grow with your business without having dedicated engineers to manage both the configuration and the infrastructure.

Also, on top of the Algolia services per se (search, analytics, recommendation, etc.), we're providing a lot of backend and frontend libraries which one would otherwise need to reimplement when using an elastic- or Solr-based implementations.

[+] vegannet|5 years ago|reply
Search is hard to get right and the cost of Algolia is negligible vs. doing it yourself. As a programmer, every line of code you write is a line of code you own: the less code you own in production, the better off you are. Algolia has saved us hundreds of hours which translates to tens of thousands of dollars.
[+] vbernat|5 years ago|reply
As a point of comparison, you can also expose Puppet masters to the public Internet but Puppet is using HTTP/HTTPS as a transport, so it is trivial to put a reverse proxy in front of it, requiring a valid certificate (managed and signed by Puppet) to contact the service. This way, no need to maintain a whitelist of legitimate clients.