top | item 42189283

(no title)

jhgg | 1 year ago

When I worked at Discord, we used BEAM hot code loading pretty extensively, built a bunch of tooling around it to apply and track hot-patches to nodes (which in turn could update the code on >100M processes in the system.) It allowed us to deploy hot-fixes in minutes (full tilt deploy could complete in a matter of seconds) to our stateful real-time system, rather than the usual ~hour long deploy cycle. We generally only used it for "emergency" updates though.

The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.

discuss

order

davisp|1 year ago

This matches my experience. I spent a decade operating Erlang clusters and using hot code upgrades is a superpower for debugging a whole class of hard to track bugs. Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.

As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.

scotty79|1 year ago

> Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.

This and everything else said sounds so much like PHP+FTP workflow. It's so good.

stouset|1 year ago

Can someone explain how this is not genuinely terrifying from a security perspective?

nelsonic|1 year ago

Where is the security problem? All code commits and builds can still be signed. All of this is just a more efficient way of deploying changes without dropping existing connections.

Are you suggesting that hot code replacement is somehow a attack vector? Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.

No need to fear Erlang/BEAM.

ramchip|1 year ago

Erlang distribution shouldn't be used between nodes that aren't in the same security boundary, it promises and provides no isolation whatsoever. It's kind of inherent to what it does: it makes a bunch of nodes behave as part of a single large system, so compromising one node compromises the system as a whole.

In a use case like clustering together identical web servers, or message broker nodes like RabbitMQ, I don't think it's all that scary. It gives an attacker easier lateral movement, but that doesn't gain them a whole lot if all the nodes have the same permissions, operate on the same data, etc.

Depending on risk appetite and latency requirements you can also isolate clusters at the deployment / datacenter level. RabbitMQ for instance uses Erlang clustering within a deployment (nodes physically close together, in the same or nearly the same configuration) and a separate federation protocol between clusters. This acts as a bulkhead to isolate problems and attackers.

aunderscored|1 year ago

It's the same amount of terrifying as a regular deploy, you need to ensure that you limit access as needed