top | item 46737701

(no title)

mnazzaro | 1 month ago

1. The pools are very shallow- two machines per pool. While it's certainly possible for 3 tasks to get requested in the same region within 30 seconds, we handle that by falling back to the next closest region if a pool is empty. This is uncommon, though. 2. I haven't considered it, but yeah- the caching seems to work great for us. 3. The tokens are generated per-task, so if you are worried about your token getting leaked, you can just delete the task!

discuss

hinkley|1 month ago

One of the perennial problems with on call situations I encountered was that at some point everyone knew that a production incident was going on and people were either trying to help or learn by following along running the same diagnostics the on point people were running, and exhausting the available resources that were needed to diagnose the problem.

Splunk was a particular problem that way, but I also started seeing it with Grafana, at least in extremis, once we migrated to self hosted on AWS from a vendor. Most times it was fine, but if we had a bug that none of the teams could quickly disavow as being theirs, we had a lot of chefs in the kitchen and things would start to hiccup.

There can be thundering herds in dev. And a bunch of people trying a repro case in a thirty second window can be one of them. The question is if anyone has the spare bandwidth to notice that it’s happening or if everyone trudges along making the same mistakes every time.