One thing that wasn't clear to me, is that if running NPM to install dependencies on pod startup is slow, why not pre build an image with dependencies already installed, and deploy that instead?
Surely they weren't running npm at start. It's just that nodejs allows multiple versions of the same module to coexist and all the different version clients have different version dependencies which could be collapsed to one common version.
> if running NPM to install dependencies on pod startup is slow
Loading the AWS SDK via `require` was slow, not installing. As sibling comment says - collapsing different SDKs into one helped reduce loading times of the many SDKs.
The 'one weird trick' could've been spotted in a graphical bundle analyser. But are they not caching npm packages somewhere, seems like an awful waste downloading from the npm registry over and over? I would think it would be parsing four different versions of the AWS sdk that was so slow.
> seems like an awful waste downloading from the npm registry over and over
Pondering this question across every organization in the world and the countless opportunities for caching leads to dark places. Would be interesting to see CDN usage for Linux distributions pre and post docker builds becoming popular.
Sadly Grafana (cloud) comes at a cost too. Anyone struggles with this horrible active metrics based pricing too? Not only Grafana Cloud but others do it like that too.
We moved shitloads to self hosted Thanos. While this comes with its own drawbacks obv, I think it was worth it.
I really don't understand spinning up a whole pod just for a request
Wouldn't it be cheaper to just keep a pod up with a service running?
If scaleability is an issue just plop a load balancer in front of it and scale them up with load but surely you can't need a whole pod for every single one of those millions of requests right?
> Checkly is a synthetic monitoring tool that lets teams monitor their API’s and sites continually, and find problems faster.
>With some users sending *millions of request a day*, that 300ms added up to massive overall compute savings
The article said they had to do a bunch of cleanup between requests when it was handled by one service. Which surprised me but these requests must be doing more than just HTTP requests I guess.
The best advantage of cloud was never price: It was not having to argue with your data center organization, which often lead to taking months to provision anything, even a very boring VM. If those companies were good at managing data centers, and could hire people actually interested in helping the company run, they'd have had little need for the cloud in predictable compute loads.
Until you get quite big, all necessary interactions with the cloud provider are just bills. It's just much easier, even though it is often expensive
It’s not a set of up and leave it! You have to continuously monitor and improve! Yes using some cloud service will save XYZ time but doesn’t mean it’s a set it and forget it feature.
I’ll add this is a really good write up ! Love this comment :
“There is no harm in using boring/simple methods if the results are right.”
$5k/month was 25% of his pods, so the total was ≈$20k/month. It's entirely possible that self hosting would cost much more than that, particularly as they wouldn't be able to save costs by scaling down.
The problem wasn't between the cloud and self hosting - the problem was they had stateful code that didn't scale to thousands of requests for different clients. So they are bringing up new instances every invocation.
The same 3s runtime startup cost (and need for more hardware) would happen if they were running their own servers.
many of the tricks we learned in the late 90s - 2000s can no longer be pulled off. We used to download jar files over the net. Running a major prop trading platform meant 1000s of dependencies. You’d have swing and friends for front end tables, sax xml parsers, various numerical libraries, logging modules- all of this shit downloaded in the jar when the customer impatiently waited to trade some 100MM worth of fx. We learned how to cut down on dependencies. Built tools to massively compress class files. Tradeoff 1 jar with lots of little jars that downloaded on demand. Better yet, cache most of these jars so they wouldn’t need to download every single time. It became a fine art at one point - the difference between a rookie and a professional was that the latter could not just write a spiffy java frontend, but actually deploy it in prod so customers wouldn’t even know there was a startup time - it would just start like instantly. then that whole industry just vanished overnight- poof!
now i write ml code and deploy it on a docker in gcp and the same issues all over again. you import pandas gbq and pretty much the entire google bq set of libraries is part of the build. throw in a few stadard ml libs and soon you are looking at upwards of 2 seconds in Cloud Run startup time. You pay premium for autoscaling, for keeping one instance warm at all times, for your monitoring and metrics, on and on. i am yet to see startup times below 500ms. you can slice the cake any which way, you still pay the startup cost penalty. quite sad.
Ekrekr|1 year ago
One thing that wasn't clear to me, is that if running NPM to install dependencies on pod startup is slow, why not pre build an image with dependencies already installed, and deploy that instead?
lmz|1 year ago
mavidser|1 year ago
Loading the AWS SDK via `require` was slow, not installing. As sibling comment says - collapsing different SDKs into one helped reduce loading times of the many SDKs.
mrits|1 year ago
throwthrow5643|1 year ago
candiddevmike|1 year ago
Pondering this question across every organization in the world and the countless opportunities for caching leads to dark places. Would be interesting to see CDN usage for Linux distributions pre and post docker builds becoming popular.
roboben|1 year ago
We moved shitloads to self hosted Thanos. While this comes with its own drawbacks obv, I think it was worth it.
skrtskrt|1 year ago
zug_zug|1 year ago
Is it possible the prior measurement happened during a high traffic period and the post measurement happened in a low traffic period?
serverlessmom|1 year ago
sebstefan|1 year ago
Wouldn't it be cheaper to just keep a pod up with a service running?
If scaleability is an issue just plop a load balancer in front of it and scale them up with load but surely you can't need a whole pod for every single one of those millions of requests right?
> Checkly is a synthetic monitoring tool that lets teams monitor their API’s and sites continually, and find problems faster.
>With some users sending *millions of request a day*, that 300ms added up to massive overall compute savings
No shit, right?
crummy|1 year ago
BobbyTables2|1 year ago
Spending serious engineering time to wrangle with the complexities of cloud orchestration is not something that should be taken lightly.
Cloud services should be required to have a black-box Surgeon’s General warning.
hibikir|1 year ago
Until you get quite big, all necessary interactions with the cloud provider are just bills. It's just much easier, even though it is often expensive
candiddevmike|1 year ago
Bare metal and datacenter orchestration is leaps and bounds more complex. You're paying for the abstraction.
tamiral|1 year ago
I’ll add this is a really good write up ! Love this comment :
“There is no harm in using boring/simple methods if the results are right.”
rjmunro|1 year ago
helsinkiandrew|1 year ago
The same 3s runtime startup cost (and need for more hardware) would happen if they were running their own servers.
bravetraveler|1 year ago
Routinely: oops, our API usage slipped and we mistakenly paid more than the staff to avoid this would cost
Keep fucking up, tech industry. My job role depends on it (SRE)
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]
fs111|1 year ago
[deleted]
serverlessmom|1 year ago
dxbydt|1 year ago
now i write ml code and deploy it on a docker in gcp and the same issues all over again. you import pandas gbq and pretty much the entire google bq set of libraries is part of the build. throw in a few stadard ml libs and soon you are looking at upwards of 2 seconds in Cloud Run startup time. You pay premium for autoscaling, for keeping one instance warm at all times, for your monitoring and metrics, on and on. i am yet to see startup times below 500ms. you can slice the cake any which way, you still pay the startup cost penalty. quite sad.