This is a great article - I really appreciate when people take the time to assemble details from a bunch of different sources (Firecracker paper, re:Invent talks) and turn them into a useful overview like this.
A couple of days ago, I tried to search on how AWS operates RDS behind the scenes, since it is a managed stateful service I was wondering whether it runs in a traditional way VM-based or in a fully containerized environment? .. Unfortunately, a simple search will lead you to the consumer/customer resources out there only.
It’s nice that AWS builds their own higher level abstractions on the same primitives outside developers use. Feels like they eat their own dogfood much more than Google where they bypass GCP and instead utilize underlying Borg primitives for many services.
One other thing I learned here is that lambda@edge is not actually run on the edge at all. It is forwarded to the nearest datacenter to execute. Not enough capacity in edges to spin up entire VMs for everything, even with Firecracker.
Great write up. Besides the technical parts, AWS Lambda probably created a ton of new businesses/ startups that otherwise would have been hard or at least expensive to get going.
This is great! Awesome writeup w thekind of details that are sometimes opaque and hard to find documentation for. I recently deployed a NextJS app using Serverless framework (and serverless-nextjs), so Lambda@Edge... looking fwd to playing more with compute at CDN edgein general (eg fly.io). Amazing how easy it is, esp. as someone who came into webdev in 1998.
Considering your long experience, didn't you feel like we lost a lot post-PHP? I also stepped out of the PHP world into JS, and never understood why there isn't any apache2-modnodejs... And to me, the serverless JS movement seems to be just that, but with a lot of unnecessary baggage.
From the architecture, it's not really clear to me why Lambdas have the 15 min limitation. It seems to me AWS could use the same infrastructure to make a product that competes with Google Cloud Run. Maybe it's a businesses thing?
I can't think of any reason outside of product positioning.
A lot of the novelty of Lambda is its identity as a function: small units of execution run on-demand. A Lambda that can run perpetually is made redundant by EC2, and the opinionated time limit informs a lot of design.
Lots of interesting work is being done in this area (Currently doing research around serverless at the moment). Cold start up times still remain a pretty large issue (125ms start up for VM is still quite large) but some interesting papers trying to attack this, through strategies like snapshotting!
These 125ms are only the startup time of the MVM and don't include additional latency introduced by optimizing the code package and the involvement of the placement service.
You can also avoid the cold start penalties entirely, if you're willing to pay extra for provisioned concurrency [1].
Both Marc Brooker (lead developer on the AWS Lambda team) giving the talks at Re:Invent as I mentioned in the footnotes, and the official documentation that's out there will provide you with a lot of information.
There's a decent references section at the bottom, and having watched the talks and briefly scanning the Firecracker paper referenced, they do back up the writer.
When I was at re:invent 2019 I joined some chalk talks which weren't recorded (or not published). Some of the hosts told lot of details of their internal infrastructure.
Fantastic paper. So I've been playing with the java and python runtimes and it's absolutely stunning how much better python is on execution and start up time.
Also how does an event actually get to the lambda handler? Because they can come from all kind of sources.
Of course that just moves the question to "where does the runtime API get the event from?" but at that point they could be doing all kinds of things, I guess.
Now I'm curious to see what happens if you ask for more events before returning a response for the first event you got. Can you actually process events in parallel?
I believe they fire up an http server, based on how their local executor behaves, and then do "servlet-y" (or WSGi-y) dispatch into the entry point method
i keep seeing talk about fast firecracker boot times... but as far as i can tell firecracker is something that talks to the KVM apis and does monitoring...
wouldn't fast boot times be a result of kvm and the structure of the VM being booted, or does this boot time metric include scheduling (on one or more firecracker hosts) and delivery of the image to the runner host for the VM?
[+] [-] mlerner|4 years ago|reply
[+] [-] daxfohl|4 years ago|reply
[+] [-] bschaatsbergen|4 years ago|reply
[+] [-] simonw|4 years ago|reply
Clearly Bruno got a lot of the details right, Jeff Barr tweeted a link to this a few weeks ago: https://twitter.com/jeffbarr/status/1404512248152825857
[+] [-] abarrak|4 years ago|reply
[+] [-] ec109685|4 years ago|reply
It’s nice that AWS builds their own higher level abstractions on the same primitives outside developers use. Feels like they eat their own dogfood much more than Google where they bypass GCP and instead utilize underlying Borg primitives for many services.
[+] [-] rorykoehler|4 years ago|reply
[+] [-] daxfohl|4 years ago|reply
[+] [-] mcspiff|4 years ago|reply
[+] [-] tnolet|4 years ago|reply
[+] [-] chrisweekly|4 years ago|reply
[+] [-] emteycz|4 years ago|reply
[+] [-] carlosf|4 years ago|reply
From the architecture, it's not really clear to me why Lambdas have the 15 min limitation. It seems to me AWS could use the same infrastructure to make a product that competes with Google Cloud Run. Maybe it's a businesses thing?
[+] [-] cloakandswagger|4 years ago|reply
A lot of the novelty of Lambda is its identity as a function: small units of execution run on-demand. A Lambda that can run perpetually is made redundant by EC2, and the opinionated time limit informs a lot of design.
[+] [-] kolanos|4 years ago|reply
[0]: https://read.iopipe.com/how-far-out-is-aws-fargate-a2409d2f9...
[+] [-] bschaatsbergen|4 years ago|reply
[+] [-] bigodanktime|4 years ago|reply
https://arxiv.org/pdf/2101.09355.pdf
Also predicting function calls to properly schedule and reduce cold start latency
https://www.usenix.org/system/files/atc20-shahrad.pdf
[+] [-] Dunedan|4 years ago|reply
You can also avoid the cold start penalties entirely, if you're willing to pay extra for provisioned concurrency [1].
[1]: https://docs.aws.amazon.com/lambda/latest/dg/configuration-c...
[+] [-] chews|4 years ago|reply
[+] [-] dr_kretyn|4 years ago|reply
[+] [-] bschaatsbergen|4 years ago|reply
[+] [-] garblegarble|4 years ago|reply
[+] [-] dmarinus|4 years ago|reply
[+] [-] Something1234|4 years ago|reply
Also how does an event actually get to the lambda handler? Because they can come from all kind of sources.
[+] [-] ben0x539|4 years ago|reply
https://github.com/aws/aws-lambda-go/blob/159d1c69878562cd54...
The response is apparently posted to another API endpoint.
Googling for the environment variable actually specifying the API endpoint, I found https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom... which seems to spell out some of the details.
Of course that just moves the question to "where does the runtime API get the event from?" but at that point they could be doing all kinds of things, I guess.
Now I'm curious to see what happens if you ask for more events before returning a response for the first event you got. Can you actually process events in parallel?
[+] [-] mdaniel|4 years ago|reply
[+] [-] bigodanktime|4 years ago|reply
My guess is java is slow as it works really well when the JIT can optimize your code. Longer running functions will very likely outperform python.
[+] [-] a-dub|4 years ago|reply
wouldn't fast boot times be a result of kvm and the structure of the VM being booted, or does this boot time metric include scheduling (on one or more firecracker hosts) and delivery of the image to the runner host for the VM?
[+] [-] marp00|4 years ago|reply
[+] [-] robertlagrant|4 years ago|reply
[+] [-] personlurking|4 years ago|reply
[deleted]
[+] [-] minitoar|4 years ago|reply
[+] [-] Exmoor|4 years ago|reply
[+] [-] adflux|4 years ago|reply
[+] [-] nl|4 years ago|reply