Behind the scenes, AWS Lambda

[+] mlerner|4 years ago|reply

If you're interested in Firecracker, I wrote a summary of the original paper here: https://www.micahlerner.com/2021/06/17/firecracker-lightweig...

[+] daxfohl|4 years ago|reply

Any idea how much it has diverged from crosvm?

[+] bschaatsbergen|4 years ago|reply

Great article @mlerner

[+] simonw|4 years ago|reply

This is a great article - I really appreciate when people take the time to assemble details from a bunch of different sources (Firecracker paper, re:Invent talks) and turn them into a useful overview like this.

Clearly Bruno got a lot of the details right, Jeff Barr tweeted a link to this a few weeks ago: https://twitter.com/jeffbarr/status/1404512248152825857

[+] abarrak|4 years ago|reply

A couple of days ago, I tried to search on how AWS operates RDS behind the scenes, since it is a managed stateful service I was wondering whether it runs in a traditional way VM-based or in a fully containerized environment? .. Unfortunately, a simple search will lead you to the consumer/customer resources out there only.

[+] ec109685|4 years ago|reply

This is a good paper that talks about Aurora and provides some insight into how RDS operates: https://www.allthingsdistributed.com/files/p1041-verbitski.p...

It’s nice that AWS builds their own higher level abstractions on the same primitives outside developers use. Feels like they eat their own dogfood much more than Google where they bypass GCP and instead utilize underlying Borg primitives for many services.

[+] rorykoehler|4 years ago|reply

Based on how they bill it, it looks like it's running on VMs

[+] daxfohl|4 years ago|reply

One other thing I learned here is that lambda@edge is not actually run on the edge at all. It is forwarded to the nearest datacenter to execute. Not enough capacity in edges to spin up entire VMs for everything, even with Firecracker.

[+] mcspiff|4 years ago|reply

They do have Cloudfront Functions now for real Edge compute: https://aws.amazon.com/blogs/aws/introducing-cloudfront-func...

[+] tnolet|4 years ago|reply

Great write up. Besides the technical parts, AWS Lambda probably created a ton of new businesses/ startups that otherwise would have been hard or at least expensive to get going.

[+] chrisweekly|4 years ago|reply

This is great! Awesome writeup w thekind of details that are sometimes opaque and hard to find documentation for. I recently deployed a NextJS app using Serverless framework (and serverless-nextjs), so Lambda@Edge... looking fwd to playing more with compute at CDN edgein general (eg fly.io). Amazing how easy it is, esp. as someone who came into webdev in 1998.

[+] emteycz|4 years ago|reply

Considering your long experience, didn't you feel like we lost a lot post-PHP? I also stepped out of the PHP world into JS, and never understood why there isn't any apache2-modnodejs... And to me, the serverless JS movement seems to be just that, but with a lot of unnecessary baggage.

[+] carlosf|4 years ago|reply

Really cool post!

From the architecture, it's not really clear to me why Lambdas have the 15 min limitation. It seems to me AWS could use the same infrastructure to make a product that competes with Google Cloud Run. Maybe it's a businesses thing?

[+] cloakandswagger|4 years ago|reply

I can't think of any reason outside of product positioning.

A lot of the novelty of Lambda is its identity as a function: small units of execution run on-demand. A Lambda that can run perpetually is made redundant by EC2, and the opinionated time limit informs a lot of design.

[+] kolanos|4 years ago|reply

This service exists, it's called AWS Fargate [0].

[0]: https://read.iopipe.com/how-far-out-is-aws-fargate-a2409d2f9...

[+] bschaatsbergen|4 years ago|reply

Good to know you enjoyed the read!

[+] bigodanktime|4 years ago|reply

Lots of interesting work is being done in this area (Currently doing research around serverless at the moment). Cold start up times still remain a pretty large issue (125ms start up for VM is still quite large) but some interesting papers trying to attack this, through strategies like snapshotting!

https://arxiv.org/pdf/2101.09355.pdf

Also predicting function calls to properly schedule and reduce cold start latency

https://www.usenix.org/system/files/atc20-shahrad.pdf

[+] Dunedan|4 years ago|reply

These 125ms are only the startup time of the MVM and don't include additional latency introduced by optimizing the code package and the involvement of the placement service.

You can also avoid the cold start penalties entirely, if you're willing to pay extra for provisioned concurrency [1].

[1]: https://docs.aws.amazon.com/lambda/latest/dg/configuration-c...

[+] chews|4 years ago|reply

nice writeup for how the magic really works. lambdas rock!

[+] dr_kretyn|4 years ago|reply

Is this write up correct? How do they know that? I don't see any references on info source except a talk at re:invent.

[+] bschaatsbergen|4 years ago|reply

Both Marc Brooker (lead developer on the AWS Lambda team) giving the talks at Re:Invent as I mentioned in the footnotes, and the official documentation that's out there will provide you with a lot of information.

[+] garblegarble|4 years ago|reply

There's a decent references section at the bottom, and having watched the talks and briefly scanning the Firecracker paper referenced, they do back up the writer.

[+] dmarinus|4 years ago|reply

When I was at re:invent 2019 I joined some chalk talks which weren't recorded (or not published). Some of the hosts told lot of details of their internal infrastructure.

[+] Something1234|4 years ago|reply

Fantastic paper. So I've been playing with the java and python runtimes and it's absolutely stunning how much better python is on execution and start up time.

Also how does an event actually get to the lambda handler? Because they can come from all kind of sources.

[+] ben0x539|4 years ago|reply

Judging from five minutes of browsing the go runtime sources, they poll some lambda API:

https://github.com/aws/aws-lambda-go/blob/159d1c69878562cd54...

The response is apparently posted to another API endpoint.

Googling for the environment variable actually specifying the API endpoint, I found https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom... which seems to spell out some of the details.

Of course that just moves the question to "where does the runtime API get the event from?" but at that point they could be doing all kinds of things, I guess.

Now I'm curious to see what happens if you ask for more events before returning a response for the first event you got. Can you actually process events in parallel?

[+] mdaniel|4 years ago|reply

I believe they fire up an http server, based on how their local executor behaves, and then do "servlet-y" (or WSGi-y) dispatch into the entry point method

[+] bigodanktime|4 years ago|reply

Pretty sure its just gRPC calls, and firecracker passes events along to its firecracker-containerd services.

My guess is java is slow as it works really well when the JIT can optimize your code. Longer running functions will very likely outperform python.

[+] a-dub|4 years ago|reply

i keep seeing talk about fast firecracker boot times... but as far as i can tell firecracker is something that talks to the KVM apis and does monitoring...

wouldn't fast boot times be a result of kvm and the structure of the VM being booted, or does this boot time metric include scheduling (on one or more firecracker hosts) and delivery of the image to the runner host for the VM?

[+] marp00|4 years ago|reply

Great article. Is there an according one about Google Cloud Functions?

[+] robertlagrant|4 years ago|reply

I'd love to see a similar writeup of Cloudflare Workers.

[+] personlurking|4 years ago|reply

[deleted]

[+] minitoar|4 years ago|reply

It’s not that odd if you consider it is the 11th letter of the Greek alphabet.

84 comments