The rise of embarrassingly parallel serverless compute

[+] Uptrenda|5 years ago|reply

Except that none of these compute resources are free and the cost of cloud-based super computing is a complete rip-off. It's actually thousands of dollars cheaper to build your own super computer from used rack mount servers than it is to use AWS, Azure, or Google cloud for lengthy computations.

As an example: you can buy a Dell PowerEdge M915 from ebay with 128 cores for ~$500 USD and a rack costs around the same. Five of them is 640 cores for a total cost of just 3k USD. That's 640 cores that you now own forever. Guess how much it would cost to use that many cores for a month on AWS? Well over 10k... and next month its another 10k... and so is next month...

With this option you only pay for power and still have the resources for all future experiments. I think the m1000e rack can fit 8 blades in total so you could upgrade to a max 1024 cores in the future! The downside with this particular rack is that it's very, very loud. But I've run the numbers here and it's hard to beat high-core, used rack servers on a $/Ghz basis.

[+] jungturk|5 years ago|reply

Are any compute resources free?

It's telling that you've neglected your own labor in this assessment, and that of the other specialists required to get such ad-hoc compute working at scale (dynamic software networks, on-demand compute, provisioning/billing/monitoring apparatuses, etc...)

What plenty have realized in cloud-native approaches is that the TCO is pretty compelling, sometimes because its easier to distribute procurement/provisioning/billing to teams for their own resource needs, or just in sidestepping the inevitable bureaucracy when you try to centralize that decision-making.

Of course its different if you're talking hobby kit setups in our own homes, but if you're aiming at real scale for absurd parallelisms then that ship has sailed.

[+] cortesoft|5 years ago|reply

Where are you going to store this thing? Pay for the power? Cool it? Replace failing hardware?

As someone who works for a company with over 150 data centers around the world, I know space and power is always one of our biggest expenses.

[+] louwrentius|5 years ago|reply

You are right. And if you buy metal and rent some space in a DC, it is still more cost-effective.

You need a few people who understand a little bit of hardware and IPMI and you are set. That's not a big deal.

The cloud gaslighting is nuts.

Tons of companies now exist to give you insight in your cloud bill because that's the last thing cloud provides want you to have.

So in the end you must decide what kind of pain you want. Cost pain, tech pain, and that all depends on your company's particular circumstances.

[+] bpodgursky|5 years ago|reply

None of this math is correct. You absolutely cannot get 128 cores for $500. Please link to a single valid-looking eBay offer at that amount (you may have seen 128GB and thought it was 128 cores). I see 32 cores for $500, at best.

Second, 640 cores on an m4.large for 3 years upfront (the equivalent purchase) is $8755/mo, not well over 10k.

Third, you really underestimate electricity costs if you're running a server 24/7 at all cores.

Last, and most importantly, most people simply don't run servers 24/7. They run batch computations for five hours a week, or a day, and then spend a week or two crunching the numbers.

There's some case to make that for certain institutions with very particular compute needs, running on-premise might make short-term financial sense (let's not even get started on labor costs). But it's really inaccurate to call cloud computing a "rip-off" for anything but the niche-est customers.

[+] jayd16|5 years ago|reply

The way I look at this is you still need 3 people minimum on call to have around the clock ops for this rack. Three IT salaries is more than $10k a month. Eventually you'll find the balancing point but its higher than you think.

[+] csdreamer7|5 years ago|reply

> you can buy a Dell PowerEdge M915 from ebay with 128 cores for ~$500 USD

Where? Was not able to find one. Found a 32 core for $750. It was a pre-Epyc AMD server which has it's own problems.

There were only 9 options listed for the exact term you gave 'Dell PowerEdge M915' for the USA before I got to pricey international sellers. The few options make me feel skeptical about this.

This is of course before I even consider security and performance issues of used Intel servers.

[+] garethmcc|5 years ago|reply

You forget a few things:

1. AWS and other cloud vendors don't have just one data centre per region, they usually have three redundant locations for availability reasons connected via a direct fibre link. 2. Cost of time; time to find hardware, prep hardware, maintain hardware 3. Cost of your agility. If you need more compute capacity, it will take you weeks to get the hardware required and installed, otherwise you need to have unused hardware sitting around "just in case". 4. Cost of your availability. What if you have a sudden spike of traffic within minutes and the current available hardware cannot service it? At least in the cloud you can spin up short lived resources to manage that load before it throttles back down again. 5. Permanent running costs of fixed hardware. A lot of implementations do not need permanent hardware running and can throttle down to a base of almost nothing so the average running cost on a monthly basis is actually very low.

[+] p1necone|5 years ago|reply

You can't really compare value just by looking at core count, I imagine an old 128 core server off of eBay is going to perform much differently from something running on a more modern architecture.

[+] 6gvONxR4sf7o|5 years ago|reply

You also have to account for burstiness. If you have nightly/weekly workflows, you might see some wild swings in resource usage, where an average usage and max usage differ by a lot. It’s nice to pay for average needs rather than max needs (especially when you can pay for on demand pricing).

[+] coldtea|5 years ago|reply

>Except that none of these compute resources are free and the cost of cloud-based super computing is a complete rip-off. It's actually thousands of dollars cheaper to build your own super computer

Only if your time is worthless

[+] jariel|5 years ago|reply

" It's actually thousands of dollars cheaper to build your own super computer from used rack mount servers than it is to use AWS, Azure, or Google cloud for lengthy computations."

The issue is 'total cost of ownership' - not 'unit cost of equipment'.

The operational cost and overhead of running your own gear can be prohibitive.

AWS is successful because it is in fact much cheaper than the alternative in many enterprise situations.

[+] unknown|5 years ago|reply

[deleted]

[+] ChuckMcM|5 years ago|reply

Hilariously, in 2008 Google was hugely resistant to this idea. DARPA had just put out an RFP for a new way of running computationally dependent tasks that currently ran on super computers on a 'shared nothing' architecture (which is what Google ran at the time (and I believe they still do). I had done some research in that space when I was at NetApp looking at decomposing network attached storage services into a shared nothing architecture so I had some idea of the kinds of problems that were the "hard bits" in getting it right.

I recall pointing out to Platform's management that if Google could provide an infrastructure that solved these sorts of problems with massive parallelism that currently required specialized switching fabrics and massive memory sharing we would have something very special. But at the time it was a non-starter, way too much money to be made in search ads to bother with building a system for something like the 200 customers in the world total.

I didn't care one way or the other if Google did it so after running at the wall of "under 2s" a couple of times I just said "fine, your loss."

[+] fxtentacle|5 years ago|reply

I strongly believe that the author never tried out his examples.

One time, I wanted to process a lot of images stored on Amazon S3. So I used 3 Xeon quad-core nodes from my render farm together with a highly optimized C++ software. We peaked at 800mbit/s downstream before the entire S3 bucket went offline.

Similarly, when I was doing molecular dynamics simulations, the initial state was 20 GB large, and so were the results.

The issue with these computing workloads is usually IO, not the raw compute power. That's why Hadoop, for example, moves the calculations onto the storage nodes, if possible.

[+] papaf|5 years ago|reply

That's why Hadoop, for example, moves the calculations onto the storage nodes, if possible

You make a good point about I/O and I actually wanted to comment something along the lines of "why not Hadoop?" since the programming model looks very similar but with less mature tooling.

However, now I think about it, the big win of serverless is that it is not always on. With Hadoop, you build and administer a cluster which will only be efficient if you constantly use it. This Serverless setup would suit jobs that only run occasionally.

[+] shoo|5 years ago|reply

there is a subset of embarrassingly parallel problems that are are heavily data intensive.

E.g. suppose you have 100 TB of data files and you want to run some kind of keyword search over the data. If the data can be broken into 1000x100GB chunks then you can do some map-reduce-ish thing where each 100GB chunk is searched independently, then the search results from each of the 1000 chunks is aggregated. 1000x speedup! serverless!

however, if you want to execute this across some fleet of rented "serverless" servers, a key factor that will influence cost and running time is (1) where is the 100 TB of data right now, (2) how are you going to copy each 100 GB chunk of the data to each serverless server, (3) how much time and money will that copy cost.

I.e. in examples like this where the time required to read the data and send the data over the network is much larger than the time required to compute the data once the data is in memory, is going to be more efficient to move the code & the compute to where the data already is rather than moving the data and the code to some other physical compute device behind a bunch of abstraction layers and network pipes.

[+] gopalv|5 years ago|reply

> there is a subset of embarrassingly parallel problems that are are heavily data intensive.

There's an even smaller subset which is one-shot data access.

> in examples like this where the time required to read the data and send the data over the network is much larger than the time required to compute the data once the data is in memory

The annoying thing about lambda and other functional alternatives is that data-access patterns tend to be repetitive in somewhat predictable ways & there is no way to take advantage of that fact easily.

However, if you don't have that & say you were reading from S3 for every pass, then lambda does look attractive because the container lifetime management is outsourced - but if you do have even temporal stickiness of data, then it helps to do your own container management & direct queries closer to previous access, rather than to entirely cold instances.

If there's a thing that hadoop missed out on building into itself, it was a distributed work queue with functions with slight side-effects (i.e memoization).

[+] Grimm1|5 years ago|reply

If you stored the data in a distributed fashion initially then it's not a problem.

Redshift, Bigquery etc implement it this way and then have various schemes for computation on top. Redshift bundles individual compute with storage whereas other implementations scale the compute independent of the distributed storage.

But this has allowed very cheap scale for querying large datasets and in practice, I imagine you very rarely have to worry about implementing the data transport yourself beyond initial loading with tools like those available.

Edit: Also most clouds moving data within their networks is free so really it's just talking time for moving data which indirectly influcences cost in terms of run time.

[+] mihaaly|5 years ago|reply

Not to mention when the processing of huge set of data is not for having a short answer but to process the data into an other form, the same size or still big. Even if the data is massively distributed from the start the communication and aggregation of distributed data between processing or client nodes might be a considerable bottleneck. Different aspects might desire different logic for distribution otherwise the performance suffers, but rearranging for the sake of processing might ruin the overall performance of the task.

[+] aquamesnil|5 years ago|reply

Stateless serverless platforms like Lambda force a data shipping architecture which hinders any workflows that require state bigger than a webpage like in the included tweet or coordination between function. The functions are short-lived and not reusable, have a limited network bandwidth to S3, and lack P2P communication which does not fit the efficient distributed programming models we know of today.

Emerging stateful serverless runtimes[1] have been shown to support even big data applications whilst keeping the scalability and multi-tenancy benefits of FaaS. Combined with scalable fast stores[2][3], I believe we have here the stateful serverless platforms of tomorrow.

[1] https://github.com/lsds/faasm (can run on KNative, includes demos) [2] https://github.com/hydro-project/anna (KVS) [3] https://github.com/stanford-mast/pocket (Multi-tiered storage DRAM+Flash)

[+] superjan|5 years ago|reply

It reminds me of this quote “a supercompter is a device that turns a compute bound problem into an IO bound problem” (Ken Batcher).

[+] ipnon|5 years ago|reply

It's going to take a new kind of software engineer to build these fully distributed systems. You can imagine calls for "Senior Serverless Engineers". Will conventional serverful engineers be left in the dust, or will the serverless engineers just break away and pioneer the apps on a new scale?

[+] threeseed|5 years ago|reply

Serverless has been around for many years now.

It doesn't require a new kind of software engineer. It's just another software architecture to go alongside micro services, containerisation etc.

And it hasn't changed the world because (a) it's the ultimate form of vendor lock in and (b) it makes even simple apps much more complex to reason about and manage.

[+] corty|5 years ago|reply

Frankly, there isn't much engineering to be done. This only works for embarassingly parallel tasks, so you can just start a loop, distribute data, start compute, collect results. Done. For anything else, this model breaks down.

Need aggregation of results? Communication among nodes? Computation subdivision that is not strictly predeterminable? Sorry, not embarrasingly parallel, won't be doable like this.

You may be able to extract some embarrassingly parallel part, like compilation of independent object files, but very often you still have a longish, complex and timeconsuming serial step, like linking those object files. This kind of recognising different parts of a program is already state of the art, no need to invent a new field...

[+] adev_|5 years ago|reply

> It's going to take a new kind of software engineer to build these fully distributed systems.

What your are selling here is nothing else than a new proprietary-scheduler-runtime to run embarrassingly parallel jobs ( the easiest kind of parallelism)

There were already plenty of solution to do that, the only difference here is that you run on AWS lambda.

Why would you need an entire new type of engineer to do that ?

There is nothing new here excepted buzzwords.

Are engineer nowadays a script-kiddies bind to a technology there entire career ? (Tip: of course no)

[+] rahimnathwani|5 years ago|reply

The scraping example seems poorly chosen. The original blog post describing that example is no longer online, but archive.org has a copy: https://web.archive.org/web/20180822034920/https://blog.sean...

If the author just wanted to fetch pages in parallel, they could have done better than 8 hours even on their own laptop (you can run more than one chromium process at a time). The real benefit they got from using AWS Lambda is that the requests weren't throttled or ghosted by Redfin, probably because the processes were running on enough different machines, with different IP addresses.

[+] alexchamberlain|5 years ago|reply

> One of the challenging parts about this today is that most software is designed to run on single machines, and parallelization may be limited to the number of machine cores or threads available locally.

Depending how you look at it, I don't think most software is designed to take advantage of multiple cores, let alone multiple machines.

[+] efnx|5 years ago|reply

Both points are true, but the author is talking about writing new software specifically for serverless compute.

[+] raverbashing|5 years ago|reply

But how much is serverless efficient?

Has anyone benchmarked the speed of running (let's say, on AWS) 1000x a lambda function vs. running the same function on a regular AWS instances?

What about all the overhead (for example, k8s overhead, both in CPU and disk, etc)

I'm afraid it would be very easy to get a repeat of this https://adamdrake.com/command-line-tools-can-be-235x-faster-...

[+] yangl1996|5 years ago|reply

Such a scheme depends heavily on whether the cloud providers can efficiently multiplex their bare-metal machines to run these jobs concurrently. Ultimately, a particular computing job takes a fixed amount of CPU-hours, so there's definitely no savings in such a scheme in terms of energy consumption or CPU-hours. At the same time, overhead comes when a job can't be perfectly parallelized: e.g. the same memory content being replicated across all executing machines, synchronization, the cost of starting a ton of short-lived processes, etc. These overhead all add to the CPU-hour and energy consumption.

So, does serverless computing reduce the job completion time? Yes if the job is somewhat parallelizable. Does it save energy, money, etc.? Definitely no. The question is whether you want to make the tradeoff here: how much more energy would you afford to pay for, if you want to reduce the job completion time by half? It like batch processing vs. realtime operation. The former provides higher throughput, while the latter gives user a shorter latency. Having better cloud infrastructure (VM, scheduler, etc.) helps make this tradeoff more favorable, but the research community have just started looking at this problem.

[+] qeternity|5 years ago|reply

Lambda compute only seems to have been rising lately to those who never used it before. We’ve been running huge amounts of “serverless” style workloads in a Celery + RabbitMQ setup. Our workloads are fairly stable for that so bursting in a public cloud has no real value, but we regularly do batch jobs that burst capacity. And we spun up more workers as such.

The author seems to think the paradigm is new (it isn’t) and claims that it hasn’t taken off massively (it has) because he incorrectly points to a number of workloads that aren’t embarrassingly parallel. On the other hand, in theory having a common runtime for these operations from a public cloud provider should enable them to keep their utilization of resources extremely high, such that it would be cheaper for us to use AWS/GCP/etc instead of rolling our own on OVH/Hetzner. But if anything, the per compute cost of FaaS is higher than it is for other compute models, which means the economics really only work for small workloads where the fixed overhead of EC2 is larger than the variable overhead of Lambda.

[+] tlarkworthy|5 years ago|reply

Don't overcomplicate it. Xargs and curl is often enough to drive big, ad hoc jobs.

[+] kristopolous|5 years ago|reply

These things only have a handful of customers in the world.

Datasets that are tens of gigabytes, or maybe 100mil records or so...this really covers most things.

And for every 1 thing it doesn't, there's 20 more claimed that a single machine using simple tools could handle just fine.

Being able to detect when things have been processed, have a way to set dirty flags, prioritize things, have regions of interest, be able to have re-entrant processing, caching parts of the pipeline and having nuanced rules for invalidating it, these in my mind are kinda basic things here.

When they aren't done, sure, someone will need giant resources because they're doing foolish things. But that's literally the only reason. Substituting money for sense is an old hack.

[+] lachlan-sneff|5 years ago|reply

There's no way to distribute really lightweight thunks of arbitrary code. Maybe WASM can work here, especially if you shape the standard APIs in the right way.

You'd also need built-in support in tooling and compilers, where you can compile specific functions or modules into something that can run separately without actually doing that manually.

[+] theamk|5 years ago|reply

This depends on how lightweight are those chunks.

If you goal is <0.1 sec startup -- yeah, then you'll need WASM.

If you are OK with 1-5 second of startup, you have a ton of options. Apache Spark uses JVM magic to send out the the raw bytecode. You can start up docker container. If you are willing to rewrite stdio, you can exec machine code under seccomp/pledge.

There are even full-blown VM solutions -- Amazon Firecracker, which claims that: "Firecracker initiates user space or application code in as little as 125 ms and supports microVM creation rates of up to 150 microVMs per second per host."

[+] j88439h84|5 years ago|reply

I dont really understand your claim. You can send code to another computer and it can execute it and send it back. What is the problem?

[+] Lio|5 years ago|reply

This seems to come up a lot in these discussions but Moore’s Law (actually just an observation) says nothing about single thread performance.

It only tells you that the number of transistors on on a silicon die will double every 18 months.

If we’re still able to add parallel threads of execution at the same rate then Moore’s Law still holds.

[+] vsskanth|5 years ago|reply

Yeah but how expensive will it be ? Some numbers would be nice for something like hyper parameter tuning

[+] choeger|5 years ago|reply

> Software compilation

Well, no. Software compilation does not work massively parallel. Maybe parts of the optimization pipeline, but compilation of 1000 unit program (assuming your language of choice even has separate compilation) does normally require to put the units into a dependency graph (see OCaml, for instance), or puts most of the effort into the inherently serial tasks like preprocessing and linking (C++).

[+] cranekam|5 years ago|reply

> Note: If you are comfortable with Kubernetes and scaling out clusters for big data jobs & the parallel workloads described below, godspeed!

Probably my pedantic side showing through but I find reading text where ampersand is used in place of “and” really jarring (same for capitalised regular nouns). It seems somewhat common now so I guess I’ll have to get used to it.

[+] loopback_device|5 years ago|reply

Kubernetes is a proper noun, and thus always to be capitalized in English, while the "&" is a ligature of the Latin "et", the word for "and". Not sure why that would be jarring, it's all fine and correct English.

[+] jonnypotty|5 years ago|reply

We can't even design systems to properly take advantage of multiple cpu cores yet.

[+] dstaley|5 years ago|reply

I'm a huge fan of using Lambda to perform hundreds of thousands of discrete tasks in a fraction of the time it'd take to perform those same tasks locally. A while back I used Lambda and SQS to cross check 168,000 addresses with my ISP's gigabit availability tool.[1] If I recall correctly each check took about three seconds, but running all 168,000 checks on Lambda only took a handful of minutes. I believe the scraper was written in Python, so I shudder to think about how long it would have taken to run on a single machine.

[1] https://dstaley.com/2017/04/30/gigabit-in-baton-rouge/

131 comments