top | item 43974735

(no title)

cletus | 9 months ago

So I've worked for Google (and Facebook) and it really drives the point home of just how cheap hardware is and how not worth it optimizing code is most of the time.

More than a decade ago Google had to start managing their resource usage in data centers. Every project has a budget. CPU cores, hard disk space, flash storage, hard disk spindles, memory, etc. And these are generally convertible to each other so you can see the relative cost.

Fun fact: even though at the time flash storage was ~20x the cost of hard disk storage, it was often cheaper net because of the spindle bottleneck.

Anyway, all of these things can be turned into software engineer hours, often called "mili-SWEs" meaning a thousandth of the effort of 1 SWE for 1 year. So projects could save on hardware and hire more people or hire fewer people but get more hardware within their current budgets.

I don't remember the exact number of CPU cores amounted to a single SWE but IIRC it was in the thousands. So if you spend 1 SWE year working on optimization acrosss your project and you're not saving 5000 CPU cores, it's a net loss.

Some projects were incredibly large and used much more than that so optimization made sense. But so often it didn't, particularly when whatever code you wrote would probably get replaced at some point anyway.

The other side of this is that there is (IMHO) a general usability problem with the Web in that it simply shouldn't take the resources it does. If you know people who had to or still do data entry for their jobs, you'll know that the mouse is pretty inefficient. The old terminals from 30-40+ years ago that were text-based had some incredibly efficent interfaces at a tiny fraction of the resource usage.

I had expected that at some point the Web would be "solved" in the sense that there'd be a generally expected technology stack and we'd move on to other problems but it simply hasn't happened. There's still a "framework of the week" and we're still doing dumb things like reimplementing scroll bars in user code that don't work right with the mouse wheel.

I don't know how to solve that problem or even if it will ever be "solved".

discuss

mike_hearn|9 months ago

I worked there too and you're talking about performance in terms of optimal usage of CPU on a per-project basis.

Google DID put a ton of effort into two other aspects of performance: latency, and overall machine utilization. Both of these were top-down directives that absorbed a lot of time and attention from thousands of engineers. The salary costs were huge. But, if you're machine constrained you really don't want a lot of cores idling for no reason even if they're individually cheap (because the opportunity cost of waiting on new DC builds is high). And if your usage is very sensitive to latency then it makes sense to shave milliseconds off because of business metrics, not hardware $ savings.

cletus|9 months ago

The key part here is "machine utilization" and absolutely there was a ton of effort put into this. I think before my time servers were allocated to projects but even early on in my time at Google Borg had already adopted shared machine usage and therew was a whole system of resource quota implemented via cgroups.

Likewise there have been many optimization projects and they used to call these out at TGIF. No idea if they still do. One I remember was reducing the health checks via UDP for Stubby and given that every single Google product extensively uses Stubby then even a small (5%? I forget) reduction in UDP traffic amounted to 50,000+ cores, which is (and was) absolutely worth doing.

I wouldn't even put latency in the same category as "performance optimization" because often you decrease latency by increasing resource usage. For example, you may send duplicate RPCs and wait for the fastest to reply. That could be double or tripling effort.

xondono|9 months ago

Except you’re self selecting for a company that has high engineering costs, big fat margins to accommodate expenses like additional hardware, and lots of projects for engineers to work on.

The evaluation needs to happen in the margins, even if it saves pennies/year on the dollar, it’s best to have those engineers doing that than have them idling.

The problem is that almost no one is doing it, because the way we make these decisions has nothing to do with the economical calculus behind, most people just do “what Google does”, which explains a lot of the disfunction.

bjourne|9 months ago

I think the parent's point is that if Google with millions of servers can't make performance optimization worthwhile, then it is very unlikely that a smaller company can. If salaries dominate over compute costs, then minimizing the latter at the expense of the former is counterproductive.

> The evaluation needs to happen in the margins, even if it saves pennies/year on the dollar, it’s best to have those engineers doing that than have them idling.

That's debatable. Performance optimization almost always lead to complexity increase. Doubled performance can easily cause quadrupled complexity. Then one has to consider whether the maintenance burden is worth the extra performance.

arp242|9 months ago

> I don't remember the exact number of CPU cores amounted to a single SWE but IIRC it was in the thousands.

I think this probably holds true for outfits like Google because 1) on their scale "a core" is much cheaper than average, and 2) their salaries are much higher than average. But for your average business, even large businesses? A lot less so.

I think this is a classic "Facebook/Google/Netflix/etc. are in a class of their own and almost none of their practices will work for you"-type thing.

morepork|9 months ago

Maybe not to the same extent, but an AWS EC2 m5.large VM with 2 cores and 8 GB RAM costs ~$500/year (1 year reserved). Even if your engineers are being paid $50k/year, that's the same as 100 VMs or 200 cores + 800 GB RAM.

smikhanov|9 months ago

    I don't know how to solve that problem or even if it will ever be "solved".

It will not be “solved” because it’s a non-problem.

You can run a thought experiment imagining an alternative universe where human resource were directed towards optimization, and that alternative universe would look nothing like ours. One extra engineer working on optimization means one less engineer working on features. For what exactly? To save some CPU cycles? Don’t make me laugh.

karmakaze|9 months ago

Google doesn't come up with better compression and binary serialization formats just for fun--it improves their bottom line.