Too much efficiency makes everything worse (2022)

[+] refibrillator|1 year ago|reply

I recognize the author Jascha as an incredibly brilliant ML researcher, formerly at Google Brain and now at Anthropic.

Among his notable accomplishments, he and coauthors mathematically characterized the propagation of signals through deep neural networks via techniques from physics and statistics (mean field and free probability theory). Leading to arguably some of the most profound yet under-appreciated theoretical and experimental results in ML in the past decade. For example see “dynamical isometry” [1] and the evolution of those ideas which were instrumental in achieving convergence in very deep transformer models [2].

After reading this post and the examples given, in my eyes there is no question that this guy has an extraordinary intuition for optimization, spanning beyond the boundaries of ML and across the fabric of modern society.

We ought to recognize his technical background and raise this discussion above quibbles about semantics and definitions.

Let’s address the heart of his message, the very human and empathetic call to action that stands in the shadow of rapid technological progress:

> If you are a scientist looking for research ideas which are pro-social, and have the potential to create a whole new field, you should consider building formal (mathematical) bridges between results on overfitting in machine learning, and problems in economics, political science, management science, operations research, and elsewhere.

[1] Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

http://proceedings.mlr.press/v80/xiao18a/xiao18a.pdf

[2] ReZero is All You Need: Fast Convergence at Large Depth

https://arxiv.org/pdf/2003.04887

[+] tablatom|1 year ago|reply

Interesting timing for me! Just a couple of days ago I discovered the work of biologist Olivier Hamant who has been raising exactly this issue. His main thesis is that very high performance (which he defines as efficacy towards a known goal plus efficiency) and very high robustness (the ability to withstand large fluctuations in the system) are physically incompatible. Examples abound in nature. Contrary to common perception evolution does not optimise for high performance but high robustness. Giving priority to performance may have made sense in a world of abundant resources, but we are now facing a very different period where instability is the norm. We must (and will be forced to) backtrack on performance in order to become robust. It’s the freshest and most interesting take on the poly-crisis that I’ve seen in a long time.

https://books.google.co.uk/books/about/Tracts_N_50_Antidote_...

[+] salawat|1 year ago|reply

>> If you are a scientist looking for research ideas which are pro-social, and have the potential to create a whole new field, you should consider building formal (mathematical) bridges between results on overfitting in machine learning, and problems in economics, political science, management science, operations research, and elsewhere.

Translation to laymen: ML is being analogized to the mathematical structure of signaling between entities and institutions in society.

Mathematician proposes problem that plagues one (overfitting in ML, the phenomena by which a neural network's ability to generalize is negatively impacted by overtraining so the functions it can emulate are tightly coupled to the training data), must plague the other.

In short, there must be a breakdown point at which overdevelopment of societal systems or signaling between them makes things simply worse.

I personally think all one need do is look at what would happen if every system were perfectly complied with to see we may already be well beyond that breakpoint in several industrial verticals.

[+] thomasahle|1 year ago|reply

I love the idea of ReZero, basically using a trainable parameter, alpha, in residual layers like this:

  Deep Network                  | xi+1 = F(xi)                 
  Residual Network              | xi+1 = xi + F(xi)            
  Deep Network + Norm           | xi+1 = Norm(F(xi))           
  Residual Network + Pre-Norm   | xi+1 = xi + F(Norm(xi))      
  Residual Network + Post-Norm  | xi+1 = Norm(xi + F(xi))      
  ReZero                        | xi+1 = xi + αi F(xi)

However, I haven't actually seen this used in practice. The papers we have on Gemma and Llama all still seem to be using layer norms.

Am I missing something?

[+] lubujackson|1 year ago|reply

The exciting thing about this idea is if you can correlate, say, economics with the works of ML, that means a computer program which you can run, revise and alter can directly give you measurable data about these complex system interactions that mostly have existed as a platonic idea since reality is too nuanced and multiple to validate concepts formally.

With the idea that there is some subset of logic that sits below economics that is provable and exact. That is a powerful idea worth pursuing!

[+] throw10920|1 year ago|reply

This is a really manipulative way to categorically hand-wave away objections without actually responding to their content, in addition to having several logical fallacies (such as the appeal to emotion and the argument from authority). This is not in the spirit of intellectual curiosity that HN is for.

[+] mrfox321|1 year ago|reply

More importantly, he invented diffusion models:

http://proceedings.mlr.press/v37/sohl-dickstein15.pdf

[+] RGamma|1 year ago|reply

Brilliant enough to know he's helping build another atom bomb (presumably for peanuts)? And the nuclear briefcase is gonna be controlled by the ultrarich.

[+] LarsDu88|1 year ago|reply

Adding to my reading list!

[+] t_mann|1 year ago|reply

The argument rides on the well-known Goodhart's law (when a measure becomes a target, it ceases to be a good measure). However, it only puts it down to measurement problems, as in, we can't measure the things we really care about, so we optimize some proxies.

That, in my view, is a far too reductionist view of the problem. The problem isn't just about measurement, it's about human behavior. Unlike particles, humans will actively seek to exploit any control system you've set up. This problem goes much deeper than just not being able to measure "peace, love, puppies" well. There's a similar adage called Campbell's law [0] that I think captures this better than the classic formulation of Goodhart's law:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

The mitigants proposed (regularization, early stopping) address this indirectly at best and at worst may introduce new quirks that can be exploited through undesired behavior.

[0] https://en.wikipedia.org/wiki/Campbell%27s_law

[+] layer8|1 year ago|reply

> Unlike particles, humans will actively seek to exploit any control system you've set up.

But that’s only possible because the control system doesn’t exactly (and only) control what we want it to control. The control system is only an imperfect proxy for what we really want, in a very similar way as the measure in Goodhart’s law.

Another variation of that is the law of unintended consequences [0]. There is probably a generalized computational or complex-systems version of it that we haven’t discovered yet.

[0] https://www.sas.upenn.edu/~haroldfs/540/handouts/french/unin...

[+] Edman274|1 year ago|reply

> Unlike particles, humans will actively seek to exploit any control system you've set up.

Well, agents will. If you created a genetic algorithm for an AI agent whose reward function was the amount of dead cobras it got from Delhi, I feel like you'd quickly find that the best performing agent was the one that started breeding cobras. In the human case and in the AI case the reward function has been hacked. In the AI case we decide that the reward function wasn't designed well, but in the human case we decide that the agents are sneaky petes who have a low moral character and "exploited" the system.

[+] EasyMark|1 year ago|reply

I think a big portion of that is humans don’t like to be viewed only as numbers and will rebel and manipulate any system you try to put the thumbscrews to them with. So the quote to mean rings golden and isn’t fallible to much of an extent

[+] netcan|1 year ago|reply

This is true, these "laws" are approximations and imperfect reductions.

Which one is useful or descriptive will depend on the specific example.

Optimizing ML VS Optimizing a social media algorithm VS using standardized testing to optimize education systems.

There is no perfect abstraction that applies to these different scenarios precisely. We don't need that precision. We just need the subsequent intuition about where these things will go wrong.

[+] whizzter|1 year ago|reply

This has become a societal problem in Sweden during the past 20 or so years.

1: Healthcare efficiency is measured by "completed tasks" by primary care doctors, the apparatus is optimized for them handling simple cases and they thus often do some superficial checking and either send one home with some statistically correct medicine (aspirin/antibiotics) or punt away cases to a specialized doctor if it appears to be something more complicated.

The problem is that since there's now fewer of them (efficient) they've more or less assembly line workers and have totally lost the personal "touch" with patients that would give them an indication on when something is wrong. Thus cancers,etc are very often diagnosed too late so even if specialized cancer care is better, it's often too late to do anything anyhow.

2: The railway system was privatized, considering the amount of cargo shipped it's probably been a huge success but the system is plagued by delays due to little gaps in the system to allow late trains to speed up or to even do more than basic maintenance (leading to bigger issues).

[+] ksec|1 year ago|reply

There are examples everywhere. To quote Steve Jobs;

>When a company grows big enough, they want to replicate their initial success. They all thought about the "process" of how the first success was created. So they replicate those "process" across the company. And before very long the people confused that the process was the content.

And you can fit that from small companies to the world's largest government. Most of them forgot about their content.

[+] EasyMark|1 year ago|reply

I wish these were the biggest problems facing US train and healthcare industries.

[+] remram|1 year ago|reply

Those are great points! Another related law is from queuing theory: waiting time goes to infinity when utilization approaches 100%. You need your processes/machines/engineers to have some slack otherwise some tasks will wait forever.

[+] toasterlovin|1 year ago|reply

I’m remembering reading once that cities are incredibly efficient in how they use resources (compared to the suburbs and rural areas, I guess), and, in light of your comment about waiting time, I’m realizing why now why they’re so unpleasant: constant resource contention.

[+] georgeburdell|1 year ago|reply

Yep, I used to work in a factory. Target utilization at planning time was 80%. If you over-predict your utilization, you waste money. If you under-predict, a giant queue of “not important” stuff starts to develop

[+] eru|1 year ago|reply

You can add a measure of robustness to your optimization criteria. You can explicitly optimise for having enough slack in your utilisation to handle these unforeseen circumstances.

For example, you can assign priorities to the loads on your systems, so that you can shed lower priority loads to create some slack for emergencies, without having to run your system idle under during lulls.

I get what the article is trying to say, but they shouldn't write off optimisation as easily as that.

[+] vishnugupta|1 year ago|reply

I feel that a 100% efficient system is not resilient. Even minor disruptions in subsystems lead to major breakdowns.

There’s no room to absorb shocks. We saw a drastic version of this during COVID-19 induced supply chain collapse. Car manufacturers had built near 100% just in time manufacturing that they couldn’t absorb chip shortages and it took them years to get back up.

It also leaves no room for experimentation. Whatever experiment can only happen outside a system not from within it.

[+] I_AM_A_SMURF|1 year ago|reply

That tracks. I worked at a lot of places/teams where anything but a P0 was something that would never be done.

[+] robertclaus|1 year ago|reply

Interesting. My gut reaction is that this is true in reverse: infinite wait time leads to 100% utilization. However, I feel like you can also have 100% utilization with any queue length if input=output. Is that theory just a result of a first order approximation or am I missing something?

[+] appendix-rock|1 year ago|reply

For some it may go without saying, but for the uninitiated, y’all should be reading https://en.wikipedia.org/wiki/The_Goal_(novel)

[+] amelius|1 year ago|reply

Slack __or__ lower priority tasks.

[+] netcan|1 year ago|reply

Another example of this approximate law is in exercise physiology.

To a normal person, there are a lot of good proxy indicators of fitness. You could train sprinting. You could hop up and down. Squat. Clean and jerk.. etc.

Running faster,hopping higher, squatting heavier... all indicators of increasing fitness... and success of your fitness training.

Two points:

1 - The more general your training methodology, the more meaningful the indicators. Ie, if your fitness measure is "can I push a car uphill," and your training method is sprinting and swimming... pushing a heavier car is a really strong indicator of success. If your training method is "practice pushing a car," then an equivalent improvement does not indicate equivalent improvement in fitness.

2- As an athlete (say clean and jerk) becomes more specialized... improvements in performance become less indicative of general fitness. Going from zero to "recreational weighlifter" involves getting generally stronger and musclier. Going from college to olympic level... that typically involves highly specialized fitness attributes that don't cross into other endeavors.

Another metaphor might be "base vs peak" fitness, from sports. Accidentaly training for (unsustainable) peak performance is another over-optimization pitfall. It can happen when someone blindly follows "line go up." Illusary optimizations are actually just trapping you in a local maxima.

I think there are a lot of analogies here to biology, but also ML optimization and social phenomenon.

[+] bob1029|1 year ago|reply

Clean & jerk is one of those movements that I would almost consider "complete". Especially if you are going to mix in variants of the squat.

Not sure these are the best example. I don't know of anyone who can C&J more than their body weight for multiple repetitions who isn't also an absolute terminator at most other meaningful aspects of human fitness.

Human body is one machine. Hormonal responses are global. Endurance/strength is a spectrum but the whole body goes along for the ride.

[+] naasking|1 year ago|reply

I think that's more an indication that "general fitness" is not a rigorous metric. It's fine as a vague notion of "physical ability" up to a point, and past that it loses meaning because improvements in ability are task-specific and no longer translate to other tasks.

[+] bilsbie|1 year ago|reply

This is why I don’t like focusing on GDP. I think a quarterly poll on life satisfaction and optimism would be a better measure.

If you’re curious about GDP. I my car breaks and I get it fixed, that adds to GDP.

If a parent stays home to raise kids, that lowers GDP. If I clean my own house that lowers GDP. Etc.

Unemployment is another crude metric. Are these jobs people want or do they feel forced to work bad jobs.

[+] jebarker|1 year ago|reply

I'm not really disagreeing (as GDP is a crude measure), but rather thinking out loud. I don't think my individual life satisfaction and optimism should be influenced by nation-state economics to the extent that that's what they're optimizing for. The job of my government is to create the conditions for security, prosperity and opportunity without oppressing the rest of the world or destroying the planet. But it's up to me to find a satisfying life within that and that is possible within drastically different economic and social structures. Similarly, there's probably not a set of conditions that gives universal satisfaction to all citizens, so what summary statistics of life satisfaction and optimism do we optimize for?

[+] klysm|1 year ago|reply

The point is it doesn’t matter what you measure

[+] vladms|1 year ago|reply

I find ironic that we are talking about ML where we have vectors of thousands of quantities and then we go to measure social/economic stuff with one (or a couple of numbers).

The general discourse (news, politicians, forums, etc.) over a couple of measures will always be highly simplifying. The discourse over thousands of measures will be too complex to communicate easily.

I hope that at some point most people will acknowledge implicitly that the fewer the number of measures the more probable is that it is a simplification that hides stuff. (ex: "X is a billionaire, means his smart"; "country X has high GDP means it's better than country Y with less GDP" and so forth).

[+] LarsDu88|1 year ago|reply

I was trying to remember where I remember where I heard of this author's name before.

Invented the first generative diffusion model in 2015. https://arxiv.org/abs/1503.03585

[+] redsparrow|1 year ago|reply

This makes me think of going to chain restaurants. Everything has been focus-grouped and optimized and feels exactly like an overfit proxy for an enjoyable meal. I feel like I'm in a bald-faced machine that is optimized to generate profit from my visit. The fact that it's a restaurant feels almost incidental.

"HI! My name is Tracy! I'm going to be your server this evening!" as she flawlessly writes her name upside down in crayon on the paper tablecloth. Woah. I think this place needs to re-calibrate their flair.

[+] usaphp|1 year ago|reply

I think it also applies to when managers try to overoptimize work process, in the end creative people lose interest and work becomes unbearable...little chaos is necessary in a work place/life imo...

[+] thomassmith65|1 year ago|reply

I noticed an example of this rule at my local hardware superstore.

Around a decade ago, the store installed anti-theft cages.

At first they only kept high-dollar items in the cages. It was a bit of an inconvenience, but not so bad. If a customer is dropping $200+ on some fancy power tool, he or she likely doesn't mind waiting five minutes.

But a few years later, there was a change - almost certainly a 'data-driven' change: suddenly there was no discernible logic to which items they caged and which they left uncaged. Now a $500 diagnostics tool is as likely to sit open on a shelf, as a $5 light bulb to be kept under lock and key.

Presumably the change is a result of sorting a database by 'shrinkage' - they lock up the items that cumulatively lose the hardware store the most money, due to theft.

But the result is (a) the store atmosphere reads as "so profit-driven they don't trust the customers not to steal a box of toothpicks" and (b) it's often not worth it for customers to shop there due to the waiting around for an attendant to unlock the cage.

I doubt the optimization helped their bottom-line, even if it has prevented the theft of some $3 bars of soap.

[+] jrochkind1|1 year ago|reply

Calling this the "strong version of Goodhart's law" was immediately brain-expanding for me.

I have been thinking of goodhart's law a lot, but realized I had been leaning toward focusing on human reaction to the metric as the cause of it; but this reminded me it's actually fundamentally about the fact that any metric itself is inherently not an exact representation of the quality you wish to represent.

And that this may, as OP argues, make goodhart's law fundamental to pretty much any metric used as a target. Independently of how well-intentioned any actors. It's not a result of like human laziness or greed or competing interests, it's an epistemic (?) result of the neccesary imperfection of metric validity.

This makes some of OP's more contentious "Inject noise into the system" and "early stopping" ideas more interesting even for social targets.

"The more our social systems break due to the strong version of Goodhart's law, the less we will be able to take the concerted rational action required to fix them."

Well, that's terrifying.

[+] raister|1 year ago|reply

This reminds me of Eli Goldratt's quote: "Tell me how you measure me, I will tell you how I behave."

[+] inglor_cz|1 year ago|reply

When it comes to his Mitigation: Inject noise into the system. proposal: I would be happy to experiment with some sortition in our political systems. Citizens' assemblies et cetera.

Randomly chosen deliberative bodies could keep some of the stupid polarization in check, especially if your chances to be chosen twice into the same body are infinetesimal.

https://en.wikipedia.org/wiki/Sortition

We tend to consider "democracy" as fundamentally equivalent to "free and fair elections", but sortition would be another democratic mechanism that could complement our voting systems. Arguably more democratic, as you need money and a support structure to have a chance to win an election.

[+] orcim|1 year ago|reply

It's an effect that exists, but the examples aren't accurate.

An overemphasis on grades isn't from wanting to educate the population; obesity isn't from prioritizing nutrient-rich food; and increased inequality isn't from wanting to distribute resources based on the needs of society.

Living a well-lived life through culture, cooking, or exercise doesn't make you more susceptible to sensationalism, addiction, or gambling. It's a lack of stimulus that makes you reach for those things.

You can argue that academia enables rankings, industrial food production enables producing empty calories, and economic development enables greater inequality. But that isn't causation.

It also isn't a side effect when significant resources specifically go into promoting education as a private matter best used to educate the elite, that businesses aren't responsible for the externalities they cause, and that resources should be privately controlled.

In many ways, it is far easier to have more public education, heavily tax substances like sugar, and redistribute wealth than it is to do anything else. That just isn't the goal. It used to be hard to get a good education, good food, and a good standard of living. And it still is. For the same reasons.

[+] dooglius|1 year ago|reply

Overfitting may be a special case of Goodhart's Law, but I don't think Goodhart's Law in general is the same as overfitting, so I don't think the conclusion is well-supported supported in general; there may be plenty of instances of proxy measures that do not have issues.

I'll also quibble with the example of obesity: the proxy isn't nutrient-rich good, but rather the evaluation function of human taste buds (e.g. sugar detection). The problem is the abundance of food that is very nutrient-poor but stimulating to taste buds. If the food that's widely available were nutrient-rich, it's questionable whether we would have an obesity epidemic.

[+] feyman_r|1 year ago|reply

We realize now or at least in recent past, the value of true nutrient-rich food or a balanced diet.

Carbohydrate abundance was likely important in moving people out of hunger and poverty but excesses of the same kind of diet are a reflection on obesity.

My guess is that calorie-per-gram-per-dollar of carbohydrates is still lower than fat and protein.

[+] unknown|1 year ago|reply

[deleted]

[+] dangitman|1 year ago|reply

[deleted]

[+] rowanG077|1 year ago|reply

I don't think it's unintuitive at all. 100% optimized means 100% without slack. No slack means any hitch at all will destroy you.

[+] abernard1|1 year ago|reply

The author identifies problems with a system measuring targets, but then all the proposals are about increasing the power and control of the system.

Perhaps the answer—as hippy sounding as it is—is to reduce the control of the system outright. Instead of adding more measures, more controls, which are susceptible to the prejudices of control, we let the system fall where it may.

This, to me, is a classic post of an academic understanding the failures of a system (and people like themselves in control of said system) but then not allowing the mitigation mechanisms of alternate systems to take its place.

This is one of the reasons I come to HN: to view the prime instigators of big-M Modern failure and their inability to recognize their contributions to that problem.

[+] projektfu|1 year ago|reply

And that's leaving out Jevon's paradox, where increasing efficiency in the use of some scarce resource sometimes/often increases its consumption, by making the unit price of the dependent thing affordable and increasing its demand. For example, gasoline has limited demand if it requires ten liters to go one km, but very high demand at 1 L/10km, even at the same price per liter.

[+] whatever1|1 year ago|reply

When we optimize we typically have a specific scenario in our head. With the proper tools one can probably make the mathematically optimal decisions to deal with this exact scenario.

However: 1) This exact scenario will likely never materialize 2) You have not good quantification of the scenario anyway due to noise/biases in measurements.

So now you optimized for something very specific, and the nature throws you something slightly different and you are completely screwed because your optimized solve is not flexible at all.

That is why a more “suboptimal” approach is typically better and why our stupid brains outperform super fancy computers and algorithms in planning.

[+] smokel|1 year ago|reply

I was listening to an episode of the "inControl" podcast [1], in which Ben Recht suggested that overfitting is not always well understood.

Perhaps it is interesting to read his blogpost "Machine Learning has a validity problem" alongside this article.

[1] https://www.incontrolpodcast.com/

[2] https://archives.argmin.net/2022/03/15/external-validity/

360 comments