Learning Theory from First Principles [pdf]

[+] sampo|2 years ago|reply

> 2.5 No free lunch theorem > > Although it may be tempting to deﬁne the optimal learning algorithm that works optimally for all distributions, this is impossible. In other words, learning is only possible with assumptions.

A mention of no free lunch theorem should come with a disclaimer that the theorem is not relevant in practice. An assumption that your data originates from the real world, is sufficient that the no free lunch theorem is not a hindrance.

This book doesn't discuss this at all. Maybe mention that "all distributions" means a generalization to higher dimensional spaces of discontinuous functions (including the tiny subset of continuous functions) of something similar to all possible bit sequences generated by tossing a coin. So basically if you data is generated from an even random distribution of "all possibilities", you cannot learn to predict the outcome of the next coin tosses, or similar.

[+] nextos|2 years ago|reply

Yes, most free lunch theorems and results of this kind, which make overly general assumptions, tend to be too pessimistic.

For example, many people naively think that static program analysis is unfeasible due to the halting problem, Rice's theorem, etc.

[+] bonoboTP|2 years ago|reply

It's analogous to the problem of induction (Hume). Without assumptions, it's impossible to connect past observations to predictions about the future. Observing that the sun rises a thousand mornings does not automatically make it any more or less likely that it will rise tomorrow, unless we make some assumptions, for example that events tend to be similar over time. And those assumptions can't be extracted from the data. Maybe things tended to be similar over time in the past, but that says nothing about the future. The pattern might break tomorrow and to say that it likely won't, is simply an assumption on a higher level.

But it seems like science is "doing fine" despite this problem. Similarly machine learning chugs along fine, because people use their experience and prior knowledge when designing the algorithms. These assumptions are also called inductive biases. They are biasing the learning towards certain patterns (like "things tend to be similar locally").

[+] magnio|2 years ago|reply

I don't see how your disclaimer applies. My interpretation of no free lunch theorems is that no single algorithm works well for all classes of problems, not that some problems are unlearnable. The example in its proof might be contrived, but in actuality, additional assumptions can and do lead to different algorithms being picked, no?

[+] unknown|2 years ago|reply

[deleted]

[+] jabowery|2 years ago|reply

Learning theory is the attempt to formalize natural science up to decision. Natural science's unstated assumption is that a sufficiently sophisticated algorithmic world model can be used to predict future observations from past observations. Since this is the same assumption as Solomonoff's assumption in his proof of inductive inference, you have to start there: with Turing complete coding rather than Rissanen's so-called "universal" coding.

It's ok* to depart from that starting point in creating subtheories but if you don't start there you'll end up with garbage like the last 50 years of confusion over what "The Minimum Description Length Principle" really means.

*It is, however, _not_ "ok" if what you are trying to do is come up with causal models. You can't get away from Turing complete codes if you're trying to model dynamical systems even though dynamical systems can be thought of as finite state machines with very large numbers of states. In order to make optimally compact codes you need Turing complete semantics that execute on a finite state machine that just so happens to have a really large but finite number of flipflops or other directed cyclic graph of universal (eg NOR, NAND, etc.) gates.

[+] canjobear|2 years ago|reply

Have they figured out what causes double descent yet?

[+] tel|2 years ago|reply

I don't know if it's a generalized result, but the Circuits team at Anthropic has a very compelling thesis: the first phase of descent corresponds to the model memorizing data points, the second phase corresponds to it shifting geometrically toward learning "features".

Here a "feature" might be seen as an abstract, very, very high dimensional vector space. The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts. They experiment with a toy model and toy data set where the latent features are represented explicitly and then compressed into a small set of data dimensions. This forces superposition. Then they show how that superposition looks under varying sizes of training data.

It's obviously a toy model, but it's a compelling idea. At least for any model which might suffer from superposition.

https://transformer-circuits.pub/2023/toy-double-descent/ind...

[+] arolihas|2 years ago|reply

There was actually a very recent blog post claiming that statistical mechanics can explain double descent https://calculatedcontent.com/2024/03/01/describing-double-d...

Some more detail here: https://calculatedcontent.com/2019/12/03/towards-a-new-theor...

[+] iaseiadit|2 years ago|reply

Not an expert, but this paper explores double descent with simple models. The interpretation there: when you extend into the overparameterized regime, that permits optimization towards small-norm weights, which generalize well again. Does that explain DD generally? Does it apply to other models (e.g. DNNs)?

https://arxiv.org/pdf/2303.14151.pdf

[+] a_wild_dandan|2 years ago|reply

No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning you're not optimizing w.r.t the training corpus, but a tiny subset, so your gradient isn't quite right. Over-training allows you to bulldoze over local optima and recurse toward the true distribution rather than drive around a local over-fitting basin.

[+] da39a3ee|2 years ago|reply

There are so many great mathematical PDFs available for free on the Internet, written by academics/educators/engineers. A problem is that there is a huge amount of overlap. I wonder if an AI model could be developed that would do a really good job of synthesizing an overlapping collection into a coherent single PDF without duplication.

[+] hef19898|2 years ago|reply

One could also just pick the books used in the corresponding university courses.

[+] nerdponx|2 years ago|reply

No need for an AI model. Probabilistic Machine Learning by Murphy is an excellent reference and resource.

[+] ogogmad|2 years ago|reply

Minor nitpick: The title is confusing. The first word should be substituted with "Machine-Learning" or "Statistical-Learning". Ideally, the author of the piece should do that at some point.

[+] throwaway81523|2 years ago|reply

This is pretty hard to read. For example, on the first page of chapter 1, it talks about "minimization of quadratic forms" and shows what looks like the formula for linear least squares. Is that right? It doesn't say anything about this. Some more exposition would help.

I do like that there are lots of exercises.

[+] guimplen|2 years ago|reply

I think the text is geared towards people with some mathematical background who want to understand learning theory. Besides it is clearly stated that this chapter is a review (so its assumed that you learned or will learn these things elsewhere).

[+] dawnofdusk|2 years ago|reply

First principles doesn't mean easy to read unfortunately

[+] nerdponx|2 years ago|reply

The sibling comment is right in that this is clearly not intended for first timers.

But your instincts are correct here. When you write out the objective function for ordinary least squares, it turns out to be a quadratic form. The choice of the word "quadratic" here is not a coincidence: it is the generalization of quadratic functions to matrices. That section covers the vector equivalent of minimizing quadratic functions.

[+] garydevenay|2 years ago|reply

Certainly doesn't seem like first principles...

[+] wenc|2 years ago|reply

Least squares is quadratic.

Quadratic means square terms.

[+] noonething|2 years ago|reply

Fascinating, does anyone have any good books on this?

[+] scythmic_waves|2 years ago|reply

Interesting! I’ll have to look over it when I have more time.

From a quick glance, it looks like it covers much of the same material as this text [1]. I wonder how they compare.

[1]: https://www.cambridge.org/core/books/understanding-machine-l...

[+] throwaway81523|2 years ago|reply

A 2014 book on machine learning sounds quaint and historical.

[+] p1esk|2 years ago|reply

I can’t wait until I tell GPT-5 “I have this idea I want to try, read this book and tell me there’s anything relevant there to make it work better”.

[+] ralphist|2 years ago|reply

I've never heard of an LLM making up a new idea. Shouldn't this only work if your thing has already been tried before?

[+] erbdex|2 years ago|reply

I am realising that passing context for $this is the tricky part as-

1. It is very difficult for me to tell you about my context as a user within low dimension variables.

2. I do not understand my situation in the universe to be able to tell AI.

3. I dont have a vocabulary with AI. Internet i feel aced this with shared HTTP protocol to consistently share agreed upon state. For ex within Uber I am a very narrow request response universe with.. POST phone, car, gps(a,b,c,d), now, payment.

But as a student wanting to learn algorithms how do I pass that I'm $age $internet-type from $place and prefer graphical explanations of algorithms, have tried but gotten scared of that thick book and these $milestones-cs50, know $python upto $proficiency(which again is a fractal variable with research papers on how to define for learning).

Similarly how do I help you understand what stage my startup idea is beyond low traction, but want to know have $networks/(VC, devs, sales) APIs, have $these successful partnerships with such evidence $attendance, $sales. Who should I speak to? Could you pls write the needful in mails and engage in partnership with other bots under $budget.

Even in the real world this vocabulary is in smaller pockets as our contexts are too different.

4. Learning assumes knowledge exists as a global forever variable in a wider than we understand universe. $meteor being a non maskable interrupt to the power supply at unicorn temperatures in a decade. Similarly one time trends in disposable $companies that $ecosystem uses to learn. I'm in a desert village with with absent electricity might mean those machines never reach me and perhaps most people don't have a basic phone in the world to be able to share state. Their local power mafia politics and absent governance might mean the pdf AI recommends i read might or might not help.

I don't know how this will evolve but to think of the possibilities has been so interesting. It's like computers can talk to us easily and they're such smart babies on day 1 and "folks we aren't able to put right, enough, cheap data in" is perhaps the real bottleneck to how much usefulness we are being able to uncover.

[+] unknown|2 years ago|reply

[deleted]

[+] hef19898|2 years ago|reply

Ypur idea would become even better so if you read the book yourself... You might even learn something new along the way.

63 comments