top | item 46996857

(no title)

Which part was heuristic? This format doesn't lend itself to providing proofs, it isn't exactly a LaTeX environment. Also why does the proof need to be constructive? That seems like an arbitrarily high bar to me. It suggests that you are not even remotely open to the possibility of evidence either.

I also don't think you understand my point of view, and you mistake me for a grifter. Keeping the possibility open is not profitable for me, and it would be much more beneficial to believe what you do.

discuss

measurablefunc|17 days ago

I didn't think you were a grifter but you only presented heuristics so if you have formal references then you can share them & people can decide on their own what to believe based on the evidence presented.

hodgehog11|17 days ago

Fine, that's fair. I believe the statement that you made is countered by my claim, which is:

Theorem. For any tolerance epsilon > 0, there exists a transformer neural network of sufficient size that follows, up to the factor epsilon, the policy that most optimally achieves arbitrary goals in arbitrary stochastic environments.

Proof (sketch). For any stochastic environment with a given goal, there exists a model that maximizes expected return under this goal (not necessarily unique, but it exists). From Solomonoff's convergence theorem (Theorem 3.19 in [1]), Bayes-optimal predictors under the universal Kolmogorov prior converge with increasing context to this model. Consequently, there exists an agent (called the AIXI agent) that is Pareto-optimal for arbitrary goals (Theorem 5.23 in [1]). This agent is a sequence-to-sequence map with some mild regularity, and satisfies the conditions of Theorem 3 in [2]. From this universal approximation theorem (itself proven in Appendices B and C in [2]), there exists a transformer neural network of a sufficient size that replicates the AIXI agent up to the factor epsilon.

This is effectively the argument made in [3], although I'm not fond of their presentation. Now, practitioners still cry foul because existence doesn't guarantee a procedure to find this particular architecture (this is the constructive bit). This is where the neural scaling law comes in. The trick is to work with a linearization of the network, called the neural tangent kernel; it's existence is guaranteed from Theorem 7.2 of [4]. The NTK predictors are also universal and are a subset of the random feature models treated in [5], which derives the neural scaling laws for these models. Extrapolating these laws out as per [6] for specific tasks shows that the "floor" is always below human error rates, but this is still empirical because it works with the ill-defined definition of superintelligence that is "better than humans in all contexts".

[1] Hutter, M. (2005). Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media.

[2] https://arxiv.org/abs/1912.10077

[3] https://openreview.net/pdf?id=Vib3KtwoWs

[4] https://arxiv.org/abs/2006.14548

[5] https://arxiv.org/abs/2210.16859

[6] https://arxiv.org/abs/2001.08361