cgadski | 6 days ago | on: I Simulated 38,612 Countryle Games to Find the Best Strategy
cgadski's comments
cgadski | 2 months ago | on: Ask HN: Share your personal website
cgadski | 6 months ago | on: Language models pack billions of concepts into 12k dimensions
C is meant to be the smallest constant so that, for each (N, k, epsilon) with k > C epsilon^-2 log N and epsilon > 0, there exists some arrangement of N vectors in R^k with inner products smaller than epsilon in absolute value. For each (N, k), the author optimized epsilon and reported the empirical value k epsilon^2 / log N, which is the smallest value of C for which our condition holds restricted to the given values of (N, k). (Of course, this is my good-faith interpretation---the article introduces C in the context of a JL distortion bound, and it takes an extra step to turn that into a bound on inner products.)
It can be shown that C = 4 satisfies this condition, when log is the natural log. See [1], for example. Based on the graph, the article claims to do better: "for sufficiently large spaces," it says we can put C = 0.2. This would be a very significant improvement.
For k = 2, the graph shows that C will be lower than 0.2 for sufficiently large N. (The color scheme is confusing! The line for k = 2 is the one that starts at just under 0.8 when N = 1.) Already for k = 3, the graph doesn't give us reason to believe it will be lower than 0.2---you correctly observed it gets to around 0.3. For larger value of k, the graph doesn't seem to show what we can expect for large N: the curves go up, but do not come down. This is what I meant with my comment: the conclusion that C <= 0.2 as N -> infinity is only justified by the behavior at K = 2.
Now, do these results make sense? In the case k = 2, we're trying to put a large number (N) of vectors on the unit circle, and thinking about how small the maximum inner product (epsilon) between any pair of vectors can be. As N -> infinity, the vectors will be packed very tightly and the maximum inner product epsilon will come close to 1. Overall, C = k epsilon^2 / log N will become arbitrarily small. In fact, the same happens for every k.
So, just in connection to this graph, the article makes three mistakes:
1) The article's interpretation of its experiment is wrong: the graph alone doesn't show that C < 0.2 for "large spaces".
2) However, it should be obvious a priori that, for all values of k, the reported values C should converge to 0 for large N (albeit very slowly, at a rate of 1/log N).
3) Unfortunately, this doesn't tell us anything about the minimum value of k / log(N) for a given epsilon and k, and so it doesn't support the conclusion of the article.
The problem with this kind of LLM-driven article is that it gives uncareful work the _appearance_ of careful work but none of the other qualities that usually come with care.
cgadski | 6 months ago | on: Language models pack billions of concepts into 12k dimensions
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
[1] https://people.math.ethz.ch/~abandeira//BandeiraSingerStrohm... [2] https://cgad.ski/blog/when-numbers-are-bits.html
cgadski | 6 months ago | on: Language models pack billions of concepts into 12k dimensions
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
cgadski | 6 months ago | on: SpikingBrain 7B – More efficient than classic LLMs
> Our architectural choices are closely aligned with principles observed in biological brains.
How? They point out three design choices: linear attention, MoE layers, and spike coding.
Apparently linear attention is brain-inspired because it can be viewed as a "simplified abstraction of dendritic dynamics with multi-branch morphology." Who knows what that means exactly [1]. They don't discuss it further. MoE layers apparently reflect "a principle of modular specialization." Fine, whatever.
Now, using a dozen attention variants + MoE is bog standard. The real novelty would be spike coding. Page 11 is dedicated to the different ways they could turn signals into spike trains, including such biologically-inspired mechanisms as using two's complement. However, they don't actually do spike coding in a time domain. In their implementation, "spike coding" apparently means to turn activations into integers. Section 3.3.3 claims that this lets us simulate an underlying spiking neural network, so we can validate the spiking approach without using special hardware. But if your SNN can be simulated faithfully on a GPU by turning things into integers, isn't that a bit of a depressing SNN?
Either I'm missing something, or this is just just dressing standard techniques with loads of meaningless jargon. Of course that’s a very popular way to operate in deep learning nowadays.
[1] Like, attention can draw from multiple tokens, sort of like how different spines of a dendrite can draw from multiple axons? Can’t make this stuff up.
cgadski | 6 months ago | on: Important machine learning equations
What makes me angry about LLM slop is imagining how this looks to a student learning this stuff. Putting a post like this on your personal blog is implicitly saying: as long as you know some some "equations" and remember the keywords, a language model can do the rest of the thinking for you! It's encouraging people to forgo learning.
cgadski | 6 months ago | on: Important machine learning equations
It makes me sad to see LLM slop on the front page.
cgadski | 7 months ago | on: What is the average length of a queue of cars? (2023)
Conditional on the value of the first draw, N is geometrically distributed. If we're drawing from an absolutely continuous distribution on the first line, then of course the details of our distribution don't matter: N is a draw from a geometric distribution with rate lambda, where lambda in turn is drawn uniformly from [0, 1]. It follows that N has a thick tail; for example, the expected value of N is the expected value of 1/lambda, which is infinite. In fact, N turns out to have a power law tail.
However, this isn't true if we're drawing from a distribution that's not absolutely continuous. If you coarse-grain into just "fast" and "slow" cars, then N again has a thin (geometric) tail. More to the point, if we imagine that our queues of cars need to be formed within a finite amount of time, then a car is only added to the queue in front of it if its velocity is epsilon larger than the velocity of the queue, and the problematic situation where lambda -> 0 goes away. In this idealized scenario, I guess you could relate the rate of the exponential tail of N to how long the cars have been travelling for.
Finally, it's worth remembering the old "waiting-time paradox": the variable N we're talking about is not the same as the length of the queue that a randomly selected driver finds themself in. What's the distribution of the latter---the distribution of "experienced" queue lengths? In this post the author computed that P(N = n) = 1/n(n + 1). It stands to reason that to get the density of the distribution of experienced lengths we need to multiply by n and divide by a normalizing constant. Unfortunately, you can't multiply 1/(n + 1) by any constant to get a probability distribution, since the sum over n diverges.
What does it mean that the distribution of experienced queues lengths doesn't exist? If you did a huge numerical simulation, you'd find that almost all drivers experience incredibly large queues, and that this concentration towards large queues only becomes more pronounced as you simulate more drivers. If anything, you could argue that the experienced queue length is "concentrated at infinity," although of course in practice all queues are finite.
cgadski | 8 months ago | on: Entropy of a Mixture
cgadski | 11 months ago | on: Ask HN: What are you working on? (April 2025)
So far all my work has gone into the technical side of setting up the game (a Java app written in 2010) to work as a reinforcement learning environment. The developers were nice enough to maintain the source and open it to the community, so I patched the client/server to be controllable through protobuf messages. So far, I can:
- Record games between humans. I also wrote a kind of janky replay viewer [1] that probably only makes sense to people who play the game already. (Before, the game didn't have any recording feature.)
- Define bots with pytorch/python and run them in offline training mode. (The game runs relatively quickly, like 8 gameplay minutes / realtime second.)
- Run my python-defined bots online versus human players. (Just managed to get this working today.)
It took a bunch of messing around with the Java source to get this far, and I haven't even really started on the reinforcement learning part yet. Hopefully I can start on that soon.
This game (https://planeball.com) is really unique, and I'm excited to produce a reinforcement learning environment that other people can play with easily. Thinking about how you might build bots for this game was one of the problems that made me interested in artificial intelligence 8 years ago. The controls/mechanics are pretty simple and it's relatively easy to make bots that beat new players---basically just don't crash into obstacles, don't stall out, conserve your energy, and shoot when you will deal damage---but good human players do a lot of complicated intuitive decision-making.
[1] http://altistats.com/viewer/?f=4b020f28-af0b-4aa0-96be-a73f0... (Press h for help on controls. Planes will "jump around" when they're not close to the objective---the server sends limited information on planes that are outside the field of vision of the client, but my recording viewer displays the whole map.)
cgadski | 11 months ago | on: Neural Graffiti – Liquid Memory Layer for LLMs
Except, the linear map W is just set to a random initialization, so it won't work for obvious reasons in its current form. (I guess this is why there is no example of its output. I'm guessing it was vibe-coded?) Also, since the intervention is only happening at the last hidden layer, I can't imagine this would really change how the model "thinks" in an interesting way. Like, yeah, you can absolutely make a model talk about dogs by adding in control vector for "dogness" somewhere.
Basically, this method is "inspired by graffiti art of tagging and the neuroplastic nature of living brains" in the same way that taking an exponential moving average of a time series would be "informed by state-space dynamics techniques utilized in deep learning, reservoir computing, and quantum mechanics." Really tired of the amount of insincere/pointless language in deep learning nowadays.
cgadski | 1 year ago | on: ARC-AGI without pretraining
One example that comes to mind (I don't know much/haven't thought about it much) is how AlphaFold apparently dropped rotational equivariance of the model in favor of what amounts to data augmentation---opting to "hammer in" the symmetry rather than using these fancy equivariant-by-design architectures. Apparently it's a common finding that hard-coded equivariance can hurt performance in practice when you have enough data.
cgadski | 1 year ago | on: The Lost Art of Logarithms
When X is an exponential variable and c is a constant, X + c has the same distribution as X after conditioning on large outcomes. In other words, these two variables have same "tail." This is true exactly for exponential distributions. (Sometimes this is called "memorylessness.")
Similarly, when U has a uniform distribution on [0, 1] and c is a constant, cU has the same distribution as U after conditioning on small outcomes.
But if cU is distributed like U near 0, then -ln(c U) is distributed like -ln(U) near infinity. But -ln(c U) = -ln(c) - ln(U), so the tail of -ln(U) doesn't change when we add a constant, meaning it must have an exponential distribution.
cgadski | 1 year ago | on: ARC-AGI without pretraining
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
cgadski | 1 year ago | on: Complex Systems and Quantitative Mereology
In this article, we fix a mereology and a kind of quantity Q that "decomposes" over it---in the sense that Q(p) = sum_{r <= p} q(r) for some function q(r)---and then see that Mobius inversion lets us solve for q in terms of Q. In terms of incidence algebras, we're saying: assume Q = zeta q, as a product of elements in an incidence algebra. Then zeta has an inverse mu, so q = mu Q.
In other situations, we might want to "solve for" a quantity Q that decomposes over some class of metrologies while respecting some properties. The "simpler" and more "homogeneous" the parts of your mereology, the less you can express, but the easier it becomes to reason about Q. A mereology that breaks me up into the empty set, singleton sets with each of my atoms, and the set of all my atoms admits no "decomposing quantities" besides a histogram of my atoms. An attempt to measure "how healthy I am" in terms of that mereology can't do much. On the other hand, if I choose the mereology that breaks me up into the empty set and my whole, all quantities decompose but I have no tools to reason about them.
I guess Euler characteristic could be an example of how the requirement of respecting a certain kind of mereology can "bend" a hard-to-decompose quantity into a weirder but "nicer" quantity. For example, say we're interested in defining a Q that attempts to "count the number of connected regions" of some object, and we insist on using a mereology that lets us divide regions up into "cells". Of course this is impossible, as we can see in the problem of counting connected components of a graph-like object: we can't get the answer just as a function of the number of vertices and edges. However, if we insist on assigning a value of 1 to "blobs" of any dimension, the "compositionality requirement" forces us to define the Euler characteristic. This doesn't help us much with graph algorithms in general, but gives us an unexpectedly easy way to, say, count the number of blob-shaped islands on a map.
I wonder if there are other examples of this?
cgadski | 1 year ago | on: Complex Systems and Quantitative Mereology
cgadski | 1 year ago | on: Minimum bipartite matching via Riemann optimization (2023)
I thought the formulas for updates in this case were pretty opaque, so let's think of a simpler example. Say we want to optimize a function f(x_1, ..., x_n) over the simplex
\Delta = \{ (x_1, ..., x_n) : x_1 + ... + x_n = 1, x_i >= 0. \}
We can imagine that we're allocating a limited resource into n different bins. At each optimization step, say we have numbers g_i telling us partials of f with respect to the x_i. (Somewhat more generally, these numbers could tell us much we "regret" having allocated our resource into the ith bin.) How can we use this information to change our choices of x_i?
One option would be to take vanilla gradient descent steps. That is, we'd choose some small eta, set x_i := x_i - eta g_i, and somehow adjust these assignments to be a valid allocation. Another method, which tends to be more natural and do better in practice, is to set x_i := x_i exp(-eta g_i) and then divide each x_i by (x_1 + ... + x_n). (This is called the Hedge method, and "mysteriously" also looks just like a Bayesian update for a distribution of belief over n possibilities.) This is the flavor of method being used in the post to optimize over doubly stochastic matrices.
I've studied differential geometry but still have the impression that the proximal operator/mirror descent point of view is a little more natural in optimization/learning. Gesticulating wildly about Levi-Citiva connections is fun, but e.g. to optimize over the simplex it makes more sense to just ask: how should we regularize our updates to not be too "imprudent"? (Of course this depends on what problem you're solving!) One common story is that you want to avoid neglecting bins until you're very sure they're worthless. To do this, you can use a metric that makes your manifold look larger when you get near a vertex of the simplex (and then make sure you can do the necessary computations and find a good retraction map!), or you can think about what proximal operator to use/what Bregman divergence to use to regularize your updates. The paper used by this post uses a Fisher metric, which is the infinitesimal version of the KL divergence, and regularizing your updates by KL gives you things like the Hedge method/Bayesian updates.
cgadski | 1 year ago | on: Rnote – Sketch and take handwritten notes
Besides selecting/manipulating with rectangle/lasso/intersection selections, handwritten text that is roughly level with the page rules can be deleted/moved by using the stylus as a "cursor." It understands punctuation and descenders and works surprisingly well. Text that is pushed too far to the right even "reflows" to the next line. The site [1] has a good explanation of how this works.
Rnote looks nice too though. I wonder if it'll be able to match the UX of Write.
cgadski | 1 year ago | on: A new type of neural network is more interpretable