A new type of neural network is more interpretable

[+] Ameo|1 year ago|reply

I've tried out and written about[1] KANs on some small-scale modeling, comparing them to vanilla neural networks, as previously discussed here: https://news.ycombinator.com/item?id=40855028.

My main finding was that KANs are very tricky to train compared to NNs. It's usually possible to get per-parameter loss roughly on par with NNs, but it requires a lot of hyperparameter tuning and extra tricks in the KAN architecture. In comparison, vanilla NNs were much easier to train and worked well under a much broader set of conditions.

Some people commented that we've invested an incredible amount of effort into getting really good at training NNs efficiently, and many of the things in ML libraries (optimizers like Adam, for example) are designed and optimized specifically for NNs. For that reason, it's not really a good apples-to-apples comparison.

I think there's definitely potential in KANs, but they aren't a magic bullet. I'm also a bit dubious about interpretability claims; the splines that are usually used for KANs don't really offer much more insight to me than just analyzing the output of a neuron in a lower layer of a NN.

[1] https://cprimozic.net/blog/trying-out-kans/

[+] Lerc|1 year ago|reply

This is sort of my view as well, most of the hype and the criticisms of KANs seem to be fairly unfounded.

I do think they have a lot of potential, but what has been published so far does not represent a panacea. Perhaps they will have an impact like transformers, perhaps they will only serve in a little niche. You can't really tell immediately how refinements will alter the usability.

Finding out what those refinements are and how they change things is what research is all about. I have been quite enjoying following https://github.com/mintisan/awesome-kan progress and seeing the variety of things being tried. I have a few ideas of my own I might try at sometime.

Between KANs and fixed activation function networks there is an entire continuum of activation function tuning available for research.

Buckets of simple parameter activation functions something like xsigmoid(mx) ( ReLU when m is large, GeLU at m=1.7, SiLU at m=1). This adds a small number of parameters for presumably some game

Single activation functions as above per neuron.

Multi parameterizable activation functions, in batches, or per neuron.

Many parameter function approximators, in batches, or per neuron.

Full KANs without weights.

I can see some significant acclaim being awarded to the person who can calculate a unified formula for determining where additional parameters should go for the largest impact.

[+] smus|1 year ago|reply

Not just the optimizers, but the initialization schemes for neural networks have been explicitly tuned for stable training of neural nets with traditional activation functions. I'm not sure as much work has gone into intialization for KANs

I 100% agree with the idea that these won't be any more interpretable and I've never understood the argument that they would be. Sure, if the NN was a single neuron I can see it, but as soon as you start composing these things you lose all interpretability imo

[+] wanderingmind|1 year ago|reply

Really detailed work. Thank you. For those looking to jump straight to code, here is the link to codebase discussed in the blog.

https://github.com/Ameobea/kan

[+] alexnewman|1 year ago|reply

I’m very happy to hear someone else say the quiet part out loud . Everyone claims nn aren’t interpretable, but that’s never been my experience . Quiet the contrary

[+] thomasahle|1 year ago|reply

KANs can be modeled as just another activation architecture in normal MLPs, which is of course not surprising, since they are very flexible. I made a chart of different types of architectures here: https://x.com/thomasahle/status/1796902311765434694

Curiously KANs are not very efficient when implemented with normal matrix multiplications in Pytorch, say. But with a custom cuda kernel, or using torch.compile they can be very fast: https://x.com/thomasahle/status/1798408687981297844

[+] byteknight|1 year ago|reply

Side question:

Can people this deep in the field read that visualization with all the formulas and actually grok what's going on? I'm trying to understand just how far behind I am from the average math person (obviously very very very far, but quantifiable lol)

[+] kherud|1 year ago|reply

Interesting, thanks for sharing! Do you have an explanation or idea why compilation slows some architectures down?

[+] yorwba|1 year ago|reply

Previous discussion of Kolmogorov-Arnold networks: https://news.ycombinator.com/item?id=40219205

[+] smusamashah|1 year ago|reply

> One downside of KANs is that they take longer per parameter to train—in part because they can’t take advantage of GPUs. But they need fewer parameters. Liu notes that even if KANs don’t replace giant CNNs and transformers for processing images and language, training time won’t be an issue at the smaller scale of many physics problems.

They don't even say that it might be possible to take advantage of GPUs in future. Reads like a fundamental problem with these.

[+] johnsutor|1 year ago|reply

From the KAN repo itself, it appears they already have GPU support https://github.com/KindXiaoming/pykan/blob/master/tutorials/...

[+] endymi0n|1 year ago|reply

This looks interesting for sure:

"ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU" https://arxiv.org/abs/2406.02075#

[+] nickpsecurity|1 year ago|reply

I’ve seen neural nets combined with decision trees. There’s a few ways to do such hybrids. One style essentially uses the accurate, GPU-trained networks to push the interpretable networks to higher accuracy.

Do any of you think that can be done cost-effectively with KAN’s? Especially using pre-trained, language models like LlaMa-3 to train the interpretable models?

[+] scotty79|1 year ago|reply

I wonder what's the issue ... GPUs can do very complex stuff

[+] xg15|1 year ago|reply

> Then they could summarize the entire KAN in an intuitive one-line function (including all the component activation functions), in some cases perfectly reconstructing the physics function that created the dataset.

The idea of KANs sounds really exciting, but just to nitpick, you could also write any traditional NN as a closed-form "one line" expression - the line will just become very very long. I don't see how the expression itself would become less complex if you used splines instead of weights (even if this resulted in less neurons for the same decision boundary).

[+] rsfern|1 year ago|reply

In the original KAN paper, they do two things to address this: first they have some sparsity-inducing regularization, and second they have a symbolification step so that you can ideally find a compact symbolic model after learning a sparse computation graph of splines.

I guess in principle you could do something similar with MLPs but since MLP representations are sort of delocalized they might be harder to sparsify and symbolify

[+] asdfman123|1 year ago|reply

Can someone ELIF this for me?

I understand how neural networks try to reduce their loss function to get the best result. But what's actually different about the KANs?

[+] svachalek|1 year ago|reply

I'm not an ML person and am just learning from this article, but I understand a little bit about ML and the key thing I get out of it is the footnote in the diagram.

A regular neural network (MLP) has matrices full of floating point numbers that act as weights. A weight is a linear function y=wx, meaning if I plot the input x and output y on cartesian coordinates, it will generate a straight line. Increasing or decreasing the input also increases or decreases the output by consistent amounts. We won't have points where increasing the output suddenly has more or less effect than the previous increase, or starts sending the output in the other direction. So we train the network by having it learn multiple layers of these weights and also connecting them with some magic glue functions that are part of the design, not something that is trained up. The end result is the output can have a complex relationship with the input by being passed through all these layers.

In contrast, in a KAN rather than weights (acting as linear functions) we let the network learn other kinds of functions. These are nonlinear so it's possible that as we increase the input, the output keeps rising in an accelerating fashion, or turns around and starts decreasing. We can learn much more complex relationships between input and output, but lose some of the computational efficiency of the MLP approach (huge matrix operations are what GPUs are built for, while you need a CPU to do arbitrary math).

So with the KAN we end up with few but more complex "neurons", made up of complex functions. And if I understand what they're getting at here, the appeal of this is that you can inspect one of those neurons and get a clear formula that describes what it is doing, because all the complexity is distilled into a formula in the neuron. While with an MLP you have to track what is happening through multiple layers of weights and do more work to figure out how it all works.

Again I'm not in the space, but I imagine the functions that come out of a KAN still aren't super intuitive formulas that look like something out of Isaac Newton's notebooks, they're probably full of bizarre constants and unintuitive factors that cancel each other out.

[+] Lerc|1 year ago|reply

I'm not sure if this counts as ELIF but it's a gross simplification

perceptron layer is

output = simple_function( sum(many_inputs*many_weights) + extra_weight_for_bias)

a KAN layer is

output = sum(fancy_functions(many_inputs))

but I could be wrong, it's been a day.

[+] yobbo|1 year ago|reply

The output of an MLP is a black-box function f(x, y).

The output of a KAN is a nice formula like exp(0.3sin(x) + 4cos(y)). This is what is meant by interpretable.

[+] Grimblewald|1 year ago|reply

a kan is, in a way, like a network of networks, each edge representing its own little network of sorts. I could be very wrong, I am still digesting the article myself, but that is my superficial take.

[+] BenoitP|1 year ago|reply

I wonder if a set of learned function (can|does) reproduce the truth tables from First Order Logic.

I think it'd be easy to check.

----

Anyways that's great news for differentiability. For now 'if' conditions expressed in JAX are tricky (at least for me), and are de facto an optimization barrier. If they're learnable and already into the network, I'd say that's a great thing.

[+] zeknife|1 year ago|reply

It is easy to construct an MLP that implements any basic logic function. But XOR requires at least one hidden layer.

[+] martingoodson|1 year ago|reply

We hosted Ziming Liu at the London Machine Learning Meetup a few weeks ago. He gave a great talk on this fascinating work.

Here's the recording https://youtu.be/FYYZZVV5vlY?si=ReoygVJMgY9oje3p

[+] novaRom|1 year ago|reply

I am a bit skeptical. There were a lot of papers and experiments in 80s and 90s about different ANN architectures alternative to f(x*w+b). The reality today all practical SOTA models are still multiply-accumulate-threshold based. It's just its speed and simplicity.

[+] theptip|1 year ago|reply

> One downside of KANs is that they take longer per parameter to train—in part because they can’t take advantage of GPUs.

This seems like a big gap. Anyone know if this is a fundamental architecture mismatch, or just no one has written the required CUDA kernels yet?

[+] jcims|1 year ago|reply

I can find descriptions at one level or another (eg RNN vs CNN) but is there a deeper kingdom/phylum/class type taxonomy of neural network architectures that can help a layman understand how they differ and how they align, ideally with specific references to contemporary ones in use or being researched?

I don't know why I'm interested because I'm not planning to actually do any work in the space, but I always struggle to understand when some new architecture is announced if it's a fundamental shift or if it's an optimization.

[+] kens|1 year ago|reply

You might find "The neural network zoo" helpful; it's a chart showing the different types of neural networks, along with a brief discussion of each type: https://www.asimovinstitute.org/neural-network-zoo/

[+] noduerme|1 year ago|reply

This sounds a bit like allowing each neuron's function to perform its own symbolic regression? But at predicting physical phenomena you might get better performance per cycle from just an A-Life swarm of symbolic regression cells competing than trying to harness them as a single organism. Why do you need a NN to model what's basically a deterministic result set, and why is that a good test?

[+] unknown|1 year ago|reply

[deleted]

[+] zygy|1 year ago|reply

Naive question: what's the intuition for how this is different from increasing the number of learnable parameters on a regular MLP?

[+] slashdave|1 year ago|reply

Orthogonality ensures that each weight has its own, individual importance. In a regular MLP, the weights are naturally correlated.

[+] Bluestein|1 year ago|reply

(I am wondering if there might not be a perverse incentive not to improve on interpretability for major incumbents ...

... given how, what you can "see" (ie. have visibility into) is something that regulatory stakeholders can ask you to exercise control over, or for oversight or information about ...

... whereas a "black box" they have trained and control - but few understand - can perhaps give you "plausible deniability" of the "we don't know how it works either" type.-

86 comments