In the paper they use some spline interpolation to represent 1d function that they sum. Their code seemed aimed at smaller sizes. Instead I chose a different representation, aka fourier coefficients that are used to interpolate the functions of individual coordinates.
It should give an idea of Kolmogorov-Arnold networks representation power, it should probably converge easier than their spline version but spline version have less operations.
Of course, if my code doesn't work, it doesn't mean theirs doesn't.
Feel free to experiment and publish paper if you want.
When I played around with implementing this last night I found using a radial basis function instead of Fourier coefficients (I tried the same, nice and parallel and easy to write) to be more well behaved in training networks of depth greater than 2.
Hi Noesis, I just noticed that your implementation, combined with the efficientKAN by Blealtan (https://github.com/Blealtan/efficient-kan), results in a structure very similar to Siren(MLP with Sin activations). efficientKAN first computes the common basis functions for all the edge activations and the output can be calculated with a linear combination of the basis. If the basis functions are fourier, then a KAN layer can be viewed as a linear layer with fixed weights + Sin activation + a linear layer with learnable weights, which is a special form of Siren.
I think this may show some connection between KAN and MLP.
I've spent some time playing with their Jupyter notebooks. The most useful (to me, anyway) is their Example_3_classfication.ipynb ([1]).
It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.
I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.
Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.
Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.
Changes:
1. Increased the training set from 1000 to 100k samples. This solved overfitting.
2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.
3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
Overall, my impressions are:
- it works!
- the reference implementation is very slow. A GPU implementation is dearly needed.
- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.
- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.
I will definitely keep an eye on this development.
This. I don't think toy examples are useful for modern ML techniques. If you tested big ideas in ML (transformers, LSTM's, ADAM) on a training dataset of 50 numbers trying to fit a y=sin(x) curve, I think you'd wrongly throw these ideas out.
It's possible to run it on CUDA. One of their examples shows how. But I found it's slower than on CPU. I'm actually not really surprised since running something on GPU is not a guarantee that it's gonna be fast, especially when lots of branching is involved.
Unfortunately, I had to modify KAN.py and KANLayer.py to make it work as not all relevant tensor are put on the correct device. In some places the formatting even suggests that there was previously a device argument.
GLMs in turn generalize logistic-, linear and other popular regression models.
Neural GAMs with learned basis functions have already been proposed, so I'm a bit surprised that the prior art is not mentioned in this new paper. Previous applications focused more on interpretability.
The success we're seeing with neural networks is tightly coupled with the ability to scale - the algorithm itself works at scale (more layers), but it also scales well with hardware, (neural nets mostly consist of matrix multiplications, and GPUs have specialised matrix multiplication acceleration) - one of the most impactful neural network papers, AlexNet, was impactful because it showed that NNs could be put on the GPU, scaled and accelerated, to great effect.
It's not clear from the paper how well this algorithm will scale, both in terms of the algorithm itself (does it still train well with more layers?), and ability to make use of hardware acceleration, (e.g. it's not clear to me that the structure, with its per-weight activation functions, can make use of fast matmul acceleration).
It's an interesting idea, that seems to work well and have nice properties on a smaller scale; but whether it's a good architecture for imagenet, LLMs, etc. is not clear at this stage.
Sounds like something which could be approximated by a DCT (discrete cosine transform). JPEG compression does this, and there are hardware accelerations for it.
> can make use of fast matmul acceleration
Maybe not, but matmul acceleration was done in hardware because it's useful for some problems (graphics initially).
So if these per weight activations functions really work, people will be quick to figure out how to run them in hardware.
It's so refreshing to come across new AI research different from the usual "we modified a transformer in this and that way and got slightly better results on this and that benchmark." All those new papers proposing incremental improvements are important, but... everyone is getting a bit tired of them. Also, anecdotal evidence and recent work suggest we're starting to run into fundamental limits inherent to transformers, so we may well need new alternatives.[a]
The best thing about this new work is that it's not an either/or proposition. The proposed "learnable spline interpolations as activation functions" can be used in conventional DNNs, to improve their expressivity. Now we just have to test the stuff to see if it really works better.
There's a ton actually. Just they tend to go through extra rounds of review (or never make it...) and never make it to HN unless there's special circumstances (this one is MIT and CIT). Unfortunately we've let PR become a very powerful force (it's always been a thing, but seems more influential now). We can fight against this by up voting things like this and if you're a reviewee, not focusing on sota (it's clearly been gamed and clearly leading us in the wrong direction)
I read a book on NNs by Robert Hecht Nielsen in 1989, during the NN hype of the time (I believe it was the 2nd hype cycle, the first beginning with Rosenblatt’s original hardware perceptron and dying with Minsky and Pappert’s “Perceptrons” manuscript a decade or two earlier).
Everything described was laughably basic by modern standards, but the motivation given in that book was the Kolmogorov representation theorem: a modest 3 layer networks with the right activation function can represent any continuous m-to-n function.
Most research back then focused on 3 layer networks, possibly for that reason. Sigmoid activation was king, and vanishing gradients the main issue. It took 2 decades until AlexNet brought NN research back from the AI winter of the 1990’s
95% percent will produce mediocre-to-nice improvements to what we already have so there were reserachers that eventually grow up and do something really exciting
From the preprint - 100 input dimensions is considered "high", and most problems considered have 5 or fewer input dimensions. This is typical of physics-inspired settings I've seen considered in ML. The next step would be demonstrating them on MNIST, which, at 784 dimensions is tiny by modern standards.
In actual business processes there are lots of ML problems with fewer than 100 input dimensions. But for most of them decision trees are still competitive with neural networks or even outperform them.
Very interesting! Kolmogorov neutral networks can represent discontinuous functions [1], but I've wondered about how practically applicable they are. This repo seems to show that they have some use after all.
Not for discontinuous functions, as your paper explains, we know that g exists for discontinuous bounded, but nothing to find it with.
> A practical construction of g in cases with discontinuous bounded and un-
bounded functions is not yet known. For such cases Theorem 2.1 gives only a theoretical understanding of the representation problem. This is because for the representation of discontinuous bounded functions we have derived (2.1) from the fact that the range of the operator Z∗ is the whole space of bounded functions B(Id). This fact directly gives us a formula (2.1) but does not tell how the bounded one-variable function g is attained. For the representation of unbounded functions we have used a linear extension of the functional F , existence of which is based on Zorn’s lemma (see, e.g., [19, Ch. 3]). Application of Zorn’s lemma provides no mechanism for practical construction of such an extension. Zorn’s lemma helps to assert only its existence.
If you look at the OP post arxiv link, you will see they are using splines .
Perhaps a hasty comment but linear combinations of B-splines are yet another (higher-degree) B-spline. Isn't this simply fitting high degree B-splines to functions?
That would be true for a single node / single layer. But once the output of one layer is fed into the input of the next it is not just a linear combination of splines anymore.
2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).
3. Everybody is talking transformers. I want to see diffusion models with this approach.
To your 3rd point, most diffusion models already use a transformer-based architecture (U-Net with self attention and cross attention, Vision Transformer, Diffusion Transformer, etc.).
Very cool stuff! Exciting to see so many people sharing their works on KANs. Seeing as the authors claim that KANs are able to reduce the issues of catastrophic forgetting that we see in MLPs, I thought "Wouldn't it be nice if there was an LLM that substituted MLPs with KANs?". I looked around and didn't find one, so I built one!
- PyTorch Module of the KAN GPT
- Deployed to PyPi
- MIT Licence
- Test Cases to ensure forward-backward passes work as expected
- Training script
I am currently working on training it on the WebText dataset to compare it to the original gpt2. Facing a few out-of-memory issues at the moment. Perhaps the vocab size (50257) is too large?
I'm open to contributions and would love to hear your thoughts!
At the end of this example, they recover the symbolic formula that generated their training set: exp(x₂² + sin(3.14x₁)).
It's like a computation graph with a library of "activation functions" that is optimised, and then pruned. You can recover good symbolic formulas from the pruned graph.
After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.
At page 6 it explains how they did back propagation https://arxiv.org/pdf/2404.19756 (and in page 2 it says that previous efforts to leverage Kolmogorov-Arnold representation failed to use backpropagation), so maybe using backpropagation to train multilayer networks with this architecture is their main contribution?
> Unsurprisingly, the possibility of using Kolmogorov-Arnold representation theorem to build neuralnetworks has been studied [8, 9, 10, 11, 12, 13]. However, most work has stuck with the original depth-2 width-(2n + 1) representation, and did not have the chance to leverage more modern techniques (e.g., back propagation) to train the networks. Our contribution lies in generalizing the original Kolmogorov-Arnold representation to arbitrary widths and depths, revitalizing and contextualizing it in today’s deep learning world, as well as using extensive empirical experiments to highlight its potential role as a foundation model for AI + Science due to its accuracy and interpretability.
No, the activations are a combination of the basis function and the spline function. It's a little unclear to me still how the grid works, but it seems like this shouldn't suffer anymore than a generic relu MLP.
I can't assess this, but I do worry that overnight some algorithmic advance will enhance LLMs by orders of magnitude and the next big model to get trained is suddenly 10,000x better than GPT-4 and nobody's ready for it.
What to be worried about? Technical progress will happen, sometimes by sudden jumps. Some company will become a leader, competitors will catch up after a while.
I think this is unlikely. There has never (in the visible fossil record) been a mutation that suddenly made tigers an order of magnitude stronger and faster, or humans an order of magnitude more intelligent. It's been a long time (if ever?) since chip transistor density made a multiple-order-of-magnitude leap. Any complex optimized system has many limiting factors and it's unlikely that all of them would leap forward at once. The current generation of LLMs are not as complex or optimized as tigers or humans, but they're far enough along that changing one thing is unlikely to result in a giant leap.
If and when something radically better comes along, say an alternative to back-propagation that is more like the way our brains learn, it will need a lot of scaling and refinement to catch up with the then-current LLM.
In the article "Distilling free-form natural laws from experimental data", Schmidt and Lipson introduced the idea that free-form natural laws can be learned from experimental measurements in a physical system using symbolic (genetic) regression algorithms. An important claim in this work is that the algorithm finds laws in data without having incorporated any prior knowledge of physics. Upon close inspection, however, we show that their method implicitly incorporates Hamilton's equations of motions and Newton's second law, demystifying how they are able to find Hamiltonians and special classes of Lagrangians from data.
I think this is hilarious.
I cannot get a PDF of your article and instead I will read a commentary on it which appears to be very interesting.
Seconded. I’m guessing you could create an implementation that is able to do that and then write optimised triton/cuda kernels to accelerate them but need to investigate further
I was under the impression that graph neural nets already trained learnable functions on graph edges rather than nodes, albeit typically on a fully connected graph. Is there any comparison to just a basic GNN here?
> Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m parent nodes represent m Boolean variables, then the probability function could be represented by a table of 2^m entries, one entry for each of the 2^m possible parent combinations.
So a new type of neural network that has been proven to work well on regression tasks common in physics? And tested in practice to fit well on elementary algebra and compositions of complex functions. But no evidence at all if it works on the most basic machine learning tasks like MNIST, not to mention language models.
I mean it's great but at the current state it seems better suited for tasks where an explicit formula exists (though not known) and the goal is to predict it on unknown points (and be able to interpret the formula as a side effect). Deep learning tasks are more of a statistical nature (think models with a cross entropy loss - it's statistically predicting the frequency of different choices of the class/next token), it requires a specialized training procedure and it is designed to fit 100% rather than somewhat close (think linear algebra - it won't be good at it). It would very likely take a radically different idea to apply it to deep learning tasks. The recently updated "Author's note" also mentions this: "KANs are designed for applications where one cares about high accuracy and/or interpretability."
It's great but let's be patient before we see this improve LLM accuracy or be used elsewhere.
They help with interpretability, which is a step brother. See for example "From Shapley Values to Generalized Additive Models and back" https://arxiv.org/abs/2209.04012
doesn't KA representation require continuous univariate functions? do B-splines actually cover the space of all continuous functions? wouldn't... MLPs be better for the learnable activation functions?
The paper [0] is pretty good for handling questions like those.
> doesn't KA representation require continuous univariate functions?
All multivariate continuous functions (on a bounded domain) can be represented as compositions of addition and univariate continuous functions. Much like an MLP, you can also approximate discontinuous functions well on most of the domain (learning a nearby continuous function instead).
> do B-splines actually cover the space of all continuous functions
Much like an MLP, you can hit your favorite accuracy bound with more control points.
> wouldn't... MLPs be better for the learnable activation functions
Perhaps. A B-spline is comparatively very fast to compute. Also, local training examples have global impacts on an MLP's weights. That's good and bad. One property you would expect while training a KAN in limited data regimes is that some control points are never updated, leading to poor generalization due to something like a phase shift as you cross over control points (I think the entropy-based regularizer they have in the paper probably solves that, but YMMV). The positive side of that coin is that you neatly side-step catastrophic forgetting.
This really reminds me of petrinets but an analog version? But instead of places and discrete tokens we have activation functions and signals. You can only trigger a transition if an activation function (place) has the right signal (tokens).
Looks very interesting, but my guess would be that this would run into the problem of exploding/vanishing gradients at larger depths, just like TanH or sigmoid networks do.
GistNoesis|1 year ago
https://github.com/GistNoesis/FourierKAN/
The core is really just a few lines.
In the paper they use some spline interpolation to represent 1d function that they sum. Their code seemed aimed at smaller sizes. Instead I chose a different representation, aka fourier coefficients that are used to interpolate the functions of individual coordinates.
It should give an idea of Kolmogorov-Arnold networks representation power, it should probably converge easier than their spline version but spline version have less operations.
Of course, if my code doesn't work, it doesn't mean theirs doesn't.
Feel free to experiment and publish paper if you want.
AbrahamParangi|1 year ago
droidlyx|1 year ago
sevagh|1 year ago
>Of course, if my code doesn't work, it doesn't mean theirs doesn't.
But, _does_ it work?
agnosticmantis|1 year ago
itsthecourier|1 year ago
krasin|1 year ago
It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.
I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.
Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.
1. https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
krasin|1 year ago
Changes:
1. Increased the training set from 1000 to 100k samples. This solved overfitting.
2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.
3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
Overall, my impressions are:
- it works!
- the reference implementation is very slow. A GPU implementation is dearly needed.
- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.
- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.
I will definitely keep an eye on this development.
londons_explore|1 year ago
This. I don't think toy examples are useful for modern ML techniques. If you tested big ideas in ML (transformers, LSTM's, ADAM) on a training dataset of 50 numbers trying to fit a y=sin(x) curve, I think you'd wrongly throw these ideas out.
samus|1 year ago
Unfortunately, I had to modify KAN.py and KANLayer.py to make it work as not all relevant tensor are put on the correct device. In some places the formatting even suggests that there was previously a device argument.
esafak|1 year ago
GLMs in turn generalize logistic-, linear and other popular regression models.
Neural GAMs with learned basis functions have already been proposed, so I'm a bit surprised that the prior art is not mentioned in this new paper. Previous applications focused more on interpretability.
talegari|1 year ago
montebicyclelo|1 year ago
It's not clear from the paper how well this algorithm will scale, both in terms of the algorithm itself (does it still train well with more layers?), and ability to make use of hardware acceleration, (e.g. it's not clear to me that the structure, with its per-weight activation functions, can make use of fast matmul acceleration).
It's an interesting idea, that seems to work well and have nice properties on a smaller scale; but whether it's a good architecture for imagenet, LLMs, etc. is not clear at this stage.
dist-epoch|1 year ago
Sounds like something which could be approximated by a DCT (discrete cosine transform). JPEG compression does this, and there are hardware accelerations for it.
> can make use of fast matmul acceleration
Maybe not, but matmul acceleration was done in hardware because it's useful for some problems (graphics initially).
So if these per weight activations functions really work, people will be quick to figure out how to run them in hardware.
cs702|1 year ago
The best thing about this new work is that it's not an either/or proposition. The proposed "learnable spline interpolations as activation functions" can be used in conventional DNNs, to improve their expressivity. Now we just have to test the stuff to see if it really works better.
Very nice. Thank you for sharing this work here!
---
[a] https://news.ycombinator.com/item?id=40179232
godelski|1 year ago
beagle3|1 year ago
Everything described was laughably basic by modern standards, but the motivation given in that book was the Kolmogorov representation theorem: a modest 3 layer networks with the right activation function can represent any continuous m-to-n function.
Most research back then focused on 3 layer networks, possibly for that reason. Sigmoid activation was king, and vanishing gradients the main issue. It took 2 decades until AlexNet brought NN research back from the AI winter of the 1990’s
glebnovikov|1 year ago
This is science as is :)
95% percent will produce mediocre-to-nice improvements to what we already have so there were reserachers that eventually grow up and do something really exciting
mxwsn|1 year ago
wongarsu|1 year ago
ubj|1 year ago
[1]: https://arxiv.org/abs/2311.00049
nyrikki|1 year ago
> A practical construction of g in cases with discontinuous bounded and un- bounded functions is not yet known. For such cases Theorem 2.1 gives only a theoretical understanding of the representation problem. This is because for the representation of discontinuous bounded functions we have derived (2.1) from the fact that the range of the operator Z∗ is the whole space of bounded functions B(Id). This fact directly gives us a formula (2.1) but does not tell how the bounded one-variable function g is attained. For the representation of unbounded functions we have used a linear extension of the functional F , existence of which is based on Zorn’s lemma (see, e.g., [19, Ch. 3]). Application of Zorn’s lemma provides no mechanism for practical construction of such an extension. Zorn’s lemma helps to assert only its existence.
If you look at the OP post arxiv link, you will see they are using splines .
https://arxiv.org/abs/2404.19756
Still interesting and potentially useful, but not useful for discontinuous functions without further discoveries.
If I am wrong please provide a link, it is of great interest to me.
reynoldss|1 year ago
Lichtso|1 year ago
Lichtso|1 year ago
1957: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...
1958: https://en.wikipedia.org/wiki/Multilayer_perceptron
2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).
3. Everybody is talking transformers. I want to see diffusion models with this approach.
trwm|1 year ago
There isn't much difference between weights of a linear sum and coefficients of a spline.
tripplyons|1 year ago
xpe|1 year ago
One might argue this via parsimony (Occam’s razor). Is this your thinking? / Anything else?
kolinko|1 year ago
cbsmith|1 year ago
baq|1 year ago
xpe|1 year ago
I'm not seeing decision trees, though. Am I missing something?
> "KANs’ nodes simply sum incoming signals without applying any non-linearities." (page 2 of the PDF)
adityang5|1 year ago
- PyTorch Module of the KAN GPT
- Deployed to PyPi
- MIT Licence
- Test Cases to ensure forward-backward passes work as expected
- Training script
I am currently working on training it on the WebText dataset to compare it to the original gpt2. Facing a few out-of-memory issues at the moment. Perhaps the vocab size (50257) is too large?
I'm open to contributions and would love to hear your thoughts!
https://github.com/AdityaNG/kan-gpt
https://pypi.org/project/kan-gpt/
cloudhan|1 year ago
apolar|1 year ago
Seminar 2021: https://warwick.ac.uk/fac/sci/maths/research/events/seminars...
Article in archive 2023: https://arxiv.org/abs/2305.08194
Video 2021: https://www.youtube.com/watch?v=eS_k6L638k0
Extension to stochastic models where KAN builds the distribution 2023: https://www.youtube.com/watch?v=0hhJIpzxPR0
apolar|1 year ago
yobbo|1 year ago
At the end of this example, they recover the symbolic formula that generated their training set: exp(x₂² + sin(3.14x₁)).
It's like a computation graph with a library of "activation functions" that is optimised, and then pruned. You can recover good symbolic formulas from the pruned graph.
Maybe not meaningful for MNIST.
beagle3|1 year ago
It will easily recover this formula, because it is separable under the log transformation (which ACE recovers as well).
But ACE doesn’t work well on unseparable problems - not sure how well KAN will.
diwank|1 year ago
bart1ett|1 year ago
gautam5669|1 year ago
Given its sparse, Will this be just replacement for MoE.
mp187|1 year ago
SpaceManNabs|1 year ago
nextaccountic|1 year ago
> Unsurprisingly, the possibility of using Kolmogorov-Arnold representation theorem to build neuralnetworks has been studied [8, 9, 10, 11, 12, 13]. However, most work has stuck with the original depth-2 width-(2n + 1) representation, and did not have the chance to leverage more modern techniques (e.g., back propagation) to train the networks. Our contribution lies in generalizing the original Kolmogorov-Arnold representation to arbitrary widths and depths, revitalizing and contextualizing it in today’s deep learning world, as well as using extensive empirical experiments to highlight its potential role as a foundation model for AI + Science due to its accuracy and interpretability.
goggy_googy|1 year ago
phpkar|1 year ago
ALittleLight|1 year ago
6mian|1 year ago
snewman|1 year ago
If and when something radically better comes along, say an alternative to back-propagation that is more like the way our brains learn, it will need a lot of scaling and refinement to catch up with the then-current LLM.
DeathArrow|1 year ago
I would worry if I'd own Nvidia shares.
erwincoumans|1 year ago
mipt98|1 year ago
thesz|1 year ago
I cannot get a PDF of your article and instead I will read a commentary on it which appears to be very interesting.
kevmo314|1 year ago
syassami|1 year ago
Maro|1 year ago
Would this approach (with non-linear learning) still be able to utilize GPUs to speed up training?
diwank|1 year ago
coderenegade|1 year ago
vindex10|1 year ago
https://en.wikipedia.org/wiki/Bayesian_network#Graphical_mod...
> Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m parent nodes represent m Boolean variables, then the probability function could be represented by a table of 2^m entries, one entry for each of the 2^m possible parent combinations.
renonce|1 year ago
I mean it's great but at the current state it seems better suited for tasks where an explicit formula exists (though not known) and the goal is to predict it on unknown points (and be able to interpret the formula as a side effect). Deep learning tasks are more of a statistical nature (think models with a cross entropy loss - it's statistically predicting the frequency of different choices of the class/next token), it requires a specialized training procedure and it is designed to fit 100% rather than somewhat close (think linear algebra - it won't be good at it). It would very likely take a radically different idea to apply it to deep learning tasks. The recently updated "Author's note" also mentions this: "KANs are designed for applications where one cares about high accuracy and/or interpretability."
It's great but let's be patient before we see this improve LLM accuracy or be used elsewhere.
nico|1 year ago
I wonder how many more new architectures are going to be found in the next few years
ComplexSystems|1 year ago
nu91|1 year ago
esafak|1 year ago
fermisea|1 year ago
brrrrrm|1 year ago
hansvm|1 year ago
> doesn't KA representation require continuous univariate functions?
All multivariate continuous functions (on a bounded domain) can be represented as compositions of addition and univariate continuous functions. Much like an MLP, you can also approximate discontinuous functions well on most of the domain (learning a nearby continuous function instead).
> do B-splines actually cover the space of all continuous functions
Much like an MLP, you can hit your favorite accuracy bound with more control points.
> wouldn't... MLPs be better for the learnable activation functions
Perhaps. A B-spline is comparatively very fast to compute. Also, local training examples have global impacts on an MLP's weights. That's good and bad. One property you would expect while training a KAN in limited data regimes is that some control points are never updated, leading to poor generalization due to something like a phase shift as you cross over control points (I think the entropy-based regularizer they have in the paper probably solves that, but YMMV). The positive side of that coin is that you neatly side-step catastrophic forgetting.
[0] https://arxiv.org/abs/2404.19756
arianvanp|1 year ago
keynesyoudigit|1 year ago
OisinMoran|1 year ago
yza|1 year ago
admin1231111|1 year ago
[deleted]
admin1231111|1 year ago
[deleted]
WithinReason|1 year ago