One interesting thing to notice is how you can remodel xor into being a linear function by using u + v as input 1 and u * v as input 2 which means it can be represented in a NN without a hidden layer. And not only xor, it keeps all other logic gates simple. So only by transforming inputs one can reduce network complexity. Perhaps a field ripe for research.
indeed,
there is an extensive work done in kernel learning that is facinating
and one of the applications that still do these transformations are satellite imagery/multispectral imagery, you can get more information just from calculating the ndvi from the different bands of your image, which make it easy for your models to make decisions
I misread this as if "there is no non-linearity". there is still non-linearity, it is just renamed and reshuffled into new operators. basically renaming apples into oranges.
Well, it's more like fruits and vegetables. The author proposed a normalized inner product as replacement for the standard inner product.
It's not an activation function, because it has the learnable weights of a linear projection (mat vec multiplication) and the clamping properties of an activation function all in one.
My personal issue with the proposal is that it essentially doubles the amount of memory needed on-chip.
Yat-Product GEMMV now needs to store the running total of the inner product and the norm of the input vectors. That's a big cost increase for something that might not improve performance all that much.
basicly the real "non-linearity" in deep learning have always been the orthogonality, squashing functions make it easy for the neurons to tap into the orthogonality, while most of the activation functions "lie" about their orthogonality by setting the dot product score to "0", and a dot product of 0 between two vectors means they are orthogonal (linear indep)
what i did was rely on both the angular information and spatial information between the input x and the weight w to measure how "similar" they are.
the lower bound of the yat-product is 0, and it is achieved only when two vectors are orthogonal and away
I was able to create a new kernel that allows you to learn non-linearity without using activation functions, making the models whitebox, and without any information loss.
nurettin|7 months ago
mlnomadpy|6 months ago
nikolayasdf123|7 months ago
imtringued|7 months ago
It's not an activation function, because it has the learnable weights of a linear projection (mat vec multiplication) and the clamping properties of an activation function all in one.
My personal issue with the proposal is that it essentially doubles the amount of memory needed on-chip.
Yat-Product GEMMV now needs to store the running total of the inner product and the norm of the input vectors. That's a big cost increase for something that might not improve performance all that much.
mlnomadpy|6 months ago
what i did was rely on both the angular information and spatial information between the input x and the weight w to measure how "similar" they are.
the lower bound of the yat-product is 0, and it is achieved only when two vectors are orthogonal and away
unknown|7 months ago
[deleted]
mlnomadpy|7 months ago
I was able to create a new kernel that allows you to learn non-linearity without using activation functions, making the models whitebox, and without any information loss.
MiniGPT with huggingface datasets streaming: https://www.kaggle.com/code/skywolfmo/yat-nnx-minigpt-finewe...
rytill|7 months ago
To my knowledge they’re a negligible portion of the total compute during training or inference and work well to provide non-linearity.
Very open to learning more.