(no title)
mlnomadpy | 7 months ago
I was able to create a new kernel that allows you to learn non-linearity without using activation functions, making the models whitebox, and without any information loss.
MiniGPT with huggingface datasets streaming: https://www.kaggle.com/code/skywolfmo/yat-nnx-minigpt-finewe...
rytill|7 months ago
To my knowledge they’re a negligible portion of the total compute during training or inference and work well to provide non-linearity.
Very open to learning more.
russfink|7 months ago
mlnomadpy|6 months ago
this preprint is not coming from a standpoint of optimizing the inference/compute, but from trying to create models that we can interpret in the future and control
julius|7 months ago
"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg- ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."