they are one of the reasons neural networks are blackbox,
we lose information about the data manifold the deeper we go in the network, making it impossible to trace back the output
this preprint is not coming from a standpoint of optimizing the inference/compute, but from trying to create models that we can interpret in the future and control
Less information loss -> Less params? Please correct me if I got this wrong. The Intro claims:
"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often
obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg-
ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."
yes,
since you can learn to represent the same problem with less amount of params,
however most of the architectures are optimized for the linear product, so we gotta figure out a new architecture for it
russfink|7 months ago
mlnomadpy|6 months ago
this preprint is not coming from a standpoint of optimizing the inference/compute, but from trying to create models that we can interpret in the future and control
julius|7 months ago
"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg- ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."
mlnomadpy|6 months ago