(no title)
anamax | 1 month ago
Most deep learning systems are learned matrices that are multiplied by "problem-instance" data matrices to produce a prediction matrix. The time to do said matrix-multiplication is data-independent (assuming that the time to do multiply-adds is data-independent).
If you multiply both sides by the inverse of the learned matrix, you get an equation where finding the prediction matrix is a solving problem, where the time to solve is data dependent.
Interestingly enough, that time is sort-of proportional to the difficulty of the problem for said data.
Perhaps more interesting is that the inverse matrix seems to have row artifacts that look like things in the training data.
These observations are due to Tsvi Achler.
srean|1 month ago
There are layers upon layers of nonlinearity, be it with softmax or sigmoid. In the tangent kernel view it does linearize.