(no title)
grph123dot | 2 years ago
It seems that the initial matrix of weights has a low rank approximation A and this implies that the difference E = W - A is small, also it seems that PCA fails when E is sparse because PCA is designed to be optimum when the error is gaussian.
stu2b50|2 years ago
Since the weights are derived from gradient descent, yeah we don't really know what the distributions would be.
A random projection empirically works quite well for very high dimensions, and is of course very cheap computationally.