Summary: | In this thesis we develop a spectral approach to large kernel matrices, graphs and the Hessians of neural networks. This approach paves the way for efficient and improved inference schemes, novel algorithms and theoretical results. The first contribution of the thesis is the development of a novel Maximum Entropy algorithm applied to the problems of log determinant estimation (necessitated by Bayesian model selection), cluster counting and graph similarity. The method produces strong empirical results compared to existing established methods, such as the Lanczos algorithm. The second contribution is the development of a fielddependent, spiked random matrix theory for characterising the sub-sampling effects, due to mini-batching, on the loss surface of deep neural networks. The theory we develop predicts that, up to a threshold, non-adaptive-gradient methods should scale linearly with batch size and adaptive methods should scale with the squareroot of the batch size. We verify these results empirically. As final contributions we develop an open source software package for neural network curvature computations, show that spectral sharpness is not relevant for solution generalisation and discuss extensions to previous theoretical work. Based upon these theoretical contributions, we develop an adaptive algorithm (Gadam) which posts strong test set performance across a variety of image and text classification problems, regularly beating finely tuned non-adaptive methods.
|