Please use this identifier to cite or link to this item:
https://hdl.handle.net/2440/136360
Citations | ||
Scopus | Web of Science® | Altmetric |
---|---|---|
?
|
?
|
Type: | Conference paper |
Title: | A Chaos Theory Approach to Understand Neural Network Optimization |
Author: | Sasdelli, M. Ajanthan, T. Chin, T.J. Carneiro, G. |
Citation: | Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA 2021), 2021, pp.1-10 |
Publisher: | IEEE |
Publisher Place: | online |
Issue Date: | 2021 |
ISBN: | 9781665417099 |
Conference Name: | Digital Image Computing: Techniques and Applications (DICTA) (29 Nov 2021 - 1 Dec 2021 : Gold Coast, Australia) |
Statement of Responsibility: | Michele Sasdelli, Thalaiyasingam Ajanthan, Tat-Jun Chin, and Gustavo Carneiro |
Abstract: | Despite the complicated structure of modern deep neural network architectures, they are still optimized with algorithms based on Stochastic Gradient Descent (SGD). However, the reason behind the effectiveness of SGD is not well understood, making its study an active research area. In this paper, we formulate deep neural network optimization as a dynamical system and show that the rigorous theory developed to study chaotic systems can be useful to understand SGD and its variants. In particular, we first observe that the inverse of the instability timescale of SGD optimization, represented by the largest Lyapunov exponent, corresponds to the most negative eigenvalue of the Hessian of the loss. This observation enables the introduction of an efficient method to estimate the largest eigenvalue of the Hessian. Then, we empirically show that for a large range of learning rates, SGD traverses the loss landscape across regions with largest eigenvalue of the Hessian similar to the inverse of the learning rate. This explains why effective learning rates can be found to be within a large range of values and shows that SGD implicitly uses the largest eigenvalue of the Hessian while traversing the loss landscape. This sheds some light on the effectiveness of SGD over more sophisticated second-order methods. We also propose a quasi-Newton method that dynamically estimates an optimal learning rate for the optimization of deep learning models. We demonstrate that our observations and methods are robust across different architectures and loss functions on CIFAR-10 dataset. |
Rights: | ©2021 IEEE |
DOI: | 10.1109/DICTA52665.2021.9647143 |
Grant ID: | http://purl.org/au-research/grants/arc/DP180103232 http://purl.org/au-research/grants/arc/FT190100525 |
Published version: | https://ieeexplore.ieee.org/xpl/conhome/9647036/proceeding |
Appears in Collections: | Computer Science publications |
Files in This Item:
There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.