Regularization caused by optimization algorithm.. "we investigate the transformations under which the function computed by a network remains the same and therefore argue for complexity measures and optimization algorithms that have similar invariances. We find complexity measures that have similar invariances to neural networks and optimization algorithms that implicitly regularize them". Path-norm a metric in parameter space, that is closer to model distance? Sloppy systems?
[28] proved that the Rademacher complexity of fully connected feedforward networks on set S can be bounded based on the 1 norm of the weights of hidden units in each layer In Chapter 5 we show how the capacity can be controlled for a large family of norms. See Learning real-valued functions
Algorithmic robustness, similar to Algorithmic stability
Sharpness. we advocate viewing a related notion of expected sharpness in the context of the PAC-Bayesian framework. Viewed this way, it becomes clear that sharpness controls only one of two relevant terms, and must be balanced with some other measure such as norm. Together, sharpness and norm do provide capacity control and can explain many of the observed phenomena. This connection between sharpness and the PAC-Bayes framework was also recently noted by Dziugaite and Roy [32].
On the Role of Implicit Regularization in Generalization we draw an analogy to matrix factorization and understand dimensionality versus norm control there. Based on this analogy we suggest that implicit norm regularization might be central also for deep learning, and also there we should think of bounded-norm models with capacity independent of number of hidden units. See also Deep learning theory
Focusing on networks with RELU activations in this section, we observe that scaling down the incoming edges to a hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network computing the same function. Since predictions are invariant to such rescalings, it is natural to seek a geometry, and corresponding optimization method, that is similarly invariant. In this chapter, we study invariances in feedforward networks with shared weights. — Sloppy systems!
Revisiting the choice of gradient descent, we recall that optimization is also inherently tied to a choice of geometry or measure of distance, norm or divergence. Gradient descent for example is tied to the l2 norm as it is the steepest descent with respect to l2 norm in the parameter space, while coordinate descent corresponds to steepest descent with respect to the l1 norm and exp-gradient (multiplicative weight) updates is tied to an entropic divergence.
Sensitivity-based bounds
Three Factors Influencing Minima in SGD – Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization
Subgradient Descent Learns Orthogonal Dictionaries
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks – Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations – Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization – The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects