Alex Cloninger, University of California, San Diego, United States
This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Neural networks (NN) provide stateoftheart performance in many problem domains. They can accommodate a vast number of parameters and still perform well when classic machine learning techniques provided with the same number of parameters would tend to overfit. To further the understanding of such incongruities, we develop a metric called the expected spanning dimension (ESD) which allows one to measure the intrinsic flexibility of an NN. We analyze NNs from the small, in which the ESD can be exactly computed, to large realworld networks with millions of parameters, in which we demonstrate how the ESD can be numerically approximated efficiently. The small NNs we study can be understood in detail, their ESD can be computed analytically, and they provide opportunities for understanding their performance from a theoretical perspective. On the other hand, applying the ESD to largescale NNs sheds light on their relative generalization performances and provides suggestions as to how such NNs may be improved.
Neural networks (NN) have wide applications in machine learning [
NNs are quite popular in many applied problems. The understanding of their performance is often measured in relation to training data and specific learning algorithms [
In particular, our work differs from this body of work in a number of ways. First, the main computational effort required by our analysis is a natural byproduct of the classic backpropagation procedure widely used in the optimization of NNs. We can leverage already existing and highly optimized routines for constructing the Jacobians that we require, and our analysis can be accomplished with a cost that is roughly the same as that required by a few hundred backpropagation steps. Second, our analysis applies to NNs ranging from the smallest NNs of which we can conceive to largescale realworld networks used for practical tasks such as image understanding. Third, and perhaps most importantly, our proposed ESD metric is highly correlated with the generalization performance on both small NNs, in which all details can be theoretically justified, and large realworld networks. As we will demonstrate, considering ESD provides insights into the performance of NNs and provides suggestions on how this performance can be improved.
Consider the simple multilayer perceptron
This simple NN can be written as
One can then compare
Similar to the line of thought in [
To measure just how restricted the polynomial form of
We can compute this dimension of the range of
The foundational idea is that, while the simple polynomial NN,
Inspired by the simple NN in
From a theoretical point of view, Definition 1 can have interesting implications. For example, here is one theorem that naturally arises.
The proof can be found in the Supplementary Material.
One can imagine that many theorems of a similar flavor could easily be posed and proven.
In Definition 1, we made three assumptions about the NN
1) We assume that
2) We assumed that the activation functions are smooth. Common activation functions like ReLU [
3) The NN has only one input and output.
In this section, we will address and relax each of these assumptions in turn and generalize our definition of PSD to apply to a much larger class of NNs including those found in realworld applications.
In Definition 1, we are representing a polynomial as the coefficients of the polynomial basis
When we computed the PSD of polynomial NNs, we considered the coefficients of the polynomial in terms of the polynomial basis
With a polynomial of the form presented in
In the case of polynomials with one input and output, we will now show that an equivalent way to compute the PSD is to determine the rank of the Jacobian of the function from
Consider the NN,
We will now show that these two definitions are equivalent for all polynomial NNs. Given a polynomial NN, we have the following relationship,
Why does computing the rank of
Harkening back to
computing
we have a natural extension for the definition of PSD, which we will simply call the spanning dimension (SD) to nonpolynomial NNs since we do not need to actually pick a basis representation of our NN. Rather, this definition is based on the dimension of the range of the function from the NN evaluated at our selected input points to its outputs and is a necessary, but not sufficient, condition for being able to fit those selected input points to arbitrary output points exactly.
Our second assumption above was that the NN in consideration had smooth activation functions, but that is not the case for realworld NNs which use activation functions like ReLU [
First, as giving rise to a “faceted manifold” where the Jacobian of interest is also piecewise constant, we observe that if the NN is evaluated exactly at a point at which activation is nonsmooth, then the Jacobian of interest will not uniquely be defined there. However, for all standard activation functions, such as ReLU, the set of nonsmooth points for that activation function is of measure0, so the Jacobian we require is defined uniquely almost everywhere.
Second, as the ReLU being thought of as the limit of a SoftPlus (i.e., smoothed ReLu) [
Furthermore, with polynomial NNs, we selected integer inputs to create a Jacobian that consisted of integer entries. For nonpolynomial NNs with many commonly used activation functions, this is not possible. For example, the sigmoid activation function contains an exponential which is irrational. To estimate the rank of a matrix with floating point entries, and thereby the SD, we can consider the singular values of that matrix. One classic method to compute the rank of a matrix from its singular values is to simply count the number of singular values that are greater than a certain tolerance based on machine epsilon. Of course, there are more sophisticated approaches, and one that we use here is the idea of the
Let the
2)
A unitary transformation on
Using effective rank, the SD is no longer confined to the set of integers, and the SD can vary based on the values of inputs and weights for smooth activation functions. To account for this, and to handle activation functions which are not smooth like ReLU [
Consider the distribution of SDs for a given NN over all possible inputs within some compact space. We can estimate this distribution by choosing a number
Estimated spanning distribution for an MLP with
Note, in the case of nonsmooth activation functions, the SD can change if the NN is evaluated at points that bracket a singularity in an activation function such as ReLU. However, as we show in
Here, we address our third assumption above, that
However, this approach does require that there is only one scalar output. To compute the ESD of an NN with multiple outputs, we recommend one of the following approaches.
Compute the ESD of each output and then average the results. This is the most computationally expensive approach for larger networks.
Compute the ESD of the first output. This approach is the least expensive computationally.
Compute the ESD of the sum of squares of the outputs. This approach closely resembles the type of squared loss function used in regression problems.
In
Previously, when considering exact rank, we did not need to determine the actual magnitude of the singular values. However, when considering effective rank, these singular values are important.
Looking at
Deriving such ideas in the Hilbert space setting would take us too far afield for the current context. However, there is one idea that is inspired by such considerations that we will leverage here. In particular, it is important to note that the expansion
The fact that there are many decompositions of
We are now interested in computing the eigenstructure of
Fortunately, we have a trick that we can leverage. In the decomposition
Since we do not have access to the decomposition
So, in our calculations, we
We are now ready to present the full algorithm for computing the ESD of
Estimated spanning dimension distribution (effective rank) with 2,000 samples of a neural network with
In this section, we explore the relationship between ESD and learning capacity in terms of training error for multilayer perceptrons. ESD measures the expressiveness of an NN, its ability to fit input data points. An experimental framework was created to investigate this relationship.
For these experiments, a set of eight random input and output points were generated
Training error vs.
The ESD for all MLPs with two hidden layers and nodes per hidden layer less than or equal to 40 with activation function ReLU, sigmoid, or tanh.
In this section, we describe the results of an exhaustive set of experiments that were run to investigate the ESD of 4800 multilayer perceptrons. For each activation function, sigmoid, ReLU, and tanh, we compute the ESD of all multilayer perceptron with two hidden layers where the first and second hidden layers have
These results show that in general ReLU has a much higher ESD than sigmoid and tanh, and may partly explain why ReLU tends to perform well in deep NNs. Sigmoid in contrast has a very low ESD and appears to plateau earlier. In particular, both sigmoid and tanh, unlike ReLU, do not appear to have a significantly higher ESD when the first hidden layer has more nodes. The number of nodes in the second hidden layer has a greater impact on ESD.
Consider the simple multilayer perceptron in
Perhaps, we wish to have a more flexible NN that can represent a larger family of functions. Interestingly, the ESD can be increased without adding weights, merely by adding links similar to those found in ResNets [
If adding these shortcuts increases the ESD considerably, a natural extension of this idea is to replace the smaller shortcuts with larger ones that skip over more of the hidden neurons. In fact, inspired by DenseNets [
It is interesting to note that, from the perspective of ESD, the modifications to standard NN architectures used by highperformance networks such as ResNets [
In this section, we modify larger multilayer perceptrons by adding shortcuts as shown in
To measure the fitting power of an NN, we test its ability to fit random noise given a certain number of inputs. One thousand images of size
Indeed, that is the case, and
Fitting a multilayer perceptron, an addshortcuts, and an addinputs network to random noise images with ten arbitrary classes. The addinputs model is able to fit the noise images while the other two models are not able to. Accordingly, this NN has more flexibility than the other networks, and can overfit.
Fitting a multilayer perceptron, a residual network, and an addinputs network to
These three NN architectures were then trained on fashion MNIST images to determine the effect of increased ESD on performance in a practical setting [
Fashion MNIST training and test accuracy average of three runs comparing multilayer perceptron (MPL) to our proposed architectures.
Training accuracy  Test accuracy  

Epochs  MLP  Addshortcuts  Addinputs  Epochs  MLP  Addshortcuts  Addinputs 
10  0.369  0.846  0.906  10  0.340  0.785  0.880 
20  0.693  0.892  0.919  20  0.658  0.865  0.886 
30  0.831  0.908  0.924  30  0.768  0.880  0.893 
40  0.867  0.9153  0.9284  40  0.822  0.890  0.898 
50  0.888  0.9228  0.9328  50  0.856  0.891  0.897 
The testing accuracy for these architectures is higher than both the multilayer perceptron and residual models in every test case. Each experiment was run 3 times, and the average training and test accuracies were computed.
Adding shortcuts that skip over individual nodes in a multilayer perceptron was inspired by deep residual networks [
To see how well these NNs perform on a low number of training inputs, these two networks were trained on 3,000 images from the CIFAR10 dataset for various numbers of epochs using stochastic gradient descent [
Fitting the original ResNet architecture and the addinputs model to 3,000 images from the CIFAR30 dataset. The addinputs model learns faster and has a higher final test accuracy than the ResNet network.
CIFAR10 training and test accuracy average of three runs.
Epoch  Training accuracy  Test accuracy  

ResNet  Add inputs  ResNet  Add inputs  
10  0.153  0.230  0.147  0.228 
20  0.263  0.304  0.274  0.315 
30  0.329  0.369  0.316  0.357 
40  0.371  0.411  0.364  0.384 
50  0.402  0.436  0.388  0.404 
In previous sections, we have demonstrated the ESD on NNs of our own construction and have focused largely on training error as a measure of the flexibility of an NN. However, these ideas can be applied to arbitrary NNs. In this section, we demonstrate the application of the ESD to
VGG (versions 11, 13, and 19) [
ResNet (versions 18, 34, 50, 101, and 152) [
SqueezeNet (versions 1.0 and 1.1) [
DenseNet (versions 121, 161, 169, 201) [
ShuffleNet (version 21.0) [
MobileNet (version 2) [
ResNext (versions 50 and 101) [
Each NN
To demonstrate the effectiveness of our methodology, we choose NNs with a number of parameters that range over several orders of magnitude. We choose networks that range in size from just over one one million parameters to networks with over 100 million parameters. We will denote by
To handle the multiple outputs of each NN, we compute the ESD using two of the approaches outlined in
With the above notation in mind, our experiments proceed as follows.
We choose, at random, a set of
We generate these
In an optimization of
Plots of the best1
Plots of the best5 testing error of NNs against their ESD. As in
Plots of the best1 testing error of NNs against their ESD. As in
Plots of the best5 testing error of NNs against their ESD. As in
This provides one set of singular values. We then iterate this procedure 10 times to get 10 sets of 500 singular values for each NN. Note, the choice of 500 in the above algorithm is not essential to the performance of the algorithm, and many different choices were tried as part of our experiments. 500 was chosen in this case as all singular values for all networks past this point are small.
Fortunately, PyTorch makes the computation of
Our results for realworld NNs can be found in Figures
Plots of the best1 and best5 testing errors of NNs against the number of parameters. Again, the testing errors are as provided by [
Scalar network  Error type  Data type 

Max 

pick_1st  best1 

4.84414e−05  5.45863e−05 
pick_1st  best5 

3.56313e−05  4.06148e−05 
sum_square  best1 

3.29885e−06  3.87184e−06 
sum_square  best5 

3.18170e−06  3.74138e−06 
pick_1st  best1  ImageNet  1.84975e−03  2.53296e−03 
pick_1st  best5  ImageNet  1.07219e−03  1.65698e−03 
sum_square  best1  ImageNet  2.64016e−05  8.80155e−05 
sum_square  best5  ImageNet  1.35040e−05  4.81841e−05 
The “max
Comparison between the
Data type  Error type 



ImageNet removed bottom third  best1  3.23e−02  9.71e−01 
ImageNet removed bottom third  best5  3.24e−02  9.81e−01 
ImageNet removed middle third  best1  1.26e−04  5.00e−01 
ImageNet removed middle third  best5  6.78e−05  4.47e−01 
ImageNet removed top third  best1  1.15e−04  5.54e−01 
ImageNet removed top third  best5  6.48e−05  4.79e−01 

best1  2.11e−02  9.71e−01 

best5  1.86e−02  9.81e−01 

best1  7.99e−06  5.00e−01 

best5  8.67e−06  4.47e−01 

best1  1.79e−05  5.54e−01 

best5  1.77e−05  4.79e−01 
We divide the NNs into three groups of six, based on their testing accuracies. We then do the
A concise summary of the results for analyzing NNs with ESDs can be found in Tables
In this article, we have defined the idea of an ESD for NNs and have demonstrated how maximizing this ESD improves the performance of realworld NNs. In many ways, a small ESD can be thought of as a regularization of the NN, and increasing the ESD is an example of the bias–variance tradeoff in machine learning. ESD can be controlled by judiciously including additional links in the NN and does not require changing the number of parameters in the NN. Perhaps, most importantly, an ESD analysis provides a mathematical foundation for the superior performance of ResNet type NNs. There are many possible directions for augmenting the results described here. For example, Theorem 2.1 only scratches the surface of the theoretical progress that can be made using ESDtype ideas for the understanding of NNs, and the bounding of the ESD is certainly a worthwhile direction for future work. However, such results appear to be nontrivial, except in the smallest of cases. In particular, the example can be analyzed in terms of algebraic geometry using Gröbner bases [
As a final point, the interpretation of the ESD in context of the recent work in “double descent” is a delicate and quite interesting direction for future work. In particular, several of our results demonstrate that the testing accuracy of NNs improves as the ESD increases. There are two explanations that present themselves. First, largescale NNs are actually somewhat underparameterized or, more precisely, have access to far fewer degrees of freedom than their number of unknowns would indicate. Their performance is therefore improved by the additional flexibility that a larger ESD provides. Second, largescale NNs are already in the “double descent” range and increasing ESD is taking advantage of this fact. Resolving this question would likely take a delicate computational study involving the development of a technique for slowly increasing the ESD in a largescale NN.
Publicly available datasets were analyzed in this study. These data can be found here:
The core of writing of this article was performed by DB and RP. The numerical results included in the article were generated by DB and RP, with additional numerical results that inspired some of the approaches presented in the article being generated by LG. The key ideas in the article were generated through discussions between DB and RP, and RP and LG.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This research was performed using computational resources supported by the Academic and Research Computing group at Worcester Polytechnic Institute. The authors would also like to acknowledge the Academic and Research Computing group at Worcester Polytechnic Institute for providing consulting support that contributed to the results reported within this article.
All codes for experiments are available upon request.
In the notation of [