| December 03, 2021

On Neural Network Accuracy and Validation

Following up on my previous article [1], note that the procedures that are commonly used to train deep learning’s neural networks sometimes have negative effects that raise concerns for people who are not necessarily researchers and practitioners in the field.

For example, consider a neural network that detects cancer in a medical image and suppose that one trains this network on a large set of images via the mini-batch training method. In this case, the result depends on the order of the images that are used to train the network.

Regardless of whether the input is split into mini-batches, a network’s weights are a solution of a large-scale minimization of a cost function \(c(x)\) of the unknown weights. In this case, the Hessian of the cost function is singular everywhere almost with certainty (see the issue of overfitting in [3]); at any rate, it is impossible to determine in practice whether the Hessian is singular. As such, there are infinitely many \(x\)-vectors for which \(c(x)\) attains a minimal value. Which of these \(x\)-vectors will be the output (i.e., the network’s weights) at the termination of an algorithm? The answer depends on its physical environment, starting point, and stopping criterion. It follows that in the presence of singularity, the properties of a given solution do not characterize an algorithm’s effectiveness.

Furthermore, one cannot conclude that an algorithm is (or is not) effective by testing its output’s fit (termed “accuracy”) on a so-called validation set. The fit level of a minimizer of \(c(x)\) may vary widely on different validation sets, and different minimizers of \(c(x)\) may have different fit levels on a given validation set. To emphasize, “accuracy” is a property of a specific minimizer of the cost function relative to a specific validation set — it is not a property of an algorithm. A validation set does not validate an algorithm; it only “validates” the accuracy of a single specific minimizer of the cost function relative to itself. If the input is partitioned such that the larger part serves as the training set and the smaller part serves as the validation set, different partitions result in different “accuracies.” One must consider this outcome with respect to deep learning accuracy claims.

Unsurprisingly, the effectiveness indicator for optimization theory is an algorithm’s rate of convergence, which only depends on the problem’s mathematical structure and is independent of its size [2]. For instance, the rate of convergence of Newton’s method is independent of problem size. Convergence rate is also independent of the application area in which the problem arises, be it engineering, business, image recognition, or speech recognition; this fact is true of all minimization algorithms. In practice, an easy-to-observe valid indicator of an iterative algorithm’s performance is the number of iterations that are required for error reduction by a fixed factor. One can thus compare the performance of two algorithms in this manner if they utilize roughly the same computational effort—e.g., the computation of the cost function’s gradient—per iteration. For example, if algorithm \(\textrm{A}\) applies one step of “Adam” per mini-batch while algorithm \(\textrm{B}\) applies two “Adam” steps per mini-batch, one must adjust the comparison of error-reduction rates for the difference in computational effort.

References
[1] Barzilai, J. (2021, October 27). A note on neural network mini-batches. SIAM News Online. Retrieved from https://sinews.siam.org/Details-Page/a-note-on-neural-network-mini-batches.
[2] Barzilai, J., & Dempster, M.A.H. (1993). Measuring rates of convergence of numerical algorithms. J. Opt. Theory Appl., 78(1), 109-125.
[3] Thompson, N.C., Greenewald, K., Lee, K., & Manso, G.F. (2021, September 24). Deep learning’s diminishing returns. IEEE Spectrum. Retrieved from https://spectrum.ieee.org/deep-learning-computational-cost.

Jonathan Barzilai earned B.Sc., M.Sc., and D.Sc. degrees in applied mathematics from the Technion – Israel Institute of Technology. He is currently a professor in the Department of Industrial Engineering at Dalhousie University and has previously held positions at the University of Texas at Austin, York University, and the Technical University of Nova Scotia.