| October 27, 2021

A Note on Neural Network Mini-Batches

Researchers formulate the neural network training problem as the minimization of an objective function that represents a total “cost” \(c(x)=\sum_ic_i(x)\), where \(x\) is the vector of unknown weights and \(c_i(x)\) is a nonnegative partial cost. General notes on this subject are available in [1]. To facilitate large-scale minimization of this sum, the total cost is broken into batches of partial costs: \(B_j(x)=\sum_{i=m_j}^{n_j}c_i(x)\). These batches are then cyclically minimized so that the weights that are produced by minimizing \(B_j(x)\) serve as the starting point for the minimization of \(B_{j+1}(x)\) [2].

If \(B_j(x^*)=0\) for all of the batches, \(c(x^*)=0\) and \(x^*\) is a global minimizer of the total cost function. But in finite-precision computation, \(x_j\) is a minimizer of \(B_j(x_j)\) if \(B_j(x_j)<t\), where \(t\) is a specified tolerance. In this case, we may have \(B_j(x^*)<t\) for all \(j\) while \(c(x^*)=\sum\limits_j B_j(x^*) \gg t\), so that a solution for every batch is not a solution for the entire problem. Since typical neural network training problems are singular and involve large data sets, there is no practical way to recognize when this happens. Furthermore, splitting the input into batches is the source of less obvious and more severe problems — even in the simplest case of small, non-singular quadratic cost functions.

Consider the following well-conditioned quadratic cost that consists of six unknowns and six partial costs:

\[c_1(x)=(2x_1-x_2)^2, \: c_2(x)=[2x_2-(x_1+x_3)]^2, \: c_3(x)=[2x_3-(x_2+x_4)]^2\]

\[c_4(x)=[2x_4-(x_3+x_5)]^2, \: c_5(x)=[2x_5-(x_4+x_6)]^2, \: c_6(x)=[2x_6-x_5)-7]^2.\]

The total cost function is strictly convex with a minimum at \(x^*=(1, 2, 3, 4, 5, 6)\). Application of 1,000 steps of the “Adam” algorithm [3] with \(t=1\rm{e}\mbox{-} 12\) as tolerance from the starting point \(x_0=(6, 5, 4, 3, 2, 1)\) with batches

\[B_1(x)=c_1(x)+c_2(x)+c_3(x) \:\:\: \textrm{and} \:\:\: B_2(x)=c_4(x)+c_5(x)+c_6(x)\]

yields the solution

\[ x_1 = \begin{bmatrix} 3.3125\\ 6.2240 \\ 8.4513 \\ 9.3573 \\ 9.2253 \\ 8.3131 \end{bmatrix} \]

with error \(1.0011\).

The same parameters with the order of batches changed to

\[B_1(x)=c_4(x)+c_5(x)+c_6(x) \:\:\: \textrm{and} \:\:\: B_2(x) = c_1(x)+c_2(x)+c_3(x)\]

yield the solution

\[ x_2 = \begin{bmatrix} -1.3131\\ -2.2253 \\ -2.3573 \\ -1.4513 \\ 0.7760 \\ 3.6875 \end{bmatrix} \]

with error \(1.0011\).

The batch problems are singular. Their general solutions are given by \(y_1=x_1+v_1\) and \(y_2=x_2+v_2\), where \(v_1\) and \(v_2\) are arbitrary vectors in different subspaces and \(x_1\) and \(x_2\) are particular solutions. Under this cyclical procedure, the arbitrary vectors that are generated by the other batches affect each batch’s solution, and the last batch’s arbitrary vector strongly affects the overall solution. Since the batches' elements depend on the order of the problem’s input, the solution thus depends on this order as well; reordering the input produces different outputs, i.e., different solutions. Note that a proper solution of a problem cannot depend on its description.

Finally, output-based metrics do not characterize an algorithm’s performance when the output depends on the order of problem input. In this case, one can use the number of steps that are required to reduce the error by a fixed factor — which depends on the algorithm’s convergence rate but not on the problem’s size.

References
[1] Barzilai, J. (2020). On neural-network training algorithms. In W. Lawless, R. Mittu, & D. Sofge (Eds.)., Human-machine shared contexts (pp. 307-313). Cambridge, MA: Elsevier.
[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. In Adaptive computation and machine learning. Cambridge, MA: MIT Press.
[3] Kingma, D.P., & Ba, J.L. (2014). Adam: A method for stochastic optimization. Preprint, arXiv:1412.6980.

Jonathan Barzilai earned B.Sc., M.Sc., and D.Sc. degrees in applied mathematics from the Technion – Israel Institute of Technology. He is currently a professor in the Department of Industrial Engineering at Dalhousie University and has previously held positions at the University of Texas at Austin, York University, Dalhousie University, and the Technical University of Nova Scotia.