Trade-off of Batch Size

According to the answer on Quora, the affection of batch size in the training of ANN are mainly two points:

Reduce the variance of the stochastic gradient updates. By taking average of some functions over each training example in the batch, using mini-batches can efficiently reduce the noise of finding the best direction to take. As we know, SGD needs to find the way out to make steepest descent in gradient. Obviously, the gradient of a single data point is going to be a lot noisier than the gradient of a 100-batch. This means that we won’t necessarily be moving down the error function in the direction of steepest descent. However, when we set an relative large batch size, we can reduce this variance greatly by sacrificing some time (which is insignificant in the parallelization or working on GPU).
What’s more, this in turn allows us to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, everything has its dark side. When we considering using a large mini-batch size, we need to know the trade-off between getting a faster converging network and taking the risk of stucking in a locally optimal solution. Variance and noise are not always bad. Sometimes we just need them to help us jump out of some shallow valleys of our loss function. If we used the entire training set to compute each gradient, our model would get stuck in the first valley it fell into (since it would register a gradient of 0 at this point). If we use smaller mini-batches, on the other hand, we’ll get more noise in our estimate of the gradient. This noise might be enough to push us out of some of the shallow valleys in the error function. Draw an conclusion, small batches cause gradient descent to attract to wider basins, while large batches cause gradient descent to attract to narrow basins, which causes higher error when attempting to generalize, because missing the mark in a narrow basin causes higher changes in errorSo, when we try to reduce the learning rate to a relatively small value but the network does not work better as we think, tuning the batch size a little bit smaller should be a good solution.

Reference:

1. Quora: Intuitively, how does mini-batch size affect the performance of (stochastic) gradient descent? 2. Reddit: Learning Rate and Accuracy vs Batch Size 3. The Effects of Hyperparameters on SGD Training of Neural Networks 4. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 5. One weird trick for parallelizing convolutional neural networks 6. Systematic evaluation of CNN advances on the ImageNet