Basics for Gradient Descent

Reading Paper : An overview of gradient descent optimization algorithms

1. Introducton

This article aims at providing the reader with intuitions with regard to he behaviour of different algorithms for optimizing gradient.

2. Gradient descent variants

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the object function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and teh time it takes to perform an update.

2.1 Batch gradient descent

Computes the gradient of the cost function w.r.t the parameters Ø for the entire training dataset.

1
2
3
for i in range(n_epochs):
params_grad = evaluate_gradient(loss_function,data,params)
params = params = learning_rate * params_grad

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

2.2 Stochastic gradient descent(SGD)

Stochastic gradient descent performs a parameter update for each training example x(i) and label y(i).

1
2
3
4
5
for i in range(n_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function,example,params)
params = params - learning_rate * params_grad

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily. SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting, so we slowly decrease the learning rate…

2.3 Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an updates for every mini-batch of n training examples.

1
2
3
4
5
for i in range(n_epochs):
np.random.shuffle(data)
for batch in get_batches(data,batch_size=50)
params_grad = evaluate_gradient(loss_function,batch,params)
params = params - learning_rate * params_grad

  • Reduces the variance of the parameter updates, which can lead to more stable convergence
  • Make use of highly optimized matrix optimizations common to state-of-the-art mini-batch sizes range between 50 and 256. The term SGD usually is employed also when mini-batchs are used.

3. Challenges

Vanilla mini-batch gradient descent does not guarantee good convergence, but offers a few challenges that need to be addressed.

  • Choosing a proper learning rate can be difficult
  • Learning rate schedules try to adjust the learning rate during training (e.g annealing退火算法, reducing the lerning rate according to a pre-defined schedule or when the change in objective between epochs falls blow a threshold). However the schedules and thresholds have to be defined in advance and are thus unable to adopt a dataset’s characteristics.
  • The same learning rate applies to all parameter updates while some times we want to perform a larger update for rarely occuring features if our data is sparse and our features have very different frequencies.
  • Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. And the difficulty arises in fact not from local minima but from saddle points…

4. Gradient descent optimization algorithms

We will not discuss algorithms that are infeasible to compute in practice for high-dimensional data, e.g. second-order methods such as Newton-method.

4.1 Momentum

4.2 Nesterov accelerated gradient

4.3 Adagrad

4.4 Adadelta

4.5 RMSprop

4.6 Adam

4.7 AdaMax

4.8 Nadam

4.9 Visualization of algorithms

4.10 Which algorithm to choose?

5. Parallelizing and distributing SGD