What’s Gradient Descent with Momentum?

Original Source Here

We have studied Gradient Descent in the past and we know how quick it converges to a minimum value. But ordinary GD can often does not work when we are dealing with local minimums. For example, see this graph:

Image taken from here

You see the problem here? When GD finds a miminum point, it settles down, without knowing if it really is the lowest point.

Hence, to tackle this problem, we have Gradient Descent with Momentum.

Gradient Descent with Momentum

Its the original Gradient Descent with a slight change. We add a little spice of Exponentially Weighted Averages into it. Specifically:

On iteration t:
   Compute ∂w and ∂b on the current mini-batch.
—> V∂w = ß * V∂w + (1 - ß) * ∂w
—> V∂b = ß * V∂b + (1 - ß) * ∂b
—> W = W - (å * V∂w), b = b - (å * V∂b)   from
—> Vø  = ß * Vø + (1 - ß) * øt

Now, we have two hyper parameters, å and ß. In practice, ß = 0.9 works pretty well most of the time.

Sampling a value of ß:
   r = np.random.rand()
   beta = 1-10^(-r - 1)

Image by Author

In the image, the blue line is without momentum, and red line is with using it. The difference is visible clearly, using momentum, we can speed up the time it takes for our algorithm to converge on the global minimum.

How does it skip local minima?

Alright, let’s first start with the definition of ‘Momentum’.

Momentum is a physics term; it refers to the quantity of motion that an object has. A sports team that is on the move has the momentum. If an object is in motion (on the move) then it has momentum. [1]

Here, we can conclude that a moving thing has momentum. And an example of it can be, have you ever tried to stop a Bowl used in the Bowling? If so, have you noticed that it takes quite some force to do so, and that force is due to the momentum of the ball, which insists/tries to keep going.

In a similar manner, when you algorithm starts to decrease our points, the points gains some momentum, and hence, when a small peak, local minima, appears, despite the opposite curve, the ball moves due to the momentum which forces it to keep going. Hence, in this way, it does not stop in local minima.

You would be thinking that if so, we can also not land on the global minimum and can go opposite, you are kind of right. But the momentum changes after each iteration, as in, when the ball is moving downwards, the momentum increases, and when the ball goes against, the momentum decreases, hence, when we reach the global minimum, the ball would go a little over, but due to the momentum decreasing, it will soon zero out and settle the correct place.

Image taken from here

Conclusion

In this article, we discussed how using GD with momentum can make our model perform better than using ordinary GD.

In this way, you can make your models converge better!

References

[1] → Link.

Contacts

If you want to keep updated with my latest articles and projects follow me on Medium. These are some of my contacts details:

Happy Learning.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.wordpress.com/2021/01/17/whats-gradient-descent-with-momentum-2/

What’s Gradient Descent with Momentum?

Gradient Descent with Momentum

How does it skip local minima?

Conclusion

References

Further Readings

Contacts

Popular posts from this blog

Fully Explained DBScan Clustering Algorithm with Python

Streamlit — Deploy your app in just a few minutes

Hierarchical clustering explained