In neural networks, each layer has a set of weights which along with activation act as gates for the output decision/classifier. In the process of learning, we try to adapt the weights such that they minimise the error between the actual and predicted values. One such popular method for this is Gradient Descent Algorithm.
In the above graph, let f(x) represent the error between the actual and predicted values (this is the case for a linear regression model and not a classifier. We take this since it is easy to visualize).
In Linear regression, the error is l2 norm i.e. (x- [x])^2. So the error function is parabolic and when actual x is close to predicted [x] we are at the bottom of the valley. This is called Convergence
x(t+1) = x(t) + ∆x(t)
Weights (x(t)) are chosen at random initially, so the first iteration we would be at the top of the valley. At this point we calculate the gradient (g(t)) or slope. It gives the direction in which the curve is increasing. So we move in the opposite direction i.e. negative of the gradient. This is the direction.
How much we have to move depends on the step value (∆x).If we take smaller (η) steps we may iterate a lot before we converge. If we take larger (η) steps we may overshoot the Convergence Value or Final Value.
∆x(t) = −η * g(t)
In theory the best value for the step size is ∆x(t) = H(t)^(-1) * g(t) based on Newton’s Method. H(t) is inverse of Hessian Matrix i.e. a second derivative matrix. The computational complexity of which increases exponentially with large number of weights which is the case with Deep networks. So it is generally not used.
The Eta (η) used is called a Hyperparameter and usually it is hand picked. Its more of a art than science in practice.
Generally, we have to have a large Eta value when the weights are way off so that we take larger steps to converge and reduce the Eta as it reaches the minima. This can be done manually or automatically as a function of the number of iterations [epochs] through the data set or by using the past gradient informations or by approximating the Hessian Matrix/second order information.