Learning Algorithm Basics

In neural networks, each layer has a set of weights which along with activation act as gates for the output decision/classifier. In the process of learning, we try to adapt the weights such that they minimise the error between the actual and predicted values. One such popular method for this is Gradient Descent Algorithm. 

In the above graph, let f(x) represent the error between the actual and predicted values (this is the case for a linear regression model and not a classifier. We take this since it is easy to visualize).

In Linear regression, the error is l2 norm i.e. (x- [x])^2. So the error function is parabolic and when actual x is close to predicted [x] we are at the bottom of the valley. This is called Convergence

x(t+1) = x(t) + ∆x(t)

Weights (x(t)) are chosen at random initially, so the first iteration we would be at the top of the valley. At this point we calculate the gradient (g(t)) or slope. It gives the direction in which the curve is increasing. So we move in the opposite direction i.e. negative of the gradient. This is the direction.

How much we have to move depends on the step value (∆x).If we take smaller (η) steps we may iterate a lot before we converge. If we take larger (η) steps we may overshoot the Convergence Value or Final Value.

∆x(t) = −η * g(t)

In theory the best value for the step size is ∆x(t) = H(t)^(-1) *  g(t) based on Newton’s Method. H(t) is inverse of Hessian Matrix i.e. a second derivative matrix. The computational complexity of which increases exponentially with large number of weights which is the case with Deep networks. So it is generally not used.

The Eta (η) used is called a Hyperparameter and usually it is hand picked. Its more of a art than science in practice.

Generally, we have to have a large Eta value when the weights are way off so that we take larger steps to converge and reduce the Eta as it reaches the minima. This can be done manually or automatically as a function of the number of iterations [epochs] through the data set or by using the past gradient informations or by approximating the Hessian Matrix/second order information.

Basic Modules of Neural Network

Consider the below as a Bird’s Eye View/Glossary of Terms/Draw Diagram and Name Parts kind of explanation of Neural Networks.

Input Data: For a NN to work, we require a lot of data to learn. We use this data, send it through the network check whether it matches the observed/expected output. Based on how different the output is from the expected we correct the network (weights) and try passing it again until the output matches the expected.

Neuron: A basic neuron is said to be based on the activation principle of neuron in the brain.

As shown above,  it does a weighted average of all the inputs and if it crosses a threshold it triggers a output. Also, it is has a recursive structure i.e. the output of a set of neurons are connected as input to another neuron.

image016

This contributes to the non-linear decision boundaries for classification.

Neural Net Layers: As shown in the above structure, we take a array of neuron arrange them in a layer which take inputs from the above layer and pass on the decision to the lower layers. This could to some extent map to the multi-resolution approach of learning i.e. say in a image recognition, we start with first identifying edges, lines, curves etc which then identifies shapes like circle, squares, boxes etc which then into faces, trees etc. It may not be exact but this is more of a analogy

Forward Propagation: The process of passing the input data through the network through every neuron (weights and activations) to get a expected output is defined as forward propagation. Thus every layer has a forward propagation step.

Backward Propagation: When you find that the expected does not match the actual output, we adapt the network (weights) in a manner that it move close to the actual output. It uses the gradient gradient decent algorithm to correct the weights

Decision criteria: Alternatively called the loss layer.  It is usually the last layer of the NN. Here after forward propagation, we do the decision for whether the output matched the expected. This varies based on what we want the output of the network to be like a classifier or regression

Training: A cycle of forward and backward propagation of the data would result in the weights moving toward the desired result. Training is the process of iterating the data through the network again and again  till the expected and the actual outputs are within desired error values.

Other than the above some other basic terms which are not actually part of the NN but play a significant part of the result you obtain,

Pre-processing: The process of preparing the input data by normalisation, weeding out unlabeled data, finding out holes in the input tables i.e. some tables without entries, etc which makes the input data sent through the NN clean and valid

Regularisation: This is the process of adding extra terms in the Decision criteria based on the weights which prevent the NN from over fitting. When the NN is over fit it may perform very good in training but performs poorly in actual test data.

Where to start ….

Coursera offers a very basic course on machine learning by Professor Andrew Ng.  Also there is course on Artificial Intelligence offered by Sebastian Thrun and Peter Norvig. There is also another course which I have not gone through in coursera by Hinton called Neural Networks for Machine Learning.

To learn about Deep Learning, Nando de Freitas offered a course in Oxford which he has uploaded in his channel in youtube. It starts from a very basic approach and then goes on to the recent trends in machine learning like RNN, CNN, AutoEncoders etc. I was able to follow the theory in the first 10 lectures or so and am still in the process of understanding the rest.

Along with the lectures it is best to follow the exercises and practicals which he suggests. He uses torch7 to teach his lectures.