Gradient Descent
Gradient descent is a optimization iterative algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent is to do derivatives to find the critical point. It asks a convex function to execut. That is, we can use grandient descent to find the minimum of a MSE.
A function is convex if for any in the domain of . for all .
Let . We initialize such to something reasonable and repeat adjusting them in the direction of steepest descent. That is, according to the direction of the greatest increase in at is its gradient (thats why we need convex), then we update in the opposite direction of the gradient descent.
- In general we can have equation , where is the learning rate.
- The larger the learning rate, the faster the algorithm converges, but it may overshoot the minimum. That is, we will try use a big learning rate and gradually decrease it.
- The objective values stop changing when where a common choice for is .
- The parameters stop changing when or .
- We define the training cost as a function of iterations.
- show how the learning rate affects in the plot train cost vs iterations.
- BUT only check whether the optimization reach the certain convergence. Doesn't tell reach the minimum or not and anything on performance of the model.
Stochastic Gradient Descent
SGD update the parameters based on the gradient for a single training example. That is, we update the parameters after each training example. The advantage of SGD is that it is much faster than gradient descent. The disadvantage is that it is much less stable than gradient descent and may have larger variance from the estimate. Steps as follows:
- Choose uniformly at random.
- Update by only using the th training example.
A optimial SGD is mini-batch SGD. It uses a small batch of training examples to update the parameters. The advantage is that it is faster than SGD and more stable than SGD and smaller variance on larger mini-batch.
Pros and Cons
Although sometimes we have formula for direct answer, but try think with parameters as large. GD will be more efficient than direct formula.
- Linear regression solution is , the inverse of Matrix is , but GD is .
- Huge difference when .