Optimizers

Generally speaking, the goal of the optimizer of a DNN is to find the best parameters to minimize the loss function. The gradient descent (GD) is historically the most frequently used optimization method for machine learning.

Gradient Descent (GD) is an optimization algorithm used to minimize a loss function \(J(\theta)\), where \(\theta\) represents the parameters of a machine learning model. The goal of GD is to find the values of \(\theta\) that minimize the loss.

The algorithm iteratively updates the parameters in the opposite direction of the gradient of the loss function with respect to \(\theta\), because the gradient points towards the steepest increase of the function.

Here's how GD works using the formula for updating \(\theta\):

1. **Initialization**:
- Initialize the parameters \(\theta\) with some random values.
- Choose a learning rate \(\alpha\), which is a hyperparameter that determines the size of the steps taken in the parameter space.

2. **Update Rule**:
- At each iteration \(t\), the parameters are updated using the following formula:
\[
\theta_{t+1} = \theta_{t} - \alpha \nabla J(\theta_{t})
\]
where:
- \(\theta_{t}\) represents the parameter values at iteration \(t\).
- \(\nabla J(\theta_{t})\) is the gradient of the loss function \(J\) with respect to \(\theta\) evaluated at \(\theta_{t}\).
- \(\alpha\) is the learning rate, which controls the size of the steps taken during each update.

3. **Convergence**:
- Repeat the update process until a stopping criterion is met, such as a predefined number of iterations or until the change in \(J(\theta)\) falls below a certain threshold.

4. **Interpretation**:
- The gradient \(\nabla J(\theta)\) points in the direction of the steepest increase of the loss function. Thus, the update rule moves \(\theta\) in the direction that minimizes the loss.

5. **Loss Reduction**:
- As the algorithm progresses through iterations, the loss function \(J(\theta)\) typically decreases. Eventually, it converges to a local minimum (or, in convex problems, the global minimum) where the gradient is zero.

6. **Learning Rate Selection**:
- The choice of learning rate \(\alpha\) is crucial. A too small \(\alpha\) leads to slow convergence, while a too large \(\alpha\) can cause overshooting and oscillations.

7. **Stochastic Gradient Descent (SGD)**:
- In practice, a variant called Stochastic Gradient Descent (SGD) is often used. It randomly samples a subset (mini-batch) of the training data to compute the gradient at each step. This can be computationally more efficient, especially for large datasets.

Overall, GD is a fundamental optimization algorithm widely used in machine learning and deep learning for training models. It is the basis for many advanced optimization algorithms.

Space shortcuts

Page tree