A loss function is a crucial component in training an Artificial Neural Network (ANN). Its purpose is to quantify the difference between the predicted output of the network and the actual target values. This quantification allows the optimizer to adjust the model's parameters (weights and biases) during the training process.
Here's how the process works:
1. **Forward Pass**:
- During the forward pass, the input data is propagated through the layers of the neural network to generate predictions.
2. **Compute Loss**:
- The loss function takes the network's predictions and the true target values as inputs. It calculates a single scalar value that represents how far off the predictions are from the actual values.
3. **Loss Interpretation**:
- A low loss value indicates that the predictions are close to the actual targets, while a high loss value indicates a significant discrepancy.
4. **Backward Pass (Backpropagation)**:
- The optimizer's main task is to minimize the loss. To do this, it uses a technique called backpropagation, which calculates the gradients of the loss with respect to each parameter in the network.
5. **Gradient Descent (or Other Optimizer)**:
- The gradients computed in the previous step guide the optimizer in adjusting the weights and biases in the direction that reduces the loss. This is done to reach a configuration where the loss is minimized.
6. **Update Parameters**:
- The optimizer updates the parameters (weights and biases) based on the gradients and the learning rate, which determines the step size of the updates. The update rule for a parameter \(\theta\) is typically of the form:
\[
\theta_{t+1} = \theta_{t} - \alpha \nabla J(\theta_{t})
\]
where \(\alpha\) is the learning rate and \(\nabla J(\theta_{t})\) is the gradient of the loss with respect to \(\theta\) at iteration \(t\).
7. **Iterative Process**:
- Steps 1-6 are repeated for a specified number of iterations (epochs) or until the loss converges to a satisfactory level.
8. **Convergence**:
- As the training progresses, the loss typically decreases, indicating that the network's predictions are improving.
9. **Validation and Testing**:
- Once the training is complete, the model's performance is evaluated on a separate validation set to ensure it generalizes well to unseen data. It is also tested on a holdout test set to provide an unbiased estimate of its performance.
10. **Fine-Tuning**:
- Based on the validation performance, further adjustments can be made, such as fine-tuning hyperparameters or modifying the architecture.
The loss function serves as a measure of the discrepancy between predicted and actual values. This discrepancy, or "error," is used by the optimizer to update the model's parameters iteratively, aiming to minimize the loss. This process of minimizing the loss through gradient-based optimization is the essence of training a neural network.
The loss for regression and classification can be quite different. Since our problem is a regression problem let's look at some important loss functions for that application.
1. **Mean Squared Error (L2 Loss)**:
- **Definition**:
- Mean Squared Error (MSE) is the most common loss function for regression problems. It measures the average of the squared differences between predicted and actual values.
- **Mathematical Formulation**:
\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]
where \(n\) is the number of samples, \(y_i\) is the true value, and \(\hat{y}_i\) is the predicted value for sample \(i\).
2. **Huber Loss**:
- **Definition**:
- Huber Loss is less sensitive to outliers compared to MSE. It's commonly used in regression problems.
- **Mathematical Formulation**:
\[
L_{\delta}(y, \hat{y}) = \begin{cases}
\frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\
\delta(|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise}
\end{cases}
\]
where \(y\) is the true value, \(\hat{y}\) is the predicted value, and \(\delta\) is a hyperparameter that controls the threshold for switching between quadratic and linear loss.
3. **Pseudo-Huber Loss**:
- **Definition**:
- Pseudo-Huber Loss is a smooth approximation of Huber Loss. It smoothly transitions from quadratic to linear as the absolute error increases.
- **Mathematical Formulation**:
\[
L_{\delta}(y, \hat{y}) = \delta^2 \left[ \sqrt{1 + \left(\frac{y - \hat{y}}{\delta}\right)^2} - 1 \right]
\]
where \(y\) is the true value, \(\hat{y}\) is the predicted value, and \(\delta\) is a hyperparameter that controls the smoothness of the transition.
4. **Welsch (Leclerc) Loss**:
- **Definition**:
- Welsch Loss is designed to be less sensitive to outliers compared to MSE. It is based on the logarithm of the squared error.
- **Mathematical Formulation**:
\[
L_{\beta}(y, \hat{y}) = \beta^2 \left(1 - e^{-(y - \hat{y})^2 / \beta^2}\right)
\]
where \(y\) is the true value, \(\hat{y}\) is the predicted value, and \(\beta\) is a hyperparameter that controls the sensitivity to large errors.
5. **Geman-McClure Loss**:
- **Definition**:
- Geman-McClure Loss is another loss function that is robust to outliers. It is based on the reciprocal of the squared error.
- **Mathematical Formulation**:
\[
L_{\alpha}(y, \hat{y}) = \alpha^2 \left(\frac{1}{(y - \hat{y})^2 + \alpha^2}\right)
\]
where \(y\) is the true value, \(\hat{y}\) is the predicted value, and \(\alpha\) is a hyperparameter that controls the sensitivity to large errors.
6. **Cauchy Loss**:
- **Definition**:
- Cauchy Loss is robust to outliers and is based on the Cauchy distribution.
- **Mathematical Formulation**:
\[
L_{\gamma}(y, \hat{y}) = \gamma^2 \log\left(1 + \left(\frac{y - \hat{y}}{\gamma}\right)^2\right)
\]
where \(y\) is the true value, \(\hat{y}\) is the predicted value, and \(\gamma\) is a hyperparameter that controls the sensitivity to large errors.
We can generalize the loss to a function of the form:
\[
\rho(x, \alpha) = \frac{|2-\alpha|}{\alpha}\left( \left(\frac{x^2}{|2-\alpha|}+1\right)^{\alpha/2} - 1\right)
\]
In this case, \(\alpha\) is a shape parameter that can be changed to select the loss function of interest. This can be done automatically to create an
adaptive loss function by using the maximum likelihood estimation. We should maximize the likelihood of the probability distribution or minimize the
negative log-likelihood.
\[
NLL(x, \alpha) = min|_{\theta,\alpha} \rho(x, \alpha) + log Z(\alpha)
\]
RESOURCES:
Paper on adaptive loss function: https://arxiv.org/abs/1701.03077
CVPR paper presentation: • "A General and Adaptive Robust Loss F...
Regression Loss Functions: https://alexisalulema.com/2017/12/07/...
ML cheat sheet for loss functions: https://ml-cheatsheet.readthedocs.io/...
Modeling the Huber loss: https://www.textbook.ds100.org/ch/10/...
Notes on Subgradients: https://see.stanford.edu/materials/ls...
Code to get up to speed: https://scikit-learn.org/stable/modul...