10 Most Common (and Must-Know) Loss Functions in ML

...depicted in a single frame.

Jun 22, 2023

FREE 3-Day Object Detection Challenge

⭐️ Build your own object detection model from start to finish!

Hey friends! Lately, I have been in touch with Data-Driven Science. They offer self-paced and hands-on learning on practical data science challenges.

A 3-day object detection challenge is available for free. Here, you’ll get to train an end-to-end ML model for object detection using computer vision techniques.

The challenge is guided, meaning you don’t need any prior expertise. Instead, you will learn as you follow the challenge.

Also, you’ll get to apply many of my previous tips around Image Augmentation, Run-time optimization, and more.

All-in-all, it will be an awesome learning experience.

👉 Register for the challenge here: https://datadrivenscience.com/free-object-detection-challenge/.

Let’s get to today’s post now.

Loss functions are a key component of ML algorithms.

They specify the objective an algorithm should aim to optimize during its training. In other words, loss functions tell the algorithm what it should be trying to minimize or maximize in order to improve its performance.

Model training with an objective function

Therefore, knowing about the most common loss functions in machine learning is extremely crucial.

The above visual depicts the most commonly used loss functions for regression and classification tasks.

Regression

Mean Bias Error
1. Captures the average bias in the prediction.
2. However, it is rarely used in training ML models.
3. This is because negative errors may cancel positive errors—leading to zero loss and consequently, no weight updates.
4. Mean Bias Error is foundational to the more advanced regression losses discussed below.
Mean Absolute Error
1. Measures the average absolute difference between predicted and actual value.
2. Also called L1 Loss.
3. Positive errors and negative errors don’t cancel out.
4. One caveat is that small errors are as important as big ones. Thus, the magnitude of the gradient is independent of error size.
Mean Squared Error
1. Measures the squared difference between predicted and actual value.
2. Also called L2 Loss.
3. Larger errors contribute more significantly than smaller errors.
4. The above point may also be a caveat as it is sensitive to outliers.
Root Mean Squared Error
1. Mean Squared Error with a square root.
2. Loss and the dependent variable (y) have the same units.
Huber Loss
1. It is a combination of Mean Absolute Error and Mean Squared Error.
2. For smaller errors → Mean Squared Error.
3. For larger errors → Mean Absolute Error.
4. Offers advantages of both.
5. For smaller errors, mean squared error is used, which is differentiable through (unlike MAE, which is non-differentiable at x=0).
6. For smaller errors, mean absolute error is used, which is less sensitive to outliers.
7. One caveat is that it is parameterized—adding another hyperparameter to the list.
Log Cosh Loss
1. For small errors, log cash loss is approximately:
  \(\frac{x^{2}}{2}\)
2. For large errors, log cash loss is approximately:
  \(|x| - log(2)\)
3. Thus, it is very similar to Huber loss.
4. Also, it is non-parametric.
5. The only caveat is that it is a bit computationally expensive.

Classification

Binary Cross Entropy (BCE)
1. A loss function used for binary classification tasks.
2. Measures the dissimilarity between predicted probabilities and true binary labels, through the logarithmic loss.
Hinge Loss
1. Penalizes both wrong and right (but less confident) predictions).
2. It is based on the concept of margin, which represents the distance between a data point and the decision boundary.
3. The larger the margin, the more confident the classifier is about its prediction.
4. Particularly used to train Support Vector Machines (SVMs).
Cross-Entropy Loss
1. An extension of Binary Cross Entropy loss to multi-class classification tasks.
KL Divergence
1. It minimizes the divergence between predicted and true probability distribution.
2. For classification, using KL divergence is the same as minimizing cross entropy.
3. Thus, it is recommended to use cross-entropy loss because of the ease of computation.

👉 Over to you: What other common loss functions have I missed?

👉 Read what others are saying about this post on LinkedIn and Twitter.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Daily Dose of Data Science

Discussion about this post

Ready for more?