A Survey and Taxonomy of Loss Functions in Machine Learning

Ciampiconi, Lorenzo; Elwood, Adam; Leonardi, Marco; Mohamed, Ashraf; Rozza, Alessandro

doi:10.3390/ai7040128

Open AccessReview

A Survey and Taxonomy of Loss Functions in Machine Learning

by

Lorenzo Ciampiconi

^*,†

,

Adam Elwood

,

Marco Leonardi

^†,

Ashraf Mohamed

and

Alessandro Rozza

^†

lastminute.com Group, Vicolo de’ Calvi 2, 6830 Chiasso, Switzerland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2026, 7(4), 128; https://doi.org/10.3390/ai7040128

Submission received: 3 February 2026 / Revised: 11 March 2026 / Accepted: 24 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Advances and Applications in Graph Neural Networks (GNNs))

Download

Browse Figures

Versions Notes

Abstract

Most state-of-the-art machine learning techniques revolve around the optimization of loss functions, making the choice of an objective critical to model performance and reliability. Although recent reviews discuss loss functions in specific domains or in deep learning settings, there is still no single reference that presents widely used losses across major task families within a unified formal setting and with consistent optimization-relevant property annotations. In this survey, we compile and systematize the most widely adopted loss functions for regression, classification, generative modeling, ranking, energy-based modeling, and relational learning. Our selection procedure combines seeding from foundational textbooks and prior surveys with cross-checking of highly cited literature and common implementations in mainstream machine learning frameworks. We introduce 52 loss functions and organize them into an intuitive taxonomy, summarizing their theoretical motivation, key mathematical properties, and typical application contexts, with compact appendix tables for quick lookup. This survey is intended as a resource for undergraduate, graduate, and Ph.D. students, as well as researchers seeking a structured reference for selecting and comparing loss functions.

Keywords:

loss functions; regularization methods; regression losses; classification losses; generative losses; ranking losses; energy-based losses; relational learning losses

1. Introduction

In the last few decades, there has been an explosion in interest in machine learning [1,2]. This field focuses on the definition and application of algorithms that can be trained on data to model underlying patterns [3,4,5,6]. Machine learning approaches can be applied to many different research fields, including biomedical science [7,8,9,10], natural language understanding [11,12,13], anomaly detection [14], image classification [15], database knowledge discovery [16], robot learning [17], online advertising [18], time series forecasting [19], brain–computer interfacing [20], and many more [21]. To train these algorithms, it is necessary to define an objective function, which gives a scalar measure of the algorithm’s performance [3,22]. They can then be trained by optimizing the value of the objective function.

Within the machine learning literature, such objective functions are usually defined in the form of loss functions, which are optimal when they are minimized. The exact form of the loss function depends on the nature of the problem to be solved, the data available, and the type of machine learning algorithm being optimized. Finding appropriate loss functions is, therefore, one of the most important research endeavors in machine learning.

As machine learning has progressed, many loss functions have been introduced to address diverse tasks and application settings. Several surveys now provide systematic coverage of loss functions, particularly in deep learning contexts. For example, Terven et al. [23] survey loss functions together with evaluation metrics across modern deep learning pipelines, while Li et al. [24] provide a deep-learning-focused review with an explicit screening procedure and a dedicated category for metric losses. Earlier broad surveys (e.g., Wang et al. [25]) and domain-focused reviews (e.g., in face recognition or semantic segmentation [26,27]) also offer valuable summaries, but they differ in scope by either combining losses with evaluation metrics, focusing on narrower application domains, or predating several now-common objectives in modern generative and representation learning settings.

In this context, there remains a need for a single reference that focuses only on loss functions, presents them within a unified formal setting with consistent notation, and connects them through an interpretable taxonomy spanning both classical and modern machine learning objectives. The increasing fragmentation and specialization of the field means that new losses are often developed within specific communities, for example, generative modeling, ranking and retrieval, energy-based modeling, or graph representation learning, and are not always presented within a shared framework.

To improve transparency, we followed a structured selection procedure. We first seeded candidate losses from foundational textbooks and prior surveys, then expanded the list by inspecting highly cited papers associated with each task family and by checking which objectives are implemented and commonly used in major machine learning frameworks (e.g., scikit-learn [v 1.7], PyTorch [v 2.7], and TensorFlow [v 2.16]). We focused primarily on widely adopted objectives in the modern literature, while retaining seminal earlier works when they define canonical losses. This process resulted in 52 loss functions included in the main text and summarized in Appendix A (Table A1).

We included a loss function if it (i) has a clear mathematical definition, (ii) is widely used across multiple works or appears as a standard objective in common machine learning frameworks, and (iii) plays a central role in at least one major task family in our taxonomy (regression, classification, generative modeling, ranking, energy-based modeling, and relational learning). Highly specialized or very recent proposals with limited adoption were excluded unless they instantiate a general pattern that transfers across tasks.

For this reason, we have worked to build a taxonomy of loss functions that highlights the advantages and disadvantages of each technique. We hope this will be useful for new users who want to familiarize themselves with common loss functions and identify those most suitable for a problem they are trying to solve. We also hope this summary will serve as a comprehensive reference for advanced users, allowing them to quickly compare alternatives without having to search broadly across the literature. More generally, the survey is intended to help researchers contextualize newly proposed objectives by showing whether they fit within an existing category or instead represent a genuinely new direction.

Overall, we include 52 widely used loss functions. In each section, we organize the losses according to the broad class of tasks they address, define each loss mathematically, and discuss its most common applications, advantages, and drawbacks. Rather than isolating common machine learning challenges into a separate section, we discuss them directly within the presentation of each loss function whenever they motivate its design or use. In particular, issues such as class imbalance, sensitivity to outliers, and robustness to noisy data are addressed contextually alongside the mathematical definition and practical interpretation of the corresponding losses.

The main contribution of this work is the proposed taxonomy depicted in Figure 1. Each loss function is first divided according to the task on which it is exploited: regression, classification, ranking, generative modeling, energy-based modeling, and relational learning. Finally, we classify each loss function by its underlying strategy, such as error minimization, probabilistic formalization, or margin maximization.

This work is organized as follows: In Section 2, we provide a formal definition of a loss function and introduce our taxonomy. In Section 3, we describe the most common regularization methods used to reduce model complexity. In Section 4, we describe the regression task and the key loss functions used to train regression models. In Section 5, we introduce the classification problem and the associated loss functions. In Section 6, we present generative models and their losses. Ranking problems and their loss functions are introduced in Section 7, and energy-based models and their losses are described in Section 8. (While energy-based models (EBMs) are often grouped with generative models, we treat them separately because they learn unnormalized energy functions (rather than normalized likelihoods), use distinct losses (often contrastive or margin-based), and are applied in both generative and discriminative settings.) Relational learning losses are presented in Section 9. Finally, we draw conclusions in Section 10.

2. Definition of the Loss Function Taxonomy

In a general machine learning problem, the aim is to learn a function f that transforms an input, defined by the input space

Φ

, into a desirable output, defined by the output space

Y

:

f : Φ \to Y

(1)

where f is a function that can be approximated by a model,

f_{Θ}

, parameterized by the parameters

Θ

.

Given a set of inputs

{x_{0}, \dots, x_{N}} \in Φ

, they are used to train the model with reference to target variables in the output space,

{y_{0}, \dots, y_{N}} \in Y

. Notice that, in some cases (such as autoencoders),

Y = Φ

.

A loss function, L, is defined as a mapping of

f (x_{i})

with its corresponding

y_{i}

to a real number

l \in R

, which captures the similarity between

f (x_{i})

and

y_{i}

. Aggregating over all the points of the dataset, we find the overall loss,

L

:

\begin{matrix} L (f | {x_{0}, \dots, x_{N}}, {y_{0}, \dots, y_{N}}) = \frac{1}{N} \sum_{i = 1}^{N} L (f (x_{i}), y_{i}) \end{matrix}

(2)

The optimization function to be solved is defined as:

\begin{matrix} min_{f} L (f | {x_{0}, \dots, x_{N}}, {y_{0}, \dots, y_{N}}) \end{matrix}

(3)

Notice that it is often convenient to explicitly introduce a regularization term (R) that maps f to a real number

r \in R

. This term is usually used for penalizng the complexity of the model in the optimization [3]:

\begin{matrix} min_{f} \frac{1}{N} \sum_{i = 1}^{N} L (f (x_{i}), y_{i}) + R (f) \end{matrix}

(4)

In practice, the family of functions chosen for the optimization can be parameterized by a parameter vector

Θ

, which allows the minimization to be defined as an exploration in the parameter space:

\begin{matrix} min_{Θ} \frac{1}{N} \sum_{i = 1}^{N} L (f_{Θ} (x_{i}), y_{i}) + R (Θ) \end{matrix}

(5)

2.1. Optimization Techniques for Loss Functions

2.1.1. Loss Functions and Optimization Methods

In this section, we list the most common mathematical properties that a loss may or may not satisfy, and then we briefly discuss the main optimization methods employed to minimize them. For the sake of simplicity, visualization, and understanding, we define such properties in a two-dimensional space, but they can be easily generalized to a d-dimensional one.

Continuity (CONT): A real-valued function, that is a function from a subset of the real numbers to the real numbers, can be represented by a graph in the Cartesian plane; such a function is continuous if the graph is a single unbroken curve belonging to the real domain. A more mathematically rigorous definition can be given in terms of limits: a function f with variable x is continuous at the point c if

$lim_{x \to c} f (x) = f (c) .$

(6)
Differentiability (DIFF): A differentiable function f on a real variable is a function derivable at each point of its domain. A differentiable function is smooth, in the sense that it is locally well approximated by a linear function, and does not contain any break, angle, or cusp. A continuous function is not necessarily differentiable, but a differentiable function is necessarily continuous. More formally, in one dimension, f is differentiable at c if the following limit exists:

$f^{'} (c) = lim_{h \to 0} \frac{f (c + h) - f (c)}{h} .$

(7)
Lipschitz Continuity (L-CONT): A Lipschitz continuous function is limited in how fast it can change. More formally, there exists a real constant $L \geq 0$ such that, for every pair of points in the domain,

$| f (x) - f (y) | \leq L ∥ x - y ∥, \forall x, y \in dom (f),$

(8)

where L is called the Lipschitz constant of the function.
To understand the robustness of a model, such as a neural network, some research papers [28,29] have tried to train the underlying model by defining an input–output map with a small Lipschitz constant. The intuition is that if a model is robust, it should not be too affected by perturbations in the input, $f (x + δ x) \approx f (x)$ , and this would be ensured by having f be ℓ-Lipschitz where ℓ is small [30].
Convexity (CONVEX): A real-valued function f is convex if each segment between any two points on the graph of the function lies above the graph between the two points. More formally, f is convex if for all $x, y$ in its domain and all $λ \in [0, 1]$ ,

$f (λ x + (1 - λ) y) \leq λ f (x) + (1 - λ) f (y) .$

(9)

Convexity is a key feature since any local minimum of a convex function is also a global minimum. Whenever the second derivative exists, convexity is easy to check, since the Hessian of the function must be positive semi-definite.
Strict Convexity (S-CONV): A real-valued function is strictly convex if the segment between any two points on the graph of the function lies strictly above the graph between the two points, except at the intersection points between the straight line and the curve. More formally, for all distinct $x, y$ in the domain and all $λ \in (0, 1)$ ,

$f (λ x + (1 - λ) y) < λ f (x) + (1 - λ) f (y) .$

(10)

Strict convexity implies that, if a minimizer exists, it is unique. If the Hessian exists and is positive definite, this is a sufficient condition for strict convexity.

2.1.2. Relevant Optimization Methods

An optimization method is a technique that, given a formalized optimization problem with an objective function, returns the solution to obtain the optimal value of that optimization problem. Most of the optimization methods presented in this work rely on algorithms that may not guarantee the optimality of the solution, but imply a degree of approximation.

For each optimization method, we specify the mathematical properties that the loss function must satisfy, such as continuity, differentiability, or convexity. These requirements are listed in the headings of each method, using blue to indicate the necessary properties for the method’s usability and red for properties that provide optimality guarantees.

Closed-Form Solutions (DIFF, S-CONV): These are systems of equations solvable analytically, where values of $Θ$ make the derivative of the loss function equal to zero. To guarantee a unique closed-form solution, the loss function must be differentiable (DIFF) and strictly convex (S-CONV), ensuring a single global minimum. Closed-form solutions are highly efficient and desirable where feasible; however, they are often impractical for complex models or high-dimensional parameter spaces. Therefore, closed-form solutions are primarily used in simpler, linear models or settings where the loss is quadratic or log-likelihood based, as in linear regression or Gaussian MLE problems.
Gradient Descent (DIFF, CONVEX): Gradient descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. The loss function must be at least differentiable (DIFF) to compute gradients, and if the loss is convex (CONVEX), the local minimum is also the global minimum. Lipschitz continuity (L-CONT) can improve convergence guarantees, as it limits how quickly the function can change, but it is not strictly necessary for gradient descent. For non-differentiable losses, techniques like subgradients or gradient approximations can be employed [31,32]. The algorithm for gradient descent is formalized in Algorithm 1. In each iteration, it calculates the gradient of the loss function with respect to the current parameters, $\nabla L (Θ^{(t)})$ , and updates those parameters by taking a step of size $α$ (the learning rate) in the opposite direction of the gradient to descend toward the minimum.

Algorithm 1 Gradient Descent

Input: initial parameters

Θ^{(0)}

, number of iterations T, learning rate

α

Output: final learning

Θ^{(T)}

1:: $t = 0$
2:: while $t < T$ do
3:: estimate $\nabla L (Θ^{(t)})$
4:: compute $Δ Θ^{(t)} = - \nabla L (Θ^{(t)})$
5:: $Θ^{(t + 1)} : = Θ^{(t)} + α Δ Θ^{(t)}$
6:: end while
7:: return $Θ^{(T)}$

Stochastic Gradient Descent (SGD) (DIFF, CONVEX): SGD [3] is a stochastic approximation of gradient descent that computes the gradient from a randomly selected subset of the data instead of the entire dataset. This reduces the computational cost in high-dimensional problems, such as neural networks, and helps avoid local minima due to the stochastic nature of the updates. Like gradient descent, SGD requires the loss function to be differentiable (DIFF). Convexity (CONVEX) ensures that the global minimum is reachable, but even for non-convex functions, SGD can often find useful minima in practice. Lipschitz continuity (L-CONT) can improve the convergence rate, but is not required.
Derivative-Free Optimization: In some cases, the derivative of the objective function may not exist or be difficult to compute. Derivative-free optimization methods, such as simulated annealing, genetic algorithms, and particle swarm optimization, can be employed [33,34]. While these methods do not strictly require continuity (CONT), having a continuous function typically improves the stability of the optimization process. Derivative-free methods can handle non-differentiable and non-convex losses, but they may struggle to scale to high-dimensional problems and can be computationally expensive.
Zeroth-Order Optimization (ZOO): ZOO optimization is a subset of derivative-free optimization that approximates gradients using function evaluations rather than direct computation of derivatives [35]. These methods are useful in black-box scenarios where the gradient is not accessible but can be estimated through perturbations. While continuity (CONT) is not required, it improves the accuracy of gradient approximations and helps achieve better convergence rates. ZOO methods are effective for non-differentiable and non-convex losses, and they have been applied in adversarial attack generation, model-agnostic explanations, and other black-box scenarios [36,37].

2.2. The Proposed Taxonomy

Our taxonomy is summarized in Figure 1. To define it, we started by categorizing the losses depending on which machine learning problem they are best suited to solve. We have identified the following categories:

Regression (Section 4);
Classification (Section 5);
Generative modelling (Section 6);
Ranking (Section 7);
Energy-based modelling (Section 8);
Relational learning (Section 9).

We also made a distinction based on the mathematical concepts used to define the loss, obtaining the following sub-categories:

Error-based;
Probabilistic;
Margin-based.

Using this approach, we developed a compact and intuitive taxonomy. We employed well-established terminology to ensure that users can intuitively navigate and understand the taxonomy.

3. Regularization Methods

Regularization methods can be applied to almost all loss functions. They are employed to reduce model complexity, simplifying the trained model and reducing its propensity to overfit the training data [38,39,40]. Model complexity is usually measured by the number of parameters and their magnitude [3,40,41]. Many techniques fall under the umbrella of regularization method and a significant number of them are based on the augmentation of the loss function [3,38]. An intuitive justification for regularization is that it imposes Occam’s razor on the complexity of the final model. More theoretically, many loss-based regularization techniques are equivalent to imposing certain prior distributions on the model parameters.

3.1. Regularization by Loss Augmentation

One can design the loss function to penalize the magnitude of model parameters, thus learning the best trade-off between bias and variance of the model and reducing the generalization error without affecting the training error too much. This prevents overfitting, while avoiding underfitting, and can be done by augmenting the loss function with a term that explicitly controls the magnitude of the parameters, or implicitly reduces the number of them. The general way of augmenting a loss function to regularize the result is formalized in the following equation:

\hat{L} (f (x_{i}), y_{i}) = L (f (x_{i}), y_{i}) + λ ρ (Θ)

(11)

where

ρ (Θ)

is called the regularization function and

λ

defines the amount of regularization (the trade-off between fit and generalization).

This general definition makes it clear that we can employ regularization on any of the losses proposed in this paper.

We are now going to describe the most common regularization methods based on loss augmentation.

3.1.1. L2-Norm Regularization

In

L_{2}

regularization, the loss is augmented to include the weighted

L_{2}

norm of the weights [3,42], so the regularization function is

ρ (Θ) = {∥Θ∥}_{2}^{2}

:

\hat{L} (f (x_{i}), y_{i}) = L (f (x_{i}), y_{i}) + λ {∥Θ∥}_{2}^{2}

(12)

when this is employed in regression problems, it is also known as Ridge regression [3,43].

3.1.2. $L_{1}$ -Norm Regularization

In

L_{1}

regularization the loss is augmented to include the weighted

L_{1}

norm of the weights [3,42], so the regularization function is

ρ (Θ) = ∥Θ∥

\hat{L} (f (x_{i}), y_{i}) = L (f (x_{i}), y_{i}) + λ {∥Θ∥}_{1}

(13)

when this is employed in regression problems it is also known as Lasso regression [3,44].

3.2. Comparison Between $L_{2}$ and $L_{1}$ Norm Regularizations

L_{1}

and

L_{2}

regularizations are both based on the same concept of penalizing the magnitude of the weights composing the models. Despite this, the two methods have important differences in their employability and their effects on the result.

One of the most crucial differences is that

L_{1}

, when optimized, can shrink weights to 0, while

L_{2}

results in non-zero (smoothed) values [3,4,42,45,46]. This allows

L_{1}

to reduce the dimensions of a model’s parameter space and perform an implicit feature selection. Indeed, it has been shown by [45] that by employing

L_{1}

regularization on logistic regression, the sample complexity (i.e., the number of training examples required to learn “well”) grows logarithmically in the number of irrelevant features. On the contrary, the authors show that any rotationally invariant algorithm (including logistic regression) with

L_{2}

regularization has a worst-case sample complexity that grows at least linearly in the number of irrelevant features. Moreover,

L_{2}

is more sensitive to the outliers than

L_{1}

-norm since it squares the error.

L_{2}

is continuous, while

L_{1}

is a piece-wise function. The main advantage of

L_{2}

is that it is differentiable, while

L_{1}

is non-differentiable at 0, which has some strong implications. Precisely, the

L_{2}

norm can be easily trained with gradient descent, while

L_{1}

sometimes cannot be efficiently applied. The first problem is the inefficiency of applying the

L_{1}

penalty to the weights of all the features, especially when the dimensions of the feature space tend to be very large [47], producing a significant slowdown of the weight-updating process. Finally, the naive application of

L_{1}

penalty in SGD does not always lead to compact models, because the approximate gradient used at each update could be very noisy, so the weights of the features can be easily moved away from zero by those fluctuations, and

L_{1}

loses its main advantages for

L_{2}

[47].

4. Regression Losses

4.1. Problem Formulation and Notation

A regression model aims to predict the value of a continuous target variable y (the dependent variable) from one or more predictor variables

x

(the independent variables). More precisely, let

f_{Θ}

be a generic model parameterized by

Θ

, which maps an input

x_{i} \in R^{D}

to a real-valued output,

f_{Θ} : R^{D} \to R .

Given a dataset of input–output pairs

{(x_{i}, y_{i})}_{i = 1}^{N}

, the goal is to estimate the parameters

Θ

that best fit the data by minimizing a loss function

L

.

All the losses considered for regression are based on functions of the residuals, i.e., the difference between the observed value

y_{i}

and the predicted value

f_{Θ} (x_{i})

. In the following, we denote by

f_{Θ} (x_{i})

the prediction corresponding to the input

x_{i}

, and by

y_{i}

the associated ground-truth target.

As highlighted by Figure 2, the mean bias error (

M B E

) can be regarded as a basic reference point for regression losses, from which several important variants can be derived. Among the most widely used are the mean absolute error (

M A E

), mean squared error (

M S E

), their regularized counterparts in linear modeling (lasso and ridge), and the root mean squared error (

R M S E

). In this section, we also introduce the Huber loss and the Smooth

L_{1}

loss, which combine properties of both

M A E

and

M S E

. Finally, we present the log-cosh loss and the root mean squared logarithmic error loss.

4.2. Error-Based Losses for Regression

Since regression models aim to minimize the discrepancy between actual and predicted values, the associated loss functions are typically classified as error-based. These losses directly measure the magnitude of the residuals in order to improve predictive accuracy.

4.2.1. Mean Bias Error Loss (CONT, DIFF, CONVEX)

The most straightforward regression loss is the mean bias error (MBE), defined in Equation (14). It measures the average signed residual and therefore captures whether a model tends, on average, to overestimate or underestimate the target variable. For this reason, MBE is mainly used as an evaluation measure rather than as a training objective. Indeed, positive and negative residuals may cancel each other out, leading to a deceptively small loss even when individual prediction errors are large [48,49,50].

The MBE loss is defined as

L_{M B E} = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - f (x_{i})) .

(14)

Several standard regression losses can be viewed as modifications of MBE that avoid error cancellation by transforming the residuals before aggregation. In particular, MAE, MSE, and log-cosh differ from MBE in how they penalize the signed prediction error.

4.2.2. Mean Absolute Error Loss (L-CONT, CONVEX)

The mean absolute error loss or L1 loss is one of the most basic loss functions for regression. It measures the average of the absolute bias in the prediction. The absolute value overcomes the problem of the

M B E

, ensuring that positive errors do not cancel the negative ones. Therefore, each error contributes to

M A E

in proportion to the absolute value of the error. Notice that the contribution of the errors follows a linear behavior, meaning that many small errors are as important as a big one. This implies that the gradient magnitude is not dependent on the error size, and thus may lead to convergence problems when the error is small. A model trained to minimize the MAE is more effective when the target data conditioned on the input is symmetric. It is important to highlight that the derivative of the absolute value at zero is not defined.

As for MBE, MAE is also used to evaluate the performances of the models [51,52].

L_{M A E} = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - f (x_{i})|

(15)

4.2.3. Mean Squared Error Loss (CONT, DIFF, CONVEX)

The mean squared error (MSE) loss, or L2 loss, is the average of the squared differences between observed values y and predicted values

\hat{y}

. It is widely used in regression tasks. The squared term ensures that all errors are positive and amplifies the impact of outliers, making it suitable for problems where the noise in observations follows a normal distribution.

The MSE loss is defined as:

L_{M S E} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - f (x_{i}))}^{2}

(16)

One key drawback of MSE is its sensitivity to outliers, as large errors have a disproportionately high influence due to the squaring of residuals.

Interpretation as Maximum Likelihood Estimation (MLE)

From a probabilistic viewpoint, MSE can be derived as a form of maximum likelihood estimation (MLE) under the assumption that the errors between the predicted and observed values follow a Gaussian (normal) distribution with constant variance [42,53]. Minimizing MSE is equivalent to maximizing the likelihood of the observed data under this Gaussian noise assumption. This probabilistic interpretation explains why MSE is commonly used when the residuals are expected to follow a Gaussian distribution and highlights its role as a standard regression loss.

4.2.4. Lasso Regression ( $L_{1}$ Regularization)

Lasso regression is derived from augmenting the MSE loss with an

L_{1}

regularization term, as detailed in Section 3.1.2. The regularized loss function penalizes the absolute magnitude of the model parameters, leading to sparsity in the learned model by driving some of the weights to zero. This makes Lasso particularly useful for feature selection in high-dimensional datasets [3,44].

The loss function for Lasso regression is:

L_{Lasso} (f (x_{i}), y_{i}) = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - f (x_{i}))}^{2} + λ {∥Θ∥}_{1}

(17)

The

L_{1}

term encourages sparsity, shrinking irrelevant model weights to zero, which implicitly performs feature selection. However, as noted in Section 3.1.2, lasso can struggle with correlated features, often selecting only one from a group of correlated variables. Additionally, lasso is non-differentiable at zero, which may pose challenges in optimization, particularly when using gradient-based methods.

4.2.5. Ridge Regression ( $L_{2}$ Regularization)

Ridge regression is an extension of MSE with an

L_{2}

regularization term, as described in Section 3.1.1. The

L_{2}

term penalizes the square of the model parameters, discouraging large coefficients and helping to mitigate overfitting in regression tasks, especially when there is multicollinearity (high correlation between features) [43].

The loss function for ridge regression is:

L_{Ridge} (f (x_{i}), y_{i}) = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - f (x_{i}))}^{2} + λ {∥Θ∥}_{2}^{2}

(18)

Interpretation as Maximum A Posteriori Estimation (MAP)

From a Bayesian perspective, ridge regression can be interpreted as maximum a posteriori estimation (MAP) [53]. In this context, the

L_{2}

regularization term corresponds to a Gaussian prior on the model parameters, with the objective being to maximize the posterior distribution of the parameters given the data. This probabilistic interpretation shows that ridge regression shrinks the model coefficients towards zero without eliminating them entirely, as occurs in lasso.

Ridge regression is particularly effective when dealing with correlated features, as it distributes the coefficient weights more smoothly across them. However, unlike lasso, ridge does not perform feature selection, retaining all input features with non-zero weights.

4.2.6. Root Mean Squared Error Loss (CONT, DIFF, CONVEX)

The root mean squared error (RMSE) loss is directly related to the MSE and differs from it only by the presence of the square root term. Its main advantage is that it preserves the same units and scale as the target variable, which often makes its value easier to interpret in practical applications. Since the square root is a monotonic transformation, minimizing RMSE yields the same minimizer as minimizing MSE. However, the two objectives may induce different optimization dynamics because their gradients are scaled differently.

As with the previously presented losses, RMSE is also widely used as an evaluation metric for regression models [50,52], and it inherits the main limitation of MSE, namely its sensitivity to large residuals and outliers.

L_{R M S E} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - f (x_{i}))}^{2}}

(19)

Strictly speaking, RMSE is not differentiable when all residuals are exactly zero, although this is usually a negligible corner case in practice.

4.2.7. Huber Loss and Smooth $L_{1}$ Loss (L-CONT, DIFF, CONVEX)

The Huber loss [54] is a regression loss that combines the main advantages of the MAE and the MSE. It behaves quadratically for small residuals and linearly for large residuals, making it less sensitive to outliers than the MSE while remaining differentiable at zero, unlike the MAE. It is parameterized by

δ

, which determines the transition point between the quadratic and linear regimes.

More precisely, when

|y_{i} - f (x_{i})| \leq δ

, the Huber loss behaves like the MSE, whereas for larger residuals it behaves like the MAE. This makes it particularly useful in settings where robustness to outliers is needed without completely discarding the smooth optimization properties of squared-error losses.

L_{Huber} = \{\begin{matrix} \frac{1}{2} {(y_{i} - f (x_{i}))}^{2} & if |y_{i} - f (x_{i})| \leq δ, \\ δ (|y_{i} - f (x_{i})| - \frac{1}{2} δ) & otherwise . \end{matrix}

(20)

The choice of

δ

is crucial, since it controls the balance between sensitivity to small residuals and robustness to large ones. The main limitation of the Huber loss is therefore the presence of this additional hyperparameter, which must be chosen according to the scale of the noise or the notion of outlier relevant to the application.

A specific case of the Huber loss is the smooth

L_{1}

loss, obtained when

δ = 1

. This loss has been shown to be particularly useful in tasks where a balance is needed between sensitivity to small residuals and robustness to outliers, such as object detection [55] and bounding-box regression in frameworks such as faster R-CNN [56].

4.2.8. Log-Cosh Loss (L-CONT, DIFF, S-CONV)

The log-cosh loss is defined as the logarithm of the hyperbolic cosine of the residual between the observed value

y_{i}

and the predicted value

f (x_{i})

. It behaves similarly to the MSE for small residuals and to the MAE for large ones. Indeed, for small values of r,

log (cosh (r)) \approx \frac{r^{2}}{2},

whereas for large values of r,

log (cosh (r)) \approx | r | - log (2) .

This makes the log-cosh loss a smooth alternative that combines the local quadratic behavior of the MSE with the increased robustness of the MAE for large residuals.

L_{logcosh} = \frac{1}{N} \sum_{i = 1}^{N} log (cosh (y_{i} - f (x_{i})))

(21)

The log-cosh loss shares many of the advantages of the Huber loss, but without requiring the manual selection of a threshold hyperparameter. On the other hand, it is computationally more expensive because of the hyperbolic cosine and logarithmic terms, and it is less flexible than the Huber loss since it does not allow explicit control of the transition between the quadratic and linear regimes.

4.2.9. Root Mean Squared Logarithmic Error Loss (CONT, DIFF)

The root mean squared logarithmic error (RMSLE) loss, formalized in Equation (22), is the root mean squared error computed on the log-transformed observed values and predictions. By applying the logarithm to both

y_{i}

and

f (x_{i})

, this loss emphasizes relative rather than absolute discrepancies, making it particularly useful when the target variable spans several orders of magnitude. The addition of 1 inside the logarithm ensures that zero-valued targets and predictions can be handled.

The RMSLE loss is defined as

L_{R M S L E} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(log (y_{i} + 1) - log (f (x_{i}) + 1))}^{2}}

(22)

Compared with RMSE, RMSLE dampens the effect of large residuals when both the predicted and observed values are large, due to the compressive effect of the logarithm. For this reason, it is often preferred when relative errors are more important than absolute ones, or when the target values exhibit a large dynamic range. However, RMSLE is not suitable when target values may be negative, since the logarithm is undefined in that case.

This loss is particularly useful in regression problems where proportional differences matter more than exact magnitude matching. On the other hand, it is less appropriate when accurate prediction of large absolute values is critical, since the logarithmic transformation reduces the influence of large deviations [57,58].

5. Classification Losses

5.1. Problem Formulation and Notation

Classification is a class of supervised learning problems in which the goal is to assign an input

x

to one or more discrete classes. This can be achieved by training a model

f_{Θ}

with parameters

Θ

through the minimization of a suitable loss function L.

A first, general way to represent classification is to consider models that return a discrete label vector. In this case, the output space can be written as

\begin{matrix} f : & Φ \to Λ^{K} \\ Λ = {0, 1} . \end{matrix}

This formulation also covers multi-label classification, since more than one label may be associated with the same sample, e.g.,

f (x) = [0, 1, 0, 1, 0, 0]

. To restrict the setting to single-label classification, it is sufficient to impose that exactly one label is active, i.e.,

\sum_{k = 1}^{K} λ_{k} = 1 .

We can also consider models with continuous outputs, where for each class

k \in {1, \dots, K}

, the model returns a confidence score or probability

p_{k} (x) \in [0, 1]

:

\begin{matrix} f : & Φ \to P^{K} \\ P = [0, 1] . \end{matrix}

As before, to move from the multi-label to the single-label setting, the outputs must satisfy

\sum_{k = 1}^{K} p_{k} (x) = 1,

so that the prediction can be interpreted as a probability distribution over the classes.

A more specific notation can be introduced for binary classification problems. This is useful here because many classical margin-based losses are originally defined in the binary setting. In this case, the label space is

Y = {- 1, 1},

while the predictor is typically a real-valued scoring function

f : Φ \to R,

with prediction given by

\hat{y} = sign (f (x))

. This formulation is natural because margin-based losses depend on the quantity

y f (x)

.

Although these losses are most naturally introduced in the binary setting, they can be generalized to multi-class classification either by decomposing the problem into multiple binary tasks (e.g., one-vs.-rest or one-vs.-one) or by adopting direct multiclass formulations [59,60,61]. Multi-label classification is instead usually handled by learning one binary predictor per label or by using probabilistic objectives independently across labels.

In this survey, we divide classification losses into two main macro-categories according to the underlying optimization strategy, namely margin-based losses and probabilistic losses, as illustrated in Figure 3. We first introduce the margin-based family, beginning with the most basic and intuitive formulation, the zero-one loss. We then discuss the hinge loss and its variants, including the smoothed hinge and quadratically smoothed hinge losses, followed by the modified Huber loss, the ramp loss, and the cosine similarity loss.

We then present the probabilistic losses, starting with cross-entropy and negative log-likelihood, which coincide from a mathematical point of view. Next, we introduce the Kullback–Leibler (KL) divergence, followed by additional losses commonly used in classification and segmentation. These include focal loss, which addresses class imbalance by focusing on hard examples; Dice loss, which measures overlap between predicted and target segmentations; and Tversky loss, which extends Dice loss by explicitly controlling the trade-off between false positives and false negatives in highly imbalanced settings.

5.2. Margin-Based Loss Functions

In this section, we introduce the most known margin-based loss functions.

5.2.1. Zero-One Loss

The most basic and intuitive margin-based classification loss is the zero-one loss, which assigns a value of 1 to a misclassified observation and 0 to a correctly classified one:

L_{ZeroOne} (f (x), y) = \{\begin{matrix} 1 & if f (x) \cdot y < 0 \\ 0 & otherwise \end{matrix}

(23)

Zero-one loss is not directly usable since it lacks convexity and differentiability. However, it is possible to derive employable surrogate losses that are classification calibrated, meaning they are a relaxation of

L_{ZeroOne}

, an upper bound, or an approximation of this loss. A significant achievement of the recent literature on binary classification has been the identification of necessary and sufficient conditions under which such relaxations yield Fisher consistency (Fisher consistency, in this context, means that minimizing the surrogate loss also minimizes the expected zero-one loss under the true data distribution, ensuring reliable classification outcomes) [62,63,64,65,66,67]. All the following losses in this section satisfy such conditions.

5.2.2. Hinge Loss and Perceptron Loss (L-CONT, CONVEX)

The most famous surrogate loss is the hinge loss [68], which linearly penalizes every prediction where the resulting agreement is

\leq 1

.

L_{Hinge} (f (x), y) = max (0, 1 - (f (x) \cdot y))

(24)

The hinge loss is not strictly convex, but it is Lipschitz continuous and convex, so many of the usual convex optimizers used in machine learning can work with it. The hinge loss is commonly employed to optimize the support vector machine (SVM [69,70]).

To train the perceptron [71], a variation of this loss, the perceptron loss, is employed. This loss slightly differs from the hinge loss, because it does not penalize samples inside the margin, surrounding the separating hyperplane, but just the ones that are mislabeled by this hyperplane with the same linear penalization.

L_{Perceptron} (f (x), y) = max (0, - (f (x) \cdot y))

(25)

There are two main drawbacks to using the hinge loss. Firstly, its adoption is used to make the model sensible to outliers in the training data. Secondly, due to the discontinuity of the derivative at

(f (x) \cdot y) = 1

, i.e., the fact that is not continuously differentiable, hinge loss results tend to be difficult to optimize.

5.2.3. Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)

A smoothed version of the hinge loss was defined in [72] with the goal of obtaining a function that is easier to optimize, as shown by the following equation:

L_{SmoothedHinge} (f (x), y) = \{\begin{matrix} \frac{1}{2} - (f (x) \cdot y) & (f (x) \cdot y) \leq 0 \\ \frac{1}{2} {(1 - (f (x) \cdot y))}^{2} & 0 < (f (x) \cdot y) < 1 \\ 0 & (f (x) \cdot y) \geq 1 \end{matrix}

(26)

This smoothed version of the hinge loss is differentiable. Clearly, this is not the only possible smooth version of the hinge loss. However, it is a canonical one that has the important property of being zero for z ≥ 1, and it has a constant (negative) slope for z ≤ 0. Moreover, for

0 < z < 1

, the loss smoothly transitions from a zero slope to a constant negative one. This loss inherits sensitivity to outliers from the original hinge loss.

5.2.4. Quadratically Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)

With the same goal as the smoothed hinge loss, a quadratically smoothed version was introduced in [73] to make optimization easier while preserving the main properties of the original hinge loss. Let

z = f (x) \cdot y

. A common formulation is:

L_{QSmoothedHinge} (f (x), y) = \{\begin{matrix} 0 & z \geq 1 \\ \frac{{(1 - z)}^{2}}{2 γ} & 1 - γ \leq z < 1 \\ 1 - z - \frac{γ}{2} & z < 1 - γ \end{matrix}

(27)

where the hyperparameter

γ > 0

determines the degree of smoothing. As

γ \to 0

, the loss approaches the original hinge loss. In contrast to the hinge loss, this formulation is differentiable over the whole domain, while remaining convex and Lipschitz continuous. The quadratic region around the margin makes the optimization smoother, although the loss still inherits some sensitivity to outliers from the original hinge loss.

5.2.5. Modified Huber Loss (L-CONT, DIFF, CONVEX)

The modified Huber loss is a slight variation of the Huber loss for regression and a special case of the quadratic smoothed hinge loss with

γ = 2

(for more details refer to Section 4.2.7):

L_{ModHuber} (f (x), y) = \{\begin{matrix} \frac{1}{4} max {(0, - (f (x) \cdot y))}^{2} & (f (x) \cdot y) \geq - 1 \\ - (f (x) \cdot y) & otherwise \end{matrix}

(28)

5.2.6. Ramp Loss (CONT)

The ramp loss, also known as the truncated hinge loss, is a piecewise linear and continuous loss obtained by truncating the hinge loss [74]. Unlike the hinge loss, the ramp loss is bounded, which makes it more robust to outliers and mislabeled examples. This robustness comes at the price of non-convexity, which makes the optimization problem more challenging.

A common formulation is:

L_{Ramp} (f (x), y) = \{\begin{matrix} 1 & if (f (x) \cdot y) \leq 0 \\ 1 - (f (x) \cdot y) & if 0 < (f (x) \cdot y) < 1 \\ 0 & if (f (x) \cdot y) \geq 1 \end{matrix}

(29)

Equivalently, the ramp loss can be written as

L_{Ramp} (f (x), y) = min (1, L_{Hinge} (f (x), y)) .

When employed in SVM-like classifiers, truncated-hinge objectives have been shown to yield more robust classifiers and, in some cases, smaller and more stable support-vector sets than standard hinge loss formulations [74]. Extensions to multicategory settings have also been studied in the literature [61].

5.2.7. Cosine Similarity Loss (CONT, DIFF)

Cosine similarity is commonly used to compare vectors when their orientation is more important than their magnitude [3,42]. A typical example arises in text representation, where samples are encoded through word counts or embeddings and the angular agreement between vectors is more informative than their norm. When both the target and the model output can be interpreted as vectors, cosine similarity can be adapted into a loss function as follows:

L_{cos-sim} (f (x), y) = 1 - \frac{y \cdot f (x)}{∥y∥ ∥f (x)∥} .

(30)

This loss is differentiable whenever the norms are non-zero, and it is useful when the objective is to align directions rather than magnitudes. It is important to note that cosine similarity itself lies in the interval

[- 1, 1]

, whereas the corresponding loss defined above lies in the interval

[0, 2]

. This normalization makes the loss insensitive to scale, which can be advantageous in some applications but may be undesirable when vector magnitude also carries meaningful information.

Although cosine similarity loss is not a classical margin-based surrogate in the SVM sense, it can be interpreted as a geometric objective that promotes angular separation and alignment in representation space, and is therefore included here as a related non-probabilistic classification loss.

5.3. Probabilistic Loss Functions

Let q be the probability distribution underlying the dataset and

f_{Θ}

the function generating the output. Probabilistic loss functions provide some distance function between q and

f_{Θ}

. By minimizing that distance, the model output distribution converges to the ground-truth one. Usually, models trained with probabilistic loss functions can provide a measure of how likely a sample is labeled with one class instead of another [3,42,75], providing richer information compared to margin-based options.

5.3.1. Cross-Entropy Loss and Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)

Maximum likelihood estimation (MLE) is a method to estimate the parameters of a probability distribution by maximizing the likelihood [3,42,76]. From the point of view of Bayesian inference, MLE can be considered a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution over the parameters.

Formally, in binary classification, assuming

y_{i} \in {0, 1}

and that

f_{Θ} (x_{i}) \in [0, 1]

is the predicted probability of class 1, given a dataset of samples

D

, we maximize the following quantity:

P (D | Θ) = \prod_{i = 1}^{N} f_{Θ} {(x_{i})}^{y_{i}} \cdot {(1 - f_{Θ} (x_{i}))}^{1 - y_{i}} .

(31)

The aim is to find the maximum likelihood estimate by minimizing a loss function. To maximize Equation (31), we can turn it into a minimization problem by employing the negative log-likelihood. To achieve this goal, we define:

log (P (D | Θ)) = \sum_{i = 1}^{N} (y_{i} log (f_{Θ} (x_{i})) + (1 - y_{i}) log (1 - f_{Θ} (x_{i})))

(32)

Then, we obtain the loss function by taking the negative of the log:

L_{N L L} = - \sum_{i = 1}^{N} (y_{i} log (f_{Θ} (x_{i})) + (1 - y_{i}) log (1 - f_{Θ} (x_{i}))) .

(33)

Often, the above loss is also called the cross-entropy loss, because it can be derived by minimizing the cross-entropy between the target distribution q and the model distribution

f_{Θ}

. In the continuous case, cross-entropy is defined as

H (q, f_{Θ}) = - \int q (x) log (f_{Θ} (x)) d x,

(34)

while in the discrete case (which is the one of interest here), it becomes

H (q, f_{Θ}) = - \sum_{i = 1}^{N} q (x_{i}) log (f_{Θ} (x_{i})) .

(35)

Moreover, cross-entropy and Kullback–Leibler divergence are related through

\begin{matrix} H (q, f_{Θ}) & = H (q) + L_{KL} (q | | f_{Θ}), \end{matrix}

(36)

where

H (q)

is the entropy of the target distribution and does not depend on the model parameters. Therefore, minimizing the cross-entropy with respect to

Θ

is equivalent to minimizing the negative log-likelihood in Equation (33) and, more generally, to minimizing the Kullback–Leibler divergence.

The classical approach to extend this loss to the multi-class scenario is to add as a final activation of the model a softmax function, defined according to the number of (K) classes considered. Given a score

f_{k} (x)

for each class, its output can be squashed to sum up to 1 by means of a softmax function

f_{S}

, obtaining

{\hat{f}}_{k} (x_{i}) = f_{S} (f_{k} (x_{i})),

(37)

where the softmax is defined as follows:

f_{S} (s_{i}) = \frac{e^{s_{i}}}{\sum_{j = 1}^{K} e^{s_{j}}} .

(38)

The final loss, usually called categorical cross-entropy, is

L_{C C E} = - \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i k} log ({\hat{f}}_{k} (x_{i})),

(39)

where

y_{i k}

is the one-hot encoded target for sample i and class k.

5.3.2. Kullback–Leibler Divergence (CONT, CONVEX, DIFF)

The Kullback–Leibler (

L_{KL}

) divergence is an information-based measure of disparity among probability distributions. Precisely, it is a non-symmetrical measurement of how one probability distribution differs from another one [3,42,77]. Technically speaking,

L_{KL}

divergence is not a distance metric because it does not obey symmetry nor the triangle inequality, i.e.,

L_{KL} (q | | f_{Θ}) \neq L_{KL} (f_{Θ} | | q)

. It is important to notice that, in the classification use case, minimizing the

L_{KL}

divergence is equivalent to minimizing the cross-entropy.

Precisely, the

L_{KL}

divergence between two continuous distributions is defined as:

L_{KL} (q | | f_{Θ}) = \int q (x) log (\frac{q (x)}{f_{Θ} (x)}) d x = - \int q (x) log (f_{Θ} (x)) d x + \int q (x) log (q (x)) d x .

(40)

If we want to minimize

L_{KL}

with respect to the parameter

Θ

, since the second integral is independent of

Θ

, we obtain:

min_{Θ} L_{KL} (q | | f_{Θ}) = min_{Θ} - \int q (x) log (f_{Θ} (x)) d x = min_{Θ} H (q, f_{Θ}) .

(41)

For this reason, in classification problems, cross-entropy is often used in practice instead of explicitly writing the Kullback–Leibler divergence: the two objectives differ only by an additive term independent of the model parameters. However,

L_{KL}

divergence remains particularly useful when the target is itself a soft distribution rather than a one-hot label, as in knowledge distillation, variational inference, and probabilistic regularization.

5.3.3. Focal Loss (L-CONT)

Focal loss [78] is a modification of the cross-entropy loss designed to address class imbalance by down-weighting the contribution of easy examples and focusing more on hard, misclassified examples. This is particularly effective in tasks such as object detection, where imbalanced datasets are common.

The focal loss is defined as:

L_{Focal} = - {(1 - f_{Θ} (x_{i}))}^{γ} log (f_{Θ} (x_{i})),

(42)

where

f_{Θ} (x_{i})

is the predicted probability for the true class, and

γ \geq 0

is the focusing parameter that controls the strength of the modulation. When

γ = 0

, focal loss reduces to the standard cross-entropy loss in the binary scenario.

Focal loss is Lipschitz continuous (L-CONT) for the predicted probabilities

f_{Θ} (x_{i})

.

5.3.4. Dice Loss (CONT, DIFF)

Dice loss [79] is widely used in image segmentation tasks to handle class imbalance. It is based on the Dice coefficient, a measure of overlap between the predicted and true segmentation. Dice loss is particularly effective when there is a significant imbalance between foreground and background classes.

The Dice loss is defined as:

L_{Dice} = 1 - \frac{2 \sum_{i} f_{Θ} (x_{i}) y_{i}}{\sum_{i} f_{Θ} (x_{i}) + \sum_{i} y_{i}},

(43)

where

f_{Θ} (x_{i})

is the predicted probability for pixel i, and

y_{i}

is the corresponding ground truth.

Dice loss is continuous (CONT) and differentiable (DIFF), but not convex. It is commonly employed in medical image segmentation tasks, where accurately measuring the overlap between the predicted and true segmentation is crucial.

5.3.5. Tversky Loss (CONT, DIFF)

Tversky loss [80] extends the Dice loss by introducing parameters that control the trade-off between false positives and false negatives. This makes Tversky loss especially useful in segmentation tasks with high class imbalance, where one class dominates the other.

The Tversky loss is defined as:

L_{Tversky} = 1 - \frac{\sum_{i} f_{Θ} (x_{i}) y_{i}}{\sum_{i} f_{Θ} (x_{i}) y_{i} + α \sum_{i} f_{Θ} (x_{i}) (1 - y_{i}) + β \sum_{i} (1 - f_{Θ} (x_{i})) y_{i}},

(44)

where

f_{Θ} (x_{i})

is the predicted probability for pixel i,

y_{i}

is the corresponding ground truth, and

α

and

β

are parameters that balance false positives and false negatives.

Tversky loss is continuous (CONT) and differentiable (DIFF), and by adjusting

α

and

β

, it can emphasize reducing false positives or false negatives, depending on the task’s requirements. This flexibility makes it a powerful tool in imbalanced segmentation problems.

6. Generative Losses

In recent years, generative models have become particularly valuable for modeling complex data distributions and regenerating realistic samples from them [81,82]. In this section, as illustrated in Figure 4, we describe the primary losses associated with variational autoencoders (VAEs) (Section 6.1), generative adversarial networks (GANs) (Section 6.2), diffusion models (Section 6.3), and Transformers with a focus on LLMs (Section 6.4).

While this survey focuses on the core generative models, other architectures like pixel RNNs [83], realNVP [84], flow-based models [85], and WaveNet [86] are also impactful but fall beyond the scope of this work.

6.1. Variational Autoencoders (VAEs)

Variational autoencoders (VAEs) [87,88] are generative models that learn a latent representation of data through probabilistic encoding and decoding. By modeling the underlying structure of data, VAEs aim to produce a latent space that follows a known prior distribution, typically Gaussian.

A VAE consists of two main components: an encoder and a decoder. The encoder maps the input data

x

to a latent variable

z

, capturing key data characteristics in a compact form through the approximate posterior distribution

q (z | x)

. The decoder then reconstructs the original data

x

from

z

, modeling the conditional distribution

p (x | z)

. This probabilistic framework allows VAEs to generate new samples by sampling from the learned latent space.

VAEs find applications in image generation, semi-supervised learning, and anomaly detection. In image generation, they facilitate smooth latent representations, useful for denoising and sample synthesis [89,90]. In semi-supervised learning, VAEs leverage both labeled and unlabeled data to improve classification [91]. For anomaly detection, VAEs model normal data distributions, identifying anomalies via high reconstruction errors for out-of-distribution samples [92]. Despite successes, VAEs face challenges such as image blurriness, often addressed by discrete latent spaces like those in VQ-VAE [93].

6.1.1. VAE Loss (ELBO) (CONT, DIFF, L-CONT)

The VAE loss function is derived from the evidence lower bound (ELBO), which provides a lower bound on the data log-likelihood. It comprises two main components: the reconstruction loss

L_{recon}

and the Kullback–Leibler (KL) divergence

L_{KL}

(see Section 5.3.2).

The reconstruction loss encourages the decoder to accurately reproduce the input data

x

from the latent variable

z

by minimizing the difference between the original data and its reconstruction. Depending on the data type,

L_{recon}

can be the mean squared error (MSE, see Section 4.2.3) for real-valued data or the binary cross-entropy (negative log-likelihood, see Section 5.3.1) for binary data. In a unified expression, we can represent

L_{recon}

as the expected log-likelihood with respect to the approximate posterior distribution of the latent variable:

L_{recon} = - E_{q (z | x)} [log f_{Θ} (x | z)] .

The full VAE loss, often referred to as the negative ELBO, is defined as:

L_{V A E} = - E_{q (z | x)} [log f_{Θ} (x | z)] + L_{KL} (q (z | x) ∥ f_{Θ} (z)),

(45)

where the first term is the reconstruction loss and the second term, the KL divergence, regularizes the latent space to follow the prior distribution.

While

L_{recon}

and

L_{KL}

may individually exhibit convexity under certain conditions, their combined interaction, particularly with neural network parameterization, results in an overall non-convex loss function [42,87,88].

The reconstruction loss alone focuses on achieving high-fidelity reconstructions of the input data. However, without the KL divergence term, the model does not impose any structure on the latent space, leading to unstructured representations that generalize poorly to generating new samples [87,88].

In contrast, the ELBO, which combines the reconstruction loss with the KL divergence term, introduces a trade-off between accurate reconstruction and regularization of the latent space. The KL divergence term encourages the latent space to follow a smooth prior distribution, facilitating sample generation by ensuring structured, regularized representations. However, if the KL divergence term dominates, it can lead to poorer reconstructions as the model prioritizes matching the prior over precise data recovery [94]. Excessive weight on the KL term can also cause posterior collapse, where the model ignores the latent variables entirely, resulting in suboptimal generative performance [95,96].

In summary, while

L_{recon}

alone enhances data fidelity, combining it with

L_{KL}

(in the ELBO) enables proper generative modeling. Achieving realistic data generation requires balancing accurate reconstruction with a regularized latent space [97].

6.1.2. Extensions of VAE Losses

Several extensions of the VAE loss have been proposed to improve the model’s performance or flexibility. The most notable variants include:

Beta-VAE (CONT, DIFF, L-CONT): The beta-VAE [94] introduces a hyperparameter $β$ to weight the KL divergence term. The loss function becomes:

$L_{β - VAE} = L_{recon} + β L_{KL} .$

(46)

By adjusting $β$ , the balance between reconstruction accuracy and latent space regularization can be controlled [97]. This variant retains the continuity, differentiability, and Lipschitz continuity but it is non-convex due to the interaction between terms.
Beta-VAE improves interpretability by encouraging the model to learn more disentangled representations. This is particularly useful in applications where distinct, independent latent factors are beneficial, as in unsupervised learning tasks or when a well-structured latent space is desired [94]. The introduction of $β$ can lead to a trade-off where increasing regularization may harm reconstruction quality. Excessively high values of $β$ can also cause posterior collapse, where the model ignores the latent variables, resulting in reduced generative performance [95,96,97].
VQ-VAE (CONT): In vector quantized VAEs [93], the latent space is discrete, rather than continuous, and a codebook is used to quantize the latent variables. The key difference in VQ-VAE compared to traditional VAEs is the use of a discrete latent representation, which introduces the following steps:
Given an input $x$ , the encoder produces a continuous latent variable $z_{e} (x) \in R^{D}$ . However, instead of passing $z_{e} (x)$ directly to the decoder, VQ-VAE performs a vector quantization by mapping $z_{e} (x)$ to the nearest vector in a learned codebook $E = {e_{k}}_{k = 1}^{K}$ , where each $e_{k} \in R^{D}$ is a learned embedding vector. This process can be written as:

$z_{q} (x) = e_{k}, where k = arg min_{j} {∥ z_{e} (x) - e_{j} ∥}_{2} .$

(47)

The quantized latent variable $z_{q} (x)$ is then passed to the decoder, which reconstructs the input as $\hat{x} = p (x | z_{q})$ .
The VQ-VAE loss function consists of three terms:

$L_{V Q - V A E} = L_{recon} + ∥ sg [z_{e} (x)] - e_{k} ∥_{2}^{2} + β {∥ z_{e} (x) - sg [e_{k}] ∥}_{2}^{2},$

(48)

where:
−
$L_{recon}$ is the reconstruction loss (typically mean squared error or binary cross-entropy).
−
The second term $∥ sg [z_{e} (x)] - e_{k} ∥_{2}^{2}$ (the codebook loss) ensures that the codebook vector $e_{k}$ is close to the encoder output $z_{e} (x)$ .
−
The third term $∥ z_{e} (x) - sg [e_{k}] ∥_{2}^{2}$ (the commitment loss) encourages the encoder to commit to a particular codebook vector, where $sg [\cdot]$ denotes the stop gradient operator, ensuring gradients only flow through the appropriate part of the network.
−
$β$ is a hyperparameter controlling the weight of the commitment loss.
While VQ-VAE remains continuous (CONT) overall, the discrete quantization step introduces non-differentiability in the loss function, as the mapping from encoder outputs to codebook vectors is non-differentiable.
The key advantage of VQ-VAE is that the use of a discrete latent space can produce sharper and higher-quality generated samples, addressing some of the issues with blurry outputs often observed in continuous VAEs [93]. VQ-VAE is particularly beneficial in tasks where the data have inherently discrete characteristics, such as in audio and image generation [98,99].
Conditional VAE (CVAE) (CONT, DIFF, L-CONT): Conditional VAEs [100] condition both the encoder and decoder on auxiliary information (e.g., class labels), modifying the ELBO to learn the conditional distribution $p (x | y)$ . The CVAE loss is:

$L_{C V A E} = - E_{q (z | x, y)} [log f_{Θ} (x | z, y)] + L_{KL} (q (z | x, y) ∥ f_{Θ} (z | y))$

(49)

The CVAE loss is continuous, differentiable, and Lipschitz continuous, and the KL divergence remains convex, but the overall loss is still non-convex due to the interaction between terms.
The advantage of using CVAE is its ability to generate conditional outputs based on specific attributes or labels. This makes CVAE particularly useful in scenarios where controlled generation is required, such as in image generation conditioned on class labels or text generation based on input attributes [91,101]. By incorporating auxiliary information, CVAEs allow for more structured and interpretable latent spaces, improving the ability to generate targeted samples in a variety of applications [91].

6.2. Generative Adversarial Networks

Generative adversarial networks (GANs) are used to create new data instances that are sampled from the training data. GANs have two main components:

The generator, referred to as $G ({z_{0}, \dots, z_{N}})$ , which generates data starting from random noise and tries to replicate real data distributions.
The discriminator, referred to as $D ({x_{0}, \dots, x_{N}})$ , learns to distinguish the generator’s fake data from the real one. It applies penalties in the generator loss for producing distinguishable fake data compared with real data.

The GAN architecture is relatively straightforward, although one aspect remains challenging: GAN loss functions. Precisely, the discriminator is trained to provide the loss function for the generator. If generator training goes well, the discriminator gets worse at telling the difference between real and fake samples. It starts to classify fake data as real, and its accuracy decreases.

Both the generator and the discriminator components are typically neural networks, where the generator output is connected directly to the discriminator input. The discriminator’s classification provides a signal that the generator uses to update its weights through back-propagation.

As GANs try to replicate a probability distribution, they should use loss functions that reflect the distance between the distribution of the data generated by the GAN and the distribution of the real data.

Two common GAN loss functions are typically used: minimax loss [82] and Wasserstein loss [102]. The generator and discriminator losses derive from a single distance measure between the two aforementioned probability distributions. The generator can only affect one term in the distance measure: the term that reflects the distribution of the fake data. During generator training, we drop the other term, which reflects the real data distribution. The generator and discriminator losses look different, even though they derive from a single formula.

In the following, both minimax and Wasserstein losses are written in a general form. The properties of the loss function (CONT, DIFF, etc.) are identified based on the function chosen for the generator or discriminator.

6.2.1. Minimax Loss

The generative model G learns the data distributions and is trained simultaneously with the discriminative model D. The latter estimates the probability that a given sample is identical to the training data rather than G. G is trained to maximize the likelihood of tricking D [82]. In other words, the generator tries to minimize the following function while the discriminator tries to maximize it:

\begin{matrix} L_{minimax} (D, G) & = E_{{x_{0}, \dots, x_{N}}} [log (D ({x_{0}, \dots, x_{N}})) + \\ E_{{z_{0}, \dots, z_{N}}} [log (1 - D (G ({z_{0}, \dots, z_{N}})))], \end{matrix}

(50)

where:

$D ({x_{0}, \dots, x_{N}})$ is the discriminator’s estimate of the probability that real data instance { $x_{0}, \dots, x_{N}$ } is real;
$E_{{x_{0}, \dots, x_{N}}}$ is the expected value over all real data instances;
$G ({z_{0}, \dots, z_{N}})$ is the generator’s output when given noise ${z_{0}, \dots, z_{N}}$ ;
$D (G (z))$ is the discriminator’s estimate of the probability that a fake instance is real;
$E_{{z_{0}, \dots, z_{N}}}$ is the expected value over all random inputs to the generator (in effect, the expected value over all generated fake instances $G ({z_{0}, \dots, z_{N}}))$ .

The loss function above directly represents the cross-entropy between real and generated data distributions. The generator cannot directly affect the

log (D ({x_{0}, \dots, x_{N}}))

term in the function, and it only minimizes the term

log (1 - D (G ({z_{0}, \dots, z_{N}})))

. A disadvantage of this formulation of the loss function is that the above minimax loss function can cause the GAN to get stuck in the early stages of the training when the discriminator received trivial tasks. Therefore, a suggested modification to the loss [82] is to allow the generator to maximize

log (D (G ({z_{0}, \dots, z_{N}})))

.

6.2.2. Wasserstein Loss

The Wasserstein distance gives an alternative method of training the generator to better approximate the distribution of the training dataset. In this setup, the training of the generator itself is responsible for minimizing the distance between the distribution of the training and generated datasets. The possible solutions are to use distribution distance measures, like Kullback–Leibler (KL) divergence, Jensen–Shannon (JS) divergence, and the earth-mover (EM) distance (also called Wasserstein distance). The main advantage of using Wasserstein distance is due to its differentiability and having a continuous linear gradient [102].

A GAN that uses a Wasserstein loss, known as a WGAN, does not discriminate between real and generated distributions in the same way as other GANs. Instead, the WGAN discriminator is called a “critic,” and it scores each instance with a real-valued score rather than predicting the probability that it is fake. This score is calculated so that the distance between scores for real and fake data is maximized.

The advantage of the WGAN is that the training procedure is more stable and less sensitive to model architecture and selection of hyperparameters.

The two loss functions can be written as:

\begin{matrix} L_{critic} & = D ({x_{0}, \dots, x_{N}}) - D (G ({z_{0}, \dots, z_{N}})) \\ L_{generator} & = D (G ({z_{0}, \dots, z_{N}})) \end{matrix}

(51)

The discriminator tries to maximize

L_{c r i t i c}

. In other words, it tries to maximize the difference between its output on real instances and its output on fake instances. The generator tries to maximize

L_{G e n e r a t o r}

. In other words, It tries to maximize the discriminator’s output for its fake instances.

The benefit of Wasserstein loss is that it provides a useful gradient almost everywhere, allowing for the continued training of the models. It also means that a lower Wasserstein loss correlates with better generator image quality, meaning that it explicitly seeks a minimization of generator loss. Finally, it is less vulnerable to getting stuck in a local minimum than minimax-based GANs [102]. However, accurately estimating the Wasserstein distance using batches requires unaffordable batch size, which significantly increases the amount of data needed [103].

6.3. Diffusion Models

Diffusion models are generative models that rely on probabilistic likelihood estimation to generate data samples. Originally inspired by the physical phenomenon of diffusion, these models systematically add noise to the data and then learn to reverse this process. The forward process corrupts the data by progressively adding noise, and the reverse process reconstructs the data by removing the noise, a procedure learned through a neural network that models conditional probability densities [104,105,106].

The forward diffusion process adds Gaussian noise to the data over T time steps, transforming it into pure noise. The reverse diffusion process then uses a neural network to learn the conditional probabilities of each time step, gradually denoising the data to reconstruct the original input.

Forward Diffusion Process

Given a data point

x_{0}

sampled from a real data distribution

q (x)

, the forward process adds small amounts of Gaussian noise iteratively T times, resulting in a sequence of noisy samples

x_{1}, \dots, x_{T}

. The noise distribution is Gaussian, and because the forecasted probability density at time t depends only on the previous state

t - 1

, the conditional probability density is:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(52)

with

β_{t} \in (0, 1)

, a hyperparameter that can be constant or variable, and

t \in [1, T]

. As

T \to \infty

,

x_{T}

converges to a pure Gaussian distribution. The forward process does not require neural network training but involves iteratively applying Gaussian noise to the data.

Reverse Diffusion Process

In the reverse process, given the noisy state

x_{T}

, the goal is to estimate the probability density of the original data by gradually removing the noise. The neural network, parameterized by

Θ

, is trained to approximate

f_{Θ} (x_{t - 1} | x_{t})

, enabling the reconstruction of

x_{0}

from

x_{T}

. The reverse diffusion process is modeled as:

f_{Θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{Θ} (x_{t}, t), Σ_{Θ} (x_{t}, t)),

(53)

where

μ_{Θ} (x_{t}, t)

is a learned mean function, and

Σ_{Θ} (x_{t}, t)

is a learned variance term. Sampling from this distribution allows the model to reconstruct the data by successively denoising the noisy sample from time step T down to

t = 0

[106].

Recent work has demonstrated that diffusion models can outperform other generative models, such as GANs, in certain tasks [107,108], though they tend to be computationally more expensive due to the large number of forward and reverse steps involved.

6.3.1. Diffusion Model Loss Function (CONT, DIFF)

The objective of diffusion models is to minimize the variational lower bound (VLB) on the data likelihood, which is formulated as:

L_{diffusion} = E_{t} [log p (x_{T}) - \sum_{t \geq 1} log \frac{f_{Θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] .

(54)

A simplified version of this loss is often used to train denoising diffusion probabilistic models (DDPMs) [106]:

L_{diffusion}^{simple} = E_{t, x_{0}, ϵ_{t}} [∥ ϵ_{t} - ϵ_{Θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ_{t}, t) ∥^{2}],

(55)

where the model learns to predict the noise

ϵ_{Θ}

added at each step. This simplified loss connects closely to score matching, as the model learns to predict the noise, which is essential for reconstructing the data.

The simplified loss function (also called noise prediction loss) in DDPMs is computationally efficient and relatively straightforward to implement. It uses a mean squared error (MSE) objective, which is cheap to compute. Each training iteration requires the model to predict the added noise and minimize the difference between the actual and predicted noise. This formulation is easy to optimize with standard gradient-based methods, leading to faster convergence and more stable training compared to more complex alternatives. As a result, DDPM’s simplified loss provides a good trade-off between computational efficiency and sample quality, making it suitable for a wide range of generative tasks [106].

6.3.2. Other MSE-Based Losses in Diffusion Models (CONT, DIFF)

While the noise prediction loss is central to diffusion models, other MSE-based losses are also employed to optimize different aspects of the generation process. These losses extend the basic MSE formulation to handle higher-level tasks and representations:

Perceptual Loss

In tasks such as super-resolution and image synthesis, where pixel-level differences may fail to capture perceptual quality, perceptual loss [109] is applied. This loss leverages high-level feature representations extracted from a pre-trained convolutional neural network (e.g., VGG). The MSE objective is then applied to these feature maps, rather than pixel values, ensuring that the generated images align more closely with human perception.

Latent Space Regularization

In latent diffusion models [109], MSE is applied to regularize the latent space during the diffusion process. This regularization ensures that the latent representations of the input data remain smooth and structured, improving the stability and consistency of the generation process. Latent space regularization helps to prevent issues such as mode collapse and ensures higher-quality outputs.

6.3.3. Score-Based Generative Model Loss (CONT, DIFF)

In score-based generative models (SBGMs) [110], the diffusion process is generalized to a continuous-time framework using stochastic differential equations (SDEs). The model learns the score function, i.e., the gradient of the log probability density at different noise levels.

The loss function for learning the score function is derived from denoising score matching:

L_{score} = E_{x_{0}, σ} [λ (σ) ∥ \nabla_{x} log f_{Θ} (x_{t} | σ) + \frac{x_{t} - x_{0}}{σ^{2}} ∥^{2}],

(56)

where

x_{t}

is the noisy sample,

log f_{Θ} (x_{t} | σ)

is the predicted score function, and

λ (σ)

is a weighting function for different noise levels.

In contrast with the simplified loss function in DDPMs, the score-based loss used in score-based generative models (SBGMs) is computationally more intensive. The model must learn the gradient of the log probability (the score function), which involves higher-order derivatives and sampling from various noise scales. Each iteration requires calculating and backpropagating through the score function, significantly increasing the computational cost. While this approach can provide richer generative performance and flexibility, particularly in continuous-time frameworks, the added complexity often results in slower training and higher resource requirements [110].

6.3.4. Cosine Similarity in Multimodal Context (CONT, DIFF)

Cosine similarity plays a crucial role in aligning representations across different modalities, particularly in models like CLIP (contrastive language-image pretraining [111]), which embeds both text and images in a shared feature space. In multimodal tasks, such as text-to-image generation, this approach ensures that representations from different domains are semantically consistent.

In text-to-image diffusion models, a common use of cosine similarity is the CLIP guidance loss [109], where the goal is to align the generated image with a given text prompt. The loss leverages a pre-trained CLIP model that computes embeddings for both the text and the generated image. By maximizing the cosine similarity between the text and image embeddings, the model enhances the semantic coherence of the generated images with respect to the input text.

The CLIP guidance loss is defined as:

L_{CLIP} = 1 - cos (ϕ_{text} (y), ϕ_{image} (\hat{x})),

(57)

where

ϕ_{text} (y)

is the CLIP embedding of the input text

y

, and

ϕ_{image} (\hat{x})

is the CLIP embedding of the generated image

\hat{x}

.

This loss is continuous (CONT) and differentiable (DIFF) with respect to the generated image, as both the CLIP encoders and cosine similarity function are composed of standard neural network operations that are differentiable. Typically, the pre-trained CLIP model is frozen during the training of diffusion models, allowing gradients to flow back through the generated image while keeping the CLIP parameters unchanged.

6.4. Transformers and LLM Loss Functions

Transformer has revolutionized machine learning, particularly in natural language processing (NLP), by enabling the parallel processing of sequential data using self-attention mechanisms [112]. This architecture has been the foundation for many powerful models, including large language models (LLMs) such as GPT [113] and BERT [114], which excel at generative tasks like text generation, translation, and summarization.

The Transformer model is built around an encoder–decoder structure, with the encoder processing the input data and the decoder generating the output sequence. An attention mechanism is a technique used in machine learning and artificial intelligence to improve model performance by focusing on relevant information. In Transformers, the attention mechanism allows the model to selectively assign varying importance to different elements in the sequence, enabling it to capture long-range dependencies and context effectively. Within each encoder and decoder layer, attention mechanisms assign scores to tokens based on relevance, ensuring that the model attends to the most pertinent information for each output. After attention, feedforward layers further refine these representations, capturing complex patterns and allowing for high parallelization, which enables Transformers to efficiently handle long sequences and excel at tasks involving sequential data.

Transformer-based models, particularly LLMs, are trained using a variety of loss functions depending on the task. This section explores the primary loss functions used in Transformer and LLM training.

6.4.1. Probabilistic Losses in LLMs

Probabilistic losses are central to training LLMs, as they help models learn from probability distributions over sequences, tokens, and classes. These losses are primarily extensions or modifications of cross-entropy and KL divergence, as introduced in Section 5.3.1 and Section 5.3.2. Even though the original losses are generally convex, the overall optimization landscape becomes non-convex due to the complex architecture of Transformers [115,116]. Below, we outline key probabilistic losses used in training popular LLM architectures.

Autoregressive Language Modeling Loss (CONT, DIFF, CONVEX)

In autoregressive language models, such as GPT [113], the model is trained to predict the next token in a sequence based on the preceding tokens. This approach is crucial for generative tasks like text generation, where the model generates text one token at a time.

The loss function used in autoregressive models is typically the cross-entropy loss, which measures the negative log-likelihood of the correct token given the model’s predictions. For a sequence of tokens

(x_{1}, x_{2}, \dots, x_{T})

, the loss is defined as:

L_{AR} = - \sum_{t = 1}^{T} log p_{θ} (x_{t} | x_{< t}),

(58)

where

p_{θ} (x_{t} | x_{< t})

is the predicted probability of token

x_{t}

conditioned on previous tokens

x_{< t}

.

Autoregressive cross-entropy loss is continuous (CONT), differentiable (DIFF), and convex (CONVEX) when considered in isolation. In LLMs, it effectively aligns model outputs with target sequences but, due to architectural complexity, contributes to an overall non-convex optimization landscape.

Masked Language Modeling (MLM) Loss (CONT, DIFF, CONVEX)

Masked language modeling, used in models like BERT [114], trains the model to predict randomly masked tokens within a sequence. This allows the model to leverage bidirectional context, as it relies on both left and right tokens for prediction.

The loss function for masked language modeling is also cross-entropy. For an input sequence

x = (x_{1}, x_{2}, \dots, x_{T})

with masked tokens

x_{m}

, the MLM loss is:

L_{MLM} = - \sum_{m \in Masked} log p_{θ} (x_{m} | x_{∖ m}),

(59)

where

p_{θ} (x_{m} | x_{∖ m})

denotes the predicted probability of a masked token given the unmasked context.

The cross-entropy loss applied here is continuous (CONT), differentiable (DIFF), and convex (CONVEX), which suits MLM tasks well for aligning token predictions with target distributions.

Label Smoothing Loss (CONT, DIFF, CONVEX)

Label smoothing [112] is a regularization technique that distributes a small probability mass over incorrect classes, thus preventing the model from becoming overconfident. It is particularly helpful in LLMs for generative tasks like machine translation and summarization.

The adjusted target labels

y_{i}

in label smoothing are given by:

y_{i} = (1 - ϵ) \cdot y_{i} + \frac{ϵ}{K},

(60)

where

ϵ

is the smoothing parameter and K is the number of classes. The loss can then be expressed as:

L_{label_smoothing} = - \sum_{i = 1}^{K} y_{i} log f_{Θ} (x_{i}) .

(61)

The label-smoothing cross-entropy remains continuous (CONT), differentiable (DIFF), and convex (CONVEX) in itself. By moderating confidence, it helps improve model generalization in tasks prone to overfitting.

KL Divergence Loss for Knowledge Distillation (CONT, DIFF)

Knowledge distillation, commonly applied in Transformer-based models like DistilBERT [117], transfers knowledge from a larger teacher model to a smaller student model. The loss function combines standard cross-entropy with a KL divergence term that minimizes the difference between the output distributions of the teacher and student models:

L_{KD} = (1 - α) L_{CE} (p_{student}, y) + α L_{KL} (p_{student}, p_{teacher}),

(62)

where

L_{CE}

is the cross-entropy loss,

L_{KL}

is the KL divergence between the student and teacher outputs, and

α

controls the balance between the two terms.

The KL divergence loss is continuous (CONT) and differentiable (DIFF) in itself. In knowledge distillation, it allows smaller models to approximate the behavior of larger models, improving efficiency while maintaining performance.

6.4.2. Ranking Losses in LLM

In large language models (LLMs), ranking losses such as triplet loss and contrastive loss are widely used to structure the embedding space for tasks like semantic search, question answering, and sentence clustering. These losses aim to pull semantically similar sentences closer together while pushing dissimilar sentences farther apart, thus learning meaningful sentence embeddings. They will be discussed in Section 7 (in particular, in Section 7.2 and Section 7.4).

6.4.3. Alignment and Preference Optimization Losses

As LLMs have grown in capabilities and popularity among everyday users, aligning their outputs with human preferences has become a critical step, often replacing or supplementing complex reinforcement learning from human feedback (RLHF) pipelines. Recent advancements have translated this alignment directly as a loss optimization problem over preference data, where the dataset consists of a prompt

x

, a chosen response

y_{w}

, and a rejected response

y_{l}

.

Direct Preference Optimization (DPO) Loss (CONT, DIFF)

DPO [118] bypasses the need for a separate reward model by reparameterizing the reward implicitly through the language model policy. The DPO loss optimizes the policy network

f_{Θ}

relative to a frozen reference model

f_{ref}

:

L_{DPO} = - E_{(x, y_{w}, y_{l})} [log σ (β log \frac{f_{Θ} (y_{w} | x)}{f_{ref} (y_{w} | x)} - β log \frac{f_{Θ} (y_{l} | x)}{f_{ref} (y_{l} | x)})]

(63)

where

σ

is the logistic function and

β

controls the deviation from the reference model. DPO operates as a continuous (CONT) and differentiable (DIFF) contrastive ranking loss for generative text.

Simple Preference Optimization (SimPO) Loss (CONT, DIFF)

While DPO is highly effective, it requires maintaining a frozen reference model in memory. Recent advancements like SimPO [119] and ORPO [120] eliminate this requirement. SimPO, for example, uses the length-normalized log probability as an implicit reward and introduces a target reward margin

γ > 0

. The SimPO loss is defined as:

L_{SimPO} = - E_{(x, y_{w}, y_{l})} [log σ (β (\frac{1}{| y_{w} |} log f_{Θ} (y_{w} | x) - \frac{1}{| y_{l} |} log f_{Θ} (y_{l} | x)) - γ)]

(64)

SimPO is continuous (CONT) and differentiable (DIFF). By incorporating a margin-based approach directly into the log-likelihoods without a reference model, it significantly reduces memory overhead and has been shown to outperform DPO in generating high-quality, aligned text [119].

7. Ranking Losses

Machine learning can be employed to solve ranking problems, which have important industrial applications, especially in information retrieval systems. These problems can be typically solved by employing supervised, semi-supervised, or reinforcement learning [121,122].

The goal of ranking losses, in contrast to other loss functions like the cross-entropy loss or MSE loss, is to anticipate the relative distances between inputs rather than learning to predict a label, a value, or a set of values given an input. This is also sometimes called metric learning. Nevertheless, the cross-entropy loss can be used in the top-one probability ranking. In this scenario, given the scores of all the objects, the top-one probability of an object in this model indicates the likelihood that it will be ranked first [123].

Ranking loss functions for training data can be highly customizable because they require only a method to measure the similarity between two data points, i.e., similarity score. For example, consider a face verification dataset: pairs of photographs that belong to the same person will have a high similarity score, whereas those that do not will have a low score [124]. Different tasks, applications, and neural network configurations use ranking losses (like Siamese nets or triplet nets). Because of this, there are various losses can be used, including contrastive loss, margin loss, hinge loss, and triplet loss.

In general, ranking loss functions require a feature extraction for two (or three) data instances, which returns an embedded representation for each of them. A metric function can then be defined to measure the similarity between those representations, such as the Euclidean distance. Finally, the feature extractors are trained to produce similar representations for both inputs in case the inputs are similar or distant representations in case of dissimilarity.

Similar to Section 6, both pairwise and triplet ranking losses are presented in a general form, as shown in Figure 5. The properties of the loss function (CONT, DIFF, etc.) are identified based on the metric function chosen.

7.1. Pairwise Ranking Loss

In the context of pairwise ranking loss, positive and negative pairs of training data points are used [121,125,126,127]. Positive pairs are composed of an anchor sample

x_{a}

and a positive sample

x_{p}

, which is similar to

x_{a}

in the metric. Negative pairs are composed of an anchor sample

x_{a}

and a negative sample

x_{n}

, which is dissimilar to

x_{a}

in that metric. The objective is to learn representations with a small distance d between them for positive pairs and a greater distance than some margin value m for negative pairs. Pairwise ranking loss forces representations to have a 0 distance for positive pairs and a distance greater than a margin for negative pairs.

Given

r_{a}

,

r_{p}

, and

r_{n}

; the embedded representations (the output of a feature extractor) of the input samples

x_{a}

,

x_{p}

, and

x_{n}

, respectively; and d as a distance function, the loss function can be written as:

L_{p a i r w i s e} = \{\begin{matrix} d (r_{a}, r_{p}) & if positive pair \\ max (0, m - d (r_{a}, r_{n})) & if negative pair \end{matrix}

(65)

For positive pairs, the loss will vanish if the distance between the embedding representations of the two elements in the pair is 0; instead, the loss will increase as the distance between the two representations increases. For negative pairs, the loss will vanish if the distance between the embedding representations of the two elements is greater than the margin m. However, if the distance is less than m, the loss will be positive, and the model parameters will be updated to provide representations for the two items that are farther apart. When the distance between

r_{a}

and

r_{n}

is 0, the loss value will be at most m. The purpose of the margin is to create representations for negative pairs that are far enough, thus implicitly stopping the training on these pairs and allowing the model to focus on more challenging ones. If

r_{0}

and

r_{1}

are the pair elements representations, y is a binary flag equal to 0 for a negative pair and to 1 for a positive pair, and the distance d is the Euclidean distance:

L_{p a i r w i s e} (r_{0}, r_{1}, y) = y | | r_{0} - r_{1} | | + (1 - y) max (0, m - | | r_{0} - r_{1} | |)

(66)

Unlike typical classification learning, this loss requires more training data and time because it requires access to all the data of all potential pairs during training. Additionally, because training involves pairwise learning, it will output the binary distance from each class, which is more computationally expensive if there is incorrect classification [128].

7.2. Triplet Loss

Employing triplets of training data instances instead of pairs can produce better performance [122,124,129]. This approach is called triplet ranking loss. A triplet consists of an anchor sample

x_{a}

, a positive sample

x_{p}

, and a negative sample

x_{n}

. The objective is for the distance between the anchor sample and the negative sample representations,

d (r_{a}, r_{n})

, to be greater (by a margin m) than the distance between the anchor and positive representations

d (r_{a}, r_{p})

. Here, d typically represents the Euclidean distance, but other distance metrics can be applied.

The triplet ranking loss is defined as:

L_{triplet} (r_{a}, r_{p}, r_{n}) = max (0, m + d (r_{a}, r_{p}) - d (r_{a}, r_{n}))

(67)

This loss function considers three types of triplets based on the values of

r_{a}

,

r_{p}

,

r_{n}

, and m:

Easy Triplets: $d (r_{a}, r_{n}) > d (r_{a}, r_{p}) + m$ . Here, the distance between the negative sample and the anchor sample is already large enough. The model parameters are not updated, and the loss is 0.
Hard Triplets: $d (r_{a}, r_{n}) < d (r_{a}, r_{p})$ . In this case, the negative sample is closer to the anchor than the positive sample. The loss is positive (and $> m$ ), leading to updates in the model’s parameters.
Semi-Hard Triplets: $d (r_{a}, r_{p}) < d (r_{a}, r_{n}) < d (r_{a}, r_{p}) + m$ . Here, the negative sample is farther away from the anchor than the positive sample, but the margin constraint is not yet satisfied. The loss remains positive (and $< m$ ), prompting parameter updates.

Triplet ranking loss is sensitive to small changes in the input samples, making it less generalizable across datasets [128]. This sensitivity arises because the model learns relative distances based on the specific sample distribution in the training data, meaning that transferring these learned distances to new data may not always yield the same effectiveness.

Triplet loss is also used in models like Sentence-BERT (SBERT) [130] to learn sentence embeddings that preserve semantic similarity. A triplet consists of an anchor sentence, a positive sentence (semantically similar to the anchor), and a negative sentence (semantically dissimilar). The objective is for the anchor’s embedding to be closer to the positive sentence than to the negative sentence by a margin m. This loss improves the alignment of sentence embeddings with semantic meaning, making it particularly useful for sentence similarity tasks.

7.3. Listwise Ranking Loss (CONT, DIFF)

Unlike pairwise and triplet ranking losses, which operate on pairs or triplets of data points, listwise ranking losses focus on ranking the entire list of items. One common listwise ranking loss is based on the softmax cross-entropy loss, which is used to predict the top-one ranking probability [123].

Given a list of items with scores, the softmax cross-entropy loss aims to maximize the probability of ranking the correct item at the top:

L_{l i s t w i s e} = - \sum_{i = 1}^{N} y_{i} log \frac{e^{s_{i}}}{\sum_{j = 1}^{N} e^{s_{j}}},

(68)

where

y_{i}

is the ground-truth relevance for item i, and

s_{i}

is the predicted score for item i in a list of N items.

This loss is continuous (CONT) and differentiable (DIFF), and it is particularly useful in information retrieval tasks, where the quality of the entire ranking list matters.

7.4. Contrastive Ranking Loss: NT-Xent (CONT, DIFF, L-CONT)

The normalized temperature-scaled cross-entropy loss (NT-Xent) is a generalization of pairwise ranking losses, typically used in self-supervised learning tasks [131]. This loss enables efficient learning by comparing the similarity between an anchor and multiple negative samples in a batch.

Given an anchor sample

x_{a}

and a positive sample

x_{p}

, along with N negative samples

x_{n}^{i}

, the NT-Xent loss is computed as:

L_{NT-Xent} = - log \frac{e^{sim (x_{a}, x_{p}) / τ}}{\sum_{i = 1}^{N} e^{sim (x_{a}, x_{n}^{i}) / τ}},

(69)

where

sim (\cdot, \cdot)

is a similarity measure (e.g., cosine similarity), and

τ

is a temperature scaling parameter.

This loss is continuous (CONT), differentiable (DIFF), and Lipschitz continuous (L-CONT), and it is often used in contrastive learning tasks where learning representations from large-scale data with many negatives is necessary.

This loss has also been used in language models like SimCSE [132] to maximize the similarity between an anchor and a positive sample while minimizing the similarity with negative samples. One of the primary advantages of contrastive losses in the context of language modelling is their ability to improve the generalization of sentence embeddings by providing a strong supervision signal, particularly in tasks requiring semantic similarity [132]. Additionally, using multiple negatives in the mini-batch, as with NT-Xent, improves training efficiency by leveraging more negative examples [130], leading to faster convergence and more stable training compared to methods with fewer negatives.

LambdaLoss (CONT, DIFF)

LambdaLoss [133,134] is a specialized loss function designed for learning-to-rank tasks, where the objective is to optimize ranking metrics such as normalized discounted cumulative gain (NDCG) or mean reciprocal rank (MRR). It is particularly effective in applications like search engines and recommendation systems, where errors in top-ranked items are more costly.

LambdaLoss differs from traditional pairwise losses by directly optimizing ranking metrics like NDCG, and weighting errors by their impact on the final ranking. The loss function considers the relative change in the ranking metric when swapping the predicted rankings of two items.

Given two documents i and j, with ground truth labels

y_{i}

and

y_{j}

, and predicted scores

f (x_{i})

and

f (x_{j})

, LambdaLoss is defined as:

L_{Lambda} = | Δ {NDCG}_{i j} | \cdot σ (f (x_{i}) - f (x_{j})),

(70)

where:

$Δ {NDCG}_{i j}$ represents the change in NDCG if the documents i and j are swapped in the ranking.
$σ (f (x_{i}) - f (x_{j}))$ is the pairwise ranking loss, typically modeled as a logistic function $σ (z) = \frac{1}{1 + e^{- z}}$ or a hinge loss.

The total LambdaLoss

L_{Lambda}

for a dataset is computed as the sum of all pairwise losses across all document pairs:

L_{Lambda} = \sum_{i < j} L_{Lambda},

(71)

where i and j index the document pairs in the dataset.

L_{Lambda}

is continuous (CONT) and differentiable (DIFF), making it suitable for gradient-based optimization. The loss function directly optimizes ranking metrics such as NDCG, ensuring that the model’s training objective aligns with the final evaluation metric. LambdaLoss emphasizes higher-ranked items, as errors in top positions are penalized more heavily due to their greater impact on the overall ranking.

LambdaLoss is commonly used in systems that require high-quality rankings, such as search engines, ad placement, and recommendation systems.

8. Energy-Based Losses

An energy-based model (EBM) is a probabilistic model that uses a scalar energy function to describe the dependencies of the model variables [135,136,137,138,139,140,141]. An EBM can be formalized as

F : X \times Y \to R

, where

F (x, y)

stands for the relationship between the

(x, y)

pairings.

Given an energy function and the input

x

, the best fitting value of

y

is computed with the following inference procedure:

\tilde{y} = a r g m i n_{y} {F (x, {y_{0}, \dots, y_{N}})}

(72)

Energy-based models provide fully generative models that can be used as an alternative to probabilistic estimation for prediction, classification, or decision-making tasks [135,139,140,142,143].

The energy function

E (x, y) \equiv F (x, y)

can be explicitly defined for all the values of

y \in Y = {y_{0}, \dots, y_{N}}

if and only if the size of the set

Y

is small enough. In contrast, when the space of

Y

is sufficiently large, a specific strategy, known as the inference procedure, must be employed to find the

y

that minimizes

E (x, {y_{0}, \dots, y_{N}})

.

In many real situations, the inference procedure can produce an approximate result, which may or may not be the global minimum of

E (x, {y_{0}, \dots, y_{N}})

for a given

x

. Moreover, it is possible that

E (x, {y_{0}, \dots, y_{N}})

has several equivalent minima. The best inference procedure to use often depends on the internal structure of the model. For example, if

Y

is continuous and

E (x, {y_{0}, \dots, y_{N}})

is smooth and differentiable everywhere concerning

y

, a gradient-based optimization algorithm can be employed [144].

In general, any probability density function

p (x)

for

x \in R^{D}

can be rewritten as an EBM:

p_{θ} (x) = \frac{exp (- E_{θ} (x))}{\int_{x^{'}} exp (- E_{θ} (x^{'})) d x^{'}},

(73)

where the energy function (

E_{θ}

) can be any function parameterized by

θ \in Θ

(such as a neural network). In these models, a prediction (e.g., finding

p (x_{0} | x_{1}, x_{2}, \dots)

) is made by fixing the values of the conditional variables and estimating the remaining variables, (e.g.,

x_{0}

), by minimizing the energy function [145,146,147].

An EBM is trained by finding an energy function that associates low energies to values of

x

drawn from the underlying data distribution,

p_{θ} (x) \sim p_{D} (x)

, and high energies for values of

x

not close to the underlying distribution.

8.1. Training EBM

Given the aforementioned conceptual framework, the training can be thought of as finding the model parameters that define the good match between the output (

y \in Y

) and the input (

x \in X

) for every step. This is achieved by estimating the best energy function from the set of energy functions (

E

) by scanning all the model parameters

Θ

[148,149], where

E = {F (Θ, X, Y) : Θ \in W = Θ}

. A loss function should behave similarly to the energy function described in Equation (72) (i.e., lower energy for a correct answer must be modeled by low losses, instead; higher energy, to all incorrect answers, by a higher loss).

Considering a set of training samples

S = {(x_{i}, y_{i}) : i = 1, \dots, N}

, during the training procedure, the loss function should have the effect of pushing down

E (Θ, y_{i}, x_{i})

and pulling up

E (Θ, {\tilde{y}}_{i}, x_{i})

, i.e., finding the parameters

Θ

that minimize the loss:

Θ^{*} = m i n_{Θ \in W} L_{e b m} (Θ, S)

(74)

The general form of the loss function

L_{e b m}

is defined as:

L_{e b m} (Θ, S) = \frac{1}{N} \sum_{i = 1}^{N} L_{e b m} (y_{i}, E (Θ, Y, x_{i})) + regularizer term

(75)

where:

$L_{e b m} (y_{i}, E (Θ, Y, x_{i}))$ is the per-sample loss;
$y_{i}$ is the desired output;
$E (Θ, Y, x_{i}))$ is energy surface for a given $x_{i}$ ; as $y \in Y$ varies.

This is an average of a per-sample loss function, denoted as

L_{e b m} (y_{i}, E (Θ, Y, x_{i}))

, over the training set. This function depends on the desired output

y_{i}

and on the energies derived by holding the input sample

x_{i}

constant and changing the output scanning over the sample

Y

. With this definition, the loss remains unchanged when the training samples are mixed up and when the training set is repeated numerous times [135]. As the size of the training set grows, the model is less likely to overfit [150].

8.2. Loss Functions for EBMs

Following Figure 6, in this section, we introduce the energy loss first since it is the most straightforward. Then, we present common losses in machine learning that can be adapted to the energy-based models, such as the negative log-likelihood loss, the hinge loss, and the log loss. Subsequently, we introduce more sophisticated losses like the generalized perceptron loss and the generalized margin loss. Finally, the minimum classification error loss, the square-square loss, and its variation square-exponential loss are presented.

8.2.1. Energy Loss

The so-called energy loss is the most straightforward loss due to its simplicity. It can be simply defined by using the energy function as the per-sample loss:

L_{e n e r g y} (y_{i}, E (Θ, Y, x_{i})) = E (Θ, x_{i}, y_{i})

(76)

This loss is often used in regression tasks. According to its definition, it pulls the energy function down for values that are close to the correct data distribution. However, the energy function is not pulled up for incorrect values. The assumption is that, by lowering the energy function in the correct location, the energy for incorrect values is left higher as a result. Due to this assumption, the training is sensitive to the model design and may result in energy collapse, leading to a largely flat energy function.

8.2.2. Generalized Perceptron Loss (L-CONT, CONVEX)

The generalized perceptron loss is defined as:

L_{p e r c e p t r o n} (y_{i}, E (Θ, Y, x_{i})) = E (Θ, y_{i}, x_{i}) - [min_{y \in Y}] {E (Θ, {y_{0}, \dots, y_{N}}, x_{i})}

(77)

This loss is positive definite as the second term is the lower bound of the first one, i.e.,

E (Θ, y_{i}, x_{i}) - [m i n_{y \in Y}] {E (Θ, {y_{0}, \dots, y_{N}}, x_{i})} \geq 0

. By minimizing this loss, the first term is pushed down and the energy of the model prediction is raised. Although it is widely used [151,152], this loss is suboptimal as it does not detect the gap between the correct output and the incorrect ones, it does not restrict the function from assigning the same value to each wrong output

y_{i}

, and it may produce flat energy distributions [135].

8.2.3. Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)

In analogy with the description in Section 5.3.1, the negative log-likelihood loss (NLL) in the energy-based context is defined as:

L_{N L L} (Θ, S) = \frac{1}{N} \sum_{i = 1}^{N} (E (Θ, y_{i}, x_{i}) + \frac{1}{β} log \int_{y \in Y} e^{β E (Θ, y, x_{i})})

(78)

where

S

is the training set.

This loss reduces to the perceptron loss when

β \to \infty

and to the log loss in case

Y

has only two labels (i.e., binary classification). Since the integral above is intractable, considerable efforts have been devoted to finding approximation methods, including Monte Carlo sampling methods [153] and variational methods [154]. While these methods have been devised as approximate ways of minimizing the NLL loss, they can be viewed in the energy-based framework as different strategies for choosing each

y

whose energies will be pulled up [135,155]. The NLL is also known as the cross-entropy [156] loss and is widely used in many applications, including energy-based models [157,158]. This loss function formulation is subject to the same limitations listed in Section 5.3.1.

8.2.4. Generalized Margin Loss

The generalized margin loss is a more reliable version of the generalized perceptron loss. The general form of the generalized margin loss in the context of energy-based training is defined as:

L_{m a r g i n} (Θ, x_{i}, y_{i}) = Q_{m} (E (Θ, x_{i}, y_{i}), E (Θ, x_{i}, {\bar{y}}_{i}))

(79)

where

{\bar{y}}_{i}

is the so-called “most offending incorrect output”, which is the output that has the lowest energy among all possible outputs that are incorrect [135]; m is a positive margin parameter; and

Q_{m}

is a convex function that ensures that the loss receives low values for

E (Θ, x_{i}, y_{i})

and high values for

E (Θ, x_{i}, {\bar{y}}_{i})

. In other words, the loss function can ensure that the energy of the most offending incorrect output is greater by some arbitrary margin than the energy of the correct output.

This loss function is written in the general form, and a wide variety of losses that use specific margin function

Q_{m}

to produce a gap between the correct output and the wrong output are formalized in the following part of the section.

Hinge Loss (L-CONT, CONVEX)

Already explained in Section 5.2.2, the hinge loss can be rewritten as:

L_{h i n g e} (Θ, x_{i}, y_{i}) = m a x (0, m + E (Θ, x_{i}, y_{i}) - E (Θ, x_{i}, {\bar{y}}_{i}))

(80)

This loss enforces that the difference between the correct answer and the most offending incorrect answer be at least m [159,160]. Individual energies are not required to take a specific value because the hinge loss depends on energy differences. This loss function shares limitations with the original hinge loss defined in Equation (24).

Log Loss (DIFF, CONT, CONVEX)

This loss is similar to the hinge loss, but it sets a softer margin between the correct output and the most offending outputs. The log loss is defined as:

L_{l o g} (Θ, x_{i}, y_{i}) = log (1 + e^{E (Θ, x_{i}, y_{i}) - E (Θ, x_{i}, {\bar{y}}_{i})})

(81)

This loss is also called soft hinge and it may produce overfitting on high-dimensional datasets [161].

Minimum Classification Error Loss (CONT, DIFF, CONVEX)

A straightforward function that roughly counts the total number of classification errors while being smooth and differentiable is known as the minimum classification error (MCE) loss [162]. The MCE is written as a sigmoid function:

L_{m c e} (Θ, x_{i}, y_{i}) = σ (E (Θ, x_{i}, y_{i}) - E (Θ, x_{i}, {\bar{y}}_{i}))

(82)

where

σ

is defined as

σ (x) = {(1 + e^{- x})}^{- 1}

. While this function lacks an explicit margin, it nevertheless produces an energy difference between the most offending incorrect output and the correct output.

Square-Square Loss (CONT, CONVEX)

Square-square loss deals differently with the energy of the correct output

E (Θ, x_{i}, y_{i})

and the energy of the most offensive output

E (Θ, x_{i}, {\bar{y}}_{i})

as:

L_{s q - s q} (Θ, x_{i}, y_{i}) = E {(Θ, x_{i}, y_{i})}^{2} + {(max (0, m - E (Θ, x_{i}, {\bar{y}}_{i})))}^{2}

(83)

The combination aims to minimize the energy of the correct output while enforcing a margin of at least m on the most offending incorrect outputs. This loss is a modified version of the margin loss. This loss can only be used when there is a lower bound on the energy function [126,155].

Square-Exponential Loss (CONT, DIFF, CONVEX)

This loss is similar to the square-square loss function, and it only differs in the second term:

L_{s q - e x p} (Θ, x_{i}, y_{i}) = E {(Θ, x_{i}, y_{i})}^{2} + γ e^{- E (Θ, x_{i}, {\bar{y}}_{i})}

(84)

While

γ

is a positive constant, the combination aims to minimize the energy of the correct output while pushing the energy of the most offending incorrect output to an infinite margin [125,142,155]. This loss is considered a regularized version of the aforementioned square-square loss. This loss, as for the square-square loss, can only be used when there is a lower bound on the energy function [126,155].

9. Relational Learning

While regression and classification losses focus on point-wise predictions, and ranking losses focus mainly on pairwise relationships, relational learning losses aim to capture the topological dependencies and global properties of graph-structured data. These losses are predominant in graph neural networks (GNNs) and manifold learning methods, where the goal is to embed nodes such that the geometric and relational structure of the original graph is preserved in the latent space. A graph

G = (V, E)

consists of a set of nodes

V

and edges

E

representing connections (e.g., in social networks, molecular graphs, knowledge graphs, or recommendation systems). Unlike traditional neural networks that assume Euclidean input structure (grids or sequences), GNNs directly incorporate graph topology and node features to model complex relational dependencies [163].

Common loss functions in graph learning include probabilistic losses like cross-entropy (see Section 5.3.1), which is well suited for supervised node and graph classification as well as link prediction tasks [164,165], and hinge-based, used in ranking scenarios, such as knowledge graph embedding, where relative distances between correct and incorrect edges are optimized [166,167]. In generative settings (see Section 6), reconstruction losses are employed by graph autoencoders (GAEs) and variational GAEs to learn node embeddings that can accurately reconstruct the graph’s adjacency matrix, effectively modeling the underlying graph distribution [165,168]. Relational learning losses can thus be categorized by their underlying strategy: probabilistic losses and error-based losses.

Many graph learning objectives, especially in self-supervised settings, explicitly leverage the structure and topology of the graph beyond what standard pointwise supervision provides. In this section, we describe several relational learning losses that encode distinct inductive biases. Some enforce smoothness over local neighborhoods, others preserve higher-order structural relationships or role-based similarities, and yet others use contrastive objectives to learn embeddings invariant to graph perturbations. These losses can be used standalone or as regularizers alongside supervised objectives, guiding GNNs to capture both local and global patterns and yielding embeddings that are structurally meaningful and generalize well across downstream tasks.

9.1. Graph Reconstruction Loss (CONT, DIFF, L-CONT)

Graph reconstruction losses aim to rebuild the graph from learned embeddings, typically treating the presence or absence of edges as a prediction target. A common choice is a binary cross-entropy loss over the adjacency matrix. For an unweighted graph with adjacency matrix A, the model (e.g., a GAE) produces an estimated probability

{\hat{A}}_{i j}

for each potential edge

(i, j)

. The loss sums over all node pairs:

L_{recon} = - \sum_{i, j} [A_{i j} log {\hat{A}}_{i j} + (1 - A_{i j}) log (1 - {\hat{A}}_{i j})],

(85)

often with negative sampling or weighting to handle class imbalance (since sparse graphs have far more absent edges than present ones). By directly optimizing the likelihood of observed links, this loss encourages embeddings to encode information sufficient to predict graph connectivity. Graph reconstruction loss is widely used in unsupervised network embedding and link prediction scenarios, providing a simple yet effective way to train on graph structure when explicit labels are unavailable. It underpins approaches like VGAE [165] and ARGA (adversarially regularized GAE) [168], which have demonstrated strong performance on link prediction benchmarks.

One advantage of the reconstruction objective is its directness and interpretability: low reconstruction error implies that important connections in the original graph are preserved in the embedding space. However, reconstructing every edge can lead to overfitting to noisy or uninformative links and over-emphasizing high-degree nodes; regularization techniques (e.g., adversarial training [168]) and negative sampling are often necessary to mitigate these issues. Moreover, this loss focuses on immediate local structure (adjacent relationships), so it may be complemented by additional terms that capture broader graph properties for a more holistic representation.

9.2. Random Walk-Based Loss (CONT, DIFF, L-CONT)

Random walk-based methods, such as DeepWalk [169] and node2vec [170], learn node embeddings by leveraging the co-occurrence statistics of nodes in short random walks over the graph. The key idea is analogous to word embeddings in NLP: nodes that frequently appear together within a random walk window should have similar vector representations. The loss is typically formulated as a skip-gram objective with negative sampling. For a node v, let

N_{w} (v)

denote the multiset of nodes encountered within a window of length w in random walks starting from v. Let

h_{u}

denote the embedding of node u. A common objective is:

L_{RW} = - \sum_{v \in V} \sum_{u \in N_{w} (v)} [log σ (h_{v}^{⊤} h_{u}) + \sum_{u^{'} \sim P_{n}} log (1 - σ (h_{v}^{⊤} h_{u^{'}}))],

(86)

where

σ

is the sigmoid function,

P_{n}

is a noise distribution over nodes for drawing negative samples

u^{'}

, and

h_{v}^{⊤} h_{u}

represents the similarity (inner product) between node embeddings. This loss pushes co-occurring nodes to have high similarity, while pulling randomly sampled node pairs apart, effectively capturing both local and some higher-order proximity patterns in the graph. Random-walk-based losses are unsupervised and have been widely used for node representation learning, link prediction, and even as a pretraining mechanism for GNNs. They can incorporate vertex proximity information without requiring labels.

A notable advantage of this approach is its scalability and simplicity, as it reduces graph mining to repeated local context generation and word2vec-style optimization. However, these methods rely purely on graph structure (ignoring node attributes) and are typically transductive; embeddings for new nodes cannot be obtained without retraining. Subsequent inductive methods like GraphSAGE [171] address this limitation by learning a mapping function that can generate embeddings for unseen nodes. Hyperparameters such as walk length, context window size, and the number of negative samples must be carefully tuned, as they significantly influence the quality of the learned embeddings.

9.3. Motif-Based Loss (CONT, DIFF, L-CONT)

Motif-based losses aim to capture higher-order graph structures by encouraging node embeddings to preserve specific recurrent subgraph patterns (motifs) such as triangles, cliques, or cycles. Motifs often carry meaningful functional information—for instance, triangles in social networks may indicate strong friend groups, while specific graphlets in protein interaction networks signify biological complexes. A standard formulation, introduced by Benson et al. [172], involves constructing a motif-adjacency matrix

A^{M}

, where

A_{i j}^{M}

denotes the count of motifs containing both nodes i and j. The objective is to learn embeddings that can reconstruct this higher-order proximity matrix, effectively treating motif co-occurrence as a stronger form of connectivity. Formally, one can minimize the reconstruction error between the motif adjacency and the embedding similarity:

L_{motif} = \sum_{i, j} ∥ A_{i j}^{M} - h_{i}^{⊤} h_{j} ∥^{2},

(87)

where

h_{i}, h_{j}

are the node embeddings. By optimizing this loss, the model forces nodes that frequently participate in the same motifs to have highly similar representations. This approach provides a powerful inductive bias, as demonstrated in works on higher-order spectral clustering [172] and motif-based graph convolution [173], which show that incorporating motif structures significantly improves performance on tasks sensitive to local topology. It has been applied in social networks (to enhance community detection or model friend groups) [174,175], in biological networks (to capture functional substructures like regulatory loops or protein complexes) [176,177], and in knowledge graphs (to incorporate common relational subgraph patterns) [178,179]. A practical consideration is the computational cost of motif enumeration, which can be high for large graphs; thus, one must carefully select motifs relevant to the specific downstream task. Nonetheless, when appropriate motifs are identified, this loss provides a powerful inductive bias, as demonstrated in motif-based graph convolutions [173], significantly improving performance on tasks sensitive to local topology.

9.4. Graph Contrastive Loss (CONT, DIFF, L-CONT)

Graph contrastive loss is a self-supervised objective designed to learn informative node or graph embeddings by comparing different augmented views of a graph. The central idea is that the representation of an entity (a node, subgraph, or entire graph) should be similar under different data augmentations, while remaining distinct from representations of other entities. In practice, one constructs two views of the graph (for instance, by randomly dropping or perturbing edges, features, or nodes) and obtains embeddings for the same node in both views. A contrastive loss then maximizes agreement between embeddings of corresponding entities (positives) and minimizes agreement with embeddings of different entities (negatives). A typical formulation of the contrastive loss for node embeddings uses an InfoNCE-based objective [131,180]:

L_{CL} = - \sum_{i \in V} log \frac{e^{sim (h_{i}, h_{i}^{+}) / τ}}{e^{sim (h_{i}, h_{i}^{+}) / τ} + \sum_{j \neq i} e^{sim (h_{i}, h_{j}^{-}) / τ}},

(88)

where

h_{i}

is the embedding of node i in the original graph,

h_{i}^{+}

is the embedding of the same node in an augmented version of the graph (a positive pair),

h_{j}^{-}

are embeddings of negative samples (other nodes or nodes from different graphs),

sim (\cdot, \cdot)

is a similarity function (e.g., cosine similarity), e is the Euler number, and

τ

is a temperature hyperparameter. By maximizing this InfoNCE objective, the model learns embeddings that are invariant to the chosen graph perturbations yet discriminative among different nodes. Graph contrastive learning has proven effective for self-supervised representation learning on graphs, yielding embeddings that perform well on downstream tasks such as node classification, graph classification, link prediction, and transfer learning without requiring any labels. Notable examples include methods like GraphCL [181] and GRACE [182], which differ in the types of augmentations used (edge removals, attribute masking, node dropout, subgraph sampling, etc.), and deep graph infomax (DGI) [183], which employs a global–local contrastive scheme.

A major advantage of contrastive losses is their ability to extract rich features by maximizing mutual information between different views of the data [183]. However, their performance can be sensitive to the choice of graph augmentations and negative sampling strategy: inappropriate or overly strong augmentations might destroy important structural information, whereas trivial augmentations can lead to degenerate solutions (e.g., collapsed embeddings that lack useful information). Careful design of augmentations and tuning of the temperature and negative sample pool are required to ensure that the contrastive task provides a meaningful learning signal. Despite these challenges, contrastive objectives offer a flexible framework to leverage abundant unlabeled graph data for representation learning, and they have quickly become a cornerstone of state-of-the-art GNN pretraining techniques.

9.5. Graph Laplacian (Smoothness) Loss (CONT, DIFF, L-CONT)

The graph Laplacian loss (also known as a smoothness or homophily regularization) is a fundamental graph-specific loss that encourages connected nodes to have similar embeddings. The intuition is that in many real-world graphs, adjacent nodes are likely to share properties (the homophily assumption). Given the adjacency matrix A of the graph, the Laplacian smoothness loss is typically defined as:

L_{lap} = \frac{1}{2} \sum_{i, j} A_{i j} {∥ h_{i} - h_{j} ∥}^{2},

(89)

where

h_{i}

and

h_{j}

are the embeddings of nodes i and j respectively. This objective penalizes large differences between the representations of any two connected nodes, encouraging the learned embedding function to vary smoothly across edges. Graph Laplacian-based losses are widely used in semi-supervised learning on graphs (where only a small subset of nodes have labels) as a regularizer to complement a supervised loss, as well as in purely unsupervised manifold learning and spectral embedding methods [184]. By enforcing local smoothness, this loss helps models generalize from limited labeled data and preserves local structural patterns in the learned embeddings. For example, graph convolutional networks (GCNs) implicitly incorporate this principle in their message-passing layers and often benefit from an explicit Laplacian regularization term during training to further align neighboring node representations [165].

A benefit of

L_{lap}

is that it is convex and easy to compute. However, if over-emphasized, a strong smoothness constraint can lead to the well-known over-smoothing effect, wherein node embeddings become nearly indistinguishable across large regions of the graph after repeated averaging [185]. This is especially problematic in graphs with heterophily, where connected nodes have disparate labels or attributes, and forcing them to be similar can harm performance. In practice, the Laplacian loss term is often given a relatively small weight or applied only to certain layers, to avoid washing out important differences while still promoting consistency among neighbors. Despite this caveat, Laplacian smoothing is a key component in many graph learning approaches, as it exploits network connectivity to impose a useful inductive bias of local consistency.

9.6. Mutual Information Maximization Loss (CONT, DIFF)

Another paradigm in self-supervised graph learning is to maximize the mutual information (MI) between local and global representations of a graph. The principle is that a good node embedding should capture not only its local neighborhood features but also be informative about the broader subgraph or graph it resides in. Deep graph infomax (DGI) [183] is a representative approach that introduces a global summary vector for the graph and trains the model to distinguish between embeddings from the true graph and those from a corrupted graph. Concretely, one computes a readout or summary representation

s

of the entire graph (e.g., by average or pooling over all node embeddings), and uses a discriminator to classify node embeddings as either coming from the original graph (positive) or from a “negative” graph where the input features or structure have been shuffled. The loss can be written as:

L_{MI} = - \sum_{i \in V} log σ (h_{i}^{⊤} s) - \sum_{j \in V} log (1 - σ ({\tilde{h}}_{j}^{⊤} s)),

(90)

where

h_{i}

is the embedding of node i in the real graph,

{\tilde{h}}_{j}

is the embedding of node j in a corrupted version of the graph (the negative sample),

s

is the summary embedding of the real graph, and

σ

is the sigmoid function. By maximizing the mutual information between

{h_{i}}

and the global descriptor

s

, the model is encouraged to produce node representations that are globally coherent and sensitive to the graph’s overall structure. This enables effective self-supervised training of GNNs, allowing them to generate high-quality embeddings without any labels and to achieve strong performance on tasks like node classification, clustering, and transfer learning [183].

The advantage of MI-based losses like DGI is that they force the encoder to capture not just immediate neighbor information but also higher-level structural patterns that correlate with global graph properties. A potential drawback is the need to design a good graph corruption strategy (e.g., feature shuffling or graph sampling) to generate challenging negatives. If the negative examples are too trivial (e.g., completely random graphs), the discriminator’s task becomes too easy and the learned embeddings may not be very informative; if negatives are too similar to the real graph, the task may be overly difficult. Despite these considerations, mutual information maximization loss has proven to be a powerful driver for unsupervised graph representation learning and is complementary to purely local objectives: it encourages the preservation of global graph information that might otherwise be overlooked.

9.7. Distance/Structural Preservation Loss (CONT, DIFF, L-CONT)

Distance-preserving losses are designed to maintain certain graph-theoretic distances or structural similarities between nodes in the learned embedding space. Unlike local adjacency-based losses (e.g., reconstruction or Laplacian smoothing), these losses aim to preserve more global and mesoscopic structure—such as community structure, centrality, or role equivalence—by embedding nodes such that graph-derived distances are reflected as geometric distances. The general idea is to ensure that if two nodes are structurally similar in the original graph (not necessarily directly connected, but e.g., having similar connectivity patterns or occupying analogous positions), their embeddings should be close in vector space. A typical formulation uses a penalty on differences between graph distances and Euclidean distances of embeddings. For instance:

L_{dist} = \sum_{i, j} {(d_{G} (i, j) - ∥ h_{i} - h_{j} ∥)}^{2},

(91)

where

d_{G} (i, j)

is a chosen graph-based distance or similarity measure between nodes i and j (e.g., shortest-path distance, random walk distance, or similarity based on graph kernels), and

∥ h_{i} - h_{j} ∥

is the distance between the corresponding node embeddings. This objective penalizes discrepancies between distances in the graph and distances in the embedding space, ensuring that nodes with similar structural roles or positions remain close in latent space. Distance-preservation losses have been applied to tasks such as role discovery and structural node embedding (capturing nodes with similar graph roles even if they are far apart) [186,187], anomaly detection in networks (where structurally unusual nodes should stand out in the embedding space) [188], and improving link prediction by accounting for long-range dependencies beyond immediate neighbors [170,186].

A benefit of these losses is that they incorporate global information that purely local losses might miss; for example, they can group together nodes that are analogously positioned in different parts of the graph (like hubs in different communities). On the downside, computing and enforcing all-pairs distance preservation can be computationally expensive for large graphs (scaling roughly with

{| V |}^{2}

pairs). In practice, methods often rely on sampling strategies or truncated distance measures to remain tractable. Additionally, if the chosen distance metric

d_{G}

does not align with the target task (for instance, preserving exact shortest-path distances might be less useful for a classification task that only depends on local structure), this loss can introduce noise. Therefore, distance-preserving objectives are most beneficial when the structural patterns themselves are of interest (as in understanding network roles or anomalies), or as complementary losses to ensure that an embedding captures multi-scale structure alongside other task-specific objectives.

10. Conclusions

The choice of an appropriate loss function is fundamental to the success of machine learning applications. In this survey, we have comprehensively reviewed and detailed 52 of the most commonly used loss functions (summarized in Appendix A), covering a diverse range of machine learning tasks, including classification, regression, generative modeling, ranking, energy-based modeling, and relational learning. Each loss function has been contextualized within its domain of application, highlighting both its advantages and limitations to guide informed selection based on specific problem requirements. This survey is intended as a resource for machine learning practitioners at all levels, including undergraduate, graduate, and Ph.D. students, as well as researchers working to deepen their understanding of loss functions or to develop new ones. By structuring each loss function within a novel taxonomy, presented in Section 2, we have created an intuitive framework that categorizes these functions based on their underlying principles, problem domains, and methodological approaches. This taxonomy provides readers with a clear view of the relationships among different techniques, enabling them to quickly identify the most suitable loss function for their task while also facilitating comparison across methods. We hope that this structured view of loss functions will not only streamline the selection process for practitioners but also support the conceptual development of this field.

Looking ahead, current research increasingly treats the loss function not as a fixed component, but as a design variable that can be adapted to the task, the data, and the training objective. Important trends include automated objective design through meta-learning and search-based approaches [189,190,191], adaptive and composite objectives for multi-task and self-supervised learning [192,193], and robustness-oriented objectives that explicitly address noisy labels, class imbalance, calibration, adversarial robustness, and distribution shift [194,195,196,197,198,199,200]. In addition, large language models are increasingly trained with preference-based and alignment-oriented objectives that extend beyond next-token prediction, including RLHF-style pipelines and direct preference optimization methods [118,201,202]. These developments suggest that the study of loss functions is progressively shifting from cataloguing standard objectives toward the broader problem of objective design.

These trends also indicate several natural directions for extending the present survey. A first improvement would be to include a more systematic analysis of the computational cost of different loss functions, including time and memory requirements, optimization overhead, and training stability. A second extension would be to enrich the appendix and section summaries with more detailed comparative tables, including typical frameworks and deployment considerations. Future versions of this survey could also expand the coverage of domain-specific dense-prediction losses, multi-loss weighting strategies, loss design under concept drift and non-stationary data, and automatically discovered or data-adaptive objectives in modern machine learning pipelines [203,204]. In this sense, the present work can also serve as a basis for future iterations tracking how loss function design continues to evolve across machine learning.

Author Contributions

Conceptualization: L.C., A.E., M.L., A.M. and A.R.; Formal analysis: L.C., M.L. and A.R.; Investigation: L.C., M.L. and A.R.; Writing—original draft preparation: L.C., A.E., M.L., A.M. and A.R.; Writing—review and editing: L.C., M.L. and A.R.; Visualization: L.C. and M.L.; Supervision: A.R.; Project administration: A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

All authors were employed by lastminute.com at the time this research was conducted. The authors declare that the research was carried out in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Summary of Loss Functions and Their Properties

Table A1. Summary of loss functions, classification, and mathematical properties.

Loss Function	Task	Taxonomy	Conv.	Diff.	Key Properties/Notes
Mean Bias Error (MBE)	Regression	Error-based	Yes	Yes	Errors may cancel out; used for evaluation.
Mean Absolute Error (MAE)	Regression	Error-based	Yes	No (at 0)	Robust to outliers; median-oriented estimate.
Mean Squared Error (MSE)	Regression	Error-based	Yes	Yes	Sensitive to outliers; MLE for Gaussian noise.
Root Mean Sq. Error (RMSE)	Regression	Error-based	Yes	Yes	Same units as target; widely used.
Huber Loss	Regression	Error-based	Yes	Yes	Robust hybrid of MAE and MSE; requires threshold $δ$ .
Log-cosh Loss	Regression	Error-based	Yes	Yes	Smooth approximation of MAE; no threshold hyperparameter.
RMSLE	Regression	Error-based	No	Yes	Works on log-transformed targets; emphasizes relative errors.
Zero-One Loss	Classification	Margin-based	No	No	Direct classification error; intractable.
Hinge Loss	Classification	Margin-based	Yes	No	Basis for SVMs; max-margin principle.
Perceptron Loss	Classification	Margin-based	Yes	No	No margin enforcement; strictly error-driven.
Smoothed Hinge	Classification	Margin-based	Yes	Yes	Differentiable variant of Hinge.
Quad. Smoothed Hinge	Classification	Margin-based	Yes	Yes	Piece-wise quadratic smoothing.
Modified Huber	Classification	Margin-based	Yes	Yes	Smooth; used for robust classification.
Ramp Loss	Classification	Margin-based	No	No	Capped Hinge; robust to outliers.
Cosine Similarity	Classification	Margin-based	No	Yes	Measures orientation; ignores magnitude.
Cross-Entropy (NLL)	Classification	Probabilistic	Yes	Yes	Standard for classification; MLE-based.
KL Divergence	Classification	Probabilistic	Yes	Yes	Measures information loss; asymmetric.
Focal Loss	Classification	Probabilistic	No	Yes	Focuses on hard examples; handles imbalance.
Dice Loss	Classification	Probabilistic	No	Yes	Overlap metric; for imbalance/segmentation.
Tversky Loss	Classification	Probabilistic	No	Yes	Generalizes Dice; tunable FP/FN balance.
VAE Loss (ELBO)	Generative	Probabilistic	No	Yes	Reconstruction + KL regularization.
Beta-VAE	Generative	Probabilistic	No	Yes	Trade-off for disentangled representations.
VQ-VAE	Generative	Probabilistic	No	No	Discrete latent codes; non-diff quantization.
Conditional VAE	Generative	Probabilistic	No	Yes	Conditioned generation (e.g., on labels).
Minimax Loss	Generative	Probabilistic	No	Yes	Original GAN loss; saddle point problem.
Wasserstein Loss	Generative	Probabilistic	No	Yes	Earth-Mover dist.; stable GAN training.
Diffusion (Simple)	Generative	Probabilistic	No	Yes	Noise prediction MSE; stable training.
Score-based Loss	Generative	Probabilistic	No	Yes	Fits score function (gradient of log-density).
CLIP Guidance	Generative	Margin-based	No	Yes	Aligns text/image embeddings (cosine sim).
Autoregressive LM	Generative	Probabilistic	Yes	Yes	Standard causal masking (GPT style).
Masked LM (MLM)	Generative	Probabilistic	Yes	Yes	Bidirectional context (BERT style).
Label Smoothing	Generative	Probabilistic	Yes	Yes	Prevents overconfidence; regularization.
Knwl. Distillation (KL)	Generative	Probabilistic	Yes	Yes	Compress teacher model info to student.
DPO Loss	NLP (LLM)	Probabilistic	No	Yes	Reparameterizes reward via LLM policy; requires frozen reference model.
SimPO Loss	NLP (LLM)	Probabilistic	No	Yes	Reference-free; uses length-normalized log-prob as reward with margin.
ORPO Loss	NLP (LLM)	Probabilistic	No	Yes	Reference-free alignment; optimizes odds ratio to penalize rejected outputs.
Pairwise Ranking	Ranking	Margin-based	Yes	No	Contrastive; minimizes pos/neg distance.
Triplet Loss	Ranking	Margin-based	Yes	No	Anchor-Pos-Neg structure; relative distance.
Listwise (Softmax)	Ranking	Probabilistic	No	Yes	Optimizes entire list order (top-1 prob).
Contrastive (NT-Xent)	Ranking	Margin-based	No	Yes	Self-supervised; uses multiple negatives.
LambdaLoss	Ranking	Margin-based	Yes	Yes	Directly optimizes IR metrics (NDCG).
Energy Loss	EBM	Margin-based	No	No	Direct mapping; prone to collapse.
Generalized Perceptron	EBM	Margin-based	Yes	No	Pushes down correct energy; no margin.
Energy NLL	EBM	Margin-based	Yes	Yes	Log-partition approx; probabilistic link.
Energy Hinge	EBM	Margin-based	Yes	No	Enforces margin between correct/incorrect.
Energy Log	EBM	Margin-based	Yes	Yes	Soft margin; smooth differentiable hinge.
MCE Loss	EBM	Margin-based	No	Yes	Approx. error count using sigmoid.
Square-square	EBM	Margin-based	Yes	Yes	Quadratically penalizes energy margins.
Square-exponential	EBM	Margin-based	Yes	Yes	Exponential penalty on incorrect energies.
Graph Reconstruction	Relational	Probabilistic	No	Yes	Link prediction; rebuilds adjacency.
Random Walk Loss	Relational	Probabilistic	No	Yes	Skip-gram for graphs (DeepWalk/Node2Vec).
Motif-based Loss	Relational	Error-based	No	Yes	Preserves higher-order substructures.
Graph Contrastive	Relational	Probabilistic	No	Yes	Invariance to graph augmentations.
Graph Laplacian	Relational	Error-based	Yes	Yes	Enforces smoothness among neighbors.
Mutual Info (DGI)	Relational	Probabilistic	No	Yes	Maximize local-global info agreement.
Distance Preservation	Relational	Error-based	No	Yes	Preserves structural/geodesic distances.

References

Mitchell, T.; Buchanan, B.; DeJong, G.; Dietterich, T.; Rosenbloom, P.; Waibel, A. Machine Learning. Annu. Rev. Comput. Sci. 1990, 4, 417–433. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
Mitchell, T.M. Machine Learning; McGraw-Hill Education: New York, NY, USA, 1997; Volume 1. [Google Scholar]
Mahesh, B. Machine learning algorithms—A review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
Shally, H. Survey for Mining Biomedical data from HTTP Documents. Int. J. Eng. Sci. Res. Technol. 2013, 2, 165–169. [Google Scholar]
Patil, S.; Patil, K.R.; Patil, C.R.; Patil, S.S. Performance overview of an artificial intelligence in biomedics: A systematic approach. Int. J. Inf. Technol. 2020, 12, 963–973. [Google Scholar] [CrossRef]
Zhang, X.; Yao, L.; Wang, X.; Monaghan, J.; Mcalpine, D.; Zhang, Y. A survey on deep learning based brain computer interface: Recent advances and new frontiers. arXiv 2019, arXiv:1905.04149. [Google Scholar]
Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef]
Chowdhary, K. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
Otter, D.W.; Medina, J.R.; Kalita, J.K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 604–624. [Google Scholar] [CrossRef]
Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.A.; Xu, M. A survey on machine learning techniques for cyber security in the last decade. IEEE Access 2020, 8, 222310–222354. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J. Knowledge discovery in databases: An overview. AI Mag. 1992, 13, 57–70. [Google Scholar]
Argall, B.D.; Chernova, S.; Veloso, M.; Browning, B. A survey of robot learning from demonstration. Robot. Auton. Syst. 2009, 57, 469–483. [Google Scholar] [CrossRef]
Perlich, C.; Dalessandro, B.; Raeder, T.; Stitelman, O.; Provost, F. Machine learning for targeted display advertising: Transfer learning in action. Mach. Learn. 2014, 95, 103–127. [Google Scholar] [CrossRef]
Bontempi, G.; Ben Taieb, S.; Borgne, Y.A.L. Machine learning strategies for time series forecasting. In Proceedings of the European Business Intelligence Summer School; Springer: Berlin/Heidelberg, Germany, 2012; pp. 62–77. [Google Scholar]
Müller, K.R.; Krauledat, M.; Dornhege, G.; Curio, G.; Blankertz, B. Machine learning techniques for brain-computer interfaces. Biomed. Tech. 2004, 49, 11–22. [Google Scholar]
Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA); IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Von Neumann, J.; Morgenstern, O. Theory of Games and Economic Behavior; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
Terven, J.; Cordova-Esparza, D.M.; Romero-Gonzalez, J.A.; Ramirez-Pedraza, A.; Chavez-Urbiola, E.A. A comprehensive survey of loss functions and metrics in deep learning. Artif. Intell. Rev. 2025, 58, 195. [Google Scholar] [CrossRef]
Li, C.; Liu, K.; Liu, S. A Survey of Loss Functions in Deep Learning. Mathematics 2025, 13, 2417. [Google Scholar] [CrossRef]
Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 2020, 9, 187–212. [Google Scholar] [CrossRef]
Wang, J.; Feng, S.; Cheng, Y.; Al-Nabhan, N. Survey on the Loss Function of Deep Learning in Face Recognition. J. Inf. Hiding Priv. Prot. 2021, 3, 29. [Google Scholar] [CrossRef]
Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB); IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Virmaux, A.; Scaman, K. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Gouk, H.; Frank, E.; Pfahringer, B.; Cree, M.J. Regularisation of neural networks by enforcing lipschitz continuity. Mach. Learn. 2021, 110, 393–416. [Google Scholar] [CrossRef]
Pauli, P.; Koch, A.; Berberich, J.; Kohler, P.; Allgöwer, F. Training robust neural networks using Lipschitz bounds. IEEE Control Syst. Lett. 2021, 6, 121–126. [Google Scholar] [CrossRef]
Kiwiel, K.C. Methods of Descent for Nondifferentiable Optimization; Springer: Berlin/Heidelberg, Germany, 2006; Volume 1133. [Google Scholar]
Shor, N.Z. Minimization Methods for Non-Differentiable Functions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 3. [Google Scholar]
Conn, A.R.; Scheinberg, K.; Vicente, L.N. Introduction to Derivative-Free Optimization; SIAM: Philadelphia, PA, USA, 2009. [Google Scholar]
Rios, L.M.; Sahinidis, N.V. Derivative-free optimization: A review of algorithms and comparison of software implementations. J. Glob. Optim. 2013, 56, 1247–1293. [Google Scholar] [CrossRef]
Liu, S.; Chen, P.Y.; Kailkhura, B.; Zhang, G.; Hero, A.O., III; Varshney, P.K. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Process. Mag. 2020, 37, 43–54. [Google Scholar] [CrossRef]
Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.J. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security; Association for Computing Machinery: New York, NY, USA, 2017; pp. 15–26. [Google Scholar]
Dhurandhar, A.; Pedapati, T.; Balakrishnan, A.; Chen, P.Y.; Shanmugam, K.; Puri, R. Model agnostic contrastive explanations for structured data. arXiv 2019, arXiv:1906.00117. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science; Institute of Mathematical Statistics Monographs, Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
Kukačka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar] [CrossRef]
Bartlett, P.; Boucheron, S.; Lugosi, G. Model Selection and Error Estimation. Mach. Learn. 2002, 48, 85–113. [Google Scholar] [CrossRef]
Myung, I.J. The Importance of Complexity in Model Selection. J. Math. Psychol. 2000, 44, 190–204. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 2000, 42, 80–86. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Ng, A.Y. Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, USA, 2004, ICML ’04; Association for Computing Machinery: New York, NY, USA, 2004; p. 78. [Google Scholar] [CrossRef]
Bektaş, S.; Şişman, Y. The comparison of L1 and L2-norm minimization methods. Int. J. Phys. Sci. 2010, 5, 1721–1727. [Google Scholar]
Tsuruoka, Y.; Tsujii, J.; Ananiadou, S. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty. In ACL ’09: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; Volume 1, pp. 477–485. [Google Scholar] [CrossRef]
Ullah, F.U.M.; Ullah, A.; Haq, I.U.; Rho, S.; Baik, S.W. Short-term prediction of residential power energy consumption via CNN and multi-layer bi-directional LSTM networks. IEEE Access 2019, 8, 123369–123380. [Google Scholar] [CrossRef]
Krishnaiah, T.; Rao, S.S.; Madhumurthy, K.; Reddy, K. Neural network approach for modelling global solar radiation. J. Appl. Sci. Res. 2007, 3, 1105–1111. [Google Scholar]
Valipour, M.; Banihabib, M.E.; Behbahani, S.M.R. Comparison of the ARMA, ARIMA, and the autoregressive artificial neural network models in forecasting the monthly inflow of Dez dam reservoir. J. Hydrol. 2013, 476, 433–441. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Li, G.; Shi, J. On comparing three artificial neural networks for wind speed forecasting. Appl. Energy 2010, 87, 2313–2320. [Google Scholar] [CrossRef]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
Huber, P.J. A robust version of the probability ratio test. Ann. Math. Stat. 1965, 36, 1753–1758. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2016; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. [Google Scholar]
Semeniuta, A. A handy approximation of the RMSLE loss function. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2017; pp. 114–125. [Google Scholar]
Semeniuta, A. Handy approximation of the RMSLE loss. arXiv 2017, arXiv:1711.04077. [Google Scholar]
Crammer, K.; Singer, Y. On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. J. Mach. Learn. Res. 2001, 2, 265–292. [Google Scholar]
Weston, J.; Watkins, C. Support Vector Machines for Multi-Class Pattern Recognition. In Proceedings of the European Symposium on Artificial Neural Networks (ESANN); D-Facto Public: Bruges, Belgium, 1999; pp. 219–224. [Google Scholar]
Lee, Y.; Lin, Y.; Wahba, G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. J. Am. Stat. Assoc. 2002, 99, 1–37. [Google Scholar]
Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, Classification, and Risk Bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef]
Jiang, W. Process consistency for AdaBoost. Ann. Stat. 2004, 32, 13–29. [Google Scholar] [CrossRef]
Lugosi, G.; Vayatis, N. On the Bayes-risk consistency of regularized boosting methods. Ann. Stat. 2003, 32, 30–55. [Google Scholar] [CrossRef]
Mannor, S.; Meir, R.; Zhang, T. Greedy Algorithms for Classification—Consistency, Convergence Rates, and Adaptivity. J. Mach. Learn. Res. 2003, 4, 713–741. [Google Scholar]
Steinwart, I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. Inf. Theory 2005, 51, 128–142. [Google Scholar] [CrossRef]
Zhang, T. Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization. Ann. Stat. 2001, 32, 56–85. [Google Scholar] [CrossRef]
Gentile, C.; Warmuth, M.K.K. Linear Hinge Loss and Average Margin. In Proceedings of the Advances in Neural Information Processing Systems; Kearns, M., Solla, S., Cohn, D., Eds.; MIT Press: Cambridge, MA, USA, 1998; Volume 11. [Google Scholar]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory; Association for Computing Machinery: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
Mathur, A.; Foody, G.M. Multiclass and binary SVM classification: Implications for training and classification users. IEEE Geosci. Remote Sens. Lett. 2008, 5, 241–245. [Google Scholar] [CrossRef]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef]
Rennie, J.D.M. Smooth Hinge Classification; Update on 2013; Massachusetts Institute of Technology: Cambridge, MA, USA, 2005. [Google Scholar]
Zhang, T. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04; Association for Computing Machinery: New York, NY, USA, 2004; pp. 919–926. [Google Scholar]
Wu, Y.; Liu, Y. Robust Truncated Hinge Loss Support Vector Machines. J. Am. Stat. Assoc. 2007, 102, 974–983. [Google Scholar] [CrossRef]
Harshvardhan, G.; Gourisaria, M.K.; Pandey, M.; Rautaray, S.S. A comprehensive survey and analysis of generative models in machine learning. Comput. Sci. Rev. 2020, 38, 100285. [Google Scholar] [CrossRef]
Myung, I.J. Tutorial on maximum likelihood estimation. J. Math. Psychol. 2003, 47, 90–100. [Google Scholar] [CrossRef]
Joyce, J.M. Kullback-leibler divergence. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In Proceedings of the International Workshop on Machine Learning in Medical Imaging; Springer: Berlin/Heidelberg, Germany, 2017; pp. 379–387. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Van Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2016; pp. 1747–1756. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Rezende, D.J.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2015; pp. 1530–1538. [Google Scholar]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2014; pp. 1278–1286. [Google Scholar]
Hou, X.; Shen, L.; Sun, K.; Qiu, G. Deep Feature Consistent Variational Autoencoder. arXiv 2016, arXiv:1610.00291. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.C.; Vincent, P. Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives. arXiv 2012, arXiv:1206.5538. [Google Scholar] [CrossRef]
Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. arXiv 2014, arXiv:1406.5298. [Google Scholar] [CrossRef]
An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.P.; Glorot, X.; Botvinick, M.M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR (Poster); OpenReview: Amherst, MA, USA, 2017; Volume 3. [Google Scholar]
Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. arXiv 2015, arXiv:1511.06349. [Google Scholar]
Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a broken ELBO. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2018; pp. 159–168. [Google Scholar]
Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in Beta-VAE. arXiv 2018, arXiv:1804.03599. [Google Scholar]
Dhariwal, P.; Payne, H.; Kim, J.W.; Radford, A.; Sutskever, I. Jukebox: A generative model for music. arXiv 2020, arXiv:2005.00341. [Google Scholar] [CrossRef]
Razavi, A.; van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 14866–14876. [Google Scholar]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Yan, X.; Yang, J.; Sohn, K.; Lee, H.; Yang, M.H. Attribute2Image: Conditional image generation from visual attributes. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2016; pp. 776–791. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar] [CrossRef]
Stanczuk, J.; Etmann, C.; Kreusser, L.M.; Schönlieb, C.B. Wasserstein GANs work because they fail (to approximate the Wasserstein distance). arXiv 2021, arXiv:2103.01678. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2015; pp. 2256–2265. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar]
Song, Y.; Ermon, S. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12438–12448. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J.W.; Chun, S. Fine-grained image-to-image transformation towards visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 3626–3635. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
Meng, Y.; Xia, M.; Chen, D. SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv 2024, arXiv:2405.14734. [Google Scholar]
Hong, J.; Lee, N.; Thorne, J. ORPO: Monolithic Preference Optimization without Reference Model. arXiv 2024, arXiv:2403.07691. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; Volume 6. [Google Scholar]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2015; pp. 84–92. [Google Scholar]
Cao, Z.; Qin, T.; Liu, T.Y.; Tsai, M.F.; Li, H. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2007; pp. 129–136. [Google Scholar]
Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; Wu, Y. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2014; pp. 1386–1393. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 539–546. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06); IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Li, Y.; Song, Y.; Luo, J. Improving pairwise ranking for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 3617–3625. [Google Scholar]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille; JMLR: Norfolk, MA, USA, 2015; Volume 37, pp. 1–8. [Google Scholar]
Chechik, G.; Sharma, V.; Shalit, U.; Bengio, S. Large Scale Online Learning of Image Similarity Through Ranking. J. Mach. Learn. Res. 2010, 11, 1109–1135. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 11. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2020; pp. 1597–1607. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
Burges, C.J. From ranknet to lambdarank to lambdamart: An overview. Learning 2010, 11, 81. [Google Scholar]
Burges, C.; Ragno, R.; Le, Q. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2006; Volume 19. [Google Scholar]
LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. Predict. Struct. Data 2006, 1, 1–59. [Google Scholar]
Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol.-Paris 2006, 100, 70–87. [Google Scholar] [CrossRef]
Friston, K. The free-energy principle: A rough guide to the brain? Trends Cogn. Sci. 2009, 13, 293–301. [Google Scholar] [CrossRef]
Finn, C.; Christiano, P.; Abbeel, P.; Levine, S. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv 2016, arXiv:1611.03852. [Google Scholar] [CrossRef]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2017; pp. 1352–1361. [Google Scholar]
Grathwohl, W.; Wang, K.C.; Jacobsen, J.H.; Duvenaud, D.; Norouzi, M.; Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. arXiv 2019, arXiv:1912.03263. [Google Scholar]
Du, Y.; Lin, T.; Mordatch, I. Model Based Planning with Energy Based Models. arXiv 2019, arXiv:1909.06878. [Google Scholar] [CrossRef]
Osadchy, M.; Miller, M.; Cun, Y. Synergistic face detection and pose estimation with energy-based models. J. Mach. Learn. Res. 2004, 17, 1197–1215. [Google Scholar]
Du, Y.; Mordatch, I. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Bengio, Y. Gradient-based optimization of hyperparameters. Neural Comput. 2000, 12, 1889–1900. [Google Scholar] [CrossRef]
Teh, Y.W.; Welling, M.; Osindero, S.; Hinton, G.E. Energy-based models for sparse overcomplete representations. J. Mach. Learn. Res. 2003, 4, 1235–1260. [Google Scholar]
Swersky, K.; Ranzato, M.; Buchman, D.; Freitas, N.D.; Marlin, B.M. On autoencoders and score matching for energy based models. In Proceedings of the 28th International Conference on Machine Learning (ICML-11); Omnipress: Madison, WI, USA, 2011; pp. 1201–1208. [Google Scholar]
Zhai, S.; Cheng, Y.; Lu, W.; Zhang, Z. Deep structured energy based models for anomaly detection. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2016; pp. 1100–1109. [Google Scholar]
Kumar, R.; Ozair, S.; Goyal, A.; Courville, A.; Bengio, Y. Maximum entropy generators for energy-based models. arXiv 2019, arXiv:1901.08508. [Google Scholar] [CrossRef]
Song, Y.; Kingma, D.P. How to train your energy-based models. arXiv 2021, arXiv:2101.03288. [Google Scholar]
Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002); Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 1–8. [Google Scholar]
Shapiro, A. Monte Carlo sampling methods. Handbooks Oper. Res. Manag. Sci. 2003, 10, 353–425. [Google Scholar]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
LeCun, Y.; Huang, F.J. Loss functions for discriminative training of energy-based models. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, PMLR; JMLR: Norfolk, MA, USA, 2005; pp. 206–213. [Google Scholar]
Levin, E.; Fleisher, M. Accelerated learning in layered neural networks. Complex Syst. 1988, 2, 3. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model. J. Mach. Learn. Res. 2000, 13, 1137–1155. [Google Scholar]
Bengio, Y.; De Mori, R.; Flammia, G.; Kompe, R. Global optimization of a neural network-hidden Markov model hybrid. IEEE Trans. Neural Networks 1992, 3, 252–259. [Google Scholar] [CrossRef]
Taskar, B.; Guestrin, C.; Koller, D. Max-margin Markov networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
Altun, Y.; Tsochantaridis, I.; Hofmann, T. Hidden markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML-03); AAAI Press: Menlo Park, CA, USA, 2003; pp. 3–10. [Google Scholar]
Kleinbaum, D.G.; Dietz, K.; Gail, M.; Klein, M.; Klein, M. Logistic Regression; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Juang, B.H.; Hou, W.; Lee, C.H. Minimum classification error rate methods for speech recognition. IEEE Trans. Speech Audio Process. 1997, 5, 257–265. [Google Scholar] [CrossRef]
Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations (ICLR); OpenReview: Amherst, MA, USA, 2019. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2017. [Google Scholar]
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 2787–2795. [Google Scholar]
Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; Bouchard, G. Complex embeddings for simple link prediction. In Proceedings of the International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2016; pp. 2071–2080. [Google Scholar]
Pan, S.; Hu, R.; Long, G.; Jiang, J.; Zhang, C.; Yao, L. Adversarially regularized graph autoencoder for graph embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI); AAAI Press: Menlo Park, CA, USA, 2018; pp. 2609–2615. [Google Scholar]
Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2014; pp. 701–710. [Google Scholar]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 855–864. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the NeurIPS; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Benson, A.R.; Gleich, D.F.; Leskovec, J. Higher-order organization of complex networks. Science 2016, 353, 163–166. [Google Scholar] [CrossRef]
Lee, J.B.; Rossi, R.A.; Kim, S.; Ahmed, N.K.; Koh, E. Attention Models in Graphs: A Multi-View Approach. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 402–412. [Google Scholar]
Ugander, J.; Backstrom, L.; Marlow, C. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd International Conference on World Wide Web; ACM: New York, NY, USA, 2013; pp. 1307–1318. [Google Scholar]
Chitwood, D.H.; Otoni, W.C. Motif-based analysis of biological networks. Nat. Commun. 2018, 9, 1–12. [Google Scholar]
Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network motifs: Simple building blocks of complex networks. Science 2002, 298, 824–827. [Google Scholar] [CrossRef]
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 2007, 23, e177–e183. [Google Scholar] [CrossRef]
Yin, H.; Li, W.; Cao, Y. Graph neural network and motif-based knowledge graph embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI); AAAI Press: Menlo Park, CA, USA, 2018. [Google Scholar]
Wang, Q.; Mao, Z.; Wang, B.; Guo, L. Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 2017, 29, 2724–2743. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
You, Y.; Chen, T.; Wang, X.; Shen, Z.; Huang, Z. Graph contrastive learning with augmentations. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep Graph Contrastive Representation Learning. In Proceedings of the ICML Workshop on Graph Representation Learning and Beyond (GRL+); JMLR: Norfolk, MA, USA, 2020. [Google Scholar]
Veličković, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the ICLR; OpenReview: Amherst, MA, USA, 2019. [Google Scholar]
Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2002, 15, 1373–1396. [Google Scholar] [CrossRef]
Li, Q.; Han, Z.; Wu, X. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI); AAAI Press: Menlo Park, CA, USA, 2018; pp. 3538–3545. [Google Scholar]
Ribeiro, L.F.; Saverese, P.H.; Figueiredo, D.R. struc2vec: Learning Node Representations from Structural Identity. In Proceedings of the KDD; Association for Computing Machinery: New York, NY, USA, 2017; pp. 385–394. [Google Scholar]
Donnat, C.; Zitnik, M.; Hallac, D.; Leskovec, J. Learning Structural Node Embeddings via Graph Wavelets. In Proceedings of the KDD; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1320–1329. [Google Scholar]
Liu, F.; Li, X.; Zhang, W. Structural anomaly detection in graphs using node embeddings. Knowl.-Based Syst. 2021, 227, 107208. [Google Scholar]
Bechtle, S.; Molchanov, A.; Chebotar, Y.; Grefenstette, E.; Righetti, L.; Sukhatme, G.S. Meta Learning via Learned Loss. arXiv 2019, arXiv:1906.05374. [Google Scholar]
Wu, L.; Tian, F.; Xia, Y.; Fan, Y.; Qin, T.; Jian-Huang, L.; Liu, T.Y. Learning to Teach with Dynamic Loss Functions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Li, H.; Fu, T.; Dai, J.; Li, H.; Huang, G.; Zhu, X. AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In Proceedings of the 35th International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2018. [Google Scholar]
Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; Bailey, J. Symmetric Cross Entropy for Robust Learning with Noisy Labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR); OpenReview: Amherst, MA, USA, 2018. [Google Scholar]
Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.P.; El Ghaoui, L.; Jordan, M.I. Theoretically Principled Trade-off between Robustness and Accuracy. In Proceedings of the 36th International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2019. [Google Scholar]
Sagawa, S.; Koh, P.W.; Hashimoto, T.B.; Liang, P. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. arXiv 2019, arXiv:1911.08731. [Google Scholar]
Kuhn, D.; Shafiee, S.; Wiesemann, W. Distributionally Robust Optimization. arXiv 2024, arXiv:2411.02549. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Ethayarajh, K.; Xu, W.; Muennighoff, N.; Jurafsky, D.; Kiela, D. KTO: Model Alignment as Prospect Theoretic Optimization. arXiv 2024, arXiv:2402.01306. [Google Scholar] [CrossRef]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef]
Hinder, F.; Vaquet, V.; Hammer, B. One or two things we know about concept drift—A survey on monitoring in evolving environments. Part B: Locating and explaining concept drift. Front. Artif. Intell. 2024, 7, 1330258. [Google Scholar] [CrossRef]

Figure 1. A comprehensive, hierarchical taxonomy of the 52 machine learning loss functions reviewed in this survey. The diagram is structured across three logical dimensions to facilitate intuitive navigation and comparison. Task-level categorization (horizontal): At the highest level, the taxonomy divides the loss functions according to their primary machine learning application: regression, classification, ranking, generative modeling, energy-based modeling, and relational learning. Theoretical strategy (sub-groups): Within each major task, losses are further grouped by their underlying mathematical optimization strategy, primarily categorized into error-based, probabilistic, and margin-based approaches. Conceptual evolution (vertical): The vertical layout illustrates the genealogical relationship between functions. Fundamental root loss functions (e.g., mean bias error, zero-one loss) are positioned at the top of their respective branches. More advanced, smoothed, or specialized variants are positioned below, connected by lines to indicate direct mathematical derivations or conceptual extensions, allowing readers to trace how complex losses are built upon simpler foundational concepts.

Figure 2. Schematic overview of the main regression losses and their relationships.

Figure 3. Overview of the classification losses divided into two major groups: margin-based losses and probabilistic ones.

Figure 4. Overview of the generative losses.

Figure 5. Overview of the ranking losses.

Figure 6. Schematic overview of the energy-based losses with their connection.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ciampiconi, L.; Elwood, A.; Leonardi, M.; Mohamed, A.; Rozza, A. A Survey and Taxonomy of Loss Functions in Machine Learning. AI 2026, 7, 128. https://doi.org/10.3390/ai7040128

AMA Style

Ciampiconi L, Elwood A, Leonardi M, Mohamed A, Rozza A. A Survey and Taxonomy of Loss Functions in Machine Learning. AI. 2026; 7(4):128. https://doi.org/10.3390/ai7040128

Chicago/Turabian Style

Ciampiconi, Lorenzo, Adam Elwood, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. 2026. "A Survey and Taxonomy of Loss Functions in Machine Learning" AI 7, no. 4: 128. https://doi.org/10.3390/ai7040128

APA Style

Ciampiconi, L., Elwood, A., Leonardi, M., Mohamed, A., & Rozza, A. (2026). A Survey and Taxonomy of Loss Functions in Machine Learning. AI, 7(4), 128. https://doi.org/10.3390/ai7040128

Article Menu

A Survey and Taxonomy of Loss Functions in Machine Learning

Abstract

1. Introduction

2. Definition of the Loss Function Taxonomy

2.1. Optimization Techniques for Loss Functions

2.1.1. Loss Functions and Optimization Methods

2.1.2. Relevant Optimization Methods

2.2. The Proposed Taxonomy

3. Regularization Methods

3.1. Regularization by Loss Augmentation

3.1.1. L2-Norm Regularization

3.1.2. L 1 -Norm Regularization

3.2. Comparison Between L 2 and L 1 Norm Regularizations

4. Regression Losses

4.1. Problem Formulation and Notation

4.2. Error-Based Losses for Regression

4.2.1. Mean Bias Error Loss (CONT, DIFF, CONVEX)

4.2.2. Mean Absolute Error Loss (L-CONT, CONVEX)

4.2.3. Mean Squared Error Loss (CONT, DIFF, CONVEX)

Interpretation as Maximum Likelihood Estimation (MLE)

4.2.4. Lasso Regression ( L 1 Regularization)

4.2.5. Ridge Regression ( L 2 Regularization)

Interpretation as Maximum A Posteriori Estimation (MAP)

4.2.6. Root Mean Squared Error Loss (CONT, DIFF, CONVEX)

4.2.7. Huber Loss and Smooth L 1 Loss (L-CONT, DIFF, CONVEX)

4.2.8. Log-Cosh Loss (L-CONT, DIFF, S-CONV)

4.2.9. Root Mean Squared Logarithmic Error Loss (CONT, DIFF)

5. Classification Losses

5.1. Problem Formulation and Notation

5.2. Margin-Based Loss Functions

5.2.1. Zero-One Loss

5.2.2. Hinge Loss and Perceptron Loss (L-CONT, CONVEX)

5.2.3. Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)

5.2.4. Quadratically Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)

5.2.5. Modified Huber Loss (L-CONT, DIFF, CONVEX)

5.2.6. Ramp Loss (CONT)

5.2.7. Cosine Similarity Loss (CONT, DIFF)

5.3. Probabilistic Loss Functions

5.3.1. Cross-Entropy Loss and Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)

5.3.2. Kullback–Leibler Divergence (CONT, CONVEX, DIFF)

5.3.3. Focal Loss (L-CONT)

5.3.4. Dice Loss (CONT, DIFF)

5.3.5. Tversky Loss (CONT, DIFF)

6. Generative Losses

6.1. Variational Autoencoders (VAEs)

6.1.1. VAE Loss (ELBO) (CONT, DIFF, L-CONT)

6.1.2. Extensions of VAE Losses

6.2. Generative Adversarial Networks

6.2.1. Minimax Loss

6.2.2. Wasserstein Loss

6.3. Diffusion Models

6.3.1. Diffusion Model Loss Function (CONT, DIFF)

6.3.2. Other MSE-Based Losses in Diffusion Models (CONT, DIFF)

Perceptual Loss

Latent Space Regularization

6.3.3. Score-Based Generative Model Loss (CONT, DIFF)

6.3.4. Cosine Similarity in Multimodal Context (CONT, DIFF)

6.4. Transformers and LLM Loss Functions

6.4.1. Probabilistic Losses in LLMs

Autoregressive Language Modeling Loss (CONT, DIFF, CONVEX)

Masked Language Modeling (MLM) Loss (CONT, DIFF, CONVEX)

Label Smoothing Loss (CONT, DIFF, CONVEX)

KL Divergence Loss for Knowledge Distillation (CONT, DIFF)

6.4.2. Ranking Losses in LLM

6.4.3. Alignment and Preference Optimization Losses

Direct Preference Optimization (DPO) Loss (CONT, DIFF)

Simple Preference Optimization (SimPO) Loss (CONT, DIFF)

7. Ranking Losses

7.1. Pairwise Ranking Loss

7.2. Triplet Loss

7.3. Listwise Ranking Loss (CONT, DIFF)

7.4. Contrastive Ranking Loss: NT-Xent (CONT, DIFF, L-CONT)

LambdaLoss (CONT, DIFF)

8. Energy-Based Losses

8.1. Training EBM

8.2. Loss Functions for EBMs

8.2.1. Energy Loss

8.2.2. Generalized Perceptron Loss (L-CONT, CONVEX)

8.2.3. Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)

3.1.2. $L_{1}$ -Norm Regularization

3.2. Comparison Between $L_{2}$ and $L_{1}$ Norm Regularizations

4.2.4. Lasso Regression ( $L_{1}$ Regularization)

4.2.5. Ridge Regression ( $L_{2}$ Regularization)

4.2.7. Huber Loss and Smooth $L_{1}$ Loss (L-CONT, DIFF, CONVEX)