A Survey and Taxonomy of Loss Functions in Machine Learning
Abstract
1. Introduction
2. Definition of the Loss Function Taxonomy
2.1. Optimization Techniques for Loss Functions
2.1.1. Loss Functions and Optimization Methods
- Continuity (CONT): A real-valued function, that is a function from a subset of the real numbers to the real numbers, can be represented by a graph in the Cartesian plane; such a function is continuous if the graph is a single unbroken curve belonging to the real domain. A more mathematically rigorous definition can be given in terms of limits: a function f with variable x is continuous at the point c if
- Differentiability (DIFF): A differentiable function f on a real variable is a function derivable at each point of its domain. A differentiable function is smooth, in the sense that it is locally well approximated by a linear function, and does not contain any break, angle, or cusp. A continuous function is not necessarily differentiable, but a differentiable function is necessarily continuous. More formally, in one dimension, f is differentiable at c if the following limit exists:
- Lipschitz Continuity (L-CONT): A Lipschitz continuous function is limited in how fast it can change. More formally, there exists a real constant such that, for every pair of points in the domain,where L is called the Lipschitz constant of the function.To understand the robustness of a model, such as a neural network, some research papers [28,29] have tried to train the underlying model by defining an input–output map with a small Lipschitz constant. The intuition is that if a model is robust, it should not be too affected by perturbations in the input, , and this would be ensured by having f be ℓ-Lipschitz where ℓ is small [30].
- Convexity (CONVEX): A real-valued function f is convex if each segment between any two points on the graph of the function lies above the graph between the two points. More formally, f is convex if for all in its domain and all ,Convexity is a key feature since any local minimum of a convex function is also a global minimum. Whenever the second derivative exists, convexity is easy to check, since the Hessian of the function must be positive semi-definite.
- Strict Convexity (S-CONV): A real-valued function is strictly convex if the segment between any two points on the graph of the function lies strictly above the graph between the two points, except at the intersection points between the straight line and the curve. More formally, for all distinct in the domain and all ,Strict convexity implies that, if a minimizer exists, it is unique. If the Hessian exists and is positive definite, this is a sufficient condition for strict convexity.
2.1.2. Relevant Optimization Methods
- Closed-Form Solutions (DIFF, S-CONV): These are systems of equations solvable analytically, where values of make the derivative of the loss function equal to zero. To guarantee a unique closed-form solution, the loss function must be differentiable (DIFF) and strictly convex (S-CONV), ensuring a single global minimum. Closed-form solutions are highly efficient and desirable where feasible; however, they are often impractical for complex models or high-dimensional parameter spaces. Therefore, closed-form solutions are primarily used in simpler, linear models or settings where the loss is quadratic or log-likelihood based, as in linear regression or Gaussian MLE problems.
- Gradient Descent (DIFF, CONVEX): Gradient descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. The loss function must be at least differentiable (DIFF) to compute gradients, and if the loss is convex (CONVEX), the local minimum is also the global minimum. Lipschitz continuity (L-CONT) can improve convergence guarantees, as it limits how quickly the function can change, but it is not strictly necessary for gradient descent. For non-differentiable losses, techniques like subgradients or gradient approximations can be employed [31,32]. The algorithm for gradient descent is formalized in Algorithm 1. In each iteration, it calculates the gradient of the loss function with respect to the current parameters, , and updates those parameters by taking a step of size (the learning rate) in the opposite direction of the gradient to descend toward the minimum.
| Algorithm 1 Gradient Descent |
| Input: initial parameters , number of iterations T, learning rate |
| Output: final learning |
|
- Stochastic Gradient Descent (SGD) (DIFF, CONVEX): SGD [3] is a stochastic approximation of gradient descent that computes the gradient from a randomly selected subset of the data instead of the entire dataset. This reduces the computational cost in high-dimensional problems, such as neural networks, and helps avoid local minima due to the stochastic nature of the updates. Like gradient descent, SGD requires the loss function to be differentiable (DIFF). Convexity (CONVEX) ensures that the global minimum is reachable, but even for non-convex functions, SGD can often find useful minima in practice. Lipschitz continuity (L-CONT) can improve the convergence rate, but is not required.
- Derivative-Free Optimization: In some cases, the derivative of the objective function may not exist or be difficult to compute. Derivative-free optimization methods, such as simulated annealing, genetic algorithms, and particle swarm optimization, can be employed [33,34]. While these methods do not strictly require continuity (CONT), having a continuous function typically improves the stability of the optimization process. Derivative-free methods can handle non-differentiable and non-convex losses, but they may struggle to scale to high-dimensional problems and can be computationally expensive.
- Zeroth-Order Optimization (ZOO): ZOO optimization is a subset of derivative-free optimization that approximates gradients using function evaluations rather than direct computation of derivatives [35]. These methods are useful in black-box scenarios where the gradient is not accessible but can be estimated through perturbations. While continuity (CONT) is not required, it improves the accuracy of gradient approximations and helps achieve better convergence rates. ZOO methods are effective for non-differentiable and non-convex losses, and they have been applied in adversarial attack generation, model-agnostic explanations, and other black-box scenarios [36,37].
2.2. The Proposed Taxonomy
- Error-based;
- Probabilistic;
- Margin-based.
3. Regularization Methods
3.1. Regularization by Loss Augmentation
3.1.1. L2-Norm Regularization
3.1.2. -Norm Regularization
3.2. Comparison Between and Norm Regularizations
4. Regression Losses
4.1. Problem Formulation and Notation
4.2. Error-Based Losses for Regression
4.2.1. Mean Bias Error Loss (CONT, DIFF, CONVEX)
4.2.2. Mean Absolute Error Loss (L-CONT, CONVEX)
4.2.3. Mean Squared Error Loss (CONT, DIFF, CONVEX)
Interpretation as Maximum Likelihood Estimation (MLE)
4.2.4. Lasso Regression ( Regularization)
4.2.5. Ridge Regression ( Regularization)
Interpretation as Maximum A Posteriori Estimation (MAP)
4.2.6. Root Mean Squared Error Loss (CONT, DIFF, CONVEX)
4.2.7. Huber Loss and Smooth Loss (L-CONT, DIFF, CONVEX)
4.2.8. Log-Cosh Loss (L-CONT, DIFF, S-CONV)
4.2.9. Root Mean Squared Logarithmic Error Loss (CONT, DIFF)
5. Classification Losses
5.1. Problem Formulation and Notation
5.2. Margin-Based Loss Functions
5.2.1. Zero-One Loss
5.2.2. Hinge Loss and Perceptron Loss (L-CONT, CONVEX)
5.2.3. Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)
5.2.4. Quadratically Smoothed Hinge Loss (L-CONT, CONVEX, DIFF)
5.2.5. Modified Huber Loss (L-CONT, DIFF, CONVEX)
5.2.6. Ramp Loss (CONT)
5.2.7. Cosine Similarity Loss (CONT, DIFF)
5.3. Probabilistic Loss Functions
5.3.1. Cross-Entropy Loss and Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)
5.3.2. Kullback–Leibler Divergence (CONT, CONVEX, DIFF)
5.3.3. Focal Loss (L-CONT)
5.3.4. Dice Loss (CONT, DIFF)
5.3.5. Tversky Loss (CONT, DIFF)
6. Generative Losses
6.1. Variational Autoencoders (VAEs)
6.1.1. VAE Loss (ELBO) (CONT, DIFF, L-CONT)
6.1.2. Extensions of VAE Losses
- Beta-VAE (CONT, DIFF, L-CONT): The beta-VAE [94] introduces a hyperparameter to weight the KL divergence term. The loss function becomes:By adjusting , the balance between reconstruction accuracy and latent space regularization can be controlled [97]. This variant retains the continuity, differentiability, and Lipschitz continuity but it is non-convex due to the interaction between terms.Beta-VAE improves interpretability by encouraging the model to learn more disentangled representations. This is particularly useful in applications where distinct, independent latent factors are beneficial, as in unsupervised learning tasks or when a well-structured latent space is desired [94]. The introduction of can lead to a trade-off where increasing regularization may harm reconstruction quality. Excessively high values of can also cause posterior collapse, where the model ignores the latent variables, resulting in reduced generative performance [95,96,97].
- VQ-VAE (CONT): In vector quantized VAEs [93], the latent space is discrete, rather than continuous, and a codebook is used to quantize the latent variables. The key difference in VQ-VAE compared to traditional VAEs is the use of a discrete latent representation, which introduces the following steps:Given an input , the encoder produces a continuous latent variable . However, instead of passing directly to the decoder, VQ-VAE performs a vector quantization by mapping to the nearest vector in a learned codebook , where each is a learned embedding vector. This process can be written as:The quantized latent variable is then passed to the decoder, which reconstructs the input as .The VQ-VAE loss function consists of three terms:where:
- −
- is the reconstruction loss (typically mean squared error or binary cross-entropy).
- −
- The second term (the codebook loss) ensures that the codebook vector is close to the encoder output .
- −
- The third term (the commitment loss) encourages the encoder to commit to a particular codebook vector, where denotes the stop gradient operator, ensuring gradients only flow through the appropriate part of the network.
- −
- is a hyperparameter controlling the weight of the commitment loss.
While VQ-VAE remains continuous (CONT) overall, the discrete quantization step introduces non-differentiability in the loss function, as the mapping from encoder outputs to codebook vectors is non-differentiable.The key advantage of VQ-VAE is that the use of a discrete latent space can produce sharper and higher-quality generated samples, addressing some of the issues with blurry outputs often observed in continuous VAEs [93]. VQ-VAE is particularly beneficial in tasks where the data have inherently discrete characteristics, such as in audio and image generation [98,99]. - Conditional VAE (CVAE) (CONT, DIFF, L-CONT): Conditional VAEs [100] condition both the encoder and decoder on auxiliary information (e.g., class labels), modifying the ELBO to learn the conditional distribution . The CVAE loss is:The CVAE loss is continuous, differentiable, and Lipschitz continuous, and the KL divergence remains convex, but the overall loss is still non-convex due to the interaction between terms.The advantage of using CVAE is its ability to generate conditional outputs based on specific attributes or labels. This makes CVAE particularly useful in scenarios where controlled generation is required, such as in image generation conditioned on class labels or text generation based on input attributes [91,101]. By incorporating auxiliary information, CVAEs allow for more structured and interpretable latent spaces, improving the ability to generate targeted samples in a variety of applications [91].
6.2. Generative Adversarial Networks
- The generator, referred to as , which generates data starting from random noise and tries to replicate real data distributions.
- The discriminator, referred to as , learns to distinguish the generator’s fake data from the real one. It applies penalties in the generator loss for producing distinguishable fake data compared with real data.
6.2.1. Minimax Loss
- is the discriminator’s estimate of the probability that real data instance {} is real;
- is the expected value over all real data instances;
- is the generator’s output when given noise ;
- is the discriminator’s estimate of the probability that a fake instance is real;
- is the expected value over all random inputs to the generator (in effect, the expected value over all generated fake instances .
6.2.2. Wasserstein Loss
6.3. Diffusion Models
- Forward Diffusion Process
- Reverse Diffusion Process
6.3.1. Diffusion Model Loss Function (CONT, DIFF)
6.3.2. Other MSE-Based Losses in Diffusion Models (CONT, DIFF)
Perceptual Loss
Latent Space Regularization
6.3.3. Score-Based Generative Model Loss (CONT, DIFF)
6.3.4. Cosine Similarity in Multimodal Context (CONT, DIFF)
6.4. Transformers and LLM Loss Functions
6.4.1. Probabilistic Losses in LLMs
Autoregressive Language Modeling Loss (CONT, DIFF, CONVEX)
Masked Language Modeling (MLM) Loss (CONT, DIFF, CONVEX)
Label Smoothing Loss (CONT, DIFF, CONVEX)
KL Divergence Loss for Knowledge Distillation (CONT, DIFF)
6.4.2. Ranking Losses in LLM
6.4.3. Alignment and Preference Optimization Losses
Direct Preference Optimization (DPO) Loss (CONT, DIFF)
Simple Preference Optimization (SimPO) Loss (CONT, DIFF)
7. Ranking Losses
7.1. Pairwise Ranking Loss
7.2. Triplet Loss
- Easy Triplets: . Here, the distance between the negative sample and the anchor sample is already large enough. The model parameters are not updated, and the loss is 0.
- Hard Triplets: . In this case, the negative sample is closer to the anchor than the positive sample. The loss is positive (and ), leading to updates in the model’s parameters.
- Semi-Hard Triplets: . Here, the negative sample is farther away from the anchor than the positive sample, but the margin constraint is not yet satisfied. The loss remains positive (and ), prompting parameter updates.
7.3. Listwise Ranking Loss (CONT, DIFF)
7.4. Contrastive Ranking Loss: NT-Xent (CONT, DIFF, L-CONT)
LambdaLoss (CONT, DIFF)
- represents the change in NDCG if the documents i and j are swapped in the ranking.
- is the pairwise ranking loss, typically modeled as a logistic function or a hinge loss.
8. Energy-Based Losses
8.1. Training EBM
- is the per-sample loss;
- is the desired output;
- is energy surface for a given ; as varies.
8.2. Loss Functions for EBMs
8.2.1. Energy Loss
8.2.2. Generalized Perceptron Loss (L-CONT, CONVEX)
8.2.3. Negative Log-Likelihood Loss (CONT, DIFF, CONVEX)
8.2.4. Generalized Margin Loss
Hinge Loss (L-CONT, CONVEX)
Log Loss (DIFF, CONT, CONVEX)
Minimum Classification Error Loss (CONT, DIFF, CONVEX)
Square-Square Loss (CONT, CONVEX)
Square-Exponential Loss (CONT, DIFF, CONVEX)
9. Relational Learning
9.1. Graph Reconstruction Loss (CONT, DIFF, L-CONT)
9.2. Random Walk-Based Loss (CONT, DIFF, L-CONT)
9.3. Motif-Based Loss (CONT, DIFF, L-CONT)
9.4. Graph Contrastive Loss (CONT, DIFF, L-CONT)
9.5. Graph Laplacian (Smoothness) Loss (CONT, DIFF, L-CONT)
9.6. Mutual Information Maximization Loss (CONT, DIFF)
9.7. Distance/Structural Preservation Loss (CONT, DIFF, L-CONT)
10. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Summary of Loss Functions and Their Properties
| Loss Function | Task | Taxonomy | Conv. | Diff. | Key Properties/Notes |
|---|---|---|---|---|---|
| Mean Bias Error (MBE) | Regression | Error-based | Yes | Yes | Errors may cancel out; used for evaluation. |
| Mean Absolute Error (MAE) | Regression | Error-based | Yes | No (at 0) | Robust to outliers; median-oriented estimate. |
| Mean Squared Error (MSE) | Regression | Error-based | Yes | Yes | Sensitive to outliers; MLE for Gaussian noise. |
| Root Mean Sq. Error (RMSE) | Regression | Error-based | Yes | Yes | Same units as target; widely used. |
| Huber Loss | Regression | Error-based | Yes | Yes | Robust hybrid of MAE and MSE; requires threshold . |
| Log-cosh Loss | Regression | Error-based | Yes | Yes | Smooth approximation of MAE; no threshold hyperparameter. |
| RMSLE | Regression | Error-based | No | Yes | Works on log-transformed targets; emphasizes relative errors. |
| Zero-One Loss | Classification | Margin-based | No | No | Direct classification error; intractable. |
| Hinge Loss | Classification | Margin-based | Yes | No | Basis for SVMs; max-margin principle. |
| Perceptron Loss | Classification | Margin-based | Yes | No | No margin enforcement; strictly error-driven. |
| Smoothed Hinge | Classification | Margin-based | Yes | Yes | Differentiable variant of Hinge. |
| Quad. Smoothed Hinge | Classification | Margin-based | Yes | Yes | Piece-wise quadratic smoothing. |
| Modified Huber | Classification | Margin-based | Yes | Yes | Smooth; used for robust classification. |
| Ramp Loss | Classification | Margin-based | No | No | Capped Hinge; robust to outliers. |
| Cosine Similarity | Classification | Margin-based | No | Yes | Measures orientation; ignores magnitude. |
| Cross-Entropy (NLL) | Classification | Probabilistic | Yes | Yes | Standard for classification; MLE-based. |
| KL Divergence | Classification | Probabilistic | Yes | Yes | Measures information loss; asymmetric. |
| Focal Loss | Classification | Probabilistic | No | Yes | Focuses on hard examples; handles imbalance. |
| Dice Loss | Classification | Probabilistic | No | Yes | Overlap metric; for imbalance/segmentation. |
| Tversky Loss | Classification | Probabilistic | No | Yes | Generalizes Dice; tunable FP/FN balance. |
| VAE Loss (ELBO) | Generative | Probabilistic | No | Yes | Reconstruction + KL regularization. |
| Beta-VAE | Generative | Probabilistic | No | Yes | Trade-off for disentangled representations. |
| VQ-VAE | Generative | Probabilistic | No | No | Discrete latent codes; non-diff quantization. |
| Conditional VAE | Generative | Probabilistic | No | Yes | Conditioned generation (e.g., on labels). |
| Minimax Loss | Generative | Probabilistic | No | Yes | Original GAN loss; saddle point problem. |
| Wasserstein Loss | Generative | Probabilistic | No | Yes | Earth-Mover dist.; stable GAN training. |
| Diffusion (Simple) | Generative | Probabilistic | No | Yes | Noise prediction MSE; stable training. |
| Score-based Loss | Generative | Probabilistic | No | Yes | Fits score function (gradient of log-density). |
| CLIP Guidance | Generative | Margin-based | No | Yes | Aligns text/image embeddings (cosine sim). |
| Autoregressive LM | Generative | Probabilistic | Yes | Yes | Standard causal masking (GPT style). |
| Masked LM (MLM) | Generative | Probabilistic | Yes | Yes | Bidirectional context (BERT style). |
| Label Smoothing | Generative | Probabilistic | Yes | Yes | Prevents overconfidence; regularization. |
| Knwl. Distillation (KL) | Generative | Probabilistic | Yes | Yes | Compress teacher model info to student. |
| DPO Loss | NLP (LLM) | Probabilistic | No | Yes | Reparameterizes reward via LLM policy; requires frozen reference model. |
| SimPO Loss | NLP (LLM) | Probabilistic | No | Yes | Reference-free; uses length-normalized log-prob as reward with margin. |
| ORPO Loss | NLP (LLM) | Probabilistic | No | Yes | Reference-free alignment; optimizes odds ratio to penalize rejected outputs. |
| Pairwise Ranking | Ranking | Margin-based | Yes | No | Contrastive; minimizes pos/neg distance. |
| Triplet Loss | Ranking | Margin-based | Yes | No | Anchor-Pos-Neg structure; relative distance. |
| Listwise (Softmax) | Ranking | Probabilistic | No | Yes | Optimizes entire list order (top-1 prob). |
| Contrastive (NT-Xent) | Ranking | Margin-based | No | Yes | Self-supervised; uses multiple negatives. |
| LambdaLoss | Ranking | Margin-based | Yes | Yes | Directly optimizes IR metrics (NDCG). |
| Energy Loss | EBM | Margin-based | No | No | Direct mapping; prone to collapse. |
| Generalized Perceptron | EBM | Margin-based | Yes | No | Pushes down correct energy; no margin. |
| Energy NLL | EBM | Margin-based | Yes | Yes | Log-partition approx; probabilistic link. |
| Energy Hinge | EBM | Margin-based | Yes | No | Enforces margin between correct/incorrect. |
| Energy Log | EBM | Margin-based | Yes | Yes | Soft margin; smooth differentiable hinge. |
| MCE Loss | EBM | Margin-based | No | Yes | Approx. error count using sigmoid. |
| Square-square | EBM | Margin-based | Yes | Yes | Quadratically penalizes energy margins. |
| Square-exponential | EBM | Margin-based | Yes | Yes | Exponential penalty on incorrect energies. |
| Graph Reconstruction | Relational | Probabilistic | No | Yes | Link prediction; rebuilds adjacency. |
| Random Walk Loss | Relational | Probabilistic | No | Yes | Skip-gram for graphs (DeepWalk/Node2Vec). |
| Motif-based Loss | Relational | Error-based | No | Yes | Preserves higher-order substructures. |
| Graph Contrastive | Relational | Probabilistic | No | Yes | Invariance to graph augmentations. |
| Graph Laplacian | Relational | Error-based | Yes | Yes | Enforces smoothness among neighbors. |
| Mutual Info (DGI) | Relational | Probabilistic | No | Yes | Maximize local-global info agreement. |
| Distance Preservation | Relational | Error-based | No | Yes | Preserves structural/geodesic distances. |
References
- Mitchell, T.; Buchanan, B.; DeJong, G.; Dietterich, T.; Rosenbloom, P.; Waibel, A. Machine Learning. Annu. Rev. Comput. Sci. 1990, 4, 417–433. [Google Scholar] [CrossRef]
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
- Mitchell, T.M. Machine Learning; McGraw-Hill Education: New York, NY, USA, 1997; Volume 1. [Google Scholar]
- Mahesh, B. Machine learning algorithms—A review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
- Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
- Shally, H. Survey for Mining Biomedical data from HTTP Documents. Int. J. Eng. Sci. Res. Technol. 2013, 2, 165–169. [Google Scholar]
- Patil, S.; Patil, K.R.; Patil, C.R.; Patil, S.S. Performance overview of an artificial intelligence in biomedics: A systematic approach. Int. J. Inf. Technol. 2020, 12, 963–973. [Google Scholar] [CrossRef]
- Zhang, X.; Yao, L.; Wang, X.; Monaghan, J.; Mcalpine, D.; Zhang, Y. A survey on deep learning based brain computer interface: Recent advances and new frontiers. arXiv 2019, arXiv:1905.04149. [Google Scholar]
- Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef]
- Chowdhary, K. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
- Otter, D.W.; Medina, J.R.; Kalita, J.K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 604–624. [Google Scholar] [CrossRef]
- Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.A.; Xu, M. A survey on machine learning techniques for cyber security in the last decade. IEEE Access 2020, 8, 222310–222354. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
- Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J. Knowledge discovery in databases: An overview. AI Mag. 1992, 13, 57–70. [Google Scholar]
- Argall, B.D.; Chernova, S.; Veloso, M.; Browning, B. A survey of robot learning from demonstration. Robot. Auton. Syst. 2009, 57, 469–483. [Google Scholar] [CrossRef]
- Perlich, C.; Dalessandro, B.; Raeder, T.; Stitelman, O.; Provost, F. Machine learning for targeted display advertising: Transfer learning in action. Mach. Learn. 2014, 95, 103–127. [Google Scholar] [CrossRef]
- Bontempi, G.; Ben Taieb, S.; Borgne, Y.A.L. Machine learning strategies for time series forecasting. In Proceedings of the European Business Intelligence Summer School; Springer: Berlin/Heidelberg, Germany, 2012; pp. 62–77. [Google Scholar]
- Müller, K.R.; Krauledat, M.; Dornhege, G.; Curio, G.; Blankertz, B. Machine learning techniques for brain-computer interfaces. Biomed. Tech. 2004, 49, 11–22. [Google Scholar]
- Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA); IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
- Von Neumann, J.; Morgenstern, O. Theory of Games and Economic Behavior; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
- Terven, J.; Cordova-Esparza, D.M.; Romero-Gonzalez, J.A.; Ramirez-Pedraza, A.; Chavez-Urbiola, E.A. A comprehensive survey of loss functions and metrics in deep learning. Artif. Intell. Rev. 2025, 58, 195. [Google Scholar] [CrossRef]
- Li, C.; Liu, K.; Liu, S. A Survey of Loss Functions in Deep Learning. Mathematics 2025, 13, 2417. [Google Scholar] [CrossRef]
- Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 2020, 9, 187–212. [Google Scholar] [CrossRef]
- Wang, J.; Feng, S.; Cheng, Y.; Al-Nabhan, N. Survey on the Loss Function of Deep Learning in Face Recognition. J. Inf. Hiding Priv. Prot. 2021, 3, 29. [Google Scholar] [CrossRef]
- Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB); IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
- Virmaux, A.; Scaman, K. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Gouk, H.; Frank, E.; Pfahringer, B.; Cree, M.J. Regularisation of neural networks by enforcing lipschitz continuity. Mach. Learn. 2021, 110, 393–416. [Google Scholar] [CrossRef]
- Pauli, P.; Koch, A.; Berberich, J.; Kohler, P.; Allgöwer, F. Training robust neural networks using Lipschitz bounds. IEEE Control Syst. Lett. 2021, 6, 121–126. [Google Scholar] [CrossRef]
- Kiwiel, K.C. Methods of Descent for Nondifferentiable Optimization; Springer: Berlin/Heidelberg, Germany, 2006; Volume 1133. [Google Scholar]
- Shor, N.Z. Minimization Methods for Non-Differentiable Functions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 3. [Google Scholar]
- Conn, A.R.; Scheinberg, K.; Vicente, L.N. Introduction to Derivative-Free Optimization; SIAM: Philadelphia, PA, USA, 2009. [Google Scholar]
- Rios, L.M.; Sahinidis, N.V. Derivative-free optimization: A review of algorithms and comparison of software implementations. J. Glob. Optim. 2013, 56, 1247–1293. [Google Scholar] [CrossRef]
- Liu, S.; Chen, P.Y.; Kailkhura, B.; Zhang, G.; Hero, A.O., III; Varshney, P.K. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Process. Mag. 2020, 37, 43–54. [Google Scholar] [CrossRef]
- Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.J. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security; Association for Computing Machinery: New York, NY, USA, 2017; pp. 15–26. [Google Scholar]
- Dhurandhar, A.; Pedapati, T.; Balakrishnan, A.; Chen, P.Y.; Shanmugam, K.; Puri, R. Model agnostic contrastive explanations for structured data. arXiv 2019, arXiv:1906.00117. [Google Scholar] [CrossRef]
- Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science; Institute of Mathematical Statistics Monographs, Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
- Kukačka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar] [CrossRef]
- Bartlett, P.; Boucheron, S.; Lugosi, G. Model Selection and Error Estimation. Mach. Learn. 2002, 48, 85–113. [Google Scholar] [CrossRef]
- Myung, I.J. The Importance of Complexity in Model Selection. J. Math. Psychol. 2000, 44, 190–204. [Google Scholar] [CrossRef]
- Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 2000, 42, 80–86. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Ng, A.Y. Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, USA, 2004, ICML ’04; Association for Computing Machinery: New York, NY, USA, 2004; p. 78. [Google Scholar] [CrossRef]
- Bektaş, S.; Şişman, Y. The comparison of L1 and L2-norm minimization methods. Int. J. Phys. Sci. 2010, 5, 1721–1727. [Google Scholar]
- Tsuruoka, Y.; Tsujii, J.; Ananiadou, S. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty. In ACL ’09: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; Volume 1, pp. 477–485. [Google Scholar] [CrossRef]
- Ullah, F.U.M.; Ullah, A.; Haq, I.U.; Rho, S.; Baik, S.W. Short-term prediction of residential power energy consumption via CNN and multi-layer bi-directional LSTM networks. IEEE Access 2019, 8, 123369–123380. [Google Scholar] [CrossRef]
- Krishnaiah, T.; Rao, S.S.; Madhumurthy, K.; Reddy, K. Neural network approach for modelling global solar radiation. J. Appl. Sci. Res. 2007, 3, 1105–1111. [Google Scholar]
- Valipour, M.; Banihabib, M.E.; Behbahani, S.M.R. Comparison of the ARMA, ARIMA, and the autoregressive artificial neural network models in forecasting the monthly inflow of Dez dam reservoir. J. Hydrol. 2013, 476, 433–441. [Google Scholar] [CrossRef]
- Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
- Li, G.; Shi, J. On comparing three artificial neural networks for wind speed forecasting. Appl. Energy 2010, 87, 2313–2320. [Google Scholar] [CrossRef]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
- Huber, P.J. A robust version of the probability ratio test. Ann. Math. Stat. 1965, 36, 1753–1758. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2016; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. [Google Scholar]
- Semeniuta, A. A handy approximation of the RMSLE loss function. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2017; pp. 114–125. [Google Scholar]
- Semeniuta, A. Handy approximation of the RMSLE loss. arXiv 2017, arXiv:1711.04077. [Google Scholar]
- Crammer, K.; Singer, Y. On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. J. Mach. Learn. Res. 2001, 2, 265–292. [Google Scholar]
- Weston, J.; Watkins, C. Support Vector Machines for Multi-Class Pattern Recognition. In Proceedings of the European Symposium on Artificial Neural Networks (ESANN); D-Facto Public: Bruges, Belgium, 1999; pp. 219–224. [Google Scholar]
- Lee, Y.; Lin, Y.; Wahba, G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. J. Am. Stat. Assoc. 2002, 99, 1–37. [Google Scholar]
- Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, Classification, and Risk Bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef]
- Jiang, W. Process consistency for AdaBoost. Ann. Stat. 2004, 32, 13–29. [Google Scholar] [CrossRef]
- Lugosi, G.; Vayatis, N. On the Bayes-risk consistency of regularized boosting methods. Ann. Stat. 2003, 32, 30–55. [Google Scholar] [CrossRef]
- Mannor, S.; Meir, R.; Zhang, T. Greedy Algorithms for Classification—Consistency, Convergence Rates, and Adaptivity. J. Mach. Learn. Res. 2003, 4, 713–741. [Google Scholar]
- Steinwart, I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. Inf. Theory 2005, 51, 128–142. [Google Scholar] [CrossRef]
- Zhang, T. Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization. Ann. Stat. 2001, 32, 56–85. [Google Scholar] [CrossRef]
- Gentile, C.; Warmuth, M.K.K. Linear Hinge Loss and Average Margin. In Proceedings of the Advances in Neural Information Processing Systems; Kearns, M., Solla, S., Cohn, D., Eds.; MIT Press: Cambridge, MA, USA, 1998; Volume 11. [Google Scholar]
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory; Association for Computing Machinery: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
- Mathur, A.; Foody, G.M. Multiclass and binary SVM classification: Implications for training and classification users. IEEE Geosci. Remote Sens. Lett. 2008, 5, 241–245. [Google Scholar] [CrossRef]
- Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef]
- Rennie, J.D.M. Smooth Hinge Classification; Update on 2013; Massachusetts Institute of Technology: Cambridge, MA, USA, 2005. [Google Scholar]
- Zhang, T. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04; Association for Computing Machinery: New York, NY, USA, 2004; pp. 919–926. [Google Scholar]
- Wu, Y.; Liu, Y. Robust Truncated Hinge Loss Support Vector Machines. J. Am. Stat. Assoc. 2007, 102, 974–983. [Google Scholar] [CrossRef]
- Harshvardhan, G.; Gourisaria, M.K.; Pandey, M.; Rautaray, S.S. A comprehensive survey and analysis of generative models in machine learning. Comput. Sci. Rev. 2020, 38, 100285. [Google Scholar] [CrossRef]
- Myung, I.J. Tutorial on maximum likelihood estimation. J. Math. Psychol. 2003, 47, 90–100. [Google Scholar] [CrossRef]
- Joyce, J.M. Kullback-leibler divergence. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar]
- Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
- Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
- Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In Proceedings of the International Workshop on Machine Learning in Medical Imaging; Springer: Berlin/Heidelberg, Germany, 2017; pp. 379–387. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
- Van Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2016; pp. 1747–1756. [Google Scholar]
- Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
- Rezende, D.J.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2015; pp. 1530–1538. [Google Scholar]
- van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2014; pp. 1278–1286. [Google Scholar]
- Hou, X.; Shen, L.; Sun, K.; Qiu, G. Deep Feature Consistent Variational Autoencoder. arXiv 2016, arXiv:1610.00291. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.C.; Vincent, P. Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives. arXiv 2012, arXiv:1206.5538. [Google Scholar] [CrossRef]
- Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. arXiv 2014, arXiv:1406.5298. [Google Scholar] [CrossRef]
- An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
- Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.P.; Glorot, X.; Botvinick, M.M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR (Poster); OpenReview: Amherst, MA, USA, 2017; Volume 3. [Google Scholar]
- Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. arXiv 2015, arXiv:1511.06349. [Google Scholar]
- Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a broken ELBO. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2018; pp. 159–168. [Google Scholar]
- Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in Beta-VAE. arXiv 2018, arXiv:1804.03599. [Google Scholar]
- Dhariwal, P.; Payne, H.; Kim, J.W.; Radford, A.; Sutskever, I. Jukebox: A generative model for music. arXiv 2020, arXiv:2005.00341. [Google Scholar] [CrossRef]
- Razavi, A.; van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 14866–14876. [Google Scholar]
- Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
- Yan, X.; Yang, J.; Sohn, K.; Lee, H.; Yang, M.H. Attribute2Image: Conditional image generation from visual attributes. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2016; pp. 776–791. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar] [CrossRef]
- Stanczuk, J.; Etmann, C.; Kreusser, L.M.; Schönlieb, C.B. Wasserstein GANs work because they fail (to approximate the Wasserstein distance). arXiv 2021, arXiv:2103.01678. [Google Scholar] [CrossRef]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2015; pp. 2256–2265. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar]
- Song, Y.; Ermon, S. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12438–12448. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Choi, Y.; Uh, Y.; Yoo, J.; Ha, J.W.; Chun, S. Fine-grained image-to-image transformation towards visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 3626–3635. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
- Meng, Y.; Xia, M.; Chen, D. SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv 2024, arXiv:2405.14734. [Google Scholar]
- Hong, J.; Lee, N.; Thorne, J. ORPO: Monolithic Preference Optimization without Reference Model. arXiv 2024, arXiv:2403.07691. [Google Scholar] [CrossRef]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; Volume 6. [Google Scholar]
- Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2015; pp. 84–92. [Google Scholar]
- Cao, Z.; Qin, T.; Liu, T.Y.; Tsai, M.F.; Li, H. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2007; pp. 129–136. [Google Scholar]
- Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; Wu, Y. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2014; pp. 1386–1393. [Google Scholar]
- Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 539–546. [Google Scholar]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06); IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
- Li, Y.; Song, Y.; Luo, J. Improving pairwise ranking for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 3617–3625. [Google Scholar]
- Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille; JMLR: Norfolk, MA, USA, 2015; Volume 37, pp. 1–8. [Google Scholar]
- Chechik, G.; Sharma, V.; Shalit, U.; Bengio, S. Large Scale Online Learning of Image Similarity Through Ranking. J. Mach. Learn. Res. 2010, 11, 1109–1135. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 11. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2020; pp. 1597–1607. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
- Burges, C.J. From ranknet to lambdarank to lambdamart: An overview. Learning 2010, 11, 81. [Google Scholar]
- Burges, C.; Ragno, R.; Le, Q. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2006; Volume 19. [Google Scholar]
- LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. Predict. Struct. Data 2006, 1, 1–59. [Google Scholar]
- Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol.-Paris 2006, 100, 70–87. [Google Scholar] [CrossRef]
- Friston, K. The free-energy principle: A rough guide to the brain? Trends Cogn. Sci. 2009, 13, 293–301. [Google Scholar] [CrossRef]
- Finn, C.; Christiano, P.; Abbeel, P.; Levine, S. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv 2016, arXiv:1611.03852. [Google Scholar] [CrossRef]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2017; pp. 1352–1361. [Google Scholar]
- Grathwohl, W.; Wang, K.C.; Jacobsen, J.H.; Duvenaud, D.; Norouzi, M.; Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. arXiv 2019, arXiv:1912.03263. [Google Scholar]
- Du, Y.; Lin, T.; Mordatch, I. Model Based Planning with Energy Based Models. arXiv 2019, arXiv:1909.06878. [Google Scholar] [CrossRef]
- Osadchy, M.; Miller, M.; Cun, Y. Synergistic face detection and pose estimation with energy-based models. J. Mach. Learn. Res. 2004, 17, 1197–1215. [Google Scholar]
- Du, Y.; Mordatch, I. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Bengio, Y. Gradient-based optimization of hyperparameters. Neural Comput. 2000, 12, 1889–1900. [Google Scholar] [CrossRef]
- Teh, Y.W.; Welling, M.; Osindero, S.; Hinton, G.E. Energy-based models for sparse overcomplete representations. J. Mach. Learn. Res. 2003, 4, 1235–1260. [Google Scholar]
- Swersky, K.; Ranzato, M.; Buchman, D.; Freitas, N.D.; Marlin, B.M. On autoencoders and score matching for energy based models. In Proceedings of the 28th International Conference on Machine Learning (ICML-11); Omnipress: Madison, WI, USA, 2011; pp. 1201–1208. [Google Scholar]
- Zhai, S.; Cheng, Y.; Lu, W.; Zhang, Z. Deep structured energy based models for anomaly detection. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: Norfolk, MA, USA, 2016; pp. 1100–1109. [Google Scholar]
- Kumar, R.; Ozair, S.; Goyal, A.; Courville, A.; Bengio, Y. Maximum entropy generators for energy-based models. arXiv 2019, arXiv:1901.08508. [Google Scholar] [CrossRef]
- Song, Y.; Kingma, D.P. How to train your energy-based models. arXiv 2021, arXiv:2101.03288. [Google Scholar]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002); Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 1–8. [Google Scholar]
- Shapiro, A. Monte Carlo sampling methods. Handbooks Oper. Res. Manag. Sci. 2003, 10, 353–425. [Google Scholar]
- Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
- LeCun, Y.; Huang, F.J. Loss functions for discriminative training of energy-based models. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, PMLR; JMLR: Norfolk, MA, USA, 2005; pp. 206–213. [Google Scholar]
- Levin, E.; Fleisher, M. Accelerated learning in layered neural networks. Complex Syst. 1988, 2, 3. [Google Scholar]
- Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model. J. Mach. Learn. Res. 2000, 13, 1137–1155. [Google Scholar]
- Bengio, Y.; De Mori, R.; Flammia, G.; Kompe, R. Global optimization of a neural network-hidden Markov model hybrid. IEEE Trans. Neural Networks 1992, 3, 252–259. [Google Scholar] [CrossRef]
- Taskar, B.; Guestrin, C.; Koller, D. Max-margin Markov networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
- Altun, Y.; Tsochantaridis, I.; Hofmann, T. Hidden markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML-03); AAAI Press: Menlo Park, CA, USA, 2003; pp. 3–10. [Google Scholar]
- Kleinbaum, D.G.; Dietz, K.; Gail, M.; Klein, M.; Klein, M. Logistic Regression; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
- Juang, B.H.; Hou, W.; Lee, C.H. Minimum classification error rate methods for speech recognition. IEEE Trans. Speech Audio Process. 1997, 5, 257–265. [Google Scholar] [CrossRef]
- Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
- Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations (ICLR); OpenReview: Amherst, MA, USA, 2019. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2017. [Google Scholar]
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 2787–2795. [Google Scholar]
- Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; Bouchard, G. Complex embeddings for simple link prediction. In Proceedings of the International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2016; pp. 2071–2080. [Google Scholar]
- Pan, S.; Hu, R.; Long, G.; Jiang, J.; Zhang, C.; Yao, L. Adversarially regularized graph autoencoder for graph embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI); AAAI Press: Menlo Park, CA, USA, 2018; pp. 2609–2615. [Google Scholar]
- Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2014; pp. 701–710. [Google Scholar]
- Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 855–864. [Google Scholar]
- Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the NeurIPS; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Benson, A.R.; Gleich, D.F.; Leskovec, J. Higher-order organization of complex networks. Science 2016, 353, 163–166. [Google Scholar] [CrossRef]
- Lee, J.B.; Rossi, R.A.; Kim, S.; Ahmed, N.K.; Koh, E. Attention Models in Graphs: A Multi-View Approach. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 402–412. [Google Scholar]
- Ugander, J.; Backstrom, L.; Marlow, C. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd International Conference on World Wide Web; ACM: New York, NY, USA, 2013; pp. 1307–1318. [Google Scholar]
- Chitwood, D.H.; Otoni, W.C. Motif-based analysis of biological networks. Nat. Commun. 2018, 9, 1–12. [Google Scholar]
- Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network motifs: Simple building blocks of complex networks. Science 2002, 298, 824–827. [Google Scholar] [CrossRef]
- Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 2007, 23, e177–e183. [Google Scholar] [CrossRef]
- Yin, H.; Li, W.; Cao, Y. Graph neural network and motif-based knowledge graph embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI); AAAI Press: Menlo Park, CA, USA, 2018. [Google Scholar]
- Wang, Q.; Mao, Z.; Wang, B.; Guo, L. Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 2017, 29, 2724–2743. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- You, Y.; Chen, T.; Wang, X.; Shen, Z.; Huang, Z. Graph contrastive learning with augmentations. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep Graph Contrastive Representation Learning. In Proceedings of the ICML Workshop on Graph Representation Learning and Beyond (GRL+); JMLR: Norfolk, MA, USA, 2020. [Google Scholar]
- Veličković, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the ICLR; OpenReview: Amherst, MA, USA, 2019. [Google Scholar]
- Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2002, 15, 1373–1396. [Google Scholar] [CrossRef]
- Li, Q.; Han, Z.; Wu, X. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI); AAAI Press: Menlo Park, CA, USA, 2018; pp. 3538–3545. [Google Scholar]
- Ribeiro, L.F.; Saverese, P.H.; Figueiredo, D.R. struc2vec: Learning Node Representations from Structural Identity. In Proceedings of the KDD; Association for Computing Machinery: New York, NY, USA, 2017; pp. 385–394. [Google Scholar]
- Donnat, C.; Zitnik, M.; Hallac, D.; Leskovec, J. Learning Structural Node Embeddings via Graph Wavelets. In Proceedings of the KDD; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1320–1329. [Google Scholar]
- Liu, F.; Li, X.; Zhang, W. Structural anomaly detection in graphs using node embeddings. Knowl.-Based Syst. 2021, 227, 107208. [Google Scholar]
- Bechtle, S.; Molchanov, A.; Chebotar, Y.; Grefenstette, E.; Righetti, L.; Sukhatme, G.S. Meta Learning via Learned Loss. arXiv 2019, arXiv:1906.05374. [Google Scholar]
- Wu, L.; Tian, F.; Xia, Y.; Fan, Y.; Qin, T.; Jian-Huang, L.; Liu, T.Y. Learning to Teach with Dynamic Loss Functions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
- Li, H.; Fu, T.; Dai, J.; Li, H.; Huang, G.; Zhu, X. AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
- Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In Proceedings of the 35th International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2018. [Google Scholar]
- Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
- Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; Bailey, J. Symmetric Cross Entropy for Robust Learning with Noisy Labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR); OpenReview: Amherst, MA, USA, 2018. [Google Scholar]
- Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.P.; El Ghaoui, L.; Jordan, M.I. Theoretically Principled Trade-off between Robustness and Accuracy. In Proceedings of the 36th International Conference on Machine Learning (ICML); JMLR: Norfolk, MA, USA, 2019. [Google Scholar]
- Sagawa, S.; Koh, P.W.; Hashimoto, T.B.; Liang, P. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. arXiv 2019, arXiv:1911.08731. [Google Scholar]
- Kuhn, D.; Shafiee, S.; Wiesemann, W. Distributionally Robust Optimization. arXiv 2024, arXiv:2411.02549. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Ethayarajh, K.; Xu, W.; Muennighoff, N.; Jurafsky, D.; Kiela, D. KTO: Model Alignment as Prospect Theoretic Optimization. arXiv 2024, arXiv:2402.01306. [Google Scholar] [CrossRef]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef]
- Hinder, F.; Vaquet, V.; Hammer, B. One or two things we know about concept drift—A survey on monitoring in evolving environments. Part B: Locating and explaining concept drift. Front. Artif. Intell. 2024, 7, 1330258. [Google Scholar] [CrossRef]






Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ciampiconi, L.; Elwood, A.; Leonardi, M.; Mohamed, A.; Rozza, A. A Survey and Taxonomy of Loss Functions in Machine Learning. AI 2026, 7, 128. https://doi.org/10.3390/ai7040128
Ciampiconi L, Elwood A, Leonardi M, Mohamed A, Rozza A. A Survey and Taxonomy of Loss Functions in Machine Learning. AI. 2026; 7(4):128. https://doi.org/10.3390/ai7040128
Chicago/Turabian StyleCiampiconi, Lorenzo, Adam Elwood, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. 2026. "A Survey and Taxonomy of Loss Functions in Machine Learning" AI 7, no. 4: 128. https://doi.org/10.3390/ai7040128
APA StyleCiampiconi, L., Elwood, A., Leonardi, M., Mohamed, A., & Rozza, A. (2026). A Survey and Taxonomy of Loss Functions in Machine Learning. AI, 7(4), 128. https://doi.org/10.3390/ai7040128

