3. Materials and Methods
This study evaluates five loss functions—L1, L2, Cross-Entropy (CE), KL Divergence (KL), and the proposed DLITE—within the context of NER using transformer-based models. All experiments were conducted using the BERT-base-uncased architecture pretrained checkpoint originally released by Google LLC (Mountain View, CA, USA) and implemented in the HuggingFace Transformers library (version 4.37; Hugging Face, Inc., New York, NY, USA). Training and inference were implemented in PyTorch backend (version 2.1.0; PyTorch Foundation, The Linux Foundation, San Francisco, CA, USA). Model training and computation were performed on an NVIDIA A100 GPU with 40 GB of memory (NVIDIA Corporation, Santa Clara, CA, USA). Unless otherwise noted, tokenization followed the WordPiece tokenizer (BertTokenizerFast from Transformers v4.37 backed by tokenizers v0.15.2 (Hugging Face, Inc., New York, NY, USA). bundled with the BERT checkpoint via the Transformers stack, and optimization used the Adam (torch.optim.Adam) from implementation provided by PyTorch v2.1.0 (PyTorch Foundation, The Linux Foundation, San Francisco, CA, USA).
3.3. DLITE Loss Formulation
DLITE (Discounted Least Information Theory of Entropy) Loss is a novel entropy-based metric designed to address the shortcomings of existing loss functions in classification tasks. Whereas popular metrics like CE or KL Divergence focus on statistical divergence without scaling for interpretability or boundedness, DLITE introduces a concept of ‘discounted’ entropy that adjusts for the redundancy or over-penalization of uncertain predictions.
In terms of boundedness, DLITE Loss is bounded within the range [0, 1] after entropy discounting. This transformation ensures the result lies within a stable and interpretable range, avoiding the scale explosion often seen in CE or KL, which are unbounded. For instance, when a predicted probability approaches zero, CE Loss tends toward infinity (−log 0), making the value hard to interpret and prone to instability during training. In contrast, DLITE reduces this impact through its entropy-aware penalty and discount factor, then compresses the output using a cube root to maintain metric properties and boundedness.
For interpretability, DLITE is based on Least Information Theory (LIT), which quantifies the amount of information needed to explain “changes in probability distributions”, or, in our paper, changes or delta to transform one probability distribution into another. This allows DLITE scores to be interpreted as an information distance—unlike CE or KL, which are divergence measures and do not satisfy the properties of a metric (e.g., symmetry, triangle inequality). DLITE’s design allows one to interpret a lower score as being “closer” in an information-theoretic sense, and its properties further support geometric and intuitive reasoning.
This means that DLITE Loss has a discount factor when measuring how different two probability distributions are (i.e., predicted vs. actual). This discounting concept reduces the penalty for predictions that are uncertain in a justified way—especially when both the predicted and true labels have low confidence or ambiguity.
This matters because popular loss functions like CE treat all errors with high confidence as equally bad, even if the model was justifiably uncertain (e.g., two very similar classes or noisy input). DLITE steps in with a different approach: Instead of over-penalizing these uncertain or ambiguous cases, DLITE subtracts a value (called entropy discount) from the total loss—this value represents how uncertain the prediction already was. DLITE is built on the foundation of Least Information Theory (LIT) [
26], which quantifies how much information is needed to transition between two probability distributions. It calculates the total information distance between prediction and truth.
This is defined as follows:
where P and Q are probability distributions over a set of outcomes X. The intuition is to measure the change in entropy-weighted probabilities for each class.
However, LIT alone is sensitive to scale and may overestimate differences in distributions due to local fluctuations. To mitigate this, DLITE introduces an entropy discount component, ∆H (Delta H) which captures redundancy or “expected noise: It is defined as follows:
where P and Q are probability distributions over a set of outcomes X. This entropy discount quantifies the extra uncertainty introduced when comparing the two distributions and subtracts it from LIT to obtain DLITE, written as follows:
The final DLITE value, when cube-rooted, satisfies metric properties: non-negativity, symmetry, identity of indiscernible, and triangle inequality. This means it behaves as a true distance measure in information space. The cube root also gives DLITE a volumetric interpretation—analogous to how volume operates in 3D geometry.
The key features of DLITE Loss Function are as follows:
Volume-Based Metric: DLITE Loss is designed as a volume-based loss function that evaluates the amount of information in a system and reduces entropy in a structured way.
Metric Properties: DLITE Loss’ cube root satisfies the properties of non-negativity, identity of indiscernible, and symmetry—key aspects of a well-defined metric.
Robustness: DLITE Loss is robust in scenarios where probabilistic inferences are involved, ensuring that when more equiprobable inferences are reduced to certainty, the function increases accordingly.
Information Aggregation: It is particularly useful in measuring and aggregating information accurately, making it suitable for complex tasks where information aggregation is critical.
7. Conclusions
This study presented a comparative analysis of five loss functions—L1, L2, Cross-Entropy (CE), KL Divergence, and DLITE Loss—for transformer-based Named Entity Recognition (NER) tasks using BERT across three benchmark datasets: Basic NER, CoNLL, and the Broad Twitter Corpus. The findings reinforce that the choice of loss function plays a pivotal role in shaping model performance, especially in domains characterized by noise, class imbalance, or precision-critical applications.
CE and KL Divergence consistently demonstrated robust and balanced performance across datasets, achieving high precision, recall, and F1-scores. Their effectiveness in both structured (CoNLL) and unstructured (Twitter) data underscores their continued relevance as reliable default choices for general-purpose NER. L2 Norm, while less competitive, still delivered stable results under controlled conditions.
DLITE, the novel entropy-based loss function introduced in this study, emerged as a precision-centric alternative, particularly effective in minimizing false positives. However, its performance was notably impacted by low recall, suggesting limitations in recall-sensitive applications. Despite this trade-off, DLITE holds promise for domains such as biomedical or legal NER, where precision and confidence in prediction are paramount.
Conversely, L1 Norm underperformed across all datasets, especially in macro-averaging, indicating a lack of adaptability to class imbalance and real-world data complexity.
The study further highlights the need to treat loss function selection as a domain-driven decision, particularly when deploying AI models in critical, regulated, or resource-constrained environments. Selecting the appropriate loss function is not only a matter of optimization but also of aligning model behavior with application goals such as interpretability, fairness, or robustness.
The following summarizes the findings in this paper:
Cross-Entropy and KL Divergence remain strong, general-purpose loss functions suitable for most NER applications.
DLITE is a promising candidate for high-precision tasks but requires further development to address recall limitations.
Loss function choice should be treated as a first-class design parameter, documented as metadata, and aligned with use-case priorities.
Future research should explore hybrid or adaptive loss strategies [
38] that blend the strengths of DLITE and standard loss functions. Expanding evaluations to include more domain-specific datasets, such as clinical or legal corpora, will also be essential to further understand the applicability of these functions in diverse real-world settings [
39].
Author Contributions
Conceptualization, S.P., M.P. and W.K.; Methodology, S.P., M.P. and W.K.; Software, M.P.; Validation, S.P., M.P. and W.K.; Formal analysis, S.P., M.P. and W.K.; Investigation, S.P., M.P. and W.K.; Resources, S.P. and M.P.; Data curation, M.P.; Writing—original draft, S.P.; Writing—review & editing, S.P., M.P. and W.K.; Visualization, M.P.; Supervision, W.K.; Project administration, W.K.; Funding acquisition, W.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The source code, experiment configurations, and run logs that support the findings of this study are openly available on GitHub [
27] git v [2.0.1] (The Git Project) and hosted on GitHub (GitHub, Inc., San Francisco, CA, USA).
Acknowledgments
Portions of this manuscript’s writing and section formatting benefited from iterative editing and organization suggestions provided by OpenAI’s ChatGPT-4. All content was verified, rewritten, and curated by the authors, who take full responsibility for the scientific integrity of the work.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AI | Artificial Intelligence |
CE | Cross-Entropy |
BERT | Bidirectional Encoder Representations from Transformers |
DLITE | Discounted Least Information Theory of Entropy |
KL Divergence | Kullberg–Lieber Divergence |
NER | Named Entity Recognition |
NLP | Natural Language Processing |
References
- De Boer, P.-T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A Tutorial on the Cross-Entropy Method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, UK, 2016; Volume 1. [Google Scholar]
- Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 2022, 9, 187–212. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE ICCV, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Zhang, Z.; Sabuncu, M. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8778–8788. [Google Scholar]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; Silva Santos, L.B.D.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
- Ke, W. Alternatives to classic BM25-IDF based on a new information theoretical framework. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 36–44. [Google Scholar] [CrossRef]
- Winograd, T. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language; MIT AI Technical: Cambridge, MA, USA, 1971. [Google Scholar]
- Chinchor, N.; Hirschman, L.; Lewis, D.D. Evaluating message understanding systems: An analysis of the Message Understanding Conference (MUC) results. In Proceedings of the Workshop on Human Language Technology, Stroudsburg, PA, USA, 21–24 March 1993; pp. 22–29. [Google Scholar]
- Eddy, S.R. Hidden Markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
- Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
- Mandic, D.P.; Chambers, J. Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures, and Stability; John Wiley & Sons, Inc: Hoboken, NJ, USA, 2001. [Google Scholar]
- Egan, S.; Fedorko, W.; Lister, A.; Pearkes, J.; Gay, C. Long short-term memory (LSTM) networks with jet constituents for boosted top tagging at the LHC. arXiv 2017, arXiv:1711.09059. [Google Scholar] [CrossRef]
- Rogers, A.; Gardner, M.; Augenstein, I. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Comput. Surv. 2023, 55, 1–45. [Google Scholar] [CrossRef]
- Derczynski, L.; Bontcheva, K.; Roberts, I. Broad Twitter corpus: A diverse named entity recognition resource. In Proceedings of the COLING 2016, The 26th International Conference on Computational Linguistics, Osaka, Japan, 11–16 December 2016; pp. 1169–1179. [Google Scholar]
- Jaswani, N. NER Dataset. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/namanj27/ner-dataset (accessed on 29 June 2025).
- Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Stroudsburg, PA, USA, 31 May 2003; pp. 142–147. Available online: https://aclanthology.org/W03-0419 (accessed on 29 June 2025).
- Weisstein, E.W. “Norm”. MathWorld. 2020. Available online: http://mathworld.wolfram.com/Norm.html (accessed on 29 June 2025).
- Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2001; p. 18. [Google Scholar]
- Dhinakaran, A. Understanding KL Divergence. Towards Data Science. 2020. Available online: https://towardsdatascience.com/understanding-kl-divergence-f3ddc8dff254 (accessed on 29 June 2025).
- Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
- Ke, W. Least information modeling for information retrieval. arXiv 2012, arXiv:1205.0312. Available online: https://arxiv.org/pdf/1205.0312 (accessed on 30 June 2025). [CrossRef]
- Ke, W. Beyond Cross-Entropy: DLITE Loss and the Impact of Loss Functions on AI-Driven Named Entity Recognition. [Computer Software]. GitHub. 2024. Available online: https://github.com/keweimao/DeepDelight/tree/main/Thread4/Beyond%20Cross-Entropy%3A%20DLITE%20Loss%20and%20the%20Impact%20of%20Loss%20Functions%20on%20AI-Driven%20Named%20Entity%20Recognition (accessed on 29 June 2025).
- OpenAI. GPT-4 Technical Report (Tech. Rep.). arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT [Large Language Model]. 2023. Available online: https://chat.openai.com/ (accessed on 29 June 2025).
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. Available online: https://arxiv.org/abs/1609.08144 (accessed on 29 June 2025).
- Krallinger, M.; Leitner, F.; Rabal, O.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D.M.; et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 2015, 7, S2. [Google Scholar] [CrossRef]
- Sheikhalishahi, S.; Miotto, R.; Dudley, J.T.; Lavelli, A.; Rinaldi, F.; Osmani, V. Natural language processing of clinical notes on chronic diseases: Systematic review. JMIR Med. Inform. 2019, 7, e12239. [Google Scholar] [CrossRef]
- Ghosh, A.; Kumar, H.; Sastry, P.S. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Kim, J.-D.; Ohta, T.; Tateisi, Y.; Tsujii, J. GENIA corpus—A semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19 (Suppl. S1), i180–i182. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. CUAD: An expert-annotated NLP dataset for legal contract review. arXiv 2021, arXiv:2103.06268. Available online: https://arxiv.org/abs/2103.06268 (accessed on 29 June 2025). [CrossRef]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; pp. 4310–4330. [Google Scholar] [CrossRef]
- Wiegreffe, S.; Pinter, Y. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; pp. 11–20. [Google Scholar] [CrossRef]
- Maldonado, S.; Vairetti, C.; Jara, K.; Carrasco, M.; López, J. OWAdapt: An adaptive loss function for deep learning using OWA operators. Knowl.-Based Syst. 2023, 280, 111022. [Google Scholar] [CrossRef]
- Janocha, K.; Czarnecki, W.M. On loss functions for deep neural networks in classification. arXiv 2017, arXiv:1702.05659. Available online: https://arxiv.org/abs/1702.05659 (accessed on 29 June 2025). [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).