# Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices

^{*}

*Remote Sensing*: 15th Anniversary)

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Accuracy Assessment Metrics

_{AA}+ P

_{BB}+ P

_{CC})/P

_{++}, where P

_{++}represents summation over both the rows and columns, and thus, P

_{++}= 1.0. Alongside OA, it is common to calculate the class-level assessment metrics, PA and UA (see Section 1, above). PA represents a 1 – omission error, while UA represents a 1 – commission error [1].

#### 2.2. Averaged Multiclass Accuracy Measures

_{j}) (Equations (5) and (6)) [7,8]. This allows the user to specify the relative weighting of each class in the aggregated metric, as opposed to traditional macro-averaging, where all classes are equally weighted, or micro-averaging, in which the values in the confusion matrix are aggregated prior to calculating the accuracy metric, and thus, the relative proportion of each class in the testing or validation dataset is preserved [3,7,8]. An important caveat is that when the class prevalences are used as the weights, weighted macro-averaged class metrics are also equivalent to OA.

#### 2.3. Loss Metrics

_{j}, of each class in the calculation. The values of w

_{j}are commonly based on the inverse of the abundance of the class in the training dataset in order to attempt to offset the impact of sample size. It is possible to increase the impact of difficult to classify samples by including a γ parameter. This is known as focal loss [3].

## 3. Methods

#### 3.1. Data

#### 3.2. Classification Experiments

#### 3.2.1. CNN Scene Classification

#### 3.2.2. Experiments Exploring the Effect of Changing Class Prevalences

## 4. Results, Discussion, and Recommendations

#### 4.1. Micro- and Macro-Averaged Accuracy Assessment Metrics

#### 4.2. Accuracy Assessment Metrics and Class Imbalance

#### 4.3. Impact of Class Imbalance on the Training Process

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Congalton, R.; Green, K. Assessing the Accuracy of Remotely Sensed Data: Principles and Practices, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Warner, T.A.; Nellis, M.D.; Foody, G.M. The SAGE Handbook of Remote Sensing; SAGE Publications, Inc.: Thousand Oaks, CA, USA, 2009. [Google Scholar] [CrossRef]
- Ma, J.; Chen, J.; Ng, M.; Huang, R.; Li, Y.; Li, C.; Yang, X.; Martel, A.L. Loss odyssey in medical image segmentation. Med. Image Anal.
**2021**, 71, 102035. [Google Scholar] [CrossRef] - Yeung, M.; Sala, E.; Schönlieb, C.-B.; Rundo, L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph.
**2022**, 95, 102026. [Google Scholar] [CrossRef] - Stehman, S.V.; Foody, G.M. Key issues in rigorous accuracy assessment of land cover products. Remote Sens. Environ.
**2019**, 231, 111199. [Google Scholar] [CrossRef] - Maxwell, A.E.; Warner, T.A.; Guillén, L.A. Accuracy Assessment in Convolutional Neural Network-Based Deep Learning Remote Sensing Studies—Part 1: Literature Review. Remote Sens.
**2021**, 13, 2450. [Google Scholar] [CrossRef] - Gowda, T.; You, W.; Lignos, C.; May, J. Macro-Average: Rare Types Are Important Too. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 12 June 2021; pp. 1138–1157. [Google Scholar]
- Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv
**2020**, arXiv:2008.05756. [Google Scholar] - Stehman, S.V.; Czaplewski, R.L. Design and Analysis for Thematic Map Accuracy Assessment: Fundamental Principles. Remote Sens. Environ.
**1998**, 64, 331–344. [Google Scholar] [CrossRef] - Stehman, S. Statistical Rigor and Practical Utility in Thematic Map Accuracy Assessment. Photogramm. Eng. Remote Sens.
**2001**, 67, 727–734. [Google Scholar] - Stehman, S.V. Impact of sample size allocation when using stratified random sampling to estimate accuracy and area of land-cover change. Remote Sens. Lett.
**2012**, 3, 111–120. [Google Scholar] [CrossRef] - Tharwat, A. Classification assessment methods. Appl. Comput. Inform.
**2021**, 17, 168–192. [Google Scholar] [CrossRef] - Congalton, R.G.; Oderwald, R.G.; Mead, R.A. Assessing Landsat classification accuracy using discrete multivariate analysis statistical techniques. Photogramm. Eng. Remote Sens.
**1983**, 49, 1671–1678. [Google Scholar] - Foody, G.M. Explaining the unsuitability of the kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification. Remote Sens. Environ.
**2020**, 239, 111630. [Google Scholar] [CrossRef] - Pontius, R.G.; Millones, M. Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. Int. J. Remote Sens.
**2011**, 32, 4407–4429. [Google Scholar] [CrossRef] - Xiao, R.; Zhong, C.; Zeng, W.; Cheng, M.; Wang, C. Novel Convolutions for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.
**2023**, 61, 5907313. [Google Scholar] [CrossRef] - Singh, A.; Kalke, H.; Loewen, M.; Ray, N. River Ice Segmentation with Deep Learning. IEEE Trans. Geosci. Remote Sens.
**2020**, 58, 7570–7579. [Google Scholar] [CrossRef] - Zeng, Q.; Zhou, J.; Niu, X. Cross-Scale Feature Propagation Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett.
**2023**, 20, 6008305. [Google Scholar] [CrossRef] - Subramanian, V. Deep Learning with PyTorch: A Practical Approach to Building Neural Network Models Using PyTorch; Packt Publishing: Birmingham, UK, 2018. [Google Scholar]
- Antiga, L.P.G.; Stevens, E.; Viehmann, T. Deep Learning with PyTorch; Manning: Shelter Island, NY, USA, 2020. [Google Scholar]
- Zhao, R.; Qian, B.; Zhang, X.; Li, Y.; Wei, R.; Liu, Y.; Pan, Y. Rethinking Dice Loss for Medical Image Segmentation. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 851–860. [Google Scholar]
- Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Québec City, QC, Canada, 14 September 2017; pp. 240–248. [Google Scholar]
- Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice Loss for Data-imbalanced NLP Tasks. arXiv
**2020**, arXiv:1911.02855. [Google Scholar] - Bertels, J.; Eelbode, T.; Berman, M.; Vandermeulen, D.; Maes, F.; Bisschops, R.; Blaschko, M. Optimizing the Dice Score and Jaccard Index for Medical Image Segmentation: Theory & Practice. arXiv
**2019**, arXiv:1911.01685. [Google Scholar] [CrossRef] - Wang, P.; Chung, A.C.S. Focal Dice Loss and Image Dilation for Brain Tumor Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; pp. 119–127. [Google Scholar]
- Salehi, S.S.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Proceedings of the 8th International Workshop, MLMI 2017, Quebec City, QC, Canada, 10 September 2017. [Google Scholar]
- Abraham, N.; Khan, N.M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 683–687. [Google Scholar]
- Helber, P.; Bischke, B.; Dengel, A.; Borth, D. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
**2019**, 12, 2217–2226. [Google Scholar] [CrossRef] - Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ.
**2012**, 120, 25–36. [Google Scholar] [CrossRef] - PyTorch [WWW Document], n.d. Available online: https://www.pytorch.org (accessed on 31 December 2020).
- Welcome to Python.org [WWW Document], n.d. Python.org. Available online: https://www.python.org/ (accessed on 5 January 2021).
- Bjorck, J.; Gomes, C.; Selman, B.; Weinberger, K.Q. Understanding Batch Normalization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information
**2020**, 11, 125. [Google Scholar] [CrossRef] - Kuhn, M.; Vaughan, D.; Hvitfeldt, E. Yardstick: Tidy Characterizations of Model Performance. R Package Version 0.0. 2021; R Core Team: Vienna, Austria, 2021; Volume 8. [Google Scholar]
- Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw.
**2008**, 28, 1–26. [Google Scholar] [CrossRef] - Evans, J.S.; Murphy, M.A. rfUtilities; R Core Team: Vienna, Austria, 2018. [Google Scholar]
- Pontius, R.G.; Santacruz, A. diffeR: Metrics of Difference for Comparing Pairs of Maps or Pairs of Variables; R Core Team: Vienna, Austria, 2023. [Google Scholar]
- Stehman, S.V. Estimating area and map accuracy for stratified random sampling when the strata are different from the map classes. Int. J. Remote Sens.
**2014**, 35, 4923–4939. [Google Scholar] [CrossRef] - Stehman, S.V. Sampling designs for accuracy assessment of land cover. Int. J. Remote Sens.
**2009**, 30, 5243–5272. [Google Scholar] [CrossRef] - Stehman, S. A Critical Evaluation of the Normalized Error Matrix in Map Accuracy Assessment. Photogramm. Eng. Remote Sens.
**2004**, 70, 743–751. [Google Scholar] [CrossRef] - Stehman, S.V. Basic probability sampling designs for thematic map accuracy assessment. Int. J. Remote Sens.
**1999**, 20, 2423–2441. [Google Scholar] [CrossRef] - Stehman, S.V. Comparison of systematic and random sampling for estimating the accuracy of maps generated from remotely sensed data. Photogramm. Eng. Remote Sens.
**1992**, 58, 1343–1350. [Google Scholar] - Stehman, S.V.; Wagner, J.E. Choosing a sample size allocation to strata based on trade-offs in precision when estimating accuracy and area of a rare class from a stratified sample. Remote Sens. Environ.
**2024**, 300, 113881. [Google Scholar] [CrossRef] - Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Ghosh, K.; Bellinger, C.; Corizzo, R.; Branco, P.; Krawczyk, B.; Japkowicz, N. The class imbalance problem in deep learning. Mach. Learn.
**2022**. [Google Scholar] [CrossRef] - Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data
**2019**, 6, 27. [Google Scholar] [CrossRef] - Ding, W.; Huang, D.Y.; Chen, Z.; Yu, X.; Lin, W. Facial action recognition using very deep networks for highly imbalanced class distribution. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 1368–1372. [Google Scholar]

**Figure 1.**Example image chips from EuroSat dataset. Chips are displayed as false color composites, with the NIR, red, and green bands displayed, respectively, as red, green, and blue. Each chip is 64 by 64 pixels in size with a spatial resolution of 10 m.

**Figure 2.**Boxplots of class PAs (

**a**) and UAs (

**b**) for 1000 randomly generated class proportions that sum to 1, selected from a uniform distribution.

**Figure 3.**Boxplots of OA and class-aggregated, macro-averaged assessment metrics using 1000 varying class proportions that sum to 1, selected from a uniform distribution.

**Figure 4.**Loss curves for training of EuroSat dataset for 50 epochs using four different losses and balanced and imbalanced training sets.

**Figure 5.**Overall accuracy for training of EuroSat dataset for 50 epochs using four different losses and balanced and imbalanced training sets.

**Figure 6.**Macro-averaged class-aggregated F1-score for training of EuroSat dataset for 50 epochs using four different losses and balanced and imbalanced training sets.

**Figure 7.**Class-level UA and PA metrics of CNN models trained using cross-entropy, weighted cross-entropy, and macro- and micro-averaged Dice loss metrics. Results using balanced and imbalanced training sets are differentiated by color. The names of rare classes, with a reduced number of samples in the training set, are printed in green.

**Table 1.**Confusion matrix conceptualization where three classes, A, B, and C, are differentiated. Pij represents the proportion of samples classified as class i, but known to belong to class j. The + symbol is used to represent summation; when the + symbol occurs in the first subscript position, the rows are summed; when the + symbol is in the second subscript position, the columns are summed. Gray cells represent correct classifications. UA = user’s accuracy (1 – commission error) and PA = producer’s accuracy (1 – omission error).

Reference | ||||||
---|---|---|---|---|---|---|

A | B | C | Row Total | UA | ||

Classification | A | P_{AA} | P_{AB} | P_{AC} | P_{A+} | P_{AA}/P_{A+} |

B | P_{BA} | P_{BB} | P_{BC} | P_{B+} | P_{BB}/P_{B+} | |

C | P_{CA} | P_{CB} | P_{CC} | P_{C+} | P_{CC}/P_{C+} | |

Column total | P_{+A} | P_{+B} | P_{+C} | |||

PA | P_{AA}/P_{+A} | P_{BB}/P_{+B} | P_{CC}/P_{+C} |

**Table 2.**Conceptualization of binary confusion matrix and associated terminology. TP = true positive, TN = true negative, FN = false negative, FP = false positive, and NPV = negative predictive value. See Table 3 for equations for recall, precision, specificity, and NPV.

Reference Data | |||||
---|---|---|---|---|---|

Positive | Negative | 1 – Commission Error | |||

Classification Result | Positive | TP | FP | $\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}}$ | Precision |

Negative | FN | TN | $\frac{\mathrm{T}\mathrm{N}}{\mathrm{F}\mathrm{N}+\mathrm{T}\mathrm{N}}$ | NPV | |

$\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}}$ | $\frac{\mathrm{T}\mathrm{N}}{\mathrm{F}\mathrm{P}+\mathrm{T}\mathrm{N}}$ | ||||

1 – omission error | Recall | Specificity |

**Table 3.**Multiclass and binary metrics commonly calculated from the confusion matrix. TP = true positive, TN = true negative, FN = false negative, FP = false positive.

Type of Classification | Metric | Equation | Comments |
---|---|---|---|

Binary and multiclass | Overall accuracy (OA) | $\frac{\mathrm{C}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{t}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}}{\mathrm{C}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}}$ or $\frac{\mathrm{T}\mathrm{P}+\mathrm{T}\mathrm{N}}{\mathrm{T}\mathrm{P}+\mathrm{T}\mathrm{N}+\mathrm{F}\mathrm{P}+\mathrm{F}\mathrm{N}}$ | |

Multiclass | User’s accuracy (UA) | $\frac{\mathrm{C}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{t}\mathrm{l}\mathrm{y}\mathrm{l}\mathrm{a}\mathrm{b}\mathrm{e}\mathrm{l}\mathrm{e}\mathrm{d}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}\mathrm{i}\mathrm{n}\mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}}{\mathrm{T}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}\mathrm{c}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}\mathrm{p}\mathrm{r}\mathrm{e}\mathrm{d}\mathrm{i}\mathrm{c}\mathrm{t}\mathrm{e}\mathrm{d}\mathrm{t}\mathrm{o}\mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}}$ | 1 – commission error |

Producer’s accuracy (PA) | $\frac{\mathrm{C}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{t}\mathrm{l}\mathrm{y}\mathrm{l}\mathrm{a}\mathrm{b}\mathrm{e}\mathrm{l}\mathrm{e}\mathrm{d}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}\mathrm{i}\mathrm{n}\mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}}{\mathrm{T}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}\mathrm{c}\mathrm{o}\mathrm{u}\mathrm{n}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{s}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e}\mathrm{s}\mathrm{a}\mathrm{c}\mathrm{t}\mathrm{u}\mathrm{a}\mathrm{l}\mathrm{l}\mathrm{y}\mathrm{i}\mathrm{n}\mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}}$ | 1 – omission error | |

Binary | Recall | $\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}}$ | PA for positives (1 – positive case omission error) |

Precision | $\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}}$ | UA for positives (1 – positive case commission error) | |

Specificity | $\frac{\mathrm{T}\mathrm{N}}{\mathrm{T}\mathrm{N}+\mathrm{F}\mathrm{P}}$ | PA for negatives (1 – negative case omission error) | |

Negative predictive value (NPV) | $\frac{\mathrm{T}\mathrm{N}}{\mathrm{T}\mathrm{N}+\mathrm{F}\mathrm{N}}$ | UA for negatives (1 – negative case commission error) | |

F1-score (Dice score) | $\frac{2\times \mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}\times \mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}}{\mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}+\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}}$ or $\frac{2\times \mathrm{T}\mathrm{P}}{2\times \mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}+\mathrm{F}\mathrm{P}}$ |

Loss | Equation |
---|---|

Binary cross-entropy (BCE) loss | $-\frac{1}{n}{\displaystyle \sum _{i=1}^{n}}\left[{y}_{i}\cdot \mathrm{log}\left(\widehat{{p}_{i}}\right)+\left(1-{y}_{i}\right)\cdot \mathrm{log}\left(1-\widehat{{p}_{i}}\right)\right]$ |

Cross-entropy (CE) loss | $-\frac{1}{n}{\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{C}}{y}_{ij}\cdot \mathrm{log}\left(\widehat{{p}_{ij}}\right)$ |

Weighted CE loss | $-\frac{1}{n}{\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{C}}{w}_{j}\cdot {y}_{ij}\cdot \mathrm{log}\left(\widehat{{p}_{ij}}\right)$ |

**Table 5.**Sentinel-2 (MSI) bands. RE = red edge, NIR = near infrared, SWIR = shortwave infrared. Wavelength units are in nanometers (nm). Y indicates bands used in the experiment.

Band | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B8a | B9 | B10 | B11 | B12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Central wavelength (nm) | 443 | 490 | 560 | 665 | 705 | 740 | 783 | 842 | 865 | 940 | 1375 | 1610 | 2190 |

Description | Ultra blue | Blue | Green | Red | RE1 | RE2 | RE3 | NIR | NIR (narrow) | Water vapor | Cirrus cloud | SWIR 1 | SWIR 2 |

Used in our experiment | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |

**Table 6.**Number of samples in each class for the balanced and imbalanced training and validation sets. The testing set proportions used the samples in the original dataset, and the results were used to estimate balanced and imbalanced error matrices.

Training | Validation | Testing | |||
---|---|---|---|---|---|

Balanced | Imbalanced | Balanced | Imbalanced | Both | |

Annual crop | 1400 | 140 | 400 | 40 | 300 |

Forest | 1400 | 1400 | 400 | 400 | 300 |

Herb veg | 1400 | 140 | 400 | 40 | 300 |

Highway | 1400 | 1400 | 400 | 400 | 250 |

Industrial | 1400 | 140 | 400 | 40 | 250 |

Pasture | 1400 | 1400 | 400 | 400 | 200 |

Perm crop | 1400 | 1400 | 400 | 400 | 250 |

Residential | 1400 | 140 | 400 | 40 | 300 |

River | 1400 | 1400 | 400 | 400 | 250 |

Sea/Lake | 1400 | 140 | 400 | 40 | 359 |

Total | 14,000 | 7700 | 4000 | 2200 | 2509 |

**Table 7.**Example confusion matrix for classification of an imbalanced dataset using an imbalanced training set and cross-entropy loss. The table is an estimate of the population confusion matrix, and thus, the numbers in the table represent the percentage of the imbalanced data classified as class i, but known to belong to class j.

Reference | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Annual Crop | Forest | Herb Veg | Highway | Industrial | Pasture | Perm Crop | Residential | River | Sea/ Lake | Row Total | UA | ||

Classification | Annual crop | 15.45 | 0.00 | 0.30 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 15.79 | 0.979 |

Forest | 0.24 | 1.79 | 0.12 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 2.18 | 0.825 | |

Herb veg | 0.06 | 0.00 | 13.09 | 0.01 | 0.07 | 0.03 | 0.01 | 0.06 | 0.00 | 0.00 | 13.33 | 0.982 | |

Highway | 0.30 | 0.00 | 0.91 | 1.75 | 4.73 | 0.01 | 0.01 | 1.58 | 0.04 | 0.00 | 9.33 | 0.188 | |

Industrial | 0.00 | 0.00 | 0.00 | 0.00 | 12.51 | 0.00 | 0.00 | 0.24 | 0.00 | 0.00 | 12.75 | 0.981 | |

Pasture | 0.61 | 0.02 | 0.85 | 0.00 | 0.00 | 1.76 | 0.01 | 0.00 | 0.01 | 0.00 | 3.27 | 0.539 | |

Perm crop | 1.52 | 0.00 | 2.67 | 0.03 | 0.00 | 0.00 | 1.72 | 0.30 | 0.00 | 0.00 | 6.24 | 0.276 | |

Residential | 0.00 | 0.00 | 0.00 | 0.01 | 0.65 | 0.00 | 0.01 | 16.00 | 0.00 | 0.00 | 16.68 | 0.959 | |

River | 0.00 | 0.00 | 0.24 | 0.01 | 0.22 | 0.00 | 0.01 | 0.00 | 1.77 | 0.56 | 2.81 | 0.630 | |

Sea/lake | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 17.62 | 17.62 | 1.000 | |

Column total | 18.18 | 1.82 | 18.18 | 1.82 | 18.18 | 1.82 | 1.82 | 18.18 | 1.82 | 18.18 | |||

PA | 0.850 | 0.987 | 0.720 | 0.964 | 0.688 | 0.970 | 0.948 | 0.880 | 0.972 | 0.969 |

**Table 8.**Class-aggregated accuracy metrics for the classification reported in Table 8 (i.e., using an imbalanced training set and a cross-entropy loss).

Micro | Macro | |||||
---|---|---|---|---|---|---|

OA | UA (Precision) | PA (Recall) | F1-Score | UA (Precision) | PA (Recall) | F1-Score |

0.835 | 0.835 | 0.835 | 0.835 | 0.736 | 0.895 | 0.755 |

**Table 9.**Comparison of assessment metrics for models trained using different loss functions and with balanced or imbalanced training sets.

Loss Metric | Training Set | Test Prevalences | OA | Macro-F1 | Macro-UA (Precision) | Macro-PA (Recall) |
---|---|---|---|---|---|---|

CE | Balanced | Balanced | 0.959 | 0.958 | 0.959 | 0.959 |

Balanced | Imbalanced | 0.970 | 0.929 | 0.907 | 0.958 | |

Imbalanced | Balanced | 0.895 | 0.895 | 0.911 | 0.895 | |

Imbalanced | Imbalanced | 0.835 | 0.755 | 0.736 | 0.894 | |

Weighted CE | Imbalanced | Balanced | 0.912 | 0.912 | 0.916 | 0.912 |

Imbalanced | Imbalanced | 0.890 | 0.804 | 0.765 | 0.911 | |

Micro-Dice (OA) | Balanced | Balanced | 0.954 | 0.954 | 0.955 | 0.954 |

Balanced | Imbalanced | 0.956 | 0.913 | 0.883 | 0.954 | |

Imbalanced | Balanced | 0.581 | 0.466 | 0.452 | 0.581 | |

Imbalanced | Imbalanced | 0.261 | 0.264 | 0.251 | 0.581 | |

Macro-Dice | Balanced | Balanced | 0.956 | 0.956 | 0.957 | 0.956 |

Balanced | Imbalanced | 0.971 | 0.927 | 0.904 | 0.956 | |

Imbalanced | Balanced | 0.873 | 0.872 | 0.894 | 0.873 | |

Imbalanced | Imbalanced | 0.805 | 0.712 | 0.697 | 0.872 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Farhadpour, S.; Warner, T.A.; Maxwell, A.E.
Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. *Remote Sens.* **2024**, *16*, 533.
https://doi.org/10.3390/rs16030533

**AMA Style**

Farhadpour S, Warner TA, Maxwell AE.
Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. *Remote Sensing*. 2024; 16(3):533.
https://doi.org/10.3390/rs16030533

**Chicago/Turabian Style**

Farhadpour, Sarah, Timothy A. Warner, and Aaron E. Maxwell.
2024. "Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices" *Remote Sensing* 16, no. 3: 533.
https://doi.org/10.3390/rs16030533