# Enhancing Medical Image Segmentation: Ground Truth Optimization through Evaluating Uncertainty in Expert Annotations

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Materials and Methods

#### 3.1. Dataset

#### 3.2. Architecture

#### 3.2.1. Segmentation Network

#### 3.2.2. Regularization Network

#### 3.3. Loss Function and Evaluation Metrics

#### 3.4. Methodology

#### 3.4.1. Jointly Learning

**Coupled CNN training:**The initial approach was to implement a local CM model where each image pixel was assigned its individual CM. Given the high dimensionality involved with assigning a CM to each pixel—effectively a H × W × 2 × 2 dimensional problem—we decided to revise our strategy, with H being the height of the image, W being the width of it, and a 2 × 2 confusion matrix for each pixel. (Figure 3)To manage the high dimensionality, we adopted a different approach, reducing the multiple CMs to a single one (global CM) and aiming at capturing the behavior across the entire image. This effectively condensed the problem from a H × W × 2 × 2 dimension to a more manageable 2 × 2 CM. (Figure 4)**Coupled CNN training with transfer learning:**In order to assist the initial models, examining further options was necessitated. We adopted the method proposed by Athanasiou et al. [6] and trained a segmenting CNN model to segment the COC area proficiently. Subsequently, the weights of the model were saved to serve as a starting point for disentangling the process from the ground truth, negating the need to train both CNNs simultaneously and offering a promising starting point for the training.Upon revisiting the approach, two primary concepts stood out. The first entailed training with the pre-trained weights, allowing the models to optimize the weights for both CNNs. The second concept involved training with the pre-trained weights, freezing the segmentation CNN, and enabling the models to train the annotating CNN, thus learning the CMs.

#### 3.4.2. Confusion Matrices on Uncertainty

#### 3.4.3. Maximum Likelihood Ground Truth

## 4. Results and Discussion

#### 4.1. Performance Coupled CNNs

#### 4.2. Performance on CMs—Learning

#### 4.3. Ground Truth

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

CM | Confusion Matrix |

CMs | Confusion Matrices |

GT | Ground Truth |

DL | Deep Learning |

COC | Cumulus Oocyte Complex |

COCs | Cumulus Oocyte Complexes |

ART | Assisted Reproductive Technology |

## References

- Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging
**2015**, 34, 1993–2024. [Google Scholar] [CrossRef] - Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv
**2015**, arXiv:1505.04597. [Google Scholar] - Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The success of U-Net. arXiv
**2022**, arXiv:2211.14830. [Google Scholar] - Harvey, H.; Glocker, B. A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology. In Artificial Intelligence in Medical Imaging; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Nguyen, N.T.T.; Le, P.B. Topological Voting Method for Image Segmentation. J. Imaging
**2022**, 8, 16. [Google Scholar] [CrossRef] - Athanasiou, G.; Cerquides, J.; Arcos, J.L. Detecting the Area of Bovine Cumulus Oocyte Complexes Using Deep Learning and Semantic Segmentation. In Proceedings of the CCIA 2022: 24th International Conference of the Catalan Association for Artificial Intelligence, Sitges, Spain, 19–21 October 2022; pp. 249–258. [Google Scholar] [CrossRef]
- Warfield, S.K.; Zou, K.H.; Wells, W.M. Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging
**2004**, 23, 903–921. [Google Scholar] [CrossRef] [PubMed] - Iglesias, J.E.; Sabuncu, M.R.; Leemput, K.V. A unified framework for cross-modality multi-atlas segmentation of brain MRI. Med. Image Anal.
**2013**, 17, 1181–1191. [Google Scholar] [CrossRef] - Cardoso, M.J.; Leung, K.; Modat, M.; Keihaninejad, S.; Cash, D.; Barnes, J.; Fox, N.C.; Ourselin, S.; for the Alzheimer’s Disease Neuroimaging Initiative. STEPS: Similarity and Truth Estimation for Propagated Segmentations and its application to hippocampal segmentation and brain parcelation. Med. Image Anal.
**2013**, 17, 671–684. [Google Scholar] [CrossRef] [PubMed] - Asman, A.J.; Landman, B.A. Non-local statistical label fusion for multi-atlas segmentation. Med. Image Anal.
**2013**, 17, 194–208. [Google Scholar] [CrossRef] [PubMed] - Tanno, R.; Saeedi, A.; Sankaranarayanan, S.; Alexander, D.C.; Silberman, N. Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 11244–11253. [Google Scholar]
- Zhang, L.; Tanno, R.; Xu, M.C.; Jin, C.; Jacob, J.; Ciccarelli, O.; Barkhof, F.; Alexander, D.C. Disentangling human error from the ground truth in segmentation of medical images. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 6–12 December 2020; pp. 15750–15762. [Google Scholar]
- Zhang, J.; Zheng, Y.; Hou, W.; Jiao, W. Leveraging non-expert crowdsourcing to segment the optic cup and disc of multicolor fundus images. Biomed. Opt. Express
**2022**, 13, 3967–3982. [Google Scholar] [CrossRef] [PubMed] - Hashmi, A.A.; Agafonov, A.; Zhumabayeva, A.; Yaqub, M.; Takáč, M. In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise. arXiv
**2023**, arXiv:2301.00524. [Google Scholar] - Warfield, S.K.; Zou, K.H.; Wells, W.M. Validation of image segmentation and expert quality with an expectation-maximization algorithm. In Proceedings of the Fifth International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Part I, Tokyo, Japan, 25–28 September 2002. [Google Scholar]
- Asman, A.J.; Landman, B.A. Formulating spatially varying performance in the statistical fusion framework. IEEE Trans. Med. Imaging
**2012**, 31, 1326–1336. [Google Scholar] [CrossRef] [PubMed] - Commowick, O.; Akhondi-Asl, A.; Warfield, S.K. Estimating a reference standard segmentation with spatially varying performance parameters: Local MAP STAPLE. IEEE Trans. Med. Imaging
**2012**, 31, 1593–1606. [Google Scholar] [CrossRef] [PubMed] - Liu, S.; Liu, K.; Zhu, W.; Shen, Y.; Fernandez-Granda, C. Adaptive Early-Learning Correction for Segmentation From Noisy Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2606–2616. [Google Scholar]
- Wang, C.; Gao, Y.; Fan, C.; Hu, J.; Lam, T.L.; Lane, N.D.; Bianchi-Berthouze, N. AgreementLearning: An End-to-End Framework for Learning with Multiple Annotators without Groundtruth. arXiv
**2021**, arXiv:2109.03596. [Google Scholar] - Rottmann, M.; Reese, M. Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification. arXiv
**2023**, arXiv:2207.06104. [Google Scholar] - Hamzaoui, D.; Montagne, S.; Renard-Penna, R.; Ayache, N.; Delingette, H. MOrphologically-Aware Jaccard-Based ITerative Optimization (MOJITO) for Consensus Segmentation. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging; Sudre, C.H., Baumgartner, C.F., Dalca, A., Qin, C., Tanno, R., Van Leemput, K., Wells, W.M., III, Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; pp. 3–13. [Google Scholar] [CrossRef]
- Carass, A.; Roy, S.; Jog, A.; Cuzzocreo, J.L.; Magrath, E.; Gherman, A.; Button, J.; Nguyen, J.; Prados, F.; Sudre, C.H.; et al. Longitudinal multiple sclerosis lesion segmentation: Resource and challenge. NeuroImage
**2017**, 148, 77–102. [Google Scholar] [CrossRef] [PubMed] - Guo, X.; Lu, S.; Yang, Y.; Shi, P.; Ye, C.; Xiang, Y.; Ma, T. Modeling Annotator Variation and Annotator Preference for Multiple Annotations Medical Image Segmentation. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 977–984. [Google Scholar] [CrossRef]
- Prados, F.; Ashburner, J.; Blaiotta, C.; Brosch, T.; Carballido-Gamio, J.; Cardoso, M.J.; Conrad, B.N.; Datta, E.; Dávid, G.; Leener, B.D.; et al. Spinal cord grey matter segmentation challenge. NeuroImage
**2017**, 152, 312–329. [Google Scholar] [CrossRef] [PubMed] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Dice, L.R. Measures of the Amount of Ecologic Association between Species. Ecology
**1945**, 26, 297–302. [Google Scholar] [CrossRef] - Dawid, A.P.; Skene, A.M. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.)
**1979**, 28, 20–28. [Google Scholar] [CrossRef]

**Figure 1.**A sample of the dataset. The first column contains a COC in both immature and mature stages. The subsequent three columns represent the corresponding masks provided by different experts. It is obvious that there is no complete agreement among the experts for each case.

**Figure 2.**The structure of the segmentation network, which consists of a UNet architecture parameterized by $\theta $ and the regularization networks parameterized by $\varphi $. The UNet has a depth of 5 layers, with the number of channels moving progressively from 32 to 64, 128, 256, and finally, to 512. The maxpooling layer has a padding and stride of 2, while the upsampling layer has a kernel size and a stride of 2. The regularization networks contain a simple network to compute the global confusion matrices and a CNN to compute the local confusion matrices.

**Figure 3.**The architecture consists of two components: (a) a segmentation network, characterized by the parameter $\theta $, which produces a probability distribution ${p}_{\theta}$ for segmentation; and (b) a regularization module consisting of a CNN, parameterized by $\varphi $, which utilizes the input image to generate three pixel-wise confusion matrices ${A}_{\varphi}$ at the local (pixel) level. During the training process, the parameters $(\theta ,\varphi )$ are learned simultaneously by optimizing the overall loss function.

**Figure 4.**The architecture consists of two components: (a) a segmentation network, characterized by the parameter $\theta $, which produces a probability distribution ${p}_{\theta}$ for segmentation, and (b) a regularization module consisting of a CNN, parameterized by $\varphi $, which utilizes the input image to generate three image-wise confusion matrices ${A}_{\varphi}$ at the global (image) level. During the training process, the parameters $(\theta ,\varphi )$ are learned simultaneously by optimizing the overall loss function.

**Figure 5.**Visualization of uncertainty regions in the segmentation process: On the left-hand side, a sample of the original microscopy images of COC is shown. On the right-hand side, the uncertainty regions corresponding to the sample on the left-hand side are displayed. As is evident, areas of high uncertainty are concentrated along the borders of the cumulus oocyte complexes.

**Figure 6.**Visualisation of the confusion matrices for each annotator, focusing on the areas of high uncertainty. This representation shows a clear behavior of each expert on the most difficult areas to identify.

**Figure 7.**A comparison between the majority-vote ground truth and the maximum likelihood ground truth, which focuses on the areas of uncertainty. In (

**a**), there is the majority-vote mask, with a gray zone on the borders, for the pixels of disagreement. In (

**b**), there is the maximum likelihood mask, which can vary within the range of (0.0–1.0), since it is calculated using the confusion matrix identity of each expert. In (

**c**), there is the zone of disagreement and alterations between case (

**a**) and case (

**b**).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Athanasiou, G.; Arcos, J.L.; Cerquides, J.
Enhancing Medical Image Segmentation: Ground Truth Optimization through Evaluating Uncertainty in Expert Annotations. *Mathematics* **2023**, *11*, 3771.
https://doi.org/10.3390/math11173771

**AMA Style**

Athanasiou G, Arcos JL, Cerquides J.
Enhancing Medical Image Segmentation: Ground Truth Optimization through Evaluating Uncertainty in Expert Annotations. *Mathematics*. 2023; 11(17):3771.
https://doi.org/10.3390/math11173771

**Chicago/Turabian Style**

Athanasiou, Georgios, Josep Lluis Arcos, and Jesus Cerquides.
2023. "Enhancing Medical Image Segmentation: Ground Truth Optimization through Evaluating Uncertainty in Expert Annotations" *Mathematics* 11, no. 17: 3771.
https://doi.org/10.3390/math11173771