Deep Learning Based Cardiac MRI Segmentation: Do We Need Experts?

Deep learning methods are the de-facto solutions to a multitude of medical image analysis tasks. Cardiac MRI segmentation is one such application which, like many others, requires a large number of annotated data so a trained network can generalize well. Unfortunately, the process of having a large number of manually curated images by medical experts is both slow and utterly expensive. In this paper, we set out to explore whether expert knowledge is a strict requirement for the creation of annotated datasets that machine learning can successfully train on. To do so, we gauged the performance of three segmentation models, namely U-Net, Attention U-Net, and ENet, trained with different loss functions on expert and non-expert groundtruth for cardiac cine-MRI segmentation. Evaluation was done with classic segmentation metrics (Dice index and Hausdorff distance) as well as clinical measurements, such as the ventricular ejection fractions and the myocardial mass. Results reveal that generalization performances of a segmentation neural network trained on non-expert groundtruth data is, to all practical purposes, as good as on expert groundtruth data, in particular when the non-expert gets a decent level of training, highlighting an opportunity for the efficient and cheap creation of annotations for cardiac datasets.


Introduction
Deep neural networks (and more specifically convolutional neural networks) have deeply percolated through healthcare R&D addressing various problems such as survival prediction, disease diagnostics, image registration, anomaly detection, and segmentation of images be it Magnetic Resonance Images (MRI), Computed Tomography (CT) or ultrasound (US) to name a few [1]. The roaring success of deep learning methods is rightly attributed to the unprecedented amount of annotated data across domains. But ironically, while solutions to decade-long medical problems are at hand [2], the use of neural networks in day-to-day practice is still pending. This can be explained in part by the following two observations. First, while being accurate on average, neural networks can nonetheless be sometimes wrong [3] as they provide no strict clinical guarantees. In other words, any neural network within the intra-expert variability is excellent on average but not immune to sparse erroneous (yet degenerated) results which is problematic in clinical practice [2]. Second, machine learning methods are known to suffer from domain adaptation problems, one of the most glaring medical imaging issue of our times [4]. As such, clinically accurate machine learning methods trained on a specific set of data almost always see their performances drop when tested on a dataset acquired following a different protocol. These problems derive in good part from the fact that current datasets are still relatively small. According to Maier-Hein et al. [5] most medical imaging challenges organized so far contain less than 100 training and testing cases. This shows that medical applications cannot rely on very large medical dataset encompassing arXiv:2107.11447v1 [eess.IV] 23 Jul 2021 tens of thousands of annotated data acquired in various conditions, with machines of various vendors showing clinical conditions and anatomical configurations of all kinds. This is unlike non-medical computer vision problems which have had access for a long time to large and varied datasets such as ImageNet, Coco, PascalVOC, ADE20k, and Youtube-8M to name a few [6]. The annotation of these datasets rely on non-experts, often through online services like Mechanical Turk [7]. Unfortunately, obtaining similarly large annotated datasets in medical imaging is difficult. The challenge stems from the nature of the data which is sensitive and requires navigating a complicated regulatory framework and privacy safeguards. Furthermore, labeling medical datasets is quite resource intensive and prohibitively costly as it requires a domain expertise.
For these reasons, the medical imaging literature have had an increasing number of publications whose goal is to compensate for the lack of expert annotations [8]. While some methods leverage partly-annotated datasets [9], others use domain adaptation strategies to compensate for small training datasets [10]. Some other approaches artificially increase the number of annotated data with Generative Adversarial Networks (GANs) [11,12] and others use third-party neural networks to help experts annotate images more rapidly [13].
While these methods have been shown effective for their specific test cases, it is widely accepted that large manually-annotated datasets brings indisputable benefits [14]. In this work, we depart from trying to improve the segmentation methods and focus on the datasets as we challenge the idea that medical data, cardiac cine MRI specifically, needs to be labeled by experts only and explore the consequences of using non-expert annotations on the generalization capabilities of a neural network. Non-expert here refers to a non-physician who could not be regarded as a reference in the field. While non-expert annotations are easier and cheaper to get, they could be used to build larger datasets faster and at a reduced cost.
This idea was tested on cardiac cine-MRI segmentation. To this end, we had two nonexperts labeling cardiac cine-MRI images and compared the performance of neural networks trained on non-expert and expert data. The evaluation between both approaches was done with geometric metrics (Dice index and Hausdorff distance) as well as clinical parameters namely, the ejection fraction for the left and right ventricles and the myocardial mass.

Methods and Data
As mentioned before, medical data annotation requires a rightful expertise so the labeling can be used with full confidence. Expert annotators are typically medical doctors or medical specialists whose training and experience are reliable sources of truth for the problem to solve. These experts often have close collaborators working daily with medical data, typically computer scientists, technicians, biophysicist, etc. While their understanding of the data is real, these non-experts are typically not considered as a reliable source of truth. Non-expert are thus considered as people who can manually label data but whose annotations are biased and/or noisy and thus unreliable.
In this study, two non-experts were asked to label 1902 cardiac images. We defined a non-expert as someone with no professional expertise on cardiac anatomy nor cine-MR images. The Non-Expert 1 is a technician in biotechnology who received a 30 minute training by a medical expert on how to recognize and outline cardiac structures. The training was done on a few examples where the expert showed what the regions of interest in the image look like and where their boundaries lie. Training also came with an introduction to the cardiac anatomy and its temporal dynamics. The Non-Expert 2 is a computer scientist with 4 years of active research in cardiac cine-MRI with several months of training. In the case of the Non-Expert 2, the training span several months where directions about the imaging modality as well as the anatomy and pathologies where thoroughly explained. In addition, fine delineation guidelines were provided to disambiguate good from poor annotations. In this study, we both gauge the effect of training a neural network on non-expert data and also verify how the level of training of the non-experts impact the overall results.
The non-experts were asked to delineate three structures of the heart namely, the left ventricular cavity (endocardial border), the left ventricle myocardium (epicardial border) and the endocardial border of the right ventricle. No further quality control was done to validate the non-expert annotations. Segmentations were used as-is for the subsequent tasks.
We used the gold standard for medical image segmentation U-Net [15] as the baseline network. In addition, the well-known Attention U-Net [20] and ENet [21] networks were trained in order to ensure that the results are affected by the differences in annotations and not the network architecture. We first trained the the segmentation models (U-Net, Attention U-Net and ENet) on the original ACDC dataset (Automated Cardiac Diagnosis Challenge) [2] with its associated groundtruth using a classical supervised training scheme, with a combined cross-entropy and Dice loss: whereŷ ki is the probabilistic output for image i ∈ N (N is the number of images in the batch) and class k ∈ {1, 2, 3} (3 is the number of classes).ŷ is the predicted output of the network, y is a one-hot encoding of the ground truth segmentation map.
We then re-trained the neural networks with the non-expert labels using the same training configuration. Furthermore, considering that the non-expert annotations can be seen as noisy versions of the true annotation (i.e. y = y + where y is the non-expert annotation, y is the groundtruth and a random variable), we also trained the networks with a mean absolute error loss which, as shown by Ghosh et al [16], has the solve property of compensating for labeling inaccuracies.

Experimental Setup
To test whether non-expert annotated datasets hold any value for cardiac MRI segmentation, the following two cardiac cine MRI datasets were used: • Automated Cardiac Diagnosis Challenge (ACDC) dataset [2]. This dataset comprises 150 exams acquired at the University Hospital of Dijon (all from different patients). It is divided into 5 evenly distributed subgroups (4 pathological plus 1 healthy subject groups) and split into 100 exams for training and 50 held out set for testing. The exams were acquired using two MRI scanners with different magnetic strengths (1.5T and 3T). The pixel spacing varies from 0.7mm to 1.9mm with a slice spacing varying between 5mm to 10mm. An example of images with the different expert and non-expert annotations is shown in Figure 1. [17]. This dataset consists of 375 cases from 3 different countries (Spain, Germany and Canada) totaling 6 different centres with 4 different MRI manufacturers (Siemens, General Electric, Philips and Canon). The cohort is composed of patients with hypertrophic and dilated cardiomyopathies as well as healthy subjects. The cine MR images were annotated by experienced clinicians from the respective centres.
We trained the segmentation models on the 100 ACDC training subjects on either the expert and non-expert groundtruth data. Training was done with a fixed set of hyperparameters, chosen through a cross-validated hyper-parameters search to best fit the 3 annotators, without tuning it further. The networks were trained three times in order to reduce the effect of the stochastic nature of the training process on the results.
As mentioned before, we first trained the neural networks on non-expert data with exactly the same setup as for the expert annotations. Then, we retrained from scratch the neural networks (U-Net, Attention U-Net and ENet) with a L1 loss which was shown to be robust to noisy labels [16].
We then tested in turn on the 50 ACDC test subjects and the 150 M&Ms training data. The M&Ms dataset constitutes data with groundtruth that is not biased towards either of the annotators of the training set, be it the expert or the non-expert. Moreover, testing on different datasets provides an inter-expert variability range as well as a domain generalization setup.

Results and Discussion
The first set of results are laid out in Table 1. It corresponds to standard geometrical metrics, i.e. the Dice score and the Hausdorff distance (HD) for the left ventricular (LV) cavity (Table 1), the myocardium (MYO) (considering the endocardial and epicardial of the left ventricle) ( Table 2) and the cavity of the right ventricle (RV) ( Table 3). It also contains the end-diastolic volume (EDV) as well as the ejection fraction (EF) for the LV and the RV and the myocardial mass error. For all three tables, the networks (U-Net, Attention U-Net and ENet) has been trained on the ACDC training set and tested on the ACDC testing set and the M&Ms training set.
Results for the ACDC testing set reveal that for the LV (Table 1) the networks trained on the non-expert annotations (Non-Expert 1 as well as Non-Expert 2) manage to achieve performances that are statistically indistinguishable from those of the expert. And that is true regardless of the training loss (CE+Dice vs MAE+Dice) and the metric (Dice, HD, and EF). The only exception is for the EF error for Non-Expert 1 with loss function MAE+Dice.
The situation however is more fuzzy for the MYO and the RV. In both cases, we can see that results for the Non-Expert 1 are almost always worse than that of the expert, especially for the CE+Dice loss. For example, there is a Dice score drop of 12% on the myocardium. Also, the clinical results on the RV (Table 3) show a clear gap between the Non-Expert 1 and the other annotators. However, we can appreciate how the MAE+Dice loss improves results for both non-experts. Overall for the MYO and the RV, results for the Non-Expert 2 are very close (if not better) than that of the expert. This is obvious when considering the average myocardial mass error in Table 2. Although, one recurrent result from our experiments is the hit-and-miss performance of all the evaluated networks on the M&Ms dataset, where in a number of cases the output segmentation was completely degenerated as shown in Figure 2. Moreover, the difference in segmentation performance between all the annotators is similar regardless of the segmentation network (U-Net, Attention U-Net or ENet) used, although Attention U-Net shows the best performance overall which is to be expected given its larger capacity.
Further analysis of the segmentation performance on the different sections of the heart (Figure 3), namely the base, the middle and the apex, show that the differences between the non-experts and the expert annotations lie heavily on the two ends of the heart. The performance gap is more pronounced on the apex for the three anatomical structures. In parallel, when we look at the performance from the disease groups (Figure 4), we can distinguish a relative similarity in the Dice score between the different annotators and disease groups.   Our experiments also reveal some interesting results on the M&Ms dataset, a dataset with different acquisition settings than for ACDC. In that case, we see the gap in performance between the expert and the non-expert decrease substantially. For example, when comparing the Non-Expert 1 results with MAE+Dice loss and those from the expert annotation, we see that the Dice difference for the RV went from a 6% on the ACDC dataset to a mere 4% on the M&Ms dataset. But overall, there again, results by Non-Expert 2 are similar (and sometimes better) than that of the expert.
Through out our experiments, the performances of the three neural networks (U-Net, Attention U-Net and ENet) trained on the Non-Expert 2 annotations with MAE+Dice loss has been roughly on par, if not better, with those trained on the expert annotations. This is especially true for the LV. For the Non-Expert 1, most likely due to a lack of proper training, results on both test sets and most MYO and RV metrics are worst than that of the expert. In  fact, a statistical test reveal that results from the Non-Expert 1 are almost always statistically different than that of the expert. We also evaluated the statistical difference between the CE+Dice and the MAE+Dice losses and observed that the MAE+Dice loss provides overall better results for both non experts.
Overall on M&Ms, while the expert got a better MYO mass error and a better RV EF error, the MYO HD is lower for Non-Expert 2 and the Dice score and the RV HD of Non-Expert 2 are statistically similar. These results underline the idea that well-trained non-expert and expert annotations could be used interchangeably to build reliable annotated cardiac datasets. In contrast, the number of non-experts we evaluated might be considered a limitation of our study, however, this still provides encouraging results for settings where experts are not readily available to annotate whole datasets, but could provide training to a non-expert to effectively annotate in their stead. We leave the investigation on more datasets to future works that could transpose the setup to more difficult problems and a larger number of non-experts. Our work comes to supplement previous endavours that rely on non-experts to annotate medical datasets, Heim et al. [18] showcased the ability of crowdsourced expertise to reliably annotate liver dataset although their approach proposes initial segmentations to the non-expert which might bias their decision. Likewise, Ganz et al. [19] proposed to make use of non-experts as a crowdsourced error detection framework. In contrast, our approach evaluated the effectiveness of non-expert knowledge without any prior input. This further reinforces the idea that crowdsourced medical annotation are a viable solution to the lack of data.

Conclusion
In this work, we studied the usefulness of training deep learning models with non-expert annotations for the segmentation of cardiac MR images. The need for medical experts was probed in a comparative study with non-physician sourced labels. Through framing the problem of relying on non-expert annotations as noisy data, we managed to obtain good performance on two public datasets, one of which was used to emulate an out-of-distribution dataset. We found that training a deep neural network, regardless of its capacity (U-Net, Attention U-Net or ENet), with data labeled by a well-trained non-expert achieves comparable performance than on expert data. Moreover, the performance gap between the networks with non-expert and expert annotations on the out-of-distribution dataset was less pronounced than the gap on the training dataset. Future endeavors could focus on crowd sourcing large-scale medical datasets and tailoring approaches that take their noisiness into account.
Funding: This research received no external funding