1. Introduction
Prostate cancer (PCa) has shown a substantial decline in the past 5 years, between 5% and 16%, but it continues to be the most common cancer among men [
1]. Our study is focused on the analysis of MR images acquired in the context of PCa. Indeed, it remains one of the most commonly diagnosed solid tumour types in men and an MRI is one of the most efficient imaging modalities used to detect PCa early in its course [
2]. Collaborative work is a growing field of work, and understanding how groups learn effectively is critical [
3]. The role of radiology in the diagnostic process, focusing on key concepts of information and communication, as well as key interpersonal interactions of teamwork, collaboration, and collegiality, all based on trust, have been explored in previous works [
4].
The annotation of medical images is subject to an inherent inter-variability between experts, and in some cases, there are also significant differences between the annotations of the same expert (intra-variability) [
5]. This difficulty in annotating the medical findings is due to different reasons, including the quality of the images themselves (difficult to understand, low resolution, and/or subtle changes, etc.), the expert who is performing the annotations (experience, tiredness, etc.), and the working conditions (monitor, annotating device, illuminance, etc.). It is commonly accepted that one way to reduce the variabilities is by overlapping the annotations performed by different experts and perform blindly with respect to the other experts. In this paper, we show that, by using a collaborative approach, the variabilities between experts can be minimized considerably.
Among the techniques used to detect PCa, MRI allows the non-invasive analysis of the anatomy and the metabolism in the entire prostate gland. MRI has been established as the best imaging modality for the detection, localization, and staging of PCa on account of its high resolution, excellent spontaneous contrast of soft tissues, and the possibilities of multi-planar and multi-parameter scanning [
6]. Previous works about the manual annotation and evaluation analyses were presented by Meyer et al. [
7]. In recent literature, a large-scale annotation of biomedical data and expert label synthesis were presented by Chen et al. [
8]. In this work, a state-of-the-art in imaging, treatment, and computer-assisted intervention in the field of endovascular intervention is discussed. More specific works are also focused on the volumetric measurement of hepatic tumours by studying the accuracy of manual contouring using computed tomography (CT) [
9]. In the same perspective, Bø et al. [
10] investigated the intra-observer variability in low-grade glioma (LGG) segmentation for a radiologist without prior segmentation experience. Indeed, the usefulness of collaborative work between radiologists and medical experts is gaining importance.
The principal problem encountered in the diagnosis of prostate cancer is the localization of a ROI containing tumour tissue. Normally, experts use different tools to establish the diagnoses using different software and make many annotations in different files [
11]. This is not a practical solution to managing abundant medical data. The use of a specific dedicated tool allows experts to analyze the prostate gland on T2-weighted imaging (T2WI), diffusion weighted imaging (DWI), perfusion based on dynamic contrast enhancement (DCE), and magnetic resonance spectroscopy (MRS) panels within the same application [
12]. In this sense, one of the most evident advantages of this kind of tool is that it allows simultaneous analysis of the prostate using different image modalities and, if available, MRS. More recent works confirm that this interaction between MRI techniques facilitates, for radiologists and medical experts, and the evaluation of the prostate using the PI-RADS v2 classification [
13].
In this paper, we compare the delimitation of different ROIs on images of the prostate gland between an independent evaluation, using collaborative work of different experts. The idea behind this study is to show that collaborative work allows a real consensus between experts and potentially decreases variabilities in their evaluation. To this end, the evaluation procedure, evaluation parameters, and the data analysis discussion of the obtained results are presented in this work.
3. Results
Two examples of PCa analysis are presented in
Figure 4 and
Figure 5. The left image in
Figure 4 corresponds to the drawing by the first expert
and in the right image by the second expert
. Three ROIs were drawn in images corresponding to CZ (white area), PZ (blue area), and tumour (red area). When visually comparing the two drawings, a very good concordance between CZ and PZ areas can be observed. Concerning the area corresponding to the tumour, a small deviation is seen but contours can be considered as being relativity close between the two experiments.
However, not all the prostate studies were evaluated with such good concordance between experiments. An example of discordance is seen in
Figure 5. CZ and PZ have a good approximation between
and
but an important discordance is seen for the tumour area. A new evaluation was carried out for
in
Figure 5c. In this example, we can see the real advantage of collaborative work. After collaboration, the tumour areas are approximately the same.
3.1. Anatomic Parameters
From
Table 1, it is clear to see that the number of cases where either expert does not include a particular zone reduces significantly after collaboration. Indeed, for CZ, this percentage is equal to 12% between
vs.
and 3% between
vs.
. For PZ, this percentage is equal to 9% between
vs.
and 3% between
vs.
. Finally, for the tumour, it is 13% between
vs.
and 0% between
vs.
.
The correlation coefficient
, regression line, Bland–Altman, and two-sample
t-test calculated for the area of the three prostate gland zones are depicted in
Table 2 and
Table 3. In general, the results are improved between
vs.
compared with
vs.
. More precisely, the correlation coefficient of the area is improved for
vs.
whatever the considered area.
The Bland–Altman test shows a better agreement between vs. than vs. . Incidentally, a Bland–Altman test has also been calculated for the volume evaluation. According to the two-sample t-test, there is no significant difference between vs. whatever the considered area, while there are always significant differences in the results between vs. . For CZ, it is between vs. and between vs. . For PZ, it is between vs. and between vs. . Finally for tumour, it is between vs. and between vs .
Figure 6a,b detail the linear regression analysis for the evaluation of the tumour area. The tumour area has been chosen due to its importance and because this area is more difficult to analyze and provides more variations among experts. When comparing the two obtained regression lines, an improvement is noted in
Figure 6b with a slope of
compared with
Figure 6a with a slope of
.
Figure 6c,d detail the corresponding Bland–Altman plots. In
Figure 6d, it can be seen that the mean of the difference between
vs.
is close to zero, meaning that there is little bias between the two measurements.
3.2. Contour Evaluation
The Hausdorff distance and the Dice index between the different annotations are presented in
Table 4. Again, between
vs.
, an improvement is observed with respect to the results obtained between
vs.
. The mean Hausdorff distance is reduced in all the cases. In the same way, the analysis of the Dice index is around
between
vs.
, whatever the area, whereas it is no higher than 0.7 for
vs.
. The differences between
vs.
and
vs.
are always significant, whatever the considered parameter.
4. Discussion
Currently, the ground-truth is often obtained via evaluations from different experts performing their tasks independently, and only afterwards, their results are objectively (or subjectively) merged. Ghose et al. proposed a set of open problems mainly related to the evaluation procedure [
24]. This can be summarised as (1) variabilities in the ground-truth, (2) unavailability of public prostate datasets, and (3) lack of standardised metrics for evaluation.
Although interpersonal interactions are difficult to specify and quantify, they are critical to the effective flow of information, which itself is critical to the diagnostic process as it is increasingly performed within and among professional teams [
4]. It is well known that medicine is becoming increasingly specialised, care teams have less time to care for patients, and interprofessional collaboration in healthcare is more important than ever. Based on the fact that collaborative work is difficult and time-consuming, medical tools are needed to facilitate the decision-making process. Indeed, there is no universal tool that can solve all the shortcomings in healthcare decision-making tasks [
25,
26]. We believe the work presented in this paper presents the roots for designing an adequate process for prostate evaluation. Moreover, the collaborative work presented in this study is the first step for obtaining a reliable ground-truth, without expert variabilities, in which automatic algorithms could be compared.
In our paper, we studied the usefulness of collaborative work, where the ground-truth is obtained by two experts, but with the second expert having prior knowledge of the other expert’s work. Exhaustive evaluations of the medical findings in different regions of the prostate gland from T2WI were performed. We asked two experts to make these drawings independently on several MR examinations, and as a second step, one expert repeated the drawings with the knowledge of the evaluation of the other expert.
The novelty in our study is to evaluate the variability between experts concerning medical findings in prostate gland regions. The localization of a lesion in a specific area is crucial and the segmentation process remains a challenge. The differences observed for the delimitation of the different ROIs between independent evaluation or using collaborative work by different users was studied. These segmentations remain difficult, and their delineations are fundamental for assigning a PI-RADS score. Differences in the obtained results (e.g., such as in the volume calculations) were compared in order to verify whether a significant improvement in consensus had been obtained through collaborative work. The idea behind this study is to show that collaborative work allows a real consensus between experts and potentially decreases variabilities in their evaluation.
A main limitation of this work is the small sample size. However, it was a select sample dataset that was within reach of only a few clinical cases. The included cases fulfilled very specific criteria and may be considered as main impacts in terms of incidence according to the ground truth. For instance, this work is a proof of concept and we think that increasing the data set will not considerably alter the results, nor the conclusion, as there is already a significant difference between the approaches even with this small data set. A potential bias in our study could be the fact that the second expert participated in the consensus (Experiment ). However, the time interval of greater than one month between and will considerably limit such a bias. Moreover, the lack of histopathology data are also a limitation in studies concerning prostate cancer. However, our patient data set was extracted from a pool of patients destined to receive radiotherapy treatment as opposed to radical prostatectomy. Therefore, in this context, surgery was out of the question. Moreover, it is complicated to obtain from the biopsy accurate knowledge on the spatial distribution of the tumour within the gland because it is a relatively random procedure.
Another novelty of our study was to show that the evaluation of medical examinations with collaborative work drastically reduces the differences between processing; even this result was expected. In particular, significant differences between the two experts virtually disappeared when there was collaborative work. We probably cannot conclude that the diagnosis was improved, but we do observe vastly improved consensus between experts. There may be human errors in the evaluation that could lower the correctness of the results and increasing the number of experts could diminish this bias. This can be done by a tool allowing working online [
11]. However, in general, a consensus of two experts still involves an increase in the quality of the diagnosis.
An alternative point of view would be to affirm that the experiment
is biased. While it is true that inter-rater agreements for prostate MRI are not outstanding and that we have not presented such a study in our work, it must be emphasised that our paper is focused on how to counter such problems of heterogeneity of response from experts. Indeed, our study shows that from a relatively dispersed set of the initial data, our approach is capable of reducing such inherent differences. An additional experiment could be performed: the second expert could repeat the process knowing what had been done by the first expert. However, in our opinion, this is not necessary because the main objective of this study was to objectively show that collaborative work in current clinical practice can provide a real consensus between experts, even if there is potentially a bias in the evaluation process. We can have two experts at the same time, and then the scenario is a little bit different, because in this case, there is no specific order between the experts. Furthermore, our protocol evaluation is perfectly in line with the recommendations of PI-RADS v2 in that 3D T2-weighted sequences are expected in the place of the 2D version. Moreover, recent articles at 3T have shown that 3D T2 images are equivalent to the 2D T2 image [
27].
Although artificial intelligence (AI) shows promise across many aspects of radiology, the use of AI to create differential diagnoses for rare and common diseases has not been demonstrated [
28]. Recent advances using deep learning have brought the immense scope of automatic detection and recognition at very high accuracy in prostate cancer. Automated deep learning systems have delivered promising results from histopathological images to accurate grading of PCa. Many studies have shown that deep learning strategies can achieve better outcomes than simpler systems that make use of pathology samples [
29]. There are other examples of algorithms based on artificial intelligence and machine learning in PCa that could be an excellent addition to our work [
30,
31,
32]. Finally, considering the difficulties to segment the prostate gland regions, a solution based on AI was proposed by Bardis et al. [
33]. The purpose of their study was to build upon these prior efforts by using a larger data set and two parallel neural networks that were specialised in localization and classification for both TZ and PZ segmentation. In general terms, there are benefits of collaboration work in healthcare, such as improving patient care and outcomes, reducing medical errors, and even improving staff relationships and job satisfaction.
5. Conclusions
In this paper, the interest of collaborative work in the evaluation of cancer issues from MRI is presented. Even if improved results had been expected, this study shows that the evaluation of medical examinations with knowledge of the work of another expert, drastically reduces the differences between processing. In particular, significant differences between experts become non-significant when there is collaborative work. We cannot conclude that the diagnosis was improved, but only that there is improved consensus between the experts (but in general, this did involve an increase of the quality of the diagnosis). Moreover, an alternative point of view is to affirm that the results from collaborative work are biased. In fact, it is out of the scope of our study because the main objective was to objectively show that collaborative work in current clinical practices can provide a consensus between experts even if there is potentially a bias in the evaluation process.
In conclusion, although collaborative work requires more time, it allows the improvement of the management of patients with prostate cancer by providing consensual diagnosis, in particular in complex cases.