A Reproducible Deep-Learning-Based Computer-Aided Diagnosis Tool for Frontotemporal Dementia Using MONAI and Clinica Frameworks

Despite Artificial Intelligence (AI) being a leading technology in biomedical research, real-life implementation of AI-based Computer-Aided Diagnosis (CAD) tools into the clinical setting is still remote due to unstandardized practices during development. However, few or no attempts have been made to propose a reproducible CAD development workflow for 3D MRI data. In this paper, we present the development of an easily reproducible and reliable CAD tool using the Clinica and MONAI frameworks that were developed to introduce standardized practices in medical imaging. A Deep Learning (DL) algorithm was trained to detect frontotemporal dementia (FTD) on data from the NIFD database to ensure reproducibility. The DL model yielded 0.80 accuracy (95% confidence intervals: 0.64, 0.91), 1 sensitivity, 0.6 specificity, 0.83 F1-score, and 0.86 AUC, achieving a comparable performance with other FTD classification approaches. Explainable AI methods were applied to understand AI behavior and to identify regions of the images where the DL model misbehaves. Attention maps highlighted that its decision was driven by hallmarking brain areas for FTD and helped us to understand how to improve FTD detection. The proposed standardized methodology could be useful for benchmark comparison in FTD classification. AI-based CAD tools should be developed with the goal of standardizing pipelines, as varying pre-processing and training methods, along with the absence of model behavior explanations, negatively impact regulators’ attitudes towards CAD. The adoption of common best practices for neuroimaging data analysis is a step toward fast evaluation of efficacy and safety of CAD and may accelerate the adoption of AI products in the healthcare system.


Introduction
Computer-Aided Diagnosis (CAD) tools aim to help improve physicians' performance in disease detection, with the main objective of detecting early pathological signs that humans may fail to find [1]. CAD applications have been developed for numerous medical imaging modalities and diseases [1,2] and they provide diagnosis probabilities by analyzing the data. As such, CAD tools would help the radiologists to draw their conclusions, supporting their interpretation and decision-making processes [3]. Interestingly, it has been shown that the performance of the radiologist can be equalized by CADs and even improved by joining human collective reasoning with machine predictions [2,4]. ments of ventricular volumetric changes in FTD patients showed that their size increases with time, and this correlates with cognitive impairment severity [34]. In particular, ventricular expansion has been found to represent an informative marker to discriminate the behavioral variant of FTD from other FTD variants and from other forms of dementia [35].
The development of a reliable CAD tool needs the data to be carefully annotated, organized, and managed, especially when it is based on Magnetic Resonance Imaging (MRI). Classification systems exploiting brain imaging have been successfully used to capture structural changes in the human brain [36,37] and detect dementia up to flawless performance [38]. Extensive datasets improve the solidity of AI algorithms' training and many data-sharing initiatives have grown in the field of neurodegenerative disease research in the last 20 years [39]. Such initiatives foster neurodegenerative disease research, by putting aside the need for years-long data collection and providing reliable data. This speeds up both hypothesis testing and data-driven research that exploits AI for data analysis. A key element that enables the use of such techniques is the adoption of academy and industry-wide data standards. In order to make the shared data productive, the FAIR (Findable, Accessible, Interoperable, Reusable) principles have been proposed to promote Open Science practices for data sharing initiatives [40,41]. FAIR data can be highly precious, especially when hundreds of subjects are available and the methods for data acquisition are standardized and reliable. The spread of research-useful data is aided by database management initiatives such as the Image and Data Archive (IDA), hosted by the Laboratory of NeuroImaging (LONI), that contains data from more than 160 studies. IDA shares the Neuroimaging in Frontotemporal Dementia (NIFD) database, which is one the biggest data sharing initiatives on FTD to date.
To date, few attempts have been made to discriminate FTD from Normal Controls (NC) using AI methods [36,37,[42][43][44][45][46][47]. Most of the studies used MRI-derived numerical features, such as gray/white matter volume or cortical thickness quantification, with traditional ML algorithms such as Support Vector Machine (SVM) and logistic regression. To the best of our knowledge, there is only one published study that used 3D MRI data for FTD classification with a pretrained Convolutional Neural Network (CNN) achieving high accuracy [42]. Given that only a few of the previous classification attempts used NIFD data, sample sizes usually ranged from 12 to~450 subjects, and preprocessing and data augmentation methods, as well as AI algorithms choice and cross-validation strategies, were highly heterogeneous. This heterogeneity in methodology resulted in unstable and potentially unreliable CAD performances, achieving accuracies ranging from 0.66 to 1.
To deepen our understanding of neurodegenerative diseases in general, and FTD in particular, standardized practices are needed when preprocessing and analyzing MRI data with DL, in order to make studies easier and more reproducible. Several initiatives have approached this problem. The Nipype (Neuroimaging in Python-Pipelines and Interfaces) package, an open-source Python project, provides functions where the interaction with neuroimaging software tools or algorithms can be performed in a single workflow, partly handling the heterogeneity of image processing [48]. NiftyTorch is a DL framework for neuroimaging in Python, giving users an interface for PyTorch modeling with the aim of providing a package to easily perform AI-based operations on neuroimaging data [49]. In the same fashion, MedicalTorch is an open-source framework for PyTorch, implementing an extensive set of loaders, pre-processors and datasets for medical imaging [50]. In 2016, efforts to increase data standardization and experiment reproducibility led to the proposal to the neuroimaging community of the Brain Imaging Data Structure (BIDS) as an organization format for clinical and imaging data [51]. Despite the scientific community's acceptance of the BIDS standard, not all public neuroimaging datasets provide BIDS versions of their data.
Clinica, a set of automatic pipelines for the management of multimodal neuroimaging data, pursues the community effort of reproducibility and aims to make clinical research studies easier [52]. This project is developed and maintained from the Aramis Lab and consists of a set of automatic pipelines for neuroimaging data processing and analysis. The aim of Clinica is to make clinical neuroimaging studies easier and reproducible by providing standardized methods for data processing. Clinica includes converters of public neuroimaging datasets to BIDS, along with processing pipelines, and organization for processed files, statistical analyses, and Machine Learning algorithms.
One of the latest and most extensive initiatives for the standardization of AI applications to medical imaging is "project MONAI", that originally started in 2020 by NVIDIA and King's College London and brought to the MONAI framework [53]. It is an opensource PyTorch-based framework for DL in healthcare imaging. The aim of the project is developing and sharing best practices for AI in healthcare imaging, creating state-ofthe-art, end-to-end training workflows for healthcare imaging and providing researchers with the optimized and standardized way to create and evaluate Deep Learning models. The MONAI framework provides workflows for using domain optimized networks, loss functions, metrics, and optimizers.
Considering the high heterogeneity in CAD tools development methodology, there is a need to foster standardized practices to spread the benefits of AI strategies adoption in the diagnosis pipeline. Here, we present a proof-of-concept of Clinica and MONAI application on the NIFD database to train and test a CAD tool on FTD data with the application of shared practices. Our aim was testing the efficacy of only using standardized frameworks for neuroimaging data preprocessing and DL modeling on medical imaging as main steps in CAD development. The achieved performance is comparable to that of other FTD classification systems, showing the appropriateness of this methodology. Explainable AI methods reveal that the model mimics human behavior when making its decision, mainly relying on morphological changes in hallmarking brain areas for FTD. The adoption of academy and industry-wide data standards coupled with standardized practices for neuroimaging data management will provide more reliable results upon the application of less biased procedures.

Neuroimaging in Frontotemporal Dementia Database
This study was performed on data from the NIFD database, hosted by the Laboratory of NeuroImaging from the University of Southern California. NIFD is the nickname for the frontotemporal lobar degeneration neuroimaging initiative (FTLDNI). FTLDNI was funded through the National Institute of Aging and started in 2010. The primary goals of FTLDNI were to identify neuroimaging modalities and methods of analysis for tracking frontotemporal lobar degeneration (FTLD) and to assess the value of imaging versus other biomarkers in diagnostic roles. The Principal Investigator of NIFD was Dr. Howard Rosen, MD, at the University of California, San Francisco. The data are the result of collaborative efforts at three sites in North America. For up-to-date information on participation and protocol, please visit http://memory.ucsf.edu/research/studies/ nifd (accessed on 25 May 2022). NIFD includes data from 346 subjects followed over time including FTD and NC all with a careful assessment through interviews, physical examinations, cognitive testing and blood and/or CSF testing, along with brain MRI and PET acquisition. FTD patients included in this database are diagnosed with one of the following disease variants: behavioral variant, semantic variant, progressive non-fluent aphasia, progressive supranuclear palsy, or cortico-basal syndrome.

Clinica: An Open-Source Software Platform for Reproducible Clinical Neuroscience Studies
We used Clinica [52] functions to manage NIFD data. In particular, we applied the nifdto-bids function to convert NIFD data to BIDS format, in order to be ready for processing. Next, we used the t1-linear pipeline to affinely align T1-weighted MR images to the MNI space. With this standardized preprocessing applied, we were ready for model training.

MONAI: Medical Open Network for Artificial Intelligence
We followed the MONAI workflow for 3D classification based on DenseNet. MONAI proposes to use the DenseNet121 architecture stored in PyTorch, a model from [54]. MONAI made importing and transforming the images easy, and provided useful functions to train and test the DL model. Such standardized and community-based practice makes data and model management more solid, increasing the robustness and reproducibility of CAD building practice.

Workflow Overview
The next sections describe the methodology for the proposed workflow, which is shown in Figure 1. NIFD data was downloaded from the IDA and filtered as specified in Section 2.5. Data preprocessing was performed with the standardized pipeline from Clinica, as specified in Section 2.6. Data augmentation was applied only on the train set and preceded model training and test phases, which were all performed in the MONAI framework. These two are described in Sections 2.7 and 2.8, respectively. Finally, the behavior of resulting CAD is evaluated by extracting attention maps as explained in Section 2.9.

Clinica: An Open-Source Software Platform for Reproducible Clinical Neuroscience Studies
We used Clinica [52] functions to manage NIFD data. In particular, we applied the nifd-to-bids function to convert NIFD data to BIDS format, in order to be ready for processing. Next, we used the t1-linear pipeline to affinely align T1-weighted MR images to the MNI space. With this standardized preprocessing applied, we were ready for model training.

MONAI: Medical Open Network for Artificial Intelligence
We followed the MONAI workflow for 3D classification based on DenseNet. MONAI proposes to use the DenseNet121 architecture stored in PyTorch, a model from [54]. MONAI made importing and transforming the images easy, and provided useful functions to train and test the DL model. Such standardized and community-based practice makes data and model management more solid, increasing the robustness and reproducibility of CAD building practice.

Workflow Overview
The next sections describe the methodology for the proposed workflow, which is shown in Figure 1. NIFD data was downloaded from the IDA and filtered as specified in Section 2.5. Data preprocessing was performed with the standardized pipeline from Clinica, as specified in Section 2.6. Data augmentation was applied only on the train set and preceded model training and test phases, which were all performed in the MONAI framework. These two are described in Sections 2.7 and 2.8, respectively. Finally, the behavior of resulting CAD is evaluated by extracting attention maps as explained in Section 2.9.

Data Collection
To perform the present study, we filtered the data to include only 3D T1-weighted Magnetization-Prepared Rapid Acquisition with Gradient Echo (MPRAGE) MRI scans at

Data Collection
To perform the present study, we filtered the data to include only 3D T1-weighted Magnetization-Prepared Rapid Acquisition with Gradient Echo (MPRAGE) MRI scans at the first visit for cases and controls, as this was the most frequent acquisition modality among many others available in NIFD. Subjects were labeled as cases or controls following the diagnosis reported in the clinical data table. When the diagnosis was missing in the clinical data table, the "patient/control" label available in the MRI metadata was used. This resulted in a final dataset of 182 FTD and 130 NC.

Preprocessing Pipeline
Subjects were randomly assigned to train, validation, and test sets with a 70/30 proportion calculated on the group with the highest n (Table 1). This splitting proportion is most indicated for small sample sizes and it required~40 subjects for validation and test [55]. Random sampling without replacement resulted in each participant being uniquely assigned to one of the three sets, avoiding data leakage. The 3D T1 MPRAGE MRI scans underwent BIDS formatting with the application of the nifd-to-bids converter from Clinica [52]. Data preprocessing for the affine registration of T1w images to the MNI standard space was performed using the t1-linear pipeline of Clinica [52,56]. More precisely, bias field correction was applied using the N4ITK method [57]. Next, an affine registration was performed using the SyN algorithm [58] from ANTs [59] to align each image to the MNI space with the ICBM 2009c nonlinear symmetric template [60,61]. The registered images were further cropped to remove the background resulting in images of size 169 × 208 × 179, with 1 mm isotropic voxels.

Data Augmentation Pipeline
MRI scans from subjects in the train set underwent augmentation procedures to generate new observations and enlarge the training set. Data augmentation was performed using MONAI randomized data augmentation transforms. The specs of the applied transformations are described in Table 2. Each image was augmented 5 times.  This process enlarged the train set to 1170 images (715 FTD, 455 NC). In order to build the final train set, all the original images were kept for both groups and the augmented images were sampled to set the n of each group to 400. A summary for the composition of the final train set is shown in Table 3.

Deep Learning Pipeline
A train set of 440 images per group was used to train a classifier following the indications reported in MONAI [53]. A DenseNet121 was trained with default parameters for 3D images for 10 epochs and using a batch size of 2 ( Figure 2). The Cross-Entropy loss was used for training with Adam optimizer, setting the learning rate to 1 × 10 −5 . DenseNet121 is a Dense Convolutional Network, where each layer is connected to every other layer and uses all preceding layers' feature-maps as inputs [54]. When the images were imported in the PyTorch environment they underwent intensity scaling between 0 and 1, a channel was added to make the image in the channel-first format, then it was resized with scaling to 150 the longest dimension, keeping the aspect ratio of the initial image. The model with the best performance on the validation set was saved. Predictions on the test set were performed following MONAI indications and results were evaluated by computing the following evaluation metrics: accuracy, sensitivity, specificity, F1-score, and Area Under the Curve (AUC).

Deep Learning Pipeline
A train set of 440 images per group was used to train a classifier following the indications reported in MONAI [53]. A DenseNet121 was trained with default parameters for 3D images for 10 epochs and using a batch size of 2 ( Figure 2). The Cross-Entropy loss was used for training with Adam optimizer, setting the learning rate to 1e-5. DenseNet121 is a Dense Convolutional Network, where each layer is connected to every other layer and uses all preceding layers' feature-maps as inputs [54]. When the images were imported in the PyTorch environment they underwent intensity scaling between 0 and 1, a channel was added to make the image in the channel-first format, then it was resized with scaling to 150 the longest dimension, keeping the aspect ratio of the initial image. The model with the best performance on the validation set was saved. Predictions on the test set were performed following MONAI indications and results were evaluated by computing the following evaluation metrics: accuracy, sensitivity, specificity, F1-score, and Area Under the Curve (AUC).

Explainable AI Using the Attention Map Method
In order to investigate where the DL model focused to make its prediction, we applied the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) algorithm using the M3d-CAM tool [62]. M3d-CAM simplifies the interpretability of PyTorch-based models by providing an easy-to-use library for generating attention maps. The application of the Guided Grad-CAM algorithm generated attention maps as new images to visualize important voxels, finally providing insights into model's behavior [63]. In particular, the neuron importance weights ( ) are computed as the global average pooling of the gradients via backpropagation, before the softmax layer of the DL network. The gradient of the score for the output is computed with respect to feature map activations of a convolutional layer, as shown in Equation (1).

1
(1) where y is the model output before sigmoid function application, c is the class of interest, A is the feature map activation, i, j, and k are the width, height, and depth dimensions, respectively. To obtain a tensor heatmap of feature importance ( ) representing the activation map of the network Guided Grad-CAM executes a weighted combination of the obtained with using the Rectified Linear Unit (ReLU) function, represented in Equation (2)   Schematic representation of DenseNet121. Its structure is similar to a classical convolutional neural network, yet DenseNet121 features dense blocks that concatenate outputs of multiple connected layers. In fact, within a dense block each layer is directly connected to every other layer in a feed-forward fashion.

Explainable AI Using the Attention Map Method
In order to investigate where the DL model focused to make its prediction, we applied the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) algorithm using the M3d-CAM tool [62]. M3d-CAM simplifies the interpretability of PyTorch-based models by providing an easy-to-use library for generating attention maps. The application of the Guided Grad-CAM algorithm generated attention maps as new images to visualize important voxels, finally providing insights into model's behavior [63]. In particular, the neuron importance weights (α c n ) are computed as the global average pooling of the gradients via backpropagation, before the softmax layer of the DL network. The gradient of the score for the output is computed with respect to feature map activations of a convolutional layer, as shown in Equation (1).
where y is the model output before sigmoid function application, c is the class of interest, A is the feature map activation, i, j, and k are the width, height, and depth dimensions, respectively. To obtain a tensor heatmap of feature importance (L Grad−CAM ) representing the activation map of the network Guided Grad-CAM executes a weighted combination of the obtained α c n with A n using the Rectified Linear Unit (ReLU) function, represented in Equation (2) [62]. The obtained tensor heatmap was then rendered as a pseudocolor image with grayscale colormapping. The areas contributing the most to the model's output emerged with the greatest variation in grayscale.

CAD Train and Test
A DenseNet121 architecture in PyTorch was trained to discriminate between FTD and NC T1 3D MRI following the MONAI framework guidelines. To ensure the model was not performing well only on the data it was trained on, we measured its performance on a separated validation set. In fact, the model was tested on the validation set after each epoch and the model with the best accuracy (0.92) was saved. After training, this saved model was tested on an independent test set to assess prediction performance. The model achieved 0.80 accuracy (95% confidence intervals: 0.64, 0.91), 1 sensitivity, 0.6 specificity, 0.83 F1score, and an AUC of 0.86. Testing if the accuracy was higher than the no information rate achieved a p-value < 0.0001. The confusion matrix is reported in Figure 3, showing that the model only misclassified NC samples, achieving max sensitivity.
The obtained tensor heatmap was then rendered as a pseudocolor image with grayscale colormapping. The areas contributing the most to the model's output emerged with the greatest variation in grayscale.

CAD Train and Test
A DenseNet121 architecture in PyTorch was trained to discriminate between FTD and NC T1 3D MRI following the MONAI framework guidelines. To ensure the model was not performing well only on the data it was trained on, we measured its performance on a separated validation set. In fact, the model was tested on the validation set after each epoch and the model with the best accuracy (0.92) was saved. After training, this saved model was tested on an independent test set to assess prediction performance. The model achieved 0.80 accuracy (95% confidence intervals: 0.64, 0.91), 1 sensitivity, 0.6 specificity, 0.83 F1-score, and an AUC of 0.86. Testing if the accuracy was higher than the no information rate achieved a p-value < 0.0001. The confusion matrix is reported in Figure 3, showing that the model only misclassified NC samples, achieving max sensitivity.

Comparison with Previous FTD Classification Approaches
We collected the performance metrics of the studies attempting to discriminate FTD from NC or AD to demonstrate that a reproducible DL powered CAD tool performs similarly (Table 4). Our application achieved 0.80 accuracy, which is in line with the results obtained by other research groups in the FTD vs. NC classification (Accuracymean = 0.84,

Comparison with Previous FTD Classification Approaches
We collected the performance metrics of the studies attempting to discriminate FTD from NC or AD to demonstrate that a reproducible DL powered CAD tool performs similarly (Table 4). Our application achieved 0.80 accuracy, which is in line with the results obtained by other research groups in the FTD vs. NC classification (Accuracy mean = 0.84, sd = 0.08) while showing high reproducibility. Notably, only a few of the previous classification attempts used NIFD data, and their sample size, AI algorithms, and cross-validation strategies were highly heterogeneous. Most of the published DL applications to FTD were Life 2022, 12, 947 9 of 16 designed to discriminate FTD from AD, making it difficult to compare their performance with ours.

Attention Maps
Attention maps help visually explain DL models for image processing, showing which parts of the image contributed to the classification. We generated attention maps using the Guided Grad-CAM algorithm that provided an image with the same size as the test image, where relevant points have a highly perturbed value, thus emerging from the background and showing where the model focused to perform its prediction. Attention maps for the FTD and the NC subjects with the most accurate prediction are shown in Figure 3. Emerging areas concentrate around the ventricles, where the difference is clearly noticeable, as the FTD subject has expanded ventricles. It also seems that the model partly focuses on the skull. Figure 4 shows a single slice where the difference is evident, yet an animation of the full brain scans and attention maps is available in the Supplementary Materials.

Discussion
In this work, we showed the use of a standardized workflow to build a CAD tool for FTD. Data preprocessing was made easily reproducible by using the Clinica library [40], while an optimized state-of-the-art DL model was trained within the standardized MONAI framework [53]. The proposed workflow resulted in a CAD with max sensitivity, correctly identifying all FTD samples. Some NC samples were missed, leading to 0.8 accuracy, yet this result is in line with the previous FTD classification approaches based on MRI, whose accuracy in discriminating FTD from controls is around 0.8 [47,68]. Notwithstanding, most of the available papers on FTD classification with AI-based methods used Machine Learning on quantitative variables that, although from MRI brain scans (e.g., cortical thickness), have to be managed differently from 3D MRI data (Table 4). A few works were published where DL is used to detect FTD from 3D imaging, but their focus was on discriminating it from AD, making their results incomparable to ours (Table 4) [42,69]. There is only one published work really comparable to ours [42], but their trained DL model takes raw images as input and uses a non-standard architecture. On such a basis, we found this method built on standardized frameworks to be new in FTD research. Moreover, our work is one of few based on the NIFD database. We believe that our standardized methodology makes our work useful for benchmark comparison in FTD classification. Experiments based on neurodegenerative disease classification need large sample sizes and to the best of our knowledge NIFD offers the largest FTD cohort to date with

Discussion
In this work, we showed the use of a standardized workflow to build a CAD tool for FTD. Data preprocessing was made easily reproducible by using the Clinica library [40], while an optimized state-of-the-art DL model was trained within the standardized MONAI framework [53]. The proposed workflow resulted in a CAD with max sensitivity, correctly identifying all FTD samples. Some NC samples were missed, leading to 0.8 accuracy, yet this result is in line with the previous FTD classification approaches based on MRI, whose accuracy in discriminating FTD from controls is around 0.8 [47,68]. Notwithstanding, most of the available papers on FTD classification with AI-based methods used Machine Learning on quantitative variables that, although from MRI brain scans (e.g., cortical thickness), have to be managed differently from 3D MRI data (Table 4). A few works were published where DL is used to detect FTD from 3D imaging, but their focus was on discriminating it from AD, making their results incomparable to ours (Table 4) [42,69]. There is only one published work really comparable to ours [42], but their trained DL model takes raw images as input and uses a non-standard architecture. On such a basis, we found this method built on standardized frameworks to be new in FTD research. Moreover, our work is one of few based on the NIFD database. We believe that our standardized methodology makes our work useful for benchmark comparison in FTD classification. Experiments based on neurodegenerative disease classification need large sample sizes and to the best of our knowledge NIFD offers the largest FTD cohort to date with neuroimaging and clinical data available. Finally, we found that several studies were underpowered, with few participants for each experimental group. Small sample sizes usually lead to unstable results and/or inflated performances of the models used for classification. This bias is worsened when coupled with misused Cross-Validation strategies such as k-fold Cross-Validation that are unreliable with~20 subjects per group. A standardized workflow following data science best practices protects from data leakage or overfitting, ensuring reproducibility and reliability.
The performance of the CAD presented here could be further enhanced by training the model with a higher number of epochs, as other works show convergence after 100 epochs [42]. Additionally, tuning the model hyper-parameters such as learning rate and batch size could have yielded higher accuracy, along with choosing a different DL architecture, finally identifying the best performing combination. Moreover, we could have taken a data-centric approach to boost the effectiveness of the training phase, possibly improving test performance. As it has been shown, a data-centric approach counters the challenges of training with a small dataset and improves accuracy [70]. Nonetheless, our approach aimed at simplicity and reproducibility, showing that a CAD tool can be set up by using open-source and easy-to-use software platforms providing state-of-the-art methods for data preprocessing and analysis.
One of the biggest flaws in DL-based CAD tools is their lack of interpretability. In fact, DL neural networks are black boxes, meaning that highly complex data processing makes it unintelligible how the model comes to its final output (a class probability). To the extent of making black boxes more interpretable, a few methods for explaining model behavior have been developed, contributing to the realization of the explainable AI [12,71]. Neurodegenerative disease research is only recently approaching explainable AI [72][73][74][75] and there are only a few works available where Guided Grad-CAM has been used to generate attention maps for DL neural networks [76][77][78][79]. To the best of our knowledge, this is the first work where a DL model for FTD detection is studied with explainable AI methods. As reported in the results section, ventricular spaces were the most influential areas for the model output, and it has also been observed that the model was influenced by skull parts too. The AI presented in this work has a good performance, and attention maps showed that it seems to rely mostly on gross characteristics accounting for large differences in the images, as a human might do. To this extent, this model seems to be reliable as it shows to mimic human behavior when choosing, and this might make this CAD trustable by groups of physicians aiming to find aids from automated analysis methods when evaluating patients. We argue that the model might have misclassified the NC samples due to noise in the signal introduced by the skull parts. Evaluating the influence of skull parts in the model decision was out of the scope of this work; it would take specific experiments in order to determine how influential they were in the model decision.
This paper presents how the use of Clinica and MONAI for data preprocessing and analysis facilitates reproducibility in developing a CAD tool for FTD detection. We believe that having standardized, reproducible and trustable CAD tools would ease their inclusion in patient's clinical management practices. In fact, although thousands of AI-based tools have been developed with the potential of smoothing disease detection or outcome prediction, their real practical application is lagging [11,80,81]. The stability of Clinica methods for data preprocessing well couples with the flexibility of the procedures proposed by MONAI, setting up a highly versatile system for CAD development. We believe that such practices should be applied on data adhering to the FAIR principles when developing an AI model to study a neurodegenerative disease. Additionally, we believe that explainable AI practices such as using and interpreting attention maps should be necessary steps in building a CAD tool, in order to bring the AI realm into standard clinical practice.

Limitations
This work has some limitations. First, we used only a subset of all the images available in NIFD, in essence the T1w MPRAGE MRI at T0 (patient's first visit). NIFD includes many more MRI modalities and other timepoints. Using all images with appropriate data management strategies could improve classification performance. Second, our CAD was built on DenseNet121, the architecture that MONAI proposes for 3D classification, without hyper-parameter tuning. It is probable that a better performance could have been yielded with different parameters and it may also be that a different architecture among those available in MONAI could have performed better on our data, yet it was outside of the scope of this work to yield max performance on the test set, as performing as good as others was enough for this application. Third, images were resized when imported in the MONAI environment; thus, possibly reducing signal quality within the MRI brain scans and finally partially hindering the model discriminative performance. Nonetheless, attention maps evaluation revealed that the model focused on gross characteristics, probably disqualifying this limitation.

Conclusions
Although AI is now a leading technology in medical research, the real-life implementation of AI-based CAD tools in daily clinical practice is still facing obstacles. To be approved by regulators, AI-based decision support systems must be able to consistently reproduce their results on multiple sites or cohorts, while integrating with electronic health record systems. Similarly, the DL models powering the statistical engine of the CAD should be consistently updated over time to keep up with evolving clinical standards [11,80,81]. In particular, the high complexity of neurodegenerative diseases poses tough challenges for the development of reliable CAD tools, since complex diseases such as FTD, AD, and PD are characterized by a strong heterogeneity in their manifestation and underlying pathological mechanisms. Here, we presented a workflow for FTD detection based on a standardized preprocessing framework of 3D brain images, coupled with a reproducible protocol of data augmentation and Deep Learning model training and evaluation. Moreover, we used explainable AI methods to demonstrate how AI behavior can be understood by regulators and physicians. We found out that our standardized workflow for AI-based CAD tool development is comparable to other classification approaches in FTD, without compromising on reproducibility and interpretability of the DL model. Interestingly, we observed that methodological heterogeneity in FTD classification is not limited to development practices but also extends to data sources and cross-validation strategies with the latter being potentially harmful for generalizability of the results (Table 4).
Thus, we believe that health informaticians should develop AI-based CAD tools with pipeline standardization in mind, as these objectives cannot be achieved without it. In particular, we need standardized data management strategies during collection, preprocessing, and sharing, especially in case of a cross-site contribution to a database. The adherence to FAIR principles for shared data is pivotal to enhance their reusability by researchers worldwide. We believe that the widespread adoption of stronger standardization principles would foster the stability of the techniques and the reliability of findings in AI research. Within this setting, the application of explainable AI methods is required to overcome the issue posed by black box models, in particular about physicians' trust in CAD. Increasing the understanding of AI behavior would weaken the hindering for CAD tools to be applied in real-world clinical settings, finally bringing us closer to a fruitful human-machine interaction in the biomedical field.