3.1. General Framework of the Proposed Methods
In this section, we present the proposed framework for ASD diagnosis using sMRI and fMRI modalities data. This study focuses on analyzing and exploring various image generation techniques from a standard dataset to classify images into predefined ASD or typical control (TC) subjects. Because this work investigates the effectiveness of multi-slice generation, vision transformers, and 3D-CNNs, we divided our research framework into two parts: the first for the methods used with sMRI modalities, and the second for those implemented with fMRI modalities.
Figure 1 shows the proposed methodology for ASD classification using sMRI modalities. To avoid data leakages between training and testing examples [
47], the raw MRI dataset was first split into 80% training and 20% validation sets based on subjects before applying any further data generation. Then, to prepare the sMRI modalities, the raw 3D data was preprocessed and normalized using Fuzzy C-means (FCM)-based tissue-based mean normalization [
48]. In FCM-based tissue-based mean normalization, the algorithm is applied to cluster the image pixels into different tissue groups based on their intensity values. The mean intensity value of each tissue cluster is then calculated. These mean values serve as reference points for normalization. By normalizing the intensities of the image pixels to the respective tissue mean values, the FCM-based tissue-based mean normalization ensures that the image intensities are adjusted to a consistent scale. This normalization step can be beneficial for ASD image analysis and classification. The resulting 3D sMRI images were used as input to 3D-CNN to perform ASD diagnosis. The proposed 3D-CNN architecture employed four sequential 3D convolutional layers (3DConv) followed by max pooling and batch normalization (BN). This approach aimed to leverage the spatial information present in the 3D sMRI data for accurate diagnosis. To implement TL, 2D sMRI images were generated from the 3D sMRI data by slicing along various anatomical planes namely axial, coronal, and sagittal. The number of extracted slices for each plane was determined by choosing 1, 10, and 50-middle slices per plane. We leveraged the constructed slice representations as input for four vision transformers and trained these models separately on each plane. The performance of each model was evaluated based on their loss, accuracy, and F1 score. In the subsequent sections, more details about TL and 3D-CNN are given.
As illustrated in
Figure 2. the fMRI modalities are processed differently. As fMRI are 4D images representing simple time-series or multi-volume data, different numbers of slices were generated based on the fourth dimension (i.e., time) whereby the first three dimensions represent axial, sagittal, and coronal planes. Specifically, 10, 30, 50, all slices, and all slices except 10-start slices and 10-end slices were considered. To calculate the mean of the images over time or the fourth dimension, it is important to note that if a list of 4D images is provided, the mean of each individual 4D fMRI image is computed separately, and then the resulting means are computed together resulting in 3D fMRI images. These 3D images were then normalized FCM-based tissue-based normalization [
48], and used as input to the proposed 3D-CNN, following a similar architecture as the sMRI experiment. The objective of using the same 3D-CNN model was to assess the performance of ASD diagnosis using the generated sMRI and fMRI datasets.
3.2. 3D-CNN Architecture for ASD Diagnosis
This study proposed a 3D-CNN model architecture that is simple yet effective in analyzing 3D neuroimaging data. The model uses raw 3D images from sMRI and fMRI neuroimages collected from the NYU dataset. Before training the model, the neuroimages are preprocessed to ensure consistent data representation through normalization and resizing. This involved normalization and resizing steps to enhance comparability and standardize spatial dimensions. The CNN architecture involved an input layer, followed by a 3D convolutional layer with 64 filters of size 3 × 3 × 3 and a ReLU activation function. Next, a max pooling layer with a pool size of 2 × 2 × 2 was used. Batch normalization (BN) was applied after each convolutional and max pooling layer to enhance the stability and efficiency of the model. The layers were repeated for an additional three convolutional layers, each with progressively more filters: 64, 128, and 256. After the convolutions, a global average pooling layer was applied to obtain a global representation of the extracted features. Next, a dense layer with 512 units and ReLU activation was employed to flatten the extracted feature weights. To prevent overfitting, a dropout regularization rate of 0.3 was applied to the dense layer. Finally, the output classification layer was composed of a single unit and a sigmoid activation function for binary classification, distinguishing between ASD and TC subjects. The model was trained for 50 epochs, using a binary cross-entropy loss function, the Adam optimizer with a learning rate of 0.001, and evaluated using loss, accuracy, and F1 metrics.
3.3. TL Vision Transformers for ASD Diagnosis
Following the success of transformers in NLP, many researchers have begun to explore transformer architectures for computer vision tasks [
41,
43,
44,
49,
50], and very recently for medical image analysis [
51,
52,
53]. CNNs and their variants provide state-of-the-art performance, partially due to their expanding receptive fields that lead to learning hierarchies of structured image representations. Generally, the idea of capturing visual meaning in pictures is considered the foundation of successful computer vision networks. However, conventional CNNs have the limitation of ignoring long-term relationships between objects in the image [
52]. Studies have shown that adding the attention mechanism, which has been successful in NLP, to CNNs can help capture these long-term dependencies and improve image classification accuracy by treating each image as a sequence of patches [
43]. In this context, this study employed four vision transformers, namely ConvNeXt [
41], MobileNet [
42], Swin [
43], and ViT [
44] in the context of ASD diagnosis using sMRI modalities.
The ConvNeXt transformers family developed by Liu, Mao, Wu, Feichtenhofer, Darrell, and Xie [
41] was specifically designed for image classification tasks. It incorporates elements from the ResNet architecture and has undergone pre-training on a large dataset of images. This pre-trained ConvNeXt model serves as a robust foundation, as it has acquired the ability to extract meaningful features from diverse image data. The versatility of the ConvNeXt transformer extends beyond image classification, making it applicable to a wide range of image-related tasks, including object detection, image segmentation, and recognition. The pre-trained models offer customization options with varying layer sizes, input sizes, and training datasets. Notably, ConvNeXt benefits from residual connections, enhancing accuracy by efficient information propagation throughout the network. As there is a family of ConvNeXt transformers, we chose one of them named ConvNeXtXLarge for ASD classification.
The MobileNet transformer was designed by Sandler, Howard, Zhu, Zhmoginov, and Chen [
42] for mobile devices and low-power applications. It incorporates depth-wise separable convolutions, which involve a depth-wise convolution followed by a point-wise convolution. This design choice significantly reduces the number of parameters in the network, resulting in lower computational requirements for both training and inference. The MobileNet was pre-trained on large-scale datasets like ImageNet and is commonly utilized in computer vision tasks such as image classification, object detection, face recognition, and scene understanding. The MobileNetV2 is the variant that we used in this study.
The Swin transformer was proposed by Liu, Lin, Cao, Hu, Wei, Zhang, Lin, and Guo [
43] as a variant of the vision transformer architecture that utilizes a shifted window algorithm for classification. It achieves hierarchical representation by starting with small-sized patches and gradually merging neighboring patches while decreasing the amount of computation needed to calculate the attention of high-resolution images. Swin was pre-trained on ImageNet-1K, ImageNet-22K, COCO 2017, and ADE20K, and has shown remarkable performance in image classification, object detection, and semantic segmentation. As there are multiple variants of Swin, this work employed the SwinV2Tiny256.
The ViT is an image classification transformer developed by Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby [
44] which provides new insight for vision-related tasks, which are markedly distinct from the current state-of-the-art approaches based on CNNs. Although the original transformer model combines both encoders and decoders, the ViT model only has encoders pre-trained on a vast collection of images in supervised learning, including ImageNet-1K, ImageNet-21k, and JFT datasets. This work employed the ViT_base16 model.
Furthermore, in this study, we focused on the downstream task of ASD diagnosis using these vision transformers and utilized pre-built modules offered by the Keras library. To accomplish this objective, we collected 3D sMRI neuroimages from the NYU dataset in the ABIDE-I repository. Before proceeding with the data generation process, we divided the subjects’ images into train and test sets, allocating 80% for training and 20% for testing. The vision transformers employed in this study required 2D images as input with 3 channels. Thus, we followed a specific process for multi-slice generation as we outlined in the experimental setup section to prepare the 2D sMRI data. After generating the 2D images, we performed normalization and resized them to a resolution of 224 × 224 pixels. In cases where the images were grayscale, we duplicated the image to create three input channels. In order to further enhance the diversity and size of our dataset, we applied a data augmentation pipeline which consists of two augmentation layers: random flip and random rotation. By employing these data augmentation techniques, our aim was to increase the size and diversity of our dataset. The visualization of augmented data will be presented in the experimental setup section.
Our approach for utilizing the vision transformers involved excluding the top layer from the original architecture and adding a classification layer. During the training, the TL model architectures were trained using identical hyperparameter settings to maintain consistency. The training process was configured to run for 50 epochs, utilizing a binary cross-entropy loss function and the Adam optimizer with a learning rate of 0.001. Upon completion of the training and evaluation process, we reported the classification loss, accuracy, and F1 score for all transformers.