A Convolutional Deep Neural Network Approach to Predict Autism Spectrum Disorder Based on Eye-Tracking Scan Paths

: Autism spectrum disorder (ASD) is a developmental disorder that encompasses difficulties in communication (both verbal and non-verbal), social skills, and repetitive behaviors. The diagnosis of autism spectrum disorder typically involves specialized procedures and techniques, which can be time-consuming and expensive. The accuracy and efficiency of the diagnosis depend on the expertise of the specialists and the diagnostic methods employed. To address the growing need for early, rapid, cost-effective, and accurate diagnosis of autism spectrum disorder, there has been a search for advanced smart methods that can automatically classify the disorder. Machine learning offers sophisticated techniques for building automated classifiers that can be utilized by users and clinicians to enhance accuracy and efficiency in diagnosis. Eye-tracking scan paths have emerged as a tool increasingly used in autism spectrum disorder clinics. This methodology examines attentional processes by quantitatively measuring eye movements. Its precision, ease of use, and cost-effectiveness make it a promising platform for developing biomarkers for use in clinical trials for autism spectrum disorder. The detection of autism spectrum disorder can be achieved by observing the atypical visual attention patterns of children with the disorder compared to typically developing children. This study proposes a deep learning model, known as T-CNN-Autism Spectrum Disorder (T-CNN-ASD), that utilizes eye-tracking scans to classify participants into ASD and typical development (TD) groups. The proposed model consists of two hidden layers with 300 and 150 neurons, respectively, and underwent 10 rounds of cross-validation with a dropout rate of 20%. In the testing phase, the model achieved an accuracy of 95.59%, surpassing the accuracy of other machine learning algorithms such as random forest (RF), decision tree (DT), K-Nearest Neighbors (KNN), and multi-layer perceptron (MLP). Furthermore, the proposed model demonstrated superior performance when compared to the findings reported in previous studies. The results demonstrate that the proposed model can accurately classify children with ASD from those with TD without human intervention.


Introduction
The prevalence of autism spectrum disorder (ASD) is on the rise globally.Currently, approximately one in 68 children has been diagnosed with ASD [1].ASD is a neurodevelopmental disorder that impacts the behavioral and cognitive abilities of children, including communication and social interaction [2].People with ASD have difficulties in social communication and interaction, such as struggles with social-emotional reciprocity, non-verbal communication, and building and understanding relationships [3].Previous research on ASD has examined the unique obstacles in understanding both spoken and non-spoken signals of communication.These obstacles encompass challenges in monitoring eye movements, and recognizing and replicating gestures and facial expressions, as well as grasping language pragmatics and verbal interactions.Diagnosing ASD presents a complex and rigorous challenge, focusing on examining interactive behaviors in diagnostic conversations.This examination involves evaluating how the person being assessed aligns their verbal and non-verbal actions with their conversation partner (interpersonal coordination), and also observing the frequency and characteristics of their repetitive behaviors (intrapersonal coordination).The expert needs to interact with the child, be mindful of their own conversational style, and make diagnostic notes according to their observations.Examining the level of interactional coordination in various conversation settings during autism diagnosis could offer valuable information about an individual's ability to adjust behavior across different conversational contexts and its impact on diagnostic results.However, many studies have concentrated on behaviors at the individual utterance level or used overall statistics at the conversation level.As a result, the real-time dynamic elements of coordination in conversations are not adequately captured [4].
In numerous countries globally, parents often request a formal assessment and diagnosis primarily due to delayed initiation of communication and language or insufficient linguistic abilities in comparison to children of a similar age.This emphasizes that a considerable number of children with ASD encounter difficulties in developing language.Proficiency in language consistently proves to be a reliable indicator of social and academic achievements, emphasizing the significance of giving priority to language assistance when formulating intervention objectives for children with ASD [5].As children diagnosed with ASD mature, some may make efforts to understand the anticipated behaviors and suitable language usage, although this may not be uniform across all contexts.Nevertheless, they frequently face difficulties in grasping the deeper implications in dialogues, resulting in literal understandings and struggles in recognizing indirect messages like sarcasm or minor deceptions.Clear and specific instructions are necessary for their understanding, as individuals with ASD might find it hard to deduce meanings without explicit guidance, despite possessing good verbal communication abilities.Nonetheless, they commonly experience difficulties in utilizing language appropriately in social environments [3].
The severity of symptoms and the impact of ASD differ from one case to another.In recent times, numerous research studies have been carried out with the aim of enhancing the quality of life for individuals with ASD.These studies suggest that early detection and intervention can have a significant impact on the diagnosis of ASD, leading to reduced costs for both individuals and society as a whole [6].Children with autism typically display symptoms early in life, but their diagnosis is often delayed until they reach school age.Several research studies have highlighted various factors that contribute to this delay, such as limited resources in rural areas and low-income families, the effectiveness of ongoing pediatric care, a shortage of specialized referrals, and parents' awareness of autism spectrum disorder (ASD) [7].Diagnosing ASD is a difficult task because of its intricate characteristics and diverse range of symptoms [2].Moreover, the identification of ASD poses challenges due to the necessity of specialized professionals utilizing methods such as observation, interviews, and questionnaires [1].Specialist physicians in a clinical setting are usually responsible for the official diagnosis of ASD.They utilize a clinical judgment procedure and rely on observable and measurable behavioral indicators.However, the current diagnostic tools are time-consuming due to the extensive number of items that need to be checked by the specialist.These tools also rely on static human-embedded rules.As a result, there is a need to modify the coding and behavior of ASD clinical tools in order to classify cases more efficiently [1].Despite the significant financial investment in the advancement of detection methods, clinical treatments have not yielded successful results thus far.The primary challenges contributing to these issues include clinicians' limited access to adequate data and their inability to conduct thorough data analysis [8].Furthermore, autism is a diverse condition with various manifestations.A standardized diagnosis approach that encompasses all cases and allows for self-treatment by specialists is lacking.Each specialist relies on their own expertise and experience to diagnose autism spectrum disorder (ASD), resulting in inconsistencies and a lack of reliability in the diag-nostic process [9].Consequently, there is a demand for advanced intelligent techniques that can swiftly and automatically classify ASD [1].
Eye-tracking technology plays a significant role in the diagnosis of autism spectrum disorder (ASD) by analyzing visual patterns [10].Typically, an eye-tracking device consists of a high-resolution digital camera and an intelligent algorithm that precisely detects eye gaze coordinates while individuals view videos or images.The eye gaze data obtained from this technology provide valuable insights into the challenges faced by individuals with ASD in social interactions, enabling interventions to be customized according to their unique requirements [7].One way to enhance our knowledge about the use of eye-tracking biomarkers in distinguishing different subsets of individuals with autism spectrum disorder (ASD) is to examine the potential impact of highly comorbid psychiatric conditions such as Attention Deficit Hyperactivity Disorder (ADHD), anxiety, and mood disorder.By doing so, we can gain a better understanding of how these conditions may affect the ability to identify specific populations within a clinical setting.For instance, research has shown that children who have both ASD and ADHD tend to display shorter fixation durations on faces when viewing static low-complexity social stimuli, as compared to children with ASD only and those with typical development (TD) [11].
Based on research findings, eye-tracking data have been found to serve as a biomarker for the early detection of autism spectrum disorder (ASD) in children [7].Biomarkers, also known as biological markers, are measurable and objective indicators that provide information about a patient's observable biological condition.Bodily fluids or soft tissue biopsies are commonly utilized to evaluate the effectiveness of treatment for a disease or medical condition.By examining eye movements, researchers have discovered that an atypical preference for geometric images can serve as an early indicator of a specific subtype of autism spectrum disorder, which is linked to more severe symptoms [11].Compared to other types of data, the analysis of biomarkers on images is often more challenging.This is because biomarkers are not only defined by numerical values but also by their spatial relationship to one another.The precision and reliability of biomarker analysis are significantly impacted by the resolution of the image.This is due to the fact that the numerical values representing biomarker indicators in the image are crucial.In cases where there is spatial variation among these indicators within tissues or organisms, the distribution of digital values in the image will differ.Therefore, specific methods are required to accurately capture and measure the spatial correlations among these indicators.However, the complex nature of biological images often necessitates the involvement of human experts in annotating these biomarkers.Unfortunately, this process is timeconsuming, labor-intensive, subjective, and prone to errors.To address these limitations, various software applications have been developed to automate data classification and assist healthcare professionals.The advent of machine learning techniques, particularly neural networks, has revolutionized this field [8].Eye tracking has the potential to be utilized in routine clinical settings for the purpose of aiding specialists in the diagnosis of autism spectrum disorder (ASD) [12].The results of a study indicate that eye-tracking data can significantly assist clinicians in efficiently and accurately screening for autism [7].
The CNN-based architecture was selected for this problem due to its impressive performance and accuracy in computer vision problems and image classification compared to other deep neural network architectures.It can autonomously and adaptively learn features from signals and images.A CNN is widely used in image-processing tasks including object detection, object recognition, anomaly detection, and segmentation.One of the main advantages of CNNs over previous models is its ability to extract relevant features from input images without human intervention [7].In this study, we introduce a model called Tuned-Convolutional Neural Network-Autism Spectrum Disorder (T-CNN-ASD) that can classify eye-tracking scan path images of children with ASD and typical development (TD) without the need for human intervention.Additionally, various imageprocessing techniques are employed to extract the most significant features from eyetracking scan path images of individuals with ASD.The dataset used in this research, which consists of 59 children with an average age of around 8 years, was obtained from the participation of these individuals [13].A total of 547 images were used in the study by Carette et al. (2019) [10].The authors then proceeded to construct the T-CNN-ASD model for the classification of eye-tracking scan path images.To improve the model's performance, various architectures were fine-tuned using different optimizers to determine the optimal optimizer and model parameters.These parameters included kernel sizes, stride, dropout, number of batches, number of epochs, and pooling parameter, all aimed at improving the accuracy of the model.The performance of the T-CNN-ASD model will be compared to that of popular classification techniques such as random forest, decision tree, K-Nearest Neighbors (KNN), and multi-layer perceptron (MLP).Furthermore, the proposed model will be compared to previous studies in the literature.Figure 1 provides a summary of the research methodology.This paper is structured as follows.In Section 1, an overview of ASD and the commonly employed methods for classifying ASD is provided.Section 2 presents the previous research conducted in this area.The proposed methodology is discussed in Section 3. Section 4 presents the experimental results.Finally, Section 5 concludes the findings of this research and provides suggestions for future work.

Related Works
Numerous investigations have been conducted on the utilization of machine learning (ML) techniques for detecting autism through eye tracking.One such study, described in [6], involved the presentation of a speaking face and the analysis of children's gaze patterns.The researchers discovered that individuals with ASD tend to have shorter periods of fixation on the eyes, mouth, nose, body, person, face, and outer person in comparison to those without ASD.The authors utilized SVM as a classification method to distinguish between individuals with autism spectrum disorder (ASD) and typically developing (TD) individuals.They based their classification on the ratios of fixation times for all areas of interest (AOIs).The outcomes of their model classification are as follows: an accuracy rate of 85.1%, a sensitivity rate of 86.5%, and a specificity rate of 83.8%.In [14], the researchers introduced a technique for categorizing various degrees of ASD by utilizing eye gaze patterns and demographic descriptors like age.They used multiple ML classifiers to classify ASD.The researchers examined the utilization of two different sets of attributes.The initial collection includes features that were created manually and unprocessed gaze data.This combination achieved an accuracy rate of 90.83% when using both the PART and C4.5 classifiers.The second group consists of gaze pattern-based features that were manually created.These features achieved an accuracy rate of 93.45% when a random forest was used.Furthermore, the researchers calculated the Pearson correlation coefficient to determine the correlation between each feature and the classification output.It was observed that the average classification accuracy decreased when age was excluded as a feature in all tested classifiers.In addition, Vabalas et al. [15] utilized ML techniques along with eyetracking and motion data to distinguish between autistic and non-autistic adults.They integrated both motion and eye-tracking data to improve the accuracy of their classification.
The authors discovered that the kinematic and eye movement datasets contained more features than the samples.To address this issue, they employed the 'wrapped t-test' method, which combines features from both the filter and wrapper methods.By selecting a subset of features that yielded the highest classification performance, they eliminated certain portions of the features.Their model achieved an accuracy of 73% in predicting the diagnosis based on kinematic characteristics, 70% accuracy using features of eye movement, and 78% accuracy when combining both sets of features.
Several investigations have employed artificial neural networks (ANNs) to categorize instances of AD.For instance, Carette et al. [10] explored the fusion of eye-tracking technology with ANNs to aid in the identification of ASD.At first, non-neural network methods were employed.The accuracy attained by this group of models was satisfactory.Following that, the model was tested with different artificial neural network (ANN) architectures.Based on the findings, the single-layer model consisting of 200 neurons produces the highest accuracy.The authors of [16] conducted research on the visual attention of children with ASD when observing human faces.They utilized deep neural networks (DNNs) to extract semantic characteristics.The feature maps for autism spectrum disorder (ASD) exhibit distinct characteristics when observing human faces in comparison to individuals without ASD.These feature maps are subsequently combined with the features obtained from CASNet.In their study, the researchers compared CASNet with six other deep learning models and discovered that CASNet outperformed all of them in every scenario.In their work [17], the authors introduced a method for classifying children with TD and ASD using the eye gaze scan paths.They utilized a combination of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks.The CNN-LSTM model was employed to extract features from the saliency maps and the sequence of fixation locations in the scan path.Moreover, the pretrained prediction model SalGAN was utilized for data preprocessing and to generate the network input.The proposed model achieved an accuracy of 74.22% on the validation dataset.
Our research stands out from prior studies in the following aspects: First, we addressed the issue of working with a dataset containing a small number of images without utilizing augmentation.Whereas many earlier studies depended on augmenting image data to improve outcomes, we maintained the original image count and achieved superior performance.Additionally, while numerous prior research works tested different machine learning algorithms to identify the most effective one, our study utilized CNNs.We carefully adjusted the parameters of the CNN model and offered explanations for the superior outcomes.This methodology enabled us to attain peak performance without resorting to image augmentation.

Proposed Methodology
Clinicians often face challenges and time constraints when diagnosing autism spectrum disorder (ASD) due to the wide range of symptoms associated with the condition.The heterogeneity of ASD symptoms leads to distinct variations within specific subgroups.Unfortunately, there is currently no reliable method to measure the diversity of autism symptoms in children, making it difficult to plan targeted interventions for smaller groups.However, research has shown a strong correlation between ASD heterogeneity and the complexity of eye-tracking data.Therefore, by utilizing advanced algorithms based on convolutional neural networks (CNNs), machine learning models could be used to quantify the variations in ASD symptoms, enabling accurate classification of ASD severity based on eye-tracking data [7].The aim of this study is to develop an enhanced convolutional deep neural network for the identification of autism using eye-tracking scan path images.The methodology consists of several stages, including image processing, cross-validation, the creation of a T-CNN-ASD model, fine-tuning of model parameters, and evaluation of the model.The architecture of the proposed methodology is depicted in Figure 2.

Dataset Collection
The study made use of the public dataset mentioned in [13].This dataset was employed to carry out experiments assessing the efficacy of eye-tracking technology within the realm of autism spectrum disorder (ASD).It comprises data from a study involving 59 children, with 29 diagnosed with ASD and the remaining 30 being typically developing children averaging around 8 years of age.A cohort of typically developing (TD) children was enlisted from various French schools in the Hauts-de-France region.All procedures involving human participants were conducted following the ethical standards set by the institutional and/or national research committee, as well as in compliance with the 1964 Helsinki Declaration and its subsequent amendments or similar ethical standards.The children with ASD were evaluated in specialized clinics, and the level of autism in each child was gauged using the Childhood Autism Rating Scale (CARS).Table 1 presents a summary of the participants' statistics [10].The dataset underwent several stages of processing.At first, participants viewed a collection of videos containing particular scenarios aimed at prompting eye movement on the screen.Participants could be seated on their own or on their parent's lap at approximately 60 cm distance from the display screen.The experiments were conducted in a quiet room at the University premises.Additionally, physical white barriers were used to reduce visual distractions.These videos were carefully created to include visually attractive components, especially appealing to children, like colorful balloons and animated characters.Moreover, the positioning of these components changed during the study.Some videos also included a live host who interacted verbally with the child, trying to guide their focus towards both seen and hidden elements.It is important to mention that all interactions and dialogues took place in French, the participants' mother tongue.
The eye tracking function was performed using the SMI Experiment Center Software remote eye tracker (Red-m 250 Hz).The motion path of the eye was then transformed into images.The resulting visualizations comprised a dataset of 547 images, with 328 images corresponding to participants without ASD and 219 images corresponding to participants with ASD.Furthermore, all images had a default size of 640 × 480 pixels.
As mentioned earlier, the field of autism spectrum disorder (ASD) has placed particular emphasis on the utilization of eye tracking technology due to the consistent identification of gaze abnormalities as a characteristic feature of autism in general [10].The divergence in the eye tracking patterns between children with ASD and typically developing (TD) children is illustrated in Figure 3.

Stage of Image Processing
Image processing refers to the application of various operations on an image in order to improve its quality or extract valuable information from it, prior to the implementation of other algorithms.Numerous image-processing techniques can be utilized, including but not limited to image resizing, geometric transformation, and conversion of color images to grayscale images [18].
The computer analyzes the image by treating it as a collection of pixels.The resolution of the image is determined by its height, width, and dimension, which are represented in the format h × w × d (where h is the height, w is the width, and d is the dimension).For instance, an image with a resolution of 3 × 3 × 3 corresponds to a matrix of Red, Green, and Blue (RGB) values.Conversely, an image with a resolution of 3 × 3 × 1 represents a grayscale image in the form of a matrix.
The CNN model can process the pixel representation of image matrices.By doing so, it can detect edges, colors, and shapes in the images, enabling image recognition.The model is capable of performing tasks such as classification and object detection within an image.However, training the CNN model using raw images may lead to lower accuracy in classification.
Furthermore, the process of preprocessing is regarded as a crucial step in accelerating the training of models [18].As a result, various processing techniques were employed to enhance the model's resilience and enhance the performance of the T-CNN-ASD.The image-processing techniques applied to the images in this paper are depicted in Figure 4.

Reducing the Size of Images
Image data are fed in batches to train T-CNN-ASD.A batch of images requires that all images be of the same size.Therefore, all images fed into the model must have the same height and width.Therefore, rescaling images is needed to fit the image size to the feature maps that are used [18].All images were scaled down to 100 × 100 with the colors of the images being RGB.Their resizing would not cause much loss of information since most images contained a large blank space of unused pixels [10].Thus, the dimensions of the image become 100 × 100 × 3. Figure 5 shows the scaling of a sample image.

Grayscale Image Conversion
In this stage, the images are transformed into grayscale format to simplify the computations and reduce space usage.Color images contain more intricate information that may not be necessary and occupy more space compared to grayscale images.Figure 6 illustrates the process of converting the image to grayscale [19].

Reduction in Dimensions
Dimensionality reduction methods aim to convert a high-dimensional dataset into a lower-dimensional representation, while preserving the essential information of the original data.This process facilitates the analysis, processing, and visualization of the data [20].This research utilizes large images with high-dimensional features.To handle this, dimensionality reduction is performed prior to using the images for training the T-CNN-ASD.The images are compressed by converting them from a 3D representation to a 1D representation.This reduction in dimensionality aids in extracting important latent features from the data, enabling the T-CNN-ASD to focus on relevant features and avoid dealing with irrelevant ones.Figure 7 illustrates the process of converting the image to 1D.

Cross-Validation
Cross-validation is a method used to assess the performance of predictive models.It involves splitting the dataset into training and testing sets in order to evaluate the model.The main purpose of its usage is to assess ML models [21].Model generalization is achieved through the use of cross-validation, specifically K-fold cross-validation.In this approach, the dataset is divided into K groups, with K representing the number of groups.During each iteration, K-1 groups are used for training the model, while the remaining group is used for testing.This process is repeated K times to prevent overfitting during training [22].Typically, the value of K is set to 5 or 10, as recommended in various studies in the field of applied machine learning [23].

T-CNN-ASD Model
Image classification is a task within the field of computer vision, as mentioned in the study by Filchev et al. (2020) [24].There are various methods available for image classification, with the most common ones being CNNs.CNNs are well-suited for image classification due to their ability to effectively reduce the dimensionality of the image, which is necessary given the large number of parameters involved.The process of using CNNs involves extracting features from the image in order to identify patterns within the dataset [25].
The T-CNN-ASD model is composed of different layers, including the input layer, the output layer, and multiple hidden layers.These hidden layers are made up of convolutional layers, ReLU activation function, pooling layers, and fully connected layers.The model can have a combination of convolutional and pooling layers, with the number of hidden layers varying depending on the specific problem being addressed.In some cases, a single hidden layer is sufficient, while in others, multiple layers are needed, followed by one or more fully connected layers [7].The process of going from input data to output through these layers is referred to as forward propagations, as mentioned in [26].The T-CNN-ASD structure, as depicted in Figure 8, is designed to implement this process.In this structure, the input image undergoes convolution layers and ReLU activation to extract features, followed by downsampling through pooling layers.Subsequently, the extracted features are fed into a fully connected layer for the classification of the image as ASD or TD.The subsequent sections provide an overview of the primary components of the T-CNN-ASD architecture:

•
The initial layer of T-CNN-ASD is referred to as the input layer.It accepts images as input, resizes them, and subsequently forwards them to subsequent layers for the purpose of extracting features.

•
The convolution layer is responsible for extracting spatial and temporal features from an input image through the process of convolution.It consists of multiple filters or kernels that perform convolution on the entire image.To configure the convolution layer, important parameters such as the number and size of the kernels need to be set.
In the T-CNN-ASD model, each image is represented as an array of pixel values.These pixel values are then passed to the convolutional layer.Within this layer, the kernels move across receptive fields in the input image, searching for specific cues like edges, colors, and curves.Figure 9 provides a visual representation of this process [27].T-CNN-ASD is composed of multiple convolutional layers, generating numerous feature maps in the input layer.These maps are then utilized as the input for subsequent layers to detect a wide range of features for learning purposes.

•
The Rectified Linear Unit (ReLU) is a linear function that processes the input directly if it is positive, and if it is negative, it transforms it to 0. This activation function has been widely used in various neural network models due to its improved performance and ease of training.It can be represented mathematically as f(x) = max(0, x) [28].In the context of T-CNN-ASD, the ReLU helps maintain mathematical stability by preventing learned values from getting stuck near 0 or approaching infinity.The process of the ReLU is illustrated in Figure 10.• Pooling Layer: The feature sets obtained from the previous layer are forwarded to the pooling layer.The purpose of this layer is to decrease the size of large images while still preserving the crucial information.In this study, both maximum and average pooling techniques were employed.These techniques involve extracting patches from the input feature maps.Max pooling selects the maximum value within each area, which represents the most prominent feature, while disregarding the remaining values.
For this research, a maximum pooling method with a filter size of 2 × 2 and a stride of 2 was utilized.The max-pooling layer is illustrated in Figure 11.The process of average pooling involves computing the average value of each patch in the feature map.In other words, a 2 × 2 square in the feature map is reduced to its average value [6]. Figure 12 illustrates the operation of average pooling.A pooling layer provides a downsampling operation, which reduces the internal dimensionality of the feature maps.It is of note that there is no learnable parameter in any of the pooling layers, whereas filter size, stride, and padding are hyperparameters in pooling operations.Unlike height and width, the depth dimension of the feature maps remains unchanged [26].

•
The Fully Connected Layer is considered the last layer in a network.Its purpose is to take the high-level filtered images and convert them into categories or classes with corresponding labels.The output feature maps from the final convolution or pooling layer are typically flattened, meaning they are transformed into a one-dimensional array of numbers.These flattened feature maps are then connected to one or more fully connected layers.In these layers, each input is connected to every output through a weight that can be measured [29].Once the features extracted by the convolution layers and downsampled by the pooling layers are generated, a subset of fully connected layers is used to map them to the final output of the network [29].

Tuning the Hyperparameters of T-CNN-ASD
Despite the impressive performance of deep network methods in noise reduction tasks, many of these methods encounter issues with vanishing or exploding gradients when the network architecture is excessively deep [30].Consequently, parameter adjustments are needed to achieve optimal performance.Additionally, accurate prediction in different datasets necessitates the use of distinct sets of hyperparameters.
The presence of a multitude of hyperparameters creates a challenge for users in selecting the most appropriate ones.Determining the optimal number of layers, neurons, or optimizer for all datasets is not a straightforward task.It is crucial to fine-tune these parameters to identify the most suitable set for constructing a model based on a particular dataset.The subsequent items will delve into the hyperparameters that have undergone tuning.

•
The architecture of the CNN-ASD model can be optimized by tuning its hyperparameters.One important hyperparameter to consider is the number of neurons in each hidden layer.The optimal number of neurons should be chosen based on the complexity of the problem being solved.In this study, we experiment with different numbers of neurons to identify the value that yields the highest accuracy.It is worth noting that the accuracy of the neural network can also be influenced by the number of layers.The selection of layers plays a crucial role in determining the performance of the prediction model.Using too few layers may lead to underfitting, while employing too many layers may result in overfitting.Hence, tuning the number of layers can lead to improved results.
Kernel/Filter Size: The size of the kernel refers to the width × height of the filter mask.Various kernel sizes were employed to determine the optimal kernel size that yielded the highest level of accuracy.

2.
Stride and Padding: The stride refers to the number of steps the filter moves in each step during convolution.It determines the number of pixels that are skipped while traversing the input horizontally and vertically after each elementwise multiplication of the input weights with those of the filter [26].Figure 13 illustrates the stride parameter.Padding is employed to maintain the output size equal to the input size.The output size is smaller compared to the input size.Padding is a technique utilized to append zero-filled columns and rows in order to preserve the spatial dimensions after convolution.This can potentially enhance performance by preserving information at the boundaries.It is employed to ensure that the output dimension matches the input dimension [26].

3.
Dropout: When the dataset is small in size, it is possible for overfitting to occur.This means that while the training results may be satisfactory, the test results are not as good.The reason for this is that the model is not able to accurately generalize to data that it has not encountered during training.If the model is unable to effectively generalize to unseen data, it will struggle to perform the intended classification or prediction tasks.To address this issue, dropout is employed as a means of reducing overfitting.Dropout is a regularization technique that decreases the likelihood of overfitting by randomly dropping out neurons during each epoch (or during each batch when using a batch approach).When a unit is dropped, the corresponding neuron is temporarily removed from the network, along with all of its incoming and outgoing connections [31], as illustrated in Figure 14.

Batch Size (BS):
The batch size refers to the number of samples that the network will process simultaneously.This parameter plays a crucial role in enhancing the training process, particularly when fine-tuning is involved [32].

5.
Epochs refer to the number of times the learning algorithm processes the entire training dataset.It is a hyperparameter that determines how many passes the algorithm will make over the data.Each epoch allows every sample in the training dataset to update the internal model parameters.An epoch consists of one or more batches.The rationale behind using epochs is to ensure that the network has the opportunity to observe previous data and adjust the model parameters, preventing bias towards the most recent data points during training [33].
• Optimizers are responsible for modifying the learning rate and weights of neurons in a neural network in order to minimize the loss function or maximize the accuracy.During the training process, the weights of the neural network are randomly initialized and then updated in each epoch to improve the overall accuracy.The loss function is used to calculate the error by comparing the output of the training data with the actual data at each epoch, and the weights are adjusted accordingly [34].

Model Evaluation
Once the image processing and implementation of the T-CNN-ASD model for image classification have been completed, the next step is to determine the evaluation method for the autism classification task.Various performance metrics are utilized to evaluate different machine learning algorithms.One commonly used measure is accuracy, which can be defined as an evaluation metric for classification problems by employing a confusion matrix to compare different machine learning classifiers.Given that the dataset is balanced, accuracy is chosen as the metric to assess the model's performance.Accuracy is calculated as the ratio of correctly identified examples in all classes, as shown in Equation (1).Given that the research pertains to the medical field, it is essential to incorporate the following metrics: sensitivity, indicated by Equation ( 2), specificity, represented by Equation (3), and the F1 score, illustrated in Equation ( 4).

Accuracy =
TP + TN TP + TN + FP + FN (1) where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.Furthermore, to assess the model's performance, the mean accuracy is compared to that of other models such as MLP, random forest, decision tree, and KNN.

Experiments and Results
This section provides a discussion of the experiments conducted and the corresponding results.

Environment Setup
The experiments were carried out using a machine with the following specifications: Intel Core i7 10th GEN processor, 64-bit operating system, X64-based processor, 16 GB RAM, NVIDIA GeForce MX230 graphics card, and 1TB secondary storage.The machine ran on Windows 10 Pro.Preprocessing steps and T-CNN-ASD were carried out using Python IDLE 3.6.8.The main Python packages used were Keras (based on Tensorflow), OpenCV, Sklearn, and Matplotlib.To standardize the images, they were scaled down to 100 × 100 pixels and converted to grayscale using various image-processing techniques.Additionally, the images were compressed by converting them from 3D to 1D.The results were obtained without using augmented images.The training dataset consisted of 410 images (75% for training), while the test dataset had 137 images (25% for testing).

Sensitivity Analysis
In order to achieve optimal performance and obtain the desired outcomes, the proposed model necessitates the adjustment of multiple hyperparameters.
In this research, different configurations of neurons and layers were utilized, alongside the tuning of hyperparameters, to find the best accuracy.Various architectures were examined by adjusting different model parameters, with the goal of identifying the most efficient architecture with the optimal parameters.

Results of Different Kernel/Filter Size:
The training of the model involved 10 rounds of cross-validation, with each round consisting of 100 epochs and a dropout rate of 20%.Table 2 displays the accuracy achieved for various kernel sizes.The T-CNN-ASD model reached a peak accuracy of 95.59% by employing two layers with 300 and 150 neurons, respectively.The kernel size for these layers was set to 3.

Results of Different Stride:
This paper trained the model using 10 rounds of crossvalidation, with 100 epochs and a 20% dropout rate.The kernel size used was 3, and the stride initially set to 1 but later changed to 2. The results can be seen in Table 3.By including two layers with 300 and 150 neurons and a stride of 1, the T-CNN-ASD model achieved an outstanding performance of 95.59%.

• Pooling Layers:
The training of the model was conducted using 10 rounds of crossvalidation.Each round consisted of 100 epochs with a dropout rate of 20%.The kernel size used was 3.After applying max pooling, average pooling was applied to a single depth slice with a stride of 1.The details are presented in Table 4.It is worth noting that the T-CNN-ASD model attained its peak performance at 95.59% accuracy by utilizing two layers comprising 300 and 150 neurons, respectively, in addition to employing max pooling.
• Dropout: In this study, three layers were utilized with varying neuron counts to identify the optimal architecture and performance.The accuracy results for the T-CNN-ASD model, with different layer and neuron configurations, along with 5 and 10 rounds of cross-validation, are presented in Table 5.The dropout rate was initially set at 20% but was later increased to 50%.Additionally, a kernel size of 3 was employed throughout the experiment.In this study, it was found that the T-CNN-ASD model, which consisted of two hidden layers with 300 and 150 neurons, respectively, and underwent 10 rounds of cross-validation with a dropout rate of 20%, achieved the highest performance with an accuracy of 95.59%.

Results
In the first experiment, the effect of batch size was investigated on T-CNN-ASD.The model was trained based on 10 rounds of cross-validation, including 100 epochs and dropout of 20%, and the kernel size was 3. The results are shown in Table 6.The best performance was 95.59% when the T-CNN-ASD model included two layers of 300 and 150 neurons with 256 batch sizes.The second experiment aimed to examine the impact of varying numbers of epochs on the training of the proposed model.The model was trained using 10 rounds of crossvalidation.The batch size was set to 256, with a dropout rate of 20% and a kernel size of 3. The results for different epochs are presented in Table 7.The T-CNN-ASD model achieved the highest performance of 95.59% with a configuration of two layers consisting of 300 and 150 neurons, trained over 100 epochs.In the third experiment, a total of six optimizers were used, each with a unique approach.The Adam optimizer was used in all previous findings.Table 8 presents the accuracy of T-CNN-ASD for various optimizers, employing 10 rounds of cross-validation.The experiment incorporated a dropout rate of 20%, a batch size of 256, and a kernel size of 3. The best performance after excluding Adam was 95.07%when the model T-CNN-ASD included two layers of 100 and 50 neurons with the Nadam optimizer.Therefore, according to the previous results, Adam's optimizer gave the best accuracy of 95.59%.

Comparison of T-CNN-ASD with ML Models
To showcase the effectiveness of our model, we performed a comparison with various machine learning models such as K-Neighbors, decision tree, random forests, and MLP.Additionally, the research focuses on analyzing eye-tracking scan paths for children with ASD, which is relevant to the medical field.Hence, the study incorporates the evaluation metrics of accuracy, sensitivity, specificity, and F1-Score.The results of this comparison are presented in Tables 9 and 10.According to the findings, the model achieves the highest accuracy of 95.59%, a sensitivity of 77.60, a specificity of 79.91, and an F1 score of 78.73 when using grayscale images.These results indicate the superior performance of our model compared to the others.

Comparison of T-CNN-ASD with Other Studies in the Literature
The researchers in [10] conducted a study on the combination of eye-tracking technology and machine learning (ML) to aid in the diagnosis of autism spectrum disorder (ASD).At first, they tested non-neural network methods such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), and random forests.Later on, they experimented with different artificial neural network (ANN) architectures.
To compare the models, the same ANN architectures were used in this paper.The models were then trained based on 10 rounds of cross-validation, including 100 epochs and 20% dropout.The results were presented without augmented images.The reason is that the models are trained faster and the results are more reliable.In addition, color and grayscale images were used to find the best accuracy.Table 11 illustrates the comparison between their findings and our approach.From Table 9, the T-CNN-ASD model with grayscale images and two layers of 80 and 40 neurons, has the best accuracy of 94.69%.The reason why the proposed model is better than other approaches is because the flexible architectures of deep convolutional neural networks (CNNs) are successfully used for image denoising; its built-in convolutional layer reduces the high dimensionality of images without losing its information [30].
Table 12 presents a comparison of the results of previous research work with the best results obtained for the ML and T-CNN-ASD models.

Discussion
The proposed T-CNN-ASD model was constructed using various architectures.Among the comparison results, the optimal outcome was achieved when the model architecture comprised of two layers with 300 and 150 neurons, respectively.Subsequently, the crucial parameters influencing the performance were fine-tuned in the model, including: Various kernel sizes were employed, and it was observed that smaller kernel sizes tend to outperform larger ones.One advantage of favoring small kernel sizes over fully connected networks is the reduction in computational costs and weight sharing, resulting in fewer weights for back-propagation.Conversely, longer kernel sizes are avoided due to their excessively long training time and associated cost, which may lead to the loss of image details [23].On the other hand, using small kernels aids in detecting small features and capturing image details.It is worth noting that odd kernel sizes are preferred over even ones, such as 2 by 2 or 4 by 4. This preference arises from the symmetry exhibited by all pixels in the previous layer around the output pixel.The absence of this symmetry necessitates additional calculations in the layers when employing even kernel sizes [6].Nonetheless, some studies have successfully utilized even kernel sizes and achieved satisfactory results.

2.
The kernel was moved using a 1-then-2 stride technique because the optimal kernel was determined to be 3. Upon comparing the outcomes, it was discovered that moving the kernel by 1 step yielded superior results.This can be attributed to the fact that incrementally shifting the kernel aids in more effectively extracting the features.

3.
The pooling layer is employed to decrease the spatial size of the input image following convolution.It is positioned between two convolution layers.Applying a fully connected after-convolution layer without pooling would be computationally intensive.Hence, max pooling and average pooling are employed to reduce the spatial size of the input image [26].Prior to pooling, the average pooling method smooths the image, resulting in the exclusion of sharp features.On the other hand, max pooling identifies the brighter pixels, which in turn determine the lighter pixels.This is particularly useful when the background of the image is dark, such as the images utilized in the research.

4.
Dropout is a technique commonly employed to mitigate overfitting in neural networks.It involves randomly deactivating neurons, allowing other neurons to compensate and make predictions based on the deactivated ones.Consequently, the neurons within the layers learn the weights independently and do not rely on collaboration with neighboring cells, thereby reducing inter-neuron dependency.This reduction in dependency helps to alleviate overfitting.In our study, we experimented with different dropout rates, specifically the lowest rate of 20% and the highest rate of 50%, to determine the optimal value.Based on the obtained results, a dropout rate of 20% yielded the best performance for the T-CNN-ASD model.

5.
Additionally, the most optimal outcomes were noted when the batch size was set to 256.Increasing the batch size allows our model to complete each epoch more rapidly during training.However, it is important to consider that using excessively large batches may compromise the quality of the model and hinder its ability to effectively generalize unseen data [32].Consequently, the batch size is a hyperparameter that requires careful testing and adjustment based on the model's performance during training.6.
Furthermore, it is important to note that selecting the correct number of epochs is crucial.Using a limited number of epochs can lead to underfitting as the neural network does not have sufficient learning time.Conversely, using an excessive number of epochs may result in the model performing well on the training data but struggling to accurately predict new, unseen data.In this study, the optimal number of epochs was determined to be 100.7.
In terms of optimizers, the model was tested to find the best one, and a total of six optimizers were employed.Each optimizer has its own unique approach to handling the dataset.Based on the previous findings, the Adam optimizer yielded the highest accuracy.Additionally, it was noted that the learning process was more consistent when using the Adam optimizer.
Based on the findings of this study, T-CNN-ASD demonstrated superior performance compared to other algorithms.Specifically, a two-layer architecture consisting of 300 and 150 neurons, 10 rounds of cross-validation, a dropout rate of 20%, and a kernel size of 3 achieved the highest accuracy.

Conclusions and Future Work
ASD is commonly regarded as a childhood disorder, making it crucial to identify ASD in children at an early stage in order to provide early intervention and enhance developmental milestones.However, diagnosing this psychological disorder is challenging and time-consuming, particularly when observing the child's behavior.Therefore, there is a need to enhance the tools used for early detection of ASD.For these detection tools to be effective, they should be brief, easily understandable for parents, and easy to administer by healthcare professionals.The accuracy of these tools is also crucial for early detection.One tool that is increasingly being utilized in ASD clinics and research is eye tracking, which is a non-invasive and convenient method of measurement.Due to the atypical visual attention of children with ASD, particularly towards human faces, there are significant differences in visual attention between children with ASD and typically developing (TD) children.
This study introduces a deep learning model, known as T-CNN-ASD, that utilizes eye-tracking scans to classify participants into ASD and TD groups.Convolutional neural networks (CNNs) have shown excellent performance in applications involving image data.The model is applied to a publicly available dataset.The research demonstrates that the T-CNN-ASD architecture, along with fine-tuning hyperparameters, effectively extracts features from eye-tracking scan path images.In the testing phase, the model achieves an accuracy of 95.59%, surpassing the accuracy of other machine learning algorithms such as random forest, decision tree, KNN, and MLP.Furthermore, it outperforms previous work conducted on the same dataset.Research suggests that ASD symptoms can be reflected in eye-tracking data in children as young as 6 to 24 months old, well before behavioral traits related to ASD become apparent.Therefore, accurately recording eye-tracking data from toddlers can reveal early indicators of ASD using the proposed model.
Future work can focus on implementing various improvements to enhance the performance of the models.One limitation of this research is the small number of participants in the dataset.Utilizing a larger dataset in the future can help build a more robust model and improve accuracy while also considering the severity of autism.Additionally, transfer learning, a popular approach in deep learning, can be employed to increase accuracy by reusing a model developed for one task as the starting point for another task.

Figure 3 .
Figure 3. Visualization of eye-tracking scan paths.The first image represents a participant without ASD, while the second image is for an ASD-diagnosed participant.

Figure 4 .
Figure 4.The image is processed by reducing the image size from 640 × 680 to 100 × 100, then the image is converted to grayscale, and after that, the image dimensions are reduced.

Figure 9 .
Figure 9. Convolution of the image with kernel and stride 1 (weights in the kernel are the parameters to be trained).

Figure 12 .
Figure 12.Average pooling with filter 2 × 2 and stride 2. Calculates the average value for each area on the feature map.

Figure 13 .
Figure 13.The stride moves the filter one step.

Figure 14 .
Figure 14.Dropout drops out neurons temporarily at random, and other neurons intervene to make predictions.

Table 1 .
Information about the participants.

Table 2 .
T-CNN-ASD accuracy with different kernel size.

Table 3 .
T-CNN-ASD accuracy with different stride.

Table 4 .
T-CNN-ASD accuracy with different pooling.

Table 5 .
T-CNN-ASD accuracy with different dropout and folds.

Table 6 .
Accuracy with different BS.

Table 7 .
The accuracy at various epochs.

Table 9 .
The outcomes of the T-CNN-ASD method for grayscale images and alternative machine learning algorithms.

Table 10 .
The T-CNN-ASD approach's results for color images are contrasted with those of other frequently used machine learning techniques.

Table 12 .
The accuracy of previous work, MLP, and T-CNN-ASD.