A Mixed Statistical and Machine Learning Approach for the Analysis of Multimodal Trail Making Test Data

: Eye-tracking can offer a novel clinical practice and a non-invasive tool to detect neuropatho-logical syndromes. In this paper, we show some analysis on data obtained from the visual sequential search test. Indeed, such a test can be used to evaluate the capacity of looking at objects in a speciﬁc order, and its successful execution requires the optimization of the perceptual resources of foveal and extrafoveal vision. The main objective of this work is to detect if some patterns can be found within the data, to discern among people with chronic pain, extrapyramidal patients and healthy controls. We employed statistical tests to evaluate differences among groups, considering three novel indicators: blinking rate, average blinking duration and maximum pupil size variation. Additionally, to divide the three patient groups based on scan-path images—which appear very noisy and all similar to each other—we applied deep learning techniques to embed them into a larger transformed space. We then applied a clustering approach to correctly detect and classify the three cohorts. Preliminary experiments show promising results.


Introduction and Related Work
Eye-tracking offers a fundamental tool to process and analyse human brain behaviour by detecting eye position and speed of movements [1]. Moreover, eye movements could in principle be used in order to highlight the presence of pathological states, and consistent research has been recently performed in this direction [2,3]. In the last decades, Machine Learning (ML) has been widely applied to many different research fields [4][5][6][7] and, in particular, some examples of its use for Trail Making Test (TMT) data analysis can be found in the literature. For instance, in [8], an approach based on random forests, decision trees and Long Short Term Memories (LSTMs) was proposed to detect the presence of a pathological state in the tested subjects. In particular, 60 patients were recruited in the study, 24 of which presenting brain injury and 36 presenting vertigo episodes. Similarly, in [9], the eye-tracking test was used to analyse children diagnosed with autism spectrum disorder (ASD), in order to establish a quantitative relationship between their gaze performance and their ability in social communication. Indeed, in the same study, the eye gaze-tracking was proposed as a possible non-invasive, quantitative biomarker to be used in children with ASD. Finally, a vast literature exists related to applications of eye-tracking tests to detect depression syndromes [10][11][12][13], and eye-tracking studies have proved their efficacy in the diagnosis of other common neurological pathologies, such as Parkinson's disease, brain trauma and neglect phenomena [14][15][16][17], while ML techniques have been recently applied to process TMT data for the detection of the Alzheimer's disease [18,19].
In [20], a new experiment based on TMT has been proposed, called the Visual Sequential Search Test (VSST). In a standard TMT experiment, a subject is presented with a sheet of numbers and letters arranged in a random manner and is asked, using a pen, to perform two tasks simultaneously, namely, to connect both in progressive and alternating order numbers and letters. In the VSST setting, the patients are required to carry on the same task based only on eye movements. Human visual search [20,21] is, in fact, a common activity that enables humans to explore the external environment to take everyday life decisions. Indeed, sequential visual search should use a peripheral spatial scene classification technique to put the next target in the sequence in the correct order, a strategy which, as a byproduct, could also improve the discriminatory ability of human peripheral vision and save neural resources associated with foveation. With respect to the cohorts of patients under examination, data were collected from people with chronic pain, extrapyramidal patients and healthy controls. In particular, individuals affected by extrapyramidal symptoms suffer from tremors, rigidity, spasms, decline in cognitive functions (dementia), affective disorders, depression, amnesia, involuntary and hyperkinetic jerky movements, slowing of voluntary movements such as walking (bradykinesia), and postural abnormalities. Conversely, there are several mechanisms underlying chronic pain; more often an excessive and persistent stimulation of the "nociceptors" or a lesion of the peripheral or central nervous system, but there are also forms of chronic pain that do not seem to have a real, well-identified cause (neuropathic pain). Therefore, chronic pain can be related to a variety of diseases, with very different severity, from depression, to chronic migraine and to cancer.
In [22], an algorithmic approach for the analysis of the VSST, based on the episode matching method, is proposed. In this paper, instead, we analyse the VSST data from a different perspective, examining both the blinking behaviour and the pupil size of the subjects and the freezed images of the scan-paths captured during the test, to gain an insight into the patient condition and offer a support for the clinical practice. For this purpose, we have compared several indicators to distinguish among classes of patients. A first preliminary analysis was performed, to evidence statistical differences based on pupilderived measures. Such analysis showed the presence of statistically diversified behaviours existing among healthy, chronic and extrapyramidal subjects. Moreover, we further implemented a Deep Learning (DL) autoencoder architecture, with a U-Net backbone [23], to reconstruct the trajectory images for the three groups of individuals. Subsequently, as a proof of concept, we analysed the latent embedding representations using the K-means clustering algorithm, to verify the presence of clusters corresponding to the three cohorts of patients. Preliminary experiments actually evidence well-defined phenotypical groups in the latent space.
The paper is organised as follows. In Section 2, the VSST and the dataset used are described, together with the statistical methodologies and the DL approach employed for analysing pupils and image data. In Section 3, we summarise and discuss the obtained results. Finally, Section 4 collects some conclusions and traces future work perspectives.

Visual Sequential Search Test
The Trail Making Test is used in clinical practice as a neuropsychological assessment of visual attention and task switching. The test investigates the subject's attentive abilities and the capability to quickly switch from a numerical to an alphabetical visual stimulus. Successful performance of the TMT requires a variety of mental abilities, including letter and number recognition, mental flexibility, visual scanning and motor function [24].
The research described in this paper used an oculomotor-based methodology, called eye-tracking, to study cognitive impairments in patients affected by chronic pain and extrapyramidal syndrome. Eye-tracking is in fact a promising way to carry out this kind of cognitive tests, allowing the recording of eye movements, to determine where a person is looking, what the person is looking at and for how long the gaze remains in a particular spot. More precisely, an eye-tracker uses invisible near-infrared light and high-definition cameras to project the light into the eye and record the direction in which it is reflected by the cornea [25]. Advanced algorithms are then used to calculate the position of the eye and determine exactly where it is in focus. This makes it possible to measure the visual behaviour and fine eye movements and allows for a more subtle exploration of cognitive dysfunction in a range of neurological disorders.
Several different eye-tracking devices exist, for example, the screen-based eye-tracker [26]. This type of test requires respondents to sit in front of a monitor and to interact with a screen-based content. In the experiments described in this paper, we made use of a special type of TMT experiment, namely the Visual Sequential Search Test. Such test has been created for studying the top-down visual search, which can be summarised as a series of saccades and fixations. In particular, the VSST consists of a repeated search task, and patients are asked to make the connection by looking at a logical sequence of numbers and letters. Here, the required task is to follow with the movement of the eyes the alphanumeric sequence 1-A, 2-B, 3-C, 4-D, 5-E, as shown in Figure 1.

VSST Experimental Dataset Description
Three types of individuals were recruited for the experiments. In particular: 46 patients with extrapyramidal syndrome, 284 affected by chronic pain and 46 controls. For each person, the eye-tracker provided the following information: Regular eye movements alternate between saccades and visual fixations. A fixation consists in maintaining the visual gaze on a single location. A saccade, instead, is a quick, simultaneous movement of both eyes, between two or more phases of fixation in the same direction. In case of blinking, the device loses the signal, which results in NaN value recorded in our dataset, either for the position (x, y) on the screen and for the size of the pupils.
Data preprocessing was necessary before proceeding with the analysis. In particular, we deleted the experiment part not referring to the image labelled as "TMT stimulus" (Figure 1), we uniformed the timing to have timestamps exactly every 4 ms, and all the artefacts and noisy information were removed from the dataset (e.g., repeated rows).

ETT Image Dataset
To generate the 2D images of the gaze trajectories, the size of the left pupil and the average positions of the gaze along the horizontal and vertical axes were extracted from the preprocessed numerical data acquired by the eye-tracker during the experiments. In this context, pupil size values equal to NaN correspond to the eye blinks and to those movements recorded while the eye was closed. Therefore these data were removed from the trajectories, as shown in Figure 2.
The ETT (Eye-Tracking Trajectory) dataset is composed of images of dimension 1920 × 1080 pixels, composed by a single colour channel. Binary images constituting the dataset consists of a black background (value 0) where the pixels corresponding to the gaze trajectory appear in white (value 255). In other words, each image is a binarised single-channel (greyscale) image, with intensity values belonging to {0, 255}. No smoothing operation was performed on the trajectories, in order to preserve the original data information. ETT includes 376 images (46 healthy controls, 46 extrapyramidal patients and 284 chronic pain patients). To generate the dataset, we made use of MATLAB 2021a software [27].

Statistical and Deep Learning Methods for the VSST Data Analysis
In the following subsections, we describe how the two sources of information from the VSST can be processed. On the one hand, the behaviour of the population of our patients is analysed with statistical methods applied to the morphological characteristics of the pupil and to the blinking frequency. On the other hand, the images belonging to the ETT dataset were preprocessed on the basis of a DL method, to obtain a latent representation that allows us to adequately group the three cohorts of examined individuals.

Statistical Methods
The following markers were extracted for each patient: the difference between the maximum and the minimum value of the pupil size (averaged over the right and left eye), the blinking rate (i.e., the number of blinks per second) and the blinking average duration. For each of these continuous variables, its distribution over the three classes of patients was computed. Afterwards, a Kruskall-Wallis test [28] was performed. This nonparametric test is used for verifying whether samples originate from the same population (or from populations with equal median). The test has been extensively used for several statistical applications, and has been proved to be a very powerful alternative to parametric tests [29]. It can compare two or more independent samples of different size, testing the null hypothesis H 0 , defined by where λ i is the median of the ith distribution sample.
In order to detect and remove the outliers of each distribution, we applied the unimodal Chebyshev Theorem, with γ = 3 [30]. This resulted in keeping at least 89% of the values around the mean. Then, we repeated the statistical analysis with the cleaned samples. Finally, a bootstrapping method was performed to avoid the potential effect on the results, due to the difference between the sample size of the chronic pain patients compared to the size of the other two.

Deep Learning Modelling: Autoencoders and the U-Net Architecture
Deep learning has reached state of the art performance in image processing and analysis for a wide range of applications, in particular it gives excellent results in tasks such as image classification, detection or segmentation, see [31][32][33]. In the present work, we made use of a U-Net-based architecture, a DL model well known for its very good performance in image processing tasks [34]. This model was originally proposed as an efficient and fast way to perform biomedical image segmentation [34]. The architecture is composed by several convolutional layers, which take the original image as input and produce their segmentation maps. It is based on an encoder-decoder structure and can, therefore, be successfully used also to perform image reconstruction. As a matter of fact, in our paper, we trained the U-Net based architecture to reproduce the original image at the output, obtaining a network which is capable of reconstructing the input images (see Figure 3). The architecture used for the present work can thus be viewed as a deep learning, self-supervised autoencoder, made of a downsampling stage (encoder) and an upsampling stage (decoder). The overall scheme of the deep learning architecture proposed, is depicted in Figure 4.  In the encoder stage of the network, the spatial dimension is reduced by convolutional blocks followed by a maxpool downsampling layer, while the channel dimension is increased, to encode the input image into a hidden representation at multiple different levels, by means of a series of convolutional layers which use filters to get the so-called feature maps. A single feature map provides an insight into the internal representation for the specific input for each of the convolutional layers in the model, capturing some specific information from the input data, such as curves, straight lines or a combination of them. The decoder stage, instead, increases the latent spatial dimension while reducing the number of channels, using convolution blocks followed by an upsampling layer. Generally, the U-Net architecture implies a series of concatenation operations between the output of a layer of the encoder and the input of the corresponding layer in the decoder, by means of residual connections. As the model used acts as an autoencoder, the residual connections have been eliminated, so that the decoder can use only the output of the encoding stage, without including in the reconstruction phase the additional information given by this type of connections. This, also, allows us to avoid overfitting of the reconstruction network, given the small amount of images available. More specifically, 1024 feature maps, each of size 67 × 120 pixels, are obtained from a binarised image of shape 1920 × 1080 × 1, as shown in Figure 5. As a single feature map captures certain aspects from the input data, all the aforementioned 1024 feature maps have been therefore flattened and concatenated to obtain the image embedding representation of shape 1× 8,232,960. Some examples of intermediate representations obtained for a random image for each class are shown in Figure 6.
The model was developed in Python version 3.9.5 with Tensorflow 2.4.0 (Keras backend), and trained using Adam optimizer with an initial learning rate equal to 10 −4 . All the experiments were performed on a Linux-based machine equipped with an Intel Core i9-10920X CPU, 128 GB DDR4 RAM and a Titan RTX GPU with 24 GB GDDR6 VRAM.

K-Means Clustering
As a proof of concept, we performed clustering in both the original and latent space, obtained with our U-Net based model. This would in fact allows to show the ability of the image reconstruction architecture to efficiently compress and maintain the information contained in the original images, producing latent representations which are easier to be distinguished in the three different cohorts [35]. With this intent, we used the Kmeans [36] algorithm, one of the best known and most used partition clustering methods. The algorithm is based on an optimisation process whose aim is to minimise the intracluster variance. The number of clusters, K, needs to be specified in advance. On the first iteration, K clusters are created. Thereafter, the representatives for each cluster are calculated iteratively, until convergence. We used the Scikit-learn Python (Version 3.1).
The K-means algorithm has been applied to input data (belonging to the ETT dataset) as well as to the reconstructed data: all the examples in both settings have been grouped in a single cluster, except for only three examples-all belonging to the chronic pain patient category-which have been assigned to the other two clusters. A different strategy was then applied, based on the latent space representations and K-means, to determine if collecting the different feature maps, resulting from image compression, was able to lead to a correct grouping of the three categories of patients.

Statistical Analysis of Pupil and Blinking Data
First, the Kruskall-Wallis test was applied to the distributions of the blinking rate, maximum pupil size variations and mean blinking duration. In Figure 7, we show the distributions of the three indicators, for healthy, chronic and extrapyramidal individuals. Similarly, in Figure 8, we show boxplots for the three indicators, comparing the three groups. We further performed the Kruskall-Wallis test, to compare the three groups' distributions. The level of statistical significance chosen is p-value = 0.05.
As shown in Table 1, no significant differences are found between healthy subjects and patients affected by extrapyramidal syndrome considering the three indicators. Conversely, a significant statistical difference between healthy controls and chronic pain patients was found for the rate of blinking and the variation of pupil size. Concerning the comparison between patients affected by chronic pain and extrapyramidal syndrome, a significant difference was detected both in the maximum pupil size variation and in the blinking average duration. In Table 1, also the H statistic value is reported, which represents the test statistic for the Kruskal-Wallis test. Under the null hypothesis, the χ-square distribution approximates the distribution of H.

Outliers Detection and Kruskall-Wallis Test
A further step of the analysis, as described in Section 2.4.1, consisted in repeating the Kruskal-Wallis test on distributions without outliers. The Chebyshev outlier detection method uses the Chebyshev inequality to calculate the upper and lower outlier detection limits. Data values outside this range will be considered outliers. The outliers could be due to an incorrect acquisition procedure or they could indicate that the data are correct but highly unusual. The results of the Kruskall-Wallis test on the cleaned distributions are reported in Table 2.
The analysis based on clean samples confirmed the previous results: all the significant p-values remained such and, in general, they even decreased. As an effect of this reduction, the difference between Healthy and Chronic subjects in the blinking average duration became significant. Although the Kruskal-Wallis test is designed for different sample size groups, the greater number of chronic patients than the other two classes may affect the results to some extent. To avoid this kind of bias, we performed the analysis described in the following. A sample of chronic patients was randomly selected from the original distribution, with a size equal to the others-46 patients, and then the Kruskal-Wallis test was applied. This resampling operation is repeated 10,000 times. Table 3 reports the percentage of p-values less than 0.05 over the 10,000 runs. The bootstrapping procedure allows us to validate the results in Table 1. Indeed, only the comparison between Healthy and Chronic patients with respect to the variation of pupil size has a percentage of significant p-values less than 50% (in particular equal to 48.06%), while this indicator has shown to be significant in the previous experiments. Therefore, we can conclude that, based on the considered indicators, healthy and extrapyramidal subjects look indistinguishable, while chronic pain patients behave significantly different. This is not an astonishing result as neurophysiological studies [37] suggested that a painful electrical stimulation is associated with consistent alterations in the eye muscle activity. Moreover, altered results of the Blink Reflex (BR) test normally stand for a dysfunction in brain stem and trigeminovascular connections of patients with migraine headache, supporting the trigeminovascular theory of migraine [38].

Mapping Latent Space Representations of ETT Images to Phenotypic Groups
For what concerns the analysis of the ETT dataset, three U-Net based autoencodersone for each group, all sharing the same architecture and the same set of initial random weights-were trained for image reconstruction. In particular, the generic U-Net i is trained only on the data describing the ith class of individuals, which means that the U-Net H has been trained to reconstruct input images from the healthy class only, while U-Net E and U-Net C are trained on extrapyramidal and chronic classes, respectively. The workflow of the analysis is depicted in Figure 9. The experiments were carried out as described in the following. The three U-Net architectures were originally trained on 46 healthy, 46 extrapyramidal and 284 chronic pain patients, respectively, i.e., with the entire ETT dataset. Moreover, to obtain a balanced training set, experiments were also performed with only 46 chronic pain randomly sampled patients. The three encoder outputs were concatenated in a unique matrix, whose "entries" (latent representations of ETT images) were then clustered using the K-means algorithm, with K = 3.
The pipeline of the procedure is depicted in Figure 10. The number of clusters is empirically defined by the structure of the dataset itself, as it contains three types of individuals known a priori. The K-means algorithm is not used for classification purposes, but with the intent of evaluating the presence of usueful information in the latent embeddings which allows to properly discern the three groups. In each of the three groups, only subjects belonging to the same cohort are present, showing the possibility of properly dividing patients into groups using the latent space embeddings. Considering such preliminary results, we decided to implement a procedure to test the generalisation capability of the models. Therefore, we trained the three U-Nets only on 41 samples for each class of individuals. The test set, consisting of five healthy, five chronic and five extrapyramidal samples, was used as input to the three architectures separately. Subsequently, we ran the K-means algorithm (K = 3) with respect to the matrix obtained as the mean values along the embedding dimension of the test embeddings obtained at the previously described step. Next, we clustered the new mean values matrix, checking if the three clusters detected correspond to the three groups. In particular, the mean healthy embeddings obtained with the three architectures ended up in the same cluster, with 67% of accuracy. On the other hand, there was no remarkable distinction for the chronic and extrapyramidal patients. Moreover, as a further proof of concepts, we averaged the embedding representations, for each group of individuals and for each model-obtaining the vectors of the mean values for the reconstructed embeddings and clustering the corresponding matrix with K-means (K = 3) to verify whether the three averaged embeddings for the generic class could give an insight of the relationship between the input image class (healthy, chronic and extrapyramidal) in the embedding space. All of the three resulting mean "Healthy" reconstructed vectors of embeddings were clustered in the same community. Instead extrapyramidal and chronic patients, were not distinctively divided in their respective groups. This shows a similar behaviour to what we detected with the testing procedure. Healthy individuals trajectories are, in fact, more characterisable comparing to the two other subject categories.
Nonetheless, classifying the three groups of individuals based on DL techniques applied to ETT images remains a very difficult task, especially because of the scarcity of data and due to the complexity of the task itself. Indeed, human experts are unable to recognise different types of patients looking at the "frozen" trajectories they follow to approach the VSST, both because such trajectories are not so different to the naked eye and because, in the freezing process, the important temporal information on the way in which each trajectory is travelled, is irremediably lost. Taking into account, with an ad hoc preprocessing, of the sequential nature of the data and, most of all, enlarging the training dataset will surely allow better results.

Conclusions
In this paper, we presented some preliminary results on the analysis of VSST data, performed on three groups of individuals: patients affected by the extrapyramidal syndrome or by chronic pain symptoms and healthy subjects. Starting from the idea that the problem to be solved is multifaceted-which means that the data collected in a VSST have different nature and can be analysed from different viewpoints [22]-the goal of the present study is to detect if some regularities can be found within the data that allow to properly group them. Such detected differences could be potentially used in clinical practice, and therefore play an important role in evidencing possible neurological syndromes. The three-stage statistical analysis has been carried out on the basis of three metrics: the blinking rate, the maximum pupil size variation and the blinking average duration. The analysis showed the presence of some statistically significant differences between the groups analysed. In particular, the relevant difference in blinking rate between healthy and chronic patients is confirmed by each step of the analysis. Moreover, a statistical difference was detected between extrapyramidal and chronic patients for what concerns the maximum pupil size variation and blinking average duration. Conversely, based on the ETT (Eye-Tracking Trajectory) image dataset, a U-Net ensemble architecture was trained to reconstruct input images, using their latent representations, to appropriately cluster the visual data. Embeddings were, in fact, divided clearly into three separated groups. We performed preliminary testing, showing promising generalisation capabilities. Limitations of this work are mainly due to the small dataset available. Moreover variations of the VSST could be implemented and standardised, to avoid biases due to the fact that no instructions were given concerning the number of times the patients should have completed the sequence during the data acquisition time. Therefore, future research and extensions will concern new standardised data collection for further testing and a more extensive validation of the employed approaches based on a wider experimentation. For example a possible extension of the present study could be to consider more than three mutual exclusive classes, so as to include co-morbidities, i.e., cases in which additional conditions are concurrent to the primary one.