Multi-Layer Picture of Neurodegenerative Diseases: Lessons from the Use of Big Data through Artificial Intelligence

In the big data era, artificial intelligence techniques have been applied to tackle traditional issues in the study of neurodegenerative diseases. Despite the progress made in understanding the complex (epi)genetics signatures underlying neurodegenerative disorders, performing early diagnosis and developing drug repurposing strategies remain serious challenges for such conditions. In this context, the integration of multi-omics, neuroimaging, and electronic health records data can be exploited using deep learning methods to provide the most accurate representation of patients possible. Deep learning allows researchers to find multi-modal biomarkers to develop more effective and personalized treatments, early diagnosis tools, as well as useful information for drug discovering and repurposing in neurodegenerative pathologies. In this review, we will describe how relevant studies have been able to demonstrate the potential of deep learning to enhance the knowledge of neurodegenerative disorders such as Alzheimer’s and Parkinson’s diseases through the integration of all sources of biomedical data.


Introduction
Neuronal degeneration is a common cause of morbidity and cognitive impairment in the elderly [1]. Neurodegenerative Diseases (ND) are a large group of neurological disorders with heterogeneous clinical and pathological expressions, affecting specific subsets of neurons in specific functional anatomic systems, placing a considerable burden on an increasingly aging society [2]. ND are broadly identified as proteinopathies due to conformational changes affecting protein functionality, thereby causing toxicity or losing their physiological function: misfolded proteins start to aggregate resulting in neurotoxicity [1,3]. ND are characterized by a high level of heterogeneity and complexity in terms of clinical presentation and etiology because of the interaction of genetic, lifestyle, and environmental factors [3][4][5][6]. Notably, the heterogeneity of ND is a key confounding factor that complicates the understanding of disease mechanisms and the identification of treatments. Case-control cohorts often include multiple phenotypes on distinct disease trajectories or rely on models that only account for a few features of the central nervous system at a time, which has been reductive for complex diseases [7][8][9]. Alzheimer's (AD) and Parkinson's (PD) diseases are two of the most frequent and heterogeneous pathologies among all the complex neurodegenerative proteinopathies, affecting 24 and 6.1 million people worldwide, respectively [3,7,10]. Both disorders include hereditary Mendelian forms, caused by mutations in single genes and complex sporadic forms characterized by polymorphisms in multiple genes that interact with environmental, epigenetic, and transcriptomic signatures in determining the heterogeneity and the differential susceptibility to disease [4,11]. To date, the identification of AD and PD therapeutic targets and in vivo biomarkers for early diagnosis is still challenging, because of the existence of different disease subtypes (phenotypic heterogeneity) and stages of disease (temporal heterogeneity) [8]. Driven first by genomic studies and more recently by transcriptomic and epigenomic studies, a large volume of data has been rapidly produced to tackle this heterogeneity. In the perspective of ND as a big data issue, such diverse observations could be pulled together to provide a personalized, multi-layer representation of patients, which considers the complex heterogeneity of the disease and the availability of effective diagnostic criteria and drug development deliverables. In this context, computational modeling and simulation represented key components of the scientific method in which both reductionist and holistic approaches are not treated as separate fields but as convergent and cross-supportive paths [7][8][9]12]. Therefore, this review aims to analyze the rapidly evolving techniques for data integration of multi-omics, clinical, and neuroimaging data discussing their role in a precision medicine framework [4,13,14]. Deep Learning (DL) techniques will be discussed with relevant examples concerning the identification of biomarkers for prognosis, early diagnosis, and assessment of symptoms, considering observations on handwritings, speeches, and movement dynamics. A specific focus will be given to articles building and analyzing a multi-layer representation of subjects, showing off the advantages offered by big data integration. Finally, publicly available databases collecting multiple sources of biomedical information for ND will be reviewed.

Literature Research
Relevant applications of Artificial Intelligence (AI) techniques to ND have been selected from specific research queries on bibliographic search engines such as PubMed, Google Scholar, and Dimensions.ai. "Artificial Intelligence", "Deep Learning", "Machine Learning" were used as keywords to identify AI-related articles, in combination with "neurodegenerative", "Alzheimer" or "Parkinson" to address the pathology. Ultimately, these were combined with "speech", "segmentation", "handwriting", "voice", "movement", "multi-omics", "EHR" or "data integration" to retrieve literature publications exploiting the related data types. Titles and abstracts were checked to identify relevant articles that were finally included in this review. Notably, we decided to include experiments with reported accuracy below the 95% threshold, which is the cut-off meet minimum Medical Diagnosis Treatment (MDT) standards and pass a 'medical Turing Test' [15], because we wanted to represent the state of the art of DL and ML applications in the field of neurodegenerative diseases data integration.

Basics of Machine Learning and Deep Learning
Machine Learning (ML) encompasses a collection of data analysis techniques aiming to generate predictive models from multi-dimensional datasets [16,17]. The advantages of ML come from its ability to learn from previous data to make accurate predictions on new data in both supervised and unsupervised contexts, with reduced or absent assumptions [17]. The focus of unsupervised methods is to learn patterns in the features of unlabeled data, while supervised methods aim to discover the relationship between input features and a target attribute, e.g., an MRI brain scan from a patient labeled as Alzheimer's [16].
DL differs from the traditional ML algorithms applied in biomedical classification tasks, such as linear or logistic regression, Support Vector Machine (SVM), and naive Bayes classifier due to its ability to cope with the complexity and volume of multi-layer data ( Figure 1) [16,18]. DL models are based on Artificial Neural Networks (ANN) that are loosely inspired by human brain networks and a typical DL architecture is organized in layers of computational units known as "neurons" [16]. Traditional ML algorithms and basic ANN are considered shallow learners, learning from data described by pre-defined features or by expert-based descriptors. These shallow learners produced significant progress both in medicine and multi-omics fields and led to the identification of multigene signatures potentially involved in disease onset and progression in ND [18]. However, the advent of Deep Neural Networks (DNNs) outperformed shallow learners, as DNNs can combine multiple hidden layers to provide a deeper and more comprehensive representation of data and allow the exploration of complex interrelationships between genetics, biochemistry, histology, and disease status. Notably, these DL methods can extract features automatically from raw data with little or no preprocessing, overcoming manual features engineering (Table 1) [16,18].  Table 1. Summary of influential DL architectures and approaches for multi-layer big data analysis.

Architecture Description Graph
Deep Neural Network (DNN) The basic network is made of multiple hidden layers. It is capable of modeling complex non-linear relationships by learning input data representation to be matched with a specific output [19].

Autoencoder (AE)
It allows detecting patterns in the data in an unsupervised fashion. The model is made of an encoder and a decoder, transforming input data to generate its own representation, aiming to minimize the difference between the input and its output representation [20].

Architecture Description Graph
Restricted Boltzmann Machine (RBM) This model is made of two layers, where nodes are bidirectionally connected but there are no connections within one layer. It is trained to learn a probability distribution for the input data and can be used as a building block for deep probabilistic models, where multiple RBMs can be stacked to build a deeper network [21].

Convolutional Neural Network (CNN)
Most used for image processing in computer vision applications. The network uses convolution and pooling operations to extract relevant features from data, useful for image classification.
This architecture is inspired by the organization of the visual cortex [22].

Recurrent Neural Network (RNN)
Best suited to process sequential data and used to predict the future from the past. The network can give an output for every timestep and takes the previous inputs into account to determine the output. Long-Short Term Memory (LSTM) and Gated Recurrent Units (GRUs) are RNN architectures [19].

Artificial Intelligence in Neurology
AI allows for automated data interpretation and decision-making. The peculiarity of AI is to be able to learn from data to acquire knowledge, represent and process information related to the task it has to perform, thereby overcoming the difficulty to assimilate and extract valuable information from large datasets. Thus, AI can be used as a powerful tool in the elaboration of biomedical data for the development of predictive models. One of the most relevant data sources for AI comes from the biomedical field, and the ability of DL-one of AI's most important branches, alongside ML-to automatically learn complex representations from data is showing to be particularly promising to help ND research and clinical management [18,23]. Nowadays, the number of publications in the ND research area employing DL techniques (Table 1) and other ML algorithms is constantly increasing (Figure 2). Classification and segmentation of neuroimaging data is a traditional subdomain of DL methods application, stating the high-dimensional nature of neuroimaging data that is highly suitable for AI intervention, and relevant application examples are presented below. Afterward, it will be shown how observations on handwritings, speeches, and movement dynamics can be used to support symptoms and diagnostic assessment. In the subsequent section, we discuss the usefulness of merging multiple data types, including multi-omics, clinical, and neuroimaging data to obtain a holistic representation of subjects. Results were limited to "article" as Publication Type.

Neuroimaging Classification and Segmentation
Biomedical imaging is a traditional field of application for DL architectures. To date, classification and segmentation tasks on neuroimaging data have been greatly improved by employing AI techniques [18,23]. DL models can be applied to classify ND stages or sub phenotypes. As a representative application in AD, a CNN-based approach has been implemented by Ramzan and colleagues on resting-state fMRI of 138 AD subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The final model achieved an average accuracy of 97.92% on the test set, classifying subjects among six different stages of AD including Cognitively Normal (CN), Significant Memory Concern (SMC), Early Mild Cognitive Impairment (EMCI), Mild Cognitive Impairment (MCI), Late Mild Cognitive Impairment (LMCI), and AD [24]. A noteworthy study focused on the detection of PD from volumetric T1-weighted MRI scans used a 3D CNN to classify patients over control subjects (CS). They used data from the PPMI database [25] (described in Section 5.) and obtained an average recall, precision, and F1-score of 0.94, 0.93, and 0.94, respectively. Their model demonstrated to be good enough to not misclassify any PD subject [26]. CNNs can also be applied in the segmentation task to quantify structural changes in brain shape, volume, and thickness that may be related to neurodegeneration [18,27]. As the assessments of the brainstem and hippocampal volumes are known to be crucial tasks in the study of ND, a 2D CNN was recently used to predict the number of voxels attributed to the hippocampus [28]. Meanwhile, an automated sub-cortical brain structure segmentation approach based on a CNN architecture outperformed state-of-the-art techniques such as Free Surfer on the Internet Brain Segmentation Repository (IBSR 18) dataset [29]. A DL-based hippocampus segmentation framework embedding statistical shape of the hippocampus as "context information" into DNN was proposed and tested on image data of AD, MCI, and CN subjects from two cohorts from ADNI and AddNeuroMed, leading to improved segmentation accuracy in cross-cohort validation [30]. Notably, DL can be used as a feature extractor before classification tasks reducing the need for rigid segmentation in preprocessing: a multiple dense CNN was used on an ADNI dataset, including 199 AD patients, 403 MCI, and 229 CN. Experimental results showed that the proposed method achieves an accuracy of 89.5% for AD vs. CN classification, and an accuracy of 73.8% for MCI vs. CN classification [31]. Moreover, another CNN model based on transfer learning was used as a feature extractor in a multi-class discrimination task on the ADNI database, achieving an overall accuracy of 95.73% on the validation set [32]. Transfer learning is defined as the ability of a system to recognize and employ the knowledge learned in a previous source domain to a novel task and it can be implemented in segmentation to reduce the need for many annotated samples to perform the training task [27]. Transfer learning is characterized by some limitations because objects in biomedical images may have very different appearances and sizes so transfer learning from the models with huge variations in organ appearance may not reduce the segmentation result [27]. Overall, AI flexibility in learning complex and abstract representations of neuroanatomical data through nonlinear transformations is particularly promising since it can greatly improve the knowledge of the aging brain and its response to several concurrent pathological processes.

Clinical Records Investigation
In addition to widespread research on DL applications for image classification and segmentation, researchers have applied AI to several neurological and general medical data. ML and DL techniques have been exploited to support clinical expertise analyzing handwritings, voice recordings, and movement registrations. Handwriting deterioration is one of the most typical clinical hallmarks of PD and the identification of distinctive handwriting features can help to build a predictive model for PD classification [33]. Drotár and colleagues [34] collected handwriting samples from a sample of 37 PD Czech patients on medication and 38 matched controls. They extracted relevant features from data using statistical methods and fed them to an SVM with a Radial Basis Function kernel, achieving 88.1% as the highest accuracy in classifying PD patients [34]. Another interesting usage of patients' handwriting is shown in a recent study by Pereira and colleagues [33]. Using an electronic pen to map handwriting dynamics by PD patients into computer images, researchers collected data to be analyzed by a CNN. The authors obtained a final accuracy of about 95% in classifying PD patients and healthy controls, supporting the employment of a DL-based approach to aid PD diagnosis. Interestingly, they showed the goodness of the model in distinguishing healthy controls from patients with early-stage PD. Their CNN has been challenged in classifying data from eight manually-selected patients with very similar traces to healthy individuals. The accuracy rate above 94% proved it to be robust enough to detect almost imperceptible changes between the two groups' handwritings ( Figure 3) [33]. Convolution and pooling operations process input data to extract relevant features from the images, allowing detection of group differences. Spirals images were taken from the NewHandPD dataset [35], available at http://wwwp.fc.unesp.br/ papa/pub/datasets/Handpd/, accessed on 5 January 2021.
These approaches can be considered as alternative or complementary to others, such as speech or movement-based discriminant analyses. Various methods have been presented for analyzing patients' speech and movement recordings. As an example, Berus and colleagues exploited speech recordings data from 20 PD and 20 CS [36]. Recordings were taken during a medical examination while subjects were reading or saying certain numbers or words, for a total of 26 recordings per subject. A fine-tuned ANNs ensemble algorithm was trained to classify each voice sample for each subject. A class was finally attributed by the majority voting of each ANN constituting the ensemble. Their algorithm achieved a test accuracy, sensitivity, and specificity of 86.47%, 88.91%, and 84.02%, respectively [36]. Another possible use of voice recordings is presented in a very recent paper by Al-Hameed and colleagues [37], where the authors showed how it is possible to discriminate between patients reporting cognitive concerns attributable to ND or Functional Memory Disorder (FMD, i.e., subjective memory concerns unassociated with objective cognitive deficits or risk of progression) by analyzing acoustic features extracted from speech recordings. Recordings data from subjects' clinical conversations with the neurologist during the diagnosis assessment were processed for feature extraction and selection and then used to train five different ML classifiers to differentiate between the two classes. This method achieved an average accuracy of 96.2%, proving that the discriminative power of purely acoustic approaches could be integrated into diagnostic workflows for patients with memory concerns. Interestingly, this method does not require automatic speech recognition and understanding because it relies only on acoustic features obtainable from recordings processing [37].
PD patients manifest motor symptoms such as bradykinesia, tremor, and posture alteration, and clinical observations can be taken from their characteristic gait. Gait disorders in PD are characterized by spatial and temporal dysfunctions and Freezing Of Gait (FOG) is one of the most debilitating motor symptoms in PD. DL algorithms can be implemented in automatic systems of FOG detection, as recently demonstrated [38]. In this paper, the researchers analyzed wearable sensor data with a CNN to automatically detect when a FOG episode would occur, achieving 89% accuracy. This study presents the first method of FOG detection on home environments based on DL techniques, showing outperforming results over other previous automatic methods and possibly improving the medical monitoring of FOG's evolution in PD patients. Finally, this tool can also be beneficial to evaluate the effects of drugs during clinical trials [38].

Big Data Integration
As 21st-century biomedicine goes through the big data era, the production of a wide variety of biomedical data gets simpler and faster [7,23]. To face the data volume and heterogeneity increase, data sharing initiatives were encouraged by funding agencies and scientific journals, and publicly available repositories and databases were established [9,39]. However, standardized protocols for cross-platform interoperability, data management strategies, and common workflows for data sharing and analysis lagged an increasingly faster data production, hurting model deployment and insights generation [7]. Multi-omics and EHRs data isolation still pose considerable challenges for researchers' abilities to access, integrate, and model often noisy, complex, and high-dimensional data [7,8,17,23,39]. In the next section, data accession and integration strategies both for data management and analytics will be discussed, introducing multi-omics and EHRs data. Finally, a list of curated databases for ND will be presented and local or international consortia initiatives aiming to maximize both sample collection and data generation will be reviewed.

Multi-Omics
Biological systems consist of several molecular features such as genes, proteins, as well as interactions between those components. Omics refers to the comprehensive characterization and quantification of these molecules, grouped according to their structural or functional similarities [17,40]. Multi-omics data integration combines information from different layers of omics data to understand how different biological systems interact at a molecular level [17,23]. This is relevant in ND such as AD and PD, where a multifactorial etiology is usually combined with heterogeneous clinical pictures and mixed pathologies [12]. Multi-omics data can be classified as (1) multi-feature data when the same set of samples presents several distinct feature sets, or (2) multi-relational data with different features and different sample sets in the same phenomenon or system. However, some variation in data architecture is possible, such as (3) multi-class data with different groups of samples measured by the same feature set and (4) tensor data measuring the same set of objects by the same set of features in different conditions [41]. Data-driven analysis of multi-omics data in ND can be performed to screen for potential biomarkers and druggable targets or to identify sub phenotypes through clusterization methods. Furthermore, the interactions among different sets of features could be crucial to understand the pathogenic pathways underlying different disease phenotypes, each one defined by its biomarkers as a phenotypic subtype with its own suitable personalized treatment [42]. Nevertheless, data integration of multi-omics data is still a major challenge in precision medicine, since omics analyses are impeded by high analytical variance and limitations in experimental design, resulting in a low signal-to-noise ratio [23]. Moreover, ND complex presentation is also subjected to temporal heterogeneity and individual variance in terms of biological measures and technical error [7,8,12,23]. To this purpose, different strategies have been proposed to produce trustworthy results and insights and to manage single and multiomics experimental design and analysis issues. Integration algorithms can be organized in workflows both for integrated or orthogonal omics datasets [7]. Dimensionality reduction methods are a set of ML multivariate techniques for feature extraction based on matrix factorization and while it is often challenging to combine features of multiple omics data, new features generated by these methods can easily be combined for every class of multi-omics data ( Figure 4) [23,41].

Electronic Health Records (EHRs)
Data isolation represents one of the major issues in big data analytics and for healthcare entities trying to construct EHRs protocols and databases. Healthcare data are typically dispersed across various medical systems located at multiple sites and many of these systems are not interconnected, constraining the data into isolated silos and contributing to the increase in the expenses of institutions [43]. EHRs contain patients' demographics along with clinical measurements, interventions, clinical laboratory tests, and medical data, thereby constituting one of the pillars of big data in the biomedical field [44]. EHRs data are both structured and unstructured, the former being represented by diagnostic codes and laboratory test outputs, the latter being represented by physician annotations about patients' status. Analysis of this kind of data is not feasible using classical statistical methods and more sophisticated techniques (such as DL) are required. To fully exploit the big data potential, all data sources must be considered to avoid discarding data due to their being unstructured. Free-text clinical notes in the EHRs, which can only be analyzed with a DL approach, can give useful information about the patients and can improve the accuracy of analytical results [23,45]. Data isolation prevents healthcare organizations from leveraging the latest Information Technologies (IT) innovations (such as data processing and cloud computing), which can help to improve care and significantly reduce costs [43]. Similar to what happened in multi-omics data management, data standards have been developed to overcome healthcare information sharing and interoperability issues across different healthcare systems [39,43]. Fast Health Interoperability Resources (FHIR) is a modern healthcare data format and exchange standards widely used to encode EHRs data [46]. FHIR implements an application programming interface with HTTP-based RESTful protocols and enables interoperable communication and information sharing between various healthcare systems, enabling their integration with mobile devices and cloud platforms. FHIR data have a well-defined structure, covering a variety of healthcare aspects including clinical, administration, financial, reporting studies. These data are called "resources" and they are easily extensible to cover non-standard use-cases. FHIR features and flexibility is ideal to effectively generate EHR datasets to be integrated with other omics data [23,43]. FHIR coded data, images, and other features processed with different standards can be integrated with cloud platforms, such as Google Health API or Amazon Comprehend Medical. Successful and standardized integration of big data in the healthcare system can be applied to real-time healthcare analytics to improve care service quality and costs [47,48]. Such approaches of continuously using newly generated data in ML applications would be interesting even in other contexts, such as in pandemic situations.

Artificial Intelligence Applications on ND Multi-Omics and Clinical Data Integration
Researchers exploiting biomedical big data for ND aim to empower clinical efficiency by combining various sources of information such as multi-omics, EHRs, and medical imaging (e.g., MRI) data, building a holistic representation of patients. DL models can be used as a cutting-edge data analysis technique to find patterns in a patient's broadscope view. This kind of approach can be hypothesis-free, exploring data in search of explanations for differences between groups instead of being hypothesis-driven as classical experiments [49,50]. By building the most accurate representation of patients possible through the integration of all sources of biomedical data, DL allows researchers to find multi-modal biomarkers to develop more effective and personalized treatments, early diagnosis tools, as well as useful information for drug discovering and repurposing [51]. Along with neuroimaging data, EHRs can provide useful information when AI takes the field. De-identified data from the PPMI database was used for the identification of PD subtypes [52]. The authors used a Long-Short Term Memory (LSTM) network to analyze patient data referred to six years of measurements on potential PD progression markers, including clinical features, imaging, bio-specimen measures, and demographics. LSTM can analyze time series data, allowing the authors to represent patients by considering value progression for the available features. The analysis brought to identify three PD subtypes with distinct patterns of progression, demonstrating heterogeneous characteristics within PD patients' features. The integration of biomarkers and clinical data for DL application showed that the disease progression rates, and the baseline severities are not necessarily associated and that motor and non-motor symptoms are not necessarily correlated [52]. This experiment is a good example of how DL techniques enable the management of integrated multi-domain data. Another application of a multi-modal DL approach was used to predict MCI to AD progression [53]. ADNI longitudinal data from cerebrospinal fluid biomarkers, neuroimaging, cognitive performance, and demographics were integrated and analyzed through a multimodal Recurrent Neural Network (RNN). This method allows integrating multiple domain data for multiple time points. Their results show that DL models perform better on integrated data than on separated single modality data, achieving a higher prediction accuracy. This approach could potentially identify people who might benefit the most from a clinical trial and assess risk stratification within clinical trials [53]. Integration of multi-omics heterogeneous data was used to predict AD diagnosis [54]. The authors implemented a DNN to predict AD using large-scale gene expression and DNA methylation data from prefrontal region tissue of different individuals diagnosed with late-onset AD. Results showed higher accuracy in predicting AD with multi-omics integrated data rather than with single-omics data. The authors also compare accuracy results from conventional ML methods with their proposed DL method, observing an improved predictive performance [54]. Currently, the use of DL methods on multi-omics integrated data is far more common in cancer research than in ND research, as fewer studies report the use of these methods in this area [55]. Overall, data integration yields better classification and prediction results in almost every field where it is applied and is standing as the next level in biomedical research [23,41,56].

Databases
The adoption of academic and industry-wide data standards is a key element to enable large-scale experimental data integration opportunities [23]. Public availability of datasets is growing in all disciplines and the Findable, Accessible, Interoperable, Reusable (FAIR) principles have been proposed to promote good scientific practices for data sharing initiatives, while databases aggregators such as OmicsDI started to monitor repositories to facilitate discovering and linking of public omics datasets [39,57]. To have a comprehensive overview of complex ND and trace their underlying pathogenesis mechanisms and progression, different biomedical data needs to be integrated for modeling and pattern recognition. A list of major available databases where researchers can retrieve data to test their hypotheses and generate novel insights is reported in Table 2. The Parkinson Progression Marker Initiative (PPMI) is an international and multi-center study that collects data from PD patients for future biomarker discovery and personalized PD therapy. Interested researchers can download de-identified clinical, biomarker, and imaging data, including raw and processed MRI and SPECT images [25]. AD and related pathologies data can be found in the NIA Genetics of Alzheimer's Disease Data Storage Site (NIAGADS). It is funded by the National Institute on Aging and provides access to multi-omics data from AD genetics projects [58]. One of the most interesting initiatives for ND data sharing is the Global Alzheimer's Association Interactive Network (GAAIN), which federates more than 50 data partners and gathers data from more than 450,000 subjects, to improve the understanding, treatment, and preventative measures for AD [59]. Other databases such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) have made AD data publicly available upon standardization of data acquisition protocols for researchers to retrieve clinical, imaging, and omics data [60]. This initiative was putting aside the need for years-long data collection, facilitating and speeding up hypotheses testing. Nevertheless, data access is restricted by data use agreements requiring ADNI to be cited in manuscripts and prohibiting data redistribution [61]. GAAIN is instead a virtual community for sharing AD data, which is stored in independently operated repositories around the world, aiming to offer a data homogenization service to the scientific community [59]. GAAIN offers the possibility to download data mapped to its data-sharing schema, allowing time-saving in interpreting different terminologies and nomenclatures used by each data repository [61]. Another interesting data source is the Swedish study Bio FINDER, which aims to discover the key pathological mechanisms in ND by analyzing various sources of data such as neuroimaging, biospecimen, and clinical examinations. Data is not publicly available but can be requested for download. Moreover, as non-specific databases, including ND data, there are Gene Expression Omnibus (GEO) and UK Biobank, containing clinical and omics data for a wide range of health-related outcomes [62,63]. Another novel initiative with the main goal of providing a multi-layer picture of ND patients is the Italian IRCCS Network of Neuroscience and Neurorehabilitation, which encourages scientific research and translational technologies for improving diagnosis, treatment, rehabilitation, and prevention of neurodegenerative disorders [4,64]. In addition, the network is also working on providing remote motor and cognitive neuro-telerehabilitation treatments finalized to facilitate the access of patients to personalized healthcare approaches, provide a continuity of care, and adequate monitoring strategies [64]. Interested researchers can query the websites to find datasets fulfilling their needs. With many available databases providing digital data from ND patients, it is possible to collect big biomedical datasets. Studies integrating data from various sources aim to obtain a holistic description of ND patients' characteristics and analyzing it using the best-suited techniques may lead to novel patterns identification in disease mechanisms.

Mixed
The Alzheimer's Disease Neuroimaging Initiative is a multisite study for the prevention and treatment of AD. Its database stores a collection of validated study data to define the progression of AD, including mild cognitive impairment subjects and elderly controls [60]. The Global Alzheimer's Association Interactive Network is an online integrated research platform affiliated with partners all over the world, providing resources and data enabling comparative data analysis and cohort discovery [59].

Challenges and Limitations for AI Techniques in ND Research
In the era of big data, the availability of biomedical information has exponentially increased, leading to technical and theoretical advances in data management, standardization, and analysis [66][67][68]. High-throughput technologies for genomic, transcriptomic, proteomic, and metabolomic analyses were accommodated in a network medicine framework focused on molecular and genetic interactions, biomarkers of disease, and therapeutic target discovery [40,69]. However, developing a comprehensive, holistic representation of patients with ND may require omics data to be merged with many other sources of information, such as EHRs, medical imaging, and wearable sensors data [23,50]. Therefore, multi-layer data integration is necessary to achieve a precision medicine approach, which is a unique opportunity to greatly improve healthcare quality and research outcomes in neurodegenerative pathologies for the identification of personalized treatments ( Figure 5) [41,56,70]. As previously discussed in this review (Sections 4 and 4.2), updated health informatics and data science workflows with a renewed data management policy are required to condense biomedical data vectors into an easily interpretable and translationally relevant form [7]. Data isolation in silos of non-communicating medical systems was discussed for EHRs, as it represents one of the major issues of the big data era, also affecting ND research. Only a few consortia initiatives have the resources to start collecting data with a multi-omics or a personalized medicine approach in their mind, leading to a multitude of isolated, low inter-operative datasets [7,9]. The adoption of FAIR principles and other standardization and monitoring processes such as OmicsDI will help to develop common ontologies and uniform data labels [39,57], while novel data-sharing initiatives with a defined big data architecture in mind, such as the National Virtual Institute for the investigation of Parkinson Disease in the Italian IRCCS Network of Neuroscience and Neurorehabilitation are starting to collect data in ND [4,64]. These new data sharing and encoding protocols are starting to shape a new direction in the biomedical field, and many authors suggest that these initiatives will become increasingly used as data volume and variety rapidly increases [7]. The implementation of a precision medicine approach in ND requires complementing classical case-control studies on less frequent diseases with community-based studies that are ideal for common neuropathologies [12]. Community design studies produce data that can be repurposed in multiple ways to look at specific outcomes, to derive new outcome measures, or to assess the interaction between many biological systems. As we progressively approach a holistic representation of the patients through an increasing volume, velocity, and variety of data generation, DL methods are being used to integrate and model those high-dimensional datasets [23,41,50]. Neural network architectures are flexible instruments uniquely allowing for labeled and unlabeled data processing and analysis. They can be used in the data integration phase as dimensionality reduction/feature extractor tools, and they are especially suited to leveraging large amounts of data from high-throughput omics studies or medical imaging. Notably, only DL has the potential to integrate the entire medical record, including physicians' free-text notes [23]. Several limitations to DL implementation in personalized medicine research are being addressed, such as reduced sample size and reproducibility issues [50]. As an example, Semi-Supervised Learning (SSL) algorithms work both with mixed labeled and unlabeled data points, sometimes achieving a better performance than a fully supervised approach because the model can learn from a much larger set [17]. Another relevant issue in this field is the reproducibility of other studies and the implementation of other's AI models. This is due to the lack of open-source implementations provided by authors and the difficulty of re-implementing a network in a different library. Automated code extraction from published papers is a scraping method enabled by DLPaper2Code to address reproducibility issues for DL architectures and it can be integrated into well-known DL frameworks [71]. Traditional DL issues, such as overfitting and interpretability represent common challenges for the development of reliable models. A model overfits the training data when it describes features that arise from noise or variance in the data, rather than the underlying distribution from which the data were drawn. Overfitting usually leads to loss of accuracy on out-of-sample data [72]. Overfitting is usually addressed using regularization methods or implicit/explicit feature selection techniques [73,74]. Cross-validation (CV) is a process for creating a distribution of pairs of training and test sets out of a single dataset. CV techniques such as hold-out and k-fold cross-validations have become industry standards, preventing the risk of overtraining. In k-fold CV, the data are partitioned into k subsets, each called a fold. The learning algorithm is then applied k times, each time using the union of all subsets other than the one left out, which will be used as a test set [72]. Moreover, DL models are commonly characterized by interpretability issues, reducing their potential as insights generators for clinicians and researchers [75]. To address this issue, several methods have been developed to understand how a DL architecture solves a regression or a classification problem [76][77][78]. Finally, data sparseness in computer-aided medical diagnosis and treatment still represents an unresolved challenge for machine diagnosticians, undermining AI diagnostic efficiency [15]. Calculations showed that the sparseness of actual symptom-treatments sets based on ICD-10 in the space of all possible sets is astronomical, thereby requiring to provide AI with more "functional" information, such as domain-specific medical reasoning processes and policies based on heuristic-driven search methods derived from human diagnostician methods [15]. Multi-layer picture of neurodegenerative diseases. Separated data can be integrated to obtain a holistic representation of patients. Artificial intelligence techniques application for data processing leads to useful findings in ND research, clinical management, and personalized treatment development.

Conclusions and Future Directions
In this work, we reviewed how AI can be applied to biomedical big data for ND research. After a brief introduction to ML and DL basics, we went through some notable AI applications on the most important biomedical data kinds. We have seen how neuroimaging, EHRs, and multi-omics data permit us to obtain better classification results when integrated together in constituting a unified representation for patients. Databases offering large-scale experimental data integration opportunities have been reviewed. Ultimately, big data integration is showing to be the next level in biomedical research, offering many advantages despite the limitations of such an approach, discussed in Section 6. Creating straightforward and interpretable DL models is a challenge for AI research in the healthcare field and several authors have attempted to address it [50]. A very interesting model for AD big data analytics is BHARAT, an application for integrated data manipulation, storage, and processing. BHARAT integrates brain structural, neurochemical, and behavioral data from magnetic resonance imaging, magnetic resonance spectroscopy, and neuropsychological testing, providing feature selection and ensemble-based classification. This framework's focus is not only on AD classification through DL methods, but also on determining relevant information originating from the analysis of multi-modal integrated data, such as early diagnostic biomarkers for AD pathogenesis [79]. Most of the biomedical research fields will benefit from advanced health informatics applications involving DL. Despite astonishing advances in biomedical data analysis through ML and DL applications for novel biomarkers and therapeutic target identification, much work remains to be done to develop more effective and personalized treatments, through the exploitation of integrated data [51]. Big data analytics in the biomedical field, especially in ND research, is providing promising opportunities as shown by the growing initiatives of data sharing and standardized integration of multiple sources of information described in Sections 5 and 6. DL can be used in a precision medicine framework and will be crucial to identify novel therapeutic targets and early biomarkers for diagnosis and improve clinical management for patients with complex and heterogeneous ND.