Machine Learning Applied to Diagnosis of Human Diseases: A Systematic Review

: Human healthcare is one of the most important topics for society. It tries to ﬁnd the correct effective and robust disease detection as soon as possible to patients receipt the appropriate cares. Because this detection is often a difﬁcult task, it becomes necessary medicine ﬁeld searches support from other ﬁelds such as statistics and computer science. These disciplines are facing the challenge of exploring new techniques, going beyond the traditional ones. The large number of techniques that are emerging makes it necessary to provide a comprehensive overview that avoids very particular aspects. To this end, we propose a systematic review dealing with the Machine Learning applied to the diagnosis of human diseases. This review focuses on modern techniques related to the development of Machine Learning applied to diagnosis of human diseases in the medical ﬁeld, in order to discover interesting patterns, making non-trivial predictions and useful in decision-making. In this way, this work can help researchers to discover and, if necessary, determine the applicability of the machine learning techniques in their particular specialties. We provide some examples of the algorithms used in medicine, analysing some trends that are focused on the goal searched, the algorithm used, and the area of applications. We detail the advantages and disadvantages of each technique to help choose the most appropriate in each real-life situation, as several authors have reported. The authors searched Scopus, Journal Citation Reports (JCR), Google Scholar, and MedLine databases from the last decades (from 1980s approximately) up to the present, with English language restrictions, for studies according to the objectives mentioned above. Based on a protocol for data extraction deﬁned and evaluated by all authors using PRISMA methodology, 141 papers were included in this advanced review.


Introduction
Healthcare is one of the most urgent matters in human societies, as the life quality of citizens directly depends on it [1]. However, the healthcare sector is highly heterogeneous, widely distributed and fragmented. From the clinical perspective, delivering appropriate patient care requires access to relevant patient information, which is seldom available where and when it is needed [2]. Additionally, the wide variation in test-ordering for diagnostic purposes suggests the requirement of sufficient and appropriate test set [3,4]. Smellie et al. [5] extended this argument by suggesting that the large differences observed in general practice pathology requesting result mostly from individual variation in clinical practice and are, therefore, potentially susceptible to change through more consistent and better informed decision-making for doctors [6].
Hence, medical data often consist of a large set of heterogeneous variables, collected from different sources, such as demographics, disease history, medication, allergies, biomarkers, medical images, or genetic markers, each of which offer a different partial view on a patient's state. Moreover, statistical properties among the aforementioned sources are inherently different. When researchers and practitioners analyse such data, they are confronted with two problems: the curse of dimensionality (the feature space is increasing exponentially in the number of dimensions and the number of samples), and the heterogeneity in feature sources and statistical properties [7]. These factors provoke delays and inaccuracy in the disease detection and, consequently, patients could not receive the appropriate cares [8]. Thus, there is a clear need for an effective and robust methodology that allows for the early disease detection and it can be used by doctors as a help for decision-making [9]. Therefore, medical, computational, and statistical fields are facing a challenge of exploring new techniques for modelling the prognosis and diagnosis of diseases, since traditional paradigms fail in the treatment of all this information [10]. This requirement is quite related to the evolutions in other domains, such as Big Data (BD), Data Mining (DM), or Artificial Intelligence (AI).
As the amount of medical data being digitally collected and stored is vast and expanding quickly, the science of data management and analysis is also advancing to convert this vast resource into information and knowledge that helps them achieve their objectives. The term BD is used to describe this evolving technology [11]. Subsequently, BD started with large-volume, heterogeneous, autonomous sources with distributed and decentralised control, and it seeks to explore complex and evolving relationships among data [12].
BD is a subset of DM, also known as knowledge discovery in databases. DM is the process of discovering interesting patterns in databases that are meaningful in decision-making and which lead to some advantage. Useful patterns allow us to make non-trivial predictions on new data and they help to explain something about the data [13]. However, finding these patterns is not an easy task. To this end, it is necessary to use advanced techniques for describing these structural patterns in data. Most of these techniques were developed within a field that is known as AI.
AI is a part of computer science that aims to make computers more intelligent. One of the basic requirements for any intelligent behaviour is learning. Currently, most researchers agree that there is no intelligence without learning. A learning problem can be defined as the difficulty of improving some measure when executing some tasks, through some type of training experience. In turn, within AI, Machine Learning (ML) emerged as a method of choice for developing algorithms to analyse datasets [14].
Today, ML provides several indispensable tools for intelligent data analysis. Additionally, its technology is currently well suited for analysing medical data and, in particular, there is a wide range of works done in medical diagnosis in small-specialised diagnostic problems [15], where initial applications of ML were pointed out. For example, ML classifiers have been successfully applied to distinguish between healthy and Parkinson patients [16], which is a useful tool in clinical diagnosis. Indeed, most ML algorithms work very well on a wide variety of important problems. However, they have not succeeded in solving the main problems in AI when they become exceedingly difficult and the number in the data is high (the curse of dimensionality). In these cases, BD technology is necessary. Thus, Deep Learning (DL) arose as a specific kind of ML. Hence, the development of DL was motivated and designed to overcome the failure of traditional algorithms to work with high-dimensional data, and learn complex functions in high-dimensional spaces [17].

Data Sources
Based on the methodology of a systematic review [18], we systematically searched in literature databases, such as Scopus, Journal Citation Reports (JCR), Google Scholar, and MedLine from the last decades (from 1980s approximately) up to the present, with English language restrictions, for studies within the scope of AI, BD, DL, DM, and ML applied to diagnosis of human diseases in the medical field. As a result, we perform a sweep for all medical field, trying to maximise the application and techniques applied within medical field, obtaining a representative systematic review.

Data Extraction
A protocol for data extraction was defined and evaluated by all authors. nor with a tangential, not central, interest for the systematic review aims. Finally, some studies that were not between 2008 and 2018 were added to the outcome, as they contained relevant topics with a high citation index.
One researcher (NCC) extracted data, which were checked independently by other ones (JLCS, JAGP, JMGP, and MLPL). In case of unsure, the full text was retrieved. In case of disagreement about the application of the selection criteria, the case was discussed with reference to the protocol criteria and, if needed, the full paper was retrieved.
To determine the estimates, each study contributed with only one estimate. Whenever a study generated multiple estimates according to different diagnostic or methodological definitions, one estimate was selected according to a previously defined protocol. This protocol ensured that data on each subject were only extracted once. In case a study published in different dates, the extracted study was the last one. The quality of the papers included is endorsed by the exhaustive review process followed by the journals from the selected literature databases.

Data Analyses
The content of this paper deals with a systematic review, not in a meta-analysis of the state of the art related to the intelligent data analysis in the medical field. Nevertheless, it does not deepen into the details regarded to the results obtained in each case of study. Hence, data analysis techniques are not applicable in this case.

Results
Of 4303 citations, we included 177 papers that met the inclusion criteria. Figure 1 shows our search, selection process, and the reasons for their exclusion. Thus, this Section provides a summary of the 177 studies reviewed in this paper necessaries for understanding the intelligent data analysis performed in the medical field.

Machine Learning Principles
Every learning process, deep or not, consists of two phases: the estimation of unknown dependencies in a system from a given data set (input) and the use of estimated dependencies to predict new outputs of the system. In this Subsection, we analyse the most common techniques used in both phases.

Input. Definition and Methods
The input of a ML process is a set of instances. These instances are the things that can be classified, associated, or clustered. Each instance is an individual, i.e., independent example of the concept that must be learned. Additionally, each one is characterised by the values of a set of predetermined attributes. Each dataset is represented as a matrix of instances versus attributes, which, in database terms, is a flat file (single relation) [13]. Kourou et al. [19] defined the main common learning methods as unsupervised learning and supervised learning.
In unsupervised learning, non-labelled examples are provided and there is no notion of the output along the learning process. The aim in this kind of learning is to explore data by the end of finding different categories or clusters, which allow us to organise them. Representative unsupervised learning clustering tasks are K-means Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Self-Organized Maps (SOMS), Similarity Network Fusion (SNF), Perturbation Clustering for Data Integration and Disease Subtyping (PINS), and Cancer Integration via Multikernel Learning (CIMLR) algorithms. The K-means Clustering algorithm is a classic technique consisting in divideding M points in N dimensions into K clusters such that the within-cluster sum of squares is minimised [20]). The main problem presented by K-means Clustering algorithm is that K must be known. DBSCAN and SOMS algorithms solve this problem. In the DBSCAN algorithm, the density that is associated with a point is obtained by counting the number of points in a region of specified radius around the point. Points with a density above a specified threshold are constructed as clusters [21]. The SOMS algorithm first proposed by Kohonen [22] uses lateral interaction within a given neighbourhood to cause similar input patterns to cluster to adjacent (relative to the neighbourhood) data [23]. SNF combines diverse types of genome-wide data to create a comprehensive view of a given disease, by constructing networks of samples (e.g., patients) for each available data type and then efficiently fusing these into one network that represents the full spectrum of underlying data [24]. PINS addresses two challenges: the meaningful integration of several different data types and the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival [25]. Finally, CIMLR is a new cancer subtyping method that integrates multi-omic data to reveal molecular subtypes of cancer [26].
In supervised learning, a labelled set of training data is used with to estimate or map the desired output. However, this fact can be remedied through Active Learning (AL), which learns incrementally by starting with a few examples and then asking in each iteration the medical expert to label only the instance that the algorithm determines as the most informative. The AL techniques have been successfully used in the medical field. Experiments had shown that this can reduce the number of labelled instances needed to each maximal accuracy by 30-40% compared to standard methods that start with the fully labelled data set (see e.g., [27,28]).
On the other hand, in supervised learning, classification and regression are two common tasks. In a classification task, learning process categorises data into a set of finite classes. Based on this process, each new sample can be categorised into one of the existing classes. In a regression task, learning process maps data into a real variable. Based on this process, for each new sample, the predictive variable value can be estimated.
The most common algorithms in supervised learning are Support Vector Machine (SVM), Iterative Dichotomiser 3 (ID3), K-Nearest-Neighbour (KNN), Naïve Bayes, Bayesian Networks, linear regression, and logistic regression. SVM algorithms use linear models to implement nonlinear class boundaries [29]. The ID3 algorithm was invented by Quinlan [30] is the precursor of algorithm, such as C4.5 and J4.8 [31]. These algorithms are used to generate decision trees from a dataset. The KNN algorithm was first used by Fix and Hodges [32], and it consists in assigning the class to a new instance based on a distance metric to the existing ones. Naïve Bayes algorithms are a family of simple probabilistic classifiers based on applying Bayes Theorem with strong independence assumptions between the features. This kind of algorithm was analysed by McCallum and Nigam [33]. The Naïve-Bayes algorithm assumes conditional independence. If it exists joint probability distributions must be used Bayesian Network algorithms where these joint distributions can be explicitly modelled [34,35]. Finally, regression is described in most standard statistical texts, and a particularly comprehensive treatment can be found in [36]. Some of the most interesting types of regression are linear and logistic regression. Linear regression algorithms model the relationship between a scalar dependent variable and one or more scalar explanatory variables. Logistic regression algorithms model the relationship between a categorical dependent variable y and one or more scalar or categorical explanatory variables [37].
Regardless of whether the learning method is unsupervised or supervised, the procedure is always the same. Figure 2 shows this process.  Based on this diagram, an iterative way from the input providing an output chooses the best algorithm. This output can be represented by the end of organising data (unsupervised learning), either infer the results of a new sample (supervised learning). Different representing methods shall be detailed below.

Output. Learning Representation
There are many different ways of representing the patterns in data. Each one dictates the kind of technique that can be used to infer the output structure from data [13] (in supervised learning) or simply represents different clusters of data (unsupervised). The most common representations are as follows. • Decision tables are one type of information tables with a decision attribute giving the decision classes for all objects [38]. Table 1 shows a simple example of a decision table, where the last column represents the final decision. In decision tables, the way of inferring the output is to make the same as the input. Because of this, decision tables are one of the simplest way of representing the output from a ML classification model. • Decision trees (DT) are predictive representations that can be used both for classification and regression models. Decision trees are a hierarchical way of partitioning the space, where the goal is to create a model that predicts the value of a target variable based on several input variables. A DT learns by splitting the source set into subsets that are based on an attribute value test. This process is repeated on each derived subset in a recursive way, called recursive partitioning. When a DT is used for classification tasks, it is more appropriately referred to as a classification tree. On the other hand, when it is used for regression tasks, it is called regression tree [39]. Breiman et al. [40] provided a simple example of a classification tree with medical data. In this example, it was predicted that the high or low risk of patients did not survive at least 30 days based on the initial 24 h. Serrano et al. [41] proposed an analysis through a regression tree for studying the successful weight loss from a commercial health app user for three distinct subgroups: the occasional users, the basic users, and the power users of the app. In both examples, a new sample can be inferred based on the corresponding representation.

•
Regression lines are the most common representations for linear regression. The regression line is that one which is the best suited to the data point cloud. Yoon et al. [42] analysed the influence of the pulse pressure on the systolic blood pressure. A new sample can be inferred throughout the regression line known as the value of the pulse pressure.

•
Hyper-plane diagrams are a specific type of representations of SVM algorithms. The basic idea is to find the maximum margin hyper-plane to separate different classes clearly and maximise the distance between them [43]. Tomar and Agarwal [44] applied the hyper-plane diagram for classifying diabetic patients basing on the blood glucose level and the body mass index. A new sample can be inferred throughout the known values of the blood glucose level and the body mass index.

•
Clusters are a specific type of representations of clustering algorithms. In this case, the output takes the form of a diagram showing how the instances fall into clusters. In the simplest case, this involves associating a cluster number with each instance, which might be depicted by laying the instances out in two dimensional and partitioning the space [13]. Rawte and Anuradha [45] clustered patients, depending on whether they suffered hearth diseases, arthritis, or Parkinson's disease. In this case, a new instance cannot be inferred, since clustering tasks are only designed for organising data.
In pattern recognition one tries to find a solution that succeeds in discriminating between points of different classes or mapping the data into another variable [46]. Additionally, it must have the ability to generalise the learning model to other new and previously unseen inputs. To this end, the model must be trained. Because unsupervised learning only seeks to find organising data, it does not require prior training [47]. Next, the training and the subsequent process of testing are detailed.

Training and Testing
One of the main challenges of ML is to obtain a likely future performance on new data. The initial discovery if the predictive relationships from data is often done with a training set. The training set is a set of data points randomly chosen representing the inputs of the model and their corresponding outputs [48]. Usually, a portion of the original and non-labelled data set, called the testing set, is used for assessing the quality of the model [49,50].
Obviously, in the training process, we are exposed to computing some measurement errors. Subsequently, determining how well a learning algorithm works is associated with the ability to minimising this training error and minimising the difference between the training and testing errors. These two factors correspond to two basic concepts in ML, overfitting and underfitting, which reduce the performance of the algorithm.
Overfitting occurs when the learning algorithm describes the random error or noise instead of the underlying data relationship. Underfitting occurs when the learning algorithm cannot find a solution that fits the observed data well enough [51]. Thus, the minimal overfitting and underfitting take place when a model fits the training and testing sets, since, although training and testing data are independent, they follow the same underlying distribution. In this case, the capacity of the method (the percentage of training samples that the algorithm can fit) is almost perfect.
There are different methods for avoiding overfitting and underfitting. In the case of overfitting, the most common way is to restrict the model complexity, for instance, by reducing the number of features. In underfitting, the most common way is to increase the number of features.
On the other hand, training and testing samples should be sufficiently large in order to obtain reliable results from a model. However, that is not always possible. Next, different methods for evaluating the performance of the algorithm both for large and small samples are detailed.

Credibility. Algorithm Evaluation
Among the most common methods used for analysing the performance of the algorithm by splitting the initial labelled data set are the holdout, the random sampling, the cross-validation, and the bootstrap methods.
The holdout method splits data into training and testing sets being mutually exclusive subsets. The input provides the training set and the output is tested in the testing set [52]. This procedure provokes that this method needs datasets with a large amount of data. In practice, it is common to use one third of the data for testing and two-thirds for training.
The random sampling method, also called repeated holdout method, is similar to the holdout method. It repeats the holdout method several times with the end of improving the accuracy of the algorithm [53,54]. Similarly to the holdout method, large data sets are also required in this method.
Both holdout and random sampling methods are simple procedures for evaluating the performance of an algorithm, but they raise an important issue. Statistically speaking, the sample size that is required by both method procedures must be as large as possible to obtain a good accuracy of the algorithm. This fact is due if a database is small, it is feasible that the sample selected for training could not be representative of the dataset. However, as we said before, it is not always possible to have a large database [55]. If a data amount is limited and small, other evaluation methods must be applied to analyse the performance of the algorithm.
One of them is cross-validation method, where a dataset is split into a number of partitions. Each partition is split by holdout method and used for testing and the remainder is used for training. There are different types of cross-validation methods. Some of them are K-fold cross-validation, Leave-one-out (LOO) cross-validation, and Leave-p-out (LPO) cross-validation. In K-fold cross-validation, dataset is split exactly into k partitions. The cross-validation process is then repeated k times (the folds). This procedure is repeated as often as the number of partitions. Thus, each partition is used once for testing [56]. In general, k remains an unfixed parameter, but 10-fold (k = 10) cross-validation is commonly used. In LOO cross-validation, the size of the training set is fixed as n − 1, where n is the database size. In this way, each data point is successively "left out" from the sample and used for validation. In LPO cross-validation, the size of the training set is fixed as n − p, with p ∈ {1, 2, . . . n − 1}. Thus, every possible subset of p data is successively "left out" of the sample and used for validation [57].
Generally, the prediction error of cross-validation methods are nearly unbiased, but can be highly variable. Thus, the bootstrap method emerged as smoother version of cross-validation methods [58].
The previous methods considered a sampling without replacement, i.e., when data were selected from the dataset to form the training or testing sets, it cannot be chosen to form the other set. On the other hand, the bootstrap method is based on the statistical procedure of sampling with replacement. In this method, the idea is to sample the data set with replacement to form the training set. Thus, when an input is chosen to form the training set, it is placed again into the entire data set and it can be selected again [59]. The bootstrap is a quite good procedure for estimating the error for very small data sets. However, the training set can represent a special and artificial situation since an input can be selected an unlimited number of times.
Once we know the basic principles of ML, DL can be understood. Next Subsection details the basic DL principles.

Deep Learning Principles
DL provides a very powerful framework for supervised learning. Traditional ML models, such as SVM, ID3, KNN, Naïve Bayes, Bayesian Networks, or linear and logistic regression models are considered to have shallow architectures. Fortunately, DL changes this fact. DL is considered to be an improvement of artificial neural networks, consisting of more layers that permit higher levels of abstraction and improved predictions from data [60]. Actually, the DL model can be trained in various ways with different approaches or algorithms [61]. Thus, a DL architecture becomes a multilayer stack of simple modules subject to learning, and many of which compute non-linear input-output mappings or classifications. Each module in the stack transform its input to increase both the selectivity and invariance of representation. By adding more layers and more units within the layers, a deep network can represent functions of increasing complexity.
A neural network is defined as a computational model that consists of many simple, connected processors, called neurons, each one producing a sequence of real-valued activations. During neural network training, the aim is to learn to map or classify a fixed size input to a fixed size output. To go from a layer to the next one, a set of units computes a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. The final layer of a neural network is called the output layer [62].
The behaviour of layers can be different in input and output layers and this difference could not be directly specified by the training data. In this case, the learning algorithm (such as defined previously) must decide how to use these layers to obtain the best approximation of the output. For instance, for classification tasks, higher representation layers amplify input aspects that are important for discrimination and also suppress irrelevant variations. On the other hand, linear models, such as logistic regression and linear regression, are very appealing because they can satisfy both closed form expressions and convex optimisation efficiently and reliably [17]. Because the training data do not show the desired output for each of these layers, they are called hidden layers.
In many real situations, there is a huge quantity of possible training solutions for a network architecture. Due to its properties, the backpropagation is one of the most widely used procedure to compute the architecture of a neural network. This popularity primarily revolves around the ability of backpropagation to learn complicated multidimensional mapping [63]. To the best of our knowledge, backpropagation was first proposed by Werbos [64] and, subsequently, deeply developed by Rumenlhart and McClelland [65].
Usually, the term backpropagation is misunderstood as an algorithm for multi-layer neural networks. However, backpropagation actually only refers to the method for computing a network architecture, while another algorithm, such as described previously, is used to perform learning using such architecture [17].
If the backpropagation is fitted by computing the network with known input and output datasets, the input neurons get activated through sensors perceiving the environment and the other neurons are activated through weighted connections from previously active neurons [66]. During the training phase of this network, differences between real outputs and model predicted outputs are propagated back through the architectural structure of the network. Thus, by minimising the error that is produced by the predicted outputs, we choose the connection weights. When the weight values optimise the errors of the network architecture, starting the learning algorithm evaluation to analyse its credibility [67].
Next, we analyse some application examples of these ML and DL methods in the medical field.

Applications
Through the application of the ML and DL basic methods described in the previous section, different authors discovered interesting patterns in the medical field, helping them, in some cases, in the decision-making.
As particular case of interest nowadays, ML means an useful approach in developing diagnostic models supported by clinical data in the context of current Corona Virus Disease 2019 (COVID-19) pandemic. Alimadadi et al. [68] and Arga [69] discuss about this opportunity. Many examples of ML and DM applications support this approach, as Albahri et al. [70] show in a deep systematic review [70]. We point out just few representative cases. Sujath et al. [71] present a model that is able to predict the spread of COVID-19 by performing linear regression, multilayer perceptron, and vector autoregression methods. Randhawa et al. [72] apply a supervised ML-based alignment approach on an intrinsic COVID-19 virus genomic signature for classifying the whole COVID-19 virus genomes. Even in the war against misinformation about COVID-19 content, ML is able to quantify the online opponents of establishment health guidance, as Sear et al. shows [73].
Next, we analyse some examples of ML applications in other heterogeneous medical fields. Starting with the unsupervised algorithm of ML K-means Clustering, DBSCAN, SOMS, SNF, and PINS. Based on K-means Clustering, Maas et al. [74] identified gene clusters overlapping between immune and autoimmune human diseases. Phillips et al. [75] assigned tumour classes to subclasses to recognise the dominant feature of the gene list characterising each subclass. Ng et al. [76] segmented medical images to provide important information from 2-D magnetic resonance images on the head. Aarsland et al. [77] analysed inter-relationship between neuropsychiatric symptoms in patients with Parkinson's disease and dementia. Sun et al. [78] screened unknown and unexpected infectious diseases. Seok et al. [79] classified the genomic response between human inflammatory diseases. Manickavasagam et al. [80] classified the Plasmodium species in thin blood smear images to control malarial infection disease. Antony and Ravi [81] classified mammographic images, reducing the false positive results that are generated by other methods. Sari [82] identified characteristics of patients suffering tuberculosis infectious disease and Kumar et al. [83] segmented magnetic resonance images to diagnose different brain disorders, such as Alzheimer's disease. Based on DBSCAN, Chauhan et al. [84] clustered lung, kidney, throat, stomach, and liver cancer in databases. Çelik et al. [85] detected anomalies in temperature data. Radhakrishnan and Kuttiannan [86] diagnosed prostate cancer through transrectal ultrasound images. Antonelli et al. [87] identified the examination pathways commonly followed by patients with diabetic disease. Sriram [16,88] diagnosed Parkinson's disease from the voice dataset. Plant et al. [89] and Aidos and Fred [90] discriminated Alzheimer's disease while using magnetic resonance imaging data features and longitudinal information, respectively. Based on SOMS, Stebbins et al. [91] identified Parkinson's disease. Lyketsos et al. [92], Harold et al. [93], and Lambert et al. [94] clustered Alzheimer's disease. Based on SNF, Wang et al. [24] combine mRNA expression, DNA methylation, and microRNA (miRNA) expression data for five cancer data sets, outperforming single data type analysis and established integrative approaches when identifying cancer subtypes and it is effective for predicting survival. Based on PINS, Nguyen et al. identify known cancer subtypes and novel subgroups of patients with significantly different survival profiles [25]. Based on CIMLR, Ramazzotti et al. extract biologically meaningful cancer subtypes from multi-omic data from 36 cancer types [26].
Within the supervised algorithms of ML, we provide some application examples of SVM, ID3, KNN, Naïve Bayes, Bayesian Networks, and linear and logistic regression algorithms. Based on the SVM algorithm, Huang et al. [95] created a strategy with feature selection to render diagnosis between the breast cancer and fibroadenoma in order to find some important risk factors for breast cancer. Avci [96] classified the Doppler signals of the heart valve diseases combining the feature extraction and classification from measured Doppler signal waveforms at the heart valve using the Doppler ultrasound. Fei [97] diagnosed the arrhythmia cordis to ensure human health and save human lives. Yao et al. [98] identified and measured pulmonary abnormalities on chest computed tomographic imaging in the cases of infection. Sartakhti et al. [99] diagnosed hepatitis disease, predicting the presence or absence of hepatitis disease by using the results of various medical tests carried out on a patient. Abdi and Giveki [100] diagnosed the dermatology disease erythemato-squamous. Berna et al. [101] diagnosed Plasmodium falciparum (malaria) infection through the analysis of breath specimen. Kesorn et al. [102] exploited the infection rate in the Aedes aegypti mosquito for forecasting the dengue morbidity rate. Khan et al. [103] classified dengue suspected based on human blood sera and Meystre et al. [104] classified between the presence or absence of pneumonia in children using chest imaging reports. Hernandez et al. [105] inferenced the infection risk using pathology data.
Based on ID3 algorithm, Forsström et al. [106] diagnosed discrepant from laboratory databases of thyroid patients. Tanner et al. [107] predicted and outcame of dengue fever in the early phase of illness. Ture et al. [108] predicted risk factors for recurrence in determining recurrence-free survival of breast cancer patients. Bashir et al. [109] classified, diagnosed, and predicted diabetes disease. Thenmozhi and Deepika [110] classified and predicted heart diseases based on a different attribute selection measures, such as information gain, gain ratio, gini index, and distance measure. Buczak et al. [111] classified and predicted malaria in South Korea through extracting relationships between epidemiological, meteorological, climatic, and socio-economic data. Subasi et al. [112] diagnosed chronic kidney disease, achieving the near-optimal performances on the identification of this illness subject. Abdar et al. [113] classified on early detection of liver disease identifying liver disease risk factors and obtaining that females have more chance of liver disease than males. Finally, Singh and ManjotKaur [114] analysed the use of these algorithms for diagnosis in angioplasty and stents for heart disease treatment.
Based on KNN algorithm, Jen et al. [115] adopted a preventative perspective and ascertained the impacts of important physiological indicators and clinical test values for various chronic illnesses, such as hypertension, diabetes, cardiovascular disease, liver disease, and renal disease. Liu et al. [116] diagnosed for thyroid disease based on computer aided diagnostic system. Zuo et al. [117] proposed an adaptive fuzzy KNN approach for an efficient Parkinson's disease diagnosis. Wisittipanit et al. [118] analysed length heterogeneity polymerase chain reaction (LH-PCR) associated with inflammatory bowel disease studying the relationships between some microbial communities within the human gut and their roles in disease. Papakostas et al. [119] diagnosed Alzheimer's disease based on magnetic resonance imaging data features and applying a lattice computing scheme. Chandel et al. [120] detected and classified thyroid diseases while using the rapid miner tool. Mahajan et al. [121] discriminated between with and without bacterial infections in febrile infants aged 60 days or younger. Biswas and Acharyya [122] identified disease critical genes causing Duchenne muscular dystrophy. Nelson et al. [123] identified HIV infection related to DNA methylation sites and advanced epigenetic aging in HIV-positive patients. Vargas et al. [124] identified the possible genetic causes of some neurodegenerative diseases based on phenotype prediction. Finally, Mabrouk et al. consider a wide set of ML techniques, including KNN, to classify images in order to detect Parkinson and dopaminergic deficit [125].
Based on the Naïve-Bayes algorithm, Soni et al. [126], Pattekari and Parveen [127], and Chaurasia and Pal [128] analysed prediction systems to detect heart diseases. Vijayarani and Dhayanand [129] classified between liver diseases, such as cirrhosis, bile duct, chronic hepatitis, liver cancer, and acute hepatitis from liver function test dataset while using this algorithm. Thangaraju and Mehala [130] predicted lung cancer at an early stage by using generic lung cancer symptoms, such as age, sex, wheezing, shortness of breath, and pain in shoulder, chest, and arm. Vijayarani and Dhayanand [131] predicted kidney disease based on blood and urine tests as well as removing a sample of kidney tissue for testing. Zhou et al. [132] classified between normal and pathological brains based on magnetic resonance imaging scanning. Trihartati and Adi [133] identified tuberculosis infectious disease. Ferreira et al. [134] proposed a prediction system based on this algorithm to predict Parkinson's disease. Finally, Stern et al. [135] predicted responsiveness to lithium of bipolar disorder patients.
Based on Bayesian Networks, Gevaert et al. [136] predicting the prognosis of breast cancer by integrating clinical and microarray data. Vázquez-Castellanos et al. [137] analysed interactions between, the bacterial community, their altered metabolic pathways, and systemic markers of immune dysfunction in HIV-infected individuals. Castro et al. [138], Wu et al. [139], and Zhang et al. [140] diagnosed Alzheimer's disease and Rowe et al. [141] diagnosed Parkinson's disease. Sciarretta et al. [142] studied heart failures and high cardiovascular risk in patients with hypertension. Wu et al. [143] compared the effectiveness of renin-angiotensin system blockers and other antihypertensive drugs in patients with diabetes.
Based on linear and logistic regression, Raggi et al. [144] analysed the coronary arter calcification in adult with end-stage renal disease receiving hemodialysis. Lanzkron et al. [145] studied the mortality rates for children and adults with sickle-cell disease. Zhou et al. [146] diagnosed the lymph node metastasis in gastric cancer. Smith et al. [147] analysed the global rise in human infectious disease outbreaks. Althoff et al. [148] compared risk and age at diagnosis of myocardial infarction, end-stage of renal disease, and non-AIDS-defining Cancer in HIV-infected versus uninfected Adults. Williams et al. [149], Nowak et al. [150], and Horvath et al. [151] predicted the HIV-1 disease progression. Gjoneska et al. [152] analysed the transcriptional and chromatin state dynamic to characterise Parkinson's disease. Fischer et al. [153] evaluated the stability of Ebola virus on surfaces and in fluids. Finally, Ly et al. [154] predicted the hepatitis C virus disease progression based on generic hepatitis C symptoms.
Additionally, some authors used supervised algorithms to perform learning while backpropagation is fitted by computing neural networks. Some examples are Chaplot et al. [155] and Saritha et al. [156] combining neural networks and SVM algorithm and El-Dahshan et al. [157] combining neural networks and KNN algorithm to classify magnetic resonance imaging brain images. Chen et al. [158] combined neural networks and KNN algorithm and Hariharan et al. [159] combined neural networks and SVM algorithm in order to detect Parkinson's disease. Fan et al. [160] combined neural networks and ID3 algorithm, and Onan [161] combined neural networks and KNN algorithm to detect cancer. Marateb et al. [162] combined neural networks and Naïve-Bayes and neural networks and SVM algorithm and Norouzi et al. combined neural networks and SVM algorithm [163] to predict renal diseases. Das et al. tackle diabetic retinopathy and age-related macular degeneration by means of convolution neural networks [164].
There are many other ML methods that are applied for classification, prediction or clustering purposes in all kind of diseases. For example, Obolski et al. apply Random Forest (RF) to identify genes associated with invasive disease in Streptococcus pneumoniae [165]. This technique was also applied by McDonnell et al. To identify the incidence of hyperglycaemia in steroid treated hospitalised inflammatory bowel disease [166]. Das et al. apply a ML model called Sparse High-Order Interaction Model with Rejection Option (SHIMR) for diagnosis of Alzheimer's disease [167]. Patrick et al. predict drug repurposing for immune-mediated cutaneous diseases by applying Global Vectors (GloVe), an unsupervised ML algorithm for obtaining vector representations for words [168]. Xi et al. apply Proper Orthogonal Decomposition (POD), Principal Component Analysis (PCA), Dynamic Mode Decomposition (DMD), and DMD with Control (DMDC) to diagnose obstructive lung diseases using exhaled aerosol images, as well as they classify image data using both the SVM and RF algorithms [169]. Finally, Konerman et al. consider Cross-sectional (CS) and longitudinal models as predictors in chronic hepatitis C virus [170]. Table 2 shows a summary of all these works, focusing on the goal searched, the algorithm used, and the area of application within the medical field. Figure 3 shows the same information than Table 2, although it counts the number of references instead of identifying them. Table 2. Application summary of representative techniques.

Discussion
In this paper, different tools and approaches that are widely used in the medical and healthcare fields are described. These tools are within AI and allow us to reach the main aim of ML, finding useful patterns in databases, which help to explain and make non-trivial prediction about data.

Advantages and Disadvantages
Analysing Table 2, we observe that, in general, the most common task performed in the medical field is classification in all application areas exemplified. However, another common task is regression in the infectious disease area. This task is little used in areas, such as Alzheimer's disease, Parkinson's disease, and hepatic diseases. On the other hand, the clustering task is also slight considered in hepatic and heart diseases, however, it is often used in Alzheimer's disease and Parkinson's disease. Finally, the combination of neural networks and other supervised algorithms is widely used in cancer, Alzheimer's disease, Parkinson's disease, and renal disease application areas, but is seldom applied in metabolic, hepatic, infectious, and heart disease application areas.
The author's decision regarding the technique used was motivated by the pros and cons of each tool in the particular area of application and according to the own experimental conditions. The following tables analyse some advantages and disadvantages of the techniques and approaches previously described. Based on [190,191], Table 3 shows some advantages and disadvantages of ML and DL. Based on [192], Table 4 analyses some advantages and disadvantages of unsupervised learning and supervised learning. Finally, based on [44,193], Table 5 details some advantages and disadvantages of each algorithm described previously. In any case, it should be noted that the various ML techniques cited in this review have been successful in their particular area of application. In this sense, no technique can be ruled out a priori, but its applicability must be carefully analysed according to the purpose of the investigation (classification, prediction), scope, data typology and their relationship, experimental resources, etc. Table 3. Advantages and disadvantages of Machine Learning (ML) and Deep Learning (DL).

ML
Algorithms are often easy to be implemented. Complex relationships between dependent and Algorithms are flexible enough to handle complex independent variables are not identified easily in problems with multiple interacting variables.
high-dimensional databases. Input and output are not necessarily fixed.
High computational cost.

DL
Complex relationships between dependent and Input and output are fixed. independent variables are identified easily in Overfitting problem possibility is high. high-dimensional databases.
Implementation is not so easy than in ML. Ability to handle databases with high noise.
Training process requires a higher computational cost than ML. Table 4. Advantages and disadvantages of unsupervised learning and supervised learning.

Method Advantages Disadvantages
Unsupervised learning It does not require a training data to be labelled. There are no notions of the output The automatic labelling of the training data set along the learning process. saving the time spent in hand classification.
It does not allow to estimate or Classification task is fast.
map the results of a new sample. Results vary considerably in the presence of outliers. It only performs classification tasks.
Supervised learning It exists notions of the output along It requires a labelled data set. the learning process.
It requires a training process. It performs classification and regression tasks. It allows estimating or mapping the results to a new sample. Table 5. Advantages and disadvantages of described algorithms.

Method Advantages Disadvantages
K-means Clustering Simple clustering approach. Requires a number of clusters in advance. Efficient clustering method.
Handling categorical attributes cause problems. Method is easy to be implemented.
Results vary considerably in the presence of outliers.
Handling categorical attributes cause problems. A number of clusters in advance is not required.
Results vary considerably in the presence of outliers. Efficient clustering method.

SVM
Better accuracy compared to other classifiers. High computational cost. Overfitting problem is not so great as in other methods. Training process requires more time than other methods.

ID3
There are no domain requirements.
Results are restricted to one output attribute. Exact value results are provided for various actions, Only categorical output is generated. minimising the ambiguity of complex decisions.
Classifier performance depends on High dimensional databases are processed easily.
the type of dataset, making it unstable. Classifier and output are easy to be interpreted.

KNN
Method is easy to be implemented. Large storage space is required. Training process requires low computational cost.
Sensitivity to databases with high noise. Testing process requires high computational cost.

Naïve Bayes
Method is easy to be implemented. Low accuracy is provided in cases where Bayesian Networks Method is speeder and provide more accuracy in high exists dependence between variables. dimensional databases than other methods.

Linear regression
Better accuracy compared to other classifiers. Results vary considerably in the presence of outliers. Complex relationships between dependent and Training process requires more time than other methods. independent variables are identified easily.
Classifier performance depends on the type of dataset, making it unstable. Only numerical output is generated.

Logistic regression
Better accuracy compared to other classifiers. Results vary considerably in the presence of outliers. Complex relationships between dependent and Training process requires more time than other methods. independent variables are identified easily.
Classifier performance depends on the type of dataset, making it unstable. Only categorical output is generated.
Neural network Complex relationships between dependent and High possibility of local minima. independent variables are identified easily.
High possibility of overfitting problem. Ability to handle databases with high noise.
Classifier is difficult to be interpreted. A previous feature extraction task is not required.
High computational time is required if there is a large number of layers.
No explanation or justification of decisions can be given, i.e., a "black box" characteristic.

Applicability of ML to Clinical Practice
Machine learning is a valuable tool for medical professionals in the prevention, diagnosis, and treatment of human diseases. However, there are currently few examples of the successful application of these techniques in the particular area of clinical practice, despite the fact that the various ML techniques generate good results. It is generally said that an ML algorithm learned a new task, instead of saying that it simply extracted a set of statistical patterns from a set of training data, where these data are manually selected and labeled under the direct supervision of someone who chose which algorithms, parameters, and workflows would be used for its development. It is also said that a neural network correctly distinguishes, for example, pictures of dogs and cats, by learning the characteristics of those animals, when it simply associates specific groups of colors and textures in the pictures. Thus, if an image deviates too far from the examples that the neural network has seen, the prediction will fail, with negative consequences if we address the detection of cancer or a neurodegenerative disease.
ML algorithms, through their representation, evaluation, and optimization components [194], benefit from the availability of large amounts of data and powerful hardware architectures to represent more complex statistical phenomena than traditional approaches, while DL allows for identifying previously hidden patterns, extrapolate trends, and predict results in a wide spectrum of problems, trying to "learn" an approach to some function.
ML techniques are currently applied to medical records in clinical practice to predict, for example, which patients are at greatest risk of readmission to hospital or who are unlikely to follow prescribed treatments. The applications are unlimited in diagnosis, research, drug development and clinical trials. Despite the large amount of digitized data, predictive models that are built from medical records are mainly based on traditional linear models and rarely consider more than 20 or 30 variables. However, a key advantage of ML is that researchers do not need to specify which potential predictive variables to consider and in which combinations [195].
An important issue to consider when applying ML in the clinical practice is the consistency of data from heterogeneous sources. Each health system may collect patient data differently for similar purposes. For this reason, before applying ML, it is necessary to standarise the data. This would avoid data overfitting and the difficulty of applying the same technique to other data sets. The problem of bias is also important. This problem arises when there is poor coverage of training data, leading to errors when applying to minority groups. In general, in clinical practice it is interesting to have different and large data sources to highlight the specific features of each group of patients.
Finally, the comprehensibility of the algorithms is another key element. A compromise between performance and the interpretability has to be established. Models with better performance (e.g., DL) are often difficult to explain, while models, such as linear regression or decision trees, are more understandable.

Conclusions
Intelligent data analysis emerges as a society requirement for finding effective and robust disease detections as soon as possible to patients receipt the appropriate cares within the shortest possible time. In the last decades, this detection has been performed through the process of discovering interesting patterns in databases. This knowledge in databases is called Data Mining. However, discovering these patterns is not an easy task. Hence, many techniques were developed within Artificial Intelligence, where Machine Learning appears as a method for providing tools for intelligent data analysis.
On the other hand, medical datasets are often high-dimensional. In these cases, Machine Learning techniques become unsuccessful and Big Data technology is necessary. Thus, Deep Learning arose as a specific kind of Machine Learning allowing for us to deal with this type of databases.
In this paper, a systematic review of the intelligent data analysis tools in the medical field are provided. We also provide some examples of some algorithms used in these areas of the medical field, analysing some possible trends focused on the goal searched, the technique used, and the application area. Additionally, we detail the advantages and disadvantages of each technique described to help in a future establishment about which the technique is most suitable for each real-life situation addressed by other authors. Finally, Figure 4 shows the relationships between all techniques as well as all supervised and unsupervised learning algorithms detailed in this paper.
A systematic review such as the one that we have just presented may become outdated in a short time, given the speed with which new works appear in this emerging area. For this reason, we consider that Table 2 (and therefore Figure 3) should be mainly updated, after a careful search for new scientific literature, given that it is more likely that more studies will appear in the short term on the application of existing techniques in this area than on the proposal of new techniques that really constitute a novelty, and not a mere improvement or modification of existing ones.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: