Donoho well expressed the idea that, given the complexity of data, effective and efficient data analysis cannot (no longer) ignore a robust knowledge of mathematical approaches (both old and new).
Following this section, mathematical tools or approaches will be shortly introduced as a strategy to solve problems very relevant in the context of clinical research related to imputation of missing data, data dimensionality reduction, and building predictive models from data.
In the following part of this section, some methods will be proposed and briefly described. They could be used to solve the aforementioned problems afflicting clinical research based on data.
2.3.1. Handling Incomplete Data
As mentioned above, if the problem of missing data is very widespread in the field of medical research it is also true that it is a problem that affects many other contexts: see, for example, the long list available in [
25] in which, in addition to the long list of problems related to medicine and health, applications related to management sciences, politics, psychology and sociology can also be highlighted.
For all these listed applications, a multiple imputation approach for a missing data problem is proposed since it has a proven effectiveness in multiple contests [
17]. In particular, some form of the so called “
chained equation approach” is proposed.
Here, we describe, and use, a version of that approach which is known as “
Multivariate Imputation by Chained Equations (MICE)” [
20,
25,
26]. Given the variables used in the imputation process, MICE operates on the assumption that missing data are “
Missing At Random (MAR)”, meaning that the probability that a value is missing depends only on seen values and not on unobserved values [
26]. Although using MICE when data are not MAR may lead to biased results, some research indicates that, even in these circumstances, MICE produces fewer biased estimates than naive approaches to managing the same censored values [
17].
Definition 1 (The Multivariate Imputation by Chained Equations (MICE) method). Let Y be a multivariate variable based on p univariate variables among which k are incomplete. Such variable Y is represented by n observations.
Then,where denotes the set of the incomplete variables and denotes the set of complete variable. Let the observed and missing parts of be denoted by and , respectively, so and stand for the observed and missing data in Y. Letdenotes the multivariate variable depending from all the variables in Y except . Suppose that Y is partially observed random sample from the p-variate multivariate distribution . We assume that the multivariate distribution of Y is completely specified by θ, a vector of unknown parameters.
The MICE method has the aim to obtain the multivariate distribution of θ, either explicitly or implicitly. Given the parameters , which are specific to the respective conditional variable , the MICE algorithm (see Algorithm A1 in Appendix A.1) implementing such method obtains the posterior distribution of θ by sampling iteratively from conditional distributions of the form Given a model , defined by parameter , to impute from observed (and already imputed) marginal distributions, the iterative procedure can be considered a Gibbs sampler that is used m times to calculate a number m of independent samples of the complete variable where each of the inherits the observed part of Y filling the missing part of Y by “imputed” observations. That is, if is the j-th imputed variable at iteration t,All the m independent samples can be collected in the set X for further analysis. Regarding the selection of the number of imputations
m, the desired “relative efficiency”
of MI estimates [
27] should be considered.
can be evaluated by using the following formula [
17]
where
is the rate of missing data. See
Table A1 in
Appendix A.1 for the listing for some values of
.
To overcome problems related with skewed data or data not described by linear models, the
Predictive Mean Matching (PMM) method [
28] is proposed as a flexible model
for imputation. As described in [
29], PMM is based on the following steps:
Model Building: A predictive model is built using complete cases to estimate the relationship between the target variable (to be imputed) and a set of predictor variables represented by X.
Prediction Generation: This model is employed to predict values for both complete and incomplete cases. Let be the predicted value for the i-th observation of .
Matching Process: For each incomplete case, PMM identifies one or several complete cases whose predicted values are closest in distance to the incomplete case is predicted value. For each missing instance
i, calculate:
where
is the indexes list for all completed observations.
Imputation by Matching: Replace the missing value
with the observed value from the best-matching candidate
Optionally, multiple matches can be pooled to inject further randomness and variability into the imputation process.
2.3.2. Interpreting Multidimensional Data by Principal Components Analysis
In several fields, large datasets are becoming more and more common. Such datasets must be considerably reduced in dimensionality in an interpretable manner while maintaining the majority of the data’s content in order to be interpreted. Although many methods have been developed for this purpose, one of the oldest and most used is “
Principal Component Analysis (PCA)”. Its concept is straightforward: lower a dataset’s dimensionality while maintaining as much “variability” (i.e., statistical information) as feasible [
30].
PCA’s goal is to extract the important information from a set of observed data, described by several inter-correlated variables, to represent it as a set of new orthogonal variables called “principal components”, and to display the pattern of similarity of the observations and of the variables as points in appropriate maps called BiPlot [
31]. The main uses of PCA are descriptive, rather than inferential [
30]. “
PCA allows us to simultaneously describe the association between variables, as well as the resemblance among individuals. PCA can also be regarded to as a dimension reduction technique of quantitative variables, often employed as an intermediate step towards a subsequent model building phase [
32]”. Mathematically, PCA depends upon the solution to an eigenproblem or, alternatively, upon the singular value decomposition (SVD) of the (centered) data matrix. A complete description of the PCA approach can be found in [
30,
33]. For an informal but more descriptive approach in introducing PCA, we also suggest reading a paper by Aluja et al. [
32]. Here, we propose some definitions and descriptions that are useful for our discussion of the proposed results.
Definition 2 (Principal Components Analysis (PCA) of a dataset Correlation Matrix)
. Let a dataset with observations on p numerical variables, for each of n entities or individuals. These data values define an data matrix , whose j-th column is the vector of observations on the j-th variable. Let be the normalized data matrix of ; then, for its generic elements, is valid as follows:where is the mean value of the n observations of variable j, where and where is the standard deviation of the elements , i.e., .Let be the sample covariance matrix related to ; is a positive semi-definite matrix. Then, it has an eigen-decomposition such as the following one:where the p column vectors of matrix are the p linearly independent eigenvectors of defining an orthonormal set of vectors, i.e., Let us define the set of vectors as Since the matrix coincides with the correlation matrix of , the vectors defined in (3) are called Principal Components (PCs) of the Correlation Matrix of the dataset, and this is called a “Normalized PCA”, which is the PC approach based on those vectors. It can be shown [30,31] that is a solution of the following optimization problem: Equation (
4) expresses the essence of Principal Component Analysis which is related to identify a “
projection matrix with orthonormal columns”, which transform the “cloud of points”, representing the observations
in a
p dimensional space, in such a way that the new configuration, represented in a
r-dimensional space
, is as close as possible to the original configuration (i.e., the distances among all different points are preserved as much as possible; see
Figure 1). Let us call the new space a “
Factorial Space”, or a “
Factorial Plane” if
.
When variables used in a dataset have different units of measurement, it is common practice to begin by standardizing the variables as in Definition 2. Correlation matrix PCs are therefore the best option for datasets where various scale variations are possible for each variable, since they are invariant to linear changes in units of measurement. For all these reasons, some statistical software assumes by default that a PCA means a normalized PCA [
30].
In standard PCA terminology, the elements of the eigenvectors
are commonly called “
PC loadings”, while the elements of vectors
are called “
PC scores”, as they are the values that each individual would score on a given PC [
30].
It is noteworthy that PCA is related to a Singular Value Decomposition (SVD) of the matrix
(see Proposition A1 in
Appendix A.2).
Diagonal elements of matrix
can be used to evaluate the variance of the projection
of column vectors
of
(i.e., the variables) on the computed components
. In particular,
where a normalized PCA is considered as follows:
It is also possible to define the
of the components
j,
in representing the total variance
, as well as the cumulative relevance
of the first
j components, as
Thanks to the reformulation (
A2) of PCA, one could assert that PCA “
… is at heart a dimensionality-reduction method, whereby a set of p original variables can be replaced by an optimal set of q derived variables, the PCs.” [
30], where
. In fact, consider the “Reduced-Rank”
approximation of matrix
(where
) (see Proposition A2 in
Appendix A.2).
The BiPlot is a helpful tool for data analysis that makes it possible to visually evaluate the structure of data matrices. It is particularly useful in Principal Component Analysis, where the BiPlot may exhibit variances and correlations of the variables, as well as inter-unit distances and unit clustering [
34]. Also, thanks to PCA, this graphing method can exploit the opportunity offered by PCA to approximate the data matrix by a matrix product of dimension 2 [
32]. In a BiPlot, the individuals
and the variables
, of a normalized data matrix
defined as in Definition 2, are graphically represented, respectively, as points and as vectors (i.e., arrows) in a bidimensional Cartesian system (see Proposition A3 in
Appendix A.2 for its description). The BiPlot is based on an approximation
of
defined by the product
where
and
, and where the rows of
and of
represent, respectively, the individuals and the variables.
As suggested in [
30], and thanks to the definition of the bidimensional Cartesian system
on which the BiPlot is based (see Proposition A3 in
Appendix A.2), the BiPlot has the following properties:
The cosine of the angle between any two vectors representing variables is the coefficient of correlation between those variables.
Similarly, the cosine of the angle between any vector representing a variable and the axis representing a given PC is the coefficient of correlation between those two variables.
The inner product between the markers for individual i and variable j gives the value of individual i on variable j. The practical implication of this result is that orthogonally projecting the point representing individual i onto the vector representing variable j recovers the value .
The Euclidean distance between the markers for individuals
i and
is proportional to the “Mahalanobis distance” [
35] between them (see [
33] for more details).
Roughly speaking,
Interpreting Points: The relative location of the points can be interpreted. Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.
Interpreting Vectors: Both the direction and length of the vectors can be interpreted. Vectors point away from the origin in some direction. With the principal components under consideration, a vector direction is associated with the highest correlation. The squared multiple correlation between the projected variable and the components under consideration determines the vector’s length. As a result, variables with comparable response profiles and meanings within the context of the data are represented by vectors pointing in the same direction. The observations with the greatest amount of variable measures are those whose points project the farthest in the direction of the vector points. The points with the least amount are those that project at the opposite end. The amount of those projecting in the middle is average.
2.3.3. Building Predictive Model from Data by Neural Networks
Neural networks, a cornerstone of artificial intelligence (AI) and machine learning, are computational models built from data inspired by the structure and function of the human brain [
36]. The concept of neural networks dates back to the mid-20th century, but it was not until the advent of powerful computers and the availability of large datasets in the 21st century that neural networks truly flourished.
A neural network consists of layers of neurons (nodes), where each neuron is a function that takes inputs, processes them using weights, biases, and an activation function, and produces an output. A deep neural network (DNN) can be considered the result of putting more than one level of neurons one after another.
Figure 2 shows an example of a DNN composed of 3 layers. The numbers
,
and
of “
neurons” at each of the three layers are, respectively, 3, 4 and 2. If the number
L of the DNN is equal to three, the DNN is called a “
shallow neural network”.
We recall that the term “Data fitting” denotes the process of constructing a mathematical function (the model) that has “the best fit” to a series of data points . Curve fitting can involve either interpolation where an exact fit to the data is required (i.e., ) or smoothing in which a “smooth” g function is constructed that approximately fits the data (i.e., ) for some small value for and some norm defined on . The most widely used approach, especially in the field of machine learning based on neural networks, to build and use a data-driven model is related to a “Data fitting” smoothing process, where a function , defined through a set of k parameters , should be determined by a “learning” process on known information to be subsequently used to “predict/describe” new ones. In detail, this looks as follows:
Learning phase Given a set of
m data points
, let
be a function defined by a set of
k parameters
, organized in the vector
. The aim of the learning process, given a loss (or “
cost function”)
C, is to compute the following minimum:
Predict phase After learning from known information, the best fit function
, where
is the solution of problem (
8),
can be used to “
predict/describe” unknown information
about new data
Definition 3 defines the form of the fitting function related to a DNN composed of L layers.
Definition 3 (Fitting function of a DNN). Let be a DNN composed of L layers and let be the number of “neurons” in the l-th layer of . The fitting function of has the form of the following function compositions:
Let and be two functions. Then, the composition of f and g, denoted by , is defined as the function given by .
Given a set of M functions such that , with the symbol , we have where is the so called “weight”
from the -th “neuron”
in the -th layer to the -th “neuron”
in the l-th layer [37]. Each function is defined as the composed functionwhere is the so called “activation function”
and where The value
of the total number of nodes in the hidden layers (
) is called “Complexity of DNN
”.
The success of such models is also due to their approximation capabilities for which neural networks (NNs) are known as “Universal Approximators” [
38,
39,
40,
41,
42] in the sense that they can approximate arbitrarily well any continuous function of
n variables on a compact domain [
42]. Theorem A1 (see
Appendix A.3) expresses a universal approximation concept in a more formal way [
42,
43].
If just the “shallow neural networks”
are considered, some results about the computational complexity of computing a
approximation of a function
g by a fitting function defined on
is available from [
42] and is stated by Theorem A2 in
Appendix A.3.
Theorem A2 (see
Appendix A.3) states that “shallow neural networks” are able to give any desiderable approximation
to any function
at a computational cost that grows exponentially with
n unless the smoothness of the approximant is increased.
In [
42], Poggio et al. answer to some questions about “Which classes of functions can it approximate and learn well?”, giving the message that
.. deep networks have the theoretical guarantee, which shallow networks do not have, that they can avoid the “curse of dimensionality” for an important class of problems, corresponding to … hierarchically local compositional functions where all the constituent functions are local in the sense of bounded small dimensionality.
Since an optimization problem as in (
8) should be solved in the context of machine learning based on DNNs to find a good approximation
of a function
g that “models data”, particular attention should be paid to the conditions that guarantees the existence of optimization problem solution and the effectiveness of the algorithms used to compute such solution numerically [
44,
45].
In the context of machine learning [
46,
47], the most commonly used algorithm to compute the solution of problem (
8) is the
gradient descent algorithm [
48] based on the
Steepest Descent method [
49]. Gradient descent is an algorithm (see Algorithm A2 in
Appendix A.3) for unconstrained mathematical optimization. It is an iterative algorithm used for finding a local minimum of a differentiable multivariate function. It is based on the idea to take repeated steps in the opposite direction of the gradient
(or approximate gradient) of the function
at the current point, as this is the direction of steepest descent.
The gradient
is a vector in
defined as
It is possible to guarantee the convergence to a local minimum under certain assumptions on the function C (for example, C is a convex function and is a Lipschitz functions) and particular choices of .
The “
Learning phase” in the context of a DNN is then related with the aim to compute “
the best values” for the “
weights”
, given a “
cost function”
, by using Algorithm A2, taking into consideration the form of the “fitting function”
f defined in Definition 3. Due to the nature of the cost function
C and fitting function
f, the gradient estimation needed at line 6 of Algorithm A2 can be computed by the “
backpropagation method” [
50].
Some studies such as “
illuminating the NN black box” contribute to the literature on this matter [
51,
52]. Among them, we mention those that use the set of weights
to interpret predictor variable contributions in neural networks, such as the Garson and Olden methods, where the second is an evolution of the first one [
52,
53]. An alternate and more adaptable way for assessing variable relevance is the Olden method. Relevance is determined by multiplying the raw input-hidden and hidden-output connection weights for each input and output neuron and then summing the product across all hidden neurons. Unlike Garson’s technique, which only takes into account the absolute magnitude, this method preserves the relative contributions of every connection’s weight in terms of both magnitude and sign. For instance, Garson’s approach may produce misleading outcomes based on the absolute magnitude, while the connection weights that change sign (i.e., from positive to negative) between the input-hidden to hidden-output layers could have a canceling effect. The ability of Olden’s algorithm to evaluate neural networks with multiple hidden layers and response variables is an additional advantage. In the case of of a shallow neural network, the Olden method calculates the relevance coefficient
of the
i-th input factor to
q-th output using the product of the connection weights among the input layer neurons, hidden layer neurons and output layer neurons of the neural network by via following formula:
where
Agreeing with Brownlee [
54] who stated
… The objective of a neural network is to have a final model that performs well both on the data that we used to train it (e.g., the training dataset) and the new data on which the model will be used to make predictions …,
The main aim, in building such models, is to solve the challenging problem of defining models then can “generalize well” to new data [
54]. Two cases indicate that the model fails in “generalization”: “Overfit” and “Underfit” models.
Underfit Model A model that performs poorly on a training dataset and does not perform well in predicting future observations reliably.
Overfit Model A model that corresponds too closely or exactly to a particular set of data and may therefore fail to fit to additional data or predict future observations reliably [
55].
Underfitting by addressed by increasing the “capacity of the model”. Increasing such capacity is easily achieved by changing the structure of the model, such as adding more layers and/or more nodes to layers [
54]. If an underfitting problem can be easily solved, this is not true for an overfitting case. Nonetheless, two ways can be considered to approach an overfit model based on an NN: training the network on more examples and changing the capacity of the network. As summarized by Brownlee [
54], when the number of available data for training is limited, only the option of changing the network’s capacity can be taken advantage of by using one of the following two ways:
By changing the network structure (number of weights). This is called “structural stabilization” [
54].
By changing the network parameters (values of weights) through the use of “regularization techniques” which, involving the addition of a penalty term to the “Cost function”, has the aim to constrain the values of the weights [
54].
The term “regularization” is borrowed from the context of numerical analysis where “regularization approaches” are used to to transform an ill-posed problem into a more “stable” one. A problem is said to be ill-posed if small changes in the given information cause large changes in the solution. This instability with respect to the data makes solutions unreliable because small measurement errors for uncertainties in parameters may be greatly magnified and lead to wildly different responses. For an introductory description of the main regularization techniques in the context of NN-based models, we suggest reading Brownlee [
54].
In summary, despite all the limitations related to the stability, convergence and interpretability of the algorithms on which NNs are based, they can be considered very powerful tools to model (also well approximating) any general phenomena thanks to their ability to express non-linear models and their nature of “Universal Approximators”. In this study, we will, then, evaluate the applicability of a shallow neural network of complexity
N to build models for “supervised classification” (see
Section 3.5) because of the following reasons:
From Theorem A2, they can be considered able to give any desirable approximation of order to any function.
No information exists about observed data that gives the chance to reformulate the model by a hierarchically local compositional functions that can then justify a deep structure of an NN.
The choice for the values of complexity
N is oriented by a compromise (see Equation (
A13)) between the computational cost that grows exponentially with the dimension
n of observations’ space and the smoothness of the approximant that conditions model generalization.