2.1. Principal Component Analysis
Data dimensionality reduction techniques are divided into feature selection and feature extraction techniques. Feature selection techniques, such as random forest or grid search algorithms, select a subset of the original features in order to reduce the complexity and computational efficiency of the model. Conversely, feature extraction techniques extract information from the original features set and create a new features subspace. While feature selection techniques aim to select the most significant features, discarding the less significant ones from the set of original features, feature extraction techniques construct a new reduced set of features, starting from the existing ones, able to synthesize most of the information contained in the original set of features.
The use of feature selection techniques is preferable when the explainability of the model and the semantic meaning of the features are required; feature extraction techniques are used to reduce the model complexity, improving its predictive performance.
PCA is one of the most used feature extraction techniques in data analysis. Its strong point is to be able to reduce the dimensionality of the data, while preserving their information content.
Let D be the original dataset with s features X1, …, Xs and N instances. The ith instance is characterized by a vector (xi1, xi2,…, xis)T where xij is the value of the ith instance in correspondence to the jth feature.
Let m
j be the mean value of the j
th feature, given by:
and st
j the standard deviation is given by:
To remember this definition, we can break it down into eight steps:
- 1.
The aim of this phase is to standardize the range of the initial variables so that each one of them contributes equally to the analysis. For each instance xij, its normalized value is computed, given by:
- 2.
The relationships among features are analyzed by computing the symmetric covariance matrix C = ZTZ, where ZT is the transpose of the normalized matrix Z. The components of C are given by
- 3.
In this phase, the s eigenvalues and the s eigenvectors of the covariance matrix are extracted. The eigendecomposition of C is where we decompose C into VDV−1, where V is the matrix of eigenvectors and D is a diagonal matrix in which the diagonal components are the eigenvalues λi i =1,…,s and the other elements are equal to 0.
- 4.
The eigenvalues on the diagonal of D will be associated with the corresponding column in P—that is, the first element of D is λ1 and the corresponding eigenvector is the first column of P. This holds for all elements in D and their corresponding eigenvectors in P. We will always be able to calculate PDP−1 in this fashion.
- 5.
In this phase the s eigenvalues are sorted in descending order; in the same way the corresponding eigenvectors in the matrix V are ordered, obtaining the matrix V′, whose columns correspond to the ordered eigenvectors.
- 6.
The normalized data matrix Z is transformed in the matrix of the principal components Z′ multiplying Z by the ordered matrix of the eigenvectors V′: Z′ = ZV′.
- 7.
The significant principal components are selected by analyzing the eigenvalues, sorted in descending order. Three heuristic criteria are generally used for the choice of the number of components:
- -
Select only the main components corresponding to the eigenvalues whose sum, compared to the sum of all the eigenvalues, is greater than or equal to a specific threshold, for example, 80% or 90%.
- -
Adopt the Kaiser criterion, in which only those components are selected which correspond to an eigenvalue greater than or equal to 1, or, equivalently, the components that have variance greater than the average.
- -
Build the eigenvalue graph, called Scree Plot, and select the number of components corresponding to the elbow point beyond which the graph the eigenvalues stabilizes.
- 8.
The reduced dataset is constructed considering only the significant principal components.
2.2. Multidimensional F-Transform
Let f: X ⊆ Rn → Y⊆ R be a continuous n-dimensional function defined in a closed interval X = [a1,b1] × [a2,b2] ×…× [an,bn] ⊆ Rn and known in a discrete set of N points P = {(p11, p12, …, p1n), (p21, p22, …, p2n),…, (pN1, pN2, …,pNn)}.
For each i = 1,…,n let xi1, xi2, …, ximi with mi ≥ 2 be a set of mi points of [ai,bi], called nodes, such that xi1 = ai < xi2 <…< ximi = bi.
For each i = 1,…,n let Ai1, Ai2,…, Aimi: [ai, bi] → [0,1] be a family of fuzzy sets forming a fuzzy partition of [ai,bi], where:
Aih(xih) = 1 for every h =1,2,…, mi;
Aih(x) = 0 if x is not in (xih−1, xih+1), where we assume xi0 = xi1 = ai and xin+1 = xin = bi by commodity of presentation;
Aih(x) strictly increases on [xih−1, xih] for h =2,…, mi and strictly decreases on [xih, xih+1] for h = 1,…, mi − 1;
for every x ∊ [ai, bi].
The fuzzy sets Ai1, Ai2,…, Aimi are called basic functions.
Let ci = (bi − ai)/(mi − 1). The basic functions Ai1, Ai2,…, Aimi form a uniform fuzzy partition of [ai,bi] if:
- 5.
mi ≥ 3 and the nodes are equidistant, i.e., xih = ai + di ∙ (h − 1), where di = (bi − ai)/(mi − 1) and h = 1, 2, …, mi.
- 6.
Aih(xih − x) = Aih(xih + x) ∀ x ∊ [0,h] and ∀ h = 2,…, mi − 1;
- 7.
Aih + 1(x) = Aih(x − di) ∀ x ∊ [xih, xih + 1] and ∀ h = 1,2,…, mi − 1.
We say that the set P = {(p
11, p
12, …, p
1n), (p
21, p
22, …, p
2n),…,(p
N1, p
N2, …,p
Nn)} is sufficiently dense w.r.t. the set of fuzzy partitions
,…,
,…,
if for each combination
exists at least a point p
j =
∊ P, such that
> 0. In this case, we can define the direct multidimensional F-transform of
f with the (h
1,h
2,…,h
s)
th component
given by
The multidimensional inverse F-transform, calculated in the point p
j, is given by:
It approximates the function
f in the point p
j. In [
11,
12] the multidimensional inverse F-transform (6) is applied in regression analysis to find dependencies between attributes in datasets.
To highlight the use of the multidimensional F-transform, consider, as an example, a dataset, given by two input features defined in the close intervals, respectively, [1.1, 4.9] and [0.1, 1.0].
Suppose we create for each of the two input variables a fuzzy partition consisting of three basic functions, setting m1 = 3 and m2 = 3. We obtain c1 = 1.9 and c2 = 0.45.
Table 1 shows the values of the three nodes for the two input variables.
Figure 1 shows the points in the input variable plane. The four rectangles are drawn to show that the dataset is sufficiently dense with respect to the set of the two fuzzy partitions {A
11, A
12, A
i3} and partitions {A
21, A
22, A
23}. In fact, in each rectangle in the figure, there is at least one point; this implies that for every combination of basic functions A
1i, A
2i i = 1,2,3, there exists at least one point p
j = (p
j1,p
j2) such that A
1i (p
j1) A
2i (p
j1) ≠ 0.
Since the data are sufficiently dense with respect to the sets of fuzzy partitions, it is possible to apply Equation (5) to calculate the components of the multidimensional direct F-transform Fh1h2 h1, h2 = 1,2,3. Finally, Equation (6) can be applied to calculate the multidimensional inverse F-transform in a point p; it approximates the function f in that point.
2.3. High Degree F-Transform
This paragraph introduces the concept of higher degree fuzzy transform or Fr-transform. One-dimensional square-integrable functions will now be considered.
Let A
h, h = 1,…,n, be the h
th basic function defined on [a,b] and L
2([x
h−1,x
h+1]) be the Hilbert space of square-integrable functions
f,
g: [x
h−1,x
h+1] ⟶ R with the inner product:
Let L
r2([x
h−1,x
h+1]), with r positive integer, be a linear subspace of the Hilbert space L
2([x
h−1,x
h+1]) with orthogonal basis given by polynomials {P
0h, P
1h,…,P
rh} obtained applying the Gram-Schmidt orthonormalization to the linear independent system of polynomials {1, x, x
2,…, x
r} defined in the interval [x
h−1,x
h+1]. We have:
The following Lemma holds (Cfr. Perfilieva et al., 2011 [
3], Lemma 1):
Lemma 1. Letbe the orthogonal projection of the function f on Lr2([xh−1,xh+1]). Then:
where
it is the h-th component of the direct Fr-transform of f Fr[f] = (Fr1, Fr2,…,Frn).
The inverse F
r-transform of f in a point x ∊ [a,b] is:
For r = 0 we have P0h = 1 and the F0-transform is given by the F-transform in one variable (F0h(x) = ch,0).
For r = 1 we have P
1h = (x − x
h) and the h-th component of the F
1-transform is given by the formula:
If the function
f is known in a set of N data points p
1,…p
N, c
h,0 and c
h,1 can be discretized in the form:
Likewise, let L
2 ([x
k1−1, x
k1+1] × [x
k2−1,x
k2+1] ×…× [x
kn−1, x
kn+1]) be the Hilbert space of square- integrable n-variables functions
f: [x
k1−1, x
k1+1] × [x
k2−1,x
k2+1] ×…× [x
kn−1, x
kn+1] → R with the weighted inner product:
Two function L2([xh1−1, xh1+1] × [xh2−1,xh2+1] ×…× [xhn−1, xhn+1]) are orthogonal if .
Let f: X ⊆ Rn → Y⊆ R be a continuous n-dimensional function defined in a closed set [a1,b1] × [a2,b2] ×…× [an,bn]. Let Ahk, k = 1,…,nk, be the kth basic function defined on the interval [ah,bh] and L2([xh,k−1,xh,k+1]) be the Hilbert space of square-integrable functions f,g: [xh,k−1,xh,k+1] ⟶ R.
The inverse F
1-transform of
f in a point x = (x
1, x
2,…, x
n) ∊ [a
1,b
1] × [a
2,b
2] ×…× [a
n,b
n] is:
where
is the (h
1, h
2,…, h
n)
th component of the direct F
1-transform, given by the formula:
If
f is known in a set of N n-dimensional data points p
1,…p
N where p
i = (p
i1, p
i2,…, p
in), we obtain:
where
is the component
of the multidimensional discrete direct F transform of
f, given by (5).