Exploratory Analysis of Distributional Data Using the Quantile Method

: The quantile method transforms each complex object described by different histogram values to a common number of quantile vectors. This paper retraces the authors’ research, including a principal component analysis, unsupervised feature selection using hierarchical conceptual clustering, and lookup table regression model. The purpose is to show that this research is essentially based on the monotone property of quantile vectors and works cooperatively in the exploratory analysis of the given distributional data


Introduction
The extension of various statistical methods has been developed for complex data types, including histogram-valued symbolic data [1][2][3][4].This paper considers the following three research categories.

Principal Component Analysis (PCA)
The main purpose of traditional PCA is to transform a number of possibly correlated variables into a small number of uncorrelated variables, which are called principal components.In the generalization of PCA for complex data types, mainstream research uses Pearson's approach.For example, a summary of various generalized PCA for interval data are given in [5].The authors proposed a general method of PCA based on the quantification method using generalized Minkowski metrics [6,7] and proposed the quantile method of PCA for general distributional data [8,9].

Clustering and Unsupervised Feature Selection
In the generalization of hierarchical clustering for histogram-valued data, the main problem is how to define an appropriate similarity or dissimilarity measure for the given objects and/or clusters.A hierarchical clustering method based on the Wasserstein distance [10] and a nonhierarchical method based on the dynamical clustering method [11] are typical examples.The authors also proposed a hierarchical conceptual clustering method based on the quantile method [12].
In unsupervised feature selection, clustering is a useful tool for searching for informative feature subsets.By combining existing clustering methods with an appropriate wrapper method, for example, we can achieve unsupervised feature selection.The authors proposed an unsupervised feature selection method for general distributional data using hierarchical conceptual clustering based on compactness [13].Compactness plays multiple roles, i.e., the measures of similarity between objects and/or clusters, cluster quality, and feature effectiveness.This property greatly simplifies the task of feature selection.

Regression Models
The extension of linear regression models for histogram-valued variables was developed in [14][15][16][17][18][19][20].In these studies, some functional forms between the response variable and the explanatory variable(s) have been proposed under the appropriately defined optimality criterion.As another very different method, the authors proposed the lookup table regression model (LTRM) for histogram data using the quantile method [21,22].This paper retraces the aforementioned studies based on the quantile method and describes the proposed methods working cooperatively in an exploratory analysis of the given distributional data.
Section 2 describes the representation of objects by quantile vectors and bin rectangles.Sections 2.1-2.3 describe the quantile method, which transforms each object with p distributional feature variables into a description using a series of m + 1 p-dimensional quantile vectors.It further describes these objects using a series of m p-dimensional bin rectangles, each spanned by adjacent quantile vectors, where m is predetermined integer.Sections 2.4 and 2.5 define the concept size of bin rectangles and the concept size of the Cartesian join of objects.The Cartesian join generates a generalized concept for the two given objects.Section 2.6 defines the measure of compactness for the two given objects and/or clusters under the assumption of equal bin probabilities.Compactness plays the central role in our unsupervised feature selection using hierarchical conceptual clustering.
Section 3 discusses the results of an exploratory analysis of two distributional datasets: oil data and hardwood data.Section 3.1 summarizes the quantile method of PCA and dual-PCA using rank order correlation coefficients under the monotone property of quantile vectors.Section 3.2 describes the unsupervised feature selection method using the hierarchical conceptual clustering based on compactness.
Section 3.3 proposes the lookup table regression model (LTRM) for distributional data based on monotone blocks segmentation (MBS).
Section 4 includes a concluding summary.

Quantile Vectors, Bin Rectangles, and Compactness
Let U = {ω i , i = 1, 2, . .., N} be the set of given objects, and let feature variables F j , j = 1, 2, . .., p describe each object.Let D j be the domain of feature F j , j = 1, 2, . .., p.Then, the feature space is defined by the following: (1) Each element of D (p) is represented by: where E j is the feature value of F j , j = 1, 2, . .., p.

Histogram-Valued Feature
For each object ω i , let each feature F j be represented by a histogram value as follows: where p ij1 + p ij2 + ••• + p ijnij = 1 and n ij is the number of bins that compose the histogram E ij .Therefore, the Cartesian product of p histogram values represents the object ω i : Because the interval-valued feature is a special case of a histogram feature with n ij = 1 and p ij1 = 1, the representation of (3) is reduced to an interval, as follows: (5) It should be noted that the histogram representation is also possible for other feature types, such as categorical multivalued and modal multivalued features [12,13].

Representation of Histograms by Common Number of Quantiles
Let ω i ∈ U be the given object, and let E ij be a histogram value in (3) for a feature F j .Then, under the assumption that n ij bins have uniform distributions, we define the cumulative distribution function F ij (x) of the histogram (3) as follows: F ij (x) = 0 for x ≤ a ij1 F ij (x) = p ij1 (x − a ij1 )/(a ij2 − a ij1 ) for a ij1 ≤ x < a ij2 F ij (x) = F(a ij1 ) + p ij2 (x − a ij2 )/(a ij3 − a ij2 ) for a ij2 ≤ x < a ij3 Figure 1 illustrates a cumulative distribution function for a histogram feature value, where c 1 , c 2 , and c 3 are cut points for the case m = 4, and q 1 , q 2 , and q 3 are the corresponding quantile values.

Representation of Histograms by Common Number of Quantiles
Let ωi ∈ U be the given object, and let Eij be a histogram value in (3) for a feature Fj.Then, under the assumption that nij bins have uniform distributions, we define the cumulative distribution function Fij(x) of the histogram (3) as follows: Fij(x) = 0 for x ≤ aij1 Fij(x) = pij1(x − aij1)/(aij2 − aij1) for aij1 ≤ x < aij2 Fij(x) = F(aij1) + pij2(x − aij2)/(aij3 − aij2) for aij2 ≤ x < aij3  Our general procedure to have common representation for histogram-valued data is as follows.
(1) We choose a common number m of quantiles.
(2) Let c1, c2, …, cm − 1 be preselected cut points dividing the range of the distribution function Fij(x) into continuous intervals, i.e., bins with preselected probabilities associated with m − 1 cut points.(3) For the given cut points c1, c2, …, cm − 1, we calculate the corresponding quantiles by solving the following equations: and Fij(xijm) = 1, (i.e., xijm = aij(nij + 1)).Therefore, we describe each object ωi ∈ U for each feature Fj using a (m + 1) tuple: and the corresponding histogram using: where we assume that c0 = 0 and cm = 1.In ( 7  Our general procedure to have common representation for histogram-valued data is as follows. (1) We choose a common number m of quantiles.
(2) Let c 1 , c 2 , . .., c m−1 be preselected cut points dividing the range of the distribution function F ij (x) into continuous intervals, i.e., bins with preselected probabilities associated with m − 1 cut points.(3) For the given cut points c 1 , c 2 , . .., c m−1 , we calculate the corresponding quantiles by solving the following equations: Therefore, we describe each object ω i ∈ U for each feature F j using a (m + 1) tuple: and the corresponding histogram using: where we assume that c 0 = 0 and c m = 1.In (7), (c k+1 − c k ), k = 0, 1, . .., m − 1, denote bin probabilities using the preselected cut point probabilities c 1 , c 2 , . .., c m−1 .In the quartile case, m = 4 and c 1 = 1/4, c 2 = 2/4, and and [x ij3 , x ij4 ), have the same bin probability: 1/4.The number of bins of the given histograms may be mutually different in general.However, we can obtain (m + 1)-tuples as the common representation for all histograms by selecting an integer m and a set of cut points.

Quantile Vectors and Bin Rectangles
For each object ω i ∈ U, we define (m + 1) p-dimensional numerical vectors, which are called the quantile vectors, as follows.
We call x i0 and x im the minimum quantile vector and the maximum quantile vector, respectively.Therefore, m + 1 quantile vectors {x i0 , x i1 , . .., x im } in R p describe each object ω i ∈ U together with cut point probabilities.
The components of m + 1 quantile vectors in (8) for object ω i ∈ U satisfy the inequalities: Therefore, m + 1 quantile vectors in (8) for object ω i ∈ U satisfy the monotone property: For the series of quantile vectors x i0 , x i1 , . .., x im of object ω i ∈ U, we define m series of p dimensional rectangles, which are called bin rectangles, spanned by adjacent quantile vectors x ik and x i(k+1) , k = 0, 1, . .., m − 1, as follows: where x ik ⊞x i(k+1) is the Cartesian join [6,7] of x ik and x i(k+1) obtained using the Cartesian join x ijk ⊞x ij(k+1) = [x ijk , x ij(k+1) ], j = 1, 2, . .., p. Figure 2 illustrates two objects using two-dimensional bin rectangles in the quartile case.Because a bin rectangle is regarded as a conjunctive logical expression, we also use the term concept.Therefore, four bin rectangles describe each of these objects ω i and ω l as a concept.The number of bins of the given histograms may be mutually different in general.However, we can obtain (m + 1)-tuples as the common representation for all histograms by selecting an integer m and a set of cut points.

Quantile Vectors and Bin Rectangles
For each object ωi  U, we define (m + 1) p-dimensional numerical vectors, which are called the quantile vectors, as follows.
We call xi0 and xim the minimum quantile vector and the maximum quantile vector, respectively.Therefore, m + 1 quantile vectors {xi0, xi1, …, xim} in R p describe each object ωi  U together with cut point probabilities.
The components of m + 1 quantile vectors in (8) for object ωi  U satisfy the inequalities: Therefore, m + 1 quantile vectors in (8) for object ωi  U satisfy the monotone property: For the series of quantile vectors xi0, xi1, …, xim of object ωi  U, we define m series of p dimensional rectangles, which are called bin rectangles, spanned by adjacent quantile vectors xik and xi(k + 1), k = 0, 1, …, m − 1, as follows: where xik⊞xi(k + 1) is the Cartesian join [6,7] of xik and xi(k + 1) obtained using the Cartesian join Figure 2 illustrates two objects using two-dimensional bin rectangles in the quartile case.Because a bin rectangle is regarded as a conjunctive logical expression, we also use the term concept.Therefore, four bin rectangles describe each of these objects ωi and ωl as a concept.

Concept size of bin-rectangles
For each feature F j , j = 1, 2,…, p, let the domain D j of feature values be the following interval:

Definition 1
Let an object ω i ÎU be described by the set of histograms E ij in (9):
Definition 1.Let object ω i ∈ U be described using the set of histograms for E ij in (7).We define the average concept size P(E ij ) of m bins for histogram E ij as follows: The average concept size P(E ij ) satisfies the inequality: Proposition 1.
(1) When m bin probabilities are the same, the average concept size of m bins is reduced to the form: (2) When m bin widths are the same size w ij , we have: (3) It is clear that: This proposition asserts that both extremes yield the same conclusion.
Then, we define the concept size P(E i ) of E i using the arithmetic mean: From (13), It is clear that: Definition 3. Let P(B(x ik , x i(k+1) )), k = 0, 1, . .., m − 1, be the concept size of m bin rectangles defined by the average of p normalized bin widths: Then ( 12) and ( 19) lead to the following proposition.
Proposition 2. The concept size P(E i ) is equivalent to the average value of m concept sizes of bin rectangles: where c 0 = 0 and c m = 1.
In Figure 2, two objects, ω i and ω l , are represented by four bin rectangles with the same probability: 1/4.According to Proposition 2, object ω i has a smaller concept size than object ω l .

Concept Size of the Cartesian Join of Objects
Let E ij and E lj be two histogram values of objects ω i, ω l ∈ U with respect to the j-th feature.We represent a generalized histogram value of E ij and E lj , which is called the Cartesian join of E ij and E lj , using E ij ⊞E lj .Let F Eij (x) and F Elj (x) be the cumulative distribution functions associated with histograms E ij and E lj , respectively.Definition 4. We define the cumulative distribution function for the Cartesian join E ij ⊞E lj as follows: Then, by applying the same integer m and the set of cut point probabilities, c 1 , c 2 , . .., c m−1 , used for E ij and E lj , we define the histogram of the Cartesian join E ij ⊞E lj for the j-th feature as: where we assume that c 0 = 0 and c m = 1 and that the suffix (i + l) denotes the quantile values for the Cartesian join E ij ⊞E lj .We should note that x (i+l)j0 = min(x ij0 , x lj0 ) and x (i+l)jm = max(x ijm , x ljm ).
Definition 5. We define the average concept size P(E ij ⊞E lj ) of m bins for the Cartesian join E ij and E lj under the j-th feature as follows: The average concept size satisfies the inequality: Proposition 3. When m bin probabilities are the same or m bin widths are the same, we have the following monotone property: be the descriptions of p histograms in R p for ω i and ω l , respectively.Then, we define the concept size P(E i ⊞E l ) for the Cartesian join of E i and E l using the arithmetic mean, as follows: From (24), it is clear that: Definition 7. Let x (i+l)k , k = 0, 1, . .., m be the quantile vectors for the Cartesian join E i ⊞E l , and let P(B(x (i+l)k , x (i+l)(k+1) )), k = 0, 1, . .., m − 1 be the concept sizes of m bin rectangles defined by the average of p normalized bin widths, as follows: Then, we have the following result: Proposition 4. The concept size P(E i ⊞E l ) is equivalent to the average value of m concept sizes of bin rectangles: where c 0 = 0 and c m = 1.
We have the following monotone property from Proposition 3 and Definition 6.
Proposition 5. When m bin probabilities are the same or m bin widths are the same for all features, we have the monotone property: This property plays a very important role in our hierarchical conceptual clustering in Section 3.2.

Compactness and Its Properties
In the following section, we assume that the given distributional data have the same representation using m quantile values with the same bin probabilities.Definition 8.Under the assumption of equal bin probabilities, we define the compactness of the generalized concept of ω i and ω l as follows: The compactness satisfies the following properties: Proposition 6.
Figure 3 illustrates the Cartesian join for interval-valued objects.We should note that the compactness, C(ω 1 , ω 2 ) = P(E 1 ⊞E 2 ) and C(ω 3 , ω 4 ) = P(E 3 ⊞E 4 ), takes the same value as the concept size.On the other hand, any (dis)similarity measures for distributional data should take different values for the pairs (E 1 , E 2 ) and (E 3 , E 4 ).Therefore, a small-value compactness requires that the pair of objects under consideration should be similar to each other, but the converse is not true.

P(Ei), P(El) ≤ P(Ei⊞El). (30)
This property plays a very important role in our hierarchical conceptual clustering in Section 3.2.

Compactness and Its Properties
In the following section, we assume that the given distributional data have the same representation using m quantile values with the same bin probabilities.

Definition 8. Under the assumption of equal bin probabilities, we define the compactness of the generalized concept of ωi and ωl as follows:
The compactness satisfies the following properties: Proposition 6.
(2) C(ωi, ωl) = 0 iff Ei ≡ El and has null size (P(Ei) = 0).Figure 3 illustrates the Cartesian join for interval-valued objects.We should note that the compactness, C(ω1, ω2) = P(E1⊞E2) and C(ω3, ω4) = P(E3⊞E4), takes the same value as the concept size.On the other hand, any (dis)similarity measures for distributional data should take different values for the pairs (E1, E2) and (E3, E4).Therefore, a small-value compactness requires that the pair of objects under consideration should be similar to each other, but the converse is not true.

Principal Component Analysis (PCA)
In standard numerical data of the size N objects by p variables, we captured macroscopic properties of the data on the factor planes using the principal components obtained

Principal Component Analysis (PCA)
In standard numerical data of the size N objects by p variables, we captured macroscopic properties of the data on the factor planes using the principal components obtained from the factorization of a p × p covariance matrix or a correlation matrix.In this paper, for the given N objects by p distributional variables, we used the following procedures: Quantile Method of PCA [8,9] 1.
We transformed the given data of the size N objects by p distributional variables into N × (m + 1) quantile vectors in the space R p , where m is a preselected common number of quantiles describing each histogram value.For each object, the essential property of (m + 1) quantile vectors was that they satisfy the monotone property in the space R p . 2.
We evaluated the covariate relations between each pair of p variables using the Spearman or Kendall rank order correlation coefficient and obtained the correlation matrix S. If N × (m + 1) quantile vectors followed a monotone structure, many offdiagonal elements of S took large absolute values.Then, we expected the existence of a large eigenvalue of S, and the corresponding eigenvector reproduced the original monotone property of N × (m + 1) quantile vectors in the space R p .

3.
With the factorization of the correlation matrix S, we obtained factor planes using the principal components on which each of N objects is represented by m series of connected arrow lines from the minimum quantile vector to the maximum quantile vector.

PCA of Oil Data
The oil data in Table 1 are composed of six plant oils and two animal fats described using four interval-valued features and one nominal multivalued feature.Here, we used the composition table in Table 2 for major acids.Each object is composed of acids from the ordered acids by molecular weight.For each object, we assumed a unit interval for each component acid assuming uniform distribution.Figure 4a shows the obtained cumulative distribution functions, and Figure 4b contains the corresponding quantile functions.The last column of Table 2 features seven quantiles calculated for each object.Table 3 shows the oil data described using five interval values.For major acids, we cut 0% and 100% quantiles to clarify the distinctions between objects, and we regarded 10% and 90% quantiles as the new 0% and 100% quantiles.Table 4 features the first two principal components for the oil data in Table 3.The two principal components have very high contribution ratios.In this example, the first principal component is not the size factor.Specific gravity and iodine value have very large positive weights.Figure 5a shows the mutual position of five features, and specific gravity and iodine value are highly covariate.In Figure 5b, each object is represented by an arrow line connecting the minimum quantile vector and the maximum quantile vector.Beef and hog are isolated from plant oils.On the other hand, linseed and perilla have larger concept sizes and are separated from the other four plant oils.Figure 5a shows the mutual position of five features, and specific gravity and iodine value are highly covariate.In Figure 5b, each object is represented by an arrow line connecting the minimum quantile vector and the maximum quantile vector.Beef and hog are isolated from plant oils.On the other hand, linseed and perilla have larger concept sizes and are separated from the other four plant oils.Table 5 demonstrates part of the oil data for quartile representation.We obtained quartiles for four interval feature values assuming uniform distributions.Table 6 reveals the first two principal components for the quartile case.The two principal components, again, have very high contribution ratios and are very similar to the results of Table 4. Figure 6 shows eight objects represented by four connected arrow lines from the minimum quantile vector to the maximum quantile vector.The quartile representation affects the shapes of objects, as indicated by the arrow lines.Table 5 demonstrates part of the oil data for quartile representation.We obtained quartiles for four interval feature values assuming uniform distributions.Table 6 reveals the first two principal components for the quartile case.The two principal components, again, have very high contribution ratios and are very similar to the results of Table 4. Figure 6 shows eight objects represented by four connected arrow lines from the minimum quantile vector to the maximum quantile vector.The quartile representation affects the shapes of objects, as indicated by the arrow lines.

Dual PCA of Oil Data
In the oil data by quartile representation, we used data in the form of (8 × 5 quantile values) × (5 variables).We replaced the positions of eight objects and five variables as (5 × 5 quantile values) × (8 objects).Using the factorization of Spearman's 8 × 8 rank order correlation matrix, we obtained the results in Table 7.The sum of the contribution ratios is large, and the first principal component is the size factor in dual PCA.The scatter plot of Figure 7a is consistent with the results in Figures 5b and 6.In Figure 7b, specific gravity and iodine value have small concept sizes and are mutually covariate.Similarly, freezing point is covariate with saponification value.In between these two groups, major acids shows the largest concept size.

Dual PCA of Oil Data
In the oil data by quartile representation, we used data in the form of (8 × 5 quantile values) × (5 variables).We replaced the positions of eight objects and five variables as (5 × 5 quantile values) × (8 objects).Using the factorization of Spearman's 8 × 8 rank order correlation matrix, we obtained the results in Table 7.The sum of the contribution ratios is large, and the first principal component is the size factor in dual PCA.The scatter plot of Figure 7a is consistent with the results in Figures 5b and 6.In Figure 7b, specific gravity and iodine value have small concept sizes and are mutually covariate.Similarly, freezing point is covariate with saponification value.In between these two groups, major acids shows the largest concept size.

PCA of Hardwood Data
The data were extracted from the US Geological Survey (Climate-Vegetation Atlas of North America) [23].The number of objects is ten, and the number of features is eight.Table 8 shows quantile values for the selected ten hardwoods under the variable mean annual temperature (ANNT).For example, the existence probability of Acer East is 0% under −2.3 °C and 10% in the range −2.3 ~ 0.6 °C, etc.We selected the following eight variables to describe the objects (hardwood).The data formats for other variables F2 ~ F8 are the same as those in Table 8.

PCA of Hardwood Data
The data were extracted from the US Geological Survey (Climate-Vegetation Atlas of North America) [23].The number of objects is ten, and the number of features is eight.Table 8 shows quantile values for the selected ten hardwoods under the variable mean annual temperature (ANNT).For example, the existence probability of Acer East is 0% under −2.3 • C and 10% in the range −2.3~0.6 • C, etc.We selected the following eight variables to describe the objects (hardwood).The data formats for other variables F 2 ~F8 are the same as those in Table 8.The hardwood data are numerical data of the size {(10 objects) × (7 quantile values)} × (8 variables).Using the factorization of Spearman's 8 × 8 rank order correlation matrix, we obtained the results in Table 9. Figure 8 shows the mutual positions of eight variables by two eigen vectors.We have two groups {ANNP, JANP, JULP, and MOISTURE} and {ANNT, JANT, JULT, and GDC5}.Figure 9 shows the mutual positions of ten objects in the first factor plane.Each hardwood is represented by six arrow lines connecting the minimum quantile vector to the maximum quantile vectors.the first factor plane.Each hardwood is represented by six arrow lines connecting the minimum quantile vector to the maximum quantile vectors.We should note the following facts for the PCA results: 1.The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.2. East hardwoods show similar line graphs, and the maximum quantile vectors take mutually near positions.the first factor plane.Each hardwood is represented by six arrow lines connecting the minimum quantile vector to the maximum quantile vectors.We should note the following facts for the PCA results: 1.The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.2. East hardwoods show similar line graphs, and the maximum quantile vectors take mutually near positions.We should note the following facts for the PCA results: 1.
The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.

2.
East hardwoods show similar line graphs, and the maximum quantile vectors take mutually near positions.

3.
West hardwoods are separated into two groups: {ACER WEST and ALNUS WEST} and {FRAXINUS WEST, JUGLANS WEST, and QUERCUS WEST}.The last arrow lines are very long, especially for ACER WEST and ALNUS WEST.

Dual PCA of Hardwood Data
We changed the places of objects and variables in the hardwood data.Then, we applied the quantile method of PCA to the dual data in the form of {(8 variables) × (7 quantile values)} × (10 objects).Table 10 contains the first two principal components for the dual data, and Figure 10 shows the mutual positions of ten hardwoods by two eigenvectors.West hardwoods are separated, again, into two different groups.Figure 11 shows the mutual position of eight variables in the first factor plane.Each variable is represented by a series of six-line segments connecting the minimum quantile vector to the maximum quantile vector.We changed the places of objects and variables in the hardwood data.Then, we applied the quantile method of PCA to the dual data in the form of {(8 variables) × (7 quantile values)} × (10 objects).Table 10 contains the first two principal components for the dual data, and Figure 10 shows the mutual positions of ten hardwoods by two eigenvectors.West hardwoods are separated, again, into two different groups.Figure 11 shows the mutual position of eight variables in the first factor plane.Each variable is represented by a series of six-line segments connecting the minimum quantile vector to the maximum quantile vector.We should note the following facts for the result of dual PCA: 1.The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.2. We have two groups: {ANNP, JANP, JULP, and MITM} and {ANNT, JANT, JULT, and GDC5}.MITM and GDC5 have very long line graphs compared with the other members in each group.

Unsupervised Feature Selection Using Hierarchical Conceptual Clustering
This section describes our algorithm of hierarchical conceptual clustering and an exploratory method for unsupervised feature selection based on compactness.
Let U = {ω1, ω2, …, ωN} be the given set of objects, and let each object ωi be described using a set of histograms Ei = Ei1 × Ei2 × ••• × Eip in the feature space R p .We assumed that all histogram values for all objects have the same number, m, of quantiles.We also assumed the same bin probabilities for all histogram values to keep the monotone property in Proposition 5 and Proposition 6 (3).

Analysis of Oil Data
We applied the hierarchical conceptual clustering (HCC) algorithm [13] to the oil data in Table 3.In this data, each object is described using interval values, i.e., histograms having a single bin.The dendrogram in Figure 12 shows three explicit clusters (linseed, perilla), (cotton, sesame, olive, camellia), and (beef, hog).We should note the following facts for the result of dual PCA: 1.
The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.

Unsupervised Feature Selection Using Hierarchical Conceptual Clustering
This section describes our algorithm of hierarchical conceptual clustering and an exploratory method for unsupervised feature selection based on compactness.
Let U = {ω 1 , ω 2 , . .., ω N } be the given set of objects, and let each object ω i be described using a set of histograms in the feature space R p .We assumed that all histogram values for all objects have the same number, m, of quantiles.We also assumed the same bin probabilities for all histogram values to keep the monotone property in Proposition 5 and Proposition 6 (3).

Analysis of Oil Data
We applied the hierarchical conceptual clustering (HCC) algorithm [13] to the oil data in Table 3.In this data, each object is described using interval values, i.e., histograms having a single bin.The dendrogram in Figure 12 shows three explicit clusters (linseed, perilla), (cotton, sesame, olive, camellia), and (beef, hog).Table 11 summarizes the values of the average compactness for each feature in each clustering step.As clarified by bold format numbers, the most robustly informative features are specific gravity and iodine value until step 6. Figure 13 shows the scatter diagram of the oil data for the two selected robustly informative features.This figure, again, shows three distinct clusters (linseed, perilla) and (cotton, sesame, camellia, olive), and (beef, hog).They exist in locally limited regions, and they are organized in a geometrically thin structure with respect to the selected features.Figure 14 shows the dendrogram with concept descriptions of clusters with respect to specific gravity and iodine value.This dendrogram clarifies two major clusters, plant oils and fats, in addition to three distinct clusters, and the compactness takes smaller values compared with the dendrogram in Figure 12.We should note that compactness plays the role of the similarity measure between objects and/or clusters, the role of the cluster quality measure, and the role of the feature effectiveness criterion.Table 11 summarizes the values of the average compactness for each feature in each clustering step.As clarified by bold format numbers, the most robustly informative features are specific gravity and iodine value until step 6. Figure 13 shows the scatter diagram of the oil data for the two selected robustly informative features.This figure, again, shows three distinct clusters (linseed, perilla) and (cotton, sesame, camellia, olive), and (beef, hog).They exist in locally limited regions, and they are organized in a geometrically thin structure with respect to the selected features.Figure 14 shows the dendrogram with concept descriptions of clusters with respect to specific gravity and iodine value.This dendrogram clarifies two major clusters, plant oils and fats, in addition to three distinct clusters, and the compactness takes smaller values compared with the dendrogram in Figure 12.We should note that compactness plays the role of the similarity measure between objects and/or clusters, the role of the cluster quality measure, and the role of the feature effectiveness criterion.

Analysis of Hardwood Data
To maintain the monotone property in Propositions 5 and 6 (3), we assumed quartile representation for the hardwood data.Figure 15 is the result of PCA for the quartile case.After the removal of 10% and 90% quantiles, the lengths of the first and the last line segments greatly increased compared with the result in Figure 9, especially for the west hardwoods.
Figure 16 is the result of our HCC using compactness.In this dendrogram, HCC generated a cluster of east hardwoods in the order ((((AcE, JE), FE), QE), AlE), and ((JW, FW), QW).Then, AcW was merged into the cluster of east hardwoods with a compactness of 0.847, and AlW was merged further with a compactness of 0.935.Because the compactness of east hardwoods is 0.671, AcW and AlW are mutually similar compared with the east hardwoods.As a result, we have three clusters (AcW, AlW), (AcE, AlE, FE, JE, QE), and (FW, JW, QW).The PCA result in Figure 15 also supports these clusters.

Analysis of Hardwood Data
To maintain the monotone property in Propositions 5 and 6 (3), we assumed quartile representation for the hardwood data.Figure 15 is the result of PCA for the quartile case.After the removal of 10% and 90% quantiles, the lengths of the first and the last line segments greatly increased compared with the result in Figure 9, especially for the west hardwoods.
Figure 16 is the result of our HCC using compactness.In this dendrogram, HCC generated a cluster of east hardwoods in the order ((((AcE, JE), FE), QE), AlE), and ((JW, FW), QW).Then, AcW was merged into the cluster of east hardwoods with a compactness of 0.847, and AlW was merged further with a compactness of 0.935.Because the compactness of east hardwoods is 0.671, AcW and AlW are mutually similar compared with the east hardwoods.As a result, we have three clusters (AcW, AlW), (AcE, AlE, FE, JE, QE), and (FW, JW, QW).The PCA result in Figure 15 also supports these clusters.Table 12 shows the average compactness for each feature and clustering step.The most robustly informative feature is ANNP, then JULP.However, we should note that JANT is also important in steps 7 and 8. AppliedMath 2024, 4, FOR PEER REVIEW 17

Analysis of Hardwood Data
To maintain the monotone property in Propositions 5 and 6 (3), we assumed quartile representation for the hardwood data.Figure 15 is the result of PCA for the quartile case.After the removal of 10% and 90% quantiles, the lengths of the first and the last line segments greatly increased compared with the result in Figure 9, especially for the west hardwoods.
Figure 16 is the result of our HCC using compactness.In this dendrogram, HCC generated a cluster of east hardwoods in the order ((((AcE, JE), FE), QE), AlE), and ((JW, FW), QW).Then, AcW was merged into the cluster of east hardwoods with a compactness of 0.847, and AlW was merged further with a compactness of 0.935.Because the compactness of east hardwoods is 0.671, AcW and AlW are mutually similar compared with the east hardwoods.As a result, we have three clusters (AcW, AlW), (AcE, AlE, FE, JE, QE), and (FW, JW, QW).The PCA result in Figure 15 also supports these clusters.Table 12 shows the average compactness for each feature and clustering step.The most robustly informative feature is ANNP, then JULP.However, we should note that JANT is also important in steps 7 and 8. Table 12 shows the average compactness for each feature and clustering step.The most robustly informative feature is ANNP, then JULP.However, we should note that JANT is also important in steps 7 and 8. Figure 17a,b show the scatter diagrams of ten hardwoods by informative feature.Figure 17b is very similar to the PCA result in Figure 15.We should note, again, that the compactness contributed to the selection of the important features.We should also note that the minimum quantile vectors and the maximum quantile vectors describe the differences between objects and/or clusters in the scatter diagrams under the selected informative features.
AppliedMath 2024, 4, FOR PEER REVIEW 18  15.We should note, again, that the compactness contributed to the selection of the important features.We should also note that the minimum quantile vectors and the maximum quantile vectors describe the differences between objects and/or clusters in the scatter diagrams under the selected informative features.

Lookup Table Regression Model
This section describes the lookup table regression model (LTRM) for histogram-valued symbolic data [21,22].For the given symbolic data table of the size (N objects) × (p variables), we represented each object using (m + 1) p-dimensional quantile vectors, where m is a preselected integer number.To the new numerical data table of the size {N × (m + 1) quantile values} × (p variables), we applied the monotone blocks segmentation (MBS) algorithm.The MBS interchange N × (m + 1) rows were organized according to the values of the selected response variable, from smallest to largest.For each of the remaining p−1 explanatory variables, i.e., columns, MBS executed the segmentation of variable values into blocks so that the generated blocks, i.e., interval values, satisfied the monotone property.MBS discarded columns that had only a single block.Therefore, MBS detected monotone covariate relations existing between the response variable and explanatory variable(s).Finally, MBS obtained a lookup table of the size N' × p', where N' < N × (m + 1) and p' < p.Each element of the table was an interval value corresponding to the segmented block.We realized the interval value estimation rule for the response variable by searching for the nearest element in the lookup table.

Lookup Table Regression Model
This section describes the lookup table regression model (LTRM) for histogram-valued symbolic data [21,22].For the given symbolic data table of the size (N objects) × (p variables), we represented each object using (m + 1) p-dimensional quantile vectors, where m is a preselected integer number.To the new numerical data table of the size {N × (m + 1) quantile values} × (p variables), we applied the monotone blocks segmentation (MBS) algorithm.The MBS interchange N × (m + 1) rows were organized according to the values of the selected response variable, from smallest to largest.For each of the remaining p − 1 explanatory variables, i.e., columns, MBS executed the segmentation of variable values into blocks so that the generated blocks, i.e., interval values, satisfied the monotone property.MBS discarded columns that had only a single block.Therefore, MBS detected monotone covariate relations existing between the response variable and explanatory variable(s).Finally, MBS obtained a lookup table of the size N ′ × p ′ , where N ′ < N × (m + 1) and p ′ < p.Each element of the table was an interval value corresponding to the segmented block.We realized the interval value estimation rule for the response variable by searching for the nearest element in the lookup table.

Illustration by Oil Data
We used the oil data in Table 3 to describe the basic ideas of MBS and LTRM.In these data, each of eight objects is described using five interval values.Because an interval is a special histogram composed of one bin, we split each object into two sub-objects, the minimum sub-object and the maximum sub-object, described using five-dimensional quantile vectors, i.e., the minimum quantile vector and the maximum quantile vector.Table 13 contains the obtained quantile representation of our numerical data of the size (8 × 2 quantile values) × (5 variables).In this example, we selected iodine value as the response variable and the remaining four as explanatory variables.In Table 14, we interchanged the given sixteen quantile vectors, according to iodine value, from a minimum value of 40 to a maximum value of 208.Then, we segmented each column into blocks to satisfy the monotone property.Because the saponification value is composed of a single block, we omitted this from the explanatory variables.Specific gravity is most strongly connected to the response variable.In the previous section, we obtained the data in Figure 14 using the unsupervised feature selection method.MBS also has a feature selection capability under monotone covariate relations between the response and explanation variables.Then we determine that the minimum response value of c1 or c2 according to a1 is near b1 or b2.Similarly, we determine that the maximum response value of c3 or c4 according to a2 is near b3 or b4.
For example, the specific gravity of cotton is [0.916, 0.918] and is included in [0.916, 0.920].Hence, the estimated iodine value is [80,113].On the other hand, the specific gravity of sesame is [0.920, 0.926].The minimum value of 0.920 suggests the maximum value 113 of the response variable value [80,113].On the other hand, the maximum value of 0.926 suggests the value 116 of the response value [116,116].As a result, the estimated iodine value is [113,116].Table 16 summarizes our estimated result.

Iodine Value Specific Gravity Freezing Point Major Acid
In this table, ANNT shows the strongest connection to the response variable GDC5.We used the test data in Table 18 to check the estimation ability of our lookup table.Table 19 summarizes the estimated result for our test data.In the range [0.1, 2.5] of GDC5, the result requires further improvement because the PCA result in Figure 15 suggests the use of clustering.Under the assumption of quartiles, we applied HCC to the hardwood data, and we obtained the dendrogram in Figure 16.From the results in Figures 15 and 16, we supposed three clusters, C1 = (AcW, AlW), C2 = (AcE, AlE, FE, JE, QE), and C3 = (FW, JW, QW), in the following discussion.
We applied MBS to each of three clusters, C1, C2 and C3.Tables 20-22 feature lookup tables for these three clusters.In Table 20, JULT contributes to the range [0.1, 1.1] of GDC5.On the other hand, in Tables 21 and 22, ANNT is strongly connected to the whole range of GDC5.
Figure 18 shows the scatter diagram of the hardwood data for ANNT and GDC5, in which all hardwoods exist in a narrow region.We used the estimation of GDC5 by ANNT for cluster C2 because the lookup table for C2 covers the widest range of ANNT compared with the other lookup tables.Figure 19 shows the graph of GDC5 for ANNT under cluster C2, and Table 23 presents the estimation result for the test data.We could have a better estimation result compared to the result in Table 19.

Conclusive Summary
The quantile method is a unified quantification method for histogram-valued symbolic data.We retraced and summarized three research categories: principal component analysis using the monotone property of quantile values, hierarchical conceptual clustering and unsupervised feature selection using the compactness measure, and look up table

Conclusive Summary
The quantile method is a unified quantification method for histogram-valued symbolic data.We retraced and summarized three research categories: principal component analysis using the monotone property of quantile values, hierarchical conceptual clustering and unsupervised feature selection using the compactness measure, and look up table regression model using monotone blocks segmentation (MBS).In the following sections, we summarize our results.

PCA and Dual PCA
For each object, (m + 1) quantile values of each variable satisfied the monotone property.Based on this property, PCA was realized using the eigenvalue problem of the Spearman correlation matrix.

Analysis of Oil Data
In the PCA of the oil data, three explicit clusters (beef, hog), (olive, camellia, cotton, sesame), and (linseed, perilla) were obtained in the factor plane using the first two principal components with a high contribution ratio.Linseed and perilla have larger line graphs compared with the other objects.The quartile representation affected the shapes of the objects, especially linseed.In the dual PCA using quartile representation, three groups, (freezing point, saponification value), (major acids), and (specific gravity, iodine value) were placed in different positions on the factor plane using the first two principal components with a high contribution ratio.The specific gravity and iodine values have very small concept sizes and are in mutually near positions.Major acids has a very large line graph and is located between the two other groups.

Analysis of Hardwood Data
In the PCA of hardwood data, three clusters, (AcW, AlW), (AcE, AlE, FE, JE, QE), and (FW, JW, QW), were obtained on the factor plane using the first two principal components with a high contribution ratio.East hardwoods have similar shapes in a narrow region.On the other hand, west hardwoods, especially the maximum quantile vectors, were spread in a wide range on the factor plane.In the dual PCA, two groups (ANNP, JANP, JULP, MITM) and (ANNT, JANT, JULT, GDC5) were obtained on the factor plane using the first two principal components with a high contribution ratio.MITM and GDC5 have very large line graphs in their respective groups.

Hierarchical Conceptual Clustering (HCC) and Unsupervised Feature Selection
The HCC algorithm is based on compactness under the assumption of equal bin probabilities.Compactness is the concept size of the merged concept of two objects and/or clusters.Compactness takes a 0-1 normalized value and satisfies the monotone property, i.e., the merged concept size is larger than the concept sizes of the two given objects and/or clusters.In each step of our hierarchical conceptual clustering, two objects and/or clusters were merged to minimize compactness.This required the two merged objects and/or clusters to be mutually similar and have small concept sizes.In this sense, compactness plays the role of similarity measure between objects and/or clusters.On the other hand, to minimize the merged concept size is equivalent to maximizing the dissimilarity of the merged concept from the whole concept.Therefore, compactness plays the role of cluster quality.In each clustering step, we evaluated the average compactness of objects and/or clusters for each variable.Then, the informative features took smaller values through the successive clustering steps.Therefore, compactness also plays the role of feature effectiveness criterion.This fact greatly simplified the task of unsupervised feature selection.

Analysis of Oil Data
In the dendrogram of the oil data using the HCC algorithm, three explicit clusters, (beef, hog), (olive, camellia, cotton, sesame), and (linseed, perilla) are recognized again as PCA results.However, (linseed, perilla) is isolated from other plant oils in the obtained dendrogram.From the results of average compactness evaluated for each variable and each clustering step, the most robustly informative variables are specific gravity and iodine value.The dual PCA and scatter diagram of eight oils on the plane using these informative variables also support the obtained result.In the dendrogram of the oil data using two informative features, three clusters have smaller concept sizes than those in the dendrogram using five variables, and they have mutually similar concept sizes.Furthermore, the cluster (linseed, perilla) is merged with the cluster of other plant oils.

Analysis of Hardwood Data
The dendrogram of the hardwood data using the HCC algorithm shows two large clusters, (FW, JW, QW) and (AcE, AlE, FE, JE, QE, AcW, AlW), at step 8.The results of average compactness for each variable and each clustering step show the facts: ANNP is informative during steps 1~8, then, JULP is important during steps 3~7, and JANT is important in steps 7~8.Five east hardwoods exist in a narrow region on the plane using ANNP and JULP, and west hardwoods spread out widely on the same plane.On the other hand, on the plane using ANNP and JANT, we have three clusters, (AcW, AlW), (AcE, AlE, FE, JE, QE), and (FW, JW, QW), and the scatter diagram is very similar to the result of PCA on the factor plane using the first two principal components.

Lookup Table Regression Model (LTRM)
In the LTRM, we used the monotone blocks segmentation (MBS) algorithm.When each of N objects was represented by (m + 1) p-dimensional quantile vectors, MBS interchanged N × (m + 1) rows of the data table according to the values of the selected response variable, from smallest to largest.For each of the remaining p − 1 explanatory variables, i.e., columns, MBS executed the segmentation of variable values into blocks so that the generated blocks, i.e., interval values, satisfied the monotone property.The MBS discarded single-block columns and obtained a lookup table of the size N ′ × p ′ , where N ′ < N × (m + 1) and p ′ < p.We realized the interval estimation rule for the response variable by searching for the nearest element in the lookup table.

Lookup Table of Oil Data
When each object was represented by the minimum and maximum quantile vectors, we applied MBS to the data under the assumption that iodine value was the response variable and the remaining four were explanatory variables.As a result, we obtained a lookup table composed of three explanatory variables: specific gravity, freezing point, and major acids.Specific gravity is the most important variable for explaining iodine value, and this result is supported by the unsupervised feature selection for the oil data.

Lookup Table of the Hardwood Data
We applied MBS to the hardwood data described using seven quantile values under the assumption that GDC5 was the response variable and the remaining seven were explanatory variables.We obtained the lookup table, which is composed of three explanatory variables: ANNT, JANT, and JULT.Among these, ANNT had the strongest connection to the response variable.This result is supported by the dual PCA for the hardwood data.We applied the test data, which is composed of six hardwoods, to the obtained lookup table.In the range [0.1, 2.5] of GDC5, the result required further improvement.The result of the PCA for the hardwood data also suggested the use of clustering.We applied MBS to each of three clusters, i.e., two west hardwood clusters and one east hardwood cluster.The three obtained lookup tables and the scatter diagram of hardwoods using GDC5 and ANNT suggested the use of the lookup table for the east hardwood cluster because this lookup table covers the widest range of GDC5.In fact, we could have the improved estimation results for our test data using the lookup table using the east hardwood cluster.
As a concluding remark, we should note that three research categories using the quantile method are mutually cooperative for analyzing the given distributional data under the common monotone property of quantiles.

Figure 1
Figure 1 illustrates a cumulative distribution function for a histogram feature value, where c1, c2, and c3 are cut points for the case m = 4, and q1, q2, and q3 are the corresponding quantile values.

Figure 1 .
Figure 1.Cumulative distribution function and cut point probabilities.

Figure 1 .
Figure 1.Cumulative distribution function and cut point probabilities.

Figure 2 .
Figure 2. Representation of objects and bin rectangles in the quartile case.

Figure 2
Figure 2 Representations of objects by bin-rectangles in the quartile case.

Figure 2 .
Figure 2. Representation of objects and bin rectangles in the quartile case.

Figure 3 .
Figure 3.A property of compactness.

Figure 3 .
Figure 3.A property of compactness.

Figure 4 .
Figure 4. Cumulative distribution functions and their corresponding quantile functions.

Figure 4 .
Figure 4. Cumulative distribution functions and their corresponding quantile functions.

AppliedMath 2024, 4 ,
FOR PEER REVIEW 10 (a) Scatter plot of five features.(b) Result in the first factor plane.

Figure 5 .
Figure 5. Result of PCA for the interval-valued oil data.

Figure 5 .
Figure 5. Result of PCA for the interval-valued oil data.

Figure 6 .
Figure 6.Result of PCA for quartile case.
(a) Scatter plot of eight objects.(b) Result in the first factor plane.

Figure 7 .
Figure 7. Result of dual PCA for the oil data.

Figure 7 .
Figure 7. Result of dual PCA for the oil data.

Figure 8 .
Figure 8. Scatter plot of eight features by two eigenvectors.

Figure 9 .
Figure 9. Result of PCA for the hardwood data.

Figure 8 .
Figure 8. Scatter plot of eight features by two eigenvectors.

Figure 8 .
Figure 8. Scatter plot of eight features by two eigenvectors.

Figure 9 .
Figure 9. Result of PCA for the hardwood data.

Figure 9 .
Figure 9. Result of PCA for the hardwood data.

Figure 11 .
Figure 11.Result of dual PCA for the hardwood data.

Figure 12 .
Figure 12.Result of HCC for oil data (five features).

3 Figure 11 .
Figure 11.Result of dual PCA for the hardwood data.

Figure 12 .
Figure 12.Result of HCC for oil data (five features).Figure 12. Result of HCC for oil data (five features).

Figure 12 .
Figure 12.Result of HCC for oil data (five features).Figure 12. Result of HCC for oil data (five features).

Figure 13 .
Figure 13.Scatter diagram using two informative features.

Figure 14 .
Figure 14.Result of HCC for oil data using iodine value and specific gravity.

Figure 14 .
Figure 14.Result of HCC for oil data using iodine value and specific gravity.

Figure 15 .
Figure 15.Result of PCA for hardwood data (quartile case).

Figure 16 .
Figure 16.Result of HCC for hardwood data (eight features).

Figure 15 .
Figure 15.Result of PCA for hardwood data (quartile case).

Figure 15 .
Figure 15.Result of PCA for hardwood data (quartile case).

Figure 16 .
Figure 16.Result of HCC for hardwood data (eight features).

Figure 16 .
Figure 16.Result of HCC for hardwood data (eight features).

Figure 17 .
Figure 17.Scatter diagrams for the selected informative features.

Figure 17 .
Figure 17.Scatter diagrams for the selected informative features.

Table 2 .
Composition table of major acids.

Table 3 .
Oil data described using five interval values.

Table 4 .
The first two principal components for the oil data in Table3.

Table 4 .
The first two principal components for the oil data in Table3.

Table 5 .
Part of the oil data by quartile representation.

Table 6 .
The first two principal components for the quartile case of the oil data.

Table 5 .
Part of the oil data by quartile representation.

Table 6 .
The first two principal components for the quartile case of the oil data.

Table 7 .
The first two principal components for dual PCA of the oil data.
Figure 6.Result of PCA for quartile case.

Table 7 .
The first two principal components for dual PCA of the oil data.

Table 8 .
The original quantile values for ANNT.

Table 8 .
The original quantile values for ANNT.

Table 9 .
The first two principal components of the hardwood data.

Table 9 .
The first two principal components of the hardwood data.

Table 9 .
The first two principal components of the hardwood data.

Table 10 .
The first two principal components of the dual hardwood data.

Table 10 .
The first two principal components of the dual hardwood data.

Table 11 .
Average compactness of each feature in each clustering step.

Table 11 .
Average compactness of each feature in each clustering step.

Table 12 .
Average compactness of each feature in each clustering step.

Table 12 .
Average compactness of each feature in each clustering step.Figure 17a,b show the scatter diagrams of ten hardwoods by informative feature.Figure 17b is very similar to the PCA result in Figure

Table 13 .
Quantile representation of oil data.
Table 15 contains the obtained lookup table, in which several intervals are composed of reduced interval values.Based on this lookup table, we can estimate the iodine value for each object by using specific gravity and freezing point.The estimation rule used here is as follows: Let [a1, a2] be the value of an explanatory variable of the given object.

Table 15 .
Lookup table for oil data.

Table 18 .
Test data for the lookup table of hardwood data.

Table 19 .
Estimated result for the test data.

Table 23 .
Estimated result for the test data by lookup table for cluster C2.

Table 23 .
Estimated result for the test data by lookup table for cluster C2.