The Lookup Table Regression Model for Histogram-Valued Symbolic Data

Ichino, Manabu

doi:10.3390/stats5040077

Open AccessArticle

The Lookup Table Regression Model for Histogram-Valued Symbolic Data

by

Manabu Ichino

School of Science and Engineering, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan

Stats 2022, 5(4), 1271-1293; https://doi.org/10.3390/stats5040077

Submission received: 2 November 2022 / Revised: 23 November 2022 / Accepted: 26 November 2022 / Published: 4 December 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper presents the Lookup Table Regression Model (LTRM) for histogram-valued symbolic data. We first transform the given symbolic data to a numerical data table by the quantile method. Then, under the selected response variable, we apply the Monotone Blocks Segmentation (MBS) to the obtained numerical data table. If the selected response variable and some remained explanatory variable(s) organize a monotone structure, the MBS generates a Lookup Table composed of interval values. For a given object, we search the nearest value of an explanatory variable, then the corresponding value of the response variable becomes the estimated value. If the response variable and the explanatory variable(s) are covariate but they follow to a non-monotonic structure, we need to divide the given data into several monotone substructures. For this purpose, we apply the hierarchical conceptual clustering to the given data, and we obtain Multiple Lookup Tables by applying the MBS to each of substructures. We show the usefulness of the proposed method by using an artificial data set and real data sets.

Keywords:

quantile method; Lookup Table Regression Model (LTRM); Monotone Blocks Segmentation (MBS); feature selection; hierarchical clustering; Multi-Lookup Table Regression Model (M-LTRM)

1. Introduction

The extension of regression model has been developed for complex data types. Bock and Diday [1], Billard and Diday [2,3], and Diday [4] present various methods of Symbolic Data Analysis (SDA) for complex data types including regression model for histogram-valued data. Additional linear regression models for histogram-valued variables are developed by Irpino and Verde [5,6], Lina Neto and De Carvalho [7,8,9], and Dias and Brito [10,11]. In these researches, several types of functional forms between the response variable and the explanatory variable(s) have been proposed under the appropriately defined optimality criterion.

The authors proposed a generalized Minkowski metric to analise mixed feature types data [12], and a feature selecdtion method to detect geometrically thin covariate structures embedded in the given data set [13]. This paper describes the Lookup Table Regression Model (LTRM) for histogram-valued symbolic data (Ichino [14]). We apply the quantile method (Ichino [15]) to the given symbolic data table of the size (N objects) × (p variables), and we represent each object by (m + 1) p-dimensional numerical vectors, called the quantile vectors, for a preselected integer m. The integer m controls the granularity of the descriptions for symbolic objects. To the new numerical data table of the size {N × (m + 1)} × p, we apply the algorithm called the Monotone Blocks Segmentation (MBS). The MBS interchange N × (m + 1) rows according to the values of the selected response variable from the smallest to the largest. For each of remained p-1 explanatory variables, i.e., columns, the MBS executes the segmentation of feature values into blocks so that the generated blocks, i.e., interval values, satisfy the monotone property. The MBS discard columns that have only a single block. Therefore, the MBS detects monotone covariate relations existing between the response variable and explanation variable(s). Finally, the MBS obtains the Lookup Table of the size N’ × p’, where N’ < N × (m + 1) and p’ < p. Each element of the table is an interval value corresponding to the segmented block. We realize the interval-value estimation rule for the response variable by the search of the “nearest element” in the lookup table.

The structure of this paper is as follows: Section 2 describes the quantile method to represent multi-dimensional histogram-valued data. Section 3 proposes the basic procedure of the MBS and the LTRM using the Fats and Oils data [12,14,15,16]. In Section 4, we illustrate that the MBS does not work well for non-monotonic data structures using artificially generated data, and we introduce the hierarchical conceptual clustering to obtain the Multi-Lookup Table Regression Model (M-LTRM). Finally, we show the usefulness of the M-LTRM by a real data set. Section 5 discusses about the obtained results, and Section 6 summarizes concluding remarks.

2. Representation of Objects by Quantile Vectors and Bin Rectangles

Let U = {ω_i, i = 1, 2, ..., N} be the set of given objects, and let feature variables F_j, j =1 , 2, ..., p, describe each object. Let D_j be the domain of feature F_j, j = 1, 2, ..., p. Then, the feature space is defined by

D^(p)= D₁ × D₂ × ··· × D_p.

(1)

Each element of D^(p) is represented by

E = E₁ × E₂ × ··· × E_p,

(2)

where E_j is the feature value of F_j, j = 1, 2, ..., p.

2.1. Histogram-Valued Feature

For each object ω_i, let each feature F_j be represented by histogram value:

E_ij = {[a_ijk, a_ij_(k+1)), p_ijk; k = 1, 2, ..., n_ij},

(3)

where p_ij₁ + p_ij₂ + … + p_ijnij = 1 and n_ij is the number of bins that compose the histogram E_ij.

Therefore, the Cartesian product of p histogram values represents object ω_i:

E_i = E_i₁ × E_i₂ × ··· × E_ip.

(4)

Since the interval-valued feature is a special case of histogram feature with n_ij = 1 and p_ij₁ = 1, the representation of (3) is reduced to an interval:

E_ij = [a_ij₁, a_ij₂).

(5)

2.2. Representation of Histograms by Common Number of Quantiles

Let ω_i ∈ U be the given object, and let E_ij be a histogram value in (3) for a feature F_j. Then, under the assumption that n_ij bins have uniform distributions, we define the cumulative distribution function F_ij(x) of the histogram (3) as:

F_ij(x) = 0 for x ≤ a_ij₁

F_ij(x) = p_ij1(x − a_ij1)/(a_ij2 − a_ij1) for a_ij1 ≤ x < a_ij2

F_ij(x) = F(a_ij1) + p_ij₂(x − a_ij₂)/(a_ij₃ − a_ij₂) for a_ij₂ ≤ x < a_ij₃

······

F_ij(x) = F(a_nij−1) + p_ijnij(x − a_nij)/(a_nij+1 − a_nij) for a_nij ≤ x < a_nij+1

F_ij(x) = 1 for a_nij₊₁ ≤ x.

Figure 1 illustrates a cumulative distribution function for a histogram feature value, where c₁, c₂, and c₃ are cut points for the case m = 4, and q1, q2, and q3 are the corresponding quantile values.

Our general procedure to have common representation for histogram-valued data is as follows.

(1): We choose common number m of quantiles.
(2): Let c₁, c₂, ..., c_m−1 be preselected cut points dividing the range of the distribution function F_ij(x) into continuous intervals, i.e., bins, with preselected probabilities associated with m cut points.
(3): For the given cut points c₁, c₂, ..., c_m−1, we have the corresponding quantiles by solving the following equations:

F_ij(x_ij₀) = 0, (i.e., x_ij₀ = a_ij₁)

F_ij(x_ij₁) = c₁, F_ij(x_ij₂) = c₂, ..., F_ij(x_ij_(m−1)) = c_m₋₁, and

F_ij(x_ijm) = 1, (i.e., x_ijm = a_ijnij₊₁).

Therefore, we describe each object ω_i ∈ U for each feature F_j using a (m + 1) tuple:

(x_ij₀, x_ij₁, x_ij₂, …, x_ij_(m−1), x_ijm), j = 1, 2, …, p,

(6)

and the corresponding histogram using:

E_ij = {[x_ijk, x_ij_(k+1)), (c_k₊₁ − c_k); k = 0, 1, ..., m−1}, j = 1, 2, ..., p,

(7)

where we assume that c₀ = 0 and c_m = 1. In (7), (c_k₊₁ − c_k), k = 0, 1, ..., m−1, denote bin probabilities using the preselected cut point probabilities c₁, c₂, ..., c_m₋₁. In the quartile case again, m = 4 and c₁ = 1/4, c₂ = 2/4, and c₃ = 3/4, and four bins, [x_ij₀, x_ij₁), [x_ij₁, x_ij₂), [x_ij₂, x_ij₃), and [x_ij₃, x_ij₄), have the same probability 1/4.

It should be noted that the number of bins of the given histograms are mutually different in general. However, we can obtain (m + 1)-tuples as the common representation for all histograms by selecting an integer m and a set of cut points.

2.3. Quantile Vectors and Bin-Rectangles

For each object ω_i ∈ U, we define (m + 1) p-dimensional numerical vectors, called the quantile vectors, as follows.

x_ik = (x_i_1k, x_i_2k, …, x_ipk), k = 0, 1, ..., m.

(8)

We call x_i₀ and x_im the minimum quantile vector and the maximum quantile vector, respectively. Therefore, m + 1 quantile vectors {x_i₀, x_i₁,..., x_im} in R^p describe each object ω_i ∈ U together with cut point probabilities.

The components of m + 1 quantile vectors in (8) for object ω_i ∈ U satisfy the inequalities:

x_ij₀ ≤ x_ij₁ ≤ x_ij₂ ≤···≤ x_ij_(m−1) ≤ x_ijm, j = 1, 2, …, p.

(9)

Therefore, m + 1 quantile vectors in (8) for object ω_i ∈ U satisfy the monotone property:

x_i₀ ≤ x_i₁ ≤ ··· ≤ x_im.

(10)

For the series of quantile vectors x_i₀, x_i₁,..., x_im of object ω_i ∈ U, we define m series of p dimensional rectangles, called bin-rectangles, spanned by adjacent quantile vectors x_ik and x_i_(k+1), k = 0, 1, ..., m−1, as follows:

B(x_ik, x_i_(k+1)) = [x_i_1k, x_i_1(k+1)] × [x_i_2k, x_i_2(k+1)] × ··· × [x_ipk, x_ip_(k+1)], k = 0, 1, ..., m−1.

(11)

Figure 2 illustrates two objects represented by two dimensional bin-rectangles in the quartile case. Since a p-dimensional rectangle in R^p is equivalent to a conjunctive logical expression, we also use the term concept for a rectangular expression in the space R^p.

3. Monotone Blocks Segmentation (MBS) and Lookup Table Regression Model (LTRM)

We use the Fats and Oils data in Table 1 in order to describe the basic ideas of the MBS and the LTRM. Each object is described by four interval-valued features, F₁, F₂, F₃, and F₄. For each object, Major acids, F₅, takes a set of acids from nine possible acids. We assigned the rank values for each acid according to the occurrence frequency for eight objects as follows.

{Lu: 1, A: 2, C: 2, Ln: 4, M: 5, S: 6, P: 7, L: 8, O: 9}

(12)

For each object, we assign interval value by the minimum rank value and the maximum rank value according to (12). For example, Linseed takes the minimum rank value 4 and the maximum rank value 9. Hence, the interval value is [4,9].

Since an interval is a special histogram composed of one bin, we split each object into two sub-objects, the minimum sub-object and the maximum sub-object, described by five dimensional quantile vectors, i.e., the minimum quantile vector and the maximum quantile vector. Table 2 is the obtained quantile representation of our numerical data of the size {(8 objects) × 2} × (5 features).

In this example, we select Iodine value as the response variable and remained four are explanatory variables. In Table 3, we interchanged the given sixteen quantile vectors, according to the Iodine value from the minimum value 40 to the maximum value 208. Then, we segment each column into blocks to satisfy monotone property. We use colors to show different blocks.

The most covariate feature to the response variable is the specific gravity. The saponification value and major acids are composed of single blocks. We omit these feature variables from explanatory variables. Figure 3 is the scatter plot of eight interval valued objects for the most covariate feature variables: Iodine value and Specific gravity. The given eight objects are clearly placed along a monotonic curve.

Table 4 is the obtained Lookup Table, where several intervals are composed of reduced interval values. Based on this Lookup Table, we can estimate of Iodine value for each object by using Specific gravity and Freezing point. For example, Specific gravity of Linseed is [0.930, 0.935]. The minimum value 0.930 suggests the minimum value 170 of the response variable value [170, 192]. On the other hand, the maximum value 0.935 suggests the maximum value 204 of the response value [204, 204]. As the result, the estimated Iodine value is [170, 204]. Freezing point of Linseed is [−27, −18], and is included in the interval [−27, 6]. Therefore lookup Table suggests the estimated value [79, 208].

Table 5 shows the estimated values for the Fats and Oils data. We should note that almost estimated interval values include the given Iodine values. Only the estimated interval for Sesame is shorter than the actual value. The estimation by Specific gravity yield finer results compared to the results by Freezing point.

In order to check the ability of our LTRM, we use unknown fats and oils data in Table 6. Table 7 is the estimated result based on our Lookup Table in Table 4. The estimation accuracy for plant oils is better than that for fats. These results suggest the following possibilities.

Our Fats and Oils data is composed of six plant oils and only two animal fats. By increasing the number of sample objects, we may obtain more accurate Lookup Table.
On the other hand, increasing the number of sample objects may complicate the covariate relations between the response variable and explanatory variables. For example, if we separate the plant oils and animal fats, we may have better Lookup Tables for respective categories.

Based on these possibilities, we discuss Multi-Lookup Table Regression Model (M-LTRM) using hierarchical conceptual clustering in the next section.

4. Multi-Lookup Table Regression Model (M-LTRM)

4.1. Illustration by Oval Data

We use the artificial data in Figure 4 and Table 8. This data was applied to check the capability of the unsupervised feature selection method using hierarchical conceptual clustering in [16]. Sixteen small rectangles organize an oval structure in the first two features F1 and F2, as shown in Figure 4. In this figure, the oval structure is organized by four different colored monotone substructures. For each of sixteen objects, we transform the feature values of F1 and F2 to 0–1 normalized interval values. Then, as feature values of F3, F4, and F5, we select three randomly generated interval values in the unit interval [0, 1] for sixteen objects. Table 8 summarizes sixteen objects described by five interval-valued features.

We rewrote Table 8 to the form {(16 objects) × (2 quantile values)} × (5 features), then we applied the MBS to the data table of the size (32 quantile values) × (5 features) under the response variable F1. Table 9 is the result of our MBS for the Oval data, where each explanatory variable has only one block. This result suggests to divide the oval structure into several monotone segments. For this purpose, we use the hierarchical conceptual clustering in the next subsection.

4.2. Hierarchical Conceptual Clustering

As the measure of similarity between objects and/or clusters described by histogram-valued features, we use the compactness proposed in [16]. Under the assumption of common number of bins m and equal bin probabilities, the compactness C(ω_i, ω_l) defines the concept size spanned by two objects ω_i and ω_l in the feature space R^p and have the following properties:

(1): 0 ≤ C(ω_i, ω_l) ≤ 1
(2): C(ω_i, ω_i), C(ω_l, ω_l) ≤ C(ω_i, ω_l)
(3): C(ω_i, ω_l) = C(ω_l, ω_i)
(4): C(ω_i, ω_r) ≤ C(ω_i, ω_l) + C(ω_l, ω_r) may not hold in general.

In the Oval data in Figure 4, C(1, 1) defines the concept size of rectangle 1. C(1, 2) defines the concept size of the minimum rectangle including rectangles 1 and 2. Therefore, monotone property 2) is clear.

Let U = {ω₁, ω₂, ..., ω_N} be the given set of objects, and let each object ω_i be described using p histograms in the feature space R^pas E_i = E_i₁ × E_i₂ × ··· × E_ip. We assume that all histogram values for all objects have the same number of quantiles m and the same bin probabilities (Algorithm 1).

Algorithm 1 (Hierarchical Conceptual Clustering (HCC) [16])

Step 1: For each pair of objects ω_i and ω_l in U, evaluate the compactness C(ω_i, ω_l) and find the pair ω_q and ω_r that minimizes the compactness.
Step 2: Add the merged concept ω_qr = {ω_q, ω_r} to U and delete ω_q and ω_r from U. The merged concept ω_qr is again described using p histograms as E_qr = E_qr₁ × E_qr₂ × ··· × E_qrp by the Cartesian join operation defined in [16] under the assumption of m quantiles and the equal bin probabilities.
Step 3: Repeat Step 1 and Step 2 until U includes only one concept, i.e., the whole concept.

It should be noted that to minimize the compactness between objects is equivalent to maximize the dissimilarity between the merged concept of the objects and the whole concept. In this meaning, the compactness plays not only the role of the measure of similarity between objects and/or clusters, but also plays the role of cluster quality.

The Oval data is a special histogram data, where each feature value takes an interval value with probability one. By applying the HCC to the Oval data, we obtain the dendrogram in Figure 5. We cut the dendrogram at the concept size 0.8 to find a least number of monotone substructures. As the result, we obtained four clusters (a) = (1, 2, 3), (b) = (4, 5, 6), (c) = (7, 8, 9, 10, 11), and (d) = (12, 13, 14, 15, 16). Table 10 is the results of MBS for these four clusters. Table 11 shows the final Lookup Tables for the Oval data. In this example, the resolution of Lookup Tables for clusters (b) and (d) are better than for (a) and (c). We can estimate the value of response variable F1 for each object by finding the nearest values of explanatory variable F2 in these four Lookup Tables.

From the results of the Fats and Oils data and the Oval artificial data, we see that the following facts.

The MBS is able to detect a covariate relation between the response variable and explanation variable(s), if the covariate relation has a monotone structure. In other words, the MBS has a feature selection capability when the target covariate relation has a monotone structure.
On the other hand, the unsupervised feature selection method using hierarchical conceptual clustering in [16] can detect “geometrically thin structures” embedded in the given histogram-valued data. The covariation of F1 and F2 is found by evaluating the compactness for each of five features in each step of clustering [16]. Therefore, the compactness plays also the role of feature effectiveness criterion.

4.3. M-LTRM for the Hardwood Data

The data is selected from the US Geological Survey (Climate-Vegetation Atlas of North America) [17]. The number of objects is ten and the number of features is eight.

Table 8 shows quantile values for the selected ten hardwoods under the feature: (Mean) Annual Temperature (ANNT). We selected the following eight features to describe objects (hardwoods). The data formats for other features F₂~F₈ are the same as in Table 12.

F₁: Annual Temperature (ANNT) (°C);
F₂: January Temperature (JANT) (°C);
F₃: July Temperature (JULT) (°C);
F₄: Annual Precipitation (ANNP) (mm);
F₅: January Precipitation (JANP) (mm);
F₆: July Precipitation (JULP) (mm);
F₇: Growing Degree Days on 5 °C base ×1000 (GDC5);
F₈: Moisture Index (MITM).

Our hardwoods data is a numerical data of the size {(10 objects) × (7 quantile values)} × (8 features). We first apply the quantile method of the principal component analysis (PCA) in [15] to our data. Table 13 is the obtained first two principal components and Figure 6 shows the mutual positions of eight features by two eigen vectors. We have two groups {ANNP, JANP, JULP, and MOISTURE} and {ANNT, JANT, JULT, and GDC5}. Figure 7 shows the mutual positions of ten objects in the first factor plane. Each hardwood is represented by six arrow lines connecting from the minimum quantile vector to the maximum quantile vectors.

We should note the following facts for the results of PCA.

The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.
East hardwoods show similar line graphs, and the maximum quantile vectors take mutually near positions.
West hardwoods are separated into two groups: {ACER WEST and ALNUS WEST} and {FRAXINUS WEST, JUGLANS WEST, and QUERCUS WEST}. The last line segments are very long especially for ACER WEST and ALNUS WEST.

We change the places of “objects” and “features” in our hardwoods data. Then, we apply the quantile method of PCA to the dual data of the form {(8 features) × (7 quantile values)} × (10 objects). Table 14 is the first two principal components for the dual data, and Figure 8 shows the mutual positions of ten hardwoods by two eigenvectors. West hardwoods are separated again into two different groups. Figure 9 shows the mutual position of eight features in the first factor plane. Each feature is represented by a series of six line segments connecting from the minimum quantile vector to the maximum quantile vector.

We should note the following facts for the results of our dual-PCA.

The first principal component is the size factor and the second is the shape factor, and the sum of their contribution ratios is very high.
We have two groups {ANNP, JANP, JULP, and MITM} and {ANNT, JANT, JULT, and GDC5}. MITM and GDC5 are very long line graphs compared to other members in each group.

We assume that GDC5 is the response variable and seven other variables are explanatory variables. Then, we applied the MBS to the data of the size (10 × 7 quantile values) × (8 features), and we obtained the result of Table 15. We used colors to show different blocks. The MBS selected only ANNT, JANT, and JULT as explanatory variables.

Table 16 is the obtained Lookup Table for Hardwood data. In this table, ANNT shows the most strong connection to the response variable GDC5. We use test data in Table 17 to check the estimation ability of our Lookup Table. Table 18 summarizes the estimated result for our test data. In the range [0.1, 2.5] of GDC5, the result requires a further improvement, since the PCA result in Figure 7 suggests the use of clustering.

Under the assumption of quartiles, we applied the HCC to the Hardwood data, and we obtained the result in Figure 10. By cutting the dendrogram at the concept size 0.8, we have four clusters (FW, JW, QW), (AcE, AlE, FE, JE, QE), (AcW), and (AlW). The concept size for the East Hardwood is 0.671. By the addition of AcW to the East Hardwood, the concept size increases largely to 0.847. However, further addition of AlW shows a small increasing to 0.935. This fact asserts that AcW and AlW are mutually similar. The PCA result in Figure 7 also supports the cluster (AcW, AlW). Therefore, we suppose three clusters C1 = (AcW, AlW), C2 = (AcE, AlE, FE, JE, QE), and C3 = (FW, JW, QW) in the following discussions.

From the view point of unsupervised features selection in [16], the most informative features are ANNP then JULP in clustering steps 1~7, and feature JANT is important to separate cluster (FW, JW, QW) from the other large cluster in step 8. Figure 11 shows the mutual positions of ten hardwoods in the plane by ANNP and JANT, and is very similar to the result in Figure 7.

We applied the MBS to each of three clusters C1 = (AcW, AlW), C2 = (AcE, AlE, FE, JE, QE) and C3 = (FW, JW, QW). Table 19, Table 20 and Table 21 are the Lookup Tables for these three clusters. In Table 19, JULT contributes in the range [0.1, 1.1] of GDC5. On the other hand, in Table 20 and Table 21, ANNT is strongly connected to the whole range of GDC5.

Figure 12 shows the scatter diagram of Hardwood data for ANNT and GDC5, where all hardwoods exist in a narrow region. We use the estimation of GDC5 by ANNT for cluster C2, since Lookup Table for C2 covers the most wide range of ANNT compared to other Lookup Tables. Figure 13 shows the graph of GDC5 for ANNT under cluster C2, and Table 22 is the estimation result for the test data. We could have a better estimation result compared to the result in Table 18.

5. Discussion

The quantile method is a unified quantification method for the histogram-valued symbolic data. When each of N objects is described by p histogram-valued features, we select a common integer number m and we represent each of p histogram-valued features by (m + 1) quantile values. As the result, we have a numerical data table of the size {N × (m + 1) quantiles} × (p features). Based on this type numerical data table, we proposed Lookup Table Regression Model (LTRM) using Monotone Blocks Segmentation (MBS). Then, we extended LTRM to Multi-Lookup Table Regression Model (M-LTRM) using the hierarchical conceptual clustering in [16]. In the following we discuss our results.

As a mixed feature-type symbolic data, we used the Fats and Oils data. This data is organized from eight objects, two fats and six plant oils, described by four interval valued features and one multinominal feature. By the quantile method, we transformed to a numerical data table of the size (8 × 2 quantiles) × (5 features). Then, we applied our MBS to this data under the assumption that the response variable is Iodine value. The MBS selected Specific gravity as the most covariate feature to Iodine value, then Freezing point. By using the obtained Lookup Table, we checked the estimation of Iodine value of each given object using Specific gravity and Freezing point. The estimated result is reasonable for the given fats and oils. We also checked our Lookup Table by a set of independent fats and oils. The result by test samples suggests the use of clustering and Multi-Lookup Table Regression Model (M-LTRM) to improve the estimation accuracy.
The MBS works well to generate a meaningful Lookup Table, when the response variable and explanatory variable(s) follow to a monotone structure. Therefore, if the response variable and explanatory variable(s) follow to non-monotonic data structure, we have to divide the given data structure into several monotone substructures. We used the hierarchical conceptual clustering in [16] to the Oval artificial data, and we could obtain four monotone substructures and corresponding Lookup Tables.
As a general histogram-valued data, we used the Hardwood data of the size {(10 objects) × (7 quantiles)} × (8 features). We applied the quantile method of Spearman PCA to this data. As a monotone structure, the first factor plane draws three streams C1 = (AcW, AlW), C2 = (AcE, AlE, FE, JE, QE), and C3 = (FW, JW, QW) with a very high contribution ratio. We applied also the Spearman PCA to the dual data of the size {(8 features) × (7 quantiles)} × (10 objects), and we obtained a monotone structure composed of two groups, (ANNP, JANP, JULP, MITM) and (ANNT, JANT, JULT, GDC5) in the first factor plane with a very high contribution ratio. We applied the MBS to the Hardwood data under the assumption that GDC5 is the response variable, and obtained the Lookup Table with explanatory variables: ANNT, JANT, and JULT. Therefore, our MBS has the ability of supervised feature selection.
For a further improvement of the Lookup Table, we applied the hierarchical conceptual clustering to the Hardwood data, and obtained again three clusters: C1, C2, and C3. From the view point of unsupervised feature selection by the hierarchical conceptual clustering, features ANNP then JULP are informative during 1~7 steps of clustering, and JANT is important to separate between clusters (C1, C2) and C3. In fact, the scatter plot of ten hardwoods in the plane ANNP and JANT is very similar to the result of PCA for the Hardwood data. We applied again the MBS to each of clusters C1, C2, and C3, and we obtained three different Lookup Tables. As the result, the Lookup Table for C2 has the highest resolution to estimate GDC5 using ANNT, and it achieves the better estimation result for our test data.

6. Conclusions

This paper proposed the Lookup Table Regression Model (LTRM) and Multi-Lookup Table Regression Model (M-LTRM) for histogram-valued symbolic data. The proposed models are very different from the traditional functional models developed for histogram-valued symbolic data.

The monotone blocks segmentation (MBS) is simple but effective to detect covariate explanatory variable(s) for the selected response variable and to obtain the Lookup Table. For a given object, the LTRM estimates each quantile value of the selected response variable by finding the nearest quantile value of an explanatory variable in the Lookup Table. We showed also, the quantile method of symbolic PCA is useful to detect monotone structure embedded in multidimensional histogram-valued symbolic data. Furthermore, the dual-PCA is useful to detect covariate explanatory variables for the selected response variable.

When the quantile method of symbolic PCA does not work well to the given histogram-valued symbolic data, the MBS also fails to have Lookup Table. In such general cases, hierarchical conceptual clustering (HCC) becomes useful. By the HCC, we may divide the given data set into several sub-data sets that have monotone structures. Then, by applying the MBS to each monotone structure, we can lead to the Multi-Lookup Table Regression Model that is flexible than the LTRM.

Funding

This work was supported by JSPS KAKENHI (Grants-in-Aid for Scientific Research) Grant Number 25330268.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

Not Applicable.

Acknowledgments

The author thanks to Kadri Umbleja for her collaborations.

Conflicts of Interest

The author declares no conflict of interest.

References

Bock, H.-H.; Diday, E. Analysis of Symbolic Data; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
Billard, L.; Diday, E. Symbolic Data Analysis: Conceptual Statistics and Data Mining; Wiley: Chichester, UK, 2007. [Google Scholar]
Billard, L.; Diday, E. Regression analysis for interval-valued data. In Data Analysis, Classification and Related Metods. Proceedings of the Conference of the International Federation of Classification Societies (IFCS’00); Springer: Berlin/Heidelberg, Germany, 2000; pp. 347–369. [Google Scholar]
Diday, E. Thinking by classes in data science: The symbolic data analysis paradigm. WIREs Comput. Stat. 2016, 8, 172–205. [Google Scholar] [CrossRef]
Verde, R.; Irpino, A. Ordinary least squares for histogram data based on Wasserstein distance. In Proceedings of the COM-STAT’, Paris, France, 22–27 August 2010; Lechevallier, Y., Saporta, G., Eds.; Physica-Verlag: Heidelberg, Germany, 2010; pp. 581–589. [Google Scholar]
Irpino, A.; Verde, R. Linear regression for numeric symbolic variables: Ordinary least squares approach based on Wasserstein Distance. Adv. Data Anal. Classif. 2015, 9, 81–106. [Google Scholar] [CrossRef] [Green Version]
Neto, L.; Carvalho, D. Center and range method for fitting a linear regression model for symbolic interval data. Comput. Stat. Data Anal. 2008, 52, 1500–1515. [Google Scholar] [CrossRef]
Neto, L.; Carvalho, D. Constrained linear regression models for symbolic interval-valued variables. Comput. Stat. Data. Anal. 2010, 54, 333–347. [Google Scholar] [CrossRef]
Neto, L.; Cordeiro, M.; Carvalho, D. Bivariate symbolic regression models for interval-valued variables. J. Stat. Comput. Simul. 2011, 81, 1727–1744. [Google Scholar] [CrossRef]
Dias, S.; Brito, P. Linear regression model with Histogram-valued variables. Stat. Anal. Data Min. 2015, 8, 75–113. [Google Scholar] [CrossRef] [Green Version]
Dias, L.; Brito, P. (Eds.) Analysis of Distributional Data; CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar]
Ichino, M.; Yaguchi, H. Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans. Syst. Man Cybern. 1994, 24, 698–708. [Google Scholar] [CrossRef]
Ono, Y.; Ichino, M. A new feature selection method based on geometrical thickness. In Proceedings of the KESDA’98, Luxembourg, 27–28 April 1998; Volume 1, pp. 19–38. [Google Scholar]
Ichino, M. The lookup table regression model for symbolic data. In Proceedings of the Data Sciences Workshop, Paris-Dauphin University, Paris, France, 12–13 November 2015. [Google Scholar]
Ichino, M. The quantile method of symbolic principal component analysis. Stat. Anal. Data Min. 2011, 4, 184–198. [Google Scholar] [CrossRef]
Ichino, M.; Umbleja, K.; Yaguchi, H. Unsupervised feature selection for histogram-valued symbolic data using hierarchical conceptual clustering. Stats 2021, 4, 359–384. [Google Scholar] [CrossRef]
Histogram Data by the U.S. Geological Survey, Climate-Vegetation Atlas of North America. Available online: http://pubs.usgs.gov/pp/p1650-b/ (accessed on 11 November 2010).

Figure 1. Cumulative distribution function and cut point probabilities.

Figure 2. Representation of objects by bin-rectangle in the quartile case.

Figure 3. Scatter plot of eight objects for the most covariate variables.

Figure 4. Oval data.

Figure 5. Dendrogram using HCC for Oval artificial data in Table 8.

Figure 6. Scatter plot of eight features by two eigen vectors.

Figure 7. Results of PCA for hardwoods data.

Figure 8. Scatter plot of ten hardwoods by two eigen vectors.

Figure 9. Results of dual-PCA for hardwoods data.

Figure 10. Dendrogram using HCC for Hardwood data.

Figure 11. Scatter diagram of Hardwood data by ANNP and JANT.

Figure 12. Scatter diagram of Hardwood data for ANNT and GDC5.

Figure 13. Estimation of GDC5 by ANNT for cluster C2.

Table 1. Fats and Oils data (Ichino and Yaguchi, 1994 [12]).

	Specific Gravity (g/cm³): F₁	Freezing Point (°C): F₂	Iodine Value: F₃	Saponification Value:F₄	Major Acids: F₅
Linseed	[0.930, 0.935]	[−27, −18]	[170, 204]	[118, 196]	L, Ln, O, P, M
Perilla	[0.930, 0.937]	[−5, −4]	[192, 208]	[188, 197]	L, Ln, O, P, S
Cotton	[0.916, 0.918]	[−6, −1]	[99, 113]	[189, 198]	L, O, P, M, S
Sesame	[0.920, 0.926]	[−6, −4]	[104, 116]	[187, 196]	L, O, P, S, A
Camellia	[0.916, 0.917]	[−21, −15]	[80, 82]	[189, 193]	L, O
Olive	[0.914, 0.919]	[0, 6]	[79, 90]	[190, 199]	L, O, P, S
Beef	[0.860, 0.870]	[30, 38]	[40, 48]	[190, 199]	O, P, M, S, C
Hog	[0.858,0.864]	[22, 32]	[53, 77]	[190, 202]	L, O, P, M, S, Lu

L: Linoleic acid; Ln: Linolenic acid; O: Oleic acid; P: Palmitic acid; M: Myristic acid; S: Stearic acid; A: Arachidic acid; C: Capric acid; Lu: Lauric acid

Table 2. Fats and Oils data by quantile vectors.

	Specific Gravity	Freezing Point	Iodine Value	Saponification Value	Major Acids
Linseed 1	0.930	−27	170	118	4
Linseed 2	0.935	−18	204	196	9
Perilla 1	0.930	−5	192	188	4
Perilla 2	0.937	−4	208	197	9
Cotton 1	0.916	−6	99	189	5
Cotton 2	0.918	−1	113	198	9
Sesame 1	0.920	−6	104	187	2
Sesame 2	0.926	−4	116	193	9
Camellia 1	0.916	−21	80	189	8
Camellia 2	0.917	−15	82	193	9
Olive 1	0.914	0	79	187	6
Olive 2	0.919	6	90	196	9
Beef 1	0.860	30	40	190	2
Beef 2	0.870	38	48	199	9
Hog 1	0.858	22	53	190	1
Hog 2	0.864	32	77	202	9

Table 3. Monotone Blocks Segmentation (MBS) for Fats and Oils data.

	Iodine Value	Specific Gravity	Freezing Point	Saponification Value	Major Acids
Beef 1	40	0.860	30	190	3
Beef 2	48	0.870	38	199	9
Hog 1	53	0.858	22	190	1
Hog 2	77	0.864	32	202	9
Olive 1	79	0.914	0	187	6
Camellia 1	80	0.916	−21	189	8
Camellia 2	82	0.917	−15	193	9
Olive 2	90	0.919	6	196	9
Cotton 1	99	0.916	−6	189	5
Sesame 1	104	0.920	−6	187	2
Cotton 2	113	0.918	−1	198	9
Sesame 2	116	0.926	−4	193	9
Linseed 1	170	0.930	−27	118	4
Perilla 1	192	0.930	−5	188	4
Linseed 2	204	0.935	−18	196	9
Perilla 2	208	0.937	−4	197	9

Table 4. The Lookup Table for the Fats and Oils data.

Iodine Value	Specific Gravity	Freezing Point
[40, 77]	[0.858, 0.870]	[22, 38]
[79, 79]	[0.914, 0.914]
[79, 208]		[−27, 6]
[80, 113]	[0.916, 0.920]
[116, 116]	[0.926, 0.926]
[170, 192]	[0.930, 0.930]
[204, 204]	[0.935, 0.935]
[208, 208]	[0.937, 0.937]

Table 5. Estimated result by the LTRM for the Fats and Oils data.

Fats and Oils	Estimated by	Estimated by	Actual Value
Fats and Oils	Specific Gravity	Freezing Point	Actual Value
Linseed	[170, 204]	[79, 208]	[170, 204]
Perilla	[170, 208]	[79, 208]	[188, 197]
Cotton	[80, 113]	[79, 208]	[99, 113]
Sesame	[113, 116]	[79, 208]	[104, 116]
Camellia	[80, 113]	[79, 208]	[80, 82]
Olive	[79, 113]	[79, 208]	[79, 90]
Beef	[40, 77]	[40, 77]	[40, 48]
Hog	[40, 77]	[40, 77]	[55, 77]

Table 6. Test data for the LTRM.

Fats and Oils	Specific Gravity	Freezing Point	Iodine Value
Corn	[0.920, 0.928]	[−18, −10]	[88, 147]
Soybeen	[0.922, 0.934]	[−7, −8]	[114, 138]
Rice bran	[0.916, 0.922]	[−10, −5]	[92, 115]
Horse fat	[0.90, 0.95]	[30, 35]	[65, 95]
Sheep tallow	[0.89, 0.90]	[30, 35]	[35, 46]
Chiken fat	[0.91, 0.92]	[30, 32]	[76, 80]

Table 7. Estimated result for the Test data.

Fats and Oils	Estimated by	Estimated by	Actual Value
Fats and Oils	Specific Gravity	Freezing Point	Actual Value
Corn	[113, 170]	[79, 208]	[88, 147]
Soybeen	[113, 204]	[79, 208]	[114, 138]
Rice bran	[80, 113]	[79, 208]	[92, 115]
Horse fat	[79, 208]	[40, 77]	[65, 95]
Sheep tallow	[77, 79]	[40, 77]	[35, 46]
Chiken fat	[79, 113]	[40, 77]	[76, 80]

Table 8. Oval artificial data.

	F1	F2	F3	F4	F5
1	[0.629, 0.798]	[0.905, 0.986]	[0.000, 0.982]	[0.002, 0.883]	[0.360, 0.380]
2	[0.854, 0.955]	[0.797, 0.905]	[0.002, 0.421]	[0.573, 1.000]	[0.754, 0.761]
3	[0.921, 1.000]	[0.527, 0.716]	[0.193, 0.934]	[0.035, 0.477]	[0.406, 0.587]
4	[0.865, 0.933]	[0.378, 0.500]	[0.452, 0.854]	[0.213, 0.604]	[0.000, 0.074]
5	[0.775, 0.876]	[0.257, 0.338]	[0.300, 0.614]	[0.425, 0.979]	[0.217, 0.568]
6	[0.663, 0.764]	[0.135, 0.216]	[0.712, 1.000]	[0.904, 0.968]	[0.103, 0.950]
7	[0.494, 0.596]	[0.041, 0.122]	[0.293, 0.470]	[0.023, 0.086]	[0.765, 0.902]
8	[0.225, 0.427]	[0.000, 0.081]	[0.633, 0.872]	[0.000, 0.582]	[0.719, 0.852]
9	[0.112, 0.213]	[0.041, 0.149]	[0.167, 0.802]	[0.056, 0.129]	[0.124, 0.642]
10	[0.022, 0.112]	[0.162, 0.270]	[0.026, 0.718]	[0.418, 0.851]	[0.549, 0.853]
11	[0.000, 0.090]	[0.297, 0.392]	[0.096, 0.759]	[0.438, 0.938]	[0.495, 0.760]
12	[0.045, 0.112]	[0.446, 0.554]	[0.826, 0.962]	[0.230, 0.755]	[0.104, 0.189]
13	[0.101, 0.202]	[0.608, 0.676]	[0.367, 0.570]	[0.236, 0.684]	[0.683, 0.930]
14	[0.213, 0.292]	[0.676, 0.811]	[0.371, 0.381]	[0.086, 0.305]	[0.009, 1.000]
15	[0.315, 0.438]	[0.811, 0.919]	[0.049, 0.585]	[0.056, 0.891]	[0.528, 0.881]
16	[0.483, 0.562]	[0.878, 1.000]	[0.402, 0.609]	[0.150, 0.769]	[0.207, 0.732]

Table 9. The result of MBS for the Oval data.

	F1	F2	F3	F4	F5
111	0	0.297	0.096	0.438	0.495
101	0.022	0.162	0.026	0.418	0.549
121	0.045	0.446	0.826	0.23	0.104
112	0.09	0.392	0.759	0.938	0.76
131	0.101	0.608	0.367	0.236	0.683
91	0.112	0.041	0.167	0.056	0.124
102	0.112	0.27	0.718	0.851	0.853
122	0.112	0.554	0.962	0.755	0.189
132	0.202	0.676	0.57	0.684	0.93
92	0.213	0.149	0.802	0.129	0.642
141	0.213	0.676	0.371	0.086	0.009
81	0.225	0	0.633	0	0.719
142	0.292	0.811	0.381	0.305	1
151	0.315	0.811	0.049	0.056	0.528
82	0.427	0.081	0.872	0.582	0.852
152	0.438	0.919	0.585	0.891	0.881
161	0.483	0.878	0.402	0.15	0.207
71	0.494	0.041	0.293	0.023	0.765
162	0.562	1	0.609	0.769	0.732
72	0.596	0.122	0.47	0.086	0.902
11	0.629	0.905	0	0.002	0.36
61	0.663	0.135	0.712	0.904	0.103
62	0.764	0.216	1	0.968	0.95
51	0.775	0.257	0.3	0.425	0.217
12	0.798	0.986	0.982	0.883	0.38
21	0.854	0.797	0.002	0.673	0.754
41	0.865	0.378	0.452	0.213	0
52	0.876	0.338	0.614	0.979	0.568
31	0.921	0.527	0.193	0.035	0.406
42	0.933	0.5	0.854	0.604	0.074
22	0.955	0.905	0.421	1	0.761
32	1	0.716	0.934	0.477	0.587

Table 10. The results of MBS for four clusters.

(a)			(b)
	F1	F2		F1	F2
11	0.629	0.905	61	0.663	0.135
12	0.798	0.986	62	0.764	0.216
21	0.854	0.797	51	0.775	0.257
31	0.921	0.527	41	0.865	0.378
22	0.955	0.905	52	0.876	0.338
32	1.000	0.716	42	0.933	0.500
(c)			(d)
	F1	F2		F1	F2
111	0.000	0.297	121	0.045	0.446
101	0.022	0.162	131	0.101	0.608
112	0.090	0.392	122	0.112	0.554
102	0.112	0.270	132	0.202	0.676
91	0.112	0.041	141	0.213	0.676
92	0.213	0.149	142	0.292	0.811
81	0.225	0.000	151	0.315	0.811
82	0.427	0.081	152	0.438	0.919
71	0.494	0.041	161	0.483	0.878
72	0.596	0.122	162	0.562	1.000

Table 11. Look up tables for the Oval data.

(a)		(c)
F1	F2	F1	F2
[0.629, 0.798]	[0.905, 0.986]	[0.000, 0.112]	[0.162, 0.392]
[0.854, 1.000]	[0.527, 0.905]	[0.112, 0.596]	[0.000, 0.149]
(b)		(d)
F1	F2	F1	F2
[0,663, 0.663]	[0.135, 0.135]	[0.045, 0.112]	[0.446, 0.608]
[0.764, 0.764]	[0.216, 0.216]	[0.202, 0.213]	[0.676, 0.676]
[0.775, 0.775]	[0.257, 0.257]	[0.292, 0.315]	[0.811, 0.811]
[0.865, 0.876]	[0.338, 0.378]	[0.438, 0.483]	[0.878, 0.919]
[0.933, 0.933]	[0.500, 0.500]	[0.562, 0.562]	[1.000, 1.000]

Table 12. The quantile values for ANNT.

Taxon Name	Mean Annual Temperature (°C)
Taxon Name	0%	10%	25%	50%	75%	90%	100%
ACER EAST	−2.3	0.6	3.8	9.2	14.4	17.9	24
ACER WEST	−3.9	0.2	1.9	4.2	7.5	10.3	21
ALNUS EAST	−10	−4.4	−2.3	0.6	6.1	15.0	21
ALNUS WEST	−12	−4.6	−3.0	0.3	3.2	7.6	19
FRAXINUS EAST	−2.3	1.4	4.3	8.6	14.1	17.9	23
FRAXINUS WEST	2.6	9.4	11.5	17.2	21.2	22.7	24
JAGLANS EAST	1.3	6.9	9.1	12.4	15.5	17.6	21
JAGLANS WEST	7.3	12.6	14.1	16.3	19.4	22.7	27
QUERCUS EAST	−1.5	3.4	6.3	11.2	16.4	19.1	24
QUERCUS WEST	−1.5	6.0	9.5	14.6	17.9	19.9	27

Table 13. The first two principal components for hardwood data.

Spearman	Pc1	Pc2
Eigen values	6.691	0.909
Contribution (%)	83.635	11.357
Eigen vector	Pc1	Pc2
ANNT	0.362	−0.363
JANT	0.346	−0.427
JULT	0.372	−0.208
ANNP	0.359	0.369
JANP	0.337	0.365
JULP	0.352	0.170
GDC5	0.365	−0.331
MITM	0.335	0.484

Table 14. The first two principal components for hardwoods data (dual).

Spearman	Pc1	Pc2
Eigen values	8.79	0.54
Contribution (%)	87.89	5.40
Eigen vectors	Pc1	Pc2
AcE	0.323	0.156
AcW	0.305	0.308
AlE	0.317	0.354
AlW	0.303	0.496
FE	0.331	0.008
FW	0.305	−0.436
JE	0.318	−0.071
JW	0.309	−0.497
QE	0.331	−0.056
QW	0.320	-0.253

Table 15. The result of MBS for Hardwood data.

Q.V.	GDC5	ANNT	JANT	JULT	ANNP	JANP	JULP	MITM
AcW1	0.1	−3.9	−23.8	7.1	105	5	0	0.14
AlE1	0.1	−10.2	−30.9	7.1	220	9	28	0.22
AlW1	0.1	−12.2	−30.5	7.1	170	4	0	0.22
QuW1	0.3	−1.5	−12	9.7	85	1	0	0.08
AcE1	0.5	−2.3	−24.6	11.5	415	10	56	0.62
AcW2	0.5	0.2	−11.8	11.3	380	28	8	0.49
AlW2	0.5	−4.6	−25.7	11.5	335	18	21	0.49
AlE2	0.6	−4.4	−26.5	13.2	380	19	58	0.53
AcW3	0.7	1.9	−10.1	12.8	505	54	23	0.61
AlW3	0.7	−3	−21.6	12.8	410	23	41	0.59
AlE3	0.8	−2.3	−22.7	14.8	475	23	74	0.69
FrE1	0.8	−2.3	−23.8	13.5	270	6	18	0.39
QuE1	0.8	−1.5	−22.7	13.5	240	7	32	0.21
AlW4	0.9	0.3	−15.1	14.4	510	37	57	0.72
FrW1	0.9	2.6	−7.4	12.5	85	5	0	0.09
JuE1	1	1.3	−14.6	15.2	525	9	41	0.63
AcW4	1.1	4.2	−6.9	14.9	750	92	38	0.75
AlE4	1.1	0.6	−18.1	16.5	770	46	91	0.93
AlW5	1.1	3.2	−7.6	15.6	790	93	74	0.87
AcE2	1.2	0.6	−18.3	16.6	720	23	77	0.89
FrE2	1.3	1.4	−18	17.4	410	12	54	0.6
QuW2	1.4	6	−5.4	16.2	295	10	2	0.35
AcE3	1.5	3.8	−12.3	18.2	835	40	89	0.94
QuE2	1.5	3.4	−14.5	18.4	505	14	56	0.66
AcW5	1.6	7.5	−1.3	17.6	1175	176	52	0.91
AlW6	1.6	7.6	−0.8	17.5	1385	199	87	0.97
FrE3	1.6	4.3	−13.1	19	655	21	74	0.83
JuW1	1.6	7.3	−1.3	17.1	235	1	0	0.2
AlE5	1.9	6.1	-8	19.8	1060	80	108	0.99
FrW2	2	9.4	−0.2	18	255	12	2	0.27
JuE2	2	6.9	−9.1	20.3	785	22	77	0.88
QuE3	2	6.3	−9.7	20.5	745	25	77	0.88
QuW3	2	9.5	0.2	18.9	385	13	19	0.48
AcW6	2.2	10.3	3.3	19.9	1860	267	71	0.98
FrE4	2.4	8.6	−6	22.2	910	55	94	0.95
AcE4	2.5	9.2	−5.1	22.2	1010	69	100	0.97
JuE3	2.5	9.1	−5.4	22.1	890	40	91	0.93
FrW3	2.7	11.5	3.5	21.2	360	19	12	0.38
QuE4	2.9	11.2	−2.8	23.9	960	61	97	0.95
JuW2	3	12.6	3.3	20	355	9	51	0.42
JuE4	3.1	12.4	−1	24.7	1030	71	101	0.96
FrE5	3.5	14.1	1.7	25.7	1130	85	108	0.98
JuW3	3.5	14.1	5.6	20.9	445	11	76	0.57
AcE5	3.6	14.4	2.3	25.8	1200	96	113	0.99
QuW4	3.6	14.6	6.8	21.1	540	25	54	0.63
AlE6	3.7	15	3.7	25.7	1235	106	126	0.99
JuE5	3.9	15.5	3.8	26.4	1190	96	112	0.97
QuE5	4.2	16.4	5	26.9	1175	90	110	0.98
JuW4	4.3	16.3	8.8	22.7	625	17	160	0.69
FrW4	4.5	17.2	9.1	24.3	485	28	43	0.49
JuE6	4.7	17.6	7	27.7	1350	127	124	0.99
AcE6	4.8	17.9	7.9	27.3	1355	127	135	0.99
AlW7	4.8	18.7	10.8	28.3	4685	667	452	1
FrE6	4.8	17.9	7.5	27.4	1320	118	127	0.99
QuW5	4.8	17.9	11.3	24.2	815	63	150	0.77
QuE6	5.2	19.1	9.5	28	1345	122	133	0.99
JuW5	5.4	19.4	12.5	25.3	790	24	200	0.78
QuW6	5.5	19.9	15.3	27.4	1160	163	201	0.88
AcW7	5.6	20.6	11	29.2	4370	616	160	1
AlE7	5.9	20.9	14.1	29.1	1650	166	212	1
FrW5	5.9	21.2	13	28.9	705	77	60	0.64
JuE7	6	21.4	12.4	29.4	1560	150	204	1
FrW6	6.5	22.7	14.7	30.4	1155	217	85	0.78
JuW6	6.5	22.7	18.4	27.7	905	35	224	0.89
FrE7	6.7	23.2	18.1	29.5	1630	166	218	1
AcE7	6.8	23.8	18.9	28.8	1630	166	222	1
FrW7	6.9	24.4	16.9	33.1	2555	414	206	0.97
QuE7	7	24.2	19.6	31.8	1630	161	222	1
JuW7	8.5	26.6	26.2	31.3	1245	166	328	0.94
QuW7	8.5	27.2	26.2	33.8	2555	400	350	0.99

Table 16. Lookup table of Hardwood data.

GDC5	ANNT	JANT	JULT
[0.1, 0.1]			[7.1, 7.1]
[0.1, 2.5]	[−12.2, 10.3]
[0.1, 4.2]		[−30.9, 6.8]
[0.3, 0.5]			[9.7, 11.5]
[0.6, 0.9]			[12.5, 14.8]
{1.0, 1.1]			[14.9, 15.2]
[1.0, 6.8]			[15.6, 30.4]
[2.7, 3.1]	[11.2, 12.6]
[3.5, 3.6]	[14.1, 14.6]
[3.7, 4.8]	[15,0, 18.7]
[4.3, 6.5]		[7.0, 15.3]
[4.5, 4.8]	[17.2, 18.7]
[5.2, 5.5]	[19.1, 19.9]
[5.6, 5.9]	[20.6, 21.2]
[6.0, 6.5]	[21.4, 22.7]
[6.5, 6.9]		[16.9, 18.9]
[6.7, 7.0]	[23.2, 24.4]
[6.9, 8.5]			[31.3, 33.8]
[7.0, 7.0]		[19.6, 19.6]
[8.5, 8.5]	[26.6, 27.2]	[26.2, 26.2]

Table 17. Test data for the Lookup table of Hardwood data.

TAXON NAME		Quantiles (%)
TAXON NAME		0	10	25	50	75	90	100
BETURA	GDC5	0.0	0.3	0.6	0.9	1.5	3.2	5.7
BETURA	ANNT	−13.4	−8.4	−5.1	−1.0	3.9	12.6	20.3
CARYA	GDC5	1.4	2.1	2.6	3.4	4.5	5.2	6.7
CARYA	ANNT	3.6	7.5	10.0	13.6	17.2	19.4	23.5
CASTANEA	GDC5	1.4	2.2	2.8	3.7	4.6	5.2	6
CASTANEA	ANNT	4.4	8.6	11.3	14.9	17.5	19.2	21.5
CAPRINUS	GDC5	1	1.6	2	2.9	4.1	5.2	8.6
CAPRINUS	ANNT	1.2	4.4	7	11.4	16	19.2	28
TILIA	GDC5	1.0	1.6	1.9	2.4	3.0	3.6	5.4
TILIA	ANNT	1.1	3.8	5.8	8.8	12.0	14.4	19.9
ULMUS	GDC5	0.8	1.3	1.7	2.6	3.9	5	6.8
ULMUS	ANNT	−2.3	1.7	4.9	9.7	15.3	18.6	23.8

Table 18. Estimated result for the test data.

TAXON NAME		Quantiles (%)
TAXON NAME		0	10	25	50	75	90	100
BETURA	GDC5	0.0	0.3	0.6	0.9	1.5	3.2	5.7
BETURA	Estimated	< 0.1			[0.1, 2.5]		3.1	5.6
CARYA	GDC5	1.4	2.1	2.6	3.4	4.5	5.2	6.7
CARYA	Estimated		[0.1, 2.5]		[3.1, 3.6]	4.5	[5.2, 5.5]	[6.7,7.0]
CASTANEA	GDC5	1.4	2.2	2.8	3.7	4.6	5.2	6
CASTANEA	Estimated	[0.1, 2.5]		[2.7, 3.1]	3.7	[4.5, 4.8]	[5.2, 5.5]	[6.0, 6.5]
CAPRINUS	GDC5	1	1.6	2	2.9	4.1	5.2	8.6
CAPRINUS	Estimated		[0.1, 2.5]		[2.7, 3.1]	[3.7, 4.3]	[5.2, 5.5]	8.5 <
TILIA	GDC5	1.0	1.6	1.9	2.4	3.0	3.6	5.4
TILIA	Estimated			[0.1, 2.5]		[2.7, 3.1]	[3.5, 3.6]	[5.2, 5.5]
ULMUS	GDC5	0.8	1.3	1.7	2.6	3.9	5	6.8
ULMUS	Estimated			[0.1, 2.5]		[3.7, 4.3]	[4.5, 4.8]	[6.7, 7.0]

Table 19. Lookup Table for Cluster C1 = (AcW, AlW).

GDC5	ANNT	JANT	JULT
[0.1, 0.1]			[7.1, 7.1]
[0.1, 0.9]	[−12.2, 1.9]	[−30.5, −10.1]
[0.5, 0.5]			[11.3, 11.5]
[0.7, 0.7]			[11.8, 12.8]
[0.9, 1.1]			[14.4, 15.6]
[1.1, 1.1]	[3.2, 4.2]	[−7.6, −6.9]
[1.6, 1.6]	[7.5, 7.6]	[−1.3, −0.8]	[17.5, 17.6]
[2.2, 2.2]	[10.3, 10.3]	[3.3, 3.3]	[19.9, 19.9]
[4.8, 4.8]	[18.7, 18.7]	[10.8, 10.8]	[28.3, 28.3]
[5.6, 5.6]	[20.5, 20.6]	[11.0, 11.0]	[29.2, 29.2]

Table 20. Lookup Table for Cluster C2 = (AcE, AlE, FE, JE, QE).

GDC5	ANNT	JANT	JULT
[0.1, 0.1]	[−10.2, −10.2]		[7.1, 7.1]
[0.1, 0.6]		[−30.9, −24.6]
[0.5, 0.5]			[11.5, 11.5]
[0.5, 0.8]	[−4.4, −1.5]
[0.6, 0.6]			[13.2, 13.2]
[0.8, 0.8]		[−23.8, −22.7]	[13.5, 14.8]
[1.0, 1.0]			[15.2, 15.2]
[1.0, 1.2]	[0.6, 1.3]
[1.0, 1.3]		[−18.3, −14.6]
[1.1, 1.1]			[16.5, 16.5]
[1.2, 1.2]			[16.6, 16.6]
[1.3, 1.3]	[1.4, 1.4]		[17.4, 17.4]
[1.5, 1.5]	[3.4, 3.8]		[18.2, 18.4]
[1.5, 1.6]		[−14.5, −12.3]
[1.6, 1.6]	[4.3, 4.3]		[19.0, 19.0]
[1.9, 1.9]	[6.1, 6.1]		[19.8, 19.8]
[1.9, 2.0]		[−9.7, −8.0]
[2.0, 2.0]	[6.3, 6.9]		[20.3, 20.5]
[2.4, 2.4]	[8.6, 8.6]	[−6.0, −6.0]
[2.4, 2.5]			[22.1, 22.2]
[2.5, 2.5]	[9.1, 9.2]	[−5.4, −5.1]
[2.9, 2.9]	[11.2, 11.2]	[−2.8, −2.8]	[23.9, 23.9]
[3.1, 3.1]	[12.4, 12.4]	[−1.0, −1.0]	[24.7, 24.7]
[3.5, 3.5]	[14.1, 14.1]	[1.7, 1.7]
[3.5, 3.7]			[25.7, 25.8]
[3.6, 3.6]	[14.4, 14.4]	[2.3, 2.3]
[3.7, 3.7]	[15.0, 15.0]	[3.7, 3.7]
[3.9, 3.9]	[15.5, 15.5]	[3.8, 3.8]	[26.4, 26.4]
[4.2, 4.2]	[16.4. 16.4]	[5.0, 5.0]	[26.9, 26.9]
[4.7, 4.7]	[17.6, 17.6]	[7.0, 7.0]
[4.7, 4.8]			[27.3, 27.7]
[4.8, 4.8]	[17.9, 17.9]	[7.5, 7.9]
[5.2, 5.2]	[19.1, 19.1]	[9.5, 9.5]
[5.2, 6.8]			[28.0, 29.5]
[5.9, 5.9]	[20.9, 20.9]
[5.9, 6.0]		[12.4, 14.1]
[6.0, 6.0]	[21.4, 21.4]
[6.7, 6.7]	[23.2, 23.2]	[18.1, 18.1]
[6.8, 6.8]	[23.8, 23.8]	[18.9, 18.9]
[7.0, 7.0]	[24.2, 24.2]	[19.6, 19.6]	[31.8, 31.8]

Table 21. Lookup Table for Cluster C3 = (FW, JW, QW).

GDC5	ANNT	JANT	JULT
[0.3, 0.3]	[−1.5, −1.5]	[−12.0, −12.0]	[9.7, 9.7]
[0.9, 0.9]	[2.6,2.6]	[−7.4, −7.4]	[12.5, 12.5]
[1.4, 1.4]	[6.0, 6.0]	[−5.4, −5.4]	[16.2, 16.2]
[1.6, 1.6]	[7.3, 7.3]	[−1.3, −1.3]	[17.1, 17.1]
[2.0, 2.0]	[9.4, 9.5]	[−0.2, 0.2]	[18.0, 18.9]
[2.7, 2.7]	[11.5, 11.5]
[2.7, 3.0]		[3.3, 3.5]
[2.7, 3.6]			[20.0, 21.2]
[3.0, 3.0]	[12.6, 12.6]
[3.5, 3.5]	[14.1, 14.1]	[5.6, 5.6]
[3.6, 3.6]	[14.6, 14.6]	[6.8, 6.8]
[4.3, 4.3]	[16.3, 16.3]	[8.8, 8.8]	[22.7, 22.7]
[4.5, 4.5]	[17.2, 17.2]	[9.1, 9.1]
[4.5, 4.8]			[24.2, 24.3]
[4.8, 4.8]	[17.9, 17.9]	[11.3, 11.3]
[5.4, 5.4]	[19.4, 19.4]	[12.5, 12.5]	[25.3, 25.3]
[5.5, 5.5]	[19.9, 19.9]
[5.5, 6.5]		[14.7, 15.3]	[27.4, 30.4]
[6.5, 6.5]	[22.7, 22.7]	[18.4, 18.4]
[8.5, 8.5]	[26.6, 27.2]	[26.2, 26.2]	[31.3, 33.8]

Table 22. Estimated result for the test data by Lookup Table for cluster C2.

TAXON NAME		Quantiles (%)
TAXON NAME		0	10	25	50	75	90	100
BETURA	GDC5	0.0	0.3	0.6	0.9	1.5	3.2	5.7
BETURA	Estimated	<0.1	[0.1, 0.5]	0.5	[0.8, 1.0]	[1.5, 1.6]	[3.1, 3.5]	[5.2, 5.9]
CARYA	GDC5	1.4	2.1	2.6	3.4	4.5	5.2	6.7
CARYA	Estimated	1.5	[2.0, 2.4]	[2.5, 2.9]	[3.1, 3.5]	[4.2, 4.7]	[5.2, 5.9]	[6.7,6.8]
CASTANEA	GDC5	1.4	2.2	2.8	3.7	4.6	5.2	6.0
CASTANEA	Estimated	1.6	2.4	2.9	3.7	4.7	5.2	6.0
CAPRINUS	GDC5	1.0	1.6	2.0	2.9	4.1	5.2	8.6
CAPRINUS	Estimated	[1.0, 1.2]	1.6	2.0	2.9	[3.9, 4.2]	5.2	7.0 <
TILIA	GDC5	1.0	1.6	1.9	2.4	3.0	3.6	5.4
TILIA	Estimated	[1.0, 1.2]	1.5	1.9	2.4	3.1	3.6	[5.2, 5.9]
ULMUS	GDC5	0.8	1.3	1.7	2.6	3.9	5	6.8
ULMUS	Estimated	[0.5,0.8]	[1.3, 1.5]	[1.6, 1.9]	[2.5, 2.9]	[3.7, 3.9]	[4.8, 5.2]	6.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ichino, M. The Lookup Table Regression Model for Histogram-Valued Symbolic Data. Stats 2022, 5, 1271-1293. https://doi.org/10.3390/stats5040077

AMA Style

Ichino M. The Lookup Table Regression Model for Histogram-Valued Symbolic Data. Stats. 2022; 5(4):1271-1293. https://doi.org/10.3390/stats5040077

Chicago/Turabian Style

Ichino, Manabu. 2022. "The Lookup Table Regression Model for Histogram-Valued Symbolic Data" Stats 5, no. 4: 1271-1293. https://doi.org/10.3390/stats5040077

APA Style

Ichino, M. (2022). The Lookup Table Regression Model for Histogram-Valued Symbolic Data. Stats, 5(4), 1271-1293. https://doi.org/10.3390/stats5040077

Article Menu

The Lookup Table Regression Model for Histogram-Valued Symbolic Data

Abstract

1. Introduction

2. Representation of Objects by Quantile Vectors and Bin Rectangles

2.1. Histogram-Valued Feature

2.2. Representation of Histograms by Common Number of Quantiles

2.3. Quantile Vectors and Bin-Rectangles

3. Monotone Blocks Segmentation (MBS) and Lookup Table Regression Model (LTRM)

4. Multi-Lookup Table Regression Model (M-LTRM)

4.1. Illustration by Oval Data

4.2. Hierarchical Conceptual Clustering

4.3. M-LTRM for the Hardwood Data

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI