Reliability of Partitioning Metric Space Data

Marmor, Yariv N.; Bashkansky, Emil

doi:10.3390/math12040603

Open AccessArticle

Reliability of Partitioning Metric Space Data

by

Yariv N. Marmor

and

Emil Bashkansky

^*

Department of Industrial Engineering and Management, Braude College of Engineering, Karmiel 2161002, Israel

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(4), 603; https://doi.org/10.3390/math12040603

Submission received: 26 December 2023 / Revised: 5 February 2024 / Accepted: 14 February 2024 / Published: 18 February 2024

(This article belongs to the Special Issue Applied Probability and Statistical Inference in Reliability Engineering)

Download

Browse Figures

Versions Notes

Abstract

The process of sorting or categorizing objects or information about these objects into clusters according to certain criteria is a fundamental procedure in data analysis. Where it is feasible to determine the distance metric for any pair of objects, the significance and reliability of the separation can be evaluated by calculating the separation/segregation power (SP) index proposed herein. The latter index is the ratio of the average inter distance to the average intra distance, independent of the scale parameter. Here, the calculated SP value is compared to its statistical distribution obtained by a simulation study for a given partition under the homogeneity null hypothesis to draw a conclusion using standard statistical procedures. The proposed concept is illustrated using three examples representing different types of objects under study. Some general considerations are given regarding the nature of the SP distribution under the null hypothesis and its dependence on the number of divisions and the amount of data within them. A detailed modus operandi (working method) for analyzing a metric data partition is also offered.

Keywords:

statistical models and methods; statistical inference; reliability; quality; data partitioning; statistical significance; metric space

MSC:

62H30

1. Introduction

Attempting to understand, take control of, or resolve a particular quality or reliability problem, initially we often try to classify the objects under study (OUS) into relatively homogeneous groups. For example, when analyzing the problem of accumulating a large inventory of faulty products of a certain type, we will likely try to divide them into groups provided by different suppliers and, if such a division makes sense, we will use methods that will force these suppliers to compete to supply higher quality products in the hope of being the supplier of choice—divide et impera!

In recent decades, new types of quality and reliability data have been appearing at a pace that sometimes exceeds our ability to comprehend and interpret them [1,2,3,4,5]. The partition of data arrays into clusters, in accordance with some criterion, is a necessary step in the study of a particular phenomenon. The subsequent investigation must confirm or refute the expediency of such a division. If confirmed, the criterion’s discriminatory power must be assessed (or, in other words, the influencing power of a factor in accordance with the levels of which the data were partitioned must be evaluated). If the data come from a metric space, then for any pair of data, a distance characterizing the dissimilarity between them is defined. Choosing the appropriate distance metric is a fundamental problem in quality control, pattern recognition, machine learning, cluster analysis, etc.

Data do not necessarily mean numbers; data can be information of any kind about the OUS, obtained as a result of tests, measurements, observations, inquiries, etc. The distance between data, however, indicating how far apart the studied objects are (i.e., dissimilar), is represented by a scalar/number. Notwithstanding all the shortcomings of such a simplification of the representation of the distinction between complex objects, this idea has one undeniable advantage: simplicity.

In metric space, distance d satisfies the following four axioms:

The distance from a data point to itself is zero: d(x, x) = 0.
The distance between two distinct points x and y is always positive: d(x, y) > 0.
The distance from x to y is always the same as the distance from y to x: d(x, y) = d(y, x).
There is a triangle inequality: d(x, y) + d(y, z) ≤ d(x, z).

The OUS, as well as the data characterizing them, can be very diverse. In our recent article [1], we considered some new types of quality data: categorical, preference chains, strings, shapes, images, tree structured, and product/process distributions. In the short time since the paper was published, not surprisingly, more and more data types have emerged. The continually expanding spectrum of distance metrics used in quality and reliability engineering is also diverse (Figure 1); e.g., see [6] for the use of Wasserstein and Hausdorff metrics for quality control and cyber-attack detection.

Here we assume that in accordance with a selected metric, for a data set

{X_{i}}_{i = 1}^{N}

, of size

N

, related to the phenomenon under study, a matrix d_ij = d(x_i, x_j) of mutual distances for each pair of data x_i and x_j can be determined. This is a symmetric square matrix with non-negative entries and zeros on the main diagonal. From the triangle inequality (axiom 4), it follows that d_ik + d_kj ≤ d_ij for any triad of i, j, and k.

To ascertain the influencing factors, all the data are divided into m groups/segments of size

n_{1}, n_{2}, \dots, n_{k}, \dots, n_{m}

(\sum_{k = 1}^{m} n_{k} = N)

according to a criterion associated with the levels of this/these factor/factors. Respectively, all

(\begin{matrix} N \\ 2 \end{matrix})

distances are split into two types (Figure 2): those that refer to data pairs belonging to the same group (denoted here by the prefix intra) and those that describe the distances between pairs of data belonging to different groups (denoted here by the prefix inter).

To ascertain whether this partitioning is effective and, if so, to what extent, the degree of distinction achieved, henceforth called segregation power—SP [7]—must be evaluated. If the discrimination turns out to be weak (statistically insignificant), the hypothesis that the partitioning criterion was chosen incorrectly could be accepted. If, on the other hand, the discrimination is not weak (statistically significant), this evaluation should be compared to other partitioning criteria. This article is devoted to the development of a measure suitable for this purpose. The attentive reader will certainly find some analogies with ANOVA; however, the proposed approach differs both in general and in specific details from the latter.

2. Preliminary Materials: Some Definitions and Separation Power (SP) Calculation Method

Before explaining our method, we define some of the concepts used herein.

2.1. Some Definitions

Intra degrees of connection ${d c}_{i n t r a}$ : The sum of data pairs belonging to the same groups, i.e.,
${d c}_{i n t r a} = \sum_{i = 1}^{m} (\begin{matrix} n_{i} \\ 2 \end{matrix}) = (\begin{matrix} n_{1} \\ 2 \end{matrix}) + (\begin{matrix} n_{2} \\ 2 \end{matrix}) + \dots . + (\begin{matrix} n_{m} \\ 2 \end{matrix}) .$
Inter degrees of connection ${d c}_{i n t e r}$ : The sum of data pairs belonging to the different groups, i.e.,
${d c}_{i n t e r} = \sum_{i = 1}^{m} \sum_{j > 1}^{m} n_{i} n_{j} = \frac{1}{2} \sum_{i \neq j} n_{i} n_{j}$ .
Total degrees of connection ${d c}_{t o t a l}$ : The sum of all data pairs, i.e.,
$d c_{t o t a l} = (\begin{matrix} N \\ 2 \end{matrix})$ . Obviously,
$d c_{t o t a l} = d c_{i n t r a} + d c_{t o t a l} = d c_{i n t e r}$ .
Intra sum of distances ${S D}_{i n t r a}$ : The sum of distances between data pairs belonging to the same groups.
Inter sum of distances ${S D}_{i n t e r}$ : The sum of distances between data pairs belonging to the different groups.
Total sum of distances ${S D}_{t o t a l}$ : The sum of distances between all data pairs. Obviously (see Figure 3 for an example),
${S D}_{t o t a l} = {S D}_{i n t r a} + {S D}_{i n t e r}$ .
Intra mean distance ${M D}_{i n t r a}$ : The sum of distances between data pairs belonging to the same groups ${S D}_{i n t r a}$ divided by the intra degrees of connection— ${d c}_{i n t r a}$ —i.e., ${M D}_{i n t r a} = \frac{{S D}_{i n t r a}}{{d c}_{i n t r a}}$ ;
Inter mean distance ${M D}_{i n t e r}$ : The sum of distances between data pairs belonging to the different groups ${S D}_{i n t e r}$ divided by the inter degrees of connection— ${d c}_{i n t e r}$ —i.e., ${M D}_{i n t e r} = \frac{{S D}_{i n t e r}}{{d c}_{i n t e r}}$ .
Separation/segregation power—SP $: {M D}_{i n t e r}$ divided by ${M D}_{i n t r a}$ , i.e.,

S P = \frac{M D_{i n t e r}}{M D_{i n t r a}} = \frac{S D_{i n t e r} / {d c}_{i n t e r}}{S D_{i n t r a} / {d c}_{i n t r a}}

(1)

2.2. Some Illustrative Examples of SP Calculation for the Different Kinds of Data

We start with the simplest example and continue with more complex ones.

2.2.1. Data Represented by Real Numbers

Times to failure (TTFs) of two products (A and B) randomly selected from the batch supplied by supplier I were 24,000 and 30,000 h, respectively, while the TTFs of two other products (C and D) randomly selected from the batch supplied by supplier II were 17,000 and 19,000 h, respectively. In other words, X_A = 24,000 and X_B = 30,000, whereas X_C = 17,000 and X_D = 19,000. Given this information, let us divide these data into two groups as in Figure 3. The first is A, B (supplier I), and the second is C, D (supplier II). Choosing the range between the TTFs as the distance measure (Euclidean distance), we obtain the matrix of mutual distances shown in Table 1.

Hence, in line with definitions provided above and in Figure 3:

S D_{i n t r a} = d_{A, B} + d_{C, D} = 6000 + 2000 = 8000

d c_{i n t r a} = 2

S D_{i n t e r} = d_{A, C} + d_{A, D} + d_{B, C} + d_{B, D} = 7000 + 5000 + 13000 + 11000 = 36000

d c_{i n t e r} = 4

Accordingly,

M D_{i n t r a} = \frac{8000}{2} = 4000; M D_{i n t e r} = \frac{36000}{4} = 9000

and finally:

S P = \frac{M D_{i n t e r}}{M D_{i n t r a}} = \frac{9000}{4000} = 2.25

2.2.2. Each Datum Is a Discrete Distribution over Categories (as in a Pie Chart)

Below are real data on the distribution of quality cost proportions by four categories defined by [8] for eight Israeli companies engaged in residential construction in Israel [9] (see Table 2 and Figure 4). Although all surveyed companies were certified to the international quality standard, it was striking that companies 1, 2, 3, and 6 spent relatively much more on external failures than the other companies (4, 5, 7, and 8). Let us now calculate the SP between these two clusters. Four proportions reflecting the distribution of quality costs across the whole spectrum of possible costs are known for each company.

The most appropriate distance measure for comparing two distributions (

p_{1}, \dots p_{i}, \dots, p_{r}

) and (

q_{1}, \dots q_{i}, \dots, q_{r}

) on a nominal scale is the Hellinger distance, defined as [10]:

H (P, Q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{i - 1}^{r} {(\sqrt{p_{i}} - \sqrt{q_{i}})}^{2}}

In our case, because we have four quality cost categories, r = 4. Using this distance measure, we obtain the matrix of mutual distances shown in Table 3.

The total degrees of connection are dc_total = 28 (“two out of eight” combinations) split into dc_intra = 12 (twice “two out of four”) and dc_inter = 16 (combinations of each of the four proportions in the first cluster with each of the four proportions in the second cluster). Skipping the routine of calculating the sums of the intra and inter distances, in Table 4 we present only the results.

And finally,

S P = \frac{M D_{i n t e r}}{M D_{i n t r a}} = 1.773

.

2.2.3. Each Datum Is a Preference Chain of Alternatives

Preference/prioritization chains (PC), along with other new types of structured data, are widely used in engineering, quality management, risk management, genetics, healthcare, customer research, decision making, etc. Let the symbol “>” depict the relationships between two alternatives, i.e., A₁ > A₂ means that A₁ is preferable to A₂. A set of n predetermined alternatives arranged as a string by this symbol (e.g., A₁ > A₂ > A₃> … > A_n) forms a strict preference chain. Obviously, there are n! such chains obtained by permutation of the alternatives. The construction of a chain is based only on relationships among the predetermined alternatives without necessarily being related to the evaluation of the property under study.

According to [11,12], all feasible PCs are scattered on the surface of the n(n − 1)/2-dimensional sphere. If one of them—for example, the naturally ordered chain A₁ > A₂ > A₃ > … > A_n—is considered as a base, the north pole ([N]), then the reverse chain is located on the south pole [S] of this sphere and all the remaining (n! − 2) PCs are located on [n(n − 1)/2] − 1 parallels formed by flat disks, which cut the N–S axis equidistantly. The so-called geodesic distance between two PCs is proportional to the length of the geodesic arc connecting them on the surface of a multidimensional globe. The radius of this globe, for convenience, is chosen so that the maximum possible distance (e.g., from [N] to [S]) is equal to 1. For details regarding calculation of the geodetic distance, we refer readers to [11].

In one of the experiments described in [11], five experts/judges prioritize five alternatives. Judges two and three are women, while judges one, four, and five are men. Table 5 shows the mutual distances between the respective preference chains.

To check the discrimination power of gender (if such exists), we divide the five judges into two clusters (2, 3) and (1, 4, 5) according to gender; see Figure 5.

Then,

$M D_{i n t r a} = \frac{[d_{2,3} + (d_{1,4} + d_{1,5} + d_{4,5})]}{4} = 0.385$
$M D_{i n t e r} = \frac{[(d_{2,1} + d_{2,4} + d_{2,5}) + (d_{3,1} + d_{3,4} + d_{3,5})]}{6} = 0.578$

This implies

S P = \frac{M D_{i n t e r}}{M D_{i n t r a}} = 1.502

.

2.3. Checking the Homogeneity Hypothesis H₀

In its most general form, the conservative (null) hypothesis of homogeneity H₀ means that data being studied across all groups are drawn from the same initial original population distribution. In other words, the scatter of data, of course, reflects the scatter of data within the population itself, but in no way indicates the influence of the level of the factor, in accordance with which the partitioning into groups was made, on the data being studied. Thus, the division of data into groups/segments does not make any sense, and the difference in the data is due to noise factors only. For example, when analyzing the academic achievements of students, an assumption H₀ can mean the independence of the latter from a characteristic/factor such as gender or hair color.

As in ANOVA, we assume that if SP exceeds a certain threshold, determined for a given risk by the distribution of the SP under H₀, it can serve as an indicator of the influence of a discriminating/segregating factor. The p-value can also serve as an indicator of discrimination/segregation: the smaller it is, the greater the influence of the segregating factor. The considerations given in Section 3.1 and Appendix A, Appendix B and Appendix C support the proposition that, for a given H₀, the distribution of SP depends only on the type of partition, i.e., vector (

n_{1}, n_{2}, \dots, n_{k}, \dots, n_{m})

. Some general conclusions about the SP distribution can be made on the basis of an analytical analysis supported by simulation (see Section 3 and Appendix A, Appendix B and Appendix C).

2.4. Some Simple Examples of Distance Metric Distribution

Suppose we have a pair of data points randomly and independently drawn from the same distribution and the distance d is defined as a range between them.

2.4.1. Normal Distribution

If the distribution is normal, then d is distributed according to (2a) (see Figure 6a [13,14]):

f (d) = \frac{1}{σ \sqrt{π}} e^{- {(\frac{d}{2 σ})}^{2}} (d \geq 0),

(2a)

where σ is the standard deviation of the original normal distribution

N (μ, σ^{2})

and f(d) means the probability density function (PDF).

Clearly, the mean distance as well as its variance depend on the scale parameter σ of the native normal distribution only:

E (d) = \frac{2}{\sqrt{π}} σ

(3a)

V A R (d) = (2 - \frac{4}{π}) σ^{2}

(4a)

2.4.2. Uniform Distribution

Now suppose a pair of data points are randomly and independently drawn from the same uniform distribution U(a, b); then the distance d between them is distributed according to the triangular distribution (2b) (see Figure 6b [15,16]):

f (d) = \frac{2}{b - a} (1 - \frac{d}{b - a}), (0 \leq d \leq b - a)

(2b)

with

E (d) = \frac{2}{\sqrt{3}} σ

(3b)

V A R (d) = \frac{2}{3} σ^{2}

(4b)

where

σ^{2} = \frac{{(b - a)}^{2}}{12}

denotes the variance of the uniform distribution U(a, b). Both Equations (3b) and (4b) do not depend on the location parameter (a + b)/2 of U(a, b).

2.4.3. Exponential Distribution

Finally, in the case of an exponential distribution

E x p (x_{0}, λ)

, anchored at the “starting” value

x_{0}

(location parameter), and the spread reciprocal λ, the distance d is distributed according to Equation (2c) (see Figure 6c [15,16]):

f (d) = λ e^{- λ d} (d \geq 0)

(2c)

with

E (d) = σ

(3c)

V A R (d) = σ^{2}

(4c)

where

σ = \frac{1}{λ}

denotes the standard deviation of the exponential distribution

E x p (x_{0}, λ)

.

2.4.4. Conclusions Derived from the above Examples

Naturally, independence of the pairwise distance distribution on the location parameter of the original native data distribution holds not only for the distributions mentioned above. It holds for any kind of data distribution for which the location and the scale parameters can be determined independently. In this sense, we can talk about the translational invariance of the distance distribution. Consequently, under H₀, both E(SD_intra) and E(SD_inter), given partitioning, do not depend on the location parameter and are proportional to the scale parameter σ. Thus, the ratio E(

M D_{i n t e r}

)/E(

M D_{i n t r a}

) does not depend on either the location or the scale parameter, and equals one. The latter gives us a reason to assume that under H₀, the distribution of SP =

M D_{i n t e r} / M D_{i n t r a}

also does not depend on these parameters, but is determined only by the method of partitioning and type of original data distribution. Detailed proof of this statement is given in Appendix A and simulation studies provided by the authors for normal, uniform, and exponential original distributions confirm this assumption. Figure 7 illustrates this universal SP distribution for a partition of four data sets as shown in Figure 3. The authors experimented with many different location and scale parameters, and the results were always repeatable. The same thing happened with other types of partitioning, different from those shown in Figure 3 (e.g., A-BCD or more data), used by the authors.

3. Results of the Theoretical and Simulation Studies

3.1. Some General Considerations Regarding SP Distribution under H₀

As noted in Section 2.4.4, expectations of the numerator and denominator of SP under H₀ are equal: E(

M D_{i n t e r}

) = E(

M D_{i n t r a}

). Simple conclusions could be drawn regarding the variances of the numerator and denominator of SP, if not for the fact that the terms in both the numerator and the denominator are distributed identically, but not independently; additionally, two distances from a common vertex datum are correlated, not independent (see Appendix B). The specific value of the correlation coefficient ρ (the same for all pairs of correlated distances) depends on the type of original data distribution. In the case of a normal original data distribution, for example, it equals 0.224. It should be noted that correlations exist not only between pairs of distances that are terms in the numerator (or denominator), but also between distances, one of which belongs to the inter connection and the other to the intra connection, if they come from a common vertex datum (e.g., d(A,C) and d(A,B) with common vertex A in Figure 3). It is not difficult to prove that:

V A R (S D_{i n t e r}) = {d c}_{i n t e r} \cdot V A R (d) + 2 \cdot c o v \cdot \sum_{i = 1}^{m} [n_{i} \cdot (\begin{matrix} N - n_{i} \\ 2 \end{matrix})]

(5)

V A R (S D_{i n t r a}) = {d c}_{i n t r a} \cdot V A R (d) + 6 \cdot c o v \cdot \sum_{i} (\begin{matrix} n_{i} \\ 3 \end{matrix})

(6)

C O V (S D_{i n t r a}, S D_{i n t e r}) = 2 \cdot c o v \cdot \sum_{i \neq j} [n_{i} \cdot (\begin{matrix} n_{j} \\ 2 \end{matrix})]

(7)

where VAR(d) means variance of d, e.g.,

(2 - \frac{4}{π}) σ^{2}

for distribution (2a) and

c o v

means covariance between two distances with a common vertex, e.g.,

c o v = ρ * V A R (d) \approx 0.163 σ^{2}

for distribution (2a).

Accordingly,

V A R (M D_{i n t e r}) = \frac{V A R (d)}{d c_{i n t e r}} + \frac{2 \cdot c o v \cdot \sum_{i = 1}^{m} [n_{i} \cdot (\begin{matrix} N - n_{i} \\ 2 \end{matrix})]}{{d c}_{i n t e r}^{2}}

(8)

V A R (M D_{i n t r a}) = \frac{V A R (d)}{d c_{i n t r a}} + \frac{6 \cdot c o v \cdot \sum_{i = 1}^{m} (\begin{matrix} n_{i} \\ 3 \end{matrix})}{{d c}_{i n t r a}^{2}}

(9)

C O V (M D_{i n t r a}, M D_{i n t e r}) = \frac{2 \cdot c o v \cdot \sum_{i \neq j} [n_{i} \cdot (\begin{matrix} n_{j} \\ 2 \end{matrix})]}{d c_{i n t r a} \cdot d c_{i n t e r}}

(10)

If there is only a small number of OUS, it is impossible to draw general theoretical conclusions about the shape of the SP distribution under H₀ using Equations (8)–(10) only. In such cases, only multiple simulations under a given partitioning, such as the one discussed in Section 2.4.4, Figure 7, can help. The situation, however, is greatly facilitated when the number of OUS (data) in groups or the number of groups increases (see Appendix C). The more OUS there are, the closer E(SP) is to 1, and the distribution itself narrows. In the limiting case,

E (S P) \to 1, V A R (S P) \to 0

. Figure 8a,b illustrate the effect of the number of partition groups m on the cumulative distribution function (CDF) and the PDF of the SP distribution (under H₀, the original data are distributed according to normal distribution,

\forall n = 10

). Figure 9 illustrates the effect of the amount of data in every one of ten equally sized groups (m = 2).

3.2. Some Remarks on Deriving the SP Distribution from a Simulation Process under H₀

Since different null hypotheses are possible for different types of OUS, there is no universal SP distribution for a certain partition. Nevertheless, under a certain null hypothesis, such a distribution can be obtained by repeating the data simulation over and over and doing the subsequent SP calculations. The data structure significantly affects the simulation model. This is demonstrated in Section 3.2.1, Section 3.2.2 and Section 3.2.3 using the examples described in Section 2.2.1, Section 2.2.2 and Section 2.2.3.

3.2.1. The Case Described in Section 2.2.1

In general, the process is clear. We simulate four pieces of data according to the normal, uniform, or other assumed distribution and compare the calculated SP and the SP_(1−α) percentile or, alternatively, calculate the corresponding p–value. In our case, the SP_0.95 percentile equals 3.43 (assuming normal, uniform, or exponential distribution; see also Figure 7) and the p–value is 10.6%. Accordingly, the conclusion is that the available data are not enough, i.e., are insufficient, to establish that products supplied by two suppliers differ in their reliability level.

3.2.2. The Case Described in Section 2.2.2

The H₀ hypothesis determines the distribution of the categories’ proportions. In the simplest, binary case, for example, this may be the expected proportions of satisfactory and faulty products in a supplied batch of a certain size N. Let us assume that all batches are the same size and that the null hypothesis H₀ assumes the same level of quality from all suppliers. In this case, one can use the binomial (and in the general case, the multinomial) distribution for simulating the number/proportion of satisfactory and faulty products in a batch.

Let us further assume that batch sizes can vary from supplier to supplier. In this case, for the simulation, we need the Beta distribution of the proportions of bad (or good) items or, in the general multicategory case, the Dirichlet distribution Dir(γ). The latter is a continuous multivariate probability distribution parameterized by a vector

γ = (γ_{1}, γ_{2}, \dots, γ_{r})

of positive reals and is known as the multivariate generalization of the Beta distribution that describes the bivariate case only. The procedure for determining the parameters of these distributions (out of the scope of this article) is based on preliminary information (both theoretical and experimental) and is described in detail in [17]. We restrict ourselves to saying that the more accurate the preliminary information, the smaller the scale parameter that determines the variance of the simulated data. Unfortunately, in [9], which is where the example given in Section 2.2.3 is taken from, such preliminary information (total cost of quality at each company) is missing, thus making it impossible to formulate H₀.

3.2.3. The Case Described in Section 2.2.3

In this example, gender equality and the absence of real preferences between alternatives were chosen as the null hypothesis H₀.

The SP distribution under this assumption for the (2,3) partition (see Figure 5) is shown in Figure 10. For α = 5%, the critical SP_0.95 = 1.476 and the p–value for the calculated SP = 1.502 equals 4.39%. Thus, we can conclude that gender has a small effect on preferences.

Another type of assumed distribution may be the so-called Mallow’s distribution, according to which preference chains are dispersed (spread) around a certain dominant preference chain serving as a “gold standard”. We refer the readers to [12] for details.

3.3. General Methodology: Modus Operandi for Analyzing a Metric Data Partition (10 Steps)

Decide on the OUS population.
Make an assumption about the type of the expected distribution of these objects within a homogeneous population.
Choose a distance metric suitable for this distribution (see Figure 1).
Decide on the factor that, in your opinion, can discriminate/distinguish between the OUSs (heterogeneity hypothesis), and which levels serve as the basis for dividing/separating objects into groups (partitioning).
Provide a corresponding data partitioning/division.
Calculate the SP (as in Section 3, for example)
Simulate the SP distribution under H₀ in accordance with the vector of the partition just made $(n_{1}, n_{2}, \dots, n_{k}, \dots, n_{m})$ and the chosen distance metric. Every simulation process cycle includes:
(a)
Random generation of N data from a population of OUS (as per step 1) characterized by the assumed distribution (as per step 2).
(b)
Distance matrix calculation (as per step 3).
(c)
Partitioning these distances into their inter and intra components (as per steps 4 and 5).
(d)
SP calculation, which ends the cycle and returns us to (a).
Determine the alpha risk α of homogeneity hypothesis H₀ rejection.
Find the (1 − α) percentile of the simulated SP distribution, or alternatively, the p–value of the calculated SP.
Make a final decision according to the results of step 9.

4. Discussion

This article continues the theme previously raised by a number of authors about processing new types of quality and reliability data and related problems [1,2,3,4,5].

The main goal of clustering OUS is to divide them into distinctively dissimilar but internally homogeneous groups. Such partitioning makes sense only if inter group differences significantly exceed intra group differences. This task becomes markedly easier when the difference between OUS can be expressed using a distance metric. In this case, we suggest that the verification of the correctness of the partition according to a certain criterion (for example, the level of a potentially influencing factor) can be carried out by comparing the average inter group and intra group distances. More precisely, the authors propose to use a ratio between these distances they call separation/segregation power (SP) as an indicator of such a comparison.

It is well known that even in a homogeneous population, objects are not absolutely identical, but differ due to random, noisy disturbances/perturbations (the so-called null hypothesis H₀). This, in turn, means that even for a homogeneous population, the distances between objects are neither zero nor constant; rather, they are characterized by a certain distribution, the type and parameters of which can be very different depending on the OUS. Usually, these distributions are characterized by a so-called scale parameter, to which the mean distance is proportional. The SP distribution, as shown in the paper, however, is insensitive to this as well as to any location parameter.

For a given H₀ about the behavior of a homogeneous set of OUS, the SP distribution depends only on the kind of partitioning (partition vector). Its universal theoretical analysis is barely possible because of, among other reasons, the fact that two distances with a common datum are correlated. Nevertheless, it can be calculated by simulation methods (step 7 in Section 3.3).

Three examples of different OUS and corresponding SP distributions under the H₀ are discussed in the article. As the amount of data increases, the mean SP tends to be 1, and the distribution itself narrows.

The location of the calculated SP value compared to its distribution under the H₀ makes it possible to draw a conclusion about the expediency of the generated partition and its discrimination/separation/segregation power using standard statistical methods (comparing SP_calculated to SP₁₋_α or the p-value to the α risk).

Though the proposed approach (see the general methodlogy in Section 3.3) is similar in spirit to ANOVA, it is innovative in that it is applicable to any type of object whose dissimilarity can be described by means of a distance metric: pie charts, prioritization chains, strings, tree structured data, etc. A certain disadvantage is the need to conduct a simulation study of the SP distribution when partitioning different kinds of OUS and the type of their spread in a homogeneous population as described in Section 3.3 (step 7). One way to circumvent this issue would be by creating a bank of such calculators along the lines of those given in [18].

We hope that the potential inherent in the analysis of metric space quality/reliability data provided here will inspire reliability engineers to explore other territories such as data-collecting sensor systems [19], clustering, discriminant analysis, experimental design, and so on. We hope that this work will serve as a catalyst for the development of new methodologies where data-driven conclusions will become the driving force of the investigation.

Author Contributions

Conceptualization, Y.N.M. and E.B.; methodology, Y.N.M. and E.B.; software, Y.N.M.; validation, Y.N.M.; formal analysis, Y.N.M.; investigation, Y.N.M. and E.B.; resources, E.B.; data curation, Y.N.M.; writing—original draft preparation, Y.N.M. and E.B.; writing—review and editing, Y.N.M. and E.B.; visualization, Y.N.M.; supervision, Y.N.M. and E.B.; project administration, Y.N.M. and E.B.; funding acquisition, Y.N.M. and E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created nor analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

Our great appreciation to our colleague T. Gadrich for her valuable statistical support, especialy in the early stage of the research. The authors are very grateful to four anonymous referees for their fruitful comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. How Does the Scale Parameter Influence the Sum of Distances (SD) and the SP Distributions?

Let us start with a definition. If we can write the probability distribution function f(d) in terms of d/s as follows:

f (d) = \frac{φ (\frac{d}{s})}{s},

(A1)

then s is called a scale parameter, and we call

d_{s t} = \frac{d}{s}

the standardized non-dimensional random distance. For example, in Equation (2a), s means

σ

and

φ (d_{s t}) = \frac{1}{\sqrt{π}} e^{- {(\frac{d_{s t}}{2})}^{2}}, (d_{s t} \geq 0)

(A2)

is a free scale factor function.

Appendix A.2. How Does the Scale Parameter Influence the SD Distribution?

When considering the SD between data pairs, we must take into account that two pairs having a common datum are not independent, so the SD may include both independent and dependent random distances. Consider first the sum of only two distances:

d_{1}

and

d_{2}

under the H₀ hypothesis. If, in addition, they have no common datum, they are independent, and the joint density function is inversely proportional to the squared scale parameter s:

f (d_{1}, d_{2}) = \frac{φ (d_{1, s t})}{s} \times \frac{φ (d_{2, s t})}{s} = \frac{φ (d_{1, s t}) \times φ (d_{2, s t})}{s^{2}} = \frac{φ (d_{1, s t}, d_{2, s t})}{s^{2}}

(A3)

In this case, the sum of the two distances is distributed in such a way that s is also a scale parameter for the SD (

{S D}_{s t}

means SD/s):

f (S D) = \int f (d_{1}) f (S D - d_{1}) d (d_{1}) = \frac{\int φ (d_{1, s t}) φ (S D_{s t} - d_{1, s t}) d (d_{1, s t})}{s}

(A4)

If two distances contain one common datum, the proof is more complicated. Imagine that three pieces of data,

x_{i}, x_{k}

, and

x_{j}

, are randomly and independently selected from the same distribution and we are interested in the joint distribution of

d_{1} = |x_{i} - x_{k}|

and

d_{1} = |x_{k} - x_{j}|

. It is obvious that marginal distributions of d₁ and d₂ are the same, but the joint distribution

f (d_{1}, d_{2}) \neq f (d_{1}) \cdot f (d_{2})

. Nevertheless,

\iint f (d_{1}, d_{2}) d (d_{1}) d (d_{1}) = s^{2} \iint f (d_{1}, d_{2}) d (d_{1, s t}) d (d_{2, s t}) = 1

(A5)

from which it follows that

f (d_{1}, d_{2}) = φ (d_{1, s t}, d_{2, s t}) / s^{2}

, where

d_{1, s t}

,

d_{2, s t}

and φ are dimensionless quantities. That is why

f (S D) = \int f (d_{1}, S D - d_{1}) d (d_{1}) ~ \frac{1}{s} (d_{s t} \geq 0)

(A6)

It is easy to show that under H₀, f(SD) ~ 1/s can be generalized to any number of terms.

To summarize, we can conclude that the distribution function of both the numerator and the denominator of SP are inversely proportional to the scale parameter. The latter implies that the distribution of SP does not depend on the scale parameter at all, but is determined only by the type of the initial data distribution and the method of their partitioning into groups.

Appendix A.3. How Does the Scale Parameter Influence the SP Distribution?

First, let us note that the two-dimensional distribution f(SD_inter, SD_intra), a result of the same normalization considerations that were used in Appendix A.2, is inversely proportional to the squared scale parameter.

The ratio r = SD_inter/SD_intra has the following distribution function:

\begin{array}{l} f (r) & = \int S D_{i n t r a} \cdot f (r \cdot S D_{i n t r a}, S D_{i n t r a}) d ({S D}_{i n t r a}) \\ = \int S D_{i n t r a} \cdot \frac{φ (r \cdot \frac{S D_{i n t r a}}{s}, \frac{S D_{i n t r a}}{s})}{s^{2}} d ({S D}_{i n t r a}) \\ = \int \frac{S D_{i n t r a}}{s} \cdot \frac{φ (r \cdot \frac{S D_{i n t r a}}{s}, \frac{S D_{i n t r a}}{s})}{s^{2}} d (\frac{{S D}_{i n t r a}}{s}) \end{array}

(A7)

where

\frac{S D_{i n t r a}}{s}

and

φ

are dimensionless and scale free. Accordingly, the distribution of SP differing from r by a constant factor

{d c}_{i n t r a}

/

{d c}_{i n t e r}

is also dimensionless and scale free.

Appendix B

Why Is There a Correlation between Two Distances with a Common Vertex Datum?

Let us consider three arbitrarily drawn random data points (A, B, C) for which the triangle inequality between distances d_A,B, d_B,C, and d_C,A holds. Consider also a circle circumscribing the triangle consisting of three arcs based on three chords (see Figure A1). It is well known from trigonometry that according to the sine theorem (r denotes the radius of the circle proportional to the scale factor):

d_B,C = 2 r·sinα; d_C,A = 2 r·sinβ; d_A,B = 2 r·sinγ

(A8)

whereas

arc (BC) = 2 r·α; arc (CA) = 2 r·β; arc (AB) = 2 r·γ

(A9)

The three angles α, β, and γ are not independent, but connected due to

α + β + γ = 2π

(A10)

and, therefore, both arcs and chords turn out to be mutually correlated.

Figure A1. Graphical illustration of correlations between distances with common vertex.

Appendix C

Asymptotical Behavior of the SP Distribution

To this end, consider Equations (7)–(9) in the main text, assuming for simplicity, all groups are equal in size (as in a balanced design), i.e.,

\forall n_{i} = n

. Then,

d c_{i n t e r} = (\begin{matrix} m \\ 2 \end{matrix}) \cdot n^{2}

(A11)

d c_{i n t r a} = m (\begin{matrix} n \\ 2 \end{matrix}) = \frac{m \cdot n \cdot (n - 1)}{2}

(A12)

\begin{array}{l} V A R ({M D}_{i n t e r}) & = \frac{V A R (d)}{(\begin{matrix} m \\ 2 \end{matrix}) {\cdot n}^{2}} + \frac{2 \cdot c o v \cdot \sum_{i}^{m} n \cdot (\begin{matrix} m n - n \\ 2 \end{matrix})}{{(\begin{matrix} m \\ 2 \end{matrix})}^{2} \cdot n^{4}} \\ = \frac{2 \cdot V A R (d)}{m \cdot (m - 1) {\cdot n}^{2}} \frac{4 \cdot c o v \cdot [n \cdot (m - 1) - 1]}{m \cdot (m - 1) {\cdot n}^{2}} \end{array}

(A13)

V A R ({M D}_{i n t r a}) = \frac{2 * V A R (d)}{m \cdot n \cdot (n - 1)} + \frac{4 \cdot c o v \cdot (n - 2)}{m \cdot n \cdot (n - 1)}

(A14)

C O V ({M D}_{i n t r a}, {M D}_{i n t e r}) = \frac{4 \cdot c o v}{m \cdot n}

(A15)

or, in the asymptotic approximation for large n values,

V A R ({M D}_{i n t e r}) \approx \frac{4 \cdot c o v}{m \cdot n}

(A16)

V A R ({M D}_{i n t r a}) \approx \frac{4 \cdot c o v}{m \cdot n}

(A17)

C O V ({M D}_{i n t r a}, {M D}_{i n t e r}) = \frac{4 \cdot c o v}{m \cdot n}

(A18)

Since as n (as well as m) increases, the standard deviations of the SP numerator and denominator according to (A16) and (A17) decrease, for sufficiently large n, one can use the Taylor approximation [20], according to which:

E (\frac{R}{S}) \approx \frac{μ_{R}}{μ_{S}} - \frac{C O V (R, S)}{{(μ_{S})}^{2}} + \frac{V A R (S) \cdot μ_{R}}{{(μ_{S})}^{2}}

(A19)

V A R (\frac{R}{S}) \approx \frac{{(μ_{R})}^{2}}{{(μ_{S})}^{2}} \cdot (\frac{σ_{R}^{2}}{{(μ_{R})}^{2}} - 2 \frac{C O V (R, S)}{μ_{R} \cdot μ_{S}} + \frac{σ_{S}^{2}}{{(μ_{S})}^{2}})

(A20)

and, therefore,

E (S P) \approx 1 + \frac{V A R (M D_{i n t r a}) - C O V (M S_{i n t e r}, M S_{i n t r a})}{E^{2} (d)} \approx 1 + \frac{2 \cdot V A R (d) - 4 \cdot c o v}{m \cdot n \cdot (n - 1) \cdot E^{2} (d)}

(A21)

V A R (S P) \approx \frac{1}{E^{2} (d)} \cdot \frac{1}{m \cdot n} (2 \cdot V A R - 4 \cdot c o v) \cdot (\frac{1}{(m - 1) \cdot n} + \frac{1}{n - 1})

(A22)

For m = 2, it follows from (26) that

V A R (S P) < \frac{V A R (D)}{E^{2} (d)} \cdot \frac{2}{{(n - 1)}^{2}}

.

Since both VAR and COV are proportional to the square of the scale factor of the original distribution, as well as Ε²(d), expectation E(SP) and VAR(SP), as expected, do not depend on either the location or scale factor of this distribution. For sufficiently large n (or m):

E (S P) \to 1, V A R (S P) \to 0

.

References

Marmor, Y.N.; Bashkansky, E. Processing new types of quality data. Qual. Reliab. Eng. Int. 2020, 36, 2621–2638. [Google Scholar] [CrossRef]
Song, W.; Zheng, J. A new approach to risk assessment in failure mode and effect analysis based on engineering textual data. Qual. Eng. 2024. [Google Scholar] [CrossRef]
González del Pozo, R.; Dias, L.C.; García-Lapresta, J.L. Using Different Qualitative Scales in a Multi-Criteria Decision-Making Procedure. Mathematics 2020, 8, 458. [Google Scholar] [CrossRef]
Weiß, C.H. On some measures of ordinal variation. J. Appl. Stat. 2019, 46, 2905–2926. [Google Scholar] [CrossRef]
Grzybowski, A.Z.; Starczewski, T. New look at the inconsistency analysis in the pairwise-comparisons-based prioritization problems. Expert. Syst. Appl. 2020, 159, 113549. [Google Scholar] [CrossRef]
Yang, W.; Chen, J.; Zhang, C.; Paynabar, K. Online detection of cyber-incidents in additive manufacturing systems via analyzing multimedia signals. Qual. Reliab. Eng. Int. 2022, 38, 1340–1356. [Google Scholar] [CrossRef]
Gadrich, T.; Bashkansky, E.; Zitikis, R. Assessing variation: A unifying approach for all scales of measurement. Qual. Quant. 2015, 49, 1145–1167. [Google Scholar] [CrossRef]
Feigenbaum, A.V. Total Quality Control, 3rd ed.; McGraw Hill: New York, NY, USA, 1991. [Google Scholar]
Rosenfeld, Y.; Jabrin, H.; Baum, H. Costs of Non-Qualiy in Residential Construction in Israel; National Institute for Construction Research: Haifa, Israel, 2019. Available online: https://www.gov.il/BlobFolder/reports/research_1077/he/r1077.pdf (accessed on 16 July 2023). (In Hebrew)
Le Cam, L.M.; Yang, G.I. Asymptotics in Statistics: Some Basic Concepts; Springer Science & Business Media: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
Vanacore, A.; Marmor, Y.N.; Bashkansky, E. Some metrological aspects of preferences expressed by prioritization of alternatives. Measurement 2019, 135, 520–526. [Google Scholar] [CrossRef]
Marmor, Y.N.; Gadrich, T.; Bashkansky, E. Accuracy of multiexperts’ prioritization under Mallows’ model of errors creation. Qual. Eng. 2021, 33, 286–299. [Google Scholar] [CrossRef]
McKay, A.T.; Pearson, E.S. A note on the distribution of range in samples of n. Biometrika 1933, 25, 415–420. [Google Scholar] [CrossRef]
Hartley, H.O. The range in random samples. Biometrika 1942, 32, 334–348. [Google Scholar] [CrossRef]
Crooks, G.E. Field Guide to Continuous Probability Distributions; Berkeley Institute for Theoretical Science: Berkeley, CA, USA, 2019; Available online: https://threeplusone.com/pubs/FieldGuide.pdf (accessed on 16 July 2023).
Johnson, H.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed.; John Wiley & Sons: New York, NY, USA, 1994. [Google Scholar]
Gadrich, T.; Bashakansky, E. A Bayseian approach to evaluating uncertainty of inaccurate categorical measurements. Measurement 2016, 91, 186–193. [Google Scholar] [CrossRef]
Gadrich, T.; Marmor, Y.N. Two-way ORDANOVA:Analyzing ordinal variation in a cross-balanced design. J. Stat. Plan. Inference 2021, 215, 330–343. [Google Scholar]
Kumar, P.; Kumar, A. Quantifying Reliability Indices of Garbage Data Collection IOT-based Sensor Systems using Markov Birth-death Process. Int. J. Math. Eng. Manag. Sci. 2023, 8, 1255–1274. [Google Scholar] [CrossRef]
Seltman, H. Approximations for Mean and Variance of a Ratio. Available online: https://www.stat.cmu.edu/~hseltman/files/ratio.pdf (accessed on 19 July 2019).

Figure 1. Examples of distance metrics.

Figure 2. Intra (dashed lines) and inter (solid lines) connections for two clusters.

Figure 3. Example of four data sets segmented into two segments.

Figure 4. Distribution of quality cost proportions in four categories.

Figure 5. Judges divided according to gender.

Figure 6. (a) Distance distribution for normal distribution of the original data; (b) Distance distribution for uniform distribution of the original data; (c) Distance distribution for exponential distribution of the original data.

Figure 7. Simulation study of SP distribution when partitioning four data points into two segments of equal size.

Figure 8. (a) The influence of the number of partition groups m on the CDF of the SP distribution. (b) The influence of the number of partition groups m on the PDF of the SP.

Figure 9. The impact of the amount of data in every one of m = 10 groups on the SP distribution.

Figure 10. The SP distribution of SP under H₀.

Table 1. The matrix of mutual distances between TTFs.

	A	B	C	D
A	0	6000	7000	5000
B	6000	0	13,000	11,000
C	7000	13,000	0	2000
D	5000	11,000	2000	0

Table 2. Distribution of proportions of quality costs in four categories ([9], p. 109).

	Company 1	Company 2	Company 3	Company 4	Company 5	Company 6	Company 7	Company 8
Prevention costs	0.19	0.05	0.07	0.19	0.11	0.31	0.19	0.23
Appraisal costs	0.22	0.15	0.13	0.35	0.47	0.12	0.32	0.23
Internal failure costs	0.14	0.28	0.18	0.27	0.19	0.14	0.27	0.27
External failure costs	0.45	0.52	0.62	0.19	0.23	0.43	0.22	0.27
Total	1	1	1	1	1	1	1	1

Table 3. Matrix of mutual distances between companies.

	Company 1	Company 2	Company 3	Company 4	Company 5	Company 6	Company 7	Company 8
Company 1	0.000	0.198	0.169	0.214	0.222	0.122	0.189	0.152
Company 2	0.198	0.000	0.094	0.290	0.290	0.265	0.265	0.240
Company 3	0.169	0.094	0.000	0.328	0.320	0.230	0.302	0.266
Company 4	0.214	0.290	0.328	0.000	0.120	0.269	0.030	0.104
Company 5	0.222	0.290	0.320	0.120	0.000	0.317	0.127	0.191
Company 6	0.122	0.265	0.230	0.269	0.317	0.000	0.244	0.178
Company 7	0.189	0.265	0.302	0.030	0.127	0.244	0.000	0.077
Company 8	0.152	0.240	0.266	0.104	0.191	0.178	0.077	0.000

Table 4. The total degree of connection split.

$S D_{i n t r a} = 1.727$	${d c}_{i n t r a} = 12$	$M D_{i n t r a} = \frac{S D_{i n t r a}}{{d c}_{i n t r a}} = 0.144$
$S D_{i n t e r} = 4.082$	${d c}_{i n t e r} = 16$	$M D_{i n t e r} = \frac{S D_{i n t e r}}{{d c}_{i n t e r}} = 0.255$
$S D_{t o t a l} = 5.809$	${d c}_{t o t a l} = 28$

Table 5. Distance matrix between each pair of judges.

		Judge j
		1	2	3	4	5
Judge i	1	0	0.59	0.73	0.33	0.38
	2	0.59	0	0.46	0.45	0.44
	3	0.73	0.46	0	0.65	0.61
	4	0.33	0.45	0.65	0	0.37
	5	0.38	0.44	0.61	0.37	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marmor, Y.N.; Bashkansky, E. Reliability of Partitioning Metric Space Data. Mathematics 2024, 12, 603. https://doi.org/10.3390/math12040603

AMA Style

Marmor YN, Bashkansky E. Reliability of Partitioning Metric Space Data. Mathematics. 2024; 12(4):603. https://doi.org/10.3390/math12040603

Chicago/Turabian Style

Marmor, Yariv N., and Emil Bashkansky. 2024. "Reliability of Partitioning Metric Space Data" Mathematics 12, no. 4: 603. https://doi.org/10.3390/math12040603

APA Style

Marmor, Y. N., & Bashkansky, E. (2024). Reliability of Partitioning Metric Space Data. Mathematics, 12(4), 603. https://doi.org/10.3390/math12040603

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reliability of Partitioning Metric Space Data

Abstract

1. Introduction

2. Preliminary Materials: Some Definitions and Separation Power (SP) Calculation Method

2.1. Some Definitions

2.2. Some Illustrative Examples of SP Calculation for the Different Kinds of Data

2.2.1. Data Represented by Real Numbers

2.2.2. Each Datum Is a Discrete Distribution over Categories (as in a Pie Chart)

2.2.3. Each Datum Is a Preference Chain of Alternatives

2.3. Checking the Homogeneity Hypothesis H0

2.4. Some Simple Examples of Distance Metric Distribution

2.4.1. Normal Distribution

2.4.2. Uniform Distribution

2.4.3. Exponential Distribution

2.4.4. Conclusions Derived from the above Examples

3. Results of the Theoretical and Simulation Studies

3.1. Some General Considerations Regarding SP Distribution under H0

3.2. Some Remarks on Deriving the SP Distribution from a Simulation Process under H0

3.2.1. The Case Described in Section 2.2.1

3.2.2. The Case Described in Section 2.2.2

3.2.3. The Case Described in Section 2.2.3

3.3. General Methodology: Modus Operandi for Analyzing a Metric Data Partition (10 Steps)

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. How Does the Scale Parameter Influence the Sum of Distances (SD) and the SP Distributions?

Appendix A.2. How Does the Scale Parameter Influence the SD Distribution?

Appendix A.3. How Does the Scale Parameter Influence the SP Distribution?

Appendix B

Why Is There a Correlation between Two Distances with a Common Vertex Datum?

Appendix C

Asymptotical Behavior of the SP Distribution

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.3. Checking the Homogeneity Hypothesis H₀

3.1. Some General Considerations Regarding SP Distribution under H₀

3.2. Some Remarks on Deriving the SP Distribution from a Simulation Process under H₀