Unraveling Hidden Major Factors by Breaking Heterogeneity into Homogeneous Parts within Many-System Problems

For a large ensemble of complex systems, a Many-System Problem (MSP) studies how heterogeneity constrains and hides structural mechanisms, and how to uncover and reveal hidden major factors from homogeneous parts. All member systems in an MSP share common governing principles of dynamics, but differ in idiosyncratic characteristics. A typical dynamic is found underlying response features with respect to covariate features of quantitative or qualitative data types. Neither all-system-as-one-whole nor individual system-specific functional structures are assumed in such response-vs-covariate (Re–Co) dynamics. We developed a computational protocol for identifying various collections of major factors of various orders underlying Re–Co dynamics. We first demonstrate the immanent effects of heterogeneity among member systems, which constrain compositions of major factors and even hide essential ones. Secondly, we show that fuller collections of major factors are discovered by breaking heterogeneity into many homogeneous parts. This process further realizes Anderson’s “More is Different” phenomenon. We employ the categorical nature of all features and develop a Categorical Exploratory Data Analysis (CEDA)-based major factor selection protocol. Information theoretical measurements—conditional mutual information and entropy—are heavily used in two selection criteria: C1—confirmable and C2—irreplaceable. All conditional entropies are evaluated through contingency tables with algorithmically computed reliability against the finite sample phenomenon. We study one artificially designed MSP and then two real collectives of Major League Baseball (MLB) pitching dynamics with 62 slider pitchers and 199 fastball pitchers, respectively. Finally, our MSP data analyzing techniques are applied to resolve a scientific issue related to the Rosenberg Self-Esteem Scale.


Introduction
A Many-System Problem (MSP) is a scientific study involving many complex systems [1][2][3]. Such systems are basically governed by the same dynamic principles with many complex tuning parameters that embrace hardly measured idiosyncratic characteristics. This nature of MSPs can be seen in almost all scientific areas and real-world research involving any sort of heterogeneity, such as study subjects of different ages and genders in psychology and neuroscience [4], many distinct households in microeconomics [5], and the well-known many-body problem in physics [6], just to name a few. It is notable that the name MSP is inspired by the many-body problem.
Data observed from each individual system, in general, might not be large enough in size to sustain an insightful study into its individual governing principles and characteristics. Therefore, we need to aggregate all data from all systems in order to understand the systems' fundamental dynamics. However, as we pool all individual ID-marked data sets into a huge data set, this data pooling generates many confounding effects due to existing diverse aspects of dissimilarity across all systems. Such confounding effects are collectively termed as heterogeneity. That is, the effects of governing principles are expected to interact with the effects of the heterogeneity contained in the data. This is the dilemma facing every MSP, which is the topic of this paper.
In the literature of complex system study [3,6], as well as statistical physics [7], we could not find effective approaches to coherently resolve such a dilemma in any MSPs. One chief difficulty underlying any MSP rests on the categorical nature of data in both response and covariate sides of system dynamics. Certainly, the system-ID is categorical. Another chief difficulty is heterogeneity among all member systems. These two difficulties are linked. This categorical feature naturally gives rise to multiscale issues that are closely tied to the heterogeneity potentially existing among all member systems. First, systems' individual and distinct characteristics as fine-scale manifestations of heterogeneity can hardly be accommodated into any analytic equation or modeling structure in a systematic fashion. Secondly, the effects of interactions between the system-ID and all other features as mid-scale manifestations of heterogeneity are nearly impossible to capture precisely in man-made system descriptions. Thirdly, some individual systems are similar, while being dissimilar from other systems. Such similarity and dissimilarity among all involving systems can cause large-scale manifestations of heterogeneity.
In this paper, we first design an artificial MSP that captures the above multiscale structures and then reveals permeating effects of heterogeneity. In this example, we clearly show how heterogeneity constrains and hides essential mechanisms, and then how homogeneity can open up and reveal the true mechanisms. Then we turn to the real MSPs of interest in this paper; we consider physical pitching dynamics in Major League Baseball (MLB) in the US. A healthy starting pitcher throughout the entire season could pitch up to or even over 3000 pitches of several pitch types. Therefore, there are only several hundred pitches belonging to each pitch type for each pitcher. This amount of data is unlikely to sustain a detailed study on pitch-type specific pitching dynamics. In fact, pitching dynamics are known to be governed by the biomechanics of musculoskeletal construction and the Magnus effect of spinning baseballs in aerodynamics [8]. There are many parameters for each of these two governing principles.
Even though the biomechanics derived from musculoskeletal construction is more or less the same across all MLB pitchers, slight differences via many tuning parameters collectively make up large differences. Likewise, the Magnus effect is a universal physical phenomenon when spinning a baseball, but there are fine-scale differences in creating spins that will result in significantly distinct Magnus effects. This is why all pitchers are different. Some pitchers' pitching dynamics are closer to those of some pitchers than others. Further, different pitch types require slightly distinct biomechanical gestures and spinning mechanisms. As such, studying a collective of pitchers' pitching dynamics for biomechanical and physical governing principles is rather complicated. In this paper, we only focus on pitch-type specific MSPs in MLB.
In this paper, we demonstrate a computational approach for effectively studying an ensemble of pitchers' pitch-type specific dynamic systems through two stages. In the first stage, we resolve the following essential questions. Where are the effects of heterogeneity in this MSP example? How do these effects come about? How do they interfere with the effects of governing Magnus and biomechanical effects? We explicitly explore and confirm the answers to these questions with explicit details. In a nutshell, all answers are tied to the categorical feature of system-IDs because it is highly associative with almost all response and covariate features.
In the second stage, after exploring and confirming the effects of heterogeneity, it is essential to mitigate all aforementioned effects in order to unravel critical pattern information on governing principles. To achieve this goal, ideally, we need to partition the ensemble of pitchers by grouping similar pitchers into a collection of homogeneous groups, only within which we are able to retain a large enough data set to study group-specific pitching dynamics. Consequently, the differences among these homogeneous groups are surely large-scale effects of heterogeneity.
With the resulting manifestations in the two stages, we can see and appreciate the effects of heterogeneity. These effects weave magical fabrics contained in data's information content. This magic phenomenon mirrors what was described by Nobel laureate physicist P.W. Anderson [9] in his 1972 Science paper titled "More is Different": from more data to more relevant scales of patterns to more surprising discoveries. In this Big Data era, MSPs are and will become even more ubiquitous in the internet. The merits of such two-stage data analysis endeavors could be critical when studying a large collection of many complex systems. We demonstrate such merits by resolving an interesting scientific problem related to the Rosenberg Self-Esteem Scale in psychology and sociology [10] by analyzing a real data set collected online.

Structural Formation and Major Factors in MSP
In this section, we postulate a generic structural formation for any MSP setting. From each single complex system member of an MSP, a data point is measured and collected in a L + KD vector format. Let the first L(= 2) components be measurements or categories of the response features denoted as Y = (Y 1 , . . . , Y L ) and the rest of K(= 10) components be measurements or categories of K one-dimensional covariate features denoted as {V 1 , . . . , V K }. It is essential to note that one covariate feature, say V K , is the categorical feature of system labels or IDs.
An unspecified complex structural relation between Y and {V 1 , . . . , V K } consists of a collection of M unknown constituent mechanisms defined by M major factors {F m {A * m }|m = 1, . . . , M}, in the following fashion: Here, we do not have any prior knowledge or assumptions of M, functional forms of F m {·}, the random noise ε, nor the governing structural function G(·). We only focus on identifying feature memberships of each A * m (⊂ {V 1 , . . . , V K }) with m = 1, . . . , M. By acquiring these memberships, the layouts or patterns of constituent mechanisms within dynamics underlying Y should be visible and explainable through contingency tables of A * m -vs-Y and m∈S A * m -vs-Y. Therefore, our computational task can be simply described as: to discover the collection of major factors {A * m |m = 1, . . . , M}. This computational task of major factor selection primarily relies on information theoretical measures, as will be seen in the next section.

Methods
In this section, we first briefly review the selection protocol for the major factors underlying Y as proposed and illustrated in [11]. The basic foundation is the recently developed computational paradigm called Categorical Exploratory Data Analysis (CEDA) [12,13]. The name EDA was originally coined by John Tukey [14]. The fundamental idea behind CEDA is to let all features' natural categories assemble freely in order to shed light on the true pattern information contained in data. Then, we present a new efficient algorithm for reliability checking and a generic plan for studying any Many-System Problem.

Developments of CEDA
The first step of CEDA is to categorize each response and covariate feature via its histogram, which can be properly built using an effective algorithm developed in [15]. This step serves to reduce the noise inherent in all measurements in order to reveal the intrinsic categorical structure of the histogram . The second step employs a contingency table for  all developments involving information theoretical measures. The contingency table is  used as a platform for coupling multiple categorized features together to form and define  a new composite variable. This contingency table platform also serves as a platform for  visualizing and evaluating possibly non-linear associations between any two variables. The two directional associations are numerically evaluated through conditional (Shannon) entropy. By properly rescaling with respect to corresponding marginal (Shannon) entropy, a Mutual Conditional Entropy (MCE) [16] is calculated. This association measure is the correct one when considering categorical features.
The third step of CEDA computes the conditional entropies of Y given all possible covariate feature combinations or feature sets, since the structural categories of features are allowed to reassemble freely in these response-vs-covariate contingency tables without being subject to man-made constraints. This collection of all possible directional associations should ideally contain all vital associative patterns that indicate all constituent mechanisms underlying the designated response variable Y within an MSP. Our selection protocol identifies exactly such a collection of major factors. Each major factor is subject to reliability checks.

Major Factor Selection Based on Information Theoretical Measurements
In this subsection, we briefly review the approach to major factor selection based on the information theoretical measurements proposed in [11].
All features in Y and all covariate features, denoted as V k with k = 1, .  (A, B). For a contingency table, information theoretical measurements are natural tools for discovering associative patterns. Take (Y, A) as an example. Let all categories of Y be arranged along the column-axis, while all categories of A are arranged along the row-axis. Then, the resultant contingency table, denoted as < Y, A >, is constructed as a rectangle array of cell counts. With suitable permutations on column-and row-axes, by aggregating unoccupied zero cells as much as possible, associative patterns and relations between Y and A become graphically visible. All information theoretical measurements used here are invariant with respect to row and column permutations. Using a contingency table, we can still visualize the global and large-scale pattern formations contained in < Y, A >.
The aforementioned associative patterns can in fact be numerically evaluated via various versions of conditional entropies (CEs) by basically treating < Y, A > as a 2D histogram of bivariate (Y, A). Given a column, say Y = y, we define a discrete conditional variable. Its Shannon entropy is calculated on this column's vector of proportions, i.e., cell counts divided by its column sum, and is denoted as Therefore, it is natural to select the major factor based on the CE drop. It is worth emphasizing the fact that the conditional entropy drop indicates the shared amount information between A and Y; see also the review paper [17]:  acts like the so-called "ecological effect"; the whole is larger than the sum of its parts. The ecological effects are essential in the process of identifying a vital collection of major factors of Y, which offers an avenue for understanding mechanisms underlying Y.
On the other hand, this difference I[(A, B)|Y ] − I[A; B] could be nearly zero or even negative when A and B are highly associative. In this case, at most, either A or B may be a candidate of major factors of Y, but not both. This choice of major factor is a conservative way of decision-making.
Based on the concepts above, the following two criteria, "confirmable" and "irreplaceable", are proposed to identify a major factor of Y: As for criterion [C2: irreplaceable], the condition (a) ensures that some kind of structural dependency among all subsets of A is embraced under the constraints imposed by Y, not just the occurrence of ecological effects. This condition allows the following to happen: This selection of A * m is realistic only when the CE drop of A * m minus the CE drop of A * m is many times of B's CE drop. That is, A * m and B have to form some "strong" bonds under the conditioning of Y in order to jointly become a higher-order major factor.
For the convenience of checking this condition, the difference between the CE drop of A and the top ranked CE drop of its feature-subsets is routinely calculated. It is called "successive CE drop" or denoted as "SCE drop" in tables.
The condition (b) again ensures the ecological effect among identified major factors. Though obvious, it is worth mentioning the difference between conditions (a) and (b). Condition (a) sets a very high bar for building up any high-order major factors, such as order-3 or higher, while condition (b) only requires the fulfillment of an ecological effect for any two major factors to coexist. That is, there still exists potential for a union of two identified major factors to become a high-order major factor, but condition (a) is rather hard to fulfill.

Reliability Checking Algorithm
To check the condition [C1: Confirmable], Algorithm 1 simulates a contingency table by expanding the covariate feature-set by including an extra random-noise feature. However, when the size of the covariate feature-set is large, the number of rows of the contingency table becomes very large. This largeness renders the algorithm inefficient. In this subsection, we propose a new algorithm to resolve this computing issue. This algorithm directly estimates the mean and variance of the conditional entropies (CEs) when expanding a covariate feature-set with an extra random-noise feature.
To achieve the goal of Algorithm 1, it suffices to estimate a CE in the case of breaking a single row into χ rows, which is the number of bins of the random-noise feature. The final CE estimate would be the weighted sum of the CE estimates across all the rows in the expanded contingency table. Consider a contingency table n ij χ×P as a random matrix. Each column is a multinomial random variable with a uniform probability {1/χ, . . . , 1/χ}. Let n ·1 , . . . , n·P be the vector of original row sums. We break each of them into χ row sums. We have Let sample size n n = P ∑ j=1 n ·j and row sums n i· The goal is to obtain the distribution of the empirical conditional entropyĤ(Y|R): with some calculations below: The first term of the last equation, ∑ P The last term ∑ χ i=1 n i· n log( n i· n ) is the empirical conditional entropy −Ĥ(multi(n, [ 1 χ , . . . , 1 χ ])). In practice, it is easier to use (2) to understandĤ(Y|R). The mean ofĤ(Y|R) equals the mean of where each term can be numerically computed separately. For a positive integer N, let x = (x 1 , . . . , x χ ) be the random variable from multi(N, We can numerically compute the mean and variance of h(x). Denote m(N) and v(N) as the mean of h(x) for x from multi(N, [ 1 χ , . . . , 1 χ ]). From (2), we have For the variance, there is no exact and simple form, since the row sums n i· 's are dependent on columns. However, since the entropy of the row sums is positively correlated to the sum of the entropies of the columns, it is safe to have ∑ P j=1 v(n ·j )/n 2 as the upper bound of the variance ofĤ(Y|R): In practice, when there are many rows in a contingency table, the number in each cell of the contingency table will not be large. That is, the unique numbers in the contingency table are relatively few. Therefore, it will become efficient to estimate the mean and variance using (3) and (4). Here we remark that the estimate of the mean by (3) is in fact more accurate than that from Algorithm 2, and the variance estimation is usually within 1.2 times the true value from our experimental experience.

Algorithm 2:
Estimating the mean and the variance of the conditional entropy with a random noise feature included.
Input: The number of category of a random noise h, and the contingency table CT 0 = n ij p×q , where p is the number of outcomes of the feature-set and q is the number of categories of the response variable Y Output: The mean µ and the variance σ 2 of the conditional entropy with a random noise feature included. Compute the row sums: n i· = ∑ q j=1 n ij and the total sample size: n = ∑ v(x). end

A Generic Plan for Studying Any Many-System Problem
How can we effectively deal with complexity and heterogeneity embedded within any MSP? We propose the following study plan for MSPs under a framework with a designated response variable Y and a collection of covariate features {V 1 , . . . , V K−1 } across all the individual systems.

MSP-4:
Apply the hierarchical clustering algorithm to partition the collection of systems into homogeneous system-clusters. This clustering is undertaken based on measurement vectors of major factors without including V K .
Here, a homogeneous system-cluster means that we find no major factors containing V K when a system-cluster is treated and studied as a small version of MSP by going through the steps MSP-1 to MSP-3. The collection of homogeneous system-clusters explicitly reveals the systemwise heterogeneity. Through computational results of MSP-2 and MSP-3, we will evidently see the interacting relationships between complexity and heterogeneity through memberships of major factors within the original MSP under study.
The above study plan for MSPs is proposed based on experience. As expected, and as will be demonstrated in the MLB examples, the complexity and heterogeneity of any MSP interact with each other in many unknown ways across multiple scales. Recognizing this leads us to adopt Equation (1) by having all the functional structures be completely unspecified and unknown. Without knowing the global functional structure G(.) and the component-wise structures {F m (.)|m = 1, . . . , M}, our feature selection protocol can freely discover "major factors of various orders" by being free from man-made structural constraints. It is worth reiterating that this study plan works equally well for all data types.
Being free from structural and data-type constraints is critical and essential to understand and resolve real-world problems reliably. That is, we can not only build understanding of any complex system with a large amount of available data, but also realistically resolve MSPs in their original forms with reliability. Especially in this Big Data era, nearly all researchers admit that no modeling can stand true when faced with large amounts of data. Although the phrase: "All models are wrong, but some are more useful than others", still rings true, we do have one viable alternative: CEDA-based selection for major factors.
Certainly, such freedom of functional structures and universal validity for all data types come at the cost of facing the "finite sample phenomenon", since reliable contingency tables can only be constructed with a limited number of covariate features, which mirror the "curse of dimensionality". Such a cost could result in missing high-order major factors in MSPs. However, all data sets are finite. On the other hand, high-order major factors are harder to confirm. Nevertheless, vital and reliable collections of low-and median-order major factors constitute authentic intelligence when they are fully supported by visible and explainable patterns using contingency table platforms.

A Designed MSP Example
In this section, we design an MSP to demonstrate the complicated effects of heterogeneity. It is intuitive that such effects are especially profound when many complex systems are involved within an MSP, since a large number of constituent systems means that data points within each category of response features are spread thinner across all system-IDs, coupled with other covariate features. Consequently, all conditional entropy computations within such an MSP study are prone to the finite sample phenomenon. With such intuition in mind, we design our experimental MSP as follows.
Let the V 1 be a categorical feature with 80 categorical-IDs digital coded as: 1 to 80. These 80 ID-codes are equally divided into eight homogeneous groups defined as follows: For each ID via V 1 , we simulate 500 (Y, V 2 , . . . , V 12 ) data points. Each group contains 10 IDs, so there are 5000 data points for each of the eight groups. Based on the whole set of 40, 000 data points, we compute all conditional entropy of Y given all possible feature subsets of {V 1 , . . . , V 12 }. In particular, we also compute and report the successive conditional entropy (SCE) drops of such feature subsets. Here we reiterate once more, for a feature-set A, its SCE drop is defined as: ) minus the maximum of CE drops of all subsets of A. We report SCEs in Table 1 across one-feature to three-feature settings. All feature-sets within all settings with more than three features do not meet criterion [C1:confirmable]. We summarize the results derived from our major factor selection protocol as follows.

Selection of Major Factors in the Designed MSP
1 In the one-feature setting of Table 1, V 2 achieves the smallest CE with the largest CE drop: 0.3332. We surely recognize V 2 as the potential candidate for order-1 major factor. In contrast, V 1 achieves the second ranked CE with a CE drop of 0.0106, which is significantly less than that of V 2 , but it is almost 10 times the CE drop of any of the rest of the 10 features, including the four features {V 3 , . . . , V 6 } involving G 1 through G 4 , all of which achieve tiny CE drops around 0.001. In fact, V 1 does not meet criterion [C1: confirmable], since its CE, 2.4231, is beyond the center of CE distribution of ε (having 80 categories) with mean 2.4229 and sd = 5.2654 × 10 −4 . Therefore, V 1 is not a candidate of order-1 major factor. Furthermore, no individual feature of {V 3 , . . . , V 12 } meets the criterion [C1: confirmable].
2 In the two-feature setting, we can see that the first 11 pairs of the 12 ranked SCEs are made of V 1 coupling with {V 2 , . . . , V 12 }, and their SCEs are all many times larger than their corresponding individual CE drops. Therefore, all these feature-pairs satisfy the criterion [C2: irreplaceable]. However, even the top four pairs, V 1 _V 3 , V 1 _V 4 , V 1 _V 5 and V 1 _V 6 , do not meet criterion [C1: confirmable]. Through Algorithm 2, the distribution of CEs of (V 1 , ε) is found to have a mean of 2.2894 and an SD of 0.0019. The CE of V 1 _V 3 , 2.2895, is slightly larger than the mean. This is partly due to the fact that there are 80 categories of V 1 . Thus, there are no confirmed candidates of order-2 major factors. It is notable that the pair (V 1 , V 2 ) achieves a SCE of 0.1035, which is about 10 times as large as the CE drop achieved by V 1 . That is, by satisfying the ecological effect, V 2 and V 1 can be order-1 major factors simultaneously.
3 In the three-feature setting, the top six feature triplets are V 1 coupling any pairs of {V 3 , V 4 , V 5 , V 6 }. They achieve SCEs that satisfy criterion [C2: irreplaceable], but they do not fulfill the criterion [C1: confirmable]. The CEs of these triplets, which range between 1.2146 and 1.2198, are all larger than the mean (1.2051) of the distribution of the CEs of (V 1 , V 3 , ε). Therefore, there are no confirmed candidates for order-3 major factors.
Next, we proceed with the decomposition of the whole data set's heterogeneity into homogeneous parts. Indeed, as shown in Figure 1, G1 to G8 are correctly identified on a heatmap clustering 80IDs' vectors of means pertaining to the 11 features. When the exclusive data of G 1 through G 8 are analyzed separately, we can precisely confirm major factors of various orders underlying each of the eight groups. The order-1 major factor V 2 is confirmed, but no candidates of order-2 major factors are confirmed across the eight groups. As for potential candidate order-3 major factors, for instance in G 1 , the feature-triplet (V 4 , V 5 , V 6 ) achieves a CE of 1.0432, satisfying the criterion [C2: irreplaceable], and meets [C1: confirmable] in the following fashion. By applying Algorithm 2, this triplet's CE is more than 10 SD below the CE distribution of (V 4 , V 5 , ε) with mean 1.1094 and SD 0.0058, where ε is a uniform noise feature. Likewise, we confirm the order-3 major factors in G 2 through G 4 , respectively. We summarize the results of our major factor selection within this designed MSP in Table 2. It is noted that the results of G 6 , G 7 , and G 8 are identical to those of G 5 .

MLB's Two MSPs: Sliders and Fastballs
In this section, we study Major League Baseball (MLB) pitching dynamics from an MSP perspective.
MLB owns two databases, namely PITCHf/x and Statcast, that record every pitch delivered in all MLB games since 2006. In this paper, we study two large real-world MSPs of slider and fastball pitching dynamics in the 2017 season. From the PITCHf/x database, we select all pitchers who have pitched more than 500 sliders or fastballs, respectively. There are 62 slider pitchers and 199 fastball pitchers that meet this 500 pitches criterion. The pitch-ID is denoted as pitN.
In a structured data format, biomechanical and physical features include a pitched baseball's releasing coordinates {x0, z0}, speeds {vX0, vY0, vZ0, }, and accelerations {aX, aY, aZ} along the horizontal (X−), vertical (Z−), and pitcher-to-catcher (Y−) directions. The feature of starting speed denoted by startSp is very close to vY0. Two features are measured for each pitch's spin information: spin direction spinD and spin rate spinR. The aforementioned features are the covariate features of pitching dynamics measured around the pitcher mound, while the two features measured around the home plate, horizontal and vertical movements denoted by p f x x and p f x z , are designated as a 2D response variable Y = (p f x x , p f x z ). In both pitch-type specific MSPs, each pitching dynamic of (p f x x , p f x z ) is found to be dominated by (aX, aZ) under the presence of pitN. Therefore, we use (aX, aZ) = Y as a 2D response variable against the remaining covariate features to further explore the effects of pitN. All aforementioned features, except pitN, are quantitative and undergo categorization. All categorized features retain the same names throughout this paper.
MLB pitching dynamics are basically governed by biomechanical forces in the X, Y, and Z directions, coupled with the Magnus effect of the "spinning baseball" created by the friction between the pitcher's fingers and the baseball surface, and the gravity of earth. The Magnus effect is a 3D force constantly acting on the baseball with a 2D direction on the 2D disk that is perpendicular to the baseball's trajectory at all time. Its 2D direction is measured by spin direction (spinD), while its stability is determined by spin rate (spinR) and various directional speeds, accelerations, and other elements. It is not an easy task to clearly separate the force due to the Magnus effect from the biomechanical ones in pitching dynamics.
Different pitch types require different kinds of spinning. Back-spinning in fastballs persistently shows four or two seams rotating backward when viewed from the catcher or batter's perspective. This is a natural type of spin that can be easily created. Its Magnus effect points upward against gravity. The top-spinning in curveballs shows seams rotating forward instead. Its Magnus effect points downward to the ground and adds to gravity. Sliders are slower than fastballs, but much faster than curveballs in speed. Their spinning direction has a much wider range than fastballs and the curveballs. That is, the Magnus effect makes sliders have wider ranges of vertical movement (measured and denoted by p f x Z ) and horizontal movement (measured and denoted by p f x X ) than that of fastballs and curveballs.
These two measurements measured around the home plate are critical to pitchers when trying to effectively deal with batters. Therefore, we first designate (p f x Z , p f x X ) as the 2D response variable. Then, the MSP of interest can be very precisely depicted as the fundamental question: what are the major factors underlying (p f x Z , p f x X ) among many pitchers? This description does not mean that the identified collection of major factors would manifest and separate Magnus effects and biomechanical forces in a clear-cut fashion. On the contrary, the collective of major factors only reveals major factors that most directly cause the designated response variable. That is, major factors are very sensitive to the choice of response variables.
For the two focal pitch types in the 2017 MLB season, there are 62 slider pitchers and 199 fastball pitchers who pass the 500 pitches threshold. We have a 2D response variable Y = (p f x Z , p f x X ) and 12 covariate features that all are measured around the pitcher's mound.

Sliders MSP
A heatmap and a network of MCE-based associative patterns of these 14 features, responses, and covariates are given in panels (A) and (B) of Figure 2. It is clear that featurepairs (p f x X , aX) and (p f x Z , aZ) are highly associative, while aX and aZ are not. Based on the heatmap and network, we clearly see diverse associative patterns among these 14 features.
We computed conditional entropies for all possible combinations of 12 covariate features. The resultant table is too large to be included entirely. Thus, we report only topranked CEs up to five-feature setting in Table 3 and SCE drops up to four-feature setting in Table 4. These two formats of tables are used throughout the three MLB-related MSPs for expositional efficiency of showing potential candidates of major factors. However, our search for feature selection of major factors is conducted based on all possible feature-sets. We summarize patterns from our feature selection protocol and list our reasoning based on

1:
In the one-feature setting, we see four features-{spinD, aX, pitN, aZ}-that achieve CE drops more than 1.1000. They are potential candidates of order-1 major factors. Among these four features, {spinD, aX, pitN} are moderately associated based on Figure 2A,B. This fact would significantly reduce their potentials to simultaneously be order-1 major factors.

2:
Considering the two-feature setting, the feature-pair (aX, aZ) achieves the lowest CE, 0.9264, and meets the criterion [C1: confirmable], see Figure 3A. However, its CE drop, 3.7626, is almost equal to the sum of individual CE drops of aX and aZ: 3.8032. Even though, based on Figure 2A Figure 3. However, it is far from meeting the criterion [C2:irreplaceable].
As for triplet (aY, vZ0, pitN), it achieves the largest SCE drop and seemingly has potential to be order-3 major factor. However, it is not, due to the criterion [C2:irreplaceable].
The reason is this is that triplet (aY, vZ0, pitN) is a union of (aY, pitN) and (vZ0, pitN) with individual CE drops of 1.5820 and 1.5908, respectively. The sum of both CE drops, 3.1728, is much larger than the CE drop of the triplet: 2.7441(= 4.6890 − 1.9449). Despite this, as indicated above, the collection {pitN, (aY, vZ0)} is a valid one. That is, it takes much more to become a high order major factor.
In conclusion, there are no order-3 major factors confirmed.
From the above feature-selection, we identify {aZ, (vY0, z0)} as the chief collection of major factors of Y = (p f x X , p f x Z ), and two alternative collections: {aX, (vZ0, z0)} and {aX, (vZ0, aY)}. These three collections contain one order-1 and one order-2 major factor. We found no order-3 or higher orders. Each order-1 major factor, either aZ or aX, within any of these three collections contributes significantly more than its corresponding order-2 major factor. If the fourth alternative collection is needed, then it would be {aZ, (spinR, z0)}.
It is also evident that the effect of heterogeneity is not explicitly present because of pitN is not selected as an order-1 major factor, while the Magnus effect is also not visible, since spinD is not selected as one of order-1 major factors in the three selected collections. However, collections such as {spinD, (vZ0, z0)} and {spinD, (spinR, z0)} are not selected because their CE values are much higher than the selected ones.
Finally, all these collections of major factors of Y = (p f x X , p f x Z ) involve two features of four aspects of the pitching dynamics: (1) the acceleration in the Y-direction (aY), (2) the vertical releasing speed (vZ0), (3) spin rate spinR, (4) the vertical coordinate of the releasing point z0. This fact collectively indicates that our understanding of the underlying dynamics of Y = (p f x X , p f x Z ) should be derived from multiple perspectives. It is also evident that the effect of spinR is not ignorable; nonetheless, it is definitely not the primary effect.
On the other hand, aX and aZ seem to play key roles in Y = (p f x X , p f x Z ). This pair achieves a CE drop of 3.7626 from a CE of Y: 4.6890. In contrast, the chief and alternative collections of major factors of Y = (p f x X , p f x Z ) have CE drops of about or less than 2.6890. That is, we need to further investigate by using Y = (aX, aZ) for the 25% unexplained CE beyond our collections of major factors.
Another motivation for looking at the dynamics of Y = (aX, aZ) is the fact that the significance of the Magnus effect is not evidently reflected in the collections of identified major factors of Y = (p f x X , p f x Z ). In fact, the spin direction (spinD) is not a primary member of the identified collections of major factors. Obvious reasons might be that this 2D response variable is highly and directly associated with aX, aZ and the heterogeneous effect via pitN. That is, the Magnus effect and the heterogeneity effect have likely been overshadowed by aX and aZ. To further investigate this speculation, we need to conduct another study via feature selection for major factors of response variable Y = (aX, aZ).
There are 10 covariate features for this investigation in the same MSP of 62 slider pitchers. Likewise we compute all CEs for all possible combinations of 10 features. We report up top 10 CE values across five feature-settings in Table 5 and the top 10 SCE drops in Table 6 across four feature-settings. Computational results of feature selection for major factors are summarized below starting from one-feature setting to five-feature setting. The CE of Y is 4.7213.  1 In the one-feature setting, spinD and pitN achieve the top two lowest CEs. Their CE drops, 1.6912 and 1.1127, respectively, are many time larger than the rest of eight features' CE drops. However, we do not expect them to become two separate order-1 major factors due to their moderate mutual association. This can be seen through the fact that the feature-pair (spinD, pitN) achieves a SCE drop of 0.5410 from spinD, which is less than half the CE drop of pitN. Again, the selection of an order-1 major factor is highly dependent on the selection of order-2 major factors.  , z0), that fulfill the two criteria. They are potential candidates for order-2 major factors. Any of them could become an order-2 major factors depending on which order-1 major factor is selected.
With respect to all order-1 and order-2 candidates for major factors, we select the collection {spinD, (vZ0, z0)} as the chief collection of major factors of Y = (aX, aZ).
3 In the three-feature setting, all top 10 triplets of CEs involve pitN, while seven out of ten involve the spinD. These 10 triplets are not the top 10 of CE drops and they achieve rather uniform CEs. According to Algorithm 2, they all fail to meet the criterion [C1:confirmable]. For example, the triplet (vZ0, pitN, aY) achieves the lowest CE of 1.9577 in this three-feature setting. This CE value is far beyond the range of the simulated CE distribution of (vZ0, pitN, ξ), with mean 1.7702 and SD 1.9173 × 10 −3 .
Our conclusion in this investigation where Y = (ax, az) is that the fact that the three collections have spinD as the order-1 major factor is rather natural based on general knowledge of pitching dynamics. The three members of order-2 major factors are pairs from {z0, aY, vZ0}. That is, the biomechanical features do play some important roles underlying the dynamics of Y = (ax, az). Further, the effects of heterogeneity within this collective of 62 slider pitchers systems seem evidently overshadowed by the finite sample phenomenon in the three-feature setting and beyond, though these effects are reflected through presences of multiple candidates of order-2 major factors.

Fastballs: 199 Pitchers
The MSP of fastballs in the 2017 MLB season consists of 199 pitchers with 14 features. In the two panels of Figure 4, the heatmap and the network based on the 14 × 14 MCE matrix reveal again that feature-pairs (p f x X , aX) and (p f x Z , aZ) are highly associative, but that the pair (aX, aZ) is not.
Again, we first consider the 2D bivariate response variable Y = (p f x X , p f x Z ). The CE of Y is calculated as 4.7167. According to the two criteria [C1: confirmable] and [C2: irreplaceable], we summarize our feature selection for major factors of Y = (p f x X , p f x Z ) based on the CEs of all the possible feature-sets from the 12 covariate features. Only the top ranked CEs and SCE drops are reported in Tables 7 and 8.

1:
With regard to the one-feature setting, we see four features, {spinD, aX, pitN, aZ}, that achieve CE drops larger than 1.2000. They are clearly potential candidates for order-1 major factors, since spinD and aX are highly associated and both are moderately associated with pitN. Such associative relations among these three features keeps them from being order-1 major factors simultaneously. In contrast, aZ is a strong potential candidate for an order-1 major factor.

2:
In the two-feature setting, the feature-pair (aX, aZ) achieves the lowest CE of 1.7424, with a CE drop of 2.9763, which is almost equal to the sum of the individual CE drops of aX and aZ: 2.9197; that is, I[aX; aZ|Y ] I[aX; aZ] . Together with the fact that, based on Figure 4A,B, aX and aZ are independent; only one of them can be an order-1 major factor in a collection of major factors. We also identify six feature-pairs, (aZ, vY0), (aY, vY0), (z0, vY0), (vZ0, pitN)
Likewise we report top ranked CEs and SCE drops in two tables, Tables 9 and 10, respectively: that is, our results of feature selection based on CEs of all possible feature-sets among the 10 covariate features.

1:
In the one-feature setting, spinD and pitN achieve top two ranked CE values. It is not surprising that spinD plays the sole role of order-1 major factor in the identified collections of major factors. It is surprising that, as seen below, we need to bring in aY and vZ0, two separate order-1 major factors, for the sake of improving the CE performances beyond spinD, to construct collections of major factors.

2:
In the two-feature setting, the top nine feature pairs achieving the lowest CEs involve spinD. However, none of these nine pairs fulfill the criterion [C2: irreplaceable].
On the other hand, the feature-pair (pitN, aY) and (pitN, vZ0) ranked second and third on the list of top 10 SCE drops, both fail to meet the criterion [C2: irreplaceable], while (pitN, spinR) satisfies both criteria. We also report no candidates for order-2 major factors that do not include pitN. This is a rather surprising observation.

3:
Considering the three-feature setting, the top 10 feature triplets on CEs are primarily extended from feature pair (spinD, pitN). This is not surprising due to the dominant effect of spinD. They satisfy the criterion [C1: confirmable] (see panel (C) of Figure 5) but not [C2: irreplaceable]. In contrast, the top 10 feature triplets on SCE drops are all involved pitN. They satisfy the criterion [C2: irreplaceable], but not [C1: confirmable].

Comparisons of Summarized Results from Slider and Fastball
We collect and compare major identified factors of various orders via our feature selection protocol on both pitch types, sliders and fastballs, with respect to two different kinds of response variables, Y: (p f x X , p f x Z ) and (aX, aZ), in Table 11. Such comparisons reveal fundamental differences of these two pitching dynamics, and at the same time, shed light on the effects of heterogeneity within both MSPs. Table 11. Comparison of major factors of sliders and fastballs with respect to Y = (p f x X , p f x Z ) and Y = (aX, aZ).

Pitch Type
Y Order-1 MF Order-2 MF Alternative MFs Alternative MFs When considering Y = (p f x X , p f x Z ) as the response variable, both pitch types select three collections of triplets in an identical format: one order-1 major factor, either aX or aZ, and one biomechanical order-2 major factor. Newton's second law of force dictates that aX and aZ as directional forces govern the two directional horizontal and vertical movements: Y = (p f x X , p f x Z ). In both pitch types, we also see the heterogeneity effects via pitN commonly interacting with biomechanical features in a format of candidates of order-2 major factors such as, (aY, pitN), (z0, pitN), (spinR, pitN), and (vZ0, pitN). From a biomechanical perspective, we find that slight differences rest on the candidates for order-2 major factors: (spinR, vY0) for sliders, and (aZ, vY0) and (pitN, vY0) for fastballs.
In other words, their differences are primarily tied to "speed" (vY0 or startSp). The coupling effect of (spinR, vY0) might be closely related to the capacity of control of slider pitches. As for fastballs, the speed (vY0 or startSp) is a basic requirement. It is surprising to see that the coupling effects of speed vY0 and vertical acceleration aZ on Y = (p f x X , p f x Z ) are somehow critical in this MSP of fastball pitching dynamics. The candidacy of order-2 major factor (pitN, vY0) indicates that there are significant differences among the 199 fastball pitchers, while this is not the case in sliders.
When Y = (aX, aZ) is the response variable in both pitch types, we found three triplet-based collections of major factors in the sliders, but only two pair-based and one singleton-based collections of major factors in the fastballs. The feature spinD is apparently an order-1 major factor for Y = (aX, aZ) in both pitch types.This clearly points to the fact that the Magnus effect is primarily contained in horizontal and vertical accelerations (aX, aZ).
The effects of heterogeneity among 62 slider pitchers and 199 fastball pitchers are manifested through the same four candidates of order-2 major factors: pitN coupled with one biomechanical member of feature-set {aY, vZ0, spinR, z0}. The impacts of such heterogeneity effects via pitN seem much more severe in fastballs than in sliders.
Further, the presences of (spinR, vX0), (spinR, vY0) as candidates for order-2 major factors in sliders, but not in fastballs, imply characteristic differences between these two pitch types. These two pairs clearly reveal the effects of spinR by coupling the two speed features vX0 and vY0, which are commonly shared by all 62 slider pitchers. Therefore, spinR is most likely linked to better control and reliable stability of slider pitches. In contrast, spinR only plays a minor role in fastballs through one aspect of interacting relations with pitN.
Here, we reiterate that pitN appears in candidates of order-2 major factors by coupling wirh primary biomechanical features under Y = (p f x X , p f x Z ) and Y = (aX, aZ). This evidence strongly indicates that the heterogeneity in these two MSPs must be accommodated before performing any inferences on Y = (p f x X , p f x Z ). This is one of the chief characteristics of MSPs.

Taking Care of Heterogeneity
From the results reported above, it is essential to recognize that pitN neither ever stands alone as an order-1, nor is a member of identified order-2 major factors. Its effects are always revealed through interacting relations with biomechanical and physical features due to it being a candidates for an order-2 major factor. Such interacting relations severely constrain our understanding of the pitching dynamics of both pitch types. In fact, such interacting relations reflect pitN's diverse associations with a majority of the biomechanical and physical features. As such, these relations are pieces of information on heterogeneity revealed in many feature-specific aspects. What about global information on heterogeneity regarding similarity and dissimilarity among pitchers in both MSPs?
With such global information in mind, it becomes necessary to correspondingly partition the whole collection of pitchers into homogeneous pitcher-groups. By doing so, we want to discover more diverse aspects of pitching dynamics, and at the same time bring out the global distinctions across various homogeneous groups.
For the two MSPs of slider pitchers and fastball pitchers, we propose here to perform hierarchical clustering (HC) on the two collections of pitchers with respect to contingency tables of triplet (aX, aZ, vY0) against pitN. These three features in the triplet involve order-1 and order-2 major factors of Y = (p f x X , p f x Z ). The two resulting HC-trees are given in Figure 6. Each HC-tree is marked with various scales of composition of pitcher-groups.
For the sliders, we first consider partitioning the 62 pitchers into five branch-based groups indexed A to E. We designate each branch-based group as one MSP and check whether it embraces homogeneity or not. For such a goal in each branch-based MSP, we conduct feature selection for major factors of the response variable Y = (aX, aZ). We summarize our findings of major factors of these five MSPs in Table 12.  In this table, when checking the criterion [C2: irreplaceable], we make sure that the SCE drop is at least three times that of Condition (a) for any candidate feature-sets. To test for the criterion [C1:confirmable], we use Algorithm 2. The selected collections across these ive MSPs share spinD as the chief order-1 major factor. Then spinD is coupled with various biomechanical features as order-1 major factors with relatively minor effects.
We see that pitN appears as one member of the candidates of order-2 major factors of Y = (aX, aZ) in MSPs defined by branches B, D and E, but not by branches A and C. That is, the branch A-and C-based MSPs embrace homogeneity among pitcher-systems, while branch B-, D-, and E-based MSPs still embrace heterogeneity to a much less extent because the number of such candidates is just two. It is also noted that throughout these five MSPs, pitN achieves the top 2 ranked individual CE drops, which are significantly smaller than the CE drops of the order-1 major factor spinD. We also observe that the SCE drop of (spinD, pitN) is rather small. Therefore, pitN is ruled out as a candidate for an order-1 major factor.
Another essential visible pattern in Table 12 is that the candidates for order-2 major factors of Y = (aX, aZ) become abundant and diverse by involving only biomechanical features in these 5 MSPs. This phenomenon is different from the interacting relations between pitN and biomechanical features in the original MSP consisting of 62 slider pitchers. This fact reflects P.W. Anderson's "More is Different" perspective of MSPs; its homogeneous components likely reveal more diverse and detailed group-idiosyncratic characteristics.
For fastballs, we first partition the 199 pitchers into three large groups with respect to the three major branches marked and indexed by A, B, and C, in panel (B) of Figure 6. These three large branches are unlikely to embrace homogeneity. Hence, we further partition the C branch into six sub branches marked and indexed by {C1, C2, . . . , C6}. These six sub branch-based MSPs seemingly embrace homogeneity, as seen in Table 13. That is, all their major factors of Y = (aX, aZ) in the six MSPs do not explicitly contain pitN. Throughout these six MSPs, pitN achieves an individual CE drop that is significantly smaller than the CE drop of spinD. The MSP C1 does not support any triplets because of the criterion [C1:confirmable]. Only two collections of order-1 major factors are identified within this MSP. Three collections consist of spinD coupling with one of the members of {vZ0, aY, vX0}. The fourth collection is {spinD, spinR}.
The MSP C2 does support one collection of triplets in the format of one order-1 major factor and one order-2 major factor. The other two collections consist of spinD coupled with another less effective order-1 major factor such as vZ0 and aY. There are 13 candidates for order-2 major factors identified, but not a single pair contains pitN as a member. Likewise, in MSP C3, there are two collections of triplets: one order-1 major factor and one order-2 major factor. The alternative is a collection of two order-1 major factors: {spinD, aY}. There are also 12 candidates of order-2 major factors identified without pitN as a member. In a similar fashion, in MSP C5 there are three collections of triplets supported by the data and there are 13 candidates for order-2 that do not include pitN. In MSP C6, identified collections are similar to those found in MSP C3.
In MSP C4, the first two selected collections of triplets share an unusual format: two order-2 major factors. The members of these two triplets are all biomechanical features without spinD. The third collection is {spinD, aY}. There are no other candidates for order-2 major factors. Overall, we identify abundant candidates for order-2 major factors that do not involve pitN at all across the six MSPs. In other words, the effect of heterogeneity seems to have disappeared.
Up to this point, our slider and fastball examples strongly suggest that a rigorous way for analyzing any MSPs containing a large number of complex systems must take the following steps. Firstly, we explore and understand how the complexity of the underlying dynamics interacts with heterogeneity. Secondly, this exploration leads us to find a proper clustering and grouping scheme, such that similar members of the original collection of systems are grouped together in order to embrace homogeneity. Thirdly, when such homogeneity is achieved across all subdivided groups, a more diverse and finer scale group-specific complexity can be discovered. This is because a group of similar systems will collectively retain a much larger number of data points than any of its single member systems. This approach for analyzing any large MSPs by going from the heterogeneity of the whole to the diverse homogeneity of its constituents is a way of seeing and understanding "More is Different" in MSPs.

Self-Esteem Across Gender and Age Human Complex Systems
In the very last section of this paper, we explore a potential merit of MSP study in a research area where psychology, psychiatry, and sociology intersect. The scientific question can be simply stated as: do people of different genders and ages respond to the Rosenberg Self-Esteem Scale differently? Prof. Morris Rosenberg studied dynamic changes of late adolescents' self-image in his well known 1965 book [10]. Boys and girls of 15 to 18 years of age, the so-called late adolescence, must make drastic self-image changes in response to drastic physiological and psychological developments due to changes in their own bodies, as well as the different societal potentials being discovered and the many career decisions that must be made. In order to study this topic, he designed the popular Rosenberg Self-Esteem Scale with 10 questions for his study subjects.
More than half of a century has passed since the publication of his book. Intuitively speaking, human self-image might be shaped, for example, by the abundant information available on the Internet. In fact, nowadays, all age and gender groups are exposed to and share a huge amount of information in regard to our world's drastic environmental and technological changes. Do such visible world-wide changes in our living conditions motivate and affect people's self-image differently or similarly?
Nowadays, the Rosenberg Self-Esteem Scale has become a popular on-line test. Many persons who are far outside the original domain of application of this test have taken the test and their results are recorded. A Rosenberg Self-Esteem Scale data set from Kaggle can be found via the following link: https://www.kaggle.com/yamqwe/rosenberg-selfesteemscale (accessed on 3 January 2022). When addressing such data, it is reasonable to first ask whether seemingly diverse systems defined across gender and age axes are equal through the perspective of the Rosenberg Self-Esteem Scale, This is one version of an MSP without specifically targeted Re-Co dynamics. Only by arriving at an answer to this MSP question can the second question be formulated accordingly.
Under the MSP setting, the first question can be stated as: does potential heterogeneity exist across spans of the age-axis and gender-axis or not? This is a rather interesting issue from the perspective of psychiatry, psychology, and sociology. We want to shed some light on this issue via the computational major factor selection method developed and illustrated in the previous sections.
This data set consists of the results of 47,974 subjects who took the online Rosenberg Self-Esteem Scale. We deleted subjects younger than 10 or older than 70. Subjects who did not indicate their gender information are collected into the "none was chosen" category of the gender variable. There are 10 questions in the Rosenberg Self-Esteem Test. Each subject can respond on the following scale: 1 = strongly disagree, 2 = disagree, 3 = agree, and 4 = strongly agree (0 = no answer).         Here we propose one fundamental way of resolving the aforementioned issue by looking into all possible response-to-covariate (Re-Co) dynamics and checking the effects of heterogeneity. For expositional simplicity, we use only one instance as an example. Given Q 1 , it is taken as a 1D response feature Y = Q 1 , while the rest of the nine questions Q 2 -Q 10 are taken as covariate features. These 10 features are ordinal. We also take Age and Gender as two additional covariate features. Specifically, Age takes categorical-membership values among the ordinal age-groups [10,14], [15,18], [19,22] 70], while V 11 = Gender takes categorical values of 1 ="male", 2 ="female", 3 = "other"; 0 ="none was chosen". That is, the MSP under study here is specified by the collection of complex systems encoded with the bivariate categories of (Age, Gender).
The pairwise associations among these 12 categorical features are presented through an MCE-based heatmap and its network in Figure 7. A glimpse of an associative map of these 12 features is seen through the three small communities in the network. When showing heterogeneity effects via our major factor selection, Age and Gender will be treated as two separate covariate features, as is demonstrated in Tables 14 and 15.     2 In the two-feature setting, we can see that the SCEs are all significantly lower than corresponding CE drops among features from Q 3 to Q 10 . That is, all feature-pairs fail the criterion [C2: unreplacable], including pairs involving Age and Gender. Therefore, no order-2 major factors can be seen. The primary reason for this phenomenon is that Q 2 is associated with all other covariate features to the degree that there are no pairs satisfying the ecological effects.
We conclude that Q 2 is the solo order-1 major factor of the Q 1 -based Re-Co dynamics and that the effects of heterogeneity of Age and Gender are absent. In other words, from the perspective of the Q 1 -based Re-Co dynamics, all age-vs-gender defined systems are not heterogeneous. To fully resolve this issue, we need all possible Re-Co dynamics. This task will be undertaken in a separate report.
To further address this issue in depth, we go into the phase of decomposing the collection of complex systems into homogeneous groups; that is, the categories of the bivariate (Age, Gender) are clustered into homogeneous groups, as shown in Figure 8. Then, we apply our major factor selection protocol to further check whether the effects of Age or Gender exist within any of these homogeneous groups of systems.  Figure 8. Decomposing the collection of systems defined by Age and Gender into homogeneous groups in the Rosenberg Self-Esteem Scale. Bivariate-encoded Age-vs-Gender systems are arranged along the column-axis, while the bivariate-encoded Q2-vs-Q6 categories are arranged along the row-axis. An HC-tree is superimposed on the column-axis for identifying homogeneous groups of systems.
Based on Figure 8, we select three obvious cluster-based groups: R 1 = {(2, 2), (3, 2)}, R 2 = {(4, 2), (3, 1)}, and R 3 = {(2, 1), (4, 1), (5, 1), (5, 2)}. The sample size is 16,800 for R 1 , 11,721 for R 2 , and 15,312 for R 3 . The single major factor collection {Q 2 } is selected for R 3 and R 3 , while a collection of two order-1 major factors {Q 2 , Age} is selected for R 2 , since Q 2 and Age jointly achieve a SCE drop of (0.0046) against the CE drop of Age of (0.0024). Thus, the ecological effect is satisfied. In addition, Age also meets the criterion [C1:confirmable], since its CE of 1.1713 is below the CE distribution of ε (having the same number of categories of Age) with a mean of 1.1736 and SD of 1.2193 × 10 −4 . Therefore, Q 2 and Age can both be order-1 major factors. Nevertheless, Q 2 is still the dominant order-1 major factor. Hence, we still conclude with a coherent result: systems within these three homogeneous groups are not heterogeneous from the perspective of Q 1 dynamics.
At the end of this example, we remark that the widely used Rosenberg Self-Esteem Scale was conceptualized by its author as a single-factor scale with scores ranging along a continuum of low self-esteem to high self-esteem [10]. From this perspective, our identification of Q 2 as the single major factor is seemingly coherent. Further, when three versions of rewording on Rosenberg Self-Esteem Scale were devised and a much smaller data set was collected and analyzed [18], the results of factor analysis [19] indeed indicated that the original version fits a two-factor model, while positively and negativey reworded versions fit single-factor models. It is noted that these results based on factor analysis are fundamentally irrelevant to the issue under study here.

Conclusions
We demonstrated the versatile merits of a CEDA-based protocol for selecting major factors of various orders as a proper way of studying any MSP. First, it is essential to reveal patter -information of "complexity interacting with heterogeneity" embedded within MSPs. Secondly, the heterogeneity embedded within the whole ensemble of complex systems needs to be broken down into homogeneity embraced by constituent groups of similar complex systems. Finally, similarity-based homogeneity and largeness of the data set will enable each constituent group to manifest detailed multiscale characteristics via major factors of various orders. This is taken as the true implication of P. W. Anderson's "More is Different" under a structured data setting. This MSP implication of breaking heterogeneity for the sake of unraveling possibly distinct collections of hidden major factors across homogeneous parts may be fundamental and critical in this Big Data era.
With an understanding of such an MSP implication, we further demonstrate its merits in formulating resolutions to a scientific problem related to the Rosenberg Self-Esteem Scale in psychology and sociology. This demonstration helps us recognize the degree of the spread of MSPs in the real world and simultaneously indicates the huge potential of our computational major factor selection techniques.
As scientists, we want to prevent data-driven understandings being twisted and information content being compromised. Nowadays, MSPs can be ubiquitously found in diverse research fields and real-world businesses. Proper discovery of data's authentic information content and true understanding of it should greatly help our societies. In this paper, we demonstrate that data-driven understandings can be supported by visible and explainable relational patterns found on the simple platform of contingency tables. This is essential and important for data analysis in this Big Data era.
Furthermore, we emphasize that our CEDA computations work for all data types. This is an essential virtue of data analysis, since its categorical nature involves features of any data types. By employing such a categorical nature in data, the contingency table platform becomes natural, as do information theoretical measurements. Consequently, the patternformation brought out by conditional entropy and mutual information is natural, so it reveals and facilitates understanding pertaining to MSPs of interest. In other words, this CEDA-based data analysis is an effective way of studying an ensemble of complex systems.
By visualizing how complexity interacts with heterogeneity and how homogeneity unravels many more major factors in the two MLB-related MSPs, we recognize the fact that a group of similar pitchers' collective pitching dynamics is indeed much more involved in many aspects than the whole collection of pitchers, as well as a single pitcher's pitching dynamics. This recognition allows us to fundamentally resolve the issue of how to properly compare the pitching dynamics of a group of pitchers.
By resolving these real-world MSPs, we also demonstrated how to successfully implement the two criteria [C1: confirmable] and [C2: irreplaceable] together with reliability checks. By recognizing existential heterogeneity and exploring its complicated effects and then teasing out diverse versions of homogeneity in all constituent groups, our various collections of major factors bring out the common governing mechanisms shared by all involved complex systems and, at the same time, they manifest distinctive mechanisms that characterize each of the identified groups of similar complex systems. The reliability check using Algorithm 2 is critical when subject to the finite sample phenomenon, even in the Big Data era.
Finally, we summarize the chief concept underlying our CEDA-based feature selection for major factors: "Let data's categorical nature assemble freely and naturally to shed light on complexity, heterogeneity, and homogeneity in MSPs".