Segmentation in Structural Equation Modeling Using a Combination of Partial Least Squares and Modiﬁed Fuzzy Clustering

: The application of a structural equation modeling (SEM) assumes that all data follow only one model. This assumption may be inaccurate in certain cases because individuals tend to differ in their responses, and failure to consider heterogeneity may threaten the validity of the SEM results. This study focuses on unobservable heterogeneity, where the difference between two or more data sets does not depend on observable characteristics. In this study, we propose a new method for estimating SEM parameters containing unobserved heterogeneity within the data and assume that the heterogeneity arises from the outer model and inner model. The method combines partial least squares (PLS) and modiﬁed fuzzy clustering. Initially, each observation was randomly assigned weights in each selected segment. These weights continued to be iteratively updated using a speciﬁc objective function. The sum of the weighted residual squares resulting from the outer and inner models of PLS-SEM is an objective function that must be minimized. We then conducted a simulation study to evaluate the performance of the method by considering various factors, including the number of segments, model speciﬁcations, residual variance of endogenous latent variables, residual variance of indicators, population size, and distribution of latent variables. From the simulation study and its application to the actual data, we conclude that the proposed method can classify observations into correct segments and precisely predict SEM parameters in each segment.

There are two different approaches for estimating the parameters of SEM, namely the covariance matrix structure-based approach [18] and the component-based approach, which is also known as partial least squares (PLS) modeling [19]. In the covariance-based approach, the maximum likelihood method is normally used for parameter estimation, whereby the indicators are assumed to have normal multivariate distribution [2,18]. The PLS approach in SEM then becomes an alternative if the normal multivariate assumption of the indicators is not fulfilled. PLS focuses on estimating the latent variable scores that differ from the covariance-based approach [20].
In applying SEM, it is assumed that the data are homogenous and follow only one model in a term. This assumption may be inaccurate because individuals tend to vary in their responses. If there are significant differences between specific segment parameters and other segments, the use of a single model in the aggregated data can be very misleading [21]. Failure to consider such heterogeneity can threaten the validity of the SEM results, leading to wrong conclusions. The discovery of segments in the data that are characterized by different SEM models will provide wider information on the results of the analysis. The information obtained is not only derived from aggregated data but also from each segment. There are two types of heterogeneity: observed heterogeneity and unobserved heterogeneity. Unobserved heterogeneity arises when the difference between two or more data groups does not depend on the observable characteristics [22]. This unobserved heterogeneity may be due to the differences in the measurement and structural models. Unobserved heterogeneity is one of the problems faced by social researchers [23]. In situations where unobserved heterogeneity exists, the researcher cannot generalize the results from the aggregate-level data analysis, but must account for differences in model relationships by establishing adequate observational segments [24]. Disclosure of this unobserved heterogeneity is a requirement to obtain valid results when using SEM modeling. Conventional segmentation methods usually fail in SEM because they only consider data from indicators and ignore the relationships between latent variables [25].
Within the context of SEM, several researchers have proposed segmentation techniques to overcome this heterogeneity, such as finite mixture SEM [23], finite mixture PLS [21,22,24], PLS typological regression [26], response-based unit segmentation PLS (REBUS-PLS) [27], and hierarchical Bayesian SEM [28]. In addition, a segmentation method based on the PLS genetic algorithm (PLS GAS), which uses guided random searching to find an optimal solution in a complex search space, was proposed by Ringle et al. [29]. There is also the PLS iterative reweighted regression segmentation (PLS IRRS) method, in which the M estimator inspires to reduce the effect of outliers in the regression model [25].
Reviewing these segmentation techniques reveals weaknesses that limit their application in many research situations. For example, finite mixture PLS [21] only considers heterogeneity in the structural model and imposes distribution assumptions on endogenous latent variables, which is contrary to the nonparametric character of PLS path modeling. As an extension of the PLS typological path modeling, the REBUS-PLS response-based unit segmentation procedure overcomes some of the limitations of FIMIX-PLS [27]. REBUS-PLS, however, has a weakness. For example, REBUS-PLS can only be used for reflective measurement models. Moreover, REBUS-PLS determines the initial partition and the number of segments using the hierarchical grouping method on the structural model residuals and measurements at the aggregate data level. This initial step is problematic because, for large amounts of data, it is not easy to interpret the segment number of the obtained dendrogram [30]. In addition, PLS GAS has some weaknesses regarding the time it takes to run the algorithm. PLS GAS takes more than one hour for calculations on simple models, whereas for more complex models, it can take longer [25]. Meanwhile, PLS IRRS assumes, similarly to FIMIX PLS, that unobserved heterogeneity can be explained only in the structural model.
We introduce a new method for SEM modeling within data containing unobserved heterogeneity by combining PLS and a modified fuzzy clustering method. In particular, PLS is combined with fuzzy clustering in an integrated framework. At the initial stage, each object was assigned a random initial weight in each segment. Through an iterative process, the weights were updated until convergence was reached. We have developed a new formula for this weighting, which considers heterogeneity in the structural and measurement models. Fuzzy clustering is an overlapping method that allows an object to be part of several segments [31][32][33]. The reasons for using fuzzy clustering to reveal data heterogeneity according to the SEM framework can be seen in [34]. The success of fuzzy clustering in grouping data motivates us to study the use of fuzzy clustering in PLS-SEM modeling. The use of fuzzy clustering in the context of finding data groups in PLS-SEM modeling has not previously been reported. The method of combining the fuzzy clustering method and the PLS method so that it can be used to reveal heterogeneity in the SEM model framework was investgated. Adapting fuzzy clustering in PLS-SEM modeling was conducted by modifications within the objective function and the procedure for updating the fuzzy membership value. This kind of fuzzy clustering continues to be developed, which provides an opportunity to develop this method with the latest fuzzy clustering. Examples of studies on recent fuzzy clustering can be seen in [30,[35][36][37][38].
This paper is structured as follows: in Section 2, we provide a brief background on the PLS approach to estimate SEM parameters and the algebraic notation in which PLS-SEM is defined. In Section 3, the basic formulas of the PLSMFC model and a summary of the PLSMFC algorithm are presented. Section 4 reviews the results and discusses the performance of the PLSMFC, which was evaluated in a detailed simulation study and empirical application. Finally, Section 5 provides conclusions regarding the feasibility of our proposed method.

Partial Least Squares Method for Estimating SEM Parameters
The PLS approach to the SEM has been proposed as a component-based estimation procedure that differs from the classical covariance-based approach [27]. It is an iterative algorithm that breaks down the blocks of the measurement model and estimates the path coefficients in the structural model in the next step. Therefore, PLS-SEM is claimed to explain the residual variance of the latent variables and indicator variables in each regression run in the model, which is why PLS path modeling is considered more of an exploratory than confirmatory approach. In contrast to the classical covariance-based approach, PLS-SEM does not aim to obtain a sample covariance matrix.
PLS-SEM is considered a soft modeling approach, where no strict assumptions are required. This is a desirable feature, especially in application studies where such assumptions are difficult to fulfill [39]. The PLS method was built from a system of interdependent equations based on simple or multiple regression. Such a system estimates the network of relationships between latent variables and the relationship between indicator variables and the latent variables associated with them. Lohmöller simplified symbols to facilitate PLS algorithm preparation [19]. Exogenous and endogenous latent variables are given the same symbol, i.e., η. The indicators of the endogenous and exogenous latent variables are also given the same symbol, namely, x. Let X kj be a data matrix of size Nxp, where N is the number of observations and p is the number of indicator variables. Furthermore, the indicators are partitioned into J subsets that do not overlap or are called blocks, namely X 1 , X 2 , . . . , X J . Each block represents a latent variable η j ; j = 1, 2, . . . , J. Each block has K j indicators x k j with k j = 1, 2, . . . , K j . Latent variables are assumed to be connected by one or more linear relations. All variables, both latent and indicators, are treated as standardized variables. In the context of PLS, the structural model is often called the inner model. The structural relationship associated with the j* endogenous latent variable in mathematical notation is written as Q j * is the number of latent variables related to the j* endogenous latent variable. In vector and matrix, the Equation (1) can be expressed by The coefficient β j * is a path coefficient vector that represents the strength and direction of the relationship between the response η j * and the predictor η →j * . ζ j * is the residual term of the inner model. The measurement model in PLS is often called the outer model, which includes two measurement models: the reflective and the formative. The reflective measurement model is the most commonly used. In this case, the latent variable is considered the indicator's cause. It is called reflective because the indicators 'reflect' the latent variable. The formative measurement model assumes that the indicator variable is the cause of the latent variable.
Next, we describe the PLS algorithm for estimating SEM parameters. As a tool for estimating model parameters, the latent variable score Y j is estimated first through the mechanism of the weighted sum of the indicator variables as in Equation (3) where w k j is the outer weight related to indicator x k j . The weights are estimated using the least squares method. There are two versions of this weight estimation process. The first version is that the indicator variables are regressed on an instrumental variable Y j . This version is called mode A.
The estimated weight value is obtained by minimizing the residual values. The second version, which is called mode B, is the instrumental variable Y j regressed against the indicator variables.
The weights of w k j in Equation (3) are rescaled from w k j . The process causes the latent variable score to have a variance of 1. The following is a summary of the basic algorithm (Algorithm 1) of PLS [19].
Algorithm 1 PLS Algorithm [19] Step 1: Estimate the weights and scores of latent variables using the following process. The inputs of this algorithm are the indicators and the initial value w k j . The following steps (1-4) will be repeated until the weights of the indicators converge.

1.
Outside approximation Outer weight Y jn = ∑ K j k j =1 w k j x k j n + δ jn ; ModeB x k j n = w k j Y jn + k j n ; ModeA Step 2. Estimate path and loading coefficients using the ordinary least squares method.
According to [15], there are three options for calculating the inner weight v ji . The first scheme is centroid. This scheme only takes into account the sign of the direction of the correlation between adjacent latent variables. This scheme does not consider the path strength. The weight of the inner model v ji is the sign correlation between Y j and Y i , i.e., The second scheme is the factor scheme. This scheme not only considers directional signs, but also considers path strength in the structural model. The inner weight of the v ji model is the correlation between Y j and Y i , i.e., The third scheme is the path. A latent variable may be positioned as an independent or dependent variable depending on the cause-and-effect relationship. A latent variable is a dependent variable influenced by other latent variables, or as a predictor if it affects other latent variables. If the latent variable Y i is a dependent variable from the latent variable Y j , then the inner weight is the same as the correlation value between Yi and Y j . On the other hand, if Y i is the dependent variable of the latent variable Y j , then the inner weight is the regression coefficient of Y i in multiple regression to Y j .

Segmentation in SEM Using a Combination of Partial Least Squares and Modified Fuzzy Clustering
In this study, we developed a new method for determining a structural equation model based on data containing unobserved heterogeneity. Our method is a combination of the PLS method and modified fuzzy clustering. The new method is called PLSMFC. The PLS method is used to estimate the SEM parameters, and the modified fuzzy clustering method is used to find the segments of the object. We chose the value of the fuzzifier (m) to equal 2. The choice of this value is based on a previous study that indicates the value of m = 2 performs better in the fuzzy clustering group [34].
In the context of PLS, the structural model is often called the inner model. The structural relationship associated with the j* endogenous latent variable is denoted according to Equation (1). Furthermore, the reflective measurement model is represented by In vector and matrix notation, the reflective measurement model can be rewritten as The formative measurement model can be expressed by η jn = λ 1 x 1 j n + λ 2 x 2 j n + . . . +λ K j x K j n +δ jn In matrix and vector notation, the formative measurement model can be rewritten as The PLSMFC method finds the data segments after estimating the latent variable score. After the first stage and before the second stage of the PLS algorithm, the number of segments is chosen, and the initial weights are randomly determined. The number of segments can be determined by selecting 2, 3, 4, and so on. For a certain number of segments, the total weight of an object in all segments is equal to 1. The existence of this weighting causes a change in the method of estimating loading and path coefficient from the SEM.
The criteria used by the PLSMFC method to find the number of segments and the SEM in each segment is to minimize the sum of the weighted squared residual obtained by the outer and inner models. The use of residual distance as a substitute for Euclidean distance in conventional fuzzy clustering has been investigated by [40,41]. Mathematically, the objective function used in this process is where J* is the number of endogenous latent variables, J R is the number of latent variables measured using the reflective model, and J F is the number of latent variables measured using the formative measurement model. u cn is the weight of object n in segment c, C ∑ c=1 u cn = 1 for every n. ζ 2 nj * c is the residual of the inner model related to the n-th observation, the-j* endogenous latent variable in the c-cluster. ε 2 nkjc is a residual of the outer model for the reflective measurement on the n-th observation in the k-th indicator, the j-th latent variable, and in segment c. δ 2 nkjc is the residual from the outer model for the formative measurement model at the n-th observation in segment c associated with the j-th latent variable. Equation (13) can be rewritten in vector and matrix notation as where Equation (14) can be rewritten as where y j * is a vector of the j*-th endogenous latent variable score, Y →j * is a matrix of the latent variables score related to the j*-th endogenous latent variable, and β j * c is a vector of the path coefficient associated with the j*-th latent variable in segment c. x jk is a vector of the k-th indicator of the jth latent variable, y j is a vector of the j-th latent variable scores, and λ jkc is the loading coefficient (reflective measurement model) associated with the j-th latent variable on the indicator to k in segment c. X j is a matrix of indicators that affect the j-th latent variable and Λ jc is a vector of loading coefficients (formative measurement model) associated with the j-th latent variable in segment c.

Parameter Estimation of the Inner and Outer Model
The determination of the parameters estimator in the inner model (2) was conducted using the Lagrange multiplier method by minimizing the function in Equation (15) {\displaystyle f(x)} subjected to ∑ C c=1 u cn = 1, for every n. Equation (16) is the Lagrange function where l is the Lagrange multiplier.
The parameters in the inner model are all in the first group; therefore, full attention is paid to the group and does not change the shape of the second, third, and fourth groups. The estimator of the parameters in the inner model is obtained through the first derivative of F* to β j * c , i.e., ∂F * ∂β j * c = 0; for specific j* and c.
Determination of the parameter estimators of the outer model for reflective measurement is carried out in the same way as before. The parameter estimators are obtained through the first derivative of F* to λ, namely ∂F * ∂ λ jkc = 0; for specific j, k, and c. From Estimators of the formative measurement model parameters were obtained using the previous method. The parameters in the outer model for the formative measurement are all in the third group. The parameter estimators are obtained through the first derivative of F* to λ, namely ∂F * ∂ Λ jkc = 0; for specific j and c. From Equation (16), we get Furthermore, the residuals of the inner and outer models can be calculated. The residual of the inner model associated with the-j* endogenous latent variable in the c-th segment isζ The residual of the outer model associated with the reflective measurement of indicator x kj in the c-th segment isε The residual of the outer model associated with the formative measurement of the indicator of the j-th latent variable in the segment c iŝ

Fuzzy Membership and PLSMFC Algorithm
In this section, the process of obtaining the fuzzy membership formula is explained. This is one of the important components that characterize the fuzzy clustering method. The formula for the fuzzy membership value was obtained using the Lagrange multiplier method. From Equation (13), we get Due to ∑ C c=1 u c = 1 for c = 1, 2, . . . , C and specific n, we obtain Finally, by substituting the value of l to Equation (23), we get the formula for updating the fuzzy membership on the n-th object as follows To characterize the heterogeneity of the SEM, minimization of the objective function as in Equation (13) is carried out using the following algorithm (Algorithm 2).

Algorithm 2 PLSMFC Algorithm (author's own contribution)
Step 1: Estimate the weights and scores of latent variables using all data. The input of this algorithm is the indicators data and the initial value w k j . The following steps (1)(2)(3)(4) are repeated until the weights of the indicators converge.

1.
Outside approximation Inside approximation Step 2. Set the number of segments C, the initial fuzzy membership value of the n-th object in the segment c (u cn ), and ∆.
Step 4. Calculate the residual of the inner model and outer model in the c-th segment using Equations (20)- (22).
Step 5. Update the fuzzy membership value for the n-th observation in the c-segment using Equation (24).
Step 8. Repeat stages 1 through 7 for a different number of segments.
In this paper, the segment validity measures used are fuzziness performance index (FPI) and normalized classification entropy (NCE) [34,42]. The FPI formula is as follows: where PC is the Partition Coefficient defined by and C is the number of segments. The NCE formula is as follows: where PE is partition entropy defined by The smaller the FPI and NCE value, the better the cluster formed in separating objects from one another.

Results and Discussions
A simulation study and applications on real data were conducted to evaluate the performance of the PLSMFC method. Specifically, this simulation aimed to determine how effective this method is in reallocating segment membership and reestimating SEM parameters in different population sizes. The simulation design considers various factors, including the number of segments, model specifications, distribution of latent variables, residual variance of endogenous latent variables, variance of residual indicator variables, and population size.

Design of Simulation and Data Generating Process
The number of segments used in the simulation is two-level, 2 and 3. The SEM model specifications consist of Model 1 and Model 2. Model 1 refers to the SEM, which only consists of a reflective measurement model, whereas Model 2 refers to the SEM, which contains reflective and formative measurement models. The SEM model used in this simulation is represented in Figures 1 and 2. The distribution of latent variable scores and indicators consists of two-level, i.e., N(0,1), which is a symmetric distribution, and beta B (4,9), which is an asymmetric distribution. Furthermore, the residuals for endogenous latent variables follow normal distribution (0,σ 2 ) where the variances are set at 5%, 10%, and 20%. The indicators in each measurement model use the same amount, each consisting of three indicator variables. The loading coefficients for the number of segments 2 and 3 are set the same in both Models 1 and 2. These coefficients can be seen in Table 1. Furthermore, the residuals for the indicator variables are normally distributed (0,σ 2 ) where the variance is set at 5%, 10%, and 20%. The population size is set at three levels, namely, 50 (small), 200 (medium), and 1000 (large), where each segment has a balanced size. Overall, this study involved 2 × 2 × 2 × 3 × 3 × 3 = 216 combinations. Each combination was replicated 100 times using different initial weights to avoid convergence at the local minimum point. contains reflective and formative measurement models. The SEM model used in this simulation is represented in Figures 1 and 2. The distribution of latent variable scores and indicators consists of two-level, i.e., N(0,1), which is a symmetric distribution, and beta B (4,9), which is an asymmetric distribution. Furthermore, the residuals for endogenous latent variables follow normal distribution (0, 2 ) where the variances are set at 5%, 10%, and 20%. The indicators in each measurement model use the same amount, each consisting of three indicator variables. The loading coefficients for the number of segments 2 and 3 are set the same in both Models 1 and 2. These coefficients can be seen in Table 1. Furthermore, the residuals for the indicator variables are normally distributed (0, 2 ) where the variance is set at 5%, 10%, and 20%. The population size is set at three levels, namely, 50 (small), 200 (medium), and 1000 (large), where each segment has a balanced size. Overall, this study involved 2 × 2 × 2 × 3 × 3 × 3 = 216 combinations. Each combination was replicated 100 times using different initial weights to avoid convergence at the local minimum point.   Table 1 shows the SEM parameters in each segment. If the number of segments is two, then the SEM parameters in each segment are as in columns 1 and 2, but if the number of the segments is three, then the SEM parameters in each segment are as in columns 1, 2, contains reflective and formative measurement models. The SEM model used in this simulation is represented in Figures 1 and 2. The distribution of latent variable scores and indicators consists of two-level, i.e., N(0,1), which is a symmetric distribution, and beta B (4,9), which is an asymmetric distribution. Furthermore, the residuals for endogenous latent variables follow normal distribution (0, 2 ) where the variances are set at 5%, 10%, and 20%. The indicators in each measurement model use the same amount, each consisting of three indicator variables. The loading coefficients for the number of segments 2 and 3 are set the same in both Models 1 and 2. These coefficients can be seen in Table 1. Furthermore, the residuals for the indicator variables are normally distributed (0, 2 ) where the variance is set at 5%, 10%, and 20%. The population size is set at three levels, namely, 50 (small), 200 (medium), and 1000 (large), where each segment has a balanced size. Overall, this study involved 2 × 2 × 2 × 3 × 3 × 3 = 216 combinations. Each combination was replicated 100 times using different initial weights to avoid convergence at the local minimum point.   Table 1 shows the SEM parameters in each segment. If the number of segments is two, then the SEM parameters in each segment are as in columns 1 and 2, but if the number of the segments is three, then the SEM parameters in each segment are as in columns 1, 2,   Table 1 shows the SEM parameters in each segment. If the number of segments is two, then the SEM parameters in each segment are as in columns 1 and 2, but if the number of the segments is three, then the SEM parameters in each segment are as in columns 1, 2, and 3. Simulation data were obtained by following a two-step procedure. As an illustration, suppose we generate data from two segments using Model 1. The first step is to produce scores of the latent variable. Exogenous latent variable scores η 1 and η 2 were generated from two distributions, namely the standard normal distribution N(0,1) and the asymmetric distribution, in which beta B (4,9) distribution was chosen. The latent variable η 3 scores were obtained by following the specifications of Model 1 on the inner model η 3 = β 1 η 1 + β 2 η 2 + ζ 3 . After η 1 , η 2 , dan ζ 3 are generated, the scores η 3 can be obtained. The second step is to generate the data for the indicators x 31, x 32, and x 33 by first generating the data ε 31, ε 32, and ε 33 from the distribution N(0,σ 2 ). Using the specification of the outer model below, the values x 31, x 32, and x 33 could be obtained.
The data in Model 2 were obtained by first generating the data x 11, x 12, x 13, x 21, x 22, and x 23 from the standard normal distribution N(0,1) or beta distribution B (4,9). Then, δ 1 and δ 2 data were generated from the distribution N(0, σ 2 ). The scores of the exogenous latent variable η 1 and η 2 were generated by following the specification of the outer model Furthermore, the data for the indicator x 31, x 32, and x 33 were generated by the data ε 31, ε 32, and ε 33 first from the distribution N(0, σ 2 ) and by using the specification of the outer model, i.e.,

Simulation Results
This simulation study examines the quality of the segmentation solution generated by the PLSMFC method in terms of how many observations are correctly reallocated by the PLSMFC method. This study also analyzes how well the parameters of the SEM in each segment are estimated by the PLSMFC method. The reallocation process is carried out based on the largest weight obtained by each observation. If an observation is initially in segment one and then reallocated to segment one by the PLSMFC method, this means that the observation has been correctly reallocated by the method. The hit ratio statistic is used to measure the proportion of data generated in a particular segment, and it is assigned again in that segment by the PLSMFC method. Figure 3 below is the hit ratio of various factors used in this study in both Models 1 and 2. The figure explains that the trend of hit ratio values for Models 1 and 2 have the same pattern. In the number of segments, the hit ratio value in the number of segments 2 tends to be higher than the number of segments 3. In the population size, the larger the population size, the greater the hit ratio value. Furthermore, in the variance of the endogenous latent residual variable, the greater the variance value, the smaller the hit ratio. Likewise, for the variance of the indicator residual, the greater the variance, the smaller the hit ratio value. The performance of the latent variable distribution factor also has the same trend. The hit ratio value obtained from the latent distribution variable N(0,1) generally tends to be larger than the distribution B (4,9). In addition, it can also be shown that at all levels of each factor, Model 2 tends to obtain a higher hit ratio value than Model 1. However, the performance of the PLSMFC method is determined by the interaction of the factors used in this study. used to measure the proportion of data generated in a particular segment, and it is assigned again in that segment by the PLSMFC method. Figure 3 below is the hit ratio of various factors used in this study in both Models 1 and 2. The figure explains that the trend of hit ratio values for Models 1 and 2 have the same pattern. In the number of segments, the hit ratio value in the number of segments 2 tends to be higher than the number of segments 3. In the population size, the larger the population size, the greater the hit ratio value. Furthermore, in the variance of the endogenous latent residual variable, the greater the variance value, the smaller the hit ratio. Likewise, for the variance of the indicator residual, the greater the variance, the smaller the hit ratio value. The performance of the latent variable distribution factor also has the same trend. The hit ratio value obtained from the latent distribution variable N(0,1) generally tends to be larger than the distribution B (4,9). In addition, it can also be shown that at all levels of each factor, Model 2 tends to obtain a higher hit ratio value than Model 1. However, the performance of the PLSMFC method is determined by the interaction of the factors used in this study.      N(0,1). The combination of these levels produces a hit ratio of 57.73%. Figure 5 explains the hit ratio from the combination of level-level factors of Model 2. The highest hit ratio in Model 2 is achieved by a combination of the number of segments is 3, population size of 1000, residual variance of endogenous variables of 5%, residual variance of indicator variables of 5%, and the distribution of indicators B (4,9). The combination of these levels produces a hit ratio of 99.88%. The lowest hit ratio in Model 2 was achieved by a combination of the number of segments is 3, population size of 50, residual variance of endogenous variables of 20%, residual variance of the indicator variables of 20%, and the distribution of the latent variable B (4,9). The combination of these levels produces a hit ratio value of 64.42%. We conclude that our proposed method shows adequate performance.   N(0,1). The combination of these levels produces a hit ratio of 57.73%. Figure 5 explains the hit ratio from the combination of level-level factors of Model 2. The highest hit ratio in Model 2 is achieved by a combination of the number of segments is 3, population size of 1000, residual variance of endogenous variables of 5%, residual variance of indicator variables of 5%, and the distribution of indicators B (4,9). The combination of these levels produces a hit ratio of 99.88%. The lowest hit ratio in Model 2 was achieved by a combination of the number of segments is 3, population size of 50, residual variance of endogenous variables of 20%, residual variance of the indicator variables of 20%, and the distribution of the latent variable B (4,9). The combination of these levels produces a hit ratio value of 64.42%. We conclude that our proposed method shows adequate performance.   Table 2 describes the mean parameter estimates of Model 1 obtained under the co ditions that the distribution of latent variables is N(0,1), residual variance of the endog nous latent variables is 5%, residual variance of indicators is 5%, and the number of se ments is 2. Table 3 shows the mean estimated value of Model 2 obtained under the cond tions that the distribution of latent variables is N(0,1), residual variance of the endogeno latent variables is 5%, residual variance of indicators is 5%, and the number of segmen is 2. The average of the estimated parameter values of Models 1 and 2 was obtained aft 100 repetitions. The tables show that the PLSMFC method can generally predict the p rameters of Models 1 and 2 well. Model 1 has specifications that all measurement mode of the construct are reflective, whereas Model 2 has a mixed measurement model speci cation between reflective and formative. This shows that the method we have develop is able to work satisfactorily for both specifications of the SEM model. The standard err of the estimated parameter values decreases as the sample size increases. In addition, can be seen that the PLSMFC method works well even in small population size.   Table 2 describes the mean parameter estimates of Model 1 obtained under the con ditions that the distribution of latent variables is N(0,1), residual variance of the endoge nous latent variables is 5%, residual variance of indicators is 5%, and the number of seg ments is 2. Table 3 shows the mean estimated value of Model 2 obtained under the cond tions that the distribution of latent variables is N(0,1), residual variance of the endogenou latent variables is 5%, residual variance of indicators is 5%, and the number of segment is 2. The average of the estimated parameter values of Models 1 and 2 was obtained afte 100 repetitions. The tables show that the PLSMFC method can generally predict the pa rameters of Models 1 and 2 well. Model 1 has specifications that all measurement mode of the construct are reflective, whereas Model 2 has a mixed measurement model specif cation between reflective and formative. This shows that the method we have develope is able to work satisfactorily for both specifications of the SEM model. The standard erro of the estimated parameter values decreases as the sample size increases. In addition, can be seen that the PLSMFC method works well even in small population size.  Table 2 describes the mean parameter estimates of Model 1 obtained under the conditions that the distribution of latent variables is N(0,1), residual variance of the endogenous latent variables is 5%, residual variance of indicators is 5%, and the number of segments is 2. Table 3 shows the mean estimated value of Model 2 obtained under the conditions that the distribution of latent variables is N(0,1), residual variance of the endogenous latent variables is 5%, residual variance of indicators is 5%, and the number of segments is 2. The average of the estimated parameter values of Models 1 and 2 was obtained after 100 repetitions. The tables show that the PLSMFC method can generally predict the parameters of Models 1 and 2 well. Model 1 has specifications that all measurement models of the construct are reflective, whereas Model 2 has a mixed measurement model specification between reflective and formative. This shows that the method we have developed is able to work satisfactorily for both specifications of the SEM model. The standard error of the estimated parameter values decreases as the sample size increases. In addition, it can be seen that the PLSMFC method works well even in small population size.

Application on Real Data
This section explains the use of the PLSMFC method to find heterogeneity in job performance data. The first step in studying the PLSMFC application is to estimate the latent variable scores. The inner weight scheme we use is the centroid, as in Equation (6). However, other schemes can also be used. The results obtained are compared with the REBUS-PLS method. REBUS-PLS was chosen as a comparison because this method has the same perspective as the PLSMFC method, where unboserved heterogeneity is caused by the inner and outer models. This is different from the FIMIX PLS and PLS IRRS methods, where the unobserved heterogeneity is entirely caused by the inner model [25,27]. The data to support the application were obtained from the R documentation, whose data could be obtained by writing the R code, read.csv ('https://articledatas3.s3.eu-central-1. amazonaws.com/StructuralEquationModelingData.csv'; accessed 15 July 2022). In this study, Job_Performance was estimated based on three indicators: Client_Sat, which is the satisfaction value of the main client with a range of 1 to 100; Super_Sat, which ranks job performance according to superiors with a value range from 1 to 100; and Proj_Compl, which is the percentage of completed projects. The hypothesis of this study states that work performance is strongly influenced by three other latent variables, namely employee social skills, intellectual skills, and motivation. Moreover, each of these variables cannot be measured directly; therefore, it is necessary to determine the indicators. The social skill construct is based on two measurable variables: Psych_Test1, psychological test scores with a range of 1-100, and Psych_Test2, which also has a score range of 1-100. The intellectual skills is based on two measurable variables: Years_Edu is the number of years of higher education, and IQ is the score on an IQ test. The motivation construct is based on two measurable variables, namely Hrs_Train, which is the number of hours spent on training, and Hrs_Work, which is the mean of hours in a working week. All constructs were modeled using reflective measurements. Figure 6 below is a path diagram that illustrates the relationship among latent variables. Similarly to other fuzzy clustering classes, by using PLSMFC, the number of segments in this study was first determined, and an evaluation to determine the appropriate segment number was conducted based on the FPI and NCE indices. The formulas for FPI and NCE are shown in Equations (21) and (22), respectively. Figure 7 shows the values for FPI and NCE indices obtained after the PLSMFC method was used on various segments. The optimum number of segments is selected when the values for FPI and NCE indices are the lowest. By applying the PLSMFC algorithm to job performance data, we identified the optimal number of segments is 2, with the FPI value equal to 1.9338 and the NCE value 0.9502, as shown in Figure 7. Therefore, heterogeneity in the SEM model of job performance data is obtained on the basis of the number of segments being 2.  Similarly to other fuzzy clustering classes, by using PLSMFC, the number of segments in this study was first determined, and an evaluation to determine the appropriate segment number was conducted based on the FPI and NCE indices. The formulas for FPI and NCE are shown in Equations (21) and (22), respectively. Figure 7 shows the values for FPI and NCE indices obtained after the PLSMFC method was used on various segments. The optimum number of segments is selected when the values for FPI and NCE indices are the lowest. By applying the PLSMFC algorithm to job performance data, we identified the optimal number of segments is 2, with the FPI value equal to 1.9338 and the NCE value 0.9502, as shown in Figure 7. Therefore, heterogeneity in the SEM model of job performance data is obtained on the basis of the number of segments being 2. Similarly to other fuzzy clustering classes, by using PLSMFC, the number of segments in this study was first determined, and an evaluation to determine the appropriate segment number was conducted based on the FPI and NCE indices. The formulas for FPI and NCE are shown in Equations (21) and (22), respectively. Figure 7 shows the values for FPI and NCE indices obtained after the PLSMFC method was used on various segments. The optimum number of segments is selected when the values for FPI and NCE indices are the lowest. By applying the PLSMFC algorithm to job performance data, we identified the optimal number of segments is 2, with the FPI value equal to 1.9338 and the NCE value 0.9502, as shown in Figure 7. Therefore, heterogeneity in the SEM model of job performance data is obtained on the basis of the number of segments being 2.   Table 4 shows the estimated values of the parameters in each segment. To evaluate the significance of the parameter estimated values, we have applied bootstrap with resampling of 500 times. The standard error results and the critical ratio value also appear in Table 4. All path coefficients from the table are significant at the 5% significance level. Work motivation appears to positively influence work performance in segments 1 and 2. The effect of motivation on work performance is much more significant than the influence of social and intellectual skills. However, the influence of motivation in each segment has a different magnitude. The motivational factor contributes more to work performance in segment 1 than in segment 2. The path coefficient of motivation in segment 1 reaches 0.8179, whereas in segment 2, it is only 0.7917. Furthermore, the coefficient of determination in segment 1 is 92.07%, which indicates that 92.07% of the diversity in the construct of work performance can be explained by the constructs of social skills, skills, and work motivation. The coefficient of determination in segment 2 is 92.02%, which indicates that 92.02% of the diversity in the construct of work performance can be explained by social skills, intellectual skills, and work motivation, and 7.98% is explained by other constructs not considered in this study. The loading coefficients in this job performance study are almost all significant in each indicator in each segment, except for the loading coefficient of the client satisfaction indicator in segment 1, which shows no significance. In segment 1, client satisfaction is not one of the suitable indicators to measure work performance. The loading coefficient for client satisfaction in segment 1 is only 0.3760 with a standard error of 0.2693, whereas the loading coefficient for client satisfaction in segment 2 is 0.9142 with a standard error of 0.2693. This phenomenon shows that heterogeneity in the data is not only shown by the structural model, but may also come from different measurement models. This result differs from the others obtained by researchers who assumed that the unobserved heterogeneity stems only from the structural model [21,22,25].
The REBUS-PLS method was also used to find the segments of the data above. By using the same number of segments as the PLSMFC method (i.e., 2), unobserved heterogeneity of Job Performance data was identified. Table 5 explains the estimated values of the parameters in each segment. To evaluate the significance of the parameter estimated values, we applied bootstrap with resampling of 500 times. From Table 5, it can be seen that the loading and path coefficients in segments 1 and 2 are numerically different. From the table, it is known that all loadings on indicators in the measurement model are significant at 5% significance level. Both in segment 1 and segment 2 have the same conditions. The path coefficient on the inner model in segment 1 and 2 also shows a significant effect of latent oxygen variables on job performance at a significant level of 5%. The coefficient of the determination of job performance in segment 1 is 91.34%, which means that the variations that appear in the latent endogenous variable of job performance can be explained by social, intellectual, and motivational variables. In segment 2 the coefficient of the determination of the latent variable endogenous job performance is 91.56%. This demonstrates that 91.56% of the variability in job performance can be explained by the exogenous latent variables used in this study, while the remaining 8.4% is explained by other latent variables that were not considered in this study. Tables 4 and 5 are a summary of the results obtained after the PLSMFC and REBUS-PLS methods were applied to job performance data. It can be seen that the PLSMF method is more sensitive in detecting the significance of the model parameters. It can also be observed that in segment 1, the client satisfaction indicator is not a good measure of the latent variable endogenous job performance, but using the REBUS-PLS method shows the opposite result. In the evaluation of the inner model, the performance of the PLSMFC method also produces better results than the REBUS-PLS method. The coefficient of the determination produced by the PLSMFC method is higher than the REBUS-PLS method.

Future Research
This study introduces PLSMFC, a new segmentation method for PLS SEM, which makes it possible to reveal unobserved heterogeneity within the data. The method finds the segments based on the residual values obtained from the measurement and structural models. The sum of the squares of the weighted residuals is an objective function that must be minimized. This makes sense because the SEM models in each segment are obtained by minimizing the residual using a weighted least square.
The PLSMFC method offers flexibility. If using these residuals does not compute well, then the objective function can be reduced by removing the residuals from the measurement model, so that the objective function becomes a function of the residuals of the structural model. Reducing the residual component in the objective function will yield in a change in the fuzzy membership formula. This perspective is the same as FIMIX PLS and PLS IRRS, where the heterogeneity of the data is considered to be influenced only by the structural model [16,20]. The PLSMFC method does not depend on distributional assumptions and can reveal heterogeneity in the reflexive and formative measurement model.
Despite all the advantages of the PLSMFC method, it has several limitations that warrant further investigation. As with other class latent methods such as REBUS-PLS, FIMIX PLS, PLS GAS, and PLS IRRS, the exact number of segments for a data set is initially unknown. The process of finding the right segment is known after running the PLSMFC method with different numbers of segments. This makes this method inefficient, because we have to run the algorithm many times. Next, the simulation design in this paper does not involve a specification of the SEM model where the measurement models are all formative. If the endogenous latent variable is measured formatively, the scenario for generating the data becomes more complicated. The analysis of the data generation with the previous specification of the measurement model requires further research. Furthermore, the segmentation process in the PLSMFC method uses modified fuzzy clustering where the value of the fuzzy parameter is set equal to 2. A study of the effect of a fuzzy parameters value of more than two on the quality of segmentation is worth investigation. Moreover, the fuzzy clustering method continues to develop and provides an opportunity for researchers to combine the PLS method with the latest fuzzy clustering methods as in [30,31].

Conclusions
In this study, a new method for estimating SEM parameters based on heterogeneous data was proposed. This method is a combination of PLS and modified fuzzy clustering. We used the sum weighted residual generated by the outer and inner models as a substitute for the Euclidean distance in classical fuzzy clustering.
To evaluate the algorithm, we simulated 432 scenarios from 5 factors and used 100 replications per scenario. The 216 scenarios came from Model 1, a model where all measurements were reflective, and 216 from Model 2, a model where latent variables are a mix of reflective and formative measurements. We generated simulation data with various population sizes to represent small (50), medium (200), and large (1000) population sizes. The simulation results show that each scenario provided varying levels of accuracy. In general, it can be concluded that the PLSMFC method demonstrates good accuracy even for small population sizes. In addition, the greater the residual variance of the endogenous constructs and the indicators, the smaller the obtained hit ratio value. In addition, our method generates a re-estimation of model parameters with results proportional to the hit ratio achievement. The greater the hit ratio, the more accurately the model parameters are re-estimated. This condition occurs in Model 1, a reflexive measurement model, and in Model 2, a model with a mixture of reflective and formative measurements.
We applied the PLSMFC method to job performance data to examine the relationship between job performance, social skills, intellectual skills, and motivation, in which all constructs were reflectively measured. The results of the application study show that the correct number of segments is two, where all path coefficients are significant at a 5% significance level in both segments 1 and 2. However, there is a slight difference in the measurement model where all the loading coefficients in segment 2 are significant but the loading coefficient of client satisfaction is insignificant in segment 1.
We recommend the application of PLSMFC when data contain unobserved heterogeneity because of its ability to allocate the object into the correct segment. At the same time, the method can appropriately estimate parameters of the structural equation model for each segment. Acknowledgments: The authors thank the Ministry of Education, Culture, Research, and Technology of the Republic of Indonesia for its support. The authors also thank the editor and the referees for their helpful comments.

Conflicts of Interest:
The authors declare no conflict of interest.