Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis

: Latent class analysis (LCA) has been applied in many research areas to disentangle the heterogeneity of a population. Despite its popularity, its estimation method is limited to maximum likelihood estimation (MLE), which requires large samples to satisfy both the multivariate normality assumption and local independence assumption. Although many suggestions regarding adequate sample sizes were proposed, researchers continue to apply LCA with relatively smaller samples. When covariates are involved, the estimation issue is encountered more. In this study, we suggest a different estimating approach for LCA with covariates, also known as latent class regression (LCR), using a fuzzy clustering method and generalized structured component analysis (GSCA). This new approach is free from the distributional assumption and stable in estimating parameters. Parallel to the three-step approach used in the MLE-based LCA, we extend an algorithm of fuzzy clusterwise GSCA into LCR. This proposed algorithm has been demonstrated with an empirical data with both categorical and continuous covariates. Because the proposed algorithm can be used for a relatively small sample in LCR without requiring a multivariate normality assumption, the new algorithm is more applicable to social, behavioral, and health sciences.


Introduction
Latent class analysis (LCA [1][2][3]) is a popular statistical tool to identify the relationship between categorical latent and observed categorical variables in a variety of research areas such as education [4], psychology [5], sociology [6], medicine [7,8], and public health [9]. LCA has been used to classify mutually exclusive heterogenous subpopulations, also known as latent classes, based on participants' responses collected as a set of observed categorical variables. In other words, LCA enumerates the latent classes in which sample units respond in similar patterns in terms of observed categorical variables. Model specification of LCA includes two sets of parameters: class membership and itemresponse probabilities within each class. We identified the characteristics of each class using itemresponse probabilities and predicted the participant's likelihood of belonging to each class based on the parameter estimates. Owing to the advent of maximum likelihood estimation methods (MLE [10]) and the development of software packages including Mplus [11], Proc LCA [12], Latent Gold [13], poLCA [14], LCA has become more popular recently in a variety of research areas.
Although LCA has become more popular, there are still concerns respecting the estimation method in LCA. Aforementioned MLE of LCA using expectation-maximization (EM) algorithm [15] causes several estimation issues [16]. The most common issue is related to defining parameters in model identification. When a model is underidentified, i.e., the amount of observed information from response data is smaller than the amount of unknown parameters or even larger than the amount unknown parameters, the unique solution may not be easily obtained because of the occurrence of multiple local maxima. In order to avoid the multiple local maxima and obtain the global maximum, most statistical programs of LCA use multiple sets of initial/starting values. However, not all cases find the unique maximum solution because the multiple sets of initial values do not guarantee that they point to the definite and unique solution.
Another identification issue may occur when the amount of unknown parameters of the models is large. As the number of parameters increases (e.g., the increase in the number of classes), the larger data from the contingency table, represented by the frequency table across item response categories and latent class, are required. However, a large size contingency table often faces the issue of sparseness because not all cells of the contingency table have a large enough cell size. Increasing sample size for the analysis may appease the sparseness in the contingency table due to the fact that the larger sample size would achieve to fill up the cell that had a small cell size before [16]. In practice, however, a larger sample size does not always mitigate sparseness due to too many cells. Lastly, MLE-based estimation assumes multivariate normality of the set of parameters. The multivariate normality is too strong of an assumption that is hard to confirm and easy to violate. As the number of parameters increases, the estimation is more likely to produce biases due to the deviation from the normality assumption. In sum, the MLE-based estimation approach requires a large sample size to avoid the estimation issue, although this is not a panacea to resolving estimation issues. On the other hand, large samples are not often achieved in studies conducting empirical data analyses and the "largeness" also depends on many factors in the sample size calculation [17,18].
To resolve these estimation issues, the authors of [19] proposed an alternative estimation approach for LCA by using fuzzy clusterwise generalized structured component analysis (gscaLCA). Although we provided the conceptual and computational algorithm of gscaLCA, the model used in our previous study was somewhat limited to a simple LCA model focusing only on classifying the number of classes without considering covariates. On the other hand, modeling covariates into LCA, also known as latent class regression (LCR), has been widely used because it can demonstrate how covariates affect the membership prevalence as well as membership probabilities [16]. Considering the popularity of LCR, this current study aims to propose an algorithm to estimate LCR by updating the gscaLCA algorithm. The study's findings contribute a new and more flexible method proposed to apply LCR without being bounded by the normality assumption but with stable and reliable estimates. This current paper is organized as follows. First, we review the theoretical framework of generalized structured component analysis (GSCA [20]), procedure for gscaLCA, and MLE-based LCA with covariates, which are the underlying concepts to build the algorithm of gscaLCA with covariates, hereinafter referred to as gscaLCR. Second, we provide a detailed algorithm for gscaLCR. These are followed by an illustration with empirical data from the National Longitudinal Study of Adolescent to Adult Health (Add Health [21]).

Generalized Structured Component Analysis (GSCA)
GSCA is a component-based approach for structural equation modeling (SEM), which encompasses three sub-models: measurement, structural, and weighted relation sub-models [22]. The measurement sub-model explains the relationship between latent variables and indicators, and the structural sub-model refers to the relationship between latent variables and/or between a latent variable and other observed variables than indicators as in the factor-based SEM. Lastly, the weighted relation sub-model defines a component or latent variable as a weighted composite or component of indicators, which is a unique part of GSCA. This weighted relation sub-model eases the estimation by allowing calculation of latent variable scores through the distinctive component scores, which also facilitates estimating the parameters of the other two sub-models through alternating least squares estimation (LSE [23]). This is a key factor to state that the GSCA algorithm using alternating LSE with the existence of global estimation and the compatibility with bootstrap methods is often beneficial over the factor-based SEM (for example, a complicated model, such as a dynamic system using fMRI [24]).

Generalized Structured Component Analysis for Latent Class Analysis (gscaLCA)
The gscaLCA algorithm was recently proposed by the authors of [19]. It was developed by combining fuzzy clusterwise GSCA [25] and optimal scaling in GSCA [22], which allows the algorithm to fit latent class analysis (LCA) within a component-based SEM framework. The fuzzy clusterwise GSCA updates membership probabilities according to the distance from centroids, which determines latent classes. This process simultaneously estimates the parameters of the GSCA model. Through the alternating estimation procedure, we estimate the membership probabilities so that we assign latent classes to each sample unit that are taken into account by the relationship between observed variables and latent variables. Prior to the gscaLCA algorithm, the fuzzy clusterwise GSCA was proposed by [22] but limited to the cases where outcome variables are continuous. To extend limited outcome variable type (or continuous outcome variable) in fuzzy clusterwise GSCA to discrete variables, the gscaLCA algorithm [19] was engaged with the optimal scaling technique [26], which is also known as optimal data transformation [22].
The algorithm of gscaLCA estimates the parameters of the three sub-models with optimal scaling, followed by fuzzy clustering [27][28][29][30][31][32][33]. The fuzzy clustering updates the individuals' membership based on optimal scaled data for each iteration. Alternating processes of updating the parameters with optimal scaling and fuzzy clustering recursively are applied with the aim of minimizing the residuals of models in the LSE in gscaLCA [19].

Latent Class Regression
Introducing covariates into MLE-based LCA, i.e., latent class regression (LCR), has been developed by many researchers and the analytic method has been widely used [34][35][36][37][38]. The covariates in LCA play roles in predicting or explaining class memberships [3,16]. Two different approaches have been typically used as algorithms to estimate parameters of LCR: a one-step and a three-step approach. The one-step approach [39,40] associates the LCA model and multinomial logistic regression of latent class membership on covariates simultaneously, which most software packages employ (e.g., poLCA [14]). On the other hand, the three-step approach consists of estimating the parameters of the LCA model (step 1), assigning a latent class for subjects based on the estimated membership probabilities (step 2), and fitting a structural model with latent class scores and covariates (step 3). The last step, for example, is a multinomial logistic regression using the latent class scores as the dependent variable and covariates as the independent variable, successively. The three-step approach follows the factor score regression [41] or latent structure model [42], which is a sequential ad hoc approach based on estimated latent variable scores after latent variable models are estimated.
The one-step approach estimates the parameters of the LCA model and fits a structural model simultaneously, which often cause serious bias due to model misspecification. Therefore, the LCA with covariates is also hard to estimate because of the large number of parameters, and the number of parameters often exceeds known information [3,42]. Compared to the one-step approach, the threestep approach is relatively less likely to cause identification issues as well as the bias due to model misspecification. On the other hand, the three-step approach in MLE-based LCA often estimates the relationship between covariates and latent class membership incorrectly because of the classification error. The classification error refers to the difference between predicted latent class scores and true latent class scores as consistent estimates of latent class scores, and it has downward bias in estimating parameters [42]. The three-step approach has been improved to diminish the effect of the classification error. The improved three-step approach corrects the bias and reflects the effect of covariates in the structural model in the latent structure model separately from estimating the parameters of LCA [3,42,43]. The improved procedure becomes a standard in LCA with covariates in MLE-based LCA. However, the updated three-step approach in LCR is not entirely free from the estimation issues associated with multivariate normality assumption. On the other hand, in gscaLCA with covariates (gscaLCR), such a bias does not occur because the estimation procedure does not require the normality assumption that causes the classification error in the MLE-based LCA. In this current study, we discuss the three-step approach for LCA using GSCA with both categorical and continuous covariates.

Generalized Structured Component Analysis for Latent Class Regression (gscaLCR)
Using the advantages of gscaLCA, such as avoiding estimation burden and relaxing the normality assumption, the current study demonstrates a new algorithm with which to implement LCA with covariates using GSCA, gscaLCR. The proposed algorithm is based on fuzzy clusterwise GSCA, which aligns with the aforementioned three-step approach in MLE-based LCA. The algorithm of gscaLCR reduces the unnecessary extra estimation procedures of LCA when adding and removing covariates, and it also enables researchers to model the influence of covariates in determining memberships. Encompassing these advantages, the gscaLCR algorithm is proposed and it consists of three main steps: (STEP 1) estimating parameters of gscaLCA, (STEP 2) assigning the membership and (STEP 3) estimating the effect of covariates on membership. The flowchart of algorithm is in Figure 1.

Step 1: Estimate parameters of gscaLCA
To fit gscaLCA into the data, specifying the three sub-models of GSCA is necessary: measurement sub-model (denoted by ), structural sub-model (denoted by ), and weighted relation sub-model (denoted by ). The covariates affecting the relationships between observed variables can be involved in the three sub-models. The inclusion of covariates in the GSCA model reflects the relationship between observed variables, and affects the clustering results. However, this inclusion is not for examining the effects of covariates but controlling for enumerating latent classes. The decision of whether to include covariate effects in the GSCA model is up to the researchers' discretion, which is out of the scope of this study. An example of the model specification of covariates in this step was featured in Figure 2. The sub-models of GSCA are integrated into a single model equation as follows [22] = ′ ′ + , where = ′ , = ′ ′ for (an identity matrix) and ' for matrix transpose, is a vector of residuals of observed and latent variables, and is the optimal scaled data of the subject (i.e., = ( ), where refers to a transformation of the original categorical variable to the optimally scaled counterpart, and is the original observed categorical data of the subject ). The estimation for parameters in GSCA is executed by minimizing .
with respect to , , and , under the constraint of = 1.
, , and are the weighted relation model matrix, the combined measurement and structural model matrix, and the combined identity and weighted relation model matrix in terms of latent class (cluster) , respectively. The number of latent classes in Equation (2) is . Equation (2) becomes equivalent to Equation (1) when = 1 . The matrix = [ , … , ]′ is an by optimal scaled data matrix of subjects and observed. Lastly, refers to the fuzzy membership or membership probability of subject in latent class , and a power of is the fuzzifier ranging from 1 to infinity, which is determined in advance. In practice, = 2 as the most popular choice in fuzzy clustering [27,30,44,45]. In the second line of Equation (2), denote a diagonal matrix consisting of fuzzy membership, (i.e., = ( , … , ) ), and ( ) is equal to To minizine the criterion in Equation (2), the following two steps are implemented alternatively until it converges.
Step 1.1. Update the parameters of generalized structured component analysis ( and ) in each cluster for fixed This step locates the parameters of and to minimize the following where = ( ) .
This minimizes the residual in Equation (1) for each latent class (cluster) meaning that the parameters , , and are updated within each latent class (see the details in [22]). By integrating the criterion for multiple latent classes, we can minimize the criterion in Equation (3). Another task to be done in this step is a transformation of the original categorical variable to the optimally scaled counterpart because the indicators in LCA are categorical. This optimal scaling procedure is executed to maintain measurement characteristics of the observed data. The optimally scaled data for nominal variable are restricted to obtain an identical value in terms of observations that falls in the same category, and the optimally scaled data for ordinal variables are required to preserve the observed order after transformation. This optimal scale transformation is executed each iteration until it converges.
Step 1.2. Update membership parameter for the fixed and Finding by minimizing Equation (2) with respect to subject to ∑ = 1 is equivalent to finding by solving the system of equations obtained from partial derivatives of a Lagrange function, * , defined by * = + where is a Lagrangian multiplier, and = ( − ).

2.4.2.
Step 2: Assign each subject into a latent class based on the parameter estimates from the previous step Two types of assignments are considered in this step: hard partitioning and soft partitioning assignments [3,46]. The hard participating is yielded from modal or random assignment. The assignment of membership for a subject is determined by latent indicator function, , in which the membership probabilities ( ) of the subject are the largest. That is, = 1 if is largest, and = 0 otherwise. On the other hand, the soft partitioning focuses on proportional assignment. That is, a subject belongs to latent class proportionally with the estimated membership probabilities of each subject ( ) rather than selecting one membership for each subject.

Step 3: Fit a regression model with covariates
Regardless of which partitioning assignment is selected in step 2, two possible procedures can be followed. With hard partitioning, multinomial or binomial logistic regression of covariates can be used by considering the assigned latent classes as dependent variables. When refers to the categorical membership of a subject via the hard partitioning, it can be expressed as categories. For example, can be 1, 2, or 3 when the number of latent class, , is 3. The multinomial logistic regression on covariates can be presented by where is the reference category, indicates covariates, and , is regression coefficient for the class, k, and = 1, …, , where the number of covariates is . The other possible regression model is a binomial logistic regression with dummy variables ( ), i.e., the class, , vs. the others.
With three latent class model, we have three dummy variables. For each latent class, we can fit the binomial regression ( = 1) = log 1 − = , + , , + ⋯ + , , , where is the probability of being in the kth class. When = 2 , fitting multinomial or binomial logistic regression brings the same results. The choice between multinomial and binomial logistic regression is based on researchers' discretion. Comparison between two regressions is unnecessary with respect to the methodological point of view. The multinomial logistic regression can be used when comparison between certain two latent classes according to the effect of covariates is required. When the change in membership on each latent class is the main interest, binary logistic regression can be used.
When the soft partition is used for the analysis, we can still fit multinomial and binominal logistic regressions, the same as in Step 3. However, the proportion and membership probabilities ( ) need to be considered in the regression models as weights. The weights play the role of adjusting the contribution of units in the regression based on their weights. Although the regression models are easy to be fitted, this may not be what the researchers are looking for in LCA in some sense, because LCA mainly purports to identify the heterogeneous subgroups instead of their likelihood. However, this is beyond the scope of this study. The effects of the covariates will be examined via hypotheses tests using bootstrap methods that can obtain the interval estimates in the LSE in the multinomial and binomial logistic regression.

Empirical Data
Data from the National Longitudinal Study of Adolescent to Adult Health (Add Health) were used in this illustration. The Add Health study was conducted to investigate how adolescents' health trajectories influence adulthood life course. As a longitudinal study, there have been five waves in Add Health to date. In this study, we focused on Wave IV data that represent relatively active substance usage. The data consisted of 5144 subjects aged from 24 to 32. Seven variables about substance usage and demographic data were considered for this illustration of fitting gscaLCR. Five variables are dichotomous data regarding the usage of smoking, alcohol, marijuana, cocaine, and other illegal drugs. Two additional variables served as covariates indicating gender and education level. A total of 65.2% of subjects had experience in smoking an entire cigarette, 80.3% had drank beer, wine, or liquor more than two or three times, 54.7% had used marijuana, 19.1% said yes to cocaine usage, and 21.5% said yes to other types of illegal drugs [21]. Gender covariate is coded as a dichotomous variable, and 54.0% of participants were female. Education level covariate is considered as a continuous variable ranging from 1 to 8: (1) not graduating high school (8%), (2) graduating high school (16%), (3) vocational training after high school (10%), (4) college (33%), (5) college completed (20%), (6) graduate school (4%), (7) completed master's degree (5%), and (8) beyond master's degree (4%). The mean educational level was 3.923, which in this scope is close to college level.

Method
To focus on the demonstration of the gscaLCR algorithm, this current study used a threesolution model with the Add Health data that were found in the previous study [19]. It is worth noting that we aim to introduce the procedure of fitting covariates into the existing LCA model rather than enumerating the number of latent classes in this study. In the illustration purpose of the gscaLCR algorithm, two different GSCA models were considered: (Model 1) gscaLCA model without any covariates and (Model 2) gscaLCA with a covariate of gender in Step 1 when enumerating latent classes in the GSCA model. The latter case, Model 2, was illustrated in Figure 2, and the specified model here assumes that each observed variable is defined by a phantom latent variable and those phantom variables are associated with each other.
Both hard and soft partitioning approaches described in Step 2 were employed. In step 3, two methods of fitting regressions (multinomial logistic or binomial logistic) of covariates on latent classes were considered. Thus, we demonstrated eight results of gscaLCR (Models 1 and 2, hard and soft partitioning, and multinomial and binomial logistic regressions) from the flow chart of gscaLCR algorithm in Figure 1. All analyses were executed in the gscaLCA package on the R program [47] that we created by implementing the gscaLCR functions generated.

Results
The overall relationship between latent class membership prevalence and five indicators is displayed in Figure 3. The results were consistent with the results of our previous study. We utilized same number of names for latent classes for Add Health data as in [19]: (Class 1) smoking and drinking group, (Class 2) heavy smoking and binge drinking, and (Class 3) heavy substance users. Depending on the inclusion of the gender covariate in the GSCA model (Model 1 vs. Model 2), the results were slightly different regarding the membership prevalence. The membership prevalence of three classes is 47.73%, 20.25%, and 32.02% in Model 1, whereas the prevalence with the gender covariate in Model 2 is 54.32%, 20.29%, and 25.38%. When the gender covariate was added into the GSCA model in Step 1, more subjects were in the smoking and drinking class (Class 1) and less subjects were involved in the substance user class (Class 3). On the other hand, the item response probabilities showed similar patterns in both two models. The smoking and drinking class (

Multinomial logistic regression: Hard partitioning
The estimated coefficients from multinomial logistic regression to examine the effect of gender and education level are presented in Table 1, which includes results from both hard partitioning and soft partitioning with weights. For instances of hard partitioning, the estimated coefficients of Model 1 regarding the latent class 2, when the reference category is the latent class 1, were −0.836 (intercept;  We interpret that, when considering the male group with average education levels, the probability of being in latent class 2 is 0.4857 times less likely than the probability of being in latent class 1. The odds of membership in latent class 2 in relation to the reference latent class 1 are 0.382 when we consider the female group with average education levels (3.923). Similarly, we found that the gender effect was statistically significant while the effect of education level was not. We also calculated the odds for latent class 3 relative to latent class 1. The computed odds are presented in Table 2.
For Model 2, which encompassed the gender covariate when fitting gscaLCA in Step 1, the results were similar but the effect of education level at the comparison between latent class 3 and latent class 1 was different from the results in Model 1. As shown in Table 1, the education level was a significant predictor of being in latent class 3 instead of latent class 1. In sum, the results of estimated coefficients and odds showed that the gender effect was statistically significant for latent classes 2 and 3 in comparison to reference latent class 1 regardless of whether the gender covariate was included in the GSCA model in Step 1. On the contrary, the education levels were not significantly influential on the prevalence of latent classes 2 and 3 over latent class 1 except for the estimated coefficient for latent class 3 relative to latent class 1 in Model 2, which was significant.

Multinomial logistic regression: Soft partitioning
When weights were taken into an account (i.e., soft partitioning was applied), the estimated coefficients were similar as the estimated coefficients with hard partitioning in Model 1 but slightly different from those of hard partitioning in Model 2. In Model 1, the estimated coefficients of covariate effects were slightly lower than the coefficients based on hard partitioning. Nevertheless, the statistical test results of the covariate effect in Model 1 were consistent with the results before applying the weights; the gender effect was statistically significant while the education level was not.
On the contrary, in Model 2, we found one inconsistency with those of hard partitioning. For the comparison between latent class 2 and latent class 1 (reference group), both gender and education level effects were not significant as −0.167 ( , , = 0.090 ) and .004 ( , , = 0.878 ), respectively. For the comparison between latent class 3 and latent class 1, both effects of gender and education level were significant as −0.646 ( , , < 0.001 ) and −0.061 ( , , = 0.016 ), which are aligned with the statistical test results of hard partitioning.
In summary, the results in terms of statistical significance were the same as in the hard/soft partitioning in Model 1. However, the results between Model 1 and Model 2 were different. Model 2 indicated different results over hard/soft partitioning regarding the effect of gender.

Binomial Logistic Regression: Hard Partitioning
Binomial logistic regressions were fitted for each latent class separately, while the class variable was coded into three latent class dummy variables based on the modal assignment. Interpretations of binomial logistic regression are more straightforward compared to multinomial logistic regression. Equation (6) can be re-written as follows which is referred to as the probability that a subject is in the latent class . For example, the probability of being the latent class 1, the smoking and drinking class, in Model 1 can be presented as by using the coefficients in Table 3. When the gender covariate was not associated with enumerating the latent classes in Step 1 (Model 1), the probability that a subject who is male and of an average educational level ( . = 3.923 ) belongs to the latent class 1 is 42.98%, and the probability that a subject who is female with an average educational level ( . = 3.923. ) belongs to the latent class 1 is 52.58%. Similarly, the probabilities for the other two latent classes were estimated, which are presented in Table 3. The results showed that there was around 10% difference in membership prevalence between male and female, and the difference was statistically significant for the latent class 1 ( , = 0.427, < 0.001 ) and the latent class 3 ( , = −0.473, < 0.001 ). There were more females in latent class 1, smoking and drinking, and more males in latent class 3, heavy substance users. On the other hand, educational level did not show a significant influence on membership prevalence for all three latent classes. With the same approach, we can interpret the estimated regression coefficients with Model 2 based on the results in Tables 2 and 3. The results showed that there was around 10% difference in the membership prevalence in latent class 1 ( , = 0.426, < 0.001; Male = 48.59% and Female = 59.13%) and latent class 3 ( , = −0.540, < 0.001; Male = 30.83% and Female = 20.62%), and the difference was statistically significant. Although the absolute magnitude prevalence for Model 1 and Model 2 was different, the differences between male and female in latent classes 1 and 3 are similar (around 10%). Another noticeable feature in Model 2 is that the education level was significant in latent classes 1 and 3, which was not significant in Model 1. This infers that a specified model is influenced not only by the membership prevalence but also by the relationship between the membership and covariates.

Binomial Logistic Regression: Soft Partitioning
After applying the weights, results of binomial logistic regression are present in the bottom part of Table 3 . Results were consistent regardless of whether the gender variable was included in Step 1. This means that more females were in latent class 1 and more males were in latent class 3, which is aligned with the results before adjusting the gender in Step 1. On the other hand, the effects of education level changed over Model 1 and Model 2. Specifically, in Model 1, the education level statistically significantly affected the prevalence of latent class 2 and latent class 3 when applying the weights, although the magnitude was small. Conversely, the education level on latent classes 1 and 3 became a noninfluential factor after the adjustment of the gender variable in Step 1 (Model 2). These results show that the weighting would provide slightly different results over Model 1 and Model 2, although the main trend is maintained. In sum, the results in terms of the statistical significance were same for the gender effect over hard/soft partitioning in Model 1. However, the results for the education level were different over hard/soft partitioning as well as between Model 1 and Model 2.

Discussion
The new algorithm for latent class analysis utilizing fuzzy clustering within GSCA with covariates (i.e., gscaLCR) was discussed with a real-world example in this study. More importantly, the specific algorithm of gscaLCR was established and its function applying the three-step approach to the gscaLCR is now available in an R package, gscaLCA [19]. This proposed algorithm diminishes the effect of the abnormality in enumerating the latent classes and the examination of covariate effects on the latent class prevalence due to the implementation of the LSE approach. This means that researchers are now able to examine the effects of covariates using gscaLCR parallel to the maximumlikelihood-based LCR. Although gscaLCR and MLE-based LCR function similarly to address the research questions in LCA, they are neither comparable nor competing approaches in an exploratory sense, but rather one of them should be selected prior to running the analysis in a confirmatory sense [48]. In this study, we focused and demonstrated that gscaLCR functions well to identify homogenous subgroups when the response variables are categorical.
It is still controversial to include a covariate in Step 1 in LCR. Our results on the regression analyses in Step 3 indicated levels of discrepancy between LCA models with and without covariates in Step 1. However, it may be meaningless to compare the two approaches. That is, including covariates in Step 1 is not optional, but it is necessary for the researcher to decide what kind of covariates should be included in Step 1, with theoretical rationale for each covariate. On the other hand, in this study, stability in parameter estimates over hard/soft partitioning was observed in the LCA model without covariates in Step 1, compared with a model with covariates.
Although the gscaLCR algorithm is sound and promising in LCR, there are several limitations. First, the gscaLCR algorithm was applied with two response options, i.e., dichotomous indicators in this current study. However, the optimal scaling method is applicable to ordered categorical variables, and thus, the gscaLCR algorithm can easily be extended into ordered categorical variables. Although we could not include the results as an example in this study, we have checked that the gscaLCR worked well with indicators with more than two response options and implemented the function in the gscaLCA package. The second limitation is that the efficiency of parameter estimation has not been examined in this study, which is not because of its difficulty in conducting the study but because we focused on proposing the new gscaLCR algorithm. Our future research looks at the efficiency of parameter recovery. Third, we did not specifically state how small a sample size would be good enough to run the gscaLCR. This may depend on the study design, but suggesting adequate sample sizes is needed, as was done in the MLE-based LCR. It will be studied as a follow-up research topic. The last limitation is related to tools of model evaluation. In GSCA, there are several evaluation tools including FIT, Adjusted FIT (AFIT), and Goodness of Fit Index (GFI) [22] and confirmatory tetrad analysis [49]. However, not many criteria of good fit in those tools are available. In LCR, it is necessary to identify the number of latent classes as an extension of LCA. To do so, we need to compare LCR models objectively. We did not touch on this issue in this study, but it is needed to provide such tools along with gscaLCR.
It should also be noted that this study is not proposing a new modeling in the LCR literature, but it is providing a new enumerating approach, fuzzy clustering, and a new parameter estimation method, alternating least square, for LCR using GSCA. That is, it is beyond the scope of this research to discuss the modeling building procedure in detail. We also tried to extend testing the effects of covariates into soft and hard partitioning in gscaLCR using the multinomial and binary logistic regressions, which has been done in the MLE-based LCA.
As noted, the bias from the classification error does not occur in gscaLCR because gscaLCR does not require multivariate normality. Therefore, discussion of the classification error brought by Block et al. [42] is not necessary in gscaLCR. However, we still have room to develop beyond gscaLCA and in the current estimation method using the algorithm of gscaLCR. First, the listwise deletion for missing data is only available in the current version of the gscaLCA package. It is a future research topic to implement model-based multiple imputation. Second, the current gscaLCR does not provide any analytical tool for multilevel modeling or multiple groups that are available in MLE-based LCA. These limitations are not just a disadvantage but need to be developed. In spite of these limitations, it is necessary to continue the development of such new approaches in latent class analysis because the MLE-based LCR does show many estimation and identification issues and requires a relatively large sample size, which can be addressed by the gscaLCR algorithm.
We believe that the gscaLCR algorithm provides a new framework of fitting LCR for researchers in mixture modeling and it will also be developed for a structural equation mixture modeling combining factor analysis with mixture modeling.