A Symmetrical Analysis of Decision Making: Introducing the Gaussian Negative Binomial Mixture with a Latent Class Choice Model

: This research presents a model called the ‘Gaussian negative binomial mixture with a latent class choice model’, which serves as a robust and efficient tool for analyzing decisions across different areas. Our innovative model combines elements of mixture models, negative binomial distributions, and latent class choice modeling to create an approach that captures the complexities of decision-making processes. We explain how the model is formulated and estimated, showcasing its effectiveness in analyzing and predicting choices in scenarios. Through the use of a dataset, we demonstrate the performance of this method, marking a significant advancement in choice modeling. Our results highlight the applications of this model and point towards promising directions for future research, especially in exploring symmetrical patterns and structures, within decision-making processes.


Introduction
The discrete choice model (DCM) has rapidly become popular due to its practical applicability and theoretical robustness under individual preferences.The primary interest in the DCM was in the context of transportation.The contribution of [1] played a vital role from the 1970s to 1990.Since that time, the DCM has only been considered a realistic behavior of choice model when used as a closed-form model like nested logit, an open-form model like mixed logit, and a latent class model.

Study Data Analysis
Findings [2] Pearson correlation Satisfaction with the indoor environment was associated with satisfaction regarding acoustics, thermal conditions, visual aspects, and air quality [3] Principal component analysis, Pearson correlation, and linear regression The overall fulfillment was influenced by satisfaction with the thermal, acoustic, and lighting conditions, air quality, control over the indoor environment, level of privacy, as well as the office layout, decor, and sanitation.
[4] Pearson correlation Satisfaction with the indoor environment was positively associated with satisfaction regarding air quality, thermal conditions, lighting, acoustics, and spatial conditions.
[5] Multiple linear regression Satisfaction with the warmth, air quality, air circulation, noise, humidity, and lighting had an impact on overall workplace comfort.
[6] Multivariate logistic regression The suitability of the overall indoor environment was influenced by the acceptability of the thermal conditions, acoustics, lighting, and air quality.
[7] Pearson correlation Satisfaction with the workspace showed correlations with several factors, including lighting, noise levels, air quality, heating, drafts, available space, furniture quality, privacy, and the color and layout of walls and partitions.
[8] Squared multiple correlations (SMCs) CLS can advance the working environment of hospital staff employed in a neuro-ICU or PACU.[9] Exploratory and confirmatory factor analysis and structural equation modeling Satisfaction with the indoor workstation environment was determined by factors such as noise, air circulation, air quality, temperature, lighting, privacy, the view to the outside, as well as workspace size, esthetics, and the level of enclosure. [10]

Multivariate logistic regression
The acceptability of the overall indoor environment was influenced by the acceptability of the thermal conditions, air quality, noise level, and illumination level.[11] Correspondence analysis and principal component analysis with ideal scaling Workspace satisfaction was impacted by contentment with temperature, lighting conditions, air quality, acoustics, spatial aspects (including privacy and workspace individualization), office furniture, and office layout.[12] Literature review The review of the existing literature highlights the dual advantages of a favorable indoor environmental quality (IEQ), encompassing both economic and health-related benefits.It underscores the substantial influence of the IEQ on occupant well-being and work efficiency. [13] Cross-sectional study design amongst objective measurements and subjective assessments The study offered a positive suggestion for green buildings with qualitatively and quantitatively measured performance in terms of the IEQ.
[14] Non-parametric techniques Satisfaction with the overall environment was influenced by satisfaction with the thermal, acoustic, lighting, and air quality.
[15] Non-parametric statistical tests The degree of improvement was more pronounced when moving from a traditional building to one certified under the WELL standard, while it was less significant or even negligible when transitioning from buildings certified under BREEAM to WELL certification.
[16] Literature review The recovery of the indoor environment of a healthcare facility for both patients and, more significantly, medical staff.
The studies did not provide a consistent definition of occupant satisfaction.However, all of them approached occupant satisfaction broadly, linking it to either their satisfaction and well-being with the indoor environmental quality or their fulfillment and comfort with their workspace.Specifically, some studies [2,5,6,10] concentrated solely on how the indoor environmental quality influenced the satisfaction of building occupants.
Their findings indicated that the thermal, visual, and acoustic atmosphere and superior air quality influenced the building occupants' sense of fulfillment.While the significance of various indoor environmental factors in building occupants' satisfaction showed slight variations across studies, the thermal environment was consistently ranked as slightly more important than the air quality and the acoustic environment, and significantly more important than the visual environment.
A literature review conducted by [17] highlighted that various non-environmental factors significantly influence building occupants' satisfaction alongside indoor environmental factors.Factors such as occupants' control over the indoor environment, view satisfaction, privacy levels, and office layout have been identified as crucial determinants ( [3,4,8,18]).Moreover, recent advancements propose the application of machine learning techniques to enhance the examination of medical data, potentially surpassing traditional methods [19].The complexity of human decision making has recently highlighted the need to comprehend risky options.One study, for example, used the choices 13k dataset to train neural networks in a unique way that revealed information about decision noise and dataset bias [20].For various other applications, one should refer to ([16,21-26]).Further, [27] promote the development of the agent decision model and provide a new way to solve complex decision problems.
This research aims to present a latent class choice model with participant environmental feedback data in an authentic setting.In order to overcome several issues with the conventional approaches, we developed a novel hybrid latent class choice model in this study that combines a Gaussian negative binomial mixture model [28].To evaluate the performance of the suggested model with more conventional models, we employed microecological momentary assessments (EMA) as secondary feedback data in this investigation.
Micro-EMA is a technique that utilizes a smartwatch interface to elicit and gather immediate, in-the-moment subjective feedback from an individual over several weeks [16].We gained insight into that person's comfort preference patterns by obtaining an extensive volume of feedback from a single individual in various environments and comfort conditions.It is suggested that these behavioral patterns can be employed to categorize individuals into clusters based on their environmental perceptions.Consequently, grouping individuals with comparable comfort preferences could enhance the precision of forecasting where a person will feel comfortable and how the system can respond without extra sensors.Moreover, accumulating substantial quantities of subjective preference data from numerous individuals in a specific area can help define the comfort-related characteristics of that space to complement the data obtained from the installed sensors.
If it is theoretically feasible and minimally disruptive to the occupants, incorporating humans as sensors within buildings can revolutionize post-occupancy evaluations and building and system design, as well as controls and automation procedures.This opens up opportunities for individuals to contribute feedback in various scenarios, whether for short-term episodic purposes (spanning days or weeks), building commissioning extending to long-term assessments (over months or years), or continuous system control and management.This research aligns with the growing interest in other fields that leverage human input as sensors, such as event detection using cybersecurity [18], social media data [24], ecological momentary assessment [29] and emergency detection [30].
In the context of environmental measurements, previous efforts have involved the use of specific sensors mounted on mobile carts [31,32].However, these sensors were not cost-effective for many building operation scenarios and the affordability issue led to the development of low-cost continuous-sensing sensors, albeit with the requirement for frequent calibration [1,33,34].Nonetheless, the placement of these sensors within buildings and the interpretation of their readings remained challenging in the literature, primarily due to the heterogeneous nature of indoor spaces [34].Conversely, surveys introduced their own set of challenges, such as determining the appropriate questions to ask, selecting the right respondents, and interpreting the survey results [1].Furthermore, [35] explored the concept of 'survey fatigue', wherein survey participants become overwhelmed by the volume of questions, potentially resulting in misrepresentations in responses and decreased response rates.
The literature on choice modeling in the context of environmental perspectives is estimated to have exceeded 14,200 publications.Some of the articles related to occupant preferences and satisfaction with their findings are listed below.
The attention to discrete choice models as a hypothetically sound and practical tool for investigating choice behavior, particularly behavioral outcomes like willingness to pay, has grown rapidly.This development initially took place in the context of transportation, which was where McFadden made his initial contributions.The authors of [1] offer a historical indication of these contributions from the 1970s to the early 1990s.Since then, there has been a significant expansion of research in the various aspects of choice modeling.This includes the development of more behaviorally realistic discrete choice model forms, such as closed-form models like nested logit, and open-form models like mixed logit, latent class, and generalized mixed logit.New data paradigms have also emerged, including mixed data approaches and expressed preference and choice studies.Additionally, process heuristics have been incorporated by researchers into choice models, such as hybrid logit models, to handle attribute endogeneity and account for attribute non-attendance [36].

Model Framework
The subject model, namely the Gaussian negative binomial mixture with a latent class choice model, is presented in this section.Then, we give an extensive comparison with benchmark models, i.e., mixed logit, Multinomial Logit, and latent class choice models.We observe that our subject model efficiently performs better than the benchmark models.By including negative binomial distribution, the subject model effectively addresses overdispersion.Additionally, the presence of a latent class choice model makes it more reliable for decision making under heterogeneity.The subject model performed better under the circumstances of heterogeneity in classes and data variability.

Latent Class Choice Model
LCCM contains two models: a class membership model and a class-specific choice model.The class membership model is defined as a function of the features of decisionmakers associated with a particular class.The utility ω of a decision-maker 'm' associated with class 'l' is stated as follows: where C m is a vector of the features of decision-makers 'm' and φ l is the corresponding vector of unknown parameters.ν ml is an error term that follows Extreme Value Type-I distribution over decision-makers and classes, which is assumed to be i.i.d.The probability of decision-makers 'm' associated with class 'l' is specified decisionmakers as follows: The second model, namely, the class-specific model, is defined as the probability of selecting a particular option as a function of the observed exogenous feature option, conditioned on the person associated with class 'l'.The utility of an individual 'm' selecting an option 'k' at a time τ is expressed as follows: where E mkτ is a vector of the observed features of selecting an option 'k' at the time τ, δ l is the corresponding unknown parameter's vector, and ν mkτ|l is an error term that follows Extreme Value Type-I distribution.At the same time, it is assumed to be i.i.d.; the conditional probability of decision-makers 'm' selecting an option 'k' at a time τ is given as follows: Symmetry 2024, 16, 908 5 of 17 where K is the number of available options.Let t m be a matrix of all the individual options at time τ, consisting of (k × τ m ) order and E m be (k × τ m ) order matrix, where The conditional probability of observing t m associated with class 'l' is expressed as follows: The likelihood of an individual 'm' selecting an option 'k' can be defined by combining the conditional option probability with the probability of an individual associated with class 'l' as follows: The resulting likelihood of all the decision-makers 'm' can be obtained as follows:

Proposed Model
The proposed model can be obtained by replacing the class membership probability with the Gaussian negative binomial mixture model.The subject model is a hybrid machine learning approach, as it combines the advantages of two types of models (i.e., discrete and continuous) into one.For this purpose, GMM and NBMM are used for continuous and discrete variables.The vector, consisting of 'm' decision-makers attributes, is split into two sub-vectors, i.e., C cm and C dm .These two sub-vectors contain the dimensions η c and η d which equals the number of elements in C cm and C dm (continuous or discrete attributes) respectively.
Gaussian Negative Binomial Mixture Model GMM N(C cm |λ cm , Π cm ) is a collection of 'L' Gaussian densities, where each density is a segment of the mixture and has its mean µ cm and covariance Π cm matrix.The overall likelihood that it represents the mixing coefficient Λ l comes from the component 'l'.
A useful and reliable distribution to incorporate count data is the negative binomial distribution.It is a versatile statistical tool that has gained popular significance for dealing with count data with overdispersion.The presence of multiple latent classes within the data allows the model to assume observation counts that are generated from a mixture of negative binomial models.This model can effectively capture the heterogeneity and excess variation within the data through the parameter estimation of the negative binomial mixture model and mixing proportion.
Marginal and posterior probability is estimated using Bayes' theorem after estimating subject model probability assuming Gaussian and negative binomial distributions are independent on continuous and discrete datasets, after considering the conditional independence properties on the graphical structure of the proposed model.The joint probability can be formulated by taking the product of four terms.The first term is class/label probability, and the second and third are the conditional probabilities of C cm and C dm .The Symmetry 2024, 16, 908 6 of 17 fourth term contains the choice probability conditional on the class.We can represent the joint probability as follows: where where R dm i = r th vectors of favorable features and C dm i is the discrete characteristics of decision-makers 'm', R dm i is the number of λ cl and λ dl which are the corresponding mean vectors of continuous and discrete distributions.

Joint Probability
The joint probability of C cm , C dm , and t m can be accessed by using Equation ( 8)'s overall component 'k': where λ c and λ d are matrices containing the 'L' mean vectors of continuous and discrete variables, Π c is a matrix containing 'L' covariance matrices Π cl , and δ is a matrix containing the L vectors of δ l .By omitting the dependencies on the left-hand side of the equation to make the notation more assembled, The overall joint probability can be estimated by using different methods (i.e., maximum likelihood estimation, Hessian Matrix, and Expectation Maximization Algorithm).The traditional maximum likelihood estimation method is inscrutable due to the summation over 'L' that will appear inside the equation on both LCCM and GNBM-LCCM.However, as the number of parameters increases in the model, the MLE becomes more burdensome and lengthened.In addition, the empirical singularity problems might arise during the Hessian Matrix procedures and become numerically challenging, [37].Therefore, the EM algorithm is an effective way to overcome all these problems.Moreover, it is a powerful technique to estimate the parameters with latent variables.

EM Algorithm
The EM algorithm is divided into two steps, expectation and maximization steps, respectively, as follows: E-step: This step starts first by taking the joint likelihood function as follows: Symmetry 2024, 16, 908 7 of 17 Then, taking the logarithm of the likelihood, the probability breaks the function into two separate terms, i.e., the class membership model and the class-specific choice model.
Now, we find the value of r ml by taking the expectation using the Bayes theorem.
It is important to note that Λ l in Equation ( 9) and γ r ml in Equation ( 16) contemplate prior probability and corresponding posterior probabilities, respectively.M-step: In this step, the unknown parameters are estimated, since, in the presence of latent variable r ml , Equation (16) cannot be estimated directly.Making use of Equations ( 16) and (19), it gives the following: By taking the derivatives of unknown parameters and setting them to zero, we obtained the following: where Overall, the EM algorithm revolves between E-step and M-step until convergence is attained.First, the unknown parameters are estimated.Second, the latent variable (Equation ( 19)) is estimated by taking the expectation using the Bayes Theorem.In addition, the closed-form solution of the parameters is derived (from Equation (21) to Equation ( 24)).Finally, the log-likelihood is examined by utilizing the obtained values of the unknown parameters and then scrutinized for convergence.If the convergence benchmark is not reached, we return to E-step.From Equation (21) to Equation ( 24), the closed-form solution is available for maximizing coefficient, the Gaussian mean matrix, the negative binomial mean matrix, and Gaussian covariance matrix, respectively.Regarding Equation ( 25), we cannot obtain any closed-form solution for the parameter δ l .For this purpose, the Gradient-Based Numerical Optimization method is used.

Final Likelihood
After attaining convergence, the marginal probability of observing a vector of 't' options of all the decision-makers 'M' is examined as follows: t mkτ (26) where P(r ml = 1|C cm , C dm , λ cl , Π cl , λ dl , Λ l ) is the posterior probability of the vector C m = {C cm , C dm } being obtained by the cluster 'l'.
The posterior probability can be expressed using Bayes theorem as follows: The above posterior probability of Equation ( 27) can be used to compare the GNBM-LCCM and traditional LCCM from Equation (7).Further, it is used to compute extrapolated sample prediction accuracy.

Real-Life Application with Discussion
Dataset Overview: This dataset comprises data collected from the BUDS lab deployments of the Cozie Fitbit smartwatch platform.It involves collecting intensive longitudinal subjective feedback regarding comfort-based preferences through micro-ecological momentary assessments on a smartwatch platform.In an experiment conducted over two weeks with 30 occupants, a total of 4378 field-based surveys were generated to assess thermal, noise, and acoustic preferences.
Throughout the entire study, the environmental variables (such as temperature and relative humidity) in three different buildings were observed.The participants used an open-source application called Cozie on their smartwatches to complete comfort surveys.
Additionally, a custom-designed smartphone application constantly tracked their indoor locations.This location data allowed us to accurately synchronize the timing and spatial aspects of environmental measurements with the thermal preference responses provided by the participants.
In order to extract valuable insights from the dataset, we initiated the exploration by carefully reviewing the features and their corresponding descriptions (refer to Table 2).This preliminary examination served as an informative starting point, granting us a holistic understanding of the dataset's contents.It facilitated the identification of essential variables that play a pivotal role in shaping user choice behavior.These key variables were subsequently selected for more in-depth analysis and model development.Table 3 presents the mean matrix illustrating the class membership model of the subject data.This matrix offers valuable insights into the distribution of the occupants among the different latent classes within the dataset.Through an examination of this mean matrix, we can gain an understanding of the likelihood of users belonging to each class and identify the underlying class structure within our hybrid model.Figure 1 graphically represents the majority of the data attributes in the case where the distribution of Environmental Light values already exhibits a noticeable overlap for various visual feedback (located at the top-middle distribution in Figure 2).
The variable 'time' was generated by employing feature engineering techniques on the timestamps corresponding to when the occupants provided feedback.This engineered feature represents the time cyclically, taking into account both the hour of the day and the day of the week.This straightforward feature type was integrated into all the scenarios to identify potential cyclical patterns or factors influencing preference prediction.The attribute 'time' is used to categorize the class membership model into distinct classes, as follows: Class L = 1: This initial latent class corresponds to environmental data recorded from September 28th to October 10th.These applications represent a specific group characterized by an early time frame.Class L = 2: The second class is associated with applications recorded from October 11th to October 22nd.Class L = 3: The third class is related to periods spanning from October 23rd to November 3rd.Class L = 4: The fourth and final class comprises applications with a time frame from November 4th to November 15th, representing the last observation period for the occupants.These class assignments help delineate the temporal structure of the dataset and provide a meaningful segmentation of the occupant observations.This finding strengthens the argument that relying solely on environmental measurements is insufficient for characterizing an individual's preferences, thus leading to less accurate predictions, as has been observed in earlier research [7].Table 4 provides an overview of the parameter estimation in the class-specific choice model.This table offers insights into the estimated parameters for each latent class within the hybrid model.These estimated parameters enable us to quantitatively assess the influence of different variables on the choice behavior of users within each class.The proposed model facilitated the division and categorization of these zones, now based on the various comfort praeferences exhibited by the occupants in those areas.This outcome primarily offered facility managers an overview of the office spaces they oversee, equipping them with insights to enhance comfort and take necessary actions.In Table 5, we compare the proposed Gaussian negative binomial mixture with a latent class choice model (GNBM-LCCM) and traditional models.Our evaluation involves the use of various metrics, including AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), HEIC (Hannan-Quinn Information Criterion), LL (log-likelihood), Joint LL (Joint Log-Likelihood), and Pred LL (Predictive Log-Likelihood).This comparative assessment allows us to determine the superiority of our GNBM-LCCM over traditional models in terms of model fit, complexity, and predictive accuracy.The visual representations of these benchmark comparisons are also provided in Figures 3-5.In Table 5, we compare the proposed Gaussian negative binomial mixture with a latent class choice model (GNBM-LCCM) and traditional models.Our evaluation involves the use of various metrics, including AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), HEIC (Hannan-Quinn Information Criterion), LL (log-likelihood), Joint LL (Joint Log-Likelihood), and Pred LL (Predictive Log-Likelihood).This comparative assessment allows us to determine the superiority of our GNBM-LCCM over traditional models in terms of model fit, complexity, and predictive accuracy.The visual representations of these benchmark comparisons are also provided in Figures 3-5.In Table 5, we compare the proposed Gaussian negative binomial mixture with a latent class choice model (GNBM-LCCM) and traditional models.Our evaluation involves the use of various metrics, including AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), HEIC (Hannan-Quinn Information Criterion), LL (log-likelihood), Joint LL (Joint Log-Likelihood), and Pred LL (Predictive Log-Likelihood).This comparative assessment allows us to determine the superiority of our GNBM-LCCM over traditional models in terms of model fit, complexity, and predictive accuracy.The visual representations of these benchmark comparisons are also provided in Figures 3-5.Considering the preference feedback in this methodology occurred at a notably higher frequency compared to typical surveys or occupants' interactions with thermostats, this study possessed preference data characterized by a relatively diverse temporal and spatial nature.In the initial observation, it is apparent that the office space generally provided a comfortable environment, whereas outdoor seating areas exhibited an overall higher preference for cooling.The study also captured time-dependent fluctuations, revealing the model's capability to predict comfort preferences that varied across different times of the day or days of the week.Notably, within the office environment, there was a peak in warmer preference around mid-day.However, it is worth noting that the model sometimes attempted to predict comfort preferences inaccurately during periods when no data were available.The square peaks observed in the office area for aural and visual prediction, particularly between the hours of 22:00 and 7:00, were a result of the absence of data to make accurate predictions during those times.
Figures 6 and 7 provide a comprehensive comparison of the performance metrics for four models: MNL, mixed logit, LCCM, and GNBM-LCCM, evaluated both in-sample and out-of-sample.Table 6 illustrates the in-sample evaluation criteria, where the GNBM-LCCM consistently outperforms the benchmark models across most metrics.Specifically, the GNBM-LCCM demonstrates superior accuracy (0.9246), Recall (0.9204), precision (0.8693), and AUC (0.8971), with only a slight dip in the F1 Score (0.8249) compared to LCCM (0.8396).These results indicate that the GNBM-LCCM not only enhances the predictive accuracy but also effectively captures the intricacies of decision-making processes within the sample data.might oversimplify the input data, especially with sparse or noisy datasets.Additionally, in large-scale applications, parameter estimation may lead to false predictions, which require more consideration of computational complexity.Therefore, future research could focus on the GNBM-LCCM for large dataset scalability and model performance.The applicability and robustness of the subject model could investigate different areas using different datasets.Finally, the advancement of the subject model could be enhanced by adding an alternative regularization technique with sparse or noisy data.

Conflicts of Interest:
The authors declare no competing interests.

Nomenclature ω ml
The utility of decision-makers associated with class l.

E mkτ
The vector of observed characteristics of individual m.

C cm
The continuous attributes of decision-maker m.

C dm
The discrete attributes of decision-maker m.

Figure 1 .
Figure 1.Visual illustration of some of the a ributes of environmental data.

Figure 1 .
Figure 1.Visual illustration of some of the attributes of environmental data.

Figure 1 .
Figure 1.Visual illustration of some of the a ributes of environmental data.

Figure 2 .
Figure 2. Visual illustrations of dispersion of sensor data based on preference voting.Figure 2. Visual illustrations of dispersion of sensor data based on preference voting.

Figure 2 .
Figure 2. Visual illustrations of dispersion of sensor data based on preference voting.Figure 2. Visual illustrations of dispersion of sensor data based on preference voting.

Figure 3 .
Figure 3. Visual illustration of traditional models with proposed model (diagonal covariance) using different criteria.

Figure 4 .
Figure 4. Visual illustration of traditional models with proposed model (spherical covariance) using different criteria.

Figure 3 .
Figure 3. Visual illustration of traditional models with proposed model (diagonal covariance) using different criteria.

Figure 3 .
Figure 3. Visual illustration of traditional models with proposed model (diagonal covariance) using different criteria.

Figure 4 .
Figure 4. Visual illustration of traditional models with proposed model (spherical covariance) using different criteria.

Figure 5 .
Figure 5. Visual illustration of traditional models with proposed model (full covariance) using different criteria.

Figure 6 .
Figure 6.Visual illustration of traditional models with proposed model using different evaluation criteria (in-sample).

Figure 7 .Table 6 .Figure 6 .
Figure 7. Visual illustration of traditional models with proposed model using different evaluation criteria.

Figure 6 .
Figure 6.Visual illustration of traditional models with proposed model using different evaluation criteria (in-sample).

Figure 7 .
Figure 7. Visual illustration of traditional models with proposed model using different evaluation criteria.

Figure 7 .
Figure 7. Visual illustration of traditional models with proposed model using different evaluation criteria.
The number of elements in C cm .ηdThe number of elements in C dm .δlThe unknown parameter of utility belonging to class l at the time τ.φ l The unknown parameter of utility belonging to class l.λ cl and λ dl The mean of the Gaussian and negative binomial mixture model.Π clThe covariance of the Gaussian mixture model.δl Mixing probability associated with class c.

Table 1 .
Research on workspace satisfaction factors.

Table 2 .
List of features and their descriptions in the initial dataset.

Table 3 .
Mean matrix of class membership model.

Table 4 .
Parameter estimation of the class-specific choice model (GNBM).

Table 5 .
Model comparison of proposed model with benchmark models using different criteria.

Table 6 .
Evaluation criterion of benchmark model and proposed model (in-sample).