Smart Cities and Awareness of Sustainable Communities Related to Demand Response Programs: Data Processing with First-Order and Hierarchical Conﬁrmatory Factor Analyses

: The mentality of electricity consumers is one of the most important entities that must be addressed when dealing with issues in the operation of power systems. Consumers are used to being completely passive, but recently these things have changed as signiﬁcant progress of Information and Communication Technologies (ICT) and Internet of Things (IoT) has gained momentum. In this paper, we propose a statistical measurement model using a covariance structure, speciﬁcally a ﬁrst-order conﬁrmatory factor analysis (CFA) using SAS CALIS procedure to identify the factors that could contribute to the change of attitude within energy communities. Furthermore, this research identiﬁes latent constructs and indicates which observed variables load on or measure them. For the simulation, two complex data sets of questionnaires created by the Irish Commission for Energy Regulation (CER) were analyzed, demonstrating the inﬂuence of some exogenous variables on the items of the questionnaires. The results revealed that there is a relevant relationship between the social–economic and the behavioral factors and the observed variables. Furthermore, the models provided a good ﬁt to the data, as measured by the performance indicators.


Introduction
The implementation of DR programs is sensitive to the consumer's perception and mentality in relation to a couple of factors that are not always measured directly using tools such as surveys or questionnaires. Confirmatory factor analysis (CFA) is a statistical instrument that allows the testing of hypotheses that are sustained by theory. It requires a good knowledge of the field to identify the latent factors and group the observed variables by those latent factors.
For this investigation, we extracted observed variables from two complex and large questionnaires that targeted Irish electricity consumers during a pre-trial and post-trial period of installing smart metering systems (SMS) in which their consumption and opinions were monitored. The data sets can be accessed (https://www.ucd.ie/issda/data/ commissionforenergyregulationcer/ (accessed on 25 January 2022)) from CER that initiated a project that aimed to evaluate the performance of a couple of DR programs using SMS and tested the opportunity to install more SMS. It consists of two of the largest and most comprehensive trials in the world that are aimed at both residential consumers and small and medium enterprises.
The two 143-item and 243-item questionnaires were launched to residential electricity consumers that were subject to a pre-and post-trial SMS implementation carried out by the Irish CER to test the electricity consumers' behavior. The questions accounted for numerous latent factors such as demographics (sex, age, education, employment status), attitude, expectation, relation with supplier, perceived impact of their own actions, proenvironmental measures, appliances, heating usage, and perceived implementation of the DR programs. In this article, the research is focused on the residential consumer respondents to the questionnaires that accounted for 4232 observations (pre-trial) and 3423 observations (post-trial). Thus, our study uses the real-life experience of the Irish residential consumers.
Our purpose was to test the structure of the factors underlying the questionnaire data sets. Thus, we empirically test the theory that describes the structure of factors that underlies the two data sets [1,2]. Furthermore, this study verifies whether the measurement model created with CFA shows an acceptable fit to the data and shows how it can be modified to be even better. The contribution of this paper consists of the following: • the creation of two measurement models (first-order CFA) that test the structure of factors that are behind the observed variables of the two large and complex questionnaires; • the creation of several hierarchical CFA models, such as second-order and bi-factor models to reflect the relation between the items of the questionnaires and the awareness of electricity consumers; • the testing of the models to prove that they do not capitalize on chance characteristics of the data, proving that the data-model fit is not random, and the model can generalize on different data sets; • drawing interesting conclusions and providing information on the relationship between respondents' answers and their awareness of environmental issues and implementation of DR programs.
This paper is structured into six sections. In the Section 1, an introduction to the general context, motivation, and contribution of this paper is provided. In the Section 2, a comprehensive literature review is performed, while in the Section 3, the research methodology is presented. We provide an exploratory data analysis (EDA) of the input data in Section 4. The results and simulations are revealed in Section 5 along with meaningful discussions on the main findings of this study. The conclusion is drawn in Section 6.

Review of the Literature
Numerous studies use CFA in investigations related to consumer awareness, behavior, education, psychology, economy, etc., in social science research to analyze questionnaire data and to extract valuable insights that are not easy to measure otherwise [3][4][5][6][7].
A first wave of studies related to attitudes towards energy consumption and conservation using CFA emerged in the United States in the 1990s [8] and focused on identifying the dimensions underlying energy usage attitudes, concepts, and beliefs. A total of 308 observations taken from the respondents in Macau were analyzed using CFA and structural equation modelling (SEM) to show that environmental concerns and financial benefits were the factors that influenced the perception of full electric vehicles (FEV) and the intention to purchase FEV. It revealed that the perception of economic benefit was one of the key factors influencing the adoption of FEV [9].
Furthermore, the authors of [10] analyzed the responses of 246 electricity consumers from Pakistan using CFA and SEM to measure consumers' awareness about electricity conservation in developing countries. It measured the influence of some factors such as beliefs, attitudes, and intentions on energy conservation, showing that awareness [11], perceived value, resistance to change, and benefits were predictors of behavioral changes towards efficient energy conservation measures.
An investigation of the effect of digitalization, climate change and energy consumption of 282 Indonesian respondents using CFA and regression was proposed in [12]. Awareness of smart technologies increased the experience of consumers.
The factors that impacted the awareness of the residential consumers, perceived benefits, price and attitudes were investigated in [13] considering 516 respondents from Jordan. Both exploratory and confirmatory factor analyses were carried out to identify the latent factors and to check the performance indicators. The relationship among these factors was also analyzed using SEM. The study found that awareness positively impacted acquisition intentions, perceived gain, and consumption behavior.
Energy saving practices and participation towards renewable energy development were studied in Malaysia using CFA and AMOS software [14,15] or with both Exploratory Factor Analysis (EFA) and CFA [16]. The latter study contained a factor structure that consisted of the degree of knowledge on RES, environmental concern, public awareness on RES, attitude towards RES usage, and willingness to adopt RES technologies and aimed to predict the willingness to pay for green energy.
Furthermore, several case studies were deployed in Tanzania [17] assessing indicators of energy access in rural areas using both EFA and CFA and concluding that these indicators related to the energy access were significant in improving the social economic progress and the living standards of the people in rural areas of Tanzania.
Another competitive software, LISREL, was exploited in [18] to study the factors underlying the waste-to-energy facilities in Thailand. A sample of 361 individuals' responses were analyzed with CFA and the Structural Equation Model (SEM) proving that all factors had a positive influence on the waste-to-energy facilities.
More studies on influencing factors related to energy management in industries [19], assessing the employees' engagement in energy-saving [20], residential consumers' lifestyle and energy saving [21], and experience in green energy learning [22] were recently performed using CFA and sometimes SEM to identify future trends. The authors of [19] aimed to analyze the influencing factors on energy management in industries from several perspectives with CFA. They analyzed the responses to surveys applied to different industrial sectors in Brazil showing a positive correlation among the factors.
Although numerous studies have been conducted with consumption data and questionnaires deployed by the Irish CER, to our knowledge, the two large questionnaires with 3000-4000 respondents and 150-250 items have not yet been analyzed with CFA.

Research Methodology
The research methodology implemented in this article consisted of processing large and challenging data sets for CFA. First, we split the questions of each questionnaire into significant groups that revealed specific characteristics of electricity consumers and their consumption. The pre-trial questionnaire consisted of 4232 observations and 143 questions that were grouped by 7 variables summing up data on: q1 Demographics, q2 Positive attitude, q3 Negative attitude, q4 Pro-environmental measures, q5 Expectations, q6 Relation with supplier, and q7 Appliances. These variables were further grouped by 2 latent factors: F1 social-economic and F2 behavioral factors.
On the other hand, the post-trial questionnaire had 3423 observations and 234 questions that were grouped by 9 variables that summed up data about: q1 DR program, q2 Demographics, q3 Positive attitude, q4 Negative attitude, q5 Heating, q6 Pro-environmental measures, q7 Positive perception on price-based DR implementation, q8 Negative perception on price-based DR implementation, and q9 Perception of incentive-based DR. Each observed variable was influenced by one latent factor and a measurement error. For CFA, the SAS CALIS procedure was implemented using the lineqs statement that allowed us to define the observed variables as linear equations. Similar results could be obtained with a factor statement that is similar to lineqs.
In the lineqs statement, we defined a set of linear equations for each observed variable q i . It is equal to the factor loading p i multiplied by latent factor F k plus a measurement error or residual term e i .
where i is the number of observed variables and k is the number of latent factors. Using variance and covariance (cov) statements, we defined the variances and covariances that were calculated using the CALIS procedure. Variances for latent factors were set to 1, whereas they were allowed to covary, and the covariance was defined in the cov statement.
The pathdiagram statement allowed us to display the CFA diagram along with the main performance indicators. When using the factor statement, the relationships are written as: where j is the number of variables that load on a certain latent factor F c . We set on the residual robust and modification options of the CALIS procedure. As we summed up the answers for a series of questions, the problem with the missing value disappeared, but outliers still existed and lowered the performance indicators of the model. They are printed with the plots = caseresid option of the CALIS procedure. Therefore, residual robust is an option for outlier treatment that does not simply erase or remove some outliers that may lead to a masking effect. It is an alternative way to handle outliers that are downweighed simultaneously with the estimation. The observations are iteratively reweighted based on the updated parameter estimates. The performance of the model increases with the residual robust option from Comparative Fit Index (CFI) = 0.9, Standardized Root Mean Squared Residual (SRMR) = 0.05 and Root Mean Square Error of Approximation (RMSEA) = 0.08 to CFI = 0.91, SRMR = 0.4 and RMSEA = 0.07. The modification option is also important, as it may provide meaningful suggestions to improve the model.
The methodology of this research study consisted of the following steps (as in Figure 1): (1) first, an exploratory data analysis (EDA) was performed to visualize the data distribution and identify outliers to further treat them and help improve the model performance; (2) the first-order CFA was initially performed with the modification option on to check whether there was additional room for improvement. The Lagrange Multiplier and Wald tests were included in the modification option. However, these modifications were moderately approached and verified with the theory; (3) third, more complex CFA models were created to verify if the performance indicators could be further improved. Moreover, these models (second-order CFA and bi-factor models) allowed us to consider one or more latent factors that were not easy to measure in questionnaires; (4) forth, more tests were performed to prove that the models did not capitalize on chance in a data set and they could provide good results with other data sets as well. Electronics 2022, 11, x FOR PEER REVIEW 5 of 17

EDA of Input Data
Both data sets are analyzed in this section. A similar EDA was performed for each of them. The pre-trial questionnaire data set consisted of seven derived variables, whereas the post-trial questionnaire data set had nine variables that were scatter plotted to display their distribution (as in Figures 2 and 3).

EDA of Input Data
Both data sets are analyzed in this section. A similar EDA was performed for each of them. The pre-trial questionnaire data set consisted of seven derived variables, whereas the post-trial questionnaire data set had nine variables that were scatter plotted to display their distribution (as in Figures 2 and 3).  The pre-trial variables were closer to the normal distribution than the post-trial variables. Thus, the logarithm function could be applied to the post-trial variables.
Outliers generally decreased the performance of CFA. Therefore, the values were grouped into normal and outliers as in Figure 4. Probability graphs are displayed to assess whether a data set was approximately normally distributed. The data were plotted against a theoretical normal distribution so that the points should have formed a straight line. Departures from it indicate departures from normality that were more evident for the post-trial data set (as in Figures 5 and 6). The distributions of the residuals are provided in Figure 7.   With the p-p plots, we obtained a higher resolution in the center of the distribution. On the other hand, with the q-q plots, we obtained a higher resolution at the tails. We were more concerned about the tails of a distribution, which impacted the model.

Results
The numerous items of the pre-trial questionnaire were aggregated into seven observed variables: q1 for Demographics, q2 for Positive attitude, q3 for Negative attitude, q4 for Measures, q5 for Expectations, q6 for Relation with supplier, and q7 for Appliances. Two latent factors were considered: F1 as the social-economic factor and F2 as the behavioral factor. Variables q1, q4, q6, and q7 from the pre-trial questionnaire loaded on the socialeconomic factor (F1), while q2, q3, and q5 loaded on the behavioral factor (F2). The SAS CALIS procedure was implemented for the pre-trial questionnaire as in Table 1. The Lineqs statement was considered to define the variables as linear equations. Each observed variable was influenced by its latent factor and measurement error. Nine variables were extracted from the post-trial questionnaire: q1 for the DR program, q2 for Demographics, q3 for Positive attitude, q4 for Negative attitude, q5 for Heating, q6 for Measures, q7 for Positive perception on price-based DR implementation, q8 for Negative perception on price-based DR implementation, and q9 for Perception on incentive-based DR. The same two latent factors were considered: F1 as the social-economic factor and F2 as the behavioral factor. Variables q2, q5, and q6 from the post-trial questionnaire loaded on the social-economic factor (F1), while q1, q3, q4, q7-q9 loaded on the behavioral factor (F2). Furthermore, the CALIS procedure was implemented for the data set of the post-trial questionnaire as in Table 2 using the lineqs statement to define the nine variables. The results related to the modelling information and variables are presented in Tables 3 and 4. The mean and standard deviation for the observed variables are presented in Table 5.  After 12 iterations, the convergence was reached for the first-order CFA model run on the pre-trial data set (as in Table 6), whereas only 6 iterations were required for the post-trial data set (as in Table 7). The factor loadings for the pre-trial questionnaire and the post-trial questionnaire are presented in Tables 8 and 9. In each case, there was one factor loading that was not statistically significant, p4 for the pre-trial questionnaire for which the t value was outside the limits (that is smaller than 2.58) and p2 for the post-trial questionnaire for which the t value was 1.289, which was also less than 2.58. Table 8. Factor loadings for the pre-trial questionnaire data set.  Table 9. Factor loadings for the post-trial questionnaire data set. Based on the Wald test (as in Table 10), we removed the two factor loadings that were also indicated in Tables 8 and 9 as not statistically significant. Therefore, the q4 variable was removed from the pre-trial data set and the simulation was repeated. In Figure 8, the two path diagrams for the pre-trial data set are displayed before and after modification option. The q4 variable did not load on F1 (social-economic factor) and CFI improved from 0.94 to 0.97, which indicates a better fit.

Standardized Effects in Linear Equations for Post-Trial
Furthermore, from the above path diagrams we may conclude that there was a weak correlation between the two latent factors. A synthesis of the performance indicators is shown in each diagram that reveals that apart from CFI and chi-square that decreased, the other indicators remained unchanged. In Figure 9, the two path diagrams for the post-trial data set are displayed before and after modification. At first sight, we noticed that the chi-square was much larger than in the previous model. However, when q2 was removed from the post-trial data set, the performance indicators did not improve. On the contrary, RMSEA and SRMR increased by 0.01. In addition to the performance indicators that are displayed along with the path diagrams, all indicators are displayed in Table 11 for the two models with and without modification indicated by the Wald test. Some other modifications were suggested by the Lagrange Multiplier test, but they were not reasonable from the theory point of view.
Furthermore, two hierarchical models were created for the two data sets. These complex models allowed us to insert a new latent variable 'Awareness' that could be affected by social-economic and behavioral factors (as in Figures 10 and 11). In the secondorder CFA model, social-economic and behavioral factors were the first-order latent factors, while awareness was the second-order latent factor. The first-order factors acted as an intermediary layer between the second-order factor and observed variables. On the contrary, in the case of the bi-factor model, the observed variables loaded on two factors (one group and one general factor).   Comparable results were obtained with the second-order CFA for the pre-trial data set, whereas better results were obtained with the bi-factor model for the post-trial data set, improving CFI from 0.91 to 0.97, SRMR from 0.04 to 0.03, and RMSEA from 0.07 to 0.05.
The challenge with CFA models and small data sets is that the model can perform well only for a single data set. However, when the data set is large, it can be divided into subsets, and the CFA models can be run on each of them to check whether the model is capable of generalizing different data sets or if it just capitalizes on chance in a data set. The size of the data set must have enough statistical power that can be tested using the degrees of freedom of the model and the sample size. With 4232 observations and DF = 8, the statistical power was 0.99, while with 1058 observations, the statistical power was 0.83, which was higher than 0.8, indicating the reliable statistical power of the data sample. Thus, the 4232 observations of the pre-trial data set were divided into four data subsets of 1058 observations each creating the first, second, third and fourth data subsets as in Table 12. We noticed that with smaller data sets, the chi-square was smaller, and the probability was even smaller than 0.05, which confirmed the null hypothesis that the theoretical model fit the data. This is a proper demonstration of the fact that the chi-square is sensitive to the data sample size and is not always reliable to indicate an adequate fit to data, especially when the data set is large (as the pre-trial data set).

Conclusions
Very good results in terms of performance indicators from each category (absolute, parsimony and incremental indexes) were obtained for the pre-trial data set with over 4000 respondents, especially when the q4 variable was removed (CFI = 0.97, SRMR = 0.02, RMSEA = 0.03), whereas the results for the post-trial data set were acceptable without excluding variables q2 (CFI = 0.91, SRMR = 0.04 and RMSEA = 0.07). Thus, the modification indications must be tested, and should be in line with the theory that underlies the model. By considering smaller data subsets, we demonstrated that the chi-square is sensitive to sample size. Furthermore, applying CFA for smaller data sets proved that the model is able to generalize and provide a reliable solution for various data sets.
Implementing more complex hierarchical models brough better or comparable results for both the pre-trial data set (second-order CFA) and the post-trial data set (bi-factor model).
The analyses will be continued as future work with tests of more complex models such as the structural equation model to make further predictions and to understand the relationship among the latent factors.