Nursing Profession Self-Efficacy Scale—Version 2: A Stepwise Validation with Three Cross-Sectional Data Collections

Background: The nursing professional self-efficacy scale (NPSES) is one of the most used self-reporting tools for assessing nursing self-efficacy. Its psychometric structure was described differently in several national contexts. This study aimed to develop and validate version 2 of the NPSES (NPSES2), which is a brief version of the original scale selecting items that contribute to stably detecting attributes of care delivery and professionalism as descriptors of salient aspects of the nursing profession. Methods: Three different and subsequent cross-sectional data collections were employed to reduce the number of items to generate the NPSES2 and validate its new emerging dimensionality. The first (June 2019–January 2020) involved 550 nurses and was used to reduce the number of the original scale items by using a Mokken scale analysis (MSA) to ensure the selection of items consistently with the invariant item ordering properties. The subsequent data collection was performed to conduct an exploratory factor analysis (EFA) involving 309 nurses (September 2020–January 2021), and the last data collection (n = 249) was performed to cross-validate with a confirmatory factor analysis (CFA), the most plausible dimensionality derived from the EFA (June 2021–February 2022). Results: The MSA led to the removal of twelve items and retention of seven items (Hs = 0.407, standard error = 0.023), which showed adequate reliability (rho reliability = 0.817). The EFA showed a two-factor solution as the most plausible structure (factors loading ranged from 0.673 to 0.903; explained variance = 38.2%), which was cross-validated by the CFA that showed adequate fit indices: χ2 (13, N = 249) = 44.521, p < 0.001; CFI = 0.946; TLI = 0.912; RMSEA = 0.069 (90% CI = 0.048–0.084); SRMR = 0.041. The factors were labeled as care delivery (four items) and professionalism (three items). Conclusions: NPSES2 is recommended to allow researchers and educators to assess nursing self-efficacy and inform interventions and policies.


Introduction
In the social cognitive theory, self-efficacy was described as individuals' belief in their ability to succeed in a specific task or accomplish a specific goal [1]. In other words, self-efficacy is the belief in one's capabilities to organize and execute the courses of action required to manage challenging situations. Self-efficacy is a psychological construct closely of colleagues" in the domain of professionalism. Thus far, the psychometric structure of the NPSES is weak, acknowledging the need for several factor structures to explain its dimensionality in different cultural contexts. A weak factor structure undermines the possibility of comparing results in studies using the same scale in different languages. For this reason, removing items that might contribute to generating different interpretations between national contexts could be an adequate strategy to develop an updated version of the NPSES. Therefore, eliminating items that generate different interpretations in different contexts helps to strengthen the psychometric structure of the NPSES and its capacity to adequately detect nursing self-efficacy. In addition, eliminating ambiguous items develops a short version of the NPSES that might also help educators and researchers measure nursing self-efficacy when measuring multiple theoretical constructs is needed (e.g., studies or educational initiatives including several assessments). Therefore, this study aimed to develop and validate version 2 of the NPSES (NPSES2), which is a brief version of the original scale with select items that contribute to stably detecting care delivery and professionalism as descriptors of salient aspects of the nursing profession.

Design
This study had an observational design with three cross-sectional data collections and it was approved by the authors of the original NPSES. The first data collection was aimed at reducing the number of items of the NPSES by employing a Mokken scale analysis (MSA) focused on identifying items that do not fit well with the underlying construct being measured [27]. Therefore, no changes in the wording of the items were performed, and the NPSES2 was the result of the items retained by the MSA. When NPSES2 was developed, a second data collection was needed for testing, with exploratory factor analysis (EFA), the underlying domains that may represent the most plausible dimensionality of the scale [28]. Finally, the third data collection was aimed at corroborating the dimensionality previously hypothesized by cross-validating it with confirmatory factor analysis (CFA) [29].

Instrument
The NPSES is a self-report scale that assesses nurses' self-efficacy in their professional practice. The original scale included 19 items that measure care characteristics and professional situations. The respondents rate their level of agreement with each element of the NPSES scale on a 5-point Likert scale, with scores ranging from 1 (completely disagree) to 5 (completely agree). The final score is 0-100 and is standardized following the scoring procedure indicated in the original scale [17], where higher scores indicate higher selfefficacy. Cronbach's alpha was generally adequate in several previous studies: 0.830 for the original NPSES scale [17], 0.930 for the Korean version [24], and 0.910 for the Albanian version [22]. Its original dimensionality encompassed two factors labeled as attributes of caring situations (twelve items) and professionalism situations (seven items) [17], while four factors were detected in the Albanian version [22] and five factors in the Korean one [24]. Each translated version of the NPSES was cross-culturally adapted in the validation studies [22][23][24][25]. In this study, no cross-cultural adaptation was required because the study was performed in the same language as the original one (Italian).

Samples, Sample Sizes, and Procedures
In a previous simulation study for determining the sample size required for performing MSA [30], six fixed characteristics were specified and similar to those required for performing this analytical procedure for the NPSES. Firstly, the distribution of the latent variables was set to bivariate standard normal. Secondly, two latent variables were used in the analysis. Thirdly, the number of answer categories for each item was set at 5 (NPSES has a 5-point Likert scale as well). Fourthly, the lower bound (c) was set to the default value of 3. Fifthly, 100 replications were conducted for each design cell. Finally, the location parameters of the J items were equidistantly spaced, with the location parameters of each item defined according to its position in the set. This resulted in different sets of location parameters being defined for different items. From that simulation [30], the ideal range of respondents to perform an MSA was between 250 and 600 to ensure per-element accuracy in the analytics.
For this reason, the first data collection (Sample A) involved a hospital in northern Italy (Lombardy) and one in southern Italy (Campania) with roughly 900 eligible nurses to be involved. A preliminary screening performed in the institutional records of the involved hospitals helped researchers identify eligible nurses with characteristics in line with inclusion/exclusion criteria. Eligible nurses had to fulfill inclusion criteria consistent with previous studies aimed at developing self-report measurements [18,21,[31][32][33]: fulltime work contracts and more than six months of experience in the same context to increase the population homogeneity. The data collection was performed between June 2019 and January 2020 by sending an invitation to eligible nurses to be enrolled in the study. A study information sheet, which was sent to their institutional email, included information regarding the aim, method, and data protection policy. After reading it, nurses willing to be involved had to sign an electronic informed consent form before filling out an online form containing socio-demographic and professional information and the NPSES. The validity of this approach to collecting data was provided by the consistency with previous research and by the analytical method employed in this study [17,18,21,[31][32][33]. The following sociodemographic and professional variables were collected: sex (male, female, other), marital status (married, unmarried), education (equal to or higher than a Bachelor of Nursing Science or equivalent titles, higher than bachelor's degree), clinical area of practice (medical wards, surgical wards, critical area, outpatient services, other), age (years), and work experience (years). The analytics on Sample A to reduce the number of items of the NPSES were performed between February and August 2020.
Given a preliminary exploration of the behaviors of the selected items in Sample A, it was possible to use a Monte Carlo simulation for estimating the sample size required for exploratory structural equation modeling, as it allows researchers to take into account the complexity of the data, the number of factors to be extracted, and the desired level of precision and power. The power in the Monte Carlo simulation was defined as the power needed to reject the null hypothesis that (the parameter) = 0, and it was the proportion of significance of the simulation study. Power ≥ 0.80 was desirable. The Monte Carlo simulation was performed in Mplus version 8.1 (Los Angeles, CA: Muthén & Muthén) by employing 1000 replications (seed = 45,335; the residual variances of the factor indicators were 0.36; factor variances were fixed to one; factor correlation set to 0.70). A sample size of 200 was needed for a power of 0.81 or 350 for a power of 0.91 including 15% of missing data under the framework of missing at random (MAR). For this reason, as per Sample A, eligible nurses (n = 440) from a hospital in northern Italy (Lombardy) different from the one previously involved, were invited to participate in the second round of data collection with a similar procedure and inclusion criteria to the one described in relation to the first data collection. The data collection was performed between September 2020 and January 2021, while data analysis in Sample B lasted until July 2021.
Considering the factor loadings derived from the EFA performed on Sample B, a second Monte Carlo simulation was used to estimate the sample size required for the CFA in Sample C needed for cross-validating the most plausible factor structure that emerged from Sample B. The model was performed in Mplus version 8.1 (Los Angeles, CA: Muthén & Muthén) by employing 1000 replications (seed = 45,335; the residual variances of the factor indicators were 0.32; factor variances were fixed to one; factor correlation set to 0.72; the mean of factor loading for the first factor was 0.79 and 0.70 for the second factor). A sample size of 140 was needed for a power of 0.81 (without missing data) and 280 for a power of 0.91, including 15% of missing data under the hypothesis of MAR. For this reason, a third hospital in Milan (Lombardy) was involved in the data collection (period: from June 2021 to February 2022) by employing the same procedure and inclusion criteria previously described (eligible nurses = 385).
Overall, the median time to complete the data collection in the three samples was 8 min.

Ethical Statement
This study was conducted following the principles of the Declaration of Helsinki in performing research involving human subjects. The institutional review board of the Italian Association of Cancer Nurses approved the protocol of the study (n.pr/2/2019), and each center provided approval to be involved. The chief nursing officers of each participating hospital received specific education to be responsible for collecting data properly in their settings. Before data collection, participants were informed about the study's objectives, methods, research design, the confidentiality of the data, and the option to participate.

Statistical Analysis
For each data collection, descriptive statistics were used to summarize the study information. The characteristics of each sample were inferentially compared to ascertain the level of homogeneity related to the socio-demographic and professional variables of the respondents in each sample. The chi-square test (χ 2 ) was used to compare qualitative variables, and one-way analysis of variance (ANOVA) was used to compare quantitative variables.
In Sample A, an MSA was performed using the package "mokken" from the statistical program R (R Foundation for Statistical Computing, Vienna, Austria). MSA is a nonparametric method for evaluating the unidimensionality of a set of items in relation to their latent trait (it means "one underlying theoretical construct"; in this case, it is selfefficacy). One of the main advantages of the MSA is that it allows authors to evaluate each item hierarchically and choose the most pertinent ones to evaluate the single latent trait. After having determined with the automated item selection procedure (AISP) that the items of the NPSES referred to a single scale, each item was evaluated in terms of its ability to discriminate between nurses who scored high and low on the scale as a whole. This evaluation was performed by calculating the coefficient of scalability (H) for each item (Loevinger's coefficient), which ranges from 0 to 1, followed by an assessment of the violations from monotonicity (a maximum of 80 violations are considered acceptable) (coefH function). A value of H close to 1 indicates that the item is highly discriminatory and therefore contributes strongly to the unidimensionality of the scale, while H values equal to or greater than 0.30, 0.40, or 0.50, respectively, showed weak, moderate, or strong scales. The results of the analysis also provided lower and upper bounds for the H values. This approach allows authors to evaluate the items hierarchically and choose the most pertinent ones to evaluate the single latent trait. Loevinger's coefficients (H values) were calculated for the scale level (Hs), individual items (Hi), and pairs of items HiJ. Items with inadequate H values and/or violations from invariant item ordering (IIO) were deleted from the scale during the IIO procedure performed to reduce the number of items. Molenaar-Sijtsma rho reliability was used to assess the internal consistency.
In Sample B, analytics was performed by using IBM SPSS ® Statistics for Windows version 28 (IBM Corp., Armonk, NY, USA). The Kaiser-Meyer-Olkin test (KMO) and Bartlett's sphericity test were employed to assess that the collected data were appropriate for EFA. EFA was performed from the covariance matrix of the items, extracting factors by considering a previous study [1], the interpretation of the scree plot, and factors with eigenvalues ≥ 1. Factors were rotated using the Promax rotation to obtain an interpretable solution. Parallel analysis with Monte Carlo simulations was used to confirm the number of factors that more likely represent the dimensionality of the NPSES2. McDonald's ω was used to assess the internal consistency.
In Sample C, the CFA was performed in Mplus version 8.1 (Los Angeles, CA: Muthén & Muthén) to cross-validate the dimensionality explored with EFA. The χ 2 , the ratio between χ 2 and degrees of freedom, the comparative fit index (CFI), the Tucker-Lewis index (TLI), the Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean Square Residual (SRMR) were used to determine if the model fitted to sample statistics. Adequate fit indexes were: CFI ≥ 0.90, TLI ≥ 0.90, RMSEA lower than 0.080, and SRMR lower than 0.1.
The scores of the NPSES2 were 0-100 standardized. All the analyses were performed with two-sided null hypotheses and alpha = 5%. Less than 5% of missingness in the three data collections was managed with an available-case analysis approach.

MSA in Sample A
The AISP selection did not show items that had to be removed for inconsistencies with the original scale. The assessment of Loevinger's coefficient of homogeneity of the 19 items showed that Hi values ranged between 0.357 (item 13) to 0.500 (items 3 and 9) (Hs = 0.432, standard error (SE) = 0.020), indicating moderate scalability. The initial assessment of the monotonicity violations showed that all the items reported less than 80 violations, ranging from 0 (item 8) to 69 (items 13 and 17). Therefore, no items were removed from the scale in this stage.
The IIO procedure is shown in Table 2. If a scale is invariant, it means that it is stable and the items are measuring the same underlying construct regardless of their order of presentation; therefore, in this stage, items presenting violations of invariant item ordering were removed, and the scale without these items was re-tested for its invariant item ordering in a stepwise process until no violations were detected. Overall, two steps were necessary. In the first step (H T = 0.199), items were hierarchically ordered based on their mean scores, and items with significant violations of the invariant item ordering were items 1,5,6,7,9,10,11,12,13,15,18

Item 17
Report any abuse or unethical behavior of colleagues to the appropriate Regulatory Authority

EFA in Sample B
The KMO was equal to 0.807, and Bartlett's test was significant (χ 2 (21, N = 309) = 897.57 p < 0.001), indicating the factorability of the matric of covariances obtained from the responses. The model with a two-factor structure was supported by the analysis of the eigenvalues (also confirmed by a parallel analysis with random data eigenvalues employing a Monte Carlo simulation), the scree plot, and the clarity of the interpretation of the item-factors relationships. Therefore, the two-factor solution was considered the most plausible dimensionality of the NPSES2. Its factor loadings are shown in Table 3 and ranged from 0.673 (item 8) to 0.903 (item 3). The factors were labeled as care delivery (former items 2, 3, 4, and 8; explained variance = 38.2%) and professionalism (former items 14, 16, and 17; explained variance = 29.02%). McDonald's ω values for care delivery and professionalism were, respectively, 0.842 and 0.815.

Discussion
This study showed that the NPSES2 is a brief, valid, and reliable self-report measure for assessing nursing profession self-efficacy. The previous tool was based on 19 items [17] with a dimensionality described to be context-specific in several studies [22][23][24][25]. More precisely, the original dimensionality was based on a two-factor structure, which was not confirmed when the NPSES was used in international contexts [22][23][24][25]. For this reason, the results derived from this study that removed ambiguous items are important for at least five reasons that can be summarized in aspects related to validity, reliability, significance, research reasons, and practicality.
This study improved the validity of the scale compared to the previous version. A valid self-efficacy scale is essential to ensure that it measures what it intends to measure, leading to more accurate conclusions and a better understanding of the construct. The deleted items from the NPSES to develop the NPSES2 were those violating the invariant item ordering. The invariant item ordering is pivotal because it ensures that the factor structure that is obtained is truly representative of the underlying construct being measured rather than being an artifact of the order in which the items were presented [34]. No evidence of invariant item ordering was previously described in relation to the NPSES. This approach in developing the NPSES2 ensured that the factor structure was easily interpretable, meaning that the factors that were extracted in the subsequent EFA were meaningful and represented the underlying constructs being measured. In addition, the MSA and the item selection based on removing those violating the invariant item ordering ensured higher stability than the previous version [35] and more generalizability because evidence of invariant item ordering is associated with high external validity [36].
The employed approach, based on a stepwise method of three separate data collections to refine the scale from the item selection procedure (MSA in Sample A) to the cross validation of the dimensionality (CFA in Sample C), also provides increased reliability of the NPSES2 compared to the previous version. While invariant item ordering refers to the property of a scale that the order of items does not affect the factor structure that is extracted, reliability refers to the consistency of the results of a scale; however, the two properties are associated because the invariant item ordering procedure ensured that the factor structure was consistent, regardless of the order in which the items are presented. This consistency is important for the scale's reliability, as it ensures that the factors extracted are accurate and not affected by the specific order of the items. In addition, the MSA ensured that the factor structure could be easily replicable, meaning that the factors that

Discussion
This study showed that the NPSES2 is a brief, valid, and reliable self-report measure for assessing nursing profession self-efficacy. The previous tool was based on 19 items [17] with a dimensionality described to be context-specific in several studies [22][23][24][25]. More precisely, the original dimensionality was based on a two-factor structure, which was not confirmed when the NPSES was used in international contexts [22][23][24][25]. For this reason, the results derived from this study that removed ambiguous items are important for at least five reasons that can be summarized in aspects related to validity, reliability, significance, research reasons, and practicality.
This study improved the validity of the scale compared to the previous version. A valid self-efficacy scale is essential to ensure that it measures what it intends to measure, leading to more accurate conclusions and a better understanding of the construct. The deleted items from the NPSES to develop the NPSES2 were those violating the invariant item ordering. The invariant item ordering is pivotal because it ensures that the factor structure that is obtained is truly representative of the underlying construct being measured rather than being an artifact of the order in which the items were presented [34]. No evidence of invariant item ordering was previously described in relation to the NPSES. This approach in developing the NPSES2 ensured that the factor structure was easily interpretable, meaning that the factors that were extracted in the subsequent EFA were meaningful and represented the underlying constructs being measured. In addition, the MSA and the item selection based on removing those violating the invariant item ordering ensured higher stability than the previous version [35] and more generalizability because evidence of invariant item ordering is associated with high external validity [36].
The employed approach, based on a stepwise method of three separate data collections to refine the scale from the item selection procedure (MSA in Sample A) to the cross validation of the dimensionality (CFA in Sample C), also provides increased reliability of the NPSES2 compared to the previous version. While invariant item ordering refers to the property of a scale that the order of items does not affect the factor structure that is extracted, reliability refers to the consistency of the results of a scale; however, the two properties are associated because the invariant item ordering procedure ensured that the factor structure was consistent, regardless of the order in which the items are presented. This consistency is important for the scale's reliability, as it ensures that the factors extracted are accurate and not affected by the specific order of the items. In addition, the MSA ensured that the factor structure could be easily replicable, meaning that the factors that were extracted could be replicated in different samples or settings, overcoming the issues detected by previous studies [22,24]. This replicability is also important for the scale's reliability because it ensures that other researchers can replicate the factors extracted and that the results may be more generalizable than the ones obtained with the previous version of the scale.
Self-efficacy is a cognitive-motivational construct that has been linked to a wide range of outcomes, such as academic and career success, mental and physical health, and wellbeing [2,3]. In the NPSES2, the invariant item ordering ensured by the MSA provides increased confidence in the conclusions and findings of studies that will use this tool, which means that the significance of measuring self-efficacy by using NPSES2 has been improved compared with the significance that a scale with uncertain dimensionality might have in research settings. In other words, the improved validity and reliability characteristics of the NPSES2 compared to the former version might also contribute to sustaining the significance of assessing nursing profession self-efficacy.
In this study, the general levels of nursing profession self-efficacy in the three samples seemed to be generally limited (see Figure 1). However, it is impossible in this study to determine which score can discriminate a situation of inadequate self-efficacy that could be associated with worsened outcomes (e.g., lower work performance, lower well-being) because further research is needed. Future studies should determine cutoff scores of the NPSES2 by using two possible approaches: the Receiver Operating Characteristic (ROC) curve or a criterion-related cutoff identification. The first method involves plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) for different cutoff scores in relation to a dichotomic outcome (e.g., intention to leave the profession) [37]. The optimal cutoff score is the one that maximizes the area under the ROC curve (AUC). The second possible method involves determining cutoff scores based on their correlation with an external criterion, such as a behavioral assessment. This method is useful when the scale is used to predict a specific outcome, such as a behavior.

Limitations and Strengths
This study has several limitations that are required to be acknowledged. First, the cross-sectional nature of the data collection rounds did not allow researchers to take into account the relationship between factor structure and time. Second, cross-sectional studies are susceptible to selection bias, meaning that the sample of participants may not be fully representative of the population being studied. In fact, in this study, we detected small variations in relation to age, sex, and work context. In general, the characteristics of the respondents are similar to the ones reported for the general population of Italian nurses [38]. Third, no information is thus far available in relation to the cutoff scores that may be used to detect inadequate self-efficacy. This aspect limits the current interpretation of the scores derived from the NPSES2. Fourth, the limited number of institutions involved in the study (four hospitals) might limit the generalizability, even if the sample size for each analysis was adequate. In this regard, generalizability might also be undermined by the differences in the characteristics of the respondents in the three samples, even if these differences might reflect "real-life" differences between hospitals. Lastly, it was not feasible to test the multi-group invariance between the three samples to avoid performing multiple and different psychometric tests on the same sample because it can lead to drawbacks with the validity of the results.
This study has several strengths and innovations. Using multiple data collection rounds in different samples analyzed with different analytic methods (MSA, EFA, and CFA) increases the rigor of the study and provides a more comprehensive assessment of the psychometric properties of the instrument. This approach allows for a more accurate and reliable evaluation of the reliability and validity of the NPSES2. In addition, the fewer items in the NPSES2, which is a short form of the NPSES, can be useful in various settings, including academic and clinical contexts, and can help reduce respondent burden without sacrificing the measurement quality. Finally, a clearer psychometric structure derived from the several analytical steps can improve its utility and validity by providing a more consistent and coherent assessment of nursing professional self-efficacy. The clearer structure can also improve the interpretability of the scores and help identify areas where nurses may need further development or support.

Study Implications
This study produces a brief, valid, and reliable self-report scale of self-efficacy, allowing researchers and educators to make accurate and reliable conclusions about the construct of self-efficacy in a specific population, which can improve the understanding of the construct and help to inform interventions and policies. In relation to the practicality and research reasons, a brief self-report scale such as the NPSES2 is more practical to use in research and practice because it is less time-consuming for participants to complete and easier for researchers to administer [39]. Therefore, NPSES2 utilization could increase the response rate and reduce participant burden in research initiatives. This aspect is particularly meaningful concerning the need for several measurements when self-efficacy is studied with its antecedents and outcomes. This hypothesis is rooted in the findings of a previously performed meta-analysis showing a strong correlation between response rate and questionnaire length, where response rates were lower for longer scales or questionnaires [39]. For these reasons, we recommend using NPSES2 when it is needed to assess professional self-efficacy among nurses, and cross-national validation studies of NPSES2 are required before using it in contexts different from the one where it was developed.

Conclusions
This study developed and validated the NPSES2, a seven-item self-report scale measuring two aspects of nursing profession self-efficacy: care delivery (four items) and professionalism (three items). Future research should determine which score of NPSES2 can identify risks for nurses or inadequate areas of self-efficacy with a likelihood of being associated with negative professional or health-related outcomes. It is recommended to test the NPSES2 validity and reliability in different contexts from the one where the scale was developed (Italy), and the wording of the items required to be culturally adapted in target languages different than Italian. Data Availability Statement: Raw data are available from the corresponding author upon reasonable request.