A Big Five-Based Multimethod Social and Emotional Skills Assessment: The Mosaic™ by ACT® Social Emotional Learning Assessment

A focus on implementing social and emotional (SE) learning into curricula continues to gain popularity in K-12 educational contexts at the policy and practitioner levels. As it continues to be elevated in educational discourse, it becomes increasingly clear that it is important to have reliable, validated measures of students’ SE skills. Here we argue that framework and design are additional important considerations for the development and selection of SE skill assessments. We report the reliability and validity evidence for The Mosaic™ by ACT® Social Emotional Learning Assessment, an assessment designed to measure SE skills in middle and high school students that makes use of a research-based framework (the Big Five) and a multi-method approach (three item types including Likert, forced choice, and situational judgment tests). Here, we provide the results from data collected from more than 33,000 students who completed the assessment and for whom we have data on various outcome measures. We examined the validity evidence for the individual item types and the aggregate scores based on those three. Our findings support the contribution of multi-method assessment and an aggregate score. We discuss the ways the field can benefit from this or similarly designed assessments and discuss how the assessment results can be used by practitioners to promote programs aimed at stimulating students’ personal growth.


Introduction
In the past 25 years, social and emotional learning (SEL) has demonstrated tremendous growth and continues to gain popularity. There is growing consensus that a number of factors outside of cognitive ability may be nearly, or just, as important for educational and workplace success (e.g., Aspen Institute 2018; Burrus et al. 2017;Durlak et al. 2015;Kautz et al. 2014). These factors are often termed social and emotional (SE) skills, which can be defined as, "individual capacities that (a) are manifested in consistent patterns of thoughts, feelings, and behaviours, (b) can be developed through formal and informal learning experiences, and (c) influence important socioeconomic outcomes throughout the individual's life" (OECD 2015, p. 34). SE skills predict a variety of important outcomes, which include, but are not limited to, health (Bogg and Roberts 2004), academic performance (Chernyshenko et al. 2018;Mammadov 2021;Poropat 2009), and job performance (Barrick et al. 2001;Zell and Lesick 2021).
As a result of this growing consensus, the implementation of SEL curricula outpaces the development of assessments through which SE skills can be measured (e.g., Denham 2015). Clearly, reliable and valid measures of students' SE skills are key for tracking student growth and for studying intervention efficacy. Beyond the evidence for reliability and validity, we encourage developers and users to consider two additional features of the SE skill measures-the framework on which the measure is based and the nature of the items that comprise the measure. Though a vast amount of SE skill frameworks exist (see Jones et al. 2016), some frameworks enjoy more empirical backing than others. Likewise, some item types may be more or less advantageous in certain situations. Here, we discuss one particular SE skills assessment, The Mosaic™ by ACT ® Social Emotional Learning Assessment (formerly known as ACT ® Tessera ® and referred to as Mosaic in the remainder of this paper), as an example of an assessment that makes use of an empirically backed framework and a multimethod approach.

The Big Five as an Organizing Framework for SE Skills
The Big Five personality trait model is guided by the lexical hypothesis, which assumes that important individual differences will become encoded into language as single terms (Goldberg 1993). The model was born after researchers turned to an English language dictionary and used factor analytic techniques (e.g., Cattell 1943;Tupes and Christal 1961) to uncover five replicable factors. These factors are commonly labeled Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experience, and are often referred to as the Big Five (de Raad and Mlačić 2015). Subsequently, the same five factors have been found in other languages using various methods (e.g., McCrae et al. 2005;Schmitt et al. 2007).
Conceptually, the Big Five can be considered a "Rosetta Stone" for understanding SE skills (Martin et al. 2019). Using the Big Five, we can take constructs expressed as time management in one framework and grit in another, and understand their connectedness as components of Conscientiousness. Soto and colleagues (2022) provide a comprehensive model of SE skills that resembles the Big Five and have constructed a psychometrically sound SE skill inventory based on this model. Additional recent work speaks to the empirical merit of the Big Five-SE skill links. For example, Walton and colleagues (2021) carried out two studies using different methodologies to evaluate this. In one, they asked subject matter experts from the fields of personality psychology and SEL to rate the degree of overlap between the Big Five traits and popular SE skills. As anticipated, the traits and skills were deemed to overlap to a significant degree. In another study, they factor analyzed items from a Big Five measure along with items from a CASEL (the most prominent SEL framework in the US) competency-based measure, and they determined that the best fitting model was one with five factors that aligned with the Big Five traits.
The Big Five is also an ideal candidate for SE skills assessments because of its consistency with SE skills definitions. Recall that the OECD's (2015) definition states that SE skills influence important outcomes. There is a vast body of research linking the Big Five with many critical outcomes. For example, a meta-analysis by Poropat (2009) found that, during the primary school years, the cognitive ability's impact on academic performance exceeds that of any Big Five trait, but by secondary education, Conscientiousness is nearly as important for academic performance as cognitive ability. Another recent meta-analysis also found that Conscientiousness was a strong predictor of academic performance across all levels of schooling and predicted performance incrementally above-and-beyond cognitive ability (Mammadov 2021).
As another example of why the five-factor model is a good choice of framework for SE skills, the OECD's (2015) definition states that such skills can be developed. In their meta-analysis of mean-level personality change, Roberts et al. (2006) found that individuals exhibit development throughout their lifespan. Furthermore, meta-analyses demonstrate that interventions can alter personality traits significantly even when the interventions last for short periods of time (Hudson et al. 2019;Roberts et al. 2017). Similarly, metaanalyses speak to the effectiveness of SEL interventions. Students who participate in SEL programs see greater gains in SE skills and academic performance versus students who do not participate, and both short-and long-term benefits have since been documented (e.g., Corcoran et al. 2018;Mahoney et al. 2018).
Due to its utility in conceptualizing SE skills and growing empirical support in the field, the Big Five has been adopted as an organizing framework for SE skills in prior literature (Abrahams et al. 2019;Casillas et al. 2015;Soto et al. 2022) and other work. Notably, the OECD's Study on Social and Emotional Skills, which is a large-scale international study, uses the framework (Chernyshenko et al. 2018;Kankaraš and Suarez-Alvarez 2019).

Methodological Approaches to the Assessment of Social and Emotional Skills
Several item types have been used to assess SE skills. For example, Likert items have been used in SEL research for decades and are known for their efficiency, reliability, and validity. Individuals indicate their agreement with several statements (e.g., "I work hard"). This item type is preferred when stakes are low and faking is not expected (Lipnevich et al. 2013). However, respondents may have various motives to fake, such as appearing more attractive to a school admissions officer (Zickar et al. 2004). Furthermore, Likert items might be particularly susceptible to reference effects. That is, people often answer such items by asking the question, "compared with whom?" For example, it could be the case that students from very high achieving schools might rate themselves lower on their SE skills than students from low achieving schools simply because they are using a different reference group, not because they are truly lower on these skills (e.g., Marsh and Hau 2003).
Forced choice (FC) items have been used to assess individuals' tendencies and capabilities as well. With FC items, statements are grouped in blocks and respondents make selections within each block regarding which statements describe him or her best. There are several variations of the FC methodology (Hontangas et al. 2015). For example, respondents may read multidimensional triads and choose, of the three items (e.g., "I enjoy leading class discussions", "I work hard in school to achieve my goals", and "I like working in a team"), which is most like them and which is least like them. Reference bias should be minimized with FC because respondents conduct an internal (self vs. self) rather than an external (self vs. other) comparison on these items. Furthermore, there are compelling studies (Christiansen et al. 2005;Jackson et al. 2000;Walton et al. 2021) and meta-analytic data (Cao and Drasgow 2019) suggesting that FC items cannot be faked as easily as Likert items. For example, Christiansen et al. studied responses across two conditions, one in which participants were instructed to answer honestly and one in which participants were instructed to answer as if they were applying for a job and trying to make a positive impression. The effect size (i.e., standardized mean difference in the responses across the honest and positive impression conditions) was just .24 for the FC items, but the effect reached .96 for the single stimulus items.
A third item type used in this domain is the situational judgment test (SJT). In this approach, students imagine themselves in a scenario and then read various behavioral responses to that scenario. They then indicate either how likely they would be to respond in that manner or are asked to select how a person should respond in that scenario. SJTs have several advantages. First, SJTs may be developed to reflect more complex judgment processes than what is possible with Likert scales. Second, SJTs have the advantage of face validity (Lievens et al. 2008). The situations presented to students have the look and feel of those that would be encountered in real life. Third, there is evidence suggesting that SJTs are less prone to faking than Likert items (Hooper et al. 2006), and some have argued for using them in high-stakes settings for this reason (Whetzel and McDaniel 2016). Fourth, SJTs should be less susceptible to reference effects than Likert items because behavioral responses to scenarios are not comparative in nature. They have been used for decades for employee selection, and now even the Association of American Medical Colleges (2022) has developed an SJT measure to assess medical students' knowledge about effective behaviors.
Given that all item types have different strengths and weaknesses, an assessment employing multiple methods (i.e., one that conforms to a multi-trait multi-method [MTMM] design) ought to minimize the effects of any biases or shortcomings. According to Kenny and Kashy (1992), "The underlying view of measurement in the MTMM analy-sis is that to measure a theoretical construct, different measures, each with its own bias, are selected. Bias that is due to method effects is reduced through a triangulation process" (p. 170). To our knowledge, there is no research specifically examining the validity evidence for a MTMM SE skills assessment. It is worth examining whether the inclusion of varied item types contributes to the prediction of important outcomes.

Current Study
In this study, we briefly describe Mosaic, a multi-method SE skills assessment based on the Big Five framework. In addition to determining the extent to which the Big Five can account for the variance in important school outcomes, we sought to examine whether Mosaic's MTMM approach leads to stronger validity evidence than what would be available from the individual item types. We anticipated our findings could help inform framework and methodological considerations for future SE skills assessments.
The students were enrolled in schools that purchased the assessment as part of their SEL programming; that is, the students completed the assessment as part of their regular school activities and were not solicited for participation in the research. Students gave their assent before completing the assessment by checking "yes" to an item worded "By clicking YES, I assent to participate in this activity". The schools were spread across the U.S. and were a mixture of public, private, and charter schools. Schools allotted one class period for completion and most students were able to complete the assessment during that time but were able to take longer than one class period if necessary.

Mosaic™ by ACT ® Social Emotional Learning Assessment
Constructs. The SE skills assessed by Mosaic (ACT 2021) can be aligned with the Big Five constructs on a one-to-one basis (Table 1). This alignment was conducted rationally by subject matter experts comparing the Big Five factors' definitions and Mosaic item content and skill definitions. The Mosaic skill labels are different from the Big Five trait labels in accordance with educator preference. For example, focus group feedback indicated that the term Social Connection is preferred over Extraversion. The different labels notwithstanding, the construct breadth is similar to the Big Five. For example, the Social Connection items measure more than that single facet of Extraversion; they also tap into assertiveness, leadership, persuasiveness, etc.
Item Types. Mosaic employs the MTMM approach discussed above with Likert, FC, and SJT items. An example Likert item measuring Getting Along with Others is: I offer to help those who need assistance. Students indicate their level of agreement with this statement on a six-point scale from Strongly Disagree to Strongly Agree. There are 40 Likert items with eight per construct. The descriptive statistics and reliability estimates, all of which exceed .70 for MS and HS forms, can be found in Table 2. Below is an example of a multidimensional FC triad. Students are instructed to select the statement that is most like them and the one that is least like them. In the example below, the statements measure, respectively, Keeping an Open Mind, Sustaining Effort, and Getting Along with Others.
• I perform well on assignments that require me to use my imagination to find the answers.

•
If I tell my teacher I will do something, I do it. • I ignore classmates who are being left out of class discussions.
There are 5 3 =10 FC triads. Each possible triad occurred once and only once, so there are a total of 30 items with six per construct. Cronbach's alpha values (see Table 2) were lower than what is typically found with Likert items, which is to be expected given their ipsative nature (Meade 2004). Mosaic includes two SJTs per construct yielding 10 total items. Again, Cronbach's alpha values (see Table 2) were lower than what is typically found with Likert items, given that they are often multidimensional (Whetzel and McDaniel 2009). Students read each scenario then are presented with five behavioral responses, all of which measure a single construct, and they are asked to indicate how likely they are to enact each response on a five-point scale from Very Unlikely to Very Likely. An example of a partial SJT measuring Sustaining Effort reads: After studying very hard for a math test, the test results are disappointing, and you have yet to do as well as expected. While you are currently proficient, you would like to move up to the next level.
Two of the behavioral responses read: • Look over the test to see what questions you got wrong and work on those. • Decide there's no point to studying so hard if you don't get the results you want.
Scoring. Negatively worded items across the three item types were reverse scored. The Likert and SJT items were scored in the typical manner by averaging across all responses to yield a total score per construct where higher scores are aligned with higher skill levels. The FC items were scored such that the statement that is selected as 'most like me' was scored with a 3, the statement not selected was scored with a 2, and the statement selected as 'least like me' was scored with a 1, then the values were averaged across the responses to yield a total score per construct where higher scores were aligned with higher skill levels.
To create aggregate scores across the three item types, the Likert, FC, and SJT scores were standardized then averaged. It is important to note that it is strongly preferred to derive a single score per skill for practical purposes. When reporting SE skill scores to students, parents, and educators, this is a more parsimonious approach rather than reporting three scores per skill. Thus, it is not our goal to examine the incremental validity of the item types; that is, below we compare the validity evidence for the single item types and aggregate scores, rather than building a hierarchical model and examining the incremental validity.

Test-Criterion Data
Below we describe each of the outcome variables that we obtained from the subsamples of the 33,512 students completing either Mosaic MS or HS.
School Climate. To measure school climate, we used the Relationships with School Personnel and School Safety Climate (hereafter referred to as Relationships and Safety, respectively) scales from ACT Engage ® . The scales are supported by strong evidence of reliability and validity (ACT 2016) and have high internal consistency estimates. There were 12 Relationship climate items (Cronbach's alpha MS = .86 and HS = .88) and 11 Safety climate items (Cronbach's alpha MS = .84 and HS = .85), all of which are Likert items rated on a six-point agreement scale. These items were identical across the two forms. All Relationships items make mention of how a student perceives his or her relationships with adults at school (e.g., There are adults at my school who care about me). The Safety items primarily focus on students' perceptions of their physical safety in school (e.g., I feel safe at school). The climate scores used in the current study reflect each student's individual perceptions. Prior research has shown that SE skills and favorable perceptions of school climate are positively related to one another (Allen et al. 2019;Osher and Berg 2017). Academic Performance. GPA was self-reported on a 12-point scale (e.g., A+, 97-100%; A, 93-96%; etc.). Meta-analytic data (Mammadov 2021;Poropat 2009) indicate that the Sustaining Effort skill is expected to have the strongest relationship with GPA.
Attendance. Four school districts provided additional MS student data, including absences and disciplinary infractions, and 12 districts reported these data for HS students. We examined the association between Mosaic scores and absenteeism. We used the sum of unexcused and excused absences as a continuous variable and, when appropriate, also split the sample into three groups representing students with chronic absenteeism (defined as 18 or more missed days), habitual absenteeism (defined as at least 10 missed days but fewer than 18), or acceptable absenteeism (fewer than 10 missed days) and examined group mean differences. Most states describe chronic absenteeism as missing 10 percent or more days within a year (Attendance Works n.d.), which equates to 18 or more days, while some states consider missing 10 or more days within a school year as being habitually truant (Colorado Department of Education n.d.). Based on the findings reported by Lounsbury et al. (2004), we expected Keeping an Open Mind to have the strongest association with absenteeism and that they would be negatively related. However, the opposite effects have been found in college students (Chamorro-Premuzic and Furnham 2003;Credé et al. 2010), so this is somewhat of an exploratory analysis.
Discipline. Finally, we considered the number of disciplinary infractions as recorded in school records. For some schools, we had a count of the actual number of disciplinary infractions, but in some cases, schools only provided a binary response (i.e., no infractions vs. at least one). For some analyses, we dichotomized the continuous responses that some schools had provided and combined them with the binary responses. Research on children and adolescents (Tackett 2006;Tackett et al. 2013) led us to anticipate negative associations between disciplinary infractions and Sustaining Effort, Getting Along with Others, and Maintaining Composure. We anticipated positive, yet smaller, associations between the disciplinary infractions and Keeping an Open Mind and Social Connection.

Convergent and Discriminant Validity
The complete MTMM correlation matrix can be found in Table 2. There is evidence for convergent validity with the Likert scales generally correlating most highly with their respective SJT and FC scales and the SJT scales correlating most highly with their respective FC scales. For example, in MS, the correlation between the Likert Maintaining Composure scores and the SJT Maintaining Composure scores (r = .54) was higher than between the Likert Maintaining Composure scores and any of the other four SJT scores (r = .28-.47). However, evidence for discriminant validity was not as strong, as there were several instances of off-trait correlations being higher than expected. For example, in MS, the correlation between the Likert Sustaining Effort scores and the SJT Maintaining Composure scores reached .59. There were fewer instances of this occurring in HS than in MS.

Test-Criterion Validity
For the MS and HS forms, we examined the test-criterion validity evidence by correlating the SE skills with the outcome measures. We also regressed the outcomes on the SE skill scales. We did this for each item type individually and compared the findings to those for the aggregate skill scores.

School Climate
As expected, all correlations between SE skills and school climate were positive (Table 3). MS associations were slightly stronger than those observed in HS, and SE skills had slightly higher correlations with Relationships than with Safety. For both MS and HS forms and both climate scales, either the Likert or the aggregate scores had the strongest correlations. The Likert scales (MS Relationships R 2 = .38; MS Safety R 2 = .24; HS Relationships R 2 = .28; HS Safety R 2 = .16) accounted for slightly more variance than the aggregate sores (MS Relationships R 2 = .36; MS Safety R 2 = .22; HS Relationships R 2 = .25; HS Safety R 2 = .14).  Note. a n = 24,400. b n = 9,112. c n = 22,783. d n = 8,882. e n = 294. f n = 890. g n = 350. h n = 721. * p < .05.

Academic Performance
For both forms, Sustaining Effort had the strongest relationship with GPA (Table 3). However, we should note that all skills were related to academic performance, a finding not typically reported in prior research (Mammadov 2021;Poropat 2009). We suspect this is due to the fact that many of our items are contextualized to a school setting, which may lead to higher correlations than what is found with "generic" Big Five measures. Across skills, the aggregate scores generally had the strongest correlations with GPA. In MS and HS, the Likert (MS R 2 = .17; HS R 2 = .23) and aggregate scores (MS R 2 = .18; HS R 2 = .23) were on par with one another in terms of variance in GPA accounted for.

Attendance
We examined correlations between the number of absences and the five SE skills (Table 3). In MS, the correlations were all near zero and there was no clear pattern in terms of which item type exhibited the strongest association. In HS, the correlations were generally small and negative, and the SJTs or aggregate scores had the strongest associations. The individual item types and the aggregates did not count for much variance in absenteeism (i.e., no more than 4%). To evaluate whether the effects varied across item types and aggregate scores, we split the HS sample into three groups-students with chronic, habitual, or acceptable absentee records. All effect sizes were small, though the SJT and aggregate effect sizes were slightly higher overall (Table 4). Note. a n = 636. b n = 168. c n = 86. F df = 2, 887. * p < .05.

Discipline
We examined the correlations between the number of reported disciplinary infractions and the five SE skills (Table 3). In MS, the correlations were in the expected direction. In HS, the Social Connection correlation was higher than expected, relative to those for Sustaining Effort, Getting Along with Others, and Maintaining Composure. The SJT and aggregate associations were generally the strongest. In MS, the SJTs accounted for 12% of the variance in disciplinary infractions, while the aggregate scores accounted for 9%. In HS, however, the individual item types and the aggregate scores were fairly even in terms of variance accounted for, with values ranging from 4% to 6%.
As previously stated, some schools only provided a binary response (i.e., no infractions vs. at least one). We dichotomized the continuous responses that some schools had provided and combined them with the binary responses and carried out independent samples t-tests to compare these students and determine if results differed across item type. Results are reported in Table 5. Any differences in the direction of the effect across the binary and continuous analyses are due to non-identical samples used across the analyses. In MS, the SJT and aggregate scores had the greatest effect sizes. The results were more varied in HS, but generally the SJTs had the greatest effect sizes.  Note. a n = 335. b n = 169. c n = 982. d n = 508. * p < .05.

Discussion
The MS and HS Mosaic forms were developed using the Big Five as its assessment framework and a multi-method approach to measure SE skills. We primarily focused on the validity evidence for three item types (Likert, FC, and SJT) and an aggregate score based on these three. Overall, the associations between the five SE skills, regardless of item type, and the outcome measures were as expected. For example, all skills were positively associated with school climate dimensions. We found support for the use of an aggregate score and determined that, depending on the particular outcome of interest, certain item types may provide stronger test-criterion validity evidence.

Methodological Considerations
The Likert item scores and aggregate scores were the best predictors of perceptions of school climate and GPA. The SJT-based and aggregate scores showed the strongest effects in terms of attendance and discipline, although effects overall were smaller than what we observed for climate and GPA. The FC items failed to prevail as the strongest predictor for any of the outcomes we examined, though they were significant predictors of all outcomes except for attendance in MS. Note that unlike HS, there was no evidence of relationships between SE skills and absenteeism in MS, which likely reflects a greater autonomy among HS students relative to MS students.
It is not surprising that Likert items accounted for more variance in perceptions of school climate. The climate items were also Likert items and therefore the shared method variance is likely at play here. We can conjecture as to why SJTs had the strongest link with school attendance and discipline. As discussed above, SJTs have the "look and feel" of situations encountered in real life. Of the five outcomes we assessed, attendance and discipline are the most behavioral. That is, you attend school or you do not, and you get in trouble or you do not. These types of behaviors are written into SJTs. For example, one Mosaic SJT provides the possible behavior response of "Angrily confront your coach . . . " This is an example of a behavior that would likely lead to some disciplinary action. GPA is not as explicitly behavioral. There are multiple processes involved that lead to an academic measure and it's a more distal outcome. That is, GPA stems from a combination of behavior, cognitive ability, course taking, and teacher judgment. As such, it's not as pure as a behavioral measure. This may explain why the Likert items did a better job explaining the variation in student academic performance. As mentioned above, although significant predictors of outcomes, the FC items were not "the best" in terms of predicting the outcomes. This may be due to the low reliability of the scores, which, again, is typical for ipsative FC scores (Meade 2004). Despite this artefact of FC scores, FC items have other advantages, such as making it difficult for respondents to engage in impression management (Cao and Drasgow 2019), which might be particularly attractive in high-stakes settings.
In addition to the possibility of reducing the impression management with the FC items and the strong face validity of SJTs, there are other benefits to using a multi-method approach. As discussed above, reference effects and other response tendencies, such as acquiescent responding, might be minimized by using alternatives to Likert items. Another factor that has gone unmentioned until this point is the increased level of engagement respondents may experience when encountering more than one item type in an assessment. All of these factors, along with the evidence for improved validity from the current study, reinforce the notion that a multi-method approach to SE skills assessments has merit.
Many other assessments rely solely on Likert items to obtain scores (e.g., Davidson et al. 2018;LeBuffe et al. 2018). We believe the review and findings presented here provide compelling evidence supporting the inclusion of multiple item types and aggregate scores. This approach is novel, but it is slowly being adopted by others in the field. For example, the OECD's Study on Social and Emotional Skills also uses an MTMM approach to generate SE skill scores . In contrast to our approach, which involves multiple student-report item types, the OECD study combined Likert scales from parents, teachers, and students.

Framework Considerations
Mosaic differs from many SE skills assessments in terms of its organizing framework. This assessment makes use of the Big Five framework, which is advantageous because it is evidence-based, has cross-cultural generalizability, and consistently replicates across age groups. In addition, it can be used as a framework through which to organize the various terms used to describe SE skills (Walton et al. 2021). One issue plaguing the field of SEL is the abundance of frameworks currently used to organize skills, with a recent study identifying 136 frameworks . Moving toward a single assessment framework can help to unify the field and integrate research findings across research groups. A unified framework should be comprehensive yet parsimonious, consider developmental implications, and be evidence based and data driven, rather than derived from theory or expert consensus alone. The Big Five stands out as the organizing framework that fits each of these recommendations.
Despite support for using the Big Five framework to organize SE skills, it is imperative to recognize that personality traits and SE skills are not synonymous with one another. Once again, consider the OECD's (2015) definition; it refers to SE skills as "individual capacities." Capacities and tendencies are distinct, and herein lies the key differentiator between SE skills and personality traits. Soto and colleagues (e.g., 2021) are careful to make this point. While they use the Big Five framework for their assessment of social, emotional, and behavior skills, the BESSI, they reframe the item stems to ask respondents to report how well they are able to do certain things rather than how likely or how often they do those things (e.g., How well can you perform in a leadership role? vs. How likely are you to take on a leadership role?; Soto et al. 2022).

Applications for Educational Contexts
Research suggests that SE skills predict an array of desirable outcomes including success in school. Obtaining measures of these skills is key to fostering school-wide SEL initiatives. Practitioners can use data from SE skill assessments as part of a formative assessment system. Assessments can be considered formative if the feedback from the assessment is used to guide practices or make changes that increase student learning (Black and Wiliam 2009). An assessment itself is not necessarily formative on its own, but the process of a formative assessment entails multiple learning components working together to improve student learning. Mosaic was designed as part of a larger assessment system, with the purpose of using assessment results to drive intervention and aim instruction at areas of need. Assessments can help schools drive universal interventions and targeted SEL initiatives. The Mosaic system provides individual student reports, following Hattie and Timperley's (2007) recommendations for delivering formative feedback to students; reports inform students of their current levels on each skill, provide examples of the target skill level, and include recommendations on how students can improve. Mosaic also provides aggregate school-level reports so educators and administrators can be informed of broad areas of need in their schools and districts.

Limitations
This study was limited in a few ways. First, the students were not equally distributed across grades with an oversampling of 7th grade students in MS and 9th grade students in HS. This was an artefact of how schools opted to use Mosaic (i.e., to which students they administered Mosaic) and was not in our control. In addition, school-reported outcomes were limited. We relied on self-reported GPA, and the subsamples with school-reported discipline and attendance data were relatively small. Therefore, generalizations are limited. Additional data could be collected, particularly school-reported outcomes, to provide further validity evidence for the assessments. Moreover, while we focused primarily on the test-criterion validity evidence, additional validity evidence (e.g., construct validity) is also important. Assessment validation should be considered a continually ongoing process, with additional sources of evidence needed to strengthen the validity argument.
Another limitation was that we did not have access to participants' SES or race/ethnicity status for most students. The exclusion of race/ethnicity data was not intentional, but rather a limitation induced from the assessment's administration procedure during the 2018-2019 school year. During this administration window, schools were asked to provide students' race/ethnicity information during the registration process. Reporting student race/ethnicity was optional, not required. We found that most schools chose not to provide students' race/ethnicity information. Following that school year, we changed the assessment process to have students self-report their race/ethnicity during the test administration rather than relying on schools to do so during the registration process. This change in administration has resulted in more complete student race/ethnicity information beginning in the 2019-2020 school year.
However, we did have race/ethnicity data for 2% (n = 503) of the MS students and 18% (n = 1649) of the HS students, and we compared white students with underrepresented minority students. The groups scored significantly differently from one another on just one SE skill, Keeping an Open Mind, with the underrepresented students scoring higher in both MS and HS. We also note that race/ethnicity information was available during earlier phases of the assessment development, and the parameter estimates for scoring were extracted from a diverse and representative sample. Furthermore, we continue to gather assessment data complete with student-reported race/ethnicity information. With more recent assessment data (N = 3720), we have demonstrated that race/ethnicity subgroup differences are consistently small in magnitude across the three item types, suggesting fairness for all test-takers (Walton and Burrus 2022). DIF analyses across subgroups is also a component of our ongoing test validation process as our body of available assessment data continues to grow.
A final area for future work pertains to the discriminant validity issue raised above; that is, particularly in MS, there were several instances of weak discriminant validity. This could be due to a number of reasons. First, as discussed previously, SJT measures are often multidimensional. In line with this, the correlations involving SJT scores was typically where there were instances of weak discriminant validity. Second, we observed this more among the MS sample than the HS sample, and this is consistent with prior work showing that self-reports become more coherent within a skill domain and better differentiated across domains from late childhood to early adulthood (Soto et al. 2008). Soto and colleagues noted that the domains related to Getting Along with Others and Sustaining Effort showed large gains in the differentiation with age but small gains in coherence, and they conjectured that this differentiation could be the result of complex notions of right and wrong or good and bad. One possible implication of this pertains to impression management; that is, as children age, they should be more adept with impression management and able to present a coherent picture of a "good" respondent. This might suggest that the inclusion of the FC items in particular would be recommended for older students. Finally, this research (Soto et al. 2008) has shown that there is a high variability in the acquiescent responding among younger study participants. This may further explain the lack of differentiation among the younger students in the current sample and further point to the need for item types other than the Likert items for younger students.

Conclusions
Reliable and valid assessments are paramount for the growth of the field, as assessments enable practitioners to measure progress and maximize student SE skill development.
Here, we argue that the assessment framework and response method are also important considerations for the development or selection of a SE skill assessment. Mosaic is one of a growing number of SE skills assessments designed to measure the skills that crosswalk to the Big Five personality framework. The current study offers additional support for this practice given that the skills accounted for a statistically significant amount of variance in important school-related constructs and outcomes. Moreover, the findings reported here highlight the importance of shying away from automatically using Likert-based assessments and considering more innovative item types either alone or in combination. Our findings suggest that the aggregate scores derived from multiple item types are often superior over single item types in terms of predictive validity, and that the single item type with the best validity evidence varies according to the outcome in question. It is often said that we measure what matters, and it is unequivocal that these five SE skills matter. By purposefully leveraging a multi-method SE skill assessment in schools, we can equip students with the skills they need to thrive in school, in the workforce, and beyond. Informed Consent Statement: Student assent was acquired prior to students completing the assessment.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to contractual and privacy issues.

Conflicts of Interest:
The authors declare no conflict of interest. The views, opinions, and interpretation of the findings contained in this article are solely those of the authors and do not purport to represent the views of the researchers' host institutions.