Validation Is a Methodology! Guideposts for Assessment Development and Validation

Bostic, Jonathan David

doi:10.3390/educsci16040565

Open AccessArticle

Validation Is a Methodology! Guideposts for Assessment Development and Validation

by

Jonathan David Bostic

School of Inclusive Teacher Education, Bowling Green State University, Bowling Green, OH 43403, USA

Educ. Sci. 2026, 16(4), 565; https://doi.org/10.3390/educsci16040565

Submission received: 29 January 2026 / Revised: 17 March 2026 / Accepted: 19 March 2026 / Published: 2 April 2026

(This article belongs to the Special Issue Educational Assessment Theories and Methodologies: Trends in Standardized Testing)

Download

Browse Figures

Versions Notes

Abstract

Measurement and assessment in Science, Technology, Engineering, and Mathematics (STEM) education is one central topic within STEM education scholarship. While there has been an increase in validation-related scholarship within STEM education, there are few guides for users to conduct validation work. Providing guidance for a broad readership, not just methodologists, offers potential for scholars from more backgrounds to engage in validation. To that end, the purpose of this paper is to build upon past scholarship and both articulate and situate validation as a methodology. Guideposts are provided to support readers as they engage in validation scholarship. A strategy is also provided to give readers support as they engage in validation scholarship. One key outcome from this paper is foundational work that scholars can leverage and extend, challenge, and generate new validation-related work, which in turn moves assessment practice and scholarship forward.

Keywords:

assessment; testing; measurement; STEM; quantitative; validity; validation

1. Introduction

Validation is essential work in the practice and scholarship of quantitative measurement. Without validity and validation, then there is no way of knowing whether the assessment measures what it intends or whether it can be used for a given situation. For decades, discussions related to instrument reliability and validity of their results and interpretations were noticeably absent (Ziebarth et al., 2014) in the literature. It is only recently that an uptick in validity scholarship has been seen (e.g., McMillan, 2026; S. G. Sireci et al., 2024; Wilson & Mari, 2026). These modern works highlight how validation has been moved forward. Yet, much of this work is primarily located within the assessment and psychometric space, which can be difficult for a broad audience of STEM scholars who are engaged with testing and assessment as part of their broader research agendas. Worse yet, validity evidence for tests and assessments is underreported (E. Krupa et al., 2024). This lack of scholarship offers a fruitful opportunity for researchers and not a problem, especially those engaging in large-scale STEM education testing. This is also an opportunity to engage in critical debates about the instruments that are developed and used within STEM education, which makes such discussions worthy and necessary for dissemination in peer-review publications. It is an opening to engage in necessary and invigorating scholarship to build a deep argument around the importance of what assessment developers and users agree to be true. Broadly speaking, such scholarship can also better inform what tests are used in classrooms, research, and evaluation.

Using tests and assessments without sufficient validity evidence has potential to cause undue negative outcomes for research and the participants of that research (Bostic, 2023). We can look at one instance illustrating the lack of validity evidence associated with measures (E. Krupa et al., 2024) as example of the wider problem across STEM education. A team of more than 40 researchers examined mathematics and statistics measures found in published peer-reviewed journal articles between 2000 and 2020, following the Standards for Educational and Psychological Evidence ([Standards], American Educational Research Association et al., 1999, 2014). The scholarly team located 1304 unique measures and then searched for any available validity evidence associated with them. They located at least one form of validity evidence for 74% of those measures, which means 26% of those measures used in peer-reviewed published journal articles had no evidence located in publicly available conference proceedings, journal articles, book chapters, or books. This is gravely problematic because it suggests that results and interpretations from those 339 measures are questionable and potentially unfounded. These negative outcomes may be due to a few issues: (a) no unifying methodology or methodological framework, (b) few examples describing how to do validation studies, and (c) concerns that validation work is limited to psychometricians. For example, administering a test to students without valid results and interpretations may lead to using the results and placing students in remedial courses unnecessarily. This is a pervasive problem in modern STEM education research (Bostic, 2023; Edelen et al., accepted). This paper seeks to remedy these issues and help a broad readership by characterizing validation as a methodology.

Jacobson and Borowski (2019) suggested that “validation is not a routine exercise but rather it has important, underused potential as a research methodology.” (p. 41). They did not unpack that idea beyond validation as a possible methodology. Many of the ideas in this paper have been part of the scholarly literature for decades. Validation and validity are discussed in detail in the last two Standards (1999; 2014). However, validation has not been framed as a methodology synthesized much less effectively. It often is portrayed as a loose set of ideas and coherent theories around validity. Validation takes on the unusual language usage as a referent to the verb format of the word (i.e., to validate) and a noun (i.e., validation), which are both tied back to validity. Messick (1989) grounds validity as a unitary ideal, which has largely been taken up, as evidenced in scholarship from the last 40 years (American Educational Research Association et al., 1999, 2014). Validity “always refers to the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores” (Messick, 1989, p. 6). Unfortunately, there has been little to practically support scholars in ways they can engage in validation. That lack of methodological framing within methodological literature and practical support has been a hindrance to the scholarly community because it prevents others from easily taking up validation work. A goal for this work is to support the broader STEM education community to consider validation as a scholarly endeavor, to promote access to these ideas in digestible ways, and to reflect on how validation can improve the lives of all those administering and taking tests.

This manuscript provides evidence for the following claim: Validation is a theoretically grounded methodology that draws on a variety of methods available to users and teams, depending on their theoretical perspectives and aims. There are three objectives in this theoretically focused and theory-grounded paper. First, I situate validation as a methodology, grounding it in theory, and seek to differentiate it from methods. Second, guideposts are provided to describe validation as a methodology. These guideposts serve as opportunities to question what is being done and why it is being done. A third and final objective is to provide readers with an example using validation as a methodology. An intended outcome from this work is to guide readers through the nuances of validation as a burgeoning methodology within STEM education scholarship, and to give them access points to engage with it. This is presented as a strategy for others to draw from for inspiration, much like how a recipe in a cookbook is a starting point for cooks and chefs at home. The use of STEM education contexts is not intended to limit validation as a methodology to STEM education contexts but instead responds to calls for better assessment practices and materials in STEM education (e.g., Hill & Shih, 2009; National Research Council, 2014). For those reasons, I focus on STEM education but there is no reason these ideas could not transfer to the humanities or other contexts (e.g., medicine and business). The next section begins with important language distinctions to guide readers through some nuances.

2. Related Literature

2.1. Language Matters

Words can have different meanings, even within one field. For instance, variables in mathematics can take on six different mathematical meanings (Schoenfeld & Arcavi, 1988). The word ‘table’ has a connotation of a mathematical representation (Lesh & Doerr, 2003) as well as of a physical object like one in an office or home. Thus, this paper begins with some language clarification to assist readers sensemaking of validation as a methodology.

Words and phrases such as assessment, measure, quantitative instrument, and test are used similarly in this paper to mean the same notion, which follows recent practices (e.g., Lawson & Bostic, 2024). They describe tools used to gather data that can be acted on with quantitative approaches and allow for use of inferential or descriptive statistics; (American Educational Research Association et al., 2014). Some readers may note, and rightly so, that there are slight differences between these terms (e.g., test and assessment); yet here, the terms are used synonymously for a broad readership. Those interested in the differences might start with reading the Standards (American Educational Research Association et al., 2014).

Validation is a methodology, not a method. A study’s methods are “the techniques or procedures used to gather and analyze data related to some research question or hypothesis” (American Educational Research Association et al., 2014, p. 3). Rasch analyses, regression analysis, and the constant-comparative method (Glaser, 1965) are three unique methods used in validation studies. Crotty (1998) defines a methodology as “the strategy, plan of action, process or design lying behind the choice and use of particular methods and linking the choice and use of methods to the desired outcomes” (p. 3). A methodology informs the methods used within a study, which is why empirical studies should have ‘methods’ sections rather than ‘methodology’ sections (American Psychological Association, 2019). The methodology is guided by a theoretical perspective or combination of perspectives (Herber et al., 2025). A theoretical perspective is a “philosophical stance informing the methodology and thus providing a context for the process and grounding its logic and criteria” (p. 3). A theoretical perspective(s) influences the methodology of a study (Herber et al., 2025). An epistemology and/or ontology frame the theoretical perspective (Crotty, 1998; Herber et al., 2025). In summary: An epistemology or ontology guides the theoretical perspectives, which inform a chosen methodology and drive the methods. Figure 1 illustrates this relationship. Because validation encompasses several methods, validation scholarship is not a method. Rather, it is a plan of action or process by which a user or team of scholars engages in this work.

Readers may wonder: Why is validation not considered a method? Validation scholarship can seek to build a test, which is a tool, and that sort of work can effectively situate itself within design-based research methodology. Some prior validation scholarship has positioned itself within design-based research yet that approach positions validation as a procedure. Positioning validation as a procedure obscures a key facet: validation is a strategy that involves a course of action with a goal to associate desired (i.e., hypothesized) outcomes with methods and their choices. It involves thinking deeply about what the aim of the research is as well as why that choice is appropriate. In short, validation as a methodology helps users derive choices in methods—validation is not the methods themselves.

2.2. Validity and Validation: A Brief Primer

In 1999, the Standards were published by a joint committee of the American Educational Research Association, American Psychological Association, and the National Council on Measurement and Education. The current Standards were adopted in 2014 (American Educational Research Association et al., 2014), and a new edition is set for release in the next 12 months. While these groups reflect a North American perspective on Standards, the Standards seek to clarify validity and validation practices. There is no evidence suggesting that the Standards cannot be taken up outside of North America; they are grounded in work that comes from global scholarship (B. Zumbo, 2014). For that reason, this paper is grounded in the Standards.

Validity is “the degree to which evidence and theory support interpretations of test scores for proposed uses of tests.” (American Educational Research Association et al., 2014, p. 11). It is a singular idea as opposed to something that takes on multiple forms (American Educational Research Association et al., 2014). Prior to 1999, the idea of multidimensional validity was more pervasive; yet today, validity as a unitary construct is well accepted (Kane, 2016; S. G. Sireci, 2013). Validity helps readers to feel more certain of the question: How do I know what I know? It is a misnomer to say: a test is valid. The results and interpretations of an assessment’s results are valid. Reliability is related to validity; however, it is neither a form of validity nor a source of validity (American Educational Research Association et al., 2014).

The Standards describe five sources of validity evidence: test content, response processes, relations to other variables, internal structure, and consequences of testing (see Table 1 for descriptions; American Educational Research Association et al., 2014; Folger et al., 2023). Those sources are a means to organize the validity evidence gathered in relation to the intended interpretation(s) of an instrument’s results.

Some readers might see these five sources as discrete buckets to fill with evidence and that they are not intertwined. That is one perspective; albeit, such a notion diminishes the interconnectedness of the five sources of validity as a unitary idea. Instead, let us imagine a rope twisted from five different strands (see Figure 2). This rope serves as a metaphor to help readers evaluate the ways in which validity sources support more robust interpretations and results from quantitative instruments. A single rope strand provides some strength, yet it may not be strong enough for some needs. Similarly, one piece of test content validity evidence may be a good start for convincing others that the test’s claims are valid but that does not necessarily support a strong validity argument that can stand up to substantive criticism. Multiple rope strands, much like multiple validity sources connected through claims and evidence, have potential to be much stronger due to their combined strength. Moreover, numerous pieces of evidence within one or more validity sources further strengthen an argument to be one that can simply and effectively communicate that the test produces valid results and interpretations. More complex validity claims related to a test’s uses and interpretations require qualitatively stronger and quantitatively more validity evidence.

Recent literature syntheses within STEM education research have demonstrated that these Standards have not been taken up consistently or used effectively in the last two decades (e.g., Bostic et al., 2021a; Folger et al., 2023; Gallagher et al., 2025; Ing et al., 2024; Lavery et al., 2020). In the cases that validity is considered, an issue that comes up frequently is a high reliance on test content or internal structure evidence (Arjoon et al., 2013; Bostic et al., 2022; Cruz et al., 2020; Decker & McGill, 2019; Lavery et al., 2020; Sondergeld, 2020). While it is helpful that papers describe one piece of validity evidence and reliability, one piece of evidence does not justify a robust validity argument (Bostic, 2023; Kane, 2013, 2016; Reeves & Marbach-Ad, 2016; S. G. Sireci, 2013). Syntheses of assessments as well as recent publications about assessments highlight this issue further (Bostic et al., 2019a; Hill & Shih, 2009; Ing et al., 2024; Lavery et al., 2020; Maric et al., 2023; Minner et al., 2012).

One area that was not necessarily clarified in the Standards was the methods to communicate what an instrument does. “Validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores for proposed uses that are evaluated, not the test itself” (American Educational Research Association et al., 2014, p. 11). Any claims about an instrument’s results or interpretations must be grounded in appropriate evidence. There have been scholarly discussions around different approaches to convey that information (Bostic & Sondergeld, 2015; Lavery et al., 2020; Pellegrino et al., 2016; Schilling & Hill, 2007; S. Sireci & Faulkner-Bond, 2014; S. G. Sireci, 2013; Wilson, 2023). A shared feature synthesizing across these different approaches is that the goal is to provide an argument that has claim(s) and evidence. Most modern approaches use an argument-based approach (Kane, 2021) that has been adopted widely across assessment and discipline-based STEM education literature (M. Carney et al., 2022; Wilson, 2023). “An argument-based approach to validation imposes two basic requirements: state the claims that are being made, and evaluate the credibility of these claims” (Kane, 2021, p. 34). A validity claim is a statement supported with validity evidence, such as: The content of test items is aligned to the construct. This claim can be supported with evidence that might come from convening an expert panel: Asking them to review items using their expertise, and giving feedback about construct alignment. While it may seem obvious that experts’ recommendations can suggest that the items and construct are aligned, it should be stated for readers rather than inferred. Combining the claim and evidence allows a reader to take up the idea, consider the merits of the evidence, and decide whether the claim is appropriately supported. This claim–evidence approach is akin to a deductive proof in mathematics or logic.

Mathematical proofs and justification may be perceived as logical and true given an agreed upon set of principles, axioms, and structure within a culture (A. J. Stylianides, 2007; G. Stylianides et al., 2019). An idea is presented as a claim, and then a sequence of reasons, evidence, and justification are leveraged to move towards that claim in some capacity. The reader decides whether a claim is supported as they judge the quality of each piece of evidence. This claim–evidence approach is like Toulmin’s (2003) claims–warrants approach for argumentation. Validation presents an opportunity for a reader to actively engage in reasoning about the quality of a validation argument rather than accept it blindly as truth. Why should a reader/listener agree with the claims and evidence? What theoretical perspectives or epistemologies are woven into the validation argument? Reflecting on these questions and others presents an opportunity to meaningfully question what is done within a validation study. Cronbach (1988) exhorted assessment scholars to question this, as that is both essential and necessary for civil public discourse. In that spirit, validation is work done in the public spectacle rather than in a scholarly vacuum.

While some claims are reflected naturally in the validity evidence collected to support score interpretation and use (e.g., psychometric results), other claims extend from the results and interpretations. A use claim provides information about how to use the instrument’s results and/or measure itself. A metaphor for a use claim is the directions for over-the-counter medicine, which include dosage directions and guidelines for taking the medicine (E. Krupa, personal communication, 7 November 2024). An interpretation claim conveys how to make sense of the results within a context. A metaphor for an interpretation claim is the way that a doctor might interpret data they see on an X-ray and conclude that there is a fracture. That is, a break in a bone indicates that a fracture has occurred. The ways that accumulated claims are presented form a validity argument.

Collecting validity evidence and communicating its meaning through claims can take on many structures (e.g., Bostic et al., 2017; Kane, 2013, 2016; Pellegrino et al., 2016; Schilling & Hill, 2007; Walkowiak et al., 2019; Wilson, 2023; Wilson & Wilmot, 2019). These claim statements and claims about validity evidence also require a variety of methods (E. Krupa et al., 2024). A collective framing of these methods is not present, much less any mechanism that might constellate such methods. Thus, I argue that validation is methodology, framed as a process or strategy to gather and evaluate the claims and evidence associated with the uses and interpretations of a test’s results. It also includes the accumulation of methods to support those claims.

3. Validation as a Methodology

Validation includes quantitative, qualitative, and mixed-methods approaches (see E. Krupa et al., 2024 for examples). It is a means to evaluate the power of an argument as perceived by the user. It involves argumentative writing that has strength within it, meaning that it is an accumulation of evidence and claims, and communication grounds the qualities of the validity argument. Argumentative writing contrasts with persuasive writing that seeks to convince a reader that one position is better than another. The next section of this paper outlines core tenets and some guideposts of validation as a methodology, much like how Wittgenstein (1958) provided signposts for logical deductions. As this is a nascent methodology, additions and revisions are likely to come forth.

Core tenet: Validation works with many theoretical perspectives.

Validation is nuanced and not wedded to a single theoretical perspective. The assessment developers engaging in validation bring that to their assessment scholarship. Historically, assessment scholarship has drawn on positivist epistemology or a behaviorist theoretical perspective because of tradition (Braat et al., 2020; Delandshere, 2002). In the last ten years, assessment scholarship has drawn from a variety of theoretical perspectives including social cognitive, embodied learning, and quantitative critical (quantcrit). Let us reify the idea with an example: Consider three assessment development teams working on a mathematics anxiety instrument. One development team might draw on a cognitive theoretical perspective (Mayer, 2024) for their work. A second team might draw on a sociocultural theoretical perspective (Vygotsky, 1997) for their validation. A third team might leverage a quantitative critical perspective (Garcia et al., 2018; Sablan, 2019) as they engage in validation work.

Each instrument development team might view the other team’s product as unsuitable for an intended need because the theoretical perspective guiding each one may lead to different results and ensuing interpretations. On the other hand, all three teams have capacity to create a high-quality mathematics anxiety instrument and develop a set of logical claims, backed up with suitable evidence. The key is that each is framed by a different theoretical perspective, which in turn may encumber different epistemological assumptions. In summary, validation, as a methodology, can be associated with multiple theoretical perspectives.

Core tenet: Validation scholarship seeks fairness and to minimize bias.

The Standards frame fairness as “A test that is fair minimizes construct-irrelevant variance associated with individual characteristics and testing contexts that otherwise compromise the validity of scores for some individuals” (American Educational Research Association et al., 2014, p. 219). Those engaging in validation scholarship seek to accurately and fairly measure a construct while simultaneously considering the needs of the population being measured. High-quality measurement includes both a high degree of precision and acceptable level of reliability, which is influenced by error and variance that can come through fairness and bias issues. Fairness extends beyond a lack of measurement bias to considerations such as (a) access to the construct and (b) individual test-score interpretation, including the consequences of legitimate score interpretations (American Educational Research Association et al., 2014; B. D. Zumbo & Hubley, 2016). These actions are equity-forward, which positions validation as equity-forward scholarship (Bostic, 2023).

Merriam-Webster defines equity as “fairness or justice in the way people are treated” (Merriam-Webster, n.d.), which is how equity is used here. That is, validation scholarship seeks to minimize bias, promote fairness and justice, and lead to equitable outcomes for all. Unfair measurement is likely to have elements of high or unnecessary variance. This is further evident from the recent addition of consequences of testing as a validity source in the 2014 edition of the Standards, and the attention to fairness and bias (Jonson & Geisinger, 2020). Bias may be wrapped into any of the validity sources. As an example, here are three possible questions that link to consequences of testing, test content, and internal structure claims: (1) Is there any bias related to the consequences of using a test for a given need? (2) To what degree is content biased towards or in favor a group of respondents? (3) What level of Differential Item Functioning (DIF) is reported for a given test? This idea of assessment scholarship as equity-forward is not new; Cronbach (1988) argued that assessment developers must work to create fair tests with just outcomes for society through their quantitative instruments because it was their ethical and civic responsibilities.

Core tenet: Validation work is scholarship that should be shared publicly.

The ways in which a test is used, and its results are interpreted can have powerful outcomes. This highlights not only the importance of validation, but how the processes and products from validation scholarship must be scrutinized. That is, discussing the validation process is just as important as the product from that process. Too often, a focus is on the product and that leaves readers without a deep understanding of the choices made during assessment development, which informs the quality of the validity claims. Tests, or at least information about them, should be disseminated through conference presentations and proceedings, journal articles, book chapters, and other peer-reviewed, publicly available sources so that potential users can consider them before making a new instrument. Outlets that target several types of potential users and administrators including but not limited to academics, industry professionals, community members, and policy makers are appropriate, and just as appropriate is material suitable for the public to become aware of the work. The peer-review process provides users with opportunity to gather feedback on what is shared and the degree to which a validation argument is supported with evidence. Academic journals must also provide spaces for validation scholarship if there is any chance for quantitative measurement-focused research to be grounded in strong testing standards. Additionally, readers should be able to access materials regardless of their institutional affiliation and/or ability to pay for access to the publication.

Jacobson and Borowski (2019) called upon “authors, reviewers, and editors alike” to change the culture of assessment scholarship by considering a broad view of what counts as scholarship and to welcome it in their journal publications. Validation scholarship has potential to advance theory or methods, and thick, rich documentation about the processes and products advances assessment scholarship. Work advancing theory by Zolfaghari and colleagues (Zolfaghari et al., 2021, 2024) as well as Kosko (2017, 2019) have pushed new theories about teachers’ pedagogical content knowledge around fractions. Similarly, Carney’s collaborative work (e.g., M. B. Carney et al., 2017) has helped to unpack new insight on how teachers attend to students’ thinking. In methods-focused work, Confrey and colleagues (Confrey et al., 2019) shared a method to analyze students’ response patterns on Learning Trajectories. Bostic et al. (2021b) describe a method for collecting response process validity evidence from learners called Whole-Class Think Alouds. Validation scholarship must communicate what was done (process) and the outcome(s) from that work (product) in scholarship and in view of the greater public. By sharing the process with details, then there is a greater chance for understanding and uptake across more individuals.

Core Tenet: Validation involves trustworthiness and open communication.

If validation is a methodology, then there is an expectation for trustworthiness, akin to the notion frequently used in qualitative research (Lincoln & Guba, 1985). Trustworthiness is both an idea and process that grounds work and provides credence for the claims and evidence (Lincoln & Guba, 1985). That is: How do we know, what we know? This lens of trustworthiness suggests the discursive nature of validation scholarship. It involves communication between people, especially respondents, administrators, and developers. Building trustworthiness through advisory boards, test content and bias review panels, respondent interviews and focus groups, as well potential user and administrator interviews and surveys are some possible ways to build trustworthiness in the claims. Dissemination through presentations and publications can also build trustworthiness through the peer-review process. It is essential to position assessment scholarship using a validation methodology as seeking to imbue trustworthiness.

Core Tenet: Validation is collaborative.

Test development requires numerous bodies of knowledge including, but not limited to: content knowledge, measurement/assessment/psychometric knowledge, as well as perspectives from potential test takers/respondents and administrators (Bostic, 2023). The breadth and depth of expertise needed in these cases cannot be filled by one person. Collaboration also entails working closely respondents and test administrators. Validation is a collaborative opportunity to bring together numerous individuals to facilitate test development. As readers reflect on tests they hope to create, it is important to pause and reflect on what knowledge and experiences are needed, what resources are available, and communication strategies across a validation team. Experience can be as important as knowledge. For example, imagine a team of developers designing an instrument measuring science literacy of elementary students, yet their team members lack any Kindergarten—twelfth grade (K-12) science teaching experience. This may result in missing known nuances that come from being in the classroom and as such, communicating experiences and contexts matter.

Core Tenet: Validation is not limited to one form of testing—it is something for all quantitative measurement.

Validation must be a part of any quantitative measurement process. Surveys, multiple-choice tests, constructed-response assessments, observation protocols, and other measures producing quantitative data are all included. Moreover, creating a measure for a small study or dissertation project falls under the same guise as creating a large-scale instrument—validation should happen. Without exploring validation, or asserting validity evidence or claims, then issues might arise during data collection, data analysis, or reporting results. A few possible issues are shared that may come up without validation: (a) It is uncertain whether the results measure the intended construct. (b) It is unknown how to interpret the results in light of the test. (c) It is unclear the degree to which participants might experience negative consequences from testing and, in turn, gauge whether the benefits of testing outweigh the negative consequences. (d) It is questionable whether a test will function with one population if it was designed and piloted with another population of students. All these issues impact quantitative assessment and in particular, large-scale testing within STEM education contexts.

Conclusion

These are some core tenets related to validation as a methodology. It may be necessary to adapt and revisit them as the validation methodology takes on greater use. Some readers may not be certain of how to plan and execute a validation study. Validation includes opportunities for choices to be made, which impact the process and outcome. Guideposts are provided to facilitate decision points that can be documented and communicated. These guideposts are much like signage along a hiking trail. They allow the hiker to choose what path to follow. They are not an algorithm to strictly follow exactly for every case. They offer opportunities to build trustworthiness and transparency in the work done during assessment scholarship.

4. Guideposts for Validation Scholarship

A purpose of this manuscript is to assist readers in planning and executing a validation study. These guideposts serve as decision points for a validation study and may help researchers reflect and document their choices. There are some commonalities across validation studies, which are framed as questions that readers might take up for their validation research. Questions are intended for the user to reflect and make choices (see Figure 3). As with any methodology, there is not a single approach or method to reach a desired research goal; rather, there are numerous decisions to make that inform the outcome. There are numerous methods suitable for conducting a validation study, and the decisions made during the process are critically important. It is essential to consistently document during the validation study.

Guidepost: What theoretical perspective(s) guide the validation study?

A theoretical perspective informs what is done during a validation study, including how data are collected and analyzed. For example, those adopting a cognitive or learning sciences-informed perspective may be more interested in one type of data compared to another team that adopts an embodied learning or critical perspective. Discuss with involved team members what perspective(s) they bring and what perspective(s) are fitting for the validation study.

Guidepost: What experiences and knowledge does your team have?

Validation is a collaborative experience that involves numerous people with various funds of knowledge including, but not limited to, scholars executing the validation study, individuals who grant access to participants/respondents, participants/respondents, and possibly mentors or advisory members. There are so many people involved that readers considering a validation study should take stock of who is involved and what they bring to the team. As one example, some smaller teams might include a content expert with a limited quantitative research background and a methodologist who brings a strong quantitative lens. In this scenario, the team may not effectively be able to engage with qualitative data, thereby limiting what claims they make and the data they collect. On the other hand, consider a team consisting of a school district partner, two content experts, and two methodologists: one methodologist with mixed-methods experience and a second methodologist with programming and quantitative experience. This team has a broad range of knowledge, wisdom, and experiences that can be leveraged for an array of data collection and analysis methods. E. Krupa et al. (2024) provide a comprehensive synthesis of expert-derived methods for each source of validity evidence, and a few are shared here in Figure 4.

Guidepost: Claims can be shared a priori (in the beginning) or a posteriori (afterwards)

Claims are the ideas supported by validity evidence. These claims may be intended to support many topics including but not limited to a test’s use, test’s results, and interpretations of those results. Folger et al. (2023) conducted a study to determine whether assessment experts perceive that claims must be shared before investigations are conducted (a priori) or if they may result from the evidence that has been collected (a posteriori). Experts agreed that both are reasonable approaches and preferences may be influenced by theoretical perspective and/or experiences (Folger et al., 2023). Having claims prior to or following data collections is akin to two mathematical reasoning approaches: a priori functions like deductive reasoning whereas a posteriori is like inductive reasoning.

Guidepost: There are numerous means for communicating a validity argument.

Validation scholarship has offered means for communicating a validity argument. Some are shared here, though readers might dive into the cited literature for more examples. Much like the pathway to solving a mathematics problem, there are numerous, viable means. All of them are rooted in argumentative writing and communicating ideas clearly and precisely. A key difference in these means are different frames. Those published after the Standards’ (2014) release are shared here for relevancy and consistency to the chapter’s message. Kane’s (2013, 2016) ideas draw heavily on a Toulmin (2003) style of argumentation. Claims are stated a priori and supported with evidence and ideas, much like a deductive mathematical proof (G. Stylianides et al., 2019). Pellegrino et al. (2016) conceptualized communicating validity components as (i) cognitive, (ii) instructional, and (iii) inferential. This framework sprang from work on instructionally relevant assessments. Questions might be framed within each component and in turn, be linked to various data collection sources. The Bear Assessment Model (Wilson, 2023) came from an item-response theory driven angle and can be used broadly. It aligns well with other work that Wilson has co-published focusing on STEM education contexts (e.g., National Research Council, 2001, 2014; Wilson, 2009). A fifth approach tends to draw more heavily on the Standards (American Educational Research Association et al., 2014). This Standards-based approach frames one set of claims around the Standards’ five sources of validity evidence, then conveys claims and evidence separately for use and interpretations of results. Examples of this work can be found in my team’s work (e.g., Bostic & Sondergeld, 2015; Bostic et al., 2017, 2019b). Again, all these approaches are satisfactory. A core feature of each means is that it is vitally important to ground what is learned in robust, defensible evidence. A key difference among them is how the claims and evidence are organized.

Guidepost: Test users, test administrators, and respondents must be at the core of assessment development and validation.

The experiences of those using, administering, and completing the test matter. Multi-directional communication between users, administrators, respondents, and test developers is essential. Communication amongst these groups facilitates understanding about the benefits and possible negative outcomes involved with testing. As an example, testing during K-12 instructional time means less time for teaching. There must be substantive merits for assessment administration, which reduces students’ time spent learning. Most importantly, assessment is not something done to a person or group without agreement/consent from everyone involved.

If a tool is developed for use by a group of individuals; then assuredly, those individuals who might use it should be included in the design. Empathic approaches (Postma et al., 2012; Zoltowski et al., 2012) sprang from the STEM and business literature as being critical to responding to client needs. Empathic approaches have been used in validation studies and place a user/client/respondent and test administrator at the forefront of test development (e.g., Bostic et al., 2024). Such approaches have potential to effectively measure a construct and, at the same time, limit negative consequences from testing. One key in test development is to maintain a positive testing experience for all those involved, which comes from minimizing negative consequences from testing. Similarly, test administrators must also be looped into test development and validation for similar reasons. Test administrators play a key role in test use and interpretations, which means working with them during test validation to create user/administration manual that provides clear, understandable language.

Guidepost: Test abstracts and test names are necessary.

M. Carney et al. (2022) as well as E. Krupa et al. (2024) provide clear guidance for test developers about what information should be communicated in an easily digestible format. That includes information like construct measured, time for administration, and costs. E. Krupa et al. (2024) noticed that few tests had names, much less recognizable names, after completing a synthesis of measures for mathematics and statistics education contexts between 2000 and 2020. Test developers must offer this information so that potential users and administrators if they intend for tests to be usable by a broad public.

Guidepost: Technology is only as useful as the individuals using it during validation.

It is common for those learning quantitative methods to use a computer program and take the initial results as correct, without pausing to reflect on some questions like: What does the output tell me? Was the approach used reflective of the data? What assumptions are baked into the program and/or procedure that might be uncertain from the output? What do the resulting output mean for the data’s contexts? Much of these questions are still relevant in the age of artificial intelligence (AI). Human-centered computing is not new (Bannon, 2011). In the modern era, there is continued advocacy for human-centered AI (Riedl, 2019). Such approaches place users in control of what information is fed into an AI model, feedback on the output, and refinement of the output until a desired outcome is reached. Hence, the test development and validation team members must think carefully about technology use and clearly communicate what was used and how it was used.

Guidepost: Balance a need for a new test with the resources necessary to develop a robust measure.

I have heard the following statement and ensuing question more than 100 times: I want to do a study of <insert phenomenon>. I cannot find a test that measures exactly what I want. Should I make a new test or use one that measures something proximal but not exactly what I want? The struggle of whether to make a new test or to use something available that does not exactly measure the desired variable is a real challenge. Scholars must reflect and consider resources, needs, and test quality. Making a new, robust test with valid results is not something that can be done easily or quickly while also making a new curriculum or professional development program.

Let us consider a team of researchers that developed a unique intervention and are weighing up whether to develop a new test aligned with that intervention. The choice to create a new test should lead to discussions about the available resources (e.g., people, places, time, and finances). Is there sufficient time to create a measure that produces valid results and interpretations? Does the current research team have appropriate knowledge, wisdom, and experiences for a validation study? Would it be possible to revise or update a current test for a desired need? Will the test be used by others in the field in multiple contexts? In my experience, validation and building a brand-new test typically takes one or more years and cannot be done well concomitantly with designing an intervention by the same people. AI may expedite some parts of this process (May et al., 2025a) but that deeply depends on the items and constructs. Expedited validation studies have led to measures with one form of validity evidence (i.e., test content) and a reliability statistic, which is a good start to a robust validation argument. Planning to make robust researchable claims from a test with one form of validity evidence is like trying to pull a car with a piece of twine. That twine, much like the argument, is not well suited for the goal. It is best practice to gather multiple forms of validity evidence, as well as numerous pieces of evidence related to sources, to create a robust validity argument that effectively supports validity claims.

4.1. A Recipe for a Validation Study

This section provides a recipe—or strategy—for a validation study (See Figure 5 for an illustration). It is offered to readers as one way validation research might be done and shared primarily to give scholars an idea of the number of resources and decisions that go into a validation study. It seeks to put validation into the hands of the broader STEM education community, using language that is reachable to a variety of scholars. I share examples from my validation work over the last 25 years to give readers insight into the recipe. The word recipe is used in this context to give a set of procedures that might lead to a desired outcome, much like in cooking. In cooking, someone might add a bit more of one spice or ingredient and omit others to their taste preferences after trying it for the first time. This is similar to how validation scholars work in practice. A first-time reader might follow this recipe exactly, yet the second time make some revisions, which is normal and expected. Validation scholars learn skills and gain wisdom over time and experiences much like a chef develops habits of mind and inherent skills as they cook more frequently and grow as a chef. They are also influenced by their knowledge and beliefs. Validation scholars assuredly grow and develop a sense of the fluidity within validation scholarship and develop their preferred base recipe. Again, this recipe is intended to assist readers after they consider whether an available test (i.e., ‘off the shelf’) meets the project’s needs.

These steps are written for teams creating new tests and not adapting or modifying tests—that is a different process. One key idea to keep in mind at each step is the team’s resources, including but not limited to, time, money, technology, and people. Each step includes some questions to consider during the validation process. It is critically important to document the choices made, outcomes from those choices, and to seek feedback on processes. Those are the sorts of things to share broadly and reflect on internally.

This process may seem linear, but it is not. There are opportunities to return to prior steps or to redo a step due to test results or issues during test development. The validation process has many repetitive moments within each step and from one step to previous ones. There are dilemmas at many steps that can arise. For instance, what might your team do if your intended sample is halved in size because one participating school district decides at the last minute that they cannot afford the instructional time needed to participate in your test, and thus you lose quantitative data and response process data from interviews? How might the team move a test forward if pilot data analysis is taking longer than anticipated? What will your team do if there is reasonable test content evidence; yet, the internal structure results are poor, according to modern guidelines? These dilemmas challenge test developers to return to earlier steps in the process and to work across test development teams and stakeholders.

Validation can also seem like a juggling act due to the numerous people involved (e.g., researchers, test takers, test administrators). In my experience, I have switched from working with the test development team and thinking about how to make the best test in one moment to managing partnerships with participating school districts that face numerous requests for teachers’ and students’ time one hour later. It becomes essential as a validation leader and team member to carefully code switch when necessary, to value all people and aspects of the project, and at the same time to keep careful records, notes, and reminders that foster a keen eye on all aspects of validation. Validation is a balancing act that will test the mettle of any scholar.

Step 1: Thoughtfully consider partners that may be involved in your validation study. Determine available resources in the partnership.

Interdisciplinary teams tend to be better positioned for validation than teams that bring a singular perspective to a topic. An interdisciplinary team is not a requirement, but it is strongly encouraged. Team members with diverse backgrounds from others on the team bring strength to your project. For example, a psychometrician, content expert, and teacher bring different strengths to creating a school-based assessment compared to a psychometrician and content expert. You could explore team members’ potential interest in partnering with you for the validation study, including a sample of respondents and/or their administrators/supervisors. It may also be helpful to discuss how disagreements might be handled, and to reflect on each person’s role on that team. Disagreements may be frustrating at times, but they strengthen the final product.

Some individuals might be used to serving as a project leader but in this capacity need to fulfill a supporting role. As the team becomes more solidified, reassess where your team has deep knowledge and other areas where it is lacking. It may be necessary to have an advisory board member or external consultant advise on certain areas. The team may wonder about external or internal funding to support this validation study. It may be helpful and, in some cases, necessary. Thus, grant writing may become a part of the project too. Finally, no data collection can begin until it is certain respondents’ rights are protected. Prepare any essential documents needed for research on human subjects and gain approval after a review of the ethical considerations of your study. There are several questions that come to mind, which I suggest as two phases for this section. In the first phase, reflect on team creation. The second phase is framed around executing the project once the team is in place.

Phase 1: What skills, knowledge, wisdom, and experiences do each person lend to your validation team? Why will each member of the team be important and necessary for this validation study? How will your proposed team balance diverse views on the same issue? What power dynamics might come up within the study? How will all voices from the team be heard and valued? In what ways might all data be valued rather than demonstrating preference to one form of data (e.g., valuing quantitative internal structure data over qualitative response process data)?

Phase 2: Do you have additional funding to compensate team members, partners, and/or respondents? What technologies will you use? How much time will the project take? What data collection sites might you use as part of validation study?

Step 2: Describe the construct to assess then thoroughly review the available literature.

Make every effort to carefully articulate the construct in a measurable manner that others comprehend. Share an initial draft with your team and seek to fully articulate it. Describe what it is, what it is not, and what it constitutes. How can it be clearly and accurately defined in a way that is accessible to your entire team? Now that the construct is defined, it is necessary to systematically and thoroughly review relevant literature. It is recommended to take a three-part approach to the literature: (a) historical and foundational literature, (b) relevant multidisciplinary literature, and (c) modern literature (i.e., publications within the last 5 years). Historical and foundational literature captures works that ground your assessment, its associated construct, and assessment development. As an example, using Rasch modeling might hearken back to seminal works from George Rasch, Benjamin Wright, or John Linacre, which are examples of historical or foundational works on Rasch modeling. While some citations of these works are more than 50 years old, they are viewed as foundational to Rasch modeling. Relevant interdisciplinary literature encapsulates ideas from multiple disciplines and their associated journals. For instance, assessment-focused research about physics education topics might be found in scholarship, including but not limited to, science education, assessment and research, cognitive and educational psychology, and special education. Choices to bound (i.e., delimit) literature should be purposeful and communicated across team members. Finally, assessment teams should intentionally seek out modern literature so that assessment development draws on current scholarly thinking. As a rule of thumb, the literature published within the last five years is relatively modern. Drawing on modern scholarship supports an argument that assessment development is current. There is no guideline on the amount of scholarship from each area; the team is best poised to make that decision.

Meta-analyses, literature syntheses, and similar materials are central for this step. These sorts of publications essentially bundle relevant work for assessment teams and can expedite the literature search process. It can be helpful to partner with an academic librarian at this stage who is skilled in assisting with these types of searches. One caveat is that the availability of published materials may vary from one institution to the next because of different collections. Let us consider a case where a team wants to create a self-efficacy assessment for 11–12-year-old students studying physical science. It may be helpful to look at literature related to self-efficacy of older or younger students in other STEM classes, or children of the same age yet in reading contexts. What literature informs your validation study? Why did your team review some literature and not others? Put simply, what counts? Will your team consider historical trends or only materials from the last 20 years?

Step 3: Define the desired test format and test administration processes. Draft an associated validity argument and ensuing claims, as well as necessary validity evidence to support the claims and argument.

Multiple-choice, Likert-scale, and constructed-response items bring different opportunities and challenges to testing. Static- and dynamic-form tests have strengths and limitations. Testing over the last 25 years has shifted more towards technology-driven administration, which presents benefits and limitations for development teams. In 2026, the rise of AI has spurred test development as well as raised new ethical concerns. How will data be gathered? What test form might be used and what grounds that decision? Who will administer tests and maintain fidelity in the testing process? How might tests be scored in a reliable manner?

Now that a testing format and administration process has been solidified, it is appropriate to develop a validity argument that links the test’s results and interpretations with claims and evidence. This is likely the first time that validation claims might be considered. The validation argument is a way to convey how users know that the test does what intends. An argument contains claims and evidence. Both a priori and a posteriori claims are acceptable; therefore, discuss with your team how they want to generate claims. Results can be very useful to inform test revisions, and make appropriate changes, which can be a reason to state them at this stage as a priori. On the other hand, those claims may be conjectures/hypotheses at this step and become claims a posteriori once the results can be synthesized into a supportive claim. If your team chooses to use a posteriori approach, then map out an organized approach to gather multiple sources of validity evidence that can be leveraged for ensuing claims. While there is no minimum or maximum number of claims or evidence to gather, stronger claims and more robust tests come from those with more evidence across a range of validity sources. What validity claims might be made? What validity evidence will be gathered during this validation study? Who will gather it and in what capacity? How will the data be leveraged as feedback during the validation process? When will validity evidence be gathered during the process?

Step 4: Develop a test item development process and implement it.

Create an item-development process and critically question it across team members. Think intentionally about how technologies (e.g., AI, machine learning, natural language processing, and computer adaptive testing software) might be used during the item-development process. Said differently: Lay out your process, try to break it, and improve where there are flaws. This is a reason that multiple perspectives from a diverse team strengthens validation. Once a process has been agreed upon and seems appropriate, then implement it during a pilot project, and revise appropriately. For some smaller projects, it might be appropriate to create all items for a measure at this time. For most projects, it is fair to implement the process for a purposeful, representative sample of items, then revise the items against the process until the team agrees that things are solid. Then, the remainder of items under development can parallel the finalized item development process. A smooth test item development process leads to better items and should not be rushed or given short shrift attention.

It is inappropriate to engage in test item development ad hoc, much less leave it entirely to technology (Paul et al., 2026). Modern evidence from health and medicine examined one AI model’s ability to construct single best-answer multiple-choice items for a medical exam (see Al-Lawama et al., in press). An expert panel reviewed the items. On the one hand, items were clear and fairly accurate. Distractors were somewhat appropriate in 77% of the 100 items. On the other hand, 23% of the items contained an implausible or nonsensical distractor, which is a poor practice. Moreover, the expert panel and AI had low agreement related to item difficulty, suggesting that AI can have very different ideas about item difficulty compared to experts at the very least. In summary, AI models can support item writing but should be used judiciously and purposefully, acknowledging ethics with their use.

An item development process can and should be revised until the team is confident that it is appropriately robust. A documented process clearly communicates what will be done and how it will be done so that it might be replicated or revised later. Who will lead and assist the item development? What does the team learn during the item writing process that can be improved? What is the item development timeline?

Step 5: Align intended data collection practices to claims and validity evidence. Perform an alpha test by collecting and then analyzing data.

At a minimum, two sources and two claims are necessary to lay a foundation for a reasonable validity argument. Historically, test content or internal structure evidence has been provided for STEM education assessments (e.g., Bostic et al., 2021a; E. Krupa et al., 2024). This is a good start; however, it omits valuable evidence often coming from other data sources, including qualitative data. Respondent’s feedback (via response process or testing consequence evidence) is critically important and historically has been rarely reported (E. Krupa et al., 2024). In my experience and from observing other validation teams, successful validation teams often start with test content and consequences of testing, followed by response process, then internal structure, and finally evidence of relations to other variables. This process allows the best items to be developed and to give respondents a voice early and frequently in the process. This in turn promotes feedback and dissemination opportunities. Remember, this is an initial study—alpha testing—and part of a process. There will be future opportunities to collect more data after item/assessment refinement. I want to clarify that this approach (i.e., starting with test content evidence) is not a rule and comes from experience. Gather the evidence that is needed to support the claims being made, and do so in a way that makes sense. What validity evidence might be gathered in support of potential claims? How will it be gathered? What sampling processes will be used to gather these data? Who will gather these data and report on them? What feedback will be sought to make revisions? Does the order of validity evidence gathering matter—or can it be done concurrently?

Step 6: Refine items and the assessment based upon alpha testing results. Collect then analyze data as part of beta testing.

Return to the item development process with your alpha testing results. Re-engage with the items and assessment. Assuredly, some items worked and some did not. For example, changing a word may be all that is needed, but on the other hand, some items may need major revisions. Changing that word might have huge implications or it might be minor. Think about the threshold for retaining items; shortening the intended test may be necessary if items are beyond revision and new items may not work. An item or set of items might seem ideal because they address an important part of the construct, but for unknown reasons, the item or items may not function as intended with respondents or may misfit. Gathering response process data is so vital as these data provide evidence for how respondents engage with items. What processes are in place to ensure that the item development process is rigorous (to the degree that another person can follow the procedures)? What frameworks are in place to ground item development and refinement in the test blueprint? How is bias mitigated throughout the item development process? What evidence supports a priori claims or leads to reasonable a posteriori claims?

With a revised assessment in hand, it is time for beta testing as part of a pilot testing phase. Pilot testing is essential and must not be skipped for reasons such as time, costs, or other resources. Beta testing promotes opportunities for rich data collection that aligns to validity claims and/or sources. Draw on appropriate literature for guidelines that inform minimal expectations for item and test functionality. It is strongly encouraged to gather multiple forms of validity evidence including test content, consequences of testing/bias, response process, and/or internal structure. What guidelines or standards are helpful for sample size for this pilot study? How does your team plan to conduct the pilot study? What data do you want to gather? How has your team checked in with respondents or those that give access to respondents to confirm the process remains positive?

Step 7: Analyze pilot test data. Share results from the analysis.

Beta testing results can be shared publicly for feedback from reviewers, partners, and potential test administrators and users. Analyze available data and seek feedback in some capacity from others. This may be from an advisory/evaluation team, peer review, or across team members. Make decisions about next steps using the test blueprint and other associated documentation. It may also be helpful to reflect on current literature and how current results converge and diverge from relevant scholarship. Test-related decisions can be difficult for many reasons. For example, consider an item that might be important but not functioning well. I have had items with strong content links to a test blueprint yet they did not work well psychometrically speaking or did not function with respondents after numerous revisions. Some of these items became sample items released to the public. In another instance, I was working with a team to develop a survey. One item was deemed as central to a construct by some members of the content development team but evidence from respondents as part of cognitive interviews (i.e., response process) and psychometric findings (i.e., internal structure) suggested the item was not functioning well. Respondents did not understand the item and responded in nonsensical ways. That item was ultimately dropped from the survey after discussions across the team, which resulted in improved survey qualities. These examples are highlighted to illustrate that this is normal and part of validation. Finally, reflect on the test directions for users, administrators, and scorers. Could they be clearer? What does an ideal test and administration guide look like? How will pilot test results be shared publicly for consideration and scrutiny?

Step 8: Create a ‘final’ test and administer it. Collect and analyze data. Disseminate results publicly through appropriate venues.

‘Final’ is meant to connote the idea that this is the best test that your team can produce given the prior work. Your team must decide whether the test is good enough for broad distribution with a large sample. The work to disseminate the test and analyze the data is not a small feat to take lightly. Your team may elect to do another round of pilot testing before this step, which is normal. For large-scale testing, this usually leads to including large diverse samples that represent the test respondent population. It can also require bringing in new partners to help with testing. Be prepared to provide clear testing guidelines and directions to test administrators. It may be useful to pilot these guidelines and directions with a sample before pushing them out for broad use. As your team prepares for this round of testing, reflect on guidelines and standards to determine an appropriate sample size for your test. Who is eligible for testing and do administrators understand eligibility? To what degree are test directions clear to the respondent and the test administrator? What is the timeline for data collection? How will tests and their data be collected? Are there any technologies, like firewalls or spam filters, that might prevent broad dissemination to all respondents? Who has access to those tests and the data?

The results from the latest test administration lead inform a priori claims or they may help to derive a posteriori claims. It is permissible to make slight revisions to the ‘final’ test from the previous stage so long as those are minor and/or improve test quality. Any changes should be discussed among the team for consistency, and summarized for an evaluation or advisory team (if applicable). Appropriate venues for dissemination are not purely scholarly groups and include practitioner audiences, supervisors, and community members. Moreover, a test might be valuable to multiple scholarly communities (e.g., content experts, psychometricians, and policy leaders). This is a key reason to build an interdisciplinary team that can have broad impact across numerous fields, and take the work to different communities that may be interested. To what degree will the test or its items be shared publicly? What was learned from the results that can be shared with the respondents and administrators? Who would be interested in learning more about the test and its validity claims and evidence?

Step 9: Return to any step to gather further validity evidence or make adjustments to the test.

Some validation teams might skip gathering more evidence than initially planned, which is acceptable. It is normal to gather evidence of relations to other variables after a test is deemed psychometrically satisfactory due to the burden of collecting those data and having a robust data set. Those relations to evidence of other variables may also be gathered concurrently during the previous administration to a broad sample. Consequences from testing evidence are similar in that this evidence can be collected during large-scale test administration or as a follow-up study weeks or months after the test has been administered. This step is also appropriate to reflect on who is eligible to take the test. For example, might the test be appropriate for multilingual populations or struggling readers, or require adaptations? As one example, a test created for an English-speaking sample may need work for a Mandarin-speaking sample learning English. Some test developers might pause work to create a multilingual test (i.e., same test but different languages) for a period of time until the test has been used in real time and needs are further explored. Translating or adapting a test is another process, which goes beyond the scope of an initial validation study.

Cronbach (1988) reminds readers that “Validation is never finished.” (p. 5) and it is with that zeal that validation work keeps going. This is the time to make last-minute adjustments to the test, guidelines or directions, or answer key questions prior to making all of the materials available for broad use beyond your team. What further validity evidence and claims could be made for this test? Is it currently robust enough for the desired claims?

Step 10: Make the test or test development process and outcomes accessible to others.

There is a stark difference between sharing the results from your team’s work, and the test and its associated administration materials. Your team may need to work with your institution(s) or business involved with the test because of ownership rights. Releasing an entire test may complicate the test’s properties, which make it necessary to share the test blueprint and/or sample items, as well as administration guidance. This practice gives others a better idea of the test before using it and provides potential users with needed information in determining if your test is appropriate for their needs. Consider uploading your materials to a public-facing website or repository. There are repositories that contain information about tests but not the test itself, such as the VM²ED repository (https://mathedmeasures.org/; E. E. Krupa et al., 2024; E. Krupa et al., 2024). It is an example of a freely accessible catalog of tests for mathematics education contexts that welcomes submissions of tests and their accompanying information, but the test itself is not required.

Your test partners (e.g., preK-12 schools, universities, and colleges) may want to continue using the test or create opportunities for others to use it. They may want the results shared with them so they can use them to adjust. Similarly, others may be interested in using your test. Thoughtful validation leads to a simpler process communicating what the test does and its appropriateness for different contexts. How will the test be made available to others beyond your research team? Who owns it? What technology is involved in the test that might change over time?

Validation is a transparent methodology that welcomes everyone’s participation and little should be hidden from the public eyes. “Validation was once a priestly mystery, a ritual performed behind the scenes, with the professional elite as witness and judge. Today it is a public spectacle combining the attractions of chess and mud wrestling.” (Cronbach, 1988, p. 3). As a public-facing process, there can be numerous problems, dilemmas, and pitfalls. Four common pitfalls are shared in the next section to assist readers in not making these mistakes in their work.

4.2. Common Pitfalls in Validation Work

There are numerous pitfalls that can happen during validation studies, as well as challenges that arise during validation work. One pitfall is the decision to ‘validate a test’ at the same time as designing or conducting an intervention. It has been a practice by some to design a test after constructing and refining an intervention because that practice suggests a high degree of correspondence between test and intervention. In this case, the test may not be constructed robustly and that can lead to invalid results or interpretations, as well as negative outcomes for participants. Such a decision to multi-task does not put sufficient resources towards either the intervention or test, and can lead to poorly designed measures, insufficient validity evidence, as well as interventions that are not robustly developed. It is important to treat both assessment development and validation, as well as design and intervention, with the time and respect and resources needed to engage in robust work. Put simply, the care and concern surrounding an intervention must also be shared for any testing or measurement during the intervention.

A second pitfall, related to the first, is to immediately decide to build a measure that aligns to a desired intervention. Designing a measure and validating its interpretations and results is no small accomplishment. Moreover, the uniqueness of an intervention does not immediately necessitate designing a new test because there is simply no test that aligns directly to the construct. For example, imagine a team is conducting a study on the self-efficacy of elementary-aged students in science classes. They find a self-efficacy measure of high school-aged students in science classes, as well as a self-efficacy measure of elementary-aged students in mathematics classes. Both have reasonable validity arguments that draw from multiple sources of evidence. In this case, it is likely to be far easier to modify one of these measures for their needs rather than starting anew. It has been my experience to advise researchers with this question to consider: (a) Have you conducted a thorough literature search and found nothing related to your needs? (b) What time and resources are available to devote to measuring construction and validation? Do you have a year or more? (c) How likely will this measure be re-used?

A third common pitfall is to not collect some forms of validity evidence (e.g., consequences of testing and response process) and focus on others (e.g., test content and internal structure). All forms of validity evidence are valuable and lead to important validity claims that can support validity arguments. E. Krupa et al. (2024)’s synthesis of mathematics and statistics education measures skews heavily towards test content and internal structure evidence, with few tests having response process or consequences of testing evidence associated with them. One validity source is no greater than another, and choosing to stop with these two validity sources can obscure a broader validity argument drawn from participants’ experiences. That is, a test might have strong agreement from experts that the items are aligned to a construct and acceptable internal structure evidence, such as those from a factor analysis or Rasch analysis. Yet, that same test could also have substantive consequences for test takers that lead to negative effects during or after testing. Similarly, a group of test takers might struggle to read the items and understand their meaning. Validity evidence and arguments should be as robust as possible and include the voices of test takers and administrators, much like Cronbach (1988) warned assessment scholars nearly 40 years ago.

A fourth common pitfall is related to the validity arguments. In some validation studies, validity evidence is presented as fact without communicating how that evidence serves as an argument that the test’s results and interpretations have validity. Using validation as a methodology requires users to clearly state what is known from the results. An example might help, drawing from the internal structure validity source. Imagine a researcher who completed a factor analysis on some test data and reported the results. These results are helpful if they are contextualized with information: Are they within acceptable ranges that draw from appropriate extant literature? To what degree was variance handled sufficiently? What is the key contextualized takeaway from the work? All of these questions can be easily handled and are part of good research reporting practices (American Psychological Association, 2025). This example is intended to help readers recognize the importance of (a) contextualizing their results or findings from validation studies and (b) providing a clear statement of the claim being made and its association with the validity source.

4.3. Remarks Following Test Development

Here are some final notes after creating the test: (i) This multi-step process was presented as linear but is not necessarily linear. It is normal to return to previous stages until the research team is convinced the test is as good as it can be for the necessary purposes. (ii) Give the test an easily identifiable name connected to the construct suited for a broad audience. (iii) Create a description of the test that describes what it measures and who it is for. This description should be no longer than a few sentences. (iv) Generate an instrument use abstract (see M. Carney et al., 2022) for the test. This abstract is typically no longer than one page.

Test development has multiple steps and should not be entered into quickly or without just cause. Unfortunately, too many tests have been created that are never used again (E. Krupa et al., 2024) and/or have limited test validity evidence, calling into question whether the test actually measures the desired construct. It is a reason for more publicly accessible repositories of tests or information about tests so that potential users and administrators are appropriately informed. This process is meant to be a guide, like a base recipe, that can be tweaked and revised.

4.4. Critiques to Come and Known Concerns

Kelly (2004) gave a critique of design-based research back when design research was initially proposed, opening with “it can be difficult to write commentary on emerging methods”. However, he pointed out that a methodology should be well-grounded in a few things: (a) argumentative grammar; (b) dealing with the problems of demarcation and meaningfulness; (c) drawing generalizations over actors (i.e., participants), behaviors, and contexts; (d) generalization of a conceptual framework to strategies; (e) technical language; and (f) managing bias. In this paper, argumentative grammar, problems of demarcation and meaningfulness, generalizations, technical language, and bias are all discussed, yet more is needed to unpack each of them through this lens. For example, how might validation manage technical language like validity, which is used in a number of contexts (i.e., validity, construct validity, and structural validity)? In what ways might bias of researchers and power dynamics within research teams problematize bias validation outcomes? Similarly, Kelly notes the importance of valuing the contribution of a nascent methodology first before contrasting it with other established ones. This paper seeks to describe validation as a methodology, situate it, and then unpack how it differs from other methodologies. Taken collectively, more work is necessary to broaden, deepen, and elucidate validation as a methodology.

There are some concerns with this idea of validation as a methodology. First, validation has traditionally been the work of a small group of scholars. It is unclear how this will look as more scholars take up this work beyond psychologists, psychometricians, and some content experts. It will be important that language—especially technical language be clearly defined for users to take up. More importantly, it will be necessary for validation scholars to consistently articulate their definitions and frameworks. Language such as ‘coding’ can take on various perspectives across quantitative and qualitative methods. Future validation scholarship should seek to promote a broad understanding of language that is easily accessible. A second concern is that this work falls short of providing a full-scale case study for readers. Such a detailed case study is beyond the scope of this paper, and examples of working through the recipe, as presented in this paper, are warranted. At this time, readers might consider consulting recent works cited in this manuscript as examples of how validation has been approached. Many of these citations draw from mathematics contexts due to my experience and interest in mathematics education contexts; hence, readers may find examples from engineering (e.g., Ghosh & May, 2025), science (May et al., 2025b), and technology (Fegely et al., 2023). A third concern is that there can be difficulties with collaboration across team members, much less finding willing team members. Teams might experience imbalance and power dynamics, which should be openly discussed and professionally handled to reach equilibrium. Individuals working in isolation or at smaller institutions might need to reach out broadly beyond their institution and ‘cold call’ published validation scholars. The validation community (see VM²ED as an example) is a vibrant, welcoming community. It has been my experience that researchers engaging in validation work are excited to meet new people, welcome them into the validation community, and build partnerships for the future.

4.5. Final Remarks: A Call to Action and Collaboration

A goal of this piece is to incite a call to action and change current assessment practices for the better. STEM education, which includes testing within STEM education contexts, has been and continues to be a priority among federal governments and multi-national groups. One common pathway to know whether priorities are being implemented, are having a positive effect, or the degree to which those effects impact varying groups is to examine testing results. Those quantitative testing results and ensuing interpretations are only as good as the validity argument supporting them. As such, it is critically important to have high-quality tests that effectively and robustly measure desirable outcomes with as little bias as possible.

In the USA, a recent Executive Order #14277 integrates greater AI usage into STEM education, starting when children are young (U.S. President, 2025). In 2022, during the prior administration in the USA, the CHIPS and Science Act authorized $13 billion for STEM education and workforce development (H.R.4346, 2022). Two different administrations seem to agree that STEM education is critically important, which includes STEM education assessment. Such a focus is not unique to the USA and there are numerous calls worldwide. China has put policies in place that placed greater focus on STEM education more than one decade ago (Zhang et al., 2026). In fact, there is an emphasis on standardized testing in STEM education contexts. The European Institute of Innovation and Technology (EIT) launched a recent call for proposals across European Union countries that seek to boost STEM education, foster innovation, and support Europe’s workforce through an available €70 million budget (European Institute of Innovation & Technology, 2025). Thus, STEM education continues to be an important politically motivated topic worldwide that involves STEM education testing. This gives everyone involved with STEM education, globally speaking, an opportunity to collaborate around testing.

The assessment community has shown the importance of making changes through better adoption of Standards (American Educational Research Association et al., 2014) and increased practices involving fairness and transparency. Moreover, scholars have potential to find community and build scholarship across boundaries through validation. Validation provides opportunities to create bonds around a shared, vested interest. Test development may have once been the work of few, but now it can and should be part of a welcoming community of others, regardless of experience, knowledge, and culture. Large-scale testing is influenced to some degree by test uses, how results might be interpreted, and policies (e.g., federal legislation), which further connects to using validation as a methodology. Validation is STEM education research, regardless of whether the test is designed for large-scale administrations or small-scale uses. Validation scholars are a welcoming community that includes individuals using qualitative, quantitative, and mixed-methods, as well as content experts, special education scholars, policy researchers, research methodologies, evaluators, psychometricians, community members, and many more. They are found in universities and colleges, businesses, government spaces, and non-profit sectors. That diversity enables validation teams to divide the investigative and educative burden according to their talents, motives, and political ideals. One potential outcome from taking up validation as a methodology is an end to the methods wars, which have tried to place quantitative or qualitative research as superior to one another. Validation welcomes and requires both, as well as mixed-methods research.

STEM education test development is both wide and deep, and still, there are areas needing better assessments. Edelen and colleagues (accepted) highlight the gap within integrated STEM education. Much STEM education scholarship has been conducted through a siloed view of STEM education, which differs from an integrated STEM education perspective that is more unified (Kelley & Knowles, 2016; Roberts et al., 2018, 2022). Siloed STEM education, also known as discipline-based STEM education, unpacks each component (i.e., science, technology, engineering, and mathematics) and views each discipline as unique from the others yet falling under a broad perspective of STEM. Integrated STEM education scholarship on the other hand is much more nascent compared to other disciplines like discipline-based STEM education for mathematics education and science education (Jackson et al., 2021; Kelley & Knowles, 2016). Research drawing on an integrated STEM education perspective often utilizes qualitative approaches (e.g., Edelen et al., 2024; Roberts et al., 2018), and there are few measures for integrated STEM contexts. Thus, it has potential for validation scholarship to promote quantitative research within integrated STEM education contexts, which can in turn lead to large-scale testing of integrated STEM education topics. Such a process must start with strong, robust, and modern assessments for integrated STEM education contexts.

As a final thought, “validation will progress in proportion as we collectively do our damnedest—no holds barred—with our minds and our hearts” (Cronbach, 1988, p. 14). Validation scholarship is driven by the community that is invested in promoting better lives and experiences for local communities, national groups, and global society. Cronbach’s statement, nearly 40 years ago, was a foundation that led to articulating validation as a methodology, guideposts for validation scholarship, and a recipe for new validation scholarship. Validation adds value to scholarship and practice through a commitment to doing better and welcomes others into it who also see value in a better future for all.

Funding

Some of this research was funded by National Science Foundation, grant numbers 1920621; 1920619; 2100988; 2101026.

Institutional Review Board Statement

Not applicable because this study did not involve human subjects.

Data Availability Statement

There are no data available with this study.

Acknowledgments

I am grateful to the numerous colleagues that reviewed earlier versions of this manuscript including but not limited to: Daniel Edelen, Timothy Folger, Erin Krupa. During the preparation of this manuscript, the author used ChatGPT (4.0) for the purposes of creating Figure 2. In addition, the author used Napkin.ai for the purposes of creating Figure 1, Figure 3 and Figure 5. The author has reviewed and edited the output; he takes full responsibility for the content of this publication.

Conflicts of Interest

At time of submission, the author was a National Science Foundation employee; however, the publication stems from data collected under grants prior to employment.

Abbreviations

The following abbreviations are used in this manuscript:

K-12	Kindergarten-12th grade
STEM	Science, Technology, Engineering, and Mathematics
AI	Artificial Intelligence
VM²ED	Validity and Measurement in Mathematics Education

References

Al-Lawama, M., Altamimi, O., & Altamimi, E. (in press). Evaluating the ability of AI models to generate level-specific medical MCQs with variable difficulty. BMC Research Notes. [CrossRef]
American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
American Educational Research Association, American Psychological Association, National Council on Measurement in Education & Joint Committee on Standards for Educational and Psychological Testing (U.S.). (1999). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
American Psychological Association. (2019). Understanding context of women’s and girls’ lives key to providing good psychological care, according to updated practice guidelines. Available online: https://www.apa.org/news/press/releases/2019/05/women-girls-psychological-care (accessed on 2 March 2026).
American Psychological Association. (2025). Journal Article Reporting Standards (JARS). Available online: https://apastyle.apa.org/jars (accessed on 2 March 2026).
Arjoon, J. A., Xu, X., & Lewis, J. E. (2013). Understanding the state of the art for measurement in chemistry education research: Examining the psychometric evidence. Journal of Chemical Education, 90(5), 536–545. [Google Scholar] [CrossRef]
Bannon, L. (2011). Reimagining HCI: Toward a more human-centered perspective. Interactions, 18(4), 50–57. [Google Scholar] [CrossRef]
Bostic, J. (2023). Engaging hearts and minds in assessment research. School Science and Mathematics Journal, 123(6), 217–219. [Google Scholar] [CrossRef]
Bostic, J., Krupa, E., Carney, M., & Shih, J. (2019a). Reflecting on the past and thinking ahead in the measurement of students’ outcomes. In J. Bostic, E. Krupa, & J. Shih (Eds.), Quantitative measures of mathematical knowledge: Researching instruments and perspectives (pp. 205–229). Routledge. [Google Scholar]
Bostic, J., Krupa, E., Folger, T., Bentley, B., & Stokes, D. (2022). Gathering validity evidence to support mathematics education scholarship. In A. Lischka, E. Dyer, R. Jones, J. Lovett, J. Strayer, & S. Drown (Eds.), Proceedings of the forty-fourth annual meeting of the North American chapter of the international group for the psychology of mathematics education (pp. 100–104). Middle Tennessee State University. [Google Scholar]
Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2021a). Classroom observation and mathematics education research. Journal of Mathematics Teacher Education, 24, 5–31. [Google Scholar] [CrossRef]
Bostic, J., Matney, G., & Sondergeld, T. (2019b). A lens on teachers’ promotion of the Standards for Mathematical Practice. Investigations in Mathematics Learning, 11(1), 69–82. [Google Scholar] [CrossRef]
Bostic, J., May, T., Folger, T., Matney, G., Koskey, K., & Stone, G. (2024, July 7–14). Applying empathic principles to mathematics education assessment. International Council on Mathematics Education, Sydney, Australia. [Google Scholar]
Bostic, J., & Sondergeld, T. (2015). Measuring sixth-grade students’ problem solving: Validating an instrument addressing the mathematics common core. School Science and Mathematics Journal, 115, 281–291. [Google Scholar] [CrossRef]
Bostic, J., Sondergeld, T., Folger, T., & Kruse, L. (2017). PSM7 and PSM8: Validating two problem-solving measures. Journal of Applied Measurement, 18(2), 151–162. [Google Scholar]
Bostic, J., Sondergeld, T., Matney, G., Stone, G., & Hicks, T. (2021b). Gathering response process data for a problem-solving measure through whole-class think alouds. Applied Measurement in Education, 34(1), 46–60. [Google Scholar] [CrossRef]
Braat, M., Engelen, J., van Gemert, T., & Verhaegh, S. (2020). The rise and fall of behaviorism: The narrative and the numbers. History of Psychology, 23(3), 252. [Google Scholar] [CrossRef]
Carney, M., Bostic, J., Krupa, E., & Shih, J. (2022). Interpretation and use statements for instruments in mathematics education. Journal for Research in Mathematics Education, 53(4), 334–340. [Google Scholar] [CrossRef]
Carney, M. B., Cavey, L., & Hughes, G. (2017). Assessing teacher attentiveness to student mathematical thinking: Validity claims and evidence. The Elementary School Journal, 118(2), 281–309. [Google Scholar] [CrossRef]
Confrey, J., Toutkoushian, E., & Shah, M. (2019). A validation argument from soup to nuts: Assessing progress on learning trajectories for middle-school mathematics. Applied Measurement in Education, 32(1), 23–42. [Google Scholar] [CrossRef]
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. Braun (Eds.), Test validity (pp. 3–17). Erlbaum. [Google Scholar]
Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research process. Sage. [Google Scholar]
Cruz, M. L., Saunders-Smits, G. N., & Groen, P. (2020). Evaluation of competency methods in engineering education. European Journal of Engineering Education, 45(5), 729–757. [Google Scholar] [CrossRef]
Decker, A., & McGill, M. M. (2019). A topical review of evaluation instruments for computing education. In Proceedings of the 50th ACM technical symposium on computer science education (pp. 558–564). ACM. [Google Scholar]
Delandshere, G. (2002). Assessment as inquiry. Teachers College Record, 104(7), 1461–1484. [Google Scholar] [CrossRef]
Edelen, D., Cook, K., Tripp, L. O., Jackson, C., Bush, S. B., Mohr-Schroeder, M. J., Schroeder, D. C., Roberts, T., Maiorca, C., Ivy, J., Burton, M., & Perrin, A. (2024). “No, this is not my boyfriend’s computer”: Elevating the voices of youth in STEM education research leveraging photo-elicitation. Journal for STEM Education Research, 7(3), 444–462. [Google Scholar] [CrossRef]
Edelen, D., Roberts, O., Bostic, J., Young, J., Lin, K.-Y., & Roberts, A. (accepted). Methodological approaches that capture the nuances of STEM Education Research. In C. Johnson, M. Mohr-Schroeder, T. Moore, L. English, & C. Jackson (Eds.), Handbook of research on STEM education. Routledge. [Google Scholar]
European Institute of Innovation & Technology. (2025). EIT launches €70 million call to boose STEM innovation and strength university cooperation. Available online: https://www.eit.europa.eu/news-events/news/eit-launches-eu70-million-call-boost-stem-innovation-and-strengthen-university (accessed on 2 March 2026).
Fegely, A., Winslow, J., Lee, C. Y., & Setari, A. P. (2023). EdTech Align: A valid and reliable instrument for measuring teachers’ EdTech competencies aligned to professional standards. TechTrends. [Google Scholar] [CrossRef]
Folger, T., Bostic, J., & Krupa, E. (2023). Defining test-score interpretation, use, and claims: Delphi study for the validity argument. Educational Measurement: Issues & Practice, 42(3), 22–38. [Google Scholar] [CrossRef]
Gallagher, M. A., Folger, T. D., Walkowiak, T. A., Wilhelm, A. G., & Zelkowski, J. (2025). Measuring mathematics teaching quality: The state of the field and a call for the future. Education Sciences, 15(9), 1158. [Google Scholar] [CrossRef]
Garcia, N. M., López, N., & Vélez, V. N. (2018). QuantCrit: Rectifying quantitative methods through critical race theory. Race Ethnicity and Education, 21(2), 149–157. [Google Scholar] [CrossRef]
Ghosh, R., & May, T. A. (2025). Validation of identity-based mentoring scales for undergraduate minoritized students in engineering. Journal of College Student Development, 66(4), 454–462. [Google Scholar] [CrossRef]
Glaser, B. G. (1965). The constant comparative method of qualitative analysis. Social Problems, 12(4), 436–445. [Google Scholar] [CrossRef]
Herber, O. R., Bradbury-Jones, C., Okpokiri, C., & Taylor, J. (2025). Epistemologies, methodologies and theories used in qualitative global north health and social care research: A scoping review protocol. BMJ Open, 15, e100494. [Google Scholar] [CrossRef]
Hill, H. C., & Shih, J. C. (2009). Research commentary: Examining the quality of statistical mathematics education research. Journal for Research in Mathematics Education, 40(3), 241–250. [Google Scholar] [CrossRef]
H.R.4346—117th congress (2021–2022): CHIPS and science act. (2022). Available online: https://www.congress.gov/bill/117th-congress/house-bill/4346 (accessed on 17 March 2026).
Ing, M., Kosko, K. W., Jong, C., & Shih, J. C. (2024). Validity evidence of the use of quantitative measures of students in elementary mathematics education. School Science and Mathematics, 124(6), 411–423. [Google Scholar] [CrossRef]
Jackson, C., Mohr-Schroeder, M. J., Bush, S. B., Maiorca, C., Roberts, T., Yost, C., & Fowler, A. (2021). Equity-oriented conceptual framework for k-12 stem literacy. International Journal of STEM Education, 8, 38. [Google Scholar] [CrossRef]
Jacobson, E., & Borowski, R. (2019). Measure validation as a research methodology for mathematics education. In J. Bostic, E. Krupa, & J. Shih (Eds.), Assessment in mathematics education contexts: Theoretical frameworks and new directions (pp. 40–62). Routledge. [Google Scholar]
Jonson, J. L., & Geisinger, K. F. (2020). Fairness in educational and psychological testing. Examining theoretical, research, practice, and policy implications of the 2014 Standards. American Educational Research Association. [Google Scholar]
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. [Google Scholar] [CrossRef]
Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. [Google Scholar]
Kane, M. T. (2021). Articulating a validity argument. In G. Fulcher, & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 32–47). Routledge. [Google Scholar]
Kelley, T. R., & Knowles, J. G. (2016). A conceptual framework for integrated STEM education. International Journal of STEM Education, 3(1), 11. [Google Scholar] [CrossRef]
Kelly, A. E. (2004). Yes, but is it methodological? Journal of the Learning Sciences, 13(1), 115–128. [Google Scholar] [CrossRef]
Kosko, K. W. (2017). Reconsidering the role of disembedding in multiplicative concepts: Extending theory from the process of developing a quantitative measure. Investigations in Mathematics Learning, 10(1), 54–65. [Google Scholar] [CrossRef]
Kosko, K. W. (2019). A multiplicative reasoning assessment for fourth and fifth grade students. Studies in Educational Evaluation, 60, 32–42. [Google Scholar] [CrossRef]
Krupa, E., Bostic, J., Folger, T., & Burkett, K. (2024). Introducing a repository of quantitative measures used in mathematics education. In K. Kosko, J. Caniglia, S. Courtney, M. Zolfaghari, & G. Morris (Eds.), Proceedings of the 46th annual meeting of the North American chapter of the international group for the psychology of mathematics education (pp. 55–64). International Group for the Psychology of Mathematics Education. [Google Scholar]
Krupa, E. E., Bostic, J. D., Bentley, B., Folger, T., Burkett, K. E., & VM²ED Community. (2024). Search. VM²ED Repository. [Google Scholar]
Lavery, M. R., Bostic, J. D., Kruse, L., Krupa, E. E., & Carney, M. B. (2020). Argumentation surrounding argument-based validation: A systematic review of validation methodology in peer-reviewed articles. Educational Measurement: Issues and Practice, 39(4), 116–130. [Google Scholar] [CrossRef]
Lawson, B., & Bostic, J. (2024). An investigation into two mathematics score reports: Problem solving measure (PSM) and measurement of academic progress (MAP). Mid-Western Educational Researcher, 36(1), 12. [Google Scholar] [CrossRef]
Lesh, R., & Doerr, H. M. (2003). Foundations of a models and modeling perspective on mathematics teaching, learning, and problem solving. In Beyond constructivism (pp. 3–33). Routledge. [Google Scholar]
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Sage. [Google Scholar]
Maric, M., Glibetic, M., & Milinkovic, D. (2023). Measurement in STEM education research: A systematic literature review of trends in the psychometric evidence of scales. International Journal of STEM Education, 10(1), 39. [Google Scholar] [CrossRef]
May, T. A., Fan, Y. K., Stone, G. E., Koskey, K. L. K., Sondergeld, C. J., Folger, T. D., Archer, J. N., Provinzano, K., & Johnson, C. C. (2025a). An effectiveness study of generative artificial intelligence tools used to develop multiple-choice test items. Education Sciences, 15(2), 144. [Google Scholar] [CrossRef]
May, T. A., Johnson, C. C., Harold, S., & Walton, J. B. (2025b). The development and validation of a K-12 STEM engagement participant outcome instrument. Education Sciences, 15(3), 377. [Google Scholar] [CrossRef]
Mayer, R. E. (2024). The past, present, and future of the cognitive theory of multimedia learning. Educational Psychology Review, 36(1), 8. [Google Scholar] [CrossRef]
McMillan, J. H. (2026). Classroom assessment validation: Proficiency claims and uses. Educational Measurement: Issues and Practice, 45, E70014. [Google Scholar] [CrossRef]
Merriam-Webster. (n.d.). Fairness. In Merriam-webster.com dictionary. Merriam-Webster. Available online: https://www.merriam-webster.com/dictionary/equity (accessed on 15 October 2025).
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. [Google Scholar] [CrossRef]
Minner, D., Martinez, A., & Freeman, B. (2012). Compendium of research instruments for STEM education. Part 1 & 2. Abt Associates; Education Development Center; CADRE website. [Google Scholar]
National Research Council. (2001). Adding it up: Helping children learn mathematics. The National Academies Press. [Google Scholar]
National Research Council. (2014). Developing assessments for the next generation science standards. The National Academies Press. [Google Scholar]
Paul, A., Fakiyesi, V., Tahsin, M. U., Arinze, L. C., Moyaki, D., & Dunmoye, I. (2026). Exploring the impact of generative AI in engineering education: A scoping review of applications and innovations. In D. May, & M. E. Auer (Eds.), 2025 Yearbook emerging technologies in learning. Learning and analytics in intelligent systems (Vol. 59). Springer. [Google Scholar]
Pellegrino, J., DiBello, L., & Goldman, S. (2016). A framework for conceptualizing and evaluating the validity of instructionally relevant assessments. Educational Psychologist, 51(1), 59–81. [Google Scholar] [CrossRef]
Postma, C. E., Zwartkruis-Pelgrim, E., Daemen, E., & Du, J. (2012). Challenges of doing empathic design: Experiences from industry. International Journal of Design, 6(1), 59–70. [Google Scholar]
Reeves, T. D., & Marbach-Ad, G. (2016). Contemporary test validity in theory and practice: A primer for discipline-based education researchers. CBE—Life Sciences Education, 15(1), rm1. [Google Scholar] [CrossRef]
Riedl, M. O. (2019). Human-centered artificial intelligence and machine learning. Human Behavior and Emerging Technologies, 1(1), 33–36. [Google Scholar] [CrossRef]
Roberts, T., Jackson, C., Mohr-Schroeder, M. J., Bush, S. B., Maiorca, C., Cavalcanti, M., Schroeder, D. C., Delaney, A., Putnam, L., & Cremeans, C. (2018). Students’ perceptions of STEM learning after participating in a summer informal learning experience. International Journal of STEM Education, 5(1), 35. [Google Scholar] [CrossRef]
Roberts, T., Maiorca, C., Jackson, C., & Mohr-Schroeder, M. (2022). Integrated STEM as problem-solving practices. Investigations in Mathematics Learning, 14(1), 1–13. [Google Scholar] [CrossRef]
Sablan, J. R. (2019). Can you really measure that? Combining critical race theory and quantitative methods. American Educational Research Journal, 56(1), 178–203. [Google Scholar] [CrossRef]
Schilling, S. G., & Hill, H. C. (2007). Assessing measures of mathematical knowledge for teaching: A validity argument approach. Measurement: Interdisciplinary Research and Perspectives, 5(2–3), 70–80. [Google Scholar] [CrossRef]
Schoenfeld, A. H., & Arcavi, A. (1988). On the meaning of variable. Mathematics Teacher, 81(6), 420–427. [Google Scholar] [CrossRef]
Sireci, S., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. [Google Scholar] [CrossRef]
Sireci, S. G. (2013). Agreeing on validity arguments. Journal of Educational Measurement, 50(1), 99–104. [Google Scholar] [CrossRef]
Sireci, S. G., Suárez-Álvarez, J., Zenisky, A. L., & Oliveri, M. E. (2024). Evolving educational testing to meet students’ needs: Design-in-real-time assessment. Educational Measurement: Issues and Practice, 43, 112–118. [Google Scholar] [CrossRef]
Sondergeld, T. A. (2020). Shifting sights on STEM education quantitative instrumentation development: The importance of moving validity evidence to the forefront rather than a footnote. School Science and Mathematics Journal, 120, 259–261. [Google Scholar] [CrossRef]
Stylianides, A. J. (2007). Proof and proving in school mathematics. Journal for Research in Mathematics Education, 38(3), 289–321. [Google Scholar]
Stylianides, G., Stylianides, A. J., & Weber, K. (2019). Research on the teaching and learning of proof: Taking stock and moving forward. In J. Cai (Ed.), Handbook for research in mathematics education (pp. 237–266). National Council of Teachers of Mathematics. [Google Scholar]
Toulmin, S. E. (2003). The uses of argument. Cambridge University Press. [Google Scholar]
U.S. President. (2025). Advancing artificial intelligence education for American youth (EO 14277). Federal register, 90 FR 17519. Available online: https://www.govinfo.gov/app/details/FR-2025-04-28/2025-07368 (accessed on 2 March 2026).
Vygotsky, L. (1997). The collected works of L.S. Vygotsky, Vol. 4: The history of the development of higher mental functions (M. Hall, Trans.; R. Rieber, Ed.). Plenum. [Google Scholar]
Walkowiak, T., Adams, E. R., & Berry, R. Q. (2019). Validity arguments for instruments that measure matheamtics teaching practice: Comparing the M-SCAn and IPL-M. In J. Bostic, E. Krupa, & J. Shih (Eds.), Assessment in mathematics education contexts: Theoretical frameworks and new directions (pp. 90–119). Routledge. [Google Scholar]
Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46(6), 716–730. [Google Scholar] [CrossRef]
Wilson, M. (2023). Constructing measures: An item response modeling approach (2nd ed.). Routledge. [Google Scholar] [CrossRef]
Wilson, M., & Mari, L. (2026). Mapping out the hexagon measurement framework as a blueprint underlying measurement in the human sciences. Journal of Educational Measurement, 63(1), e70036. [Google Scholar] [CrossRef]
Wilson, M., & Wilmot, D. (2019). Gathering validity evidence using the BEAR assessment system (BAS): A mathematics assessment perspective. In J. Bostic, E. Krupa, & J. Shih (Eds.), Assessment in mathematics education contexts: Theoretical frameworks and new directions (pp. 63–89). Routledge. [Google Scholar]
Wittgenstein, L. (1958). Philosophical investigations. Basil Blackwell, Ltd. [Google Scholar]
Zhang, Y., Chen, Y., Kan, Z. C., & Xia, S. Y. (2026). The landscape of STEM education in China: Policies, practices, and pathways for integration. Research in Integrated STEM Education, 1, 1–43. [Google Scholar]
Ziebarth, S., Fonger, N., & Kratky, J. (2014). Instruments for studying the enacted mathematics curriculum. In D. Thompson, & Z. Usiskin (Eds.), Enacted mathematics curriculum: A conceptual framework and needs (pp. 97–120). Information Age Publishing. [Google Scholar]
Zolfaghari, M., Austin, C. K., & Kosko, K. W. (2021). Exploring teachers’ pedagogical content knowledge of teaching fractions. Investigations in Mathematics Learning, 13(3), 230–248. [Google Scholar] [CrossRef]
Zolfaghari, M., Kosko, K., & Austin, C. (2024). Toward a better understanding of the nature of pedagogical content knowledge for fractions: The role of experience. Investigations in Mathematics Learning, 16(4), 263–280. [Google Scholar] [CrossRef]
Zoltowski, C. B., Oakes, W. C., & Cardella, M. E. (2012). Students’ ways of experiencing human-centered design. Journal of Engineering Education, 101(1), 28–59. [Google Scholar] [CrossRef]
Zumbo, B. (2014). What role does, and should, the test standards play outside of the United States of America? Educational Measurement: Issues and Practice, 33(4), 31–33. [Google Scholar] [CrossRef]
Zumbo, B. D., & Hubley, A. M. (2016). Bringing consequences and side effects of testing and assessment to the foreground. Assessment in Education: Principles, Policy & Practice, 23(2), 299–303. [Google Scholar] [CrossRef]

Figure 1. Relationships influencing a study’s methods.

Figure 2. Rope as a metaphor for five sources of validity.

Figure 3. Overview of guideposts for validation as a methodology.

Figure 4. Common methods and approaches for gathering validity evidence.

Figure 5. Comprehensive test validation process.

Table 1. Description of the five sources of validity.

Source of Validity Evidence	Description
Test Content	Test content includes the wording and format of test items or tasks. Validity evidence based on test content would indicate that test items, or test content, align to the construct a test intends to measure.
Response Processes	Response processes describes the alignment between test takers’ performance or behavior and the construct a test intends to measure. In cases when a test relies on observers or judges to evaluate test takers, evidence may include “the extent to which the processes of observers or judges are consistent with the intended interpretation of scores” (American Educational Research Association et al., 2014, p. 15).
Internal Structure	Internal structure may indicate the degree to which test items conform to the construct a test intends to measure. Such evidence may be collected through analysis of test dimensionality and item interrelationships.
Relations to Other Variables	Relations to other variables examines the degree to which test scores are, or are not, related to some ancillary variable. The Standards describe several examples when relations to other variables may be of interest, such as: (a) hypothesized differences in group performance, (b) the degree to which test scores predict future performance, and (c) whether test scores from different tests measuring a similar construct produce a convergent association.
Consequences of Testing	Consequences of testing presents the intended and unintended consequences following the interpretation and use of test scores. Consequential evidence evaluates “the soundness of [test score] interpretations for their intended uses” (American Educational Research Association et al., 2014, p. 19). Unintended consequences warrant close examination, and consequential evidence may anticipate and proactively address unintended consequences.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bostic, J.D. Validation Is a Methodology! Guideposts for Assessment Development and Validation. Educ. Sci. 2026, 16, 565. https://doi.org/10.3390/educsci16040565

AMA Style

Bostic JD. Validation Is a Methodology! Guideposts for Assessment Development and Validation. Education Sciences. 2026; 16(4):565. https://doi.org/10.3390/educsci16040565

Chicago/Turabian Style

Bostic, Jonathan David. 2026. "Validation Is a Methodology! Guideposts for Assessment Development and Validation" Education Sciences 16, no. 4: 565. https://doi.org/10.3390/educsci16040565

APA Style

Bostic, J. D. (2026). Validation Is a Methodology! Guideposts for Assessment Development and Validation. Education Sciences, 16(4), 565. https://doi.org/10.3390/educsci16040565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Validation Is a Methodology! Guideposts for Assessment Development and Validation

Abstract

1. Introduction

2. Related Literature

2.1. Language Matters

2.2. Validity and Validation: A Brief Primer

3. Validation as a Methodology

4. Guideposts for Validation Scholarship

4.1. A Recipe for a Validation Study

4.2. Common Pitfalls in Validation Work

4.3. Remarks Following Test Development

4.4. Critiques to Come and Known Concerns

4.5. Final Remarks: A Call to Action and Collaboration

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI