A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

King, Sophie Grace; Bostic, Jonathan David; May, Toni A.; Stone, Gregory E.

doi:10.3390/educsci15060680

Open AccessArticle

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

by

Sophie Grace King

¹,

Jonathan David Bostic

^1,*

,

Toni A. May

²

and

Gregory E. Stone

³

¹

School of Inclusive Teacher Education, Bowling Green State University, Bowling Green, OH 43403, USA

²

College of Community and Public Affairs, Binghamton University, Binghamton, NY 13902, USA

³

Clarity Assessment Systems Ltd., Oswego, NY 13126, USA

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(6), 680; https://doi.org/10.3390/educsci15060680

Submission received: 28 February 2025 / Revised: 15 May 2025 / Accepted: 16 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue Innovations in Mathematics Education: Evaluation, Research and Practice)

Download

Browse Figures

Versions Notes

Abstract

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants. A purposeful, representative sample of over 1000 students from rural, suburban, and urban districts across the USA were administered PSM-CAT followed by a survey. Approximately 100 of those students were interviewed following test administration. Findings indicated that (1) participants engaged in the PSM-CAT as desired and found it highly usable (e.g., most respondents were able to use and find the calculator and several students commented that they engaged with the test as desired) and (2) the benefits from testing largely outweighed any negative outcomes (e.g., 92% of students interviewed had positive attitudes towards the testing experiences), which in turn supports consequences from testing validity evidence for PSM-CAT. This study provides an example of a usability study for educational testing and builds upon previous calls for greater consequences of testing research.

Keywords:

assessment; computer-adaptive test; problem solving; test; usability; validity

Graphical Abstract

1. Introduction

Educators need access to comprehensive, valid information about their students’ mathematics learning. In turn, educators should make data-based decisions as a result of using valid results from high-quality assessments (Lawson & Bostic, 2024; Fennell et al., 2023). Assessment in this study refers to the activities that teachers and others use to gather student data, as well as the activities that provide teachers with feedback for modifying their teaching and student learning (Fennell et al., 2023). In some capacities, in-the-moment questioning can draw out students’ knowledge during classroom instruction, which informs teachers about what and how students are learning (Fennell et al., 2023). More formal assessments such as quizzes and tests have potential to provide necessary information about students’ learning, too (Fennell et al., 2023). A distinction is sometimes made between the terms test and assessment because a test can encompass broader sources of information than a single instrument (AERA et al., 2014); however, in this manuscript, both terms will be used interchangeably to promote readability for a broad audience. Readers interested in the language differences and nuances are encouraged to consult the Standards for Educational and Psychological Testing ([Standards], AERA et al., 2014).

Schools around the world use tests to gather information about students’ mathematics performance. However, a test’s usability—the degree to which a respondent engages with it in an intended manner—can implicate a respondent’s results (Estrada-Molina et al., 2022). Further, negative consequences from engaging with a test may also unnecessarily impact the respondent and produce greater test score variance (AERA et al., 2014; Lane, 2020).

The present study explores the consequences of testing related to a mathematics problem-solving test as well as its usability with the intended population: grades six to eight (ages 11–14) students. A purposeful, representative sample (Creswell, 2014) of students from the population participated in surveys and interviews immediately following test administration. One goal of this study was to disseminate findings related to the consequences of testing and usability related to a mathematics problem-solving test that is grounded in classroom standards. A second goal was to provide a study that could serve as a model for educational researchers with an example of how to conduct a usability study.

2. Literature Review

2.1. Educational Policies: An Overview

The No Child Left Behind Act of 2001 (NCLB) was designed to improve the performance of students enrolled in public schools in the United States of America (USA). One key feature of NCLB was accountability for student achievement, which initially required each state to develop standardized tests in reading, mathematics, and later in science, to be administered annually in grades three–eight, and once in high school. This act was reflective of USA educational and standardized test data use policies that went into effect. While standardized testing is commonplace in the USA, other countries also implement standardized tests: Japan’s National Assessment of Academic Ability (Hino & Ginshima, 2019), South Africa’s Annual National Assessment (Maphalala & Khumalo, 2018), Germany’s Abitur examination (Bruder, 2021), Mexico’s Plan Nacional para la Evaluación de los Aprendizajes (Céspedes-González et al., 2023), and Israel’s Bagrut Matriculation Exams (Naveh, 2004). Outside of standardized tests, another typical classroom assessment is an end-of-unit exam typically administered by teachers at the end of a unit or course to measure the degree to which students have mastered the material taught through instruction. These are two types of summative assessments. Summative assessments are tools to gather data about how much has been learned and/or whether an individual has reached a desired level of proficiency (A. H. Schoenfeld, 2015). Formative assessment includes “all those activities undertaken by teachers and/or by their students, which provide information to be used as feedback to modify the teaching and learning activities in which they are engaged” (Black & Wiliam, 2010, p. 7). One type of formative classroom assessment is a progress monitoring test. These progress monitoring tests provide teachers with information that communicates what students know and may provide teachers with information they can use to plan further instruction. Taken collectively, these types of assessments can be framed around notions of summative assessment and formative assessment.

All assessments have consequences and effects on teachers and students (McGatha & Bush, 2013). Positive consequences from tests for formative and summative assessments include but are not limited to improvements in teacher and student motivation and effort; better content and format of assessments; and advancements in the use and nature of test preparation activities (Lane & Stone, 2002). There can also be negative consequences: narrowing of curricula and instruction; use of unethical test preparation materials; unfair test score use; reassignment of teachers or students based on a single data point; and decreases in student and teacher confidence, motivation, and/or self-esteem (Lane & Stone, 2002). Evidence related to consequences of testing should be explored with any test—including both formative and summative assessments (AERA et al., 2014; Bostic, 2023; Kane, 2006; Lane, 2014).

2.2. Computer-Adaptive Tests (CATs): The Transition from PSM to PSM-CAT

The PSM test measures middle school (i.e., grades six through eight; ages 11–14) students’ mathematical problem-solving performance in ways that leverage their understanding of grade-level mathematics content, as derived from classroom standards. Classroom content standards may differ to some degree across states within the USA; however, 36 states maintain features originally found in the Common Core State Standards for Mathematics (CCSSM) that were implemented in 2011 (CCSSI, 2010). This PSM test functions primarily as a formative assessment. It may be used to gather information about students’ problem-solving performance while drawing on their mathematics knowledge related to classroom content standards, and in turn inform teachers’ future mathematics instruction.

Computer-adapted tests (CATs) are designed to improve measurement efficiency using a smaller number of ability-targeted items than traditional paper-pencil tests or other static tests (Martin & Lazendic, 2018). Static tests require all students to engage with a defined set of identical or nearly identical items (Wainer & Lewis, 1990). In order to effectively measure all students with a fixed set of items, test makers generally include a range of items, running from less to more difficult. Because the items are consistent across all students, regardless of ability, the desired level of measurement error (precision) is only achievable after a large number of items are administered. Items equally as difficult as a student is able have the capacity to reduce measurement error quickly and are considered “well-targeted”. Items that are easier than a student is able, or conversely, items that are more difficult than a student is able, reduce measurement error more slowly and are considered “less well-targeted”. A substantial number of items on static tests are required—not necessarily to ensure content coverage, but rather, psychometrically, to ensure that no matter what ability a student may possess, the test administration can generate an accurate measurement of student content mastery (Martin & Lazendic, 2018).

CATs accomplish the task of reaching measurement precision with far fewer items than traditional static paper-pencil tests because items are typically well-targeted to a student’s specific level of ability (Martin & Lazendic, 2018). In a CAT environment, students are first administered an item of moderate difficulty. Subsequent items are then selected based on a student’s response pattern (Davey, 2011). If a student answers correctly, then the next item delivered is more difficult (see Figure 1). On the other hand, when a student answers incorrectly, then the next item delivered is less difficult. Therefore, CATs have the capacity to efficiently zero in on (target) a student’s particular capacity (ability) in a content area by delivering items that offer increased information about each student (Lane, 2020) if effectively developed.

CATs have been used in mathematics contexts since the 1990s (Davey, 2011). They provide a mechanism to better target students’ mathematics abilities with shorter testing durations compared to static tests. There are numerous examples of CATs used in mathematics contexts with many coming between 1990 and 2010 during a period of rapid development. We share a recent example of mathematics-focused CAT use to demonstrate its continued appropriateness during a period of time when artificial intelligence, machine learning, and natural language processing are becoming more popular. Uko et al. (2024) created a mathematics and science CAT for Nigerian secondary students. One conclusion was that their CAT for mathematics and science contexts provided an accurate and efficient means to assess secondary students’ abilities. Thus, they recommend that “CAT should be introduced in assessment of learning…of the students [and] to be tested with sufficient accuracy” (p. 85).

In the case of the PSM-CAT, delivery was made via a web-based browser directly in the classroom. Students were able to use an online calculator and formula sheet, which were located on the testing platform. In addition, they could use scratch paper and a writing utensil. The test had a time limit (30 min), and students could work at their own pace with no maximum number of items to complete. Thus, there was some freedom in the testing experience that might differ from a fixed-length test. Classroom tests like the PSM-CAT test are no different from other educational tests in the sense that it is essential that test scores are interpreted and used in an appropriate way (AERA et al., 2014), which implicates test developers, users, and administrators. Much like other CATs, respondents are provided more difficult items as they answer items correctly. Conversely, they see less difficult items after answering a present item incorrectly. It is plausible, though rare in practice, for a student engaged with PSM-CAT to see an item from a different grade level based upon their response pattern.

2.3. Validity

Validity ensures that an assessment accurately captures the intended construct or phenomenon being studied (Bostic, 2023; Kane, 2013). Validity is defined as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (Messick, 1989, p. 13). Validation “can be viewed as a process of constructing and evaluating arguments for and against the intended interpretation of test scores and their relevance to the proposed use” (AERA et al., 2014, p. 11). It is a process by which scholars gather evidence related to an assessment to better assist others in evaluating the degree to which it measures what it purports (Bostic, 2023; Carney et al., 2022; Kane, 2013).

Five sources of validity evidence are described in the Standards (AERA et al., 2014). The five sources of validity evidence include test content, response processes, internal structure, relations to other variables, and consequences of testing (AERA et al., 2014). Figure 2 provides a description of each source of validity evidence. Using the metaphor of a rope with intertwined braids helps to demonstrate the idea that validity represents a unitary concept and the sources work together, and the strands represent the different sources of evidence working together to ground a validity argument within the construct. Past research (e.g., Krupa et al., 2024) has demonstrated that the consequences of testing is underexplored and rarely reported, especially in mathematics education assessment research. Stated more concretely, Krupa et al.’s (2024) research synthesis found that of the papers describing mathematics assessments between 2000 and 2020, consequences of testing validity evidence was described in 61 articles out of the total reviewed sample of 1206 articles, which equates to consequences of testing being described less than 2% of the time.

2.4. Consequences of Testing

Consequences of testing includes both positive and negative outcomes and should be purposefully evaluated for any test (Lane, 2020; Sireci & Benitez, 2023). Some positive consequences include but are not restricted to student, teacher, and administrator motivation and efforts; the use of assessment results on teacher instruction and student learning; improvement or change to school courses (Lane, 2020). There are also potentially negative effects of an assessment that may arise from giving an assessment including narrowing of curricula, unfair question use, as well as decreased confidence and/or affect (Lane, 2020). Taken collectively, consequences of testing validity evidence help others to understand potential outcomes that may arise from administering a test and using its results.

Consequences of a test can include substantive outcomes such as student placement into a class or advancement to a grade level (AERA et al., 2014; Lee, 2020). Lee (2020) focused on the test consequences of an English reading course placement test (e.g., advanced, remedial, general) that was administered at a large university. Lee talked to students about the effects this assessment had on their learning and attitudes. One-on-one interviews were conducted to draw out perceptions about the test and respondents’ experiences. The results showed that there was a near split amount of positive and negative perceptions towards the test. There were good attributes related to the test, but nearly half of the respondents felt there were negative attributes to the test’s results and ensuing interpretations. Similarly, three-fourths of the respondents communicated that the test questions and reading passages were too complex, conveying that a consequence of using the result may be linked with students’ negative perceptions of their testing experience. This is a high number of respondents concerned with negative consequences. An outcome from Lee’s work was a further study of that test to guarantee that the results and their interpretations are justified in robust consequences of testing validity evidence. Lee’s study also provides a guide for data collection with the use of interviews. A second outcome is that Lee’s work demonstrates the importance of investigating the degree to which a test has negative and positive outcomes. More positive outcomes than negative outcomes, or at least a balance, is desirable (AERA et al., 2014). Greater negative outcomes than positive outcomes warrant concern and should be considered before test administration (AERA et al., 2014).

Consequences of a test can also be explored with content-focused tests administered by teachers. Heissel et al. (2021) designed a study with third- to eighth-grade students where they measured cortisol levels at various times during the day, including time during a test (Heissel et al., 2021). Cortisol is a hormone produced by the human body released in response to stress, like during a test. Samples of cortisol were collected to gauge the physiological stress response during test periods, and how it correlated with their test performance (Heissel et al., 2021). Researchers found that during the test, students had higher cortisol levels than during the rest of the school day. Students’ negative test consequences experiences were linked with their test scores. That is, higher cortisol levels, like those during the test, were associated with lower test performance. If a goal is to effectively and fairly assess students’ knowledge, then it is critical that testing situations limit anxiety or stress that might negatively contribute as variance (error) to a test score. In addition to consequences of testing, the need for evaluating assessment usability features is critically important.

2.5. Assessment Usability Features

Any K-12 assessment designed for classroom use should be usable by students and teachers. While that might sound simple, a usability study can explore the degree to which respondents understand the questions on the assessment, the ways respondents engage with the test, and whether test administrators (e.g., teachers and staff) perceive that the test can produce robust information usable for data-based decisions (AERA et al., 2014; Estrada-Molina et al., 2022). Interpretation and use statements for quantitative instruments are helpful when considering an assessment (Carney et al., 2022; Kane, 2013). Carney et al. (2022) explain that usability features are centrally important for test administrators and developers because they communicate necessary information. The degree to which users perceive features of the test as easy to locate and understand ultimately impacts test usability. For example, if respondents are supposed to use a calculator embedded in a CAT but they cannot easily locate it on the website, then there may be usability issues that negatively impact student performance and/or consequences of testing. Understanding the usability features of a test before wide administration also provides structure and support to decide whether the test has the potential to represent student knowledge accurately. Additionally, usability testing helps to identify potential areas of needed assessment or delivery modification. Accordingly, exploring assessment usability during pilot testing is essential to minimize the impact of conditions that may contribute to the validity of test results (AERA et al., 2014).

Research related to usability is often conducted in the healthcare field (e.g., Denecke et al., 2021; Hudson et al., 2012; Saad et al., 2022; Thielemans et al., 2018) but far less frequently within educational research. After conducting a thorough literature review on usability, our results showed that there were none in mathematics education and a limited number in education related to the usability of educational tests. This outcome led our team to draw from the healthcare literature where there was an abundant amount of usability studies. One healthcare usability study created a mobile mental health chatbot for regulating emotions. This chatbot was operated by multiple users, and researchers conducted a usability test (i.e., User Experience Questionnaire) to study users’ experiences with it (Denecke et al., 2021). A 26-question survey with closed-end and open-ended items was administered to gather participants’ experiences with the chatbot. Denecke and colleagues found that participants confirmed the chatbot as understandable, easy to learn, and clear. However, attractiveness, novelty, and dependability were scored as below average. Usability results, like those from a survey, allow for judgment, comparisons, and evidence-informed modifications to be made to the tool under study.

Another healthcare assessment usability study employed direct observation, focus groups, and questionnaires to understand a test’s usability (Thielemans et al., 2018). Thielemans and colleagues’ work outlines how to study the usability of a healthcare assessment through focus group discussions (FGDs) and a self-administered questionnaire. Participants used a device and were asked to critique features. They engaged in a usability assessment with a mixed-methods approach including observations, surveys, and focus groups. The present study employed a convergent mixed-methods approach (Creswell & Plano Clark, 2018) similar to Thielemans et al. (2018), and builds upon usability studies harnessing interviews and surveys (Denecke et al., 2021).

3. Materials and Methods

3.1. The Present Survey

Usability studies can be conducted to explore how students understand the test’s directions and questions, and how students perceive the features of the test to be easy to use and foster positive outcomes after testing (AERA et al., 2014; Estrada-Molina et al., 2022). If students cannot easily access the resources, or the questions they saw were not applicable to them, then the assessment’s results will not accurately characterize students’ performance. Similarly, it is necessary to evaluate consequences of testing validity evidence to confirm that the benefits outweigh the negative outcomes.

The original version of PSM-CAT is paper-pencil and delivered in a static format (see Bostic & Sondergeld, 2015, 2018; Bostic et al., 2017); however, the version investigated for this study is delivered in a computer-adaptive format (i.e., PSM-CAT; Bostic et al., 2024). The paper-pencil version of the PSM test has been ongoing since 2019; meanwhile, the PSM-CAT development started in 2021 and the first administration was in 2024.

The PSM-CAT is grounded in mathematical problems and problem-solving frameworks. The items align with A. Schoenfeld’s (2011) framework characterizing mathematical problems as tasks such that the number of solutions is uncertain, the pathway to a solution (i.e., strategy) is unclear, and there are multiple pathways to a solution. These are word problems; therefore, we also address Verschaffel et al.’s (1999) framework for mathematical word problems. That is, PSM-CAT items are also grounded as being open, complex, and realistic. Open tasks can be solved in multiple developmentally appropriate strategies. Complex tasks engage problem solvers in ways that cause them to think, pause, and reflect. Realistic tasks draw upon experienced or believable situational knowledge as a key part of the task. Finally, the PSM-CAT is grounded in Lesh and Zawojewski’s (2007) framing of mathematical problem solving. Taken collectively, these frameworks help to ground the PSM and PSM-CAT.

A concern with any test is its content validity evidence. The PSM-CAT uses the CCSSM’s Standards for Mathematics Content as a content blueprint. Content domains within the grades six–eight regarding Standards for Mathematics Content vary; albeit, they include domains such as Geometry, Number Sense, Expressions and Equations, and Statistics and Probability. Some PSM-CAT items have a primary and secondary standard. Readers interested in test content validity evidence should consult Bostic et al. (2024). Three sample items have been released from the test, which are shared in Figure 3 below to assist readers. The grade six item aligns to Ratio and Proportions and Expressions and Equations content standards. The grade seven item aligns to an Expressions and Equations content standard, whereas the grade eight item aligns to Number Sense content standards. In summary, all items were deemed by multiple expert panels to align with the desired Standards for Mathematics Content, and additionally, each item connected to one or more Standards for Mathematical Practice (see (CCSSI, 2010) for more information).

This study used a convergent mixed-methods research approach to collect data about PSM-CAT, specifically (a) its usability among potential users and (b) validity evidence related to consequences of testing. One intended study outcome is to broadly distribute findings for potential test users and administrators (e.g., school personnel, researchers, and evaluators) to consider when selecting a mathematics problem-solving test. A second outcome is to provide readers with a model of a study that combines exploring usability and consequences from testing for a K-12 test. The research questions for this study are

RQ1:

To what degree is the PSM-CAT usable for students?

RQ2:

What consequences of testing evidence exist for the PSM-CAT?

3.2. Research Design

This convergent mixed-methods research (MMR) design (Creswell & Plano Clark, 2018) utilized a survey and 1–1 interviews (QUAN + QUAL). There are aspects from the survey and interviews that address each of the two research questions. In an MMR study, both quantitative and qualitative data are collected and then analyzed (Creswell & Plano Clark, 2018). We compared the quantitative and qualitative results to build a common understanding (Creamer, 2017; Creswell, 2014). Data were gathered immediately following test completion to accurately measure students’ experiences with the test. Students were more likely to recall their experiences with just minutes between test completion, survey, and interview.

The present study is part of a larger grant-funded project (National Science Foundation—2100988; 2101026) that works with representatively and purposefully sampled school districts across the USA. These participating districts were purposefully sampled to represent different regions and contexts of the USA, including one urban district in the Pacific region, one large suburban district in the Mountain West region, and multiple school districts inclusive of suburban and rural contexts in the Midwest. This study met exempt IRB status; students who wished not to participate in data collection did not participate. All names used in this study are pseudonyms that participants selected. We describe methods for the survey followed by methods for the interviews.

3.3. Survey Methods

3.3.1. Participants

A survey was administered to 1010 students in grades six through eight to capture their perceptions of the usability and consequences of taking the PSM-CAT. Participant demographic information is shown in Table 1. We used purposeful, representative sampling (Creswell, 2014) because our team sought to generate findings that may reflect the diversity of middle school students across three regions of the USA. Participant selection was performed with a goal to have a broad pool of students with respect to sex, racial/ethnic diversity, and geographic location (both region and context). Students self-identified their sex and race/ethnicity. As in prior work from this project, participants chose (a) male, (b) female, or (c) prefer not to say for gender. This approach was used after students in earlier studies with this project recommended a third option (see Bostic et al., 2024). Students were offered multiple options for race/ethnicity that followed the USA census’ approach.

3.3.2. Instrument and Data Collection

The survey was constructed with the intent of integrating and gathering data towards both research questions. It was further modeled after prior usability studies as well as surveys about consequences of testing (Denecke et al., 2021; Thielemans et al., 2018). In Denecke and coauthors’ study, survey questions resulted in binary responses, such as “confusing/clear”, “not understandable/understandable”, and “cluttered/organized”. Thielemans et al. (2018) survey related to a tool’s ease of use, and readability/comprehension, which was mirrored in the current study focusing on the usability of tools associated with the test (i.e., calculator and formula sheet), readability of items and test directions, and overall test usability. Survey questions are presented in Figure 4.

Classroom teachers distributed the survey to students immediately following the completion of the PSM-CAT through a hyperlink. Students were asked to provide demographic information. Then, they began responding to survey questions related to the test. In total, the survey had four focal questions. The four focal survey questions focused on whether they understood the test and could find appropriate materials (i.e., calculator and formula sheet). Those focal questions were (1) Did you understand the test? (2) Did you use a handheld or online calculator? (3) Did you use the formula sheet? (4) Did you experience any issues during the test? The survey consisted of ‘skip logic’, which parallels past research (e.g., Ifeachor et al., 2016; O’Regan et al., 2020). An additional five questions branched from the focal questions depending on student response. For example, if students answered ‘no’ to a question, then they would be given a different question than if they responded ‘yes’ (see Figure 4 for more information). Most students completed the survey within four minutes.

3.3.3. Data Analysis

Descriptive statistical analyses were performed on the closed-ended survey data to examine how students responded to each binary choice (yes/no). Open-ended survey questions were analyzed through an inductive, multi-stage thematic analysis approach (Creswell, 2014; Miles et al., 2014). This five-segment approach included multiple steps (see Figure 5). A goal of this approach was to have multiple opportunities to review the qualitative data, create observations, take notes, and generate themes from the data. Collaboration across two researchers (i.e., King and Bostic) was also included in this process to provide opportunities to evaluate and critique the processes and outcomes that were generated. In addition, results during qualitative data analysis were shared with co-authors (i.e., May and Stone) to promote triangulation (Creamer, 2017; Miles et al., 2014).

Each segment consisted of a series of steps, which are shown in five segments (A through E). Two researchers conferred at the end of each segment (see Figure 5). Segment one was used to become intimately familiar with the data and prepare data for analysis. Segment two consisted of making generalizable notes on a spreadsheet. Segment three was to draw together the notes into memos. Segment four was used through synthesizing the created memos into codes. The fifth and final segment consisted of making plausible themes from the codes in relation to the research questions.

3.4. Interviews: Data Collection and Analysis

Students were purposefully and representatively chosen for an interview and asked to confirm willingness to participate as volunteers. An aim for sampling interviewees was to reflect the demographics of the surveyed participants. In total, 121 students participated in 1–1 interviews with an intent to gather data on students’ perceptions of the PSM-CAT (see Table 1). The purpose of this interview was to gather data for both research questions. Lee’s interview questions (Lee, 2020) were used as a model for the interview protocol in this study, which focused on aspects related to testing consequences and usability. Interview questions are found in the Appendix A. Data collection started when two researchers (i.e., King and Bostic) asked teachers whether any students volunteered for interviews. Teachers provided a list of volunteers. As students were selected, they were asked to confirm participation in the interview on their way to the interview space. Students were escorted one-at-a-time to a quiet space for the interview.

The researchers used a structured protocol that asked questions about students’ perceptions and experiences with the PSM-CAT. First, they were asked to confirm their participation in the interview. After confirming their participation, they were asked to provide information related to their demographic data and to choose a pseudonym. Next, researchers provided an overview of the interview. Participants were handed a paper with the purpose of the PSM-CAT and interview questions (see Appendix A). A researcher confirmed whether the participant understood the purpose of the PSM-CAT and could read the questions. The interview started after students communicated their understanding. If students had any questions, then they were answered. Researchers redirected participants when necessary. Interviews took approximately five minutes.

Similar to the interview, a thematic data analysis approach (Creswell, 2014; Miles et al., 2014) was used to analyze the data from 1–1 interviews (see Figure 5). The same process described earlier was used with the interview data. We connect students’ responses to interview questions and communicate when students’ ideas were prompted by the final question, “Is there anything that you want to share with me about your experience during the test?” We frame responses to that final question as ‘unprompted’ because they are not necessarily resulting from a targeted interview question.

4. Results

Data analysis led to themes and ideas for RQ1 and RQ2.

RQ1:

To what degree is the PSM-CAT usable for students?

RQ2:

What consequences of testing evidence exist for the PSM-CAT?

To summarize the findings, the theme for RQ1 was that students perceived the PSM-CAT with a high degree of usability. There were two ways that usability was framed: resources and test items. Data informing RQ2 led to two themes. The first theme was that students experienced positive outcomes from taking the PSM-CAT. These positive outcomes are grounded in two ways: student learning and student attitudes. The second theme was that students believed their teachers might gain information as a consequence of the PSM-CAT. This second theme was grounded in two ways. First, students felt that teachers might learn what students understand mathematically from the test results. Second, students felt their teachers might be better equipped to help students mathematically grow. We display these themes and ideas behind the themes in Figure 6. Quotations are intentionally selected to demonstrate consistency and coherence across grade levels and are shared in greater detail through sections below. As ideas are discussed, the pseudonyms that participants chose are used. However, an individual’s demographic information (e.g., sex and race/ethnicity) are not shared to (a) protect anonymity and (b) to follow current Executive Order 13985 https://www.whitehouse.gov/presidential-actions/2025/01/ending-radical-and-wasteful-government-dei-programs-and-preferencing/ (accessed on 23 February 2025). This executive order directs the federal government to eliminate diversity, equity, and inclusive programs and policies that promote discrimination (White House, 2025), which includes scholarships stemming from federally grant-funded research.

4.1. RQ1: Usability

Data analysis led to a single theme for RQ1: Students’ perceptions of the PSM-CAT demonstrated a high degree of usability. This theme was seen through two ideas: (a) resources and (b) items. Results for RQ1 were found through quantitative survey data related to usability features, and integrated with qualitative data about usability (e.g., students could have responded to the open-ended interview questions related to usability or consequences of testing). Test usability evidence also came across in a broad sense from several students. Cornetheus, a seventh-grade student communicated that he preferred PSM-CAT over other tests he has taken recently, “I definitely do prefer that [CAT] test over our usual paper tests, because, you can see that it definitely makes an impact based on the way that it was set up and the way the time limit is, and the questions on the screen worked well. I mean, like, I knew what to do.” Test usability was also stated in a similar sense by Michael, an eighth-grade student who shares his thoughts about the usability of the time limit, “I didn’t get many questions done, but that is okay because the time that I needed was enough for every question. It was actually content that I should know and knew. That’s different from other tests where I feel I have to rush to answer every problem”. Kennedy, a sixth-grade student, added comments about the directions and flow of the test: “the test [directions] explained most stuff and there were pretty good explanations of the problems…it was easy to get through the test and I knew what to do to get to the next problem.” In summary, students conveyed a high degree of test usability from the directions to the navigation to the timer and time limit.

4.1.1. Resources

A finding for RQ1 was that resources (i.e., online calculator and formula sheet) provided in the PSM-CAT system were accessible and easy to use. This is also evidenced through students who perceived the resources on the PSM-CAT as helpful. With regards to accessibility, the majority of students easily found the online calculator and formula sheet, and could use them (see Figure 7). In total, 856 of the 1010 students used a calculator. There were 662 students who used the online calculator and 194 students who used a handheld calculator. Of those 662 students that used the online calculator, 97% (n = 642) of survey respondents indicated they were able to locate it. Similarly, there were 694 of the 1010 students who wanted to use the formula sheet. Of those 694 students, 82% (n = 569) were able to access it with ease. Interview data complemented this finding of resources being easily accessible and usable. John, a seventh-grade student, was one of many students who shared that he thought there were more benefits to the test during an unprompted response: “I liked how when I used the calculator on the test, it didn’t take up the whole screen. I could see the question while also seeing the calculator instead of having to memorize what I need to type in and then go back to the calculator. I also think that the resources were just organized well because I could find them”. Data were consistent across interviews; however, there was a small group of students who wanted to use resources but could not locate them (1.6%, n = 16). The survey data also included information about whether students could understand the directions as a usability feature, and if they had issues with the testing system. Results showed that 98% of students surveyed could clearly read and understand what the test directions were, and less than 3% of students experienced issues with the testing system. An example of the biggest issue (i.e., 1.7%) was that students could not locate the resources while taking PSM-CAT.

The finding that the PSM-CAT was usable was also described by students who expressed that resources were helpful to them while test-taking, often at the end of the interview. We found seven comments from interviewed students, which were connected to the other data from the survey. Four of these comments were captured from the interview questions themselves, and three of these comments about the resources were found in the data when asked if they had anything else they would like to share. Lucy, a sixth-grade student, conveyed this feeling when asked if she had anything else to share about the test overall, “There were good resources that you could use while taking the test which, helped me while problem solving”. Similarly, Peter, a seventh-grade student, said this quote when asked if he had anything else to share, “On the test I got to use a calculator. It was helpful to focus on the actual question because I had a calculator there. It would be more work if you did not have a calculator”. Peter and others were able to focus on the test items because of the provided calculator. In summary, students expressed that the resources were usable because they were easy to access and use, and they were valuable tools needed to support students while testing (i.e., 6% of students from interviews, and 97% (calculator ease), 82% (formula sheet ease), were found from survey data).

4.1.2. Test Items

Our finding for the test being highly usable was grounded through a second way related to the test items itself rather than the resources. Students’ perceived PSM-CAT items as readable and the item contexts were relatable. Charlie, a seventh-grade student, offered the following at the end of the interview for any final thoughts, “The problems were just fun and they were about fun things like driving and finding the cost of things. I understood the questions because they were about actual things you could use in the world”. Charlie and another participant, John, emphasized that the items’ contexts connected with their real-world experience or familiarity with a context. John, a sixth-grade student, said in his interview when asked about the perceived benefits of the test (i.e., interview question one) that “The questions were clear and easy to understand, it was very simple. I find that other test’s questions are hard to understand or find the information”. Faith, an eighth-grade student, shared this sense when asked the same question as John, “I think this test had benefits because the questions were worded in a way that I was able to understand”. This idea of the test items being easy to read and understandable was expressed consistently across participants. Interview questions were open-ended, and therefore, led students to discuss information unprompted about the usability of the test as well, leading to these results. This consistency led to the finding that supported a high degree of usability by intended respondents.

4.2. RQ2: Test Consequences

The first theme related to test consequences was that students perceived positive outcomes from the PSM-CAT. A second theme was that students believe their teachers might gain information about their mathematics proficiency from the PSM-CAT. We unpack theme one, then theme two.

4.2.1. Outcomes: Student Learning

Our first theme for RQ2 was that there were positive outcomes from taking the PSM-CAT. Theme one was reified through two ideas. The first idea was that students felt they were able to demonstrate their mathematics learning through PSM-CAT completion. That is, they perceived their test score as reflective of their mathematics learning. Students viewed the test as a coherent map of their mathematics learning. While they did not specifically reference mathematical content in their ideas, their responses were a result of thinking about the mathematics content that they came across in the testing system. Roman, a seventh-grade student, shared, like many others, that the PSM-CAT gave an opportunity to reflect on what mathematical content he knew and areas for growth: “The good from this test is that you experience more things in math, and you see more what you need help with so you can do better”. In Roman’s case, he was asked about whether the test had more benefits or negatives. His response is evidence of the first idea of theme one that students were able to demonstrate mathematics learning. As a reminder, the PSM-CAT seeks to measure students’ abilities accurately and efficiently, which requires administering items that may be somewhat beyond what they have learned. Brielle, a seventh-grade student, shared an experience with the test when also asked about the benefits: “This test has more benefits than issues because you could see what you already know how to do in math”. Her comments unpack the idea that the test can remind students what mathematics they know how to do, also evidence of the first idea. Saje, an eighth-grade student, supported this idea: “Good came from the test because it helped me to reflect on how we (Saje and peers) figure out math problems and if I am able to solve them correctly”. Statements like these were consistent across the sample and portrayed how students perceived the PSM-CAT to have positive outcomes, such as gaining information about their current content knowledge and content that they may need to work on. There were rare instances where students had negative experiences beyond the control of the PSM-CAT, (1.1%, n = 11). Those experiences include the test not loading or an item failing to open after answering another item. Such experiences may have been due to the school’s internet or computer quality rather than the test. This seemed to be limited to a paucity of students across the entire testing sample. Moreover, most students experienced positive outcomes from taking the PSM-CAT, and this was grounded through evidence related to students reflecting on what content they understood and what content they needed to work on.

4.2.2. Outcomes: Student Attitudes

A second idea related to this theme (i.e., students experienced positive outcomes from taking the test) was that the PSM-CAT led to maintaining or promoting positive attitudes. Participating students either maintained a positive attitude from start to finish with the PSM-CAT, or their attitude became more positive (i.e., increasingly positive) compared to when they started. Of the 121 interviewed students, twenty-six (21%) felt positive about the testing experience both before and after the PSM-CAT, whereas eighty-six students (71%) described a positive increase in their attitude towards the PSM-CAT after completing it. Taken collectively, 92% of students interviewed had positive attitudes towards the testing experiences. This finding was a result of interview questions about how students felt before and after the testing experience. Students shared a variety of information characterizing why they felt positive. A seventh-grade student, Jane, told the interviewer: “I felt positive towards the test because as you answered a question, they got harder and harder to match your level. I liked that and it made me feel good”. This student felt positive about the experience because the test was computer-adaptive; she was able to answer the PSM-CAT items she could, and some items were beyond her current abilities. Alternately, Trinity, a sixth-grade student, simply expressed: “I got a lot of questions done, and I felt good after the test”. Overall, most students expressed a positive attitude after completing the PSM-CAT.

4.2.3. Teachers Gain Knowledge: What Students Know

RQ2 had a second theme. Students believed their teachers might gain information about their mathematics proficiency as a result from the PSM-CAT. This theme was grounded through the first idea reflecting that teachers could gain information about what content their students currently understand. Bianca, an eighth-grade student told an interviewer when asked about the benefits of the test, “What the test is trying to do is help teachers understand what’s going on. You know—through a student’s head, and how they (students) think. Personally, I think that’s it’s important for my teacher to know”. Participants expressed that teachers may gain information about their students’ thinking as a result from completing the PSM-CAT. Harper, a sixth-grade student also answered this question, “There are benefits and positives to this test because your teacher can find out how you learn and what your math level is at. I want my teacher to know what I know”. Interviewed students consistently communicated a strong desire for their teachers to gain information about what content they knew and understood.

4.2.4. Teachers Gain Knowledge: How to Help Students

The second theme to RQ2 had a second idea that acted as supporting evidence for this theme. Students believed the PSM-CAT would result in information for teachers about helping with student mathematics learning. That is, help with content that students learned previously. Students were reflecting more on how their teachers can take the results from the PSM-CAT and help them with the content that they may not have understood in that moment. Collin, a seventh-grade student, communicated that “This test helps my teacher realize what I’m struggling in (with math). If they (my teacher) can understand that, then they can see and figure out what I need help with. I hope my teacher uses the test results to help me where I need it”. Rachel, an eighth-grade student said something similar: “The test will help teachers, like mine, figure out what to help students, like me, with math I’m learning, and that they (me and my peers) need help with some things—like those problems I got wrong”. Neither of these students were asked interview questions specifically related to their teachers gaining knowledge; however unprompted, these students and others responded in such ways in relation to the benefits of testing or how they felt during the test. In summary, the findings from interview data characterized positive consequences, such as the PSM-CAT allowing their teachers to learn what students can do in mathematics and areas where they may need help.

5. Discussion

One goal of this manuscript was to explore usability outcomes related to the PSM-CAT. A second goal was to share validity evidence related to the consequences of testing for the PSM-CAT. Reporting on these two goals provides potential users—researchers, evaluators, and school personnel—the necessary information to make informed choices related to a CAT measure of mathematical problem-solving. Evidence from this mixed-methods study provides a strong foundation for findings related to the two goals. That is, (1) the PSM-CAT had a high degree of usability by potential users; (2) the positive outcomes related to consequences of testing outweigh negative outcomes related to the PSM-CAT.

5.1. Connecting to Consequences of Testing

Accountability and computer-based testing are ubiquitous phenomena in the modern education landscape. Students take tests that have outcomes; hence, it is important that validity evidence is examined from a broad perspective, as described in the Standards (AERA et al., 2014). Test results and their interpretations that lack robust validity evidence can lead to spurious findings or worse, harmful outcomes (Bostic, 2023). Krupa et al. (2024) have shown that consequences of testing information related to mathematics tests is rarely shared in the literature, which makes this study helpful as a contributor to the discussion about consequences of testing. This study also extends prior research on consequences of testing (e.g., Lee, 2020). The findings from the present study indicated that there was an overall positive experience from testing and outcomes from the PSM-CAT. Students communicated that they developed or maintained a positive attitude while testing, and that the test has the capacity for teachers to gain information about their learning. Previous research findings suggested substantive negative consequences from testing are possible (see Lee, 2020) and should be sufficiently explored. Lee’s (2020) students conveyed concerns about how test results might be used, and roughly half felt the consequences were positive. Lee’s findings contrast our findings in which less than 3% of students communicated any negative experiences or outcomes from the test. Tests should have more positive consequences than negative ones.

Narratives from test respondents provide potential test users with greater confidence that the benefits of testing with PSM-CAT are greater than the negatives associated with it (e.g., issues while testing). There were several unprompted responses from participants during the interview regarding consequences, suggesting that consequences of testing was something they considered and felt important enough to convey to interviewers. If teachers are expected to administer a test, then the time taken during testing should be buttressed with reasonable evidence that the testing time is worth it. Conclusions like this one are called for in research (e.g., Carney et al., 2022) and warranted to support robust validity claims about consequences of testing that have been rare in literature (Krupa et al., 2024) or highlight greater positives than negatives unlike Lee (2020). Findings from this study provide readers and potential PSM-CAT users with confidence that the results from these tests are being used appropriately. This study also provided an example of researching consequences of testing for test results and, in turn, it is a case where the positive consequences from a test were greater than the negative consequences.

5.2. Usability Studies in Education Research

The present study extends past usability studies from other areas into education. Denecke et al.’s (2021) work, as well as Thielemans et al.’s (2018) usability study, provided a foundation for the present study. The Standards (AERA et al., 2014) indicate that test developers should explore usability. Scholars (e.g., Carney et al., 2022; Estrada-Molina et al., 2022) have expanded on these guidelines and recommend that studies clearly communicate usability-related information about tests. One outcome from the present study is an example of a convergent mixed-methods project that highlighted areas of strength and areas for improvement related to a mathematical problem-solving test. A second, and important outcome, is that this study serves as an example for other usability studies within educational testing research. Usability investigations from medicine and technology can be reasonably applied with some modifications to educational testing situations, and this study may serve as one way to translate from other research areas. Educational scholars might find the present study useful for conducting their own usability studies. Past usability studies described acceptable usability features, but also characterized potential improvements (Thielemans et al., 2018). Our study also found that the PSM-CAT had a high degree of usability due to the nature of the resource accessibility and test design; however, some students (less than 2%) expressed struggles locating the formula sheet. This finding led the development team to move the formula sheet’s location and convey its location more clearly in the directions. In effect, usability studies present a form of design research around validation work that can shape better outcomes for future users.

5.3. Limitations and Future Directions

We share some limitations of the study, which could be improved with future work. First, the survey did not include statements leading to quantitative data for consequences of testing. At the same time, the interviews generated qualitative data for test usability. This study drew upon both data sources (i.e., interviews and surveys) to convey a broader narrative around the test. A limitation is a lack of quantitative data related to consequences of testing and a lack of qualitative data related to usability. Our research team made those methodological choices by drawing from past research. A future study might develop a survey with items related to consequences of testing and interview questions related to usability. Second, the data set was drawn from three diverse districts (i.e., rural, suburban, and urban). There are some areas where the data could be strengthened, including greater breadth in student backgrounds such as increasing numbers of students from urban and rural populations. A future investigation might better study outcomes with these populations and explore the degree to which the findings from the present investigation extend. A third limitation was related to the usability results. Our findings discuss that the test demonstrated a high degree of usability with the features that were studied. To improve results, additional research including more usability features could have the capacity to improve the PSM-CAT. More usability studies need to be performed in education, specifically and especially in CAT environments. Our hope is that this study might serve as a model for more usability studies in education.

Author Contributions

Conceptualization, S.G.K., J.D.B., T.A.M. and G.E.S.; Methodology, S.G.K. and J.D.B.; Validation, J.D.B.; Formal analysis, S.G.K. and J.D.B.; Investigation, S.G.K. and J.D.B.; Resources, S.G.K., J.D.B. and T.A.M.; Data curation, S.G.K.; Writing—original draft, S.G.K. and J.D.B.; Writing—review & editing, J.D.B., T.A.M. and G.E.S.; Visualization, S.G.K.; Supervision, J.D.B.; Project administration, J.D.B.; Funding acquisition, J.D.B., T.A.M. and G.E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation grant numbers [2100988; 2101026; 1920621; 1920619]. The APC was funded by [1920621].

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Bowling Green State University (protocol code #1749616-7 on 11 October 2024). The Bowling Green State University Institutional Review Board determined this research is exempt according to federal regulation AND the research met the principles outlined in the Belmont Report.

Informed Consent Statement

Consent/assent from participants was waived due to the research meeting exempt status according to federal regulations. All participants were provided information about the study prior and had the option for their data not to be used as part of the study.

Data Availability Statement

Data are not publicly available.

Conflicts of Interest

Author Gregory E. Stone was employed by the company Clarity Assessment Systems Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CAT	Computer-adaptive test
FGD	Focus group discussions
IRB	Institutional review board
MMR	Mixed methods research
USA	United States of America

Appendix A

Purpose Statement and 1–1 Interview Questions

References

American Educational Research Association (AERA), American Psychological Association (APA) & National Council on Measurement in Education (ECME). (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81–90. [Google Scholar] [CrossRef]
Bostic, J. (2023). Engaging hearts and minds in assessment research. School Science and Mathematics Journal, 123(6), 217–219. [Google Scholar] [CrossRef]
Bostic, J., May, T., Matney, G., Koskey, K., Stone, G., & Folger, T. (2024, March 6–8). Computer adaptive mathematical problem-solving measure: A brief validation report. 51st Annual Meeting of the Research Council on Mathematics Learning (pp. 102–110), Columbia, SC, USA. [Google Scholar]
Bostic, J., & Sondergeld, T. (2015). Measuring sixth-grade students’ problem solving: Validating an instrument addressing the mathematics Common Core. School Science and Mathematics Journal, 115, 281–291. [Google Scholar] [CrossRef]
Bostic, J., & Sondergeld, T. (2018). Validating and vertically equating problem-solving measures. In D. Thompson, M. Burton, A. Cusi, & D. Wright (Eds.), Classroom assessment in mathematics: Perspectives from around the globe (pp. 139–155). Springer. [Google Scholar]
Bostic, J., Sondergeld, T., Folger, T., & Kruse, L. (2017). PSM7 and PSM8: Validating two problem-solving measures. Journal of Applied Measurement, 18(2), 151–162. [Google Scholar]
Bruder, R. (2021). Comparison of the Abitur examination in mathematics in Germany before and after reunification in 1990. ZDM Mathematics Education, 53, 1515–1527. [Google Scholar] [CrossRef]
Carney, M., Bostic, J., Krupa, E., & Shih, J. (2022). Interpretation and use statements for instruments in mathematics education. Journal for Research in Mathematics Education, 53(4), 334–340. [Google Scholar] [CrossRef]
Céspedes-González, Y., Otero Escobar, A. D., Ricárdez Jiménez, J. D., & Molero Castillo, G. (2023, November 6–10). Academic achievement in mathematics of higher-middle education students in veracruz: An approach based on computational intelligence. 11th International Conference in Software Engineering Research and Innovation (CONISOFT) (pp. 177–185), León, Mexico. [Google Scholar] [CrossRef]
Common Core State Standards Initiative (CCSSI). (2010). Common core standards for mathematics. Available online: http://www.corestandards.org/Math/ (accessed on 5 September 2024).
Creamer, E. G. (2017). An introduction to fully integrated mixed methods research. SAGE Publications. [Google Scholar]
Creswell, J. W. (2014). Research design: Quantitative, qualitative, and mixed method approaches (4th ed.). SAGE Publications, Inc. [Google Scholar]
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods research (3rd ed.). SAGE. [Google Scholar]
Davey, T. (2011). A guide to computer adaptive testing systems. Council of Chief State School Officers. [Google Scholar]
Denecke, K., Vaaheesan, S., & Arulnathan, A. (2021). A mental health chatbot for regulating emotions (SERMO)—Concept and usability test. IEEE Transactions on Emerging Topics in Computing, 9(3), 1170–1182. [Google Scholar] [CrossRef]
Estrada-Molina, O., Fuentes-Cancell, D. R., & Morales, A. A. (2022). The assessment of the usability of digital educational resources: An interdisciplinary analysis from two systematic reviews. Education and Information Technologies, 27, 4037–4063. [Google Scholar] [CrossRef]
Fennell, F., Kobett, B., & Wray, J. (2023). The formative 5 in action, grades K-12. Updated and expanded from the formative 5: Everyday assessment techniques for every math classroom. Corwin. [Google Scholar]
Heissel, J. A., Adam, E. K., Doleac, J. L., Figlio, D. N., & Meer, J. (2021). Testing, stress, and performance: How students respond physiologically to high-stakes testing. Education Finance and Policy, 16(2), 183–208. [Google Scholar] [CrossRef]
Hino, K., & Ginshima, F. (2019). Incorporating national assessment into curriculum design and instruction: An approach in Japan. In C. P. Vistro-Yu, C. P. Vistro-Yu, T. L. Toh, & T. L. Toh (Eds.), School mathematics curricula (pp. 81–103). Springer Singapore Pte. Limited. [Google Scholar] [CrossRef]
Hudson, J., Nguku, S. M., Sleiman, J., Karlen, W., Dumont, G. A., Petersen, C. L., Warriner, C. B., & Ansermino, J. M. (2012). Usability testing of a prototype phone oximeter with healthcare providers in high-and low-medical resource environments. Anaesthesia, 67(9), 957–967. [Google Scholar] [CrossRef]
Ifeachor, A. P., Ramsey, D. C., Kania, D. S., & White, C. A. (2016). Survey of pharmacy schools to determine methods of preparation and promotion of postgraduate residency training. Currents in Pharmacy Teaching and Learning, 8(1), 24–30. [Google Scholar] [CrossRef]
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). American Council on Education/Praeger. [Google Scholar]
Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. [Google Scholar] [CrossRef]
Krupa, E., Bostic, J., Folger, T., & Burkett, K. (2024, November 7–10). Introducing a repository of quantitative measures used in mathematics education. 46th Annual Meeting of the North American Chapter of the International Group for the Psychology of Mathematics Education (pp. 55–64), Cleveland, OH, USA. [Google Scholar]
Lane, S. (2014). Validity evidence based on testing consequences. Psicothema, 26(1), 127–135. [Google Scholar] [CrossRef]
Lane, S. (2020). Test-based accountability systems: The importance of paying attention to consequences. ETS Research Report Series, 2020(1), 1–22. [Google Scholar] [CrossRef]
Lane, S., & Stone, C. A. (2002). Strategies for examining the consequences of assessment and accountability programs. Educational Measurement: Issues and Practice, 21(1), 23–30. [Google Scholar] [CrossRef]
Lawson, B., & Bostic, J. (2024). An investigation into two mathematics score reports: What do K-12 teachers and staff want? Mid-Western Educational Researcher, 36(1), 12. Available online: https://scholarworks.bgsu.edu/mwer/vol36/iss1/12 (accessed on 18 December 2024).
Lee, E. (2020). Evaluating test consequences based on ESL students’ perceptions: An appraisal analysis. Columbia University Libraries and Applied Linguistics and TESOL at Teachers College, 20(1), 1–22. [Google Scholar] [CrossRef]
Lesh, R., & Zawojewski, J. S. (2007). Problem solving and modeling. In F. Lester (Ed.), Second handbook of research on mathematics teaching and learning (pp. 763–804). Information Age Publishing. [Google Scholar]
Maphalala, M. C., & Khumalo, N. (2018). Standardised testing in South Africa: The annual national assessments under the microscope. ResearchGate. Available online: https://www.researchgate.net/publication/321951815 (accessed on 12 May 2025).
Martin, A. J., & Lazendic, G. (2018). Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. Journal of Educational Psychology, 110(1), 27. [Google Scholar] [CrossRef]
McGatha, M. B., & Bush, W. S. (2013). Classroom assessment in mathematics. In SAGE handbook of research on classroom assessment (pp. 448–460). SAGE. [Google Scholar]
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan. [Google Scholar]
Miles, M., Huberman, A. M., & Saldaña, J. (2014). Qualitative data analysis: A methods sourcebook (3rd ed.). Sage. [Google Scholar]
Naveh, K. (2004). Matriculation in a new millennium: Analysis of a constructivist educational reform in Israeli high-schools [Doctoral dissertation, University of Leicester and ProQuest Dissertations Publishing]. Available online: https://www.proquest.com/docview/U190040 (accessed on 2 December 2024).
O’Regan, S., Molloy, E., Watterson, L., & Nestel, D. (2020). ‘It is a different type of learning’. A survey-based study on how simulation educators see and construct observer roles. BMJ Simulation & Technology Enhanced Learning, 7(4), 230–238. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Saad, M., Zia, A., Raza, M., Kundi, M., & Haleem, M. (2022). A comprehensive analysis of healthcare websites usability features, testing techniques and issues. Institute of Electrical and Electronics Engineers, 10, 97701–97718. [Google Scholar] [CrossRef]
Schoenfeld, A. (2011). How we think: A theory of goal-oriented decision making and its educational applications. Routledge. [Google Scholar]
Schoenfeld, A. H. (2015). Summative and formative assessments in mathematics supporting the goals of the Common Core Standards. Theory Into Practice, 54(3), 183–194. [Google Scholar] [CrossRef]
Sireci, S., & Benitez, I. (2023). Evidence for test validation: A guide for practitioners. Psicotherma, 35(3), 217–226. [Google Scholar] [CrossRef] [PubMed]
Thielemans, L., Hashmi, A., Priscilla, D. D., Paw, M. K., Pimolsorntong, T., Ngerseng, T., Overmeire, B. V., Proux, S., Nosten, F., McGready, R., Carrara, V. I., & Bancone, G. (2018). Laboratory validation and field usability assessment of a point-of-care test for serum bilirubin levels in neonates in a tropical setting. Wellcome Open Research, 3, 110. [Google Scholar] [CrossRef]
Uko, M. P., Eluwa, I., & Uko, P. J. (2024). Assessing the potentials of compurized adaptive testing to enhance mathematics and science student’t achievement in secondary schools. European Journal of Theoretical and Applied Sciences, 2(4), 85–100. [Google Scholar] [CrossRef] [PubMed]
Verschaffel, L., De Corte, E., Lasure, S., Van Vaerenbergh, G., Bogaerts, H., & Ratinckx, E. (1999). Learning to solve mathematical application problems: A design experiment with fifth graders. Mathematical Thinking and Learning, 1, 195–229. [Google Scholar] [CrossRef]
Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. [Google Scholar] [CrossRef]
White House. (2025). Ending radical and wasteful government DEI programs and preferencing [Presidential executive order]. The White House. [Google Scholar]

Figure 1. Example of Item Difficulty Measures for CATs. Numbers in this example represent item difficulties as measured using Rasch analysis, grounded in logits.

Figure 2. Metaphorical Knotted Rope Demonstrating Sources of Validity Evidence within a Concept. For more information, please reference (AERA et al., 2014).

Figure 3. Sample of PSM-CAT items.

Figure 4. Survey Questions and Skip Logic.

Figure 5. Multistage Qualitative Analysis Process.

Figure 6. Themes Responding to Research Questions.

Figure 7. Survey Responses (a) results for understanding the directions; (b) results for using and finding the calculator; (c) results for experiencing issues; (d) results for using and finding the formula sheet.

Table 1. Demographics of Participants for Survey and Interviews.

Demographic Type	Demographic Information	Participants in the Survey	Participants in the Interviews
Sex	Male	467 (46%)	60 (49.6%)
	Female	515 (51%)	59 (48.7%)
	Prefer not to say	28 (3%)	2 (1.7%)
Race/Ethnicity	White/Caucasian	606 (60%)	75 (62%)
	Hispanic/Latino	140 (13.8%)	16 (13.2%)
	Asian/Pacific Islander/Hawaiian	71 (7%)	7 (5.8%)
	Black/African American	54 (5.4%)	4 (3.3%)
	Mixed/Biracial	117 (11.6%)	14 (11.6%)
	Other	22 (2.2%)	5 (4.1%)
Grade Level	Sixth Grade	264 (26.1%)	25 (20.6%)
	Seventh Grade	351 (34.8%)	52 (43%)
	Eighth Grade	395 (39.1%)	44 (36.4%)
Location	Pacific	282 (28%)	44 (36.4%)
	Mountain West	477 (47.2%)	38 (31.4%)
	Midwest	251 (24.8%)	39 (32.2%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

King, S.G.; Bostic, J.D.; May, T.A.; Stone, G.E. A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test. Educ. Sci. 2025, 15, 680. https://doi.org/10.3390/educsci15060680

AMA Style

King SG, Bostic JD, May TA, Stone GE. A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test. Education Sciences. 2025; 15(6):680. https://doi.org/10.3390/educsci15060680

Chicago/Turabian Style

King, Sophie Grace, Jonathan David Bostic, Toni A. May, and Gregory E. Stone. 2025. "A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test" Education Sciences 15, no. 6: 680. https://doi.org/10.3390/educsci15060680

APA Style

King, S. G., Bostic, J. D., May, T. A., & Stone, G. E. (2025). A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test. Education Sciences, 15(6), 680. https://doi.org/10.3390/educsci15060680

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Abstract

1. Introduction

2. Literature Review

2.1. Educational Policies: An Overview

2.2. Computer-Adaptive Tests (CATs): The Transition from PSM to PSM-CAT

2.3. Validity

2.4. Consequences of Testing

2.5. Assessment Usability Features

3. Materials and Methods

3.1. The Present Survey

3.2. Research Design

3.3. Survey Methods

3.3.1. Participants

3.3.2. Instrument and Data Collection

3.3.3. Data Analysis

3.4. Interviews: Data Collection and Analysis

4. Results

4.1. RQ1: Usability

4.1.1. Resources

4.1.2. Test Items

4.2. RQ2: Test Consequences

4.2.1. Outcomes: Student Learning

4.2.2. Outcomes: Student Attitudes

4.2.3. Teachers Gain Knowledge: What Students Know

4.2.4. Teachers Gain Knowledge: How to Help Students

5. Discussion

5.1. Connecting to Consequences of Testing

5.2. Usability Studies in Education Research

5.3. Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI