The profession of podiatric medicine has long recognized the need to improve its educational standards and outcomes [
1,
2]. In the early 1990s, Stritter and Becker[
3]. emphasized the role of educational research in podiatric medical education and provided a framework for the profession. One of the research categories emphasized by these authors was student assessment [
3]. Assessment is a system whereby domain-based knowledge acquisition and other professional accomplishments are evaluated by using defined criteria, and it often includes measurement with a numeric value [
4]. The purpose of assessment includes measuring levels of student knowledge and changes in those levels over time, evaluating strengths and weaknesses, student ranking, and motivation [
4]. Many basic science courses in podiatric medical education use multiple-choice questions (items) to assess learning and acquisition of domain-based knowledge. Because depth of learning is largely driven by the quality of these items,[
5,
6]. the ability to create valid and reliable items that reflect critical thinking is an important task for podiatric medical educators [
7].
Figure 1.
General anatomy one-best-answer (A-type) item related to the back. Note the clinical vignette, lead-in question, and answer choices, of which there are four distractors and one best choice (D).
Figure 1.
General anatomy one-best-answer (A-type) item related to the back. Note the clinical vignette, lead-in question, and answer choices, of which there are four distractors and one best choice (D).
Figure 2.
General anatomy extended-matching (R-type) item related to the head. Note the theme, option list, lead-in question, and a clinical item stem. On an assessment, there would be several item stems related to this list. The one best choice for the stem shown is (E).
Figure 2.
General anatomy extended-matching (R-type) item related to the head. Note the theme, option list, lead-in question, and a clinical item stem. On an assessment, there would be several item stems related to this list. The one best choice for the stem shown is (E).
Critical thinking is a metacognitive, nonlinear process of purposeful judgment that includes self-directed learning and self-assessment [
8–
10]. Critical thinking can be facilitated during an assessment by using items that are linked to higher levels of Bloom’s taxonomy [
11,
12]. Palmer and Devitt[
12]. compared modified essay and multiple-choice items to assess their effectiveness to measure higher-order cognitive skills. They found that, when constructed well, multiple-choice items measured higher levels of critical thinking [
12]. The evolution of the American Podiatric Medical Licensing Examination (APMLE) into an assessment that measures higher-order thinking[
7]. further requires that multiple-choice items in basic and clinical science courses be created to help students prepare for their licensing examinations and facilitate critical thinking.
Figure 3.
Assessment dates during a semester for the GA 2010 cohort. The mock quiz and mock practical were given to familiarize students with the style and depth of the items; these assessments were not calculated as part of the final course grade. Five hundred items (324 multiple-choice items and 176 practical items) were administered throughout the course. Note that the time between assessments was the same for the GA 2011 cohort, although the assessment dates differed.
Figure 3.
Assessment dates during a semester for the GA 2010 cohort. The mock quiz and mock practical were given to familiarize students with the style and depth of the items; these assessments were not calculated as part of the final course grade. Five hundred items (324 multiple-choice items and 176 practical items) were administered throughout the course. Note that the time between assessments was the same for the GA 2011 cohort, although the assessment dates differed.
There are many different multiple-choice items used in medical education, including true/false (C-, K-, and X-type), one-best-answer (A-type), and extended-matching (R-type) items. True/false items have fallen out of favor on many licensing examinations because of ambiguities among experts as to their correct answers and the fact that they often assess isolated facts [
13]. In contrast, one-best-answer items consisting of a stem (usually a clinical vignette) and a lead-in question followed by a series of choices are preferred because they can be constructed on a continuum from least correct to most correct [
13]. When constructed well, these items have demonstrated high validity and reliability [
14,
15 ].
Figure 1 is an example of a one-best-answer item that was used in the general anatomy (GA) course. An extended-matching item is a multiple-choice item that consists of a theme, a list of options (typically ≤20), a lead-in question, and several stems [
16 ].
Figure 2 is an example of an extended-matching item that was used in the GA course. These items have also demonstrated high validity and reliability [
17,
18]. Bhakta et al[
5]. used Rasch analysis to investigate the psychometric properties of extended-matching items on medical student examinations and found that the items demonstrated high internal construct validity and minimal bias. And these data have been supported by other studies [
19].
In 2010, the GA course given to first-year students at the New York College of Podiatric Medicine was redesigned to emphasize clinically oriented anatomy. As part of this change, we decided to primarily use multiple-choice items (single-best-answer and extended-matching items) modeled after those used on United States Medical Licensing Examination (USMLE) Step 1 [
20]. Our decision to create USMLE-style items stemmed from a belief that anatomy items should correlate to higher levels of Bloom’s taxonomy and promote critical thinking. We qualitatively analyzed the item-writing manuals of three medical professions (podiatric,[
7]. osteopathic,[
21]. and allopathic[
20]. medicine) and felt that the one created by the National Board of Medical Examiners[
20]. provided the most in-depth information. Furthermore, the National Board of Medical Examiners visited the New York College of Podiatric Medicine several years ago and provided an in-house item-writing workshop for the faculty.
Figure 4.
Mean grades for the GA 2010 and GA 2011 cohorts. Note the positive trend in mean grades over time and the similarity in grades between the two student cohorts.
Figure 4.
Mean grades for the GA 2010 and GA 2011 cohorts. Note the positive trend in mean grades over time and the similarity in grades between the two student cohorts.
Table 1.
Analysis of Variance of Assessments Between the GA 2010 and GA 2011 Cohorts
Table 1.
Analysis of Variance of Assessments Between the GA 2010 and GA 2011 Cohorts
The purpose of this longitudinal 2-year study was to investigate the psychometric properties of the USMLE-style items used in the GA course and describe podiatric medical student perceptions of the experience. Our intention is to provide a framework for other podiatric medical educators who may want to implement similar items in their courses.
Table 2.
Reliability of Lecture Assessments
Table 2.
Reliability of Lecture Assessments
Figure 5.
Correlations between final course grades (FCGs) and lecture examinations. A, GA 2010 cohort. B, GA 2011 cohort.
Figure 5.
Correlations between final course grades (FCGs) and lecture examinations. A, GA 2010 cohort. B, GA 2011 cohort.
Figure 6.
Final course grade distributions for the GA 2010 (A) and GA 2011 (B) cohorts. Note the similarities between the histograms and the superimposed best-fit trend lines. There was no significant difference in final course grades between cohorts (P = .15).
Figure 6.
Final course grade distributions for the GA 2010 (A) and GA 2011 (B) cohorts. Note the similarities between the histograms and the superimposed best-fit trend lines. There was no significant difference in final course grades between cohorts (P = .15).
Figure 7.
A, The GA 2010 cohort post-course survey data (values in parentheses are the number of respondents). B, Bar chart of the survey results.
Figure 7.
A, The GA 2010 cohort post-course survey data (values in parentheses are the number of respondents). B, Bar chart of the survey results.
Methods
Two hundred first-year podiatric medical students taking GA in two cohorts (GA 2010 and GA 2011) were each administered 13 assessments (nine lecture quizzes/examinations and four laboratory practicals) during a semester (
Fig. 3). Data included in this study were derived from the quizzes/examinations, which contained 324 multiple-choice, USMLE-style items correlated to levels 3, 4, and 5 of Bloom’s taxonomy. The items were predominantly clinical vignettes and either single-best-answer or extended-matching items. The psychometric properties of the items and lecture examinations were obtained and analyzed. Statistics were calculated with a commercially available software program (SPSS Statistics, version 19.0; IBM SPSS, Armonk, New York). A
P value of less than 0.05 for between-cohort differences was considered statistically significant. Mean grades for each assessment were recorded over time, and intercohort comparisons were made. Correlational analyses using Pearson
r were used to investigate the relationship between final course grades (FCGs) and lecture examinations. An anonymous post-course survey containing Likert-style items was administered electronically to collect data on student perceptions of the assessments.
Results
Psychometric Properties of Items
Most USMLE-style items demonstrated strong psychometric properties, with point biserial (PB) correlations of at least 0.20 and a range (percentage) of students answering the items correct of 25% to 75%. The PB correlation for the one-best-answer item in
Figure 1 was 0.42 (GA 2011 cohort), which indicates very good discriminative ability; 60.58% of students answered the item correctly, and this percentage represents the difficulty of the item [
22].
Figure 8.
A, The GA 2011 cohort post-course survey data (values in parentheses are the number of respondents). B, Bar chart of the survey results.
Figure 8.
A, The GA 2011 cohort post-course survey data (values in parentheses are the number of respondents). B, Bar chart of the survey results.
The PB correlation for the extended-matching item in
Figure 2 was 0.53 (GA 2011 cohort), which indicates excellent discriminative ability. The item was considered easy since 88.35% of students answered it correctly.
Assessments Between Cohorts
The mean grades of all of the assessments administered to both cohorts are shown in
Figure 4. Analysis of variance of grades between the two cohorts revealed statistically significant differences on the following assessments: mock quiz, quiz 3, quiz 4, quiz 5, practical 1, and practical 2 (
Table 1). The mean grades of the three lecture examinations between groups were not found to be significant (
P > .05).
Psychometric Properties of Lecture Examinations
The lecture examinations demonstrated high reliability, with Kuder-Richardson 20 (KR-20) coefficients of 0.71 to 0.76. The KR-20 is a correlation statistic with a range from 0 to 1.00 [
23]. The KR-20 coefficients suggest that the lecture examinations had very high internal consistency reliability and, therefore, measured a single cognitive factor (ie, anatomy) (
Table 2). In addition, we are confident that the student grades on these examinations were reliable (consistent or reproducible) and represented their true grades.
Relationship Between FCGs and Lecture Examinations
The relationships between FCGs and lecture examinations for the GA 2010 and GA 2011 cohorts are shown in
Figure 5. The relationships strengthened from examination 1 to examination 3, a finding found in both courses.
FCG Distributions
The FCG distributions were almost identical and conformed to a Gaussian distribution (
Fig. 6). The mean (SD) FCGs for the GA 2010 and GA 2011 cohorts were 78.41 (6.26) and 79.76 (6.96), respectively.
Student Perceptions
An anonymous survey containing Likert-style items was administered to both cohorts at the completion of the course (
Figs. 7 and
8). The response rates for the GA 2010 and GA 2011 cohorts were 71 of 97 (73%) and 81 of 103 (79%), respectively.
Discussion
The PB correlation is an item discrimination index that falls between −1.0 and 1.0. The closer the PB is to 1.0, the better the item is able to discriminate between students who are doing well on an assessment (know the material) and those who are not (do not know the material) [
22]. For medical school assessments, some authors recommend that items should have PB correlations of at least 0.50, whereas others recommend a minimum of 0.20 [
23]. We analyzed the items post hoc and scanned all of them using the PB 0.20 benchmark. We then reviewed all of those that fell below the benchmark. An item with a negative PB correlation suggests that there may be a serious problem with the item (eg, the correct answer was not keyed correctly) [
23]. Other items that fell below 0.20 were evaluated for validity, clinical importance, and wording. Additionally, all of the items were evaluated for
test-wiseness, which is operationally defined as flaws in items that make them easier for some students to answer correctly by using only their test-taking skills [
24].
The PB correlation for the one-best-answer item in
Figure 1 was 0.42. Of 104 students, 60.58% answered this item correctly. The item, therefore, was considered difficult because the percentage of students who answered it correctly fell within the range of 50% to 75%. Items are considered low to moderately difficult if the percentage of students who answer them correctly is 70% to 85% [
22]. And a percentage greater than 85% indicates an easy item. The difficulty of the item in
Figure 1 can be explained by the fact that it required the student to apply information (higher levels of Bloom’s taxonomy) as follows: first, the student had to envision the location of the alar ligament; second, the student had to know the function of the ligament; and third, the student had to hypothesize what would most likely result if the ligament was completely torn. Although this ligament was discussed in lecture and also described in the anatomy textbook, the textbook does not emphasize the clinical relevancy of the ligament or the resultant manifestations that occur when it is completely torn. Thus, students who studied the structure and function of this ligament and were also able to apply this information could answer this item correctly.
The PB correlation for the extended-matching item in
Figure 2 was 0.53, which indicates excellent discriminative ability. And of 104 students, 88.35% answered this item correctly. Therefore, this item was considered easy. Similar to the item in
Figure 1, this item also required the student to apply information. However, the reason more students answered this question correctly may be because the pterion and its relationship to underlying blood vessels was emphasized in the lecture and in the textbook. In fact, the textbook contained a clinical correlate description of the anatomy and clinical relevance of pterion fractures. Most students recognized that this was an important (and testable) piece of information and effectively studied it before the assessment.
As demonstrated in
Figure 4, a mock quiz that contained USMLE-style items was administered 1 week after the start of the course (after 6 hours of lecture), and a mock laboratory practical was administered 2 weeks after the start of the course (after 8 hours of dissection). These mock assessments did not count as part of the FCG, but they had an important role in student learning. They allowed students to experience the depth of the USMLE-style items that they would face in the GA course and enabled them to adjust their study strategies so that they could effectively answer these items. The mean grades of the mock quiz and mock laboratory practical for both cohorts were well below 70% and quite dismal. Qualitatively, many students said that the mock assessments were their “wake-up” calls. As shown in
Figure 4, for students in both cohorts, this wake-up call represented an approximately 30-point mean increase from the mock assessment to quiz 1. Overall, the trend was positive for both cohorts, with the highest mean grades achieved on the last assessment (laboratory practical 3). There was more stability in the range of mean grades for the GA 2011 cohort, and this may be explained by the fact that students in this cohort were able to query students in the GA 2010 cohort about their experiences taking GA. Students in the GA 2010 cohort did not have this luxury because they were the first group to experience the redesigned course. There were no statistically significant differences in intercohort mean grades on any of the lecture examinations, suggesting that they were highly reliable (
Table 1). This reliability was further demonstrated by their similar KR-20 coefficients (
Table 2). Essentially, this means that despite the fact that the items on the lecture examinations from both cohorts were different, they demonstrated a consistent ability to differentiate students who knew the material from those who did not despite two independent cohorts of students. The KR-20 coefficients should be 0.60 or better for medical school assessments [
22].
This reliability is also reflected in the similar FCG distributions of both cohorts (
Fig. 6). The mean (SD) FCGs for the GA 2010 and GA 2011 cohorts were 78.41 (6.26) and 79.76 (6.96), respectively (
P = .15). Correlational analyses were used to investigate the relationship between FCGs and the three lecture assessments. As shown in
Figure 5, all of the correlations were significant and had positive slopes. The relationships were more strongly correlated as the lecture assessments were given (ie, lecture examination 1 to 2 to 3), and this was demonstrated in both cohorts. The lecture assessments were cumulative, and lecture examination 3 was a cumulative final examination, and this explains why the relationship between the FCG and lecture examination 3 was the strongest for both cohorts.
The post-course survey contained Likert-style items, and a response rate of 71 of 97 (73%) for the GA 2010 cohort and 81 of 103 (79%) for the GA 2011 cohort was obtained. Most students (>80%) reported that single-best-answer items were easier than extended-matching items. This perception may exist because of inexperience on the part of students in answering extended-matching items. Most students (>74%) reported that quizzes/examinations contained items that correlated with the lecture objectives. Most students (>76%) believed that the items on the quizzes/examinations were similar to those found on USMLE Step 1. During the course, students were encouraged to answer test items in anatomy review books tailored for USMLE Step 1 preparation so that they could identify their weak areas and, consequently, design more effective study strategies. Wenger et al[
25]. found that medical students who used practice items as a study strategy during a pathology course scored significantly higher on examinations. Although the podiatric medical students did not take the USMLE, they became familiar with the style of the items on GA assessments and compared them with those found in the anatomy review books. Therefore, they were able to judge whether the items on the GA assessments were similar to those found on USMLE Step 1. Most students (>84%) believed that they would do well on the anatomy portion of their boards (APMLE Part I) because of the items they had to answer in the GA course. We believe that this is an important finding because part of doing well on any assessment is being confident that you will do well. Students in the GA 2010 and GA 2011 cohorts took APMLE Part I and their mean scores on GA were higher than the national means.
Implications for Podiatric Medical Education
These data suggest that USMLE-style items can be successfully incorporated into a basic science course in podiatric medical education. Furthermore, podiatric medical students largely value assessments that use these items. Our experience in this endeavor can help other podiatric medical educators create more valid and reliable items in basic and clinical science courses, with the goal of training podiatric physicians to think critically, ultimately improving patient care.
Assessment Guidelines for Podiatric Medical Educators
Items should assess relevant, clinically important information and not mundane or trivial facts [
21].
To ensure validity, items should be reviewed by multiple experts in the academic discipline
Items should be correlated with lecture objectives
Items should be prospectively vetted with other health professional students or podiatric medical residents
Items should be psychometrically analyzed after an assessment to document how they performed and then should be modified accordingly
Financial Disclosure: None reported.
Conflict of Interest: None reported.