1. Introduction
Graduate admissions in psychology-related fields, as well as many other fields, is heavily dependent on use of the Graduate Record Examination (GRE) [
1]. Tests such as the GRE, however, measure a relatively narrow band of analytical thinking skills [
2,
3,
4,
5]. Is it possible to expand measures to encompass a broader range of skills that might be relevant to graduate and later career success?
Much of the demand in graduate school is on students’ scientific reasoning skills—for example, generating hypotheses, generating experiments, and drawing conclusions from data. The goal of the present research project was to create a test of skill in scientific reasoning that could be used for graduate admissions in psychology or other behavioral and brain sciences.
Our test is based on Sternberg’s [
6,
7,
8,
9] theory of successful intelligence. The theory holds that intelligence can be understood in terms of information-processing components that combine to product intelligent behavior.
Metacomponents, or higher-order executive processes, are used to plan what one is going to do, monitor it while it is being done, and evaluate it after it is done. An example of a metacomponent would be formulating a strategy to solve a problem, such as how to design an experiment to test a particular hypothesis.
Performance components are used to execute the strategies formulated by the metacomponents. An example would be inferring relations, as when one infers how the results of a particular test of statistical significance should be interpreted. Knowledge-acquisition components are used to learn how to solve the problem in the first place. An example would be selective comparison, where one tries to ascertain what information stored in long-term memory is relevant to solving a problem, such as retrieving knowledge about how to do an appropriate test of statistical significance in a particular experiment.
The various components can be used in problem-solving in a variety of ways. When they are applied to fairly abstract but relatively unfamiliar kinds of material, they are used analytically. When they are applied to relatively novel tasks or situations, they are applied creatively. And when they are applied to concrete everyday situations, they are applied practically.
According to the theory, individuals can be strong in general abstract analytical skills but not necessarily strong in applying those skills to any one particular domain of practice. That is, analytical intelligence for relatively abstract kinds of problem is largely distinct from the practical intelligence that applies cognitive skills to particular domains of practice. For example, someone might be adept at solving number or letter series, or at solving general mathematical problems, but not be adept when applying the same inductive reasoning skills to a domain of practice such as legal, medical, or scientific problem solving [
7,
10,
11]. The basic argument as it applies here is that the cognitive skills needed to succeed in actual scientific research are in part different from the abstract analytical skills measured by tests such as the GRE. In particular, based on past research (see e.g., [
5,
12,
13,
14,
15,
16,
17]), we expected our measures of scientific reasoning to yield scores that are statistically weakly related to scores on tests of academic ability, such as the SAT (formerly the Scholastic Aptitude Test but now named only by acronym), GRE (Graduate Record Examination), or any conventional psychometric test of intelligence.
Sternberg and Williams [
17] examined the validity for graduate performance in the Yale Psychology Department of the GRE for predicting (a) first-and second-year course grades, (b) professors’ ratings of student dissertations, and (c) professors’ ratings of students’ analytical, creative, practical, and teaching abilities. The test was found to be predictive only of first-year graduate grades, except that the analytical section was found to be predictive of other criteria, but only for men. In particular, they found the predictive validity of the GRE in predicting first-year grades to be 0.18 for the verbal test, 0.14 for the quantitative test, 0.17 for the analytical test, and 0.37 for the subject-matter test. Sternberg and Williams, however, did not take the next step and propose an alternative test.
Wilson [
18] found the predictive validity of the GRE in predicting first-year grades in psychology to be 0.18 for the verbal test, 0.19 for the quantitative test, and 0.32 for the analytical test. Schneider and Briel [
19] found the predictive validity of the GRE in predicting first-year graduate psychology grades to be 0.26 for the verbal test, 0.25 for the quantitative test, 0.24 for the analytical test, and 0.36 for the subject-matter test.
Kuncel, Hezlett, and Ones [
20] performed a meta-analysis of the predictive validity of the GRE across fields. They found correlations of 0.34 for the GRE verbal, 0.38 for the GRE quantitative, 0.36 for the GRE analytical, and 0.45 for the GRE subject-matter tests. Note that these correlations are corrected for both restriction of range and attenuation. These correlations, therefore, reflect not the correlations for the actual testing, but rather the correlations that in theory would have been obtained for an idealized test of perfect reliability administered to participants with a full range of skill levels. Correlations from the Kuncel et al. study with other criteria were lower, for example, 0.09 for the GRE Verbal, 0.11 for the GRE Quantitative, and 0.21 for the GRE subject-matter test with research productivity (corrected for restriction of range). Correlations with time to complete the degree were 0.28, −0.12, and −0.02, respectively. Kuncel and Hezlett [
21] and Kuncel, Wee, Serafin, and Hezlett [
22] further found that standardized tests predict graduate students’ success.
A compendium of studies relevant to the preparation and publication of the “new” GRE has been prepared by the Educational Testing Service [
23]. The new GRE, like the old GRE, has verbal reasoning and quantitative reasoning sections, but also a writing section. It appears psychometrically to measure constructs similar to those of earlier versions of the test, including general intelligence plus more specific verbal and quantitative knowledge and skills.
The GRE appears to be, at best, a modest predictor of success in graduate school, with its best results for first-year performance, although correlations can be raised by corrections for restriction of range and attenuation (thereby yielding theoretical rather than actual correlations).
The best predictor of future success appears to be the achievement test, suggesting that knowledge of the subject-matter domain may be more important than abstract analytical thinking that is somewhat domain-general and not necessarily highly directly relevant to success in a given academic discipline.
An additional factor to take into account is that data of the Educational Testing Service, which produces the test, shows wide gaps between males and females, on both the verbal and mathematical tests, favoring males [
24]. These differences hold up across racial/ethnic groups. But there is no reason to believe that males do better than females in graduate school. The gap in performance therefore is an issue of some concern. There also are racial/ethnic group differences in scores [
25]. The existence of such differences might lead one to be at least somewhat concerned about adverse impact in terms of equalizing opportunities for students to pursue graduate education in selective institutions.
New methods of conceptualizing skills for post-baccalaureate training are relevant for business school as well [
26,
27]. The Graduate Management Admission Test (GMAT) is the most widely used measure of managerial potential in MBA admissions. GMAT scores, although predictive of grades in business school, leave much of the variance in graduate business-school performance unexplained. The GMAT also produces disparities in test scores between groups, generating the potential for adverse impact in the admissions process. Hedlund and colleagues sought to compensate for these limitations by adding measures of practical intelligence to the admissions process in an MBA program. They developed two situational-judgment-test (SJT) approaches to measuring practical intelligence [
28,
29], a short form with relatively simple problems and a long form with relatively complex problems.
Hedlund et al. [
26] administered the resulting measures to two samples of incoming MBA students (total
N = 792). Across the two studies, they found that scores on both measures predicted success inside and outside the classroom and provided small, yet significant, increments beyond GMAT scores and undergraduate GPA in the prediction of variance in MBA performance. They further found that these measures exhibited less disparity across gender and racial/ethnic groups than did the GMAT.
In particular, the researchers first performed correlational analyses to determine the predictive validity of practical-intelligence scores relative to other predictors and the various performance criteria. Scores on both practical-intelligence measures were predictive of academic success. Students with higher scores on the short-form items had significantly higher first-year and final GPAs (r = 0.18 and 0.21, respectively), and also received higher grades on the team-consulting project (r = 0.17). Similarly, students with higher scores on the long-form items had significantly higher 1st year and final GPAs (r = 0.21 and 0.30, respectively) and higher consulting project grades (r = 0.17).
Both GMAT scores and undergraduate GPA also were significant predictors of first-year GPA (r = 0.44 and 0.30, respectively) and final GPA (r = 0.40 and 0.32, respectively). However, GMAT scores did not correlate significantly with the consulting project grade (r = 0.06, ns). Prior work experience did not correlate with MBA grades, and actually exhibited modest negative correlations with practical-intelligence measures (r = 0.12) and undergraduate GPA (r = 0.22).
Short- and long-form practical intelligence scores exhibited modest, but significant correlations with involvement in extracurricular activities, for which success typically requires some measure of practical intelligence with regard to relating to other people. Students who scored higher on the short-form scores participated in more student clubs (r = 0.15) and held more leadership positions (r = 0.11). Students with higher long-form scores held more leadership positions (r = 0.18).
Situational judgment tasks also can be useful in medical-school admissions. Lievens, Buyse, and Sackett [
30] developed an SJT as a possible supplement to cognitive predictors for predicting success in a medical and dental curriculum in Belgium. They found that traditional cognitive predictors were useful, but also that a video-based SJT added significantly to prediction of performance in courses that involved interpersonal aspects of patient care, but not to other courses. Lievens and Sackett [
31] subsequently showed that SJT performance was relevant to predicting quality of performance in later internship and professional practice (see also Sternberg [
32]).
Shultz and Zedeck [
33,
34] explored predictors of success in law school beyond the Law School Admission Test (LSAT). They used a broad battery of assessments measuring personality constructs, interests, values, and judgment. Their assessments predicted competency in accomplishing the tasks of a lawyer but at the same time had virtually no adverse impact on underrepresented minority applicants. The authors suggested that their measures, combined with the LSAT and undergraduate GPA, could assess law-school applicants for their predicted professional competence as well as academic performance in law school.
Approaches that go beyond conventional standardized testing have also been proposed at the undergraduate level and below. For example, Oswald, Schmitt, Kim, Ramsay, and Gillespie [
35] (see also Schmitt et al. [
36]) have found biographical data and situational-judgment tests (the latter of which we also used) to provide incremental validity to the SAT. Sedlacek [
37] has developed non-cognitive measures that appear to have had success in enhancing the university-admissions process.
Sternberg and his colleagues (see [
5,
14,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47]) have proposed measures that assess analytical, creative, practical, and wise thinking for undergraduate admissions purposes. They found that they could improve prediction of academic and extracurricular performance in the first year of college and at the same time substantially decrease ethnic-group differences on their assessments (see also Kaufman [
48] for the use of creativity measures to reduce ethnic bias).
Stemler, Grigorenko, Jarvin, and Sternberg [
49] and Stemler, Sternberg, Grigorenko, Jarvin, and Sharpes [
50] found that including creative and practical items in augmented physics, psychology, and statistics AP (Advanced Placement) Examinations reduced ethnic-group differences on augmented test scores relative to the original tests. For other new approaches to college admissions, see Sternberg, Gabora, and Bonney [
51].
Grigorenko et al. [
52] found that it was possible to improve prediction of private high school (prep school) performance beyond scores attained on the SSAT (Secondary School Admissions Test). And the same principles have been employed in a test for identification of gifted students in elementary school [
53].
These findings, although preliminary, suggest the potential value of considering a broader range of abilities in admissions testing at the graduate—and other levels. But how would such tests be constructed? Our rationale in the present research was that it should be possible to create better predictive assessments of graduate (and later) success by measuring analytical reasoning as practiced within the discipline, that is, scientific reasoning as practiced within the field of psychological science. That said, our study is not predictive but rather a correlational study that seeks to explore the potential usefulness of a new assessment for practical scientific reasoning skills relevant to success in psychological sciences.
Based on the theory of successful intelligence, such assessments within a broad-scope scientific reasoning test should correlate with each other (convergent validity) but not correlate as much with tests of abstract analytical reasoning (i.e., g) (discriminant validity). Such tests should also have higher content validity and face validity than do the GRE verbal, quantitative, and analytical tests.
The purpose of this report is to describe the results for two studies developing an assessment for graduate admissions purposes. Our basic prediction was that our multiple scientific reasoning assessments would be positively and at least moderately correlated with each other (convergent validity) but only weakly correlated with the measures of fluid and crystallized intelligence as well as with the SAT, which correlates highly with tests of intelligence [
54].
The convergent measures were chosen based on the processes of psychological-scientific reasoning presented in virtually all introductory-psychology textbooks as well as experimental methods books. (The reasoning is almost certainly relevant to other sciences as well.) For example, Breedlove [
55] describes three of these processes as “come up with alternative hypotheses,” “design an experiment,” and “see which outcome you get, and therefore which hypotheses survived” (p. 41). Although in different texts the exact wording is different, in virtually all of the texts, including our own [
56], three of the processes correspond to what we call in our studies “generating hypotheses,” “generating experiments,” and “drawing conclusions.” In terms of the theory of successful intelligence, generating hypotheses and generating experiments fall primarily into the creative domain and drawing conclusions into the analytical domain, but as applied to psychological-scientific research, they also fall into the practical domain—specifically, the practice of doing science. Thus, the use of these analytical and creative skills is contextualized to the psychological-science research domain.
4. General Discussion
We have proposed new measures that potentially could be used to assess prospective students’ potential future success in graduate training and in their chosen research careers. The tests measure various aspects of scientific reasoning: generating hypotheses, generating experiments, drawing conclusions, reviewing, and editing. The editing task appears to have been quite novel for our subjects and may thus have tapped into fluid intelligence skills as well as whatever else it is that our scientific reasoning tasks measure. These tests are not viewed as a replacement for existing measures, but rather as a supplement to them [
5,
14]. Of course, graduate success in psychological science has many aspects: Our focus is on the aspect that we believe is most important—scientific reasoning—that conventional tests used for admission fail to measure directly.
We believe that the data we obtained are promising, and that they are conservative. Our subjects were Cornell university undergraduates, who are well above average with regard to academic skills. Our correlations were not corrected for restriction of range. We could not correct for restriction of range in our correlations because although we know our students were well above average on conventional standardized tests, we have no way of assessing how they differed from a typical psychology graduate applicant population in scientific reasoning skills. Moreover, our scientific reasoning measures did not show the kinds of correlations with conventional admissions measures that might suggest that high scores on conventional measures might lead to high scores on our assessments of scientific reasoning; quite the contrary. Correcting for restriction of range, therefore, might have inflated the correlations. All that said, our subjects were above average for college students and probably for applicants to graduate school, so we cannot be certain how much generality there would be of our results to academically more diverse populations.
The research described here is obviously a beginning, not an end to an attempt to devise measures to supplement standard assessments for measuring cognitive skills needed for success in graduate school and in scientific careers.
First, although we know that the correlations of the assessments with each other are substantial but, in general, with measures of general intelligence are lower, it remains to be seen whether the assessments add prediction to measures of success in graduate school. We have reason to believe that there is more to success in graduate school than what the GRE measures [
17], but we do not know for sure that what remains to be predicted is what our assessments measure.
In the ideal case, we would have predictive-validity data for graduate school or career success. In practice, such data are hard but not impossible to obtain. If one uses first-year graduate students as participants, one can follow their careers through graduate school. However, in any reasonably competitive graduate program, the matriculated students are already a severely restricted range of applicants, so any correlations obtained for them will be suspect. It thus may make sense to provide in reports of such data both the original correlations and the correlations corrected for restriction of range, if one knows the range of the relevant population. Moreover, applicants are not at the level of advancement in education one is really interested in, namely, undergraduates who are or will be applying to graduate schools. If one uses instead students and especially seniors in college, obtaining predictive-validity data is difficult. Some of the students ultimately will go to graduate school, others will not, so there will be a (possibly severe) drop-out effect, and one skewed toward those with stronger credentials. Moreover, even those who go to graduate schools will go to different ones, so that it will be hard to obtain data, and even the data obtained will be very difficult to compare from one graduate institution to another.
Second, the assessments we have now measure scientific reasoning skills, but do not measure all the skills relevant to success in graduate school and in careers. For example, we have not included problems requiring reasoning with concepts from probability and statistics. The work of Kahneman and Tversky [
63,
64] suggests that such reasoning is important in understanding one’s data. Stanovich [
65,
66] have argued that rational thinking, of the kinds measured by the Kahneman and Tversky problems and others like them, measure a construct that is important but largely missed by conventional standardized tests. Motivation also is important for success, on tests but more importantly in graduate school [
67,
68,
69]. What is arguably an even more important skill for success in graduate school and in a career is creativity or taste in problems [
70,
71,
72]. In the end, creativity in the field is at least as important as scientific reasoning where the problem is a given.
Third, the predictive value of our assessment, like any assessment, will depend on the culture in which it is used [
73,
74,
75]. This assertion applies not only to cultures across nations, but also to departmental cultures. The assessment as it now exists emphasizes research skills. But to have a really good predictive test, one needs a strong sense of exactly what it is that one wants to predict [
76]. In a teaching-based institution—perhaps a small college—research skills may matter less and teaching skills more. Current research with an assessment to measure teaching skills will investigate such measures (see [
77] in preparation).
Fourth, other theories of intelligence and related constructs might suggest further kinds of assessments to add that would be relevant to prediction of graduate school success in psychology. For example, Carroll’s [
57] theory has many abilities potentially relevant to graduate study that are not assessed in conventional measures. Ceci’s [
78] bioecological approach might suggest contextual measures relevant to graduate study. Gardner’s [
79] theory might suggest additional measures of what he refers to as interpersonal and intrapersonal intelligence. Mayer and Salovey’s [
80] concept of emotional intelligence also might be relevant. It is probably overlapping with Gardner’s interpersonal and intrapersonal intelligences and Sternberg’s practical intelligence construct. It remains to be seen what these theories might add in terms of measures.
Fifth, there is today as always the question of whether the kinds of tests we use are equitable across different racial, ethnic, and other groups (e.g., [
81,
82,
83,
84]). There are many different views on whether the tests are fair across groups. In our own work, we have found that the kinds of tests we used tend to reduce racial-ethnic differences while increasing prediction [
14]. We did not have enough subjects of different racial/ethnic groups to make any kind of determination in the present studies. This issue would remain for further studies to sort out.
Finally, our experience is that it is extremely difficult to get universities to change their admissions procedures [
1,
5]. There is a great deal of entrenchment and reluctance to change, resulting in admissions systems that remain more or less constant decade after decade.
Nevertheless, psychology departments and departments in related fields (education, human development, cognitive science) probably can do better in selecting those students who will not only or even necessarily get the best grades, but rather who will be the best scientists. A test of scientific reasoning as part of the admissions process seems like a good first step in that direction. Any such test measures only a particular set of skills at a given point of time. One can argue whether general intelligence is modifiable (e.g., Herrnstein and Murray [
83], taking a negative position; Feuerstein [
85], taking a positive position). However, scientific reasoning skills can certainly be improved [
86,
87,
88,
89], at least in part by exercising them and reflecting on how one can make them better. For example, one learns how to review articles by reviewing them and by reading other people’s reviews. So our kind of assessment can be used diagnostically to aid improvement as well as to evaluate mastery of skills. In the end, using assessments for diagnostic purposes to improve skills may be more valuable to the science of psychology than merely using them to test for mastery.