1. Introduction
Assessment is a necessary part of academic studies on all levels. With few exceptions, an in-class, closed-book, invigilated pen-and-paper exam is the traditional assessment method [
1]. There are certainly other assessment methods in use, but the main assessment method at prominent universities is still a proctored, in-class exam (ICE). ICEs are typically characterized by hard time limits (2–6 h) and the stress this imposes on the students. The main reason for advocating ICEs seems to be that it minimizes the risks of the exam being compromised by unethical student behavior [
1,
2], but it has been criticized for several reasons: it deludes students to superficial learning [
1], it does not promote students’ ‘generic skills’ [
3], it imposes an unnatural pressure on the students that has an adverse impact on their performance [
4], it is not consonant with the prevailing theory of ‘constructive alignment’ in higher education [
1,
5,
6] and it is not suitable for assessing students’ performance on the higher levels of Bloom’s taxonomy scale [
7,
8,
9]. Bloom’s taxonomy scale [
9] is a hierarchical description of students’ learning (revised by Anderson et al. in 2001 [
7]). This taxonomy comprises all learning domains (cognitive, affective and sensory), but in this context (as in most higher education contexts) we are only considering the cognitive domain. At the lowest level, students’ learning is characterized by root learning (‘remember’). The succeeding levels are ‘understanding’, ‘applying’, ‘analyzing’, ‘evaluating’ and ‘creating’; students move from root learners to true scholars where they create knew knowledge. The idea of Bloom’s taxonomy is that it describes what phases learning undergoes and as teachers, it is paramount to understand on what level the present students are since this has direct consequences for the curriculum (objectives and activities), but, most of all, it has direct consequences for the design of the exam. An in-class, multiple-choice test (MCQ) may be justified on the lowest levels, but may not be appropriate on the highest levels; the higher levels require that students can define problems, predict, hypothesize, experiment, analyze, conclude and are capable of reflective thinking [
10] and they also indicate an “intrinsic creativity or an ability to express ideas in their own words” [
11] (p. 57). These skills can only (or preferably) be tested with ill-defined/ill-structured or open-ended questions [
12] which require an abundance of time to answer that is not facilitated by ICEs. These are forcible reasons for researching/reviewing take-home exams (THEs). Further, there appears to be no academic literature to support the widespread use of ICEs as the predominant assessment method [
1].
There are also other reasons to dispute the use of ICEs. During the last couple of decades, two dramatic events have fundamentally changed the basis on which universities conduct tertiary education; the massification of higher education [
13] and the emergence of the Internet. The consequence of the former is that classes grow (both in size and heterogeneity) and the latter has created a wealth of massive online open courses (MOOCs) and e-universities and almost any information is conveniently available from a computer keyboard. Both these events have forced universities to make comprehensive readjustments. The new cohort of students have different study habits and a much wider spread in academic ability and skills when they enter the tertiary training. The MOOCs and e-universities can usually not facilitate ICEs for practical reasons because their students are geographically scattered across the globe [
2]. Aggarwal stated already in 2003 that “technological advancements and student demands” have necessitated a shift from a “brick and mortar synchronous environment” to a “click and learn asynchronous environment” [
14] (p. 1). This implies a shift from learning on the universities’ terms to learning on the students’ terms, or as Hall phrased it: “take-home exams fit the new millennium student’s lifestyle” [
15] (p. 56). Universities also need to meet the stakeholders’ (i.e., the employers’) expectations and demands; what knowledge and skills are they really looking for in their next recruitment? Someone who can solve a problem under pressure in a couple of hours or someone who can retrieve, apply and synthesize information from the Internet?
There are also some issues with ICEs when it comes to testing of learning objectives. A tertiary course at a university is scaffolded by a formal syllabus that constitutes the framework of the course. The heart of the syllabus is the section that lists the learning objectives. The most fundamental idea about conducting an exam is to confirm that students have met the learning objectives. A ‘pass’ grade indicates that the student has reached all the course’s objectives (?). The problem with an ICE, at least in ‘hard’ disciplines like STEM (Science, Technology, Engineering and Mathematics), is that due to the imposed time constraint, items that test all objectives cannot be included in an in-class exam (ICE). Even if they are, a student may only need 50% of the maximum score to get a ‘pass’ grade. Hence, a student who is awarded only a pass grade has most likely not reached all the objectives in the syllabus. This could be effectively neutralized by introducing THEs (take-home exams). Due to the elongated time-span and the unlimited access to information, there are no longer any reasons not to include test items that cover all syllabus’ learning objectives and to pass the exam, students must prove that they have met all objectives. This would be a reasonable consequence of introducing THEs in STEM disciplines—it would benefit students’ learning and would be a ponderous argument against incredulous stakeholders.
This work emerged from a need to find alternative assessment methods consonant with the 2020 students’ study habits, e-universities’ conditions, future stakeholders’ demands and to see if a general introduction of THEs in STEM disciplines could be justified. For these reasons, the research concerning the traditional take-home exam (THE) has been systematically reviewed. In this context, a THE is an exam that the students can do at any location of their choice, it is non-proctored and the time limit is extended to days (rather than hours as is the typical time limit for an ICE). As opposed to ‘home assignments’, THEs are always ‘high-stake’, i.e., they have a decisive impact on the students’ grade. THEs have been used for decades but mostly for certain disciplines; they are well established in ‘soft’ disciplines (like psychology) but less common in ‘hard’ disciplines (like medicine and engineering). One of the aims of this work was to review the hitherto research conducted on THEs and to understand why they are prevalent in some disciplines while unwonted in others. THEs have some apparent advantages and disadvantages. The extended time limit implies less stress on the students, more complex and open-ended questions can be used which would increase the test liability. On the disadvantage side, there is of course the apparent risk of cheating when the exam is not proctored. One objective of this review was to map the community’s consensus concerning non-proctored take-home exams, contrast the advantages with the disadvantages and thereby provide a basis for deciding if, when and how to implement THEs in higher education. This should be of uttermost interest to tertiary educators in general and to those who conduct their courses online in particular (where in-class exams may be inconvenient or impractical). Another objective was to find scientific arguments that could justify a spread of THEs (take-home exams) also into STEM disciplines.
To that end, the following questions were targeted:
- Q1:
What advantages and disadvantages of THEs are contended by the community?
- Q2:
What are the risks of THEs and can they be mitigated enough to warrant a wide-spread use?
- Q3:
Are THEs only appropriate for certain levels on Bloom’s taxonomy scale?
- Q4:
THEs are non-proctored, students have access to the Internet and the time-span is (typically) extended. How does that affect the question items on the THE?
- Q5:
How do THEs affect the students’ study habits during the weeks preceding the exam and how does that affect their long-term retention of knowledge?
- Q6:
Do THEs promote students’ higher-order cognitive skills (HOCS)?
By HOCS, we refer to the ability to find, validate, select, integrate, synthesize, communicate, comprehend and present information, to function in teams, critical thinking, ethical responsibility, sustainability and social commitments and the ability to provide a holistic perspective [
3,
16]. ‘Retention of knowledge’ is defined as the knowledge that students retain (some) weeks after the initial testing [
17].
This work is closely related to a previous review conducted by Durning et al. [
18] where they reviewed the research to date on open-book (OBE) and closed-book (CBE) exams. The work conducted here differs from Durning et al.’s work in that their OBEs were still proctored (unethical behavior was not considered) and their focus was mainly medical students. This work focused specifically on non-proctored THEs and its implications/issues (and included all disciplines), but THEs are closely related to open-book exams; THEs are just an extension of OBEs [
15] and some results from OBE research will be pertinent to this review. For example, despite the fact that several works unequivocally prove the correlation between high performance on ICEs and better practice outcomes [
19], Durning et al. were not able to conclude that ICEs should be the preferred examination method in health care professions [
18].
This work was, for the most part, a qualitative review, focusing on the occurrence of the themes and conclusions rather than on quantitative results (analysis).
2. Methods
This systematic review was conducted according to Gough’s nine-phase process [
20] outlined by Bearman et al. [
21]. This process stipulates that database searches are preceded by elaborate inclusion/exclusion criteria as well as articulated search and screening strategies. Potential works were primarily identified by searching five databases (Education Database, ERC, ERIC, Scopus and Web of Science) with characteristic search phrases and pursuing cited references in these works (both forward and backwards). These searches were finally complemented with a search on Google Scholar.
2.1. Keywords
The primary keyword was ‘take-home exam’ restricted to ‘higher education’. Databases’ thesauruses suggested that ‘test’ and ‘assessment’ are valid synonyms to ‘exam’, and ‘tertiary education’ is a valid synonym to ‘higher education’. Hence, in the five primary databases, the following search condition was used:
In Google Scholar, only ‘take-home exam*’ AND ‘higher education’ was used in all fields, but patents and citations were excluded.
Table 1 illustrates the number of hits produced by each database.
2.2. Screening Algorithm
The screening and inclusion algorithm is illustrated in
Figure 1.
No screening for subjects or students’ level was applied; both undergraduates and postgraduates were included (colleges, universities, vocational schools etc.). There was also no screening for type of tertiary education and no geographic preferences were applied.
The huge number of hits in Education Database and Google Scholar called for a pre-screening process where only titles were used to determine the relevance to this review. This was followed by a redundancy screening where duplicates were removed. Next, data samples were screened by abstracts and the remaining samples were perused full-text. The full-text perusal also included forward and backward ‘snowball sampling’ for additional items.
2.3. Inclusion/Exclusion Criteria
In the pre-screening process that was applied to the Education Database and Google Scholar samples, the main inclusion criteria were that the title, or the first page excerpt, should include the phrases ‘take-home’ and ‘exam/test/assessment’ or otherwise give a clear indication of being related to question items Q1–Q6. For example, if the title included keywords like ‘HOCS’, ‘higher-order cognitive skills’, ‘non-proctored’, ‘open-book exam’, ‘retention test’ or ‘Bloom taxonomy’ they were passed forward to the abstract screening stage.
Abstracts from 168 samples were screened by exclusion criteria; if it was obvious that they were not concerned with question items Q1–Q6, they were excluded. 68 samples remained at the full-text perusal stage. In order to organize and code the works, a matrix was designed with one column for each question (Q1–Q6) and any content in the works that was identified as relevant to any of the questions was copied into the matrix. This facilitated a convenient basis for the subsequent analysis of all works, question by question. After the full-text perusal of all 68 works, the matrix contained 35 works that were considered to significantly contribute to answer question items Q1–Q6. Seven works were considered particularly interesting because they supported their conclusions by conducted retention tests.
These 35 works are presented in
Table 2, where their contributions to each question item is indicated. The methods used in all these works are summarized in
Table 3, where also the validation of each work has been assessed. The main validation criteria were that controlled experiments had been performed on random groups followed by sound statistical analyses (
p-values or other quantitative numbers accounted for) (= ‘Yes’). ‘Personal reflections’ and ‘Personal comments’ indicate that no experiments have been performed; reasoning and conclusions are based on the authors’ personal experience only. ‘Hypothesis testing’ means that at least a null hypothesis has been formulated and properly tested by statistical tools (paired
t-tests, ANOVA, etc.) and
p-values are properly accounted for (‘Yes’). ‘Case study’ indicate that take-home exams have been implemented but no hypothesis was formulated, and results and conclusions were based on non-quantitative analyses (interviews, mostly). ‘Survey’ indicate that anonymous questionnaires were used to probe students’ opinions/attitudes and ‘Review’ means that others’ works have been summarized. ‘Pilot study’ refers to an investigation where participation was voluntary and ‘Synthesizing from others’ means that data from several other experiments were analyzed and new conclusions were reported. In order to be classified as ‘Validated’ (‘Yes’), the experiments must have been properly designed (randomized groups, clear hypothesis/research question), properly analyzed with established quantitative methods.
3. Results
The results are presented as a summary of the conducted research associated with each one of the posed questions.
Q1: What advantages and disadvantages of THEs are contended by the community?
Proposed advantages and disadvantages are listed in
Table 4 and
Table 5 respectively, in order of most frequently cited advantage/disadvantage.
The most cited advantages are the reduction of students’ anxiety, the opportunity to test HOCS and conservation of classroom time. The main disadvantage of THEs, purported repeatedly, is the apparent risk of unethical student behavior.
Q2: What are the risks of THEs and can they be mitigated enough to warrant a wide-spread use?
The by far most frequently cited risk of THEs is the apparent risk of unethical student behavior, and a lot of effort has been made to prevent or diminish the risks of cheating.
Table 6 summarizes the remedies proposed by the community.
The cheating issue seems to dominate all conducted risk analyses that have been published, but other concerns have been voiced. There is some concern that students will not read the whole course material but only hunt for answers, but this can easily be mitigated by including questions about all the material [
17,
32,
38] (See also Q5 below).
Q3: Are THEs only appropriate for certain levels on Bloom’s taxonomy scale?
No research has been found that specifically addresses this issue, but it has been commented in some works. However, several researchers assert that THEs are more appropriate on the higher taxonomy levels and they also do not recommend them on the lower levels because answers can easily be retrieved from the Internet/textbook or copied from peers [
3,
4,
8,
17,
24]. Hagström and Scheja [
39] took it one step further and proved that introducing a meta-reflection on THEs stimulated a deep approach to learning (students were asked to describe and motivate their strategies and literature used).
Q4: THEs are non-proctored, students have access to the Internet and the time-span is (typically) extended. How does that affect the question items on the THE?
There seems to be an almost unanimous opinion in the community that THE question items should be designed to test the higher taxonomy levels of understanding and/or generic higher-order cognitive skills (HOCS). Questions should be open-ended and require prose or essay answers rather than multiple-choice questions (MCQs) because it is hard to design MCQs that go beyond the memorization level [
4,
8,
10,
11,
26,
31,
38,
42]. Due to the extended time-span implicated by a THE and the fact that students are “freed from the toil of memorization”, “the instructor can get tough for good reasons” [
22] (p. 344–345). A THE allows the instructor to include questions covering all the course material [
17,
32,
38]. Questions should force students to higher level thinking, to apply knowledge to novel situations and synthesize material [
17]. Bredon [
11] exemplifies this by suggesting that graphs could be used with adhering questions that force students to draw interpolating and extrapolating conclusions from graphs. In politics, Svoboda [
10] (p. 231) exemplifies this by distinguishing between typical ICE questions, like “What are the three major branches of the federal government?”, whereas on a THE, the question should rather be “Should we have governments?” in order to elicit/assess students’ higher level thinking skills.
Q5: How do THEs affect the students’ study habits during the weeks preceding the exam and how does that affect their long-term retention of knowledge?
This appears to be the most polarized question in the community, both concerning the study habits and the long-term retention. Some researchers assert that students who know that they will be assessed by a THE tend to study less than if they are assessed by an ICE [
3,
15,
23,
25,
32,
35]. Others take the polar, opposite standpoint and argue that they study more if they know they will have a THE [
4,
10,
22,
24,
33,
40]. The main disagreement seems to be whether THEs allure students not to read the entire course material and instead only hunt for answers once the THE is available.
It is interesting to note that the group of researchers that claims that THEs promote long-time retention learning better than ICEs, are (almost) identical to the group who claims that students’ study more for THEs. However, only seven works were found where actual retention tests were conducted (as unannounced tests sometime after the THE) to support their claims [
4,
6,
17,
23,
25,
29,
32].
Q6: Do THEs promote students’ higher order cognitive skills?
The community seems to agree that THEs are an excellent tool for promoting HOCS; THEs both cultivate HOCS [
8,
10,
22,
27,
41,
42] and facilitate a more accurate assessment of HOCS [
3]. By including questions from material that is not covered in class, HOCS are cultivated and a shift from algorithmic problem solving to conceptual understanding is fostered [
33]. THEs are also recommended for promoting, cultivating and assessing team work [
4,
10,
30,
42]. THEs have also been proven to foster an understanding of the learning process [
1].
The extended time-span allows students to meta-reflect on their answers which has been proven to have a benign impact on their scoring results [
39]. However, designing appropriate question items to assess students’ HOCS is a challenging task even for experienced instructors (ibid).
5. Conclusions
This review indicates that THEs may not be appropriate on the lowest levels of Bloom’s taxonomy scale. The opportunity of cheating is simply too tantalizing because the test items on the ‘knowledge’ level are usually fact oriented and too easily available on the Internet. There is a dispute in the community about whether ICEs or THEs best facilitate deep learning, but this review indicates that for the higher taxonomy levels, THEs are preferred by the community because higher-order thinking, and reflections, require more time (and less stress imposed on the students). The community seems also to agree that on the higher taxonomy levels, cheating in THEs is a minor problem that can be mitigated/prohibited/detected (or is non-existing).
If THEs are used on the lower taxonomy levels, a combination of ICEs and THEs might be worth considering. This was first suggested by Ebel in 1972 [
55] as a means to base the assessment on both the “quickness of intellect” (captured by the ICE) and the “persistence of effort” (captured by the THE). This idea was the basis of an assessment method developed by John Bailey at Clark State University [
28]. Bailey used THEs as “a second chance to learn”. An ICE is complemented with a THE; when the ICE was handed in, it was exchanged for a second THE. The total score was calculated according to an elaborate formula. Assume that the total score of the ICE is
T points and that a student scores
IC points on the ICE and
TH percent on the THE. The total score is then [
28]
If
T = 60 points and a student scores 36 points on the ICE and 75% on the THE, the total score is 45 points (36 + 0.5 × 24 × 0.75). According to Bailey [
28], this offers students a second chance without “giving away” grades. NB, the score on the ICE was not disclosed until the THE was returned. This “forces” the students to work hard also on the THE and really turn it in on time. Foley [
46] suggested a similar approach where the ICE is first returned with only the total score marked. The students were invited (but not required) to take home the ICE and redo it using any aid. If the student can identify wrong answers and provide convincing rationale for it, one can gain credit.
This may be a very good way to force the students to study hard the weeks preceding the exam (ICE) and, also turn the exam into a learning activity (THE). The dispute about whether the ICE or the THE best promote retention learning would be defused because students do both. This method would probably not affect the trust of our stakeholders (but there is a great chance that it would require more teachers’ hours for grading two exams) and also smoothly introduce the students to the change in assessment methods awaiting them later as they climb Bloom’s taxonomy ladder.
Recommendations
If THEs are used, consider using them on the higher levels of Bloom’s taxonomy only and, if possible, do not disclose the exam method in advance (Weber et al. [
24] announced the exam method two days in advance and recommended not to disclose it “until last possible moment”). If there are any concerns about the validity of THEs, Bailey’s assessment design with a combination of an ICE and a THE is recommended [
28].
This review has identified some missing research concerning THEs. Some of this research might be urgent due to the emergence of MOOCs and online e-universities where non-proctored exams prevail.
First, more tests should be conducted where ICE/THE comparisons are supported with delayed retention tests. We suggest that experiments should also investigate if students should know whether they will have an ICE or a THE in advance. If so, when should that be disclosed? Exactly when should the retention test be performed?
Second, the gain in students’ HOCS has been widely purported as one of the major advantages of THEs, but hard evidence is scarce. An experiment that contrasts the gains in HOCs between THEs and ICEs is desirable to provide scientific evidence to support the endorsement of THEs.
Third, what is exactly the prevailing attitude towards THEs among the faculty staff? A lot of works have been published that indicate a strong endorsement among students [
1,
27,
29] but the (assumed) resistance among faculty professors has mostly been insinuated [
1] and it would enrich the discourse if we could put a ‘number on the resistance’. Most of all though, the published works advocating THEs so outnumber the works dissuading from THEs which contradicts the general opinion that most professors are against it. It could be inferred that the ‘against’ group is underrepresented in scientific literature and interviewing a (large) number of (random) professors (at different faculties) would facilitate a deeper understanding of the problems attributed to THEs.
There is also no work that has considered the ‘secondary’ stakeholders’ (employers) opinion on this matter; their opinions about THEs are important inputs to the discourse.
Marsh concluded already in 1984 [
25] (p. 111) that “there is a paucity of specific literature comparing take-home exams and in-class exams” and [
24] (p. 474) came to the same conclusion: “virtually no research is available on take-home examinations”. Haynie draw the same conclusion in 1991 [
17]. This review reveals that this is still true 28 years later. However, in the light of the emergence of the Internet and its implications for higher education (MOOCs and online e-universities), the need for more research is even more urgent now.
The virtues of THEs have been widely extolled as promoting HOCS and simultaneously provide means to assess students and constitute an additional learning activity [
3,
17,
29]. Everybody recognizes the risk of unethical student behavior but there is a salient difference in attitudes towards the occurrence of cheating and the need for countermeasures. Some claim that the cheating cohort is relatively small and should be more or less ignored: “the main priority should be to focus on the higher quality learning outcomes of the majority, rather than set up an entire system to stop a small minority” [
1] (p. 234). This may be underestimating the problem; Lancaster and Clarke collected 30,000 contract cheating requests from students in a recent survey [
45]. Careless use of THEs may compromise and devaluate higher education degrees. On the other hand, proper use has the potential to both promote and assess the highest taxonomy levels and foster higher-order cognitive skills.
Finally, it should be pointed out that the 35 works reviewed in this work (
Table 2) stem predominantly from Anglo-Saxon contexts and this may introduce a bias; conclusions may very well be applicable to that context only.