Abstract
Jurors and other legal decision-makers are often required to make judgments about the likely memory accuracy of another person. Legal systems tend to presume that decision-makers are well-placed to make such judgments (at least in the majority of cases) as a result of their own experiences with memory. However, existing research highlights weaknesses in our abilities to assess the memories of others and suggests that these weaknesses are not easily ameliorated through the provision of information. In this work we examine the accuracy of layperson assessments of “real” eyewitness identifications following observation of a mock crime. We examine whether novel instructions, characteristics and beliefs of assessors, and underlying reasoning strategies are associated with improved or impaired judgment accuracy. The results support prior research in demonstrating a tendency towards over-belief in the accuracy of identifications. They suggest that reliance on what witnesses have said rather than attempts to make inferences from their statements (e.g., in relation to the level of detail provided or non-verbal cues in testimony) is associated with greater accuracy in assessments and that some individual differences and beliefs about memory are also associated with greater accuracy. However, there was no evidence that the instructions tested were effective. We discuss the implications of results for procedure surrounding the evaluation of memory in the legal context.
1. Introduction
Eyewitness accounts are often important evidence in the legal context, and identification evidence specifically has been described as a ‘staple form of evidence’ used to identify criminal offenders (Wells et al., 2020). As part of their role as legal fact-finders, jurors often have to assess the likely reliability of identification evidence (a form of ‘social’ metamemory judgment). Legal systems often operate under the presumption that laypeople (and therefore jurors) can make these judgments effectively as the result of their own experiences with memory, and therefore leave jurors with the task of making such assessments with relatively little assistance (for a discussion of the way that courts expect jurors to evaluate eyewitness testimony in the US, the UK, and Australia, see Semmler et al., 2012). However, a relatively significant body of work examining lay decision making in this area suggests that actually laypeople are susceptible to perform poorly when making such judgments, particularly through over-confidence in the accuracy of identifications. Incorrect eyewitness identification evidence remains a leading cause of miscarriages of justice across multiple jurisdictions. In May 2025, the US National Registry of Exonerations categorized 995 of the 3676 post-1989 exonerations in their database as having involved a mistaken witness identification (National Registry of Exonerations, n.d. For similar information in relation to other jurisdictions, see Evidence Based Justice Lab Miscarriages of Justice Registry, n.d.; Canadian Registry of Wrongful Convictions, n.d.; European Registry of Exonerations, n.d.; see also Helm, 2021b, 2022; Howe et al., 2017). In many of these miscarriages of justice, jurors were willing to convict a defendant based on eyewitness identification evidence (sometimes with clear flaws) that later turned out to be incorrect. In one Scottish case, the appeals court even came to the “clear conclusion” that the jury (who had convicted the defendant based primarily on identification evidence) had come to a conclusion that “no reasonable jury” could have reached (John Jenkins v Her Majesty’s Advocate, 2011). In this context, it is helpful to closely scrutinize how laypeople make assessments of identifications made by others, how the accuracy of those assessments may vary in different conditions, and how legal infrastructure can promote more effective assessments and, relatedly, more effective administration of justice. In this paper, we examine layperson (assessor) assessments of “real” identifications made by research participants following observation of a mock crime. We examine differences in the reasoning underlying inaccurate and accurate assessments and the influences of instructions and assessor characteristics on assessment accuracy.
1.1. Social Metamemory Judgments in the Identification Context
Metamemory judgments have been most extensively studied in the context of judgments of learning. Research in this context suggests that assessments of the memory of others (social metamemory judgments), as with assessments of own memory, are influenced by two broad types of cues—experience-based cues, which come from experience of a stimulus at hand (e.g., Alter & Oppenheimer, 2009), and belief-based cues, which come from beliefs or theories about memory (e.g., Frank & Kuhlmann, 2017). In the context of own memory, when someone is assessing whether they believe that they will remember a face, they may be influenced by how memorable that face seems to them upon viewing it (an experience-based cue) and their beliefs as to the general memorability of faces and how often they tend to remember faces (a belief-based cue). In the context of others’ memory, both of these types of cues are particularly susceptible to inaccuracy (see Helm & Growns, 2023). Experience-based cues involve translating own experience to the likely memory of another who experienced the stimulus differently and may find different things memorable. Belief-based cues involve applying beliefs and intuitions built on experience of own memory to assess the memory of someone else, who may experience memory differently. Therefore, while the ‘intuitions’ relating to memory that are often relied on by legal systems may be informative in assessing own memory (e.g., Wixted & Wells, 2017), they are particularly susceptible to inaccuracy in relation to assessments of the memory of others. It is therefore important to examine the ability of decision-makers to assess the memories of others as well as cues associated with more and less accurate assessments in order to examine the likely ability of jurors in this area and to design intervention if appropriate. Such an examination reveals areas of weakness in the ability of decision-makers to make such judgments.
Research suggests that laypeople’s views about memory are inconsistent with expert opinion based on scientific research (e.g., Benton et al., 2006; Kassin et al., 2001; Simons & Chabris, 2011, 2012). Perhaps most importantly in the legal context, a significant body of evidence shows that people generally over-estimate the reliability of memory. In one US-based study reported in 2011, examining survey responses from 1500 members of the public and 2 expert samples (N = 16 and N = 73) found that, while no experts agreed (strongly or mostly) that “Human memory works like a video camera, accurately recording the events we see and hear so that we can review and inspect them later,” 63% of the public sample strongly agreed or mostly agreed with this statement (Simons & Chabris, 2011; see also Simons & Chabris, 2012 for similar results in an different sample of participants). This research also highlights the potential for laypeople to be misled by particular cues in some contexts. For example, the research suggests that laypeople may be prone to misinterpreting witness confidence. Benton et al. (2006) surveyed 111 jurors (individuals who had responded to a jury summons) and found that only 50% believed that an eyewitness’s confidence can be influenced by factors that are unrelated to identification accuracy (a statement endorsed by 95% of experts in a prior study; see Kassin et al., 2001). Experimental work has confirmed the likely importance of this over-estimation of the accuracy of memory in assessments of identification evidence. That work has typically asked research participants to assess the likely accuracy of an identification given by another person who has witnessed an event (e.g., a mock crime), and has found that people are biased towards believing that identifications are accurate (Helm & Spearing, 2025; Lindsay et al., 2000). Insight into the ability of people to distinguish between accurate and inaccurate identification evidence generally can also be gleaned from this work, although accuracy levels have varied by study. Studies utilizing these paradigms have found assessors to be accurate 63.6% of the time (Martire & Kemp, 2009, using an Australia-based undergraduate psychology participant pool), accurate 73% of the time in assessments of true identifications and 51% of the time in assessments of false identifications (Lindsay et al., 2000, using a US-based undergraduate participant pool), and between 42% and 70% of the time (depending on the race of the target and witness; see Helm & Spearing, 2025, using a UK-based non-student participant pool). Thus, research suggests that while people are potentially picking up on some probative cues, intuitions as to the reliability of identifications are far from reliable themselves, and careful intervention is needed to guide these judgments in the legal process.
Existing research also provides insight into cues that laypeople rely on in assessing the memory of others, providing insight into potential sources of accuracy and inaccuracy in assessments. This research has tended to use controlled mock trial materials and to experimentally manipulate factors of interest in the trial materials. Factors that have been found to make an eyewitness account less believable in this work include a lower degree of detail (Bell & Loftus, 1988), (even minor) inconsistencies (Bruer & Pozzulo, 2014; Fraser et al., 2022; O’Neill & Pozzulo, 2012; Lavis & Brewer, 2017; although see Lindsay et al., 1986, where such an effect was not found), and lower confidence (e.g., Cutler et al., 1990, although the effect size appears to be modest; see Slane & Dodson, 2022) in an identification or account more broadly. However, these cues may be misleading in predictable contexts (on the relationship between level of detail and accuracy, see Weber and Brewer (2008)). Even inconsistencies, which are sometimes viewed as clear indications of either poor memory or deception, are now thought to tell us “little or nothing” about the accuracy of witness testimony, apart from that of the specific inconsistent statement (Fisher et al., 2012, p. 178; see also Hudson et al., 2020). The relationship between confidence and accuracy is more complex. Although highly confident memories can be false (e.g., Loftus & Greenspan, 2017), a recent analysis of research relating to confidence and eyewitness identification accuracy suggests that, at least under ‘pristine’ conditions, confidence is a relatively good predictor of accuracy and high confidence identifications are more likely to be accurate (Wixted & Wells, 2017). That work has provided rough indications (based on averages from 15 studies) of likely accuracy for different bands of confidence (0–20% confidence ≈ 65% correct; 30–40% confidence ≈ 75% correct; 50–60% confidence ≈ 80% correct; 70–80% confidence ≈ 90% correct, 90–100% confidence > 95% correct) (Wixted & Wells, 2017; and note the probability of an identification of a suspect being correct is higher than this since an inaccurate identification could be of a filler). It is therefore possible that confidence could be a helpful cue for people to rely on in assessing eyewitness identification evidence (particularly since even experts have struggled to identify more reliable cues as to memory accuracy, e.g., Bernstein & Loftus, 2009), provided that it is not treated as dispositive and the possibility of confident false memory is considered in each case, as well as the potential for non-pristine conditions underlying the identification.
1.2. The Utilization of Instructions
One way to improve assessments of identifications made by others may be to utilize judicial instructions through which relevant information, for example, information on how to utilize confidence as a cue in assessing identification accuracy, is provided to assessors. Instructions are already provided to jurors in a number of jurisdictions; however, these instructions largely focus on providing general warnings to jurors relating to the unreliability of identification evidence (e.g., Telfaire instructions (United States v Telfaire, 1972), Henderson instructions (New Jersey v Henderson, 2011), Turnbull instructions (R v Turnbull, 1977)). A significant number of miscarriages of justice resulting from mistaken identifications have persisted in spite of these warnings, and experimental research suggests that the instructions may be ineffective both in causing assessors to update their beliefs (Helm, 2021a) and in influencing their decision making (Helm, 2021a; Dillon et al., 2017; Jones et al., 2017). Some research provides insight into why these instructions might be ineffective. For example, some research suggests that the instructions communicate information to jurors in a verbatim form that they struggle to engage with, and that instructions may be improved by making content more meaningful to jurors, allowing them to apply the instructions in a concrete way in practice (Helm, 2021a). In line with this idea, the instructions that have been shown to be effective in increasing sensitivity (rather than just skepticism) in experimental work utilizing scripted materials have all focused on educating or training potential jurors rather than providing general warnings or information (Jones & Penrod, 2018; Ramirez et al., 1996; Jones et al., 2020; Helm, 2021a).
Existing experimental research found that providing information relating to witness confidence to assessors (specifically in the form of expert evidence either describing confidence as a good indicator of accuracy or stating that there is no relationship between confidence and accuracy) had no appreciable effect on the accuracy of assessments of identifications made by witnesses to a mock crime (Martire & Kemp, 2009). However, in line with the research discussed above, providing this information in a more meaningful way that can be applied in practice may be more effective. One way to make information relating to confidence meaningful for jurors may be to provide meaningful quantitative benchmarks to assist them in using confidence as a cue to assess accuracy (without overriding consideration of other cues). Research in other contexts has shown that meaningful anchors can help to orient (rather than bias) juror decisions (see, for e.g., Hans et al., 2022). The rough benchmarks identified by Wixted and Wells (2017) may be helpful in this regard in guiding jurors through the general utility of confidence, alongside warning them to consider other contextual factors that may impact the reliability of an identification. In this work, we examine the impact of both traditional instructions warning assessors that a confident identification may be wrong and instructions incorporating benchmarks from Wixted and Wells (2017) on the way that assessors interpret confidence, and on the accuracy of assessor judgments in relation to identification evidence.
1.3. Assessor Characteristics and Beliefs
A less explored area in assessing factors that may promote the accuracy of assessments of identification evidence is examining characteristics and beliefs of assessors themselves that have the potential to make them better or worse at making social metamemory judgments. Although this information may be less easy to utilize directly to improve legal systems, it provides new insight into the nature of assessments of the memory of others, including into psychological processes underlying more accurate and less accurate judgment, and therefore can be drawn on in order to identify ways in which such assessments may be improved (either via juror selection or via interventions to target factors that make particular individuals prone to poorer assessments). In this work, we consider four sets of characteristics and beliefs of assessors that may influence their assessments of the memory of others: experiences with memory, beliefs about memory, gist vs. verbatim processing, and beliefs about the criminal justice system.
First, assessors’ own experiences of memory (e.g., as reliable or unreliable) may influence what they expect from a witness, with those with experience of reliable memory being more likely to classify a memory as true (and therefore being more accurate in the identification of correct identifications) and those with experience of unreliable memory being more likely to classify a memory as false (and therefore being more accurate in the identification of false identifications).
Second, people’s beliefs about memory (and about the cues that are probative in assessing memory) may influence the accuracy of identification assessments. People with beliefs about memory that are consistent with evidence and expert opinion may be more effective at assessing the memory of others, and therefore more likely to make accurate judgments in relation to the memory of others.
Third, the way that people process information in decision making, including the tendency to rely on more gist-based processing (focused on meaningful and contextual interpretation of a stimuli) or verbatim-based processing (focused on precise and superficial details of a stimuli; see Reyna, 2012) may impact their effectiveness in evaluating memories of others. Reliance on meaningful information rather than picking up on superficial misleading cues (such as those discussed above) could potentially have a positive effect on assessment accuracy. In this work, we indirectly examine the relationships between people’s cognitive processing and their assessments of the memory of others, utilizing a scale designed to measure autistic traits in the general population. The link between autistic traits and reliance on gist has been discussed in prior work (Reyna & Brainerd, 2011) and stems from the fact that individuals with autism tend to focus on parts rather than “global aspects” of objects and have difficulty integrating information into a meaningful whole (e.g., Frith & Happé, 1994; for discussion, see Reyna & Brainerd, 2011). As a result, researchers have suggested that those with autism (and those showing relevant autistic traits) will experience disadvantages in utilizing gist-based intuition (which requires integrating elements in a meaningful and sometimes abstract way) and advantages in utilizing verbatim analytical processing (which requires attention to precise and sometimes superficial details) (see Reyna & Brainerd, 2011).
Finally, underlying beliefs relating to the criminal justice system and, more specifically, to the reliability of evidence and trustworthiness of testimony may influence assessments of memories of others. For example, people who have high confidence that the justice system tends to lead to correct outcomes may have an inherent trust in memory in the eyewitness context and therefore may have a tendency to classify identifications as correct, leading to higher accuracy in assessments of correct identifications but lower accuracy in assessments of false identifications.
In this work, we examine how each of these four sets of characteristics and beliefs are associated with accuracy in assessments of identifications. We also conduct a linguistic analysis to examine the reasoning underlying memory assessments and the extent to which different types of reasoning (relating to these characteristics and beliefs and more generally) are associated with accurate or inaccurate assessments.
1.4. The Current Study
This study is relatively exploratory and is intended to build on existing work to begin to provide information that can be drawn on in seeking to better understand assessments of identifications made by others and to develop intervention to improve these assessments in the legal context. Participants in the study (members of the public in the USA) examined “real” correct and false identifications made by witnesses to a mock crime, and their assessments were examined to provide new insight in relation to three key research questions.
First, how effective are lay assessors at distinguishing between accurate and inaccurate identification evidence? To answer this question, we examine overall levels of accuracy in assessments of identifications, and also the extent to which participants’ confidence in their assessments of identifications is related to the accuracy of those assessments (something with the potential to be important in jury deliberations).
Second, can instructions utilizing anchors informed by psycho-legal research improve the way that assessors utilize confidence as a cue in assessing identification evidence and ultimately improve the accuracy of assessments? To answer this question, we compare the associations between witness confidence and judgments of identification accuracy in participants who have received no instruction, a basic instruction (warning jurors that confident identifications may not be accurate), and an anchor instruction (utilizing benchmarks outlined in Wixted & Wells, 2017 relating to the relationship between confidence and accuracy generally in witness identifications).
Third, which characteristics and beliefs of assessors facilitate more (or less) accurate assessments of identification evidence? To answer this question, we examine associations between characteristics and beliefs and the accuracy of assessments of identifications, and conduct a linguistic analysis to examine reasoning associated with higher and lower accuracy in assessments.
2. Materials and Methods
2.1. Design
Each participant viewed and made assessments in relation to four identifications (and accompanying testimony), selected from a pool of identifications from a mock crime study in which “witnesses” saw a staged video of a person stealing a laptop and, approximately five minutes later, sought to identify the person who they had seen take the laptop from a six-person lineup with a not present option (see below for more information in relation to the underlying mock crime study). The specific identifications that each participant saw were randomly selected from groups in the larger pool of identifications, such that the four identifications seen by each participant included a randomly selected true identification and a false identification from each of two stimulus sets (although participants were not given any information in relation to the number of true and false identifications that they would see). The instructions that participants saw prior to viewing the identifications varied between subjects. Participants either saw no instructions, saw a traditional instruction in which they received warnings about potential weaknesses in identification evidence, or saw an ‘anchor’ instruction in which they received information about the proportion of witnesses who tend to be accurate at different levels of confidence.
The traditional instruction stated: “This information is designed to assist you in making your assessments. Please read this information very carefully and consider it in evaluating the identifications. A witness who is honest and convinced in his or her own mind may be wrong. A witness who is convincing may be wrong. A witness who is able to recognise the defendant, even when the witness knows the defendant very well, may be wrong.” The anchor instruction stated: “This information is designed to assist you in making your assessments. Please read this information very carefully and consider it in evaluating the identifications. Generally, under fair conditions, eyewitness confidence is predictive of accuracy. On average: when eyewitnesses have low confidence that their identification is correct, identifications turn out to be incorrect around 35% of the time (and correct around 65% of the time)—when eyewitnesses have moderate confidence that their identification is correct, identifications turn out to be incorrect around 25% of the time (and correct around 75% of the time)—when eyewitnesses have high confidence that their identification is correct, identifications turn out to be incorrect around 5% of the time (and correct around 95% of the time). Note that these are average figures, rather than figures that apply to the specific identifications in this study. They are intended to provide context for you to consider in assessing the likelihood that an identification is accurate in this particular case where the probability of a particular identification being inaccurate may be higher or lower than average.”
2.2. Participants
Participants were 498 adults recruited from the Prolific survey platform, who received GBP 6.80 (approximately USD 9) for completing the experiment. We had initially sought to recruit 500 participants so that each identification in our study would be viewed by at least 50 participants, but under-recruited very slightly since some participants on Prolific did not provide complete responses.
To be eligible for the survey, participants were required to live in the United States, to have a Prolific approval rating of at least 90%, to have completed at least 10 previous Prolific submissions, and could not have participated in the underlying mock crime work. Two participants were excluded for failing two or more of the three attention check questions in the survey. Nine participants were excluded due to verbal responses that were either nonsensical, repetitive (e.g., giving exactly the same answer multiple times), or identical to responses provided by other participants. The final sample therefore consisted of 487 participants (229 male, 251 female, 7 other gender/prefer not to say, 331 white ethnicity, 122 black/African/Caribbean ethnicity, 11 Asian ethnicity, 22 mixed/other ethnicity/prefer not to say, Mage = 38.73, SD = 12.26, range = 18–83). The Faculty of Humanities and Social Sciences Research Ethics Committee at the University of Exeter approved this research.
2.3. Materials and Procedure
Identifications: Stimuli Generation. Identifications shown to individual participants were selected from a set of 40 identifications, including 10 true and 10 false identifications from each of two stimulus sets (stimulus set A, involving a white target, and stimulus set B, involving a black target). Each set of 10 identifications were randomly selected from the corresponding categories in a larger set of 110 identification decisions from a mock crime study (the overall set of identification decisions from that set was larger [N = 485] and also included non-identifications and incorrect identifications from a target present lineup, both of which were excluded as potential stimuli for the purposes of this study). In the mock crime study, witnesses viewed a video lasting approximately two minutes in which a target stole a laptop while the owner of the laptop stepped out of the room to answer a phone call. The target in the video was clearly visible for approximately one minute. After a five-minute buffer task, they were then asked to identify the person that they had seen take the laptop from a six-person lineup including a “not present” option. They then answered a series of questions in relation to their identification (or non-identification), including: “How confident are you in your identification?” “What led to your decision?” “How well did you see the person who took the laptop?” “How difficult was it for you to figure out which person in the lineup was the person who took the laptop?” “How much attention were you paying to the face of the person who took the laptop?” And “How good is your memory generally for the faces of strangers?” (broadly mirroring the information requested in Wells & Bradfield, 1998). Participants were instructed to record themselves (on video) while making the initial identification and while answering each of these questions. In the overall data for stimulus set A, there were 27 false identifications from target absent lineups and 31 correct identifications. In the overall data for stimulus set B, there were 16 false identifications from target absent lineups and 36 correct identifications. Participants were also asked to provide their confidence in their identification on an 11-point written scale from 0 (not at all confident) to 100 (very confident), increasing in increments of 10. Levels of confidence in the overall set of 110 identifications from the mock crime work, and the sub-set of 40 identifications used in this study, are outlined in Table 1.
Table 1.
Descriptive statistics for identifications.
Identifications and Assessment Questions. Each participant in this study first received information about the event that the witness in the mock crime study observed. Specifically, they were told that, in a previous experiment, participants (“witnesses”) watched a video in which they saw a person taking a laptop while its owner was out of the room. They were informed that the video was approximately two minutes long, that the person who took the laptop was visible for about a minute, that approximately five minutes after viewing the video, the witness was asked to make an identification of the person who they saw, and that the identification was made from a lineup containing six photographs of faces and with a not present option. They were then told that they would see four identifications made by participants who saw different videos, and therefore different people taking the laptop, and that they would be asked to assess the likely accuracy of each identification.
Participants then viewed each of the four identification and testimony sets allocated to them in a randomized order. After viewing each identification, they answered four questions in relation to it. First, they were asked to indicate how likely they thought it was that the person the witness identified was the person who took the laptop in the video they saw on an 11-point scale from 0 (not at all likely) to 100 (very likely), in increments of 10. Next, they were asked to provide a dichotomous judgment of whether the identification they saw was correct or incorrect, and to indicate how confident they were in that judgment on an 11-point scale from 0 (not at all confident) to 100 (very confident), in increments of 10. Finally, they were asked to explain in as much detail as they could why they made their assessment of the witness identification. After they had watched all four video sets and answered all four sets of assessment questions, they moved on to complete the individual difference measures.
2.4. Individual Difference Measures
Participants then completed several individual difference measures in a set order.
First, they completed the Eyewitness Metamemory Scale, a 23-item questionnaire made up of three scales measuring peoples’ perceptions of their own memory. These three scales measure memory contentment (satisfaction with their ability to remember faces; α = 0.84), memory discontentment (doubt in their ability to memory faces; α = 0.89), and memory strategies (their use of strategies to remember faces; α = 0.69) (Saraiva et al., 2019). Second, they completed the Autism Quotient, a 50-item scale developed to assess the presence of autism spectrum traits in adults (Baron-Cohen et al., 2001; α = 0.70), with five subscales—communication (α = 0.64), social (α = 0.72), imagination (α = 0.52), details (α = 0.65), and attention switching (α = 0.40). This measure was included not with a view to diagnosing autism or to assessing autism, but as an indirect measure of gist (meaning focused) vs. verbatim (detail focused) cognitive processing (see discussion above, and Reyna & Brainerd, 2011). Next, they completed a memory questionnaire, examining their beliefs about the factors that do (and do not) affect memory performance (Benton et al., 2006). The questionnaire consists of 30 items, with three response options (“Generally True,” “Generally False,” and “I don’t know”). Finally, they completed the Pretrial Juror Attitudes Questionnaire (PJAQ; α = 0.88), a 29-item scale developed to measure individual differences in attitudes relevant to juror decision making, with six subscales: confidence in the criminal justice system (α = 0.76), conviction proneness (α = 0.73), cynicism to the defense (α = 0.74), racial bias (α = 0.42), social justice attitudes (α = 0.28), and belief in innate criminality (α = 0.65) (Lecci & Myers, 2008).
3. Results
In our analyses we examined overall accuracy and, in addition, accuracy in assessments of true identifications and false identifications separately. This approach was taken since factors associated with assessments of identifications may have a differential impact on the accuracy of assessments of true and false identifications (e.g., confidence in the accuracy of eyewitness memory may lead to greater accuracy in the assessment of true identifications but lower accuracy in the assessment of false identifications).
3.1. Overall Accuracy (And the Impact of Instructions on Overall Accuracy)
Overall, four participants (0.8%) were incorrect in all of their assessments (i.e., characterized the correct identification decisions that they saw as false identifications, and the false identifications that they saw as correct identifications), 37 participants (7.6%) were correct in one assessment and incorrect in three assessments, 137 participants (28.1%) were correct in two assessments and incorrect in two assessments, 182 participants (37.4%) were correct in three assessments and incorrect in one assessment, and 127 participants (26.1%) were correct in all of their assessments. In stimulus set A, higher witness confidence was predictive of lower accuracy in the assessment of correct identifications (B = −0.027, SE = 0.013, Wald = 4.508, p = 0.034, OR = 0.973, 95% CI [0.949, 0.998]) and in the assessment of false identifications (B = 0.032, SE = 0.005, Wald = 35.89, p < 0.001, OR = 0.969, 95% CI [0.958, 0.979]). In stimulus set B, higher witness confidence was predictive of greater accuracy in the assessment of correct identifications (B = 0.051, SE = 0.005, Wald = 94.33, p < 0.001, OR = 1.052, 95% CI [1.041, 1.063]) but higher witness confidence was not predictive of accuracy in the assessment of false identifications (B = −0.002, SE = 0.005, Wald = 0.155, p = 0.694, OR = 0.998, 95% CI [0.989, 1.007]).
In order to further probe accuracy and the conditions conducive to more accurate judgment, a generalized estimating equation model with an independent correlation structure was used in which the instructions (no instructions, basic instructions, anchor instructions), stimulus set (stimulus set A, stimulus set B), and underlying accuracy (correct identification, false identification) were used to predict the accuracy of assessor judgments. This analysis revealed a number of significant main effects and interactions. First (and consistent with prior work, e.g., Lindsay et al., 2000), there was a significant main effect of identification accuracy (Wald χ2(1) = 52.86, p < 0.001), such that participants were significantly more likely to be accurate in their assessments of correct identifications than in their assessments of false identifications (Mtrue = 0.78, SE = 0.02; Mfalse = 0.63, SE = 0.01). Second, there was a significant main effect of stimulus set (Wald χ2(1) = 25.91, p < 0.001), such that assessments of identifications from stimulus set A were more accurate than assessments of identifications from stimulus set B (MsetA = 0.75, SE = 0.01; MsetB = 0.65, SE = 0.02). These effects were moderated by a significant interaction between identification accuracy and stimulus set (Wald χ2(1) = 7.03, p = 0.008). Identification accuracy showed the same pattern over both stimulus sets (with assessments of correct identifications more accurate than assessments of false identifications), but this pattern was more pronounced in stimulus set A (see Figure 1).
Figure 1.
Interaction between identification accuracy and stimulus set in predicting assessment accuracy. * p < 0.05.
Finally, there was a significant interaction between instruction condition and identification accuracy (Wald χ2(2) = 6.73, p = 0.035; Figure 2). Differences in assessment accuracy were not significant between instruction conditions for correct identifications or for false identifications. However, a visual inspection of the data suggests that the basic instruction decreased accuracy in assessments of correct identifications and increased accuracy in assessments of false identifications when compared to no instruction, while the anchor instruction increased accuracy in assessments of correct identifications and did not change accuracy in assessments of false identifications.
Figure 2.
Interaction between identification accuracy and instruction condition in predicting assessment accuracy.
Finally, we used a series of logistic regression models to examine whether participants’ own confidence in the accuracy of their identification assessments significantly predicted the actual accuracy of those assessments. In both stimulus sets participant confidence in their assessments was only predictive of accuracy (such that people were more confident in accurate assessments than incorrect assessments) for assessments relating to correct identifications (stimulus set A false identifications B = 0.002, SE = 0.005, Wald = 0.131, p = 0.718, OR = 1.002, 95% CI [0.993, 1.011]; stimulus set B false identifications B = 0.004, SE = 0.004, Wald = 0.996, p = 0.318, OR = 1.004, 95% CI [0.996, 1.013]; stimulus set A correct identifications B = 0.049, SE = 0.007, Wald = 48.73, p < 0.001, OR = 1.050, 95% CI [1.036, 1.065]; stimulus set B correct identifications B = 0.051, SE = 0.006, Wald = 76.84, p < 0.001, OR = 1.053, 95% CI [1.041, 1.065]).
3.2. Witness Confidence and the Relationship Between Instructions and Witness Confidence
We examined the overall relationship between witness confidence and assessment accuracy using a series of four logistic regression models. The results were inconsistent between our stimulus sets, and, in some instances, not in line with expectations. In stimulus set A, higher witness confidence was predictive of lower accuracy in the assessment of correct identifications (B = −0.027, SE = 0.013, Wald = 4.508, p = 0.034, OR = 0.973, 95% CI [0.949, 0.998]) and in the assessment of false identifications (B = 0.032, SE = 0.005, Wald = 35.89, p < 0.001, OR = 0.969, 95% CI [0.958, 0.979]). In stimulus set B, higher witness confidence was predictive of greater accuracy in the assessment of correct identifications (B = 0.051, SE = 0.005, Wald = 94.33, p < 0.001, OR = 1.052, 95% CI [1.041, 1.063]) but higher witness confidence was not predictive of accuracy in the assessment of false identifications (B = −0.002, SE = 0.005, Wald = 0.155, p = 0.694, OR = 0.998, 95% CI [0.989, 1.007]).
We also conducted the same regressions (reported in the same order), including instruction condition and the interaction between instruction condition and witness confidence as predictors of accuracy in each of the four logistic regressions. The interaction term was not significant in any of these regressions (p = 0.496; p = 0.742, p = 0.927; p = 0.337, respectively), indicating that instructions did not alter the relationship between confidence and accuracy in participants assessments.
3.3. Associations Between Assessor Characteristics and Beliefs and Accuracy
Assessor experiences with memory were not significantly associated with assessment accuracy; however, we found significant relationships between our other measures of assessor characteristics and beliefs and assessment accuracy, as detailed below.
3.3.1. Beliefs About Memory
To assess associations between memory beliefs and accuracy assessments, we coded percentage agreement with memory statements, combining “generally false” and “don’t know” responses in one category (as in Benton et al., 2006). The percentage of our participants who agreed with each statement is provided in Table 2, alongside equivalent percentages for the experts surveyed by Kassin et al. (2001) (and utilized as a comparison group in Benton et al., 2006). The table also indicates where endorsement of a particular statement as being “generally true” was associated with higher accuracy in assessments of memory accuracy in our data.
Table 2.
Agreement rates of participants compared to agreement rates of experts from Kassin et al. (2001), and relationship between endorsement and accuracy.
3.3.2. Gist vs. Verbatim Processing (Autistic Traits)
Overall assessment accuracy significantly correlated with autistic traits, such that those who showed higher levels of autistic traits tended to make less accurate assessments (r(487) = −0.10, p = 0.041; although correlations between accuracy and AQ subscales were not significant, this correlation appeared to be driven largely by items in the attention to detail and imagination subscales since correlations between these subscales and accuracy approached significance [p = 0.072, and p = 0.078, respectively]). Accuracy in classifying false identifications as false correlated with the AQ imagination subscale (such that those with limitations in imagination were less likely to correctly classify false identifications as false; r [487] = −0.11, p = 0.02). Accuracy in classifying true identifications as true did not correlate significantly with the overall AQ score, or scores on any AQ subscales.
3.3.3. Beliefs About the Criminal Justice System
The overall assessment accuracy significantly correlated with beliefs in innate criminality, such that those who had stronger beliefs in innate criminality tended to make less accurate assessments (r(487) = −0.10, p = 0.027). Accuracy in classifying false identifications as false correlated with the overall PJAQ score (such that those with higher scores on the PJAQ were less likely to correctly classify false identifications as false; r [487] = −0.14, p = 0.002), as well as with scores on the PJAQ system confidence, conviction proneness, racial bias, and innate criminality subscales (all in the same direction as the overall scale correlation; r [487] = −0.12, p = 0.007, r [487] = −0.11, p = 0.013, r [487] = −0.13, p = 0.005 and r [487] = −0.15, p = 0.001, respectively). Accuracy in classifying true identifications as true correlated with PJAQ system confidence and PJAQ conviction proneness, such that higher scores on each of these subscales was associated with greater accuracy in classifying true identifications as true (r [487] = −0.112, p = 0.014 and r [487] = −0.09, p = 0.049, respectively).
3.3.4. Reasoning Underlying Assessments
To examine reasoning associated with higher and lower accuracy in assessments, we conducted a corpus linguistic analysis of the participants’ responses to the question: “Please explain in as much detail as you can why you made this assessment of the witness identification.” We compiled a corpus with those responses (corpus ACBC). In a further analysis, we divided the responses into two groups based on the overall accuracy of their four assessments: the above-chance group (AC; n = 309), with participants scoring three (n = 182) and four (n = 127), and the chance + below-chance group (CBC; n = 178), with participants scoring zero (n = 4), one (n = 37), or two (n = 137).
We used the corpus software AntConc (Anthony, 2024) to conduct a keyword analysis to identify key topics and features in the different corpora. A keyword is defined as “a word that occurs statistically more frequently in one file or corpus when compared against another comparable or reference corpus” (Baker, 2010, p. 104). The reference corpus is Corpus AmE06, a one-million-word reference corpus of general written American English built in 2011 with most of its texts publicized in 2006 (Potts & Baker, 2012). The keywords meet the threshold of p < 0.05 and were sorted by likelihood, which is measured by log-likelihood (4-term).
We first conducted a keyword analysis by comparing corpora ACBC and AmE06, examining the top 100 keywords (Table A1 in Appendix A) to identify the main themes in participants’ reasoning, regardless of their overall scores. The keywords indicate that the reasoning centers around witness identification related to someone who stole a laptop. Several factors are important in their assessment of witness identification, such as confidence (confident, sure, confidence, unsure, certain), memory (memory, good, strangers), attention (paying, paid, attention), view (camera, face, faces), and difficulty in identification (difficult).
Further analysis focused on the comparison between the corpora AC and CBC. Differences have been identified in their top 100 keywords: they share 80 words but differ in 20 (Table A2). All 80 shared words are included in Table A1.
As the keyword list was generated based on keyness (the degree to which each appeared important), calculated using log-likelihood tests, it includes words that are not frequently used in the two corpora. We therefore also examined word lists, which are generated based on frequency. The comparison of the top 100 words in AC and CBC reveals that they share 89 words, with each having 11 words that are not shared (Table 3). The shared words include some listed in Table A1, as well as many functional words like “the”, “of”, “in”, etc.
Table 3.
Top words not shared between corpora AC and CBC.
Four words that were identified as important in AC responses (but not CBC responses), which provide insight into the information that participants who performed above chance relied on are “says,” “difficult,” “really,” and “strangers” (the last three of these words are also in Table A2). An analysis of how these words were used provides further insight into participant reasoning. An analysis of the words frequently used alongside “says” shows that AC participants were frequently referring to what witnesses said about their view, difficulty, confidence, attention, and memory (Table 4), which can also be seen more broadly in the context in which “says” was used (Figure 3). We found similar results for the word “really”.
Table 4.
Top collocates on the right of “says”.
Figure 3.
The context of “says” in corpus AC.
AC participants’ attention to difficulty and the witness’s memory experience are also reflected in the keyness and frequency of “difficult” and “stranger,” which appear in both Table 3 and Table A2. An examination of the context of “difficult” reveals that it is mainly used to refer to the witness’s statement about the difficulty level that they experienced in identification (Figure 4). AC participants deem this an important standard in witness evaluation. For example, one AC participant stated: “She says it was difficult to identify the face and is not sure that the face is in the lineup. She also says that she does not have a good facial memory.”
Figure 4.
The context of “difficult” in corpus AC.
The word “stranger” occurs frequently in corpus AC because it is related to witnesses’ statements about their memory of strangers’ faces (Figure 5), which clearly shows that they paid attention to and utilized the responses in relation to the general ability to remember the faces of strangers.
Figure 5.
The context of “stranger” in corpus AC.
AC participants not only pay attention to what the witness “says” in response to those questions but also impose high standards when evaluating these responses to decide whether to accept their identification. This is demonstrated by the keyness and/or frequency of negation marker “t,” (Table A2) “only,” (Table 3) and “elimination” (Table A2). For AC responses, the most salient aspect shared by Table 3 and Table A2 is the negation markers, including “t,” and “didn” in Table A2 and “wasn” in Table 3. An examination of the collocates and context of “t” (Table 5 and Figure 6) shows that the main aspects of testimony being negatively evaluated include confidence, attention, and memory.
Table 5.
Top collocates on the right side of “t”.
Figure 6.
The context of “t” in corpus AC.
”Only” also indicates AC participants’ high standards in evaluation. Its context reveals the participants’ dissatisfaction with the witnesses’ confidence levels and limited visual information. One representative response is “The witness was only 80% sure this was the criminal that took the laptop. I personally need a 100% sure from witness to really believe the suspect is the criminal.”. Similarly, another participant responds, “This person is only 85% confident and basically judging one from his or her facial appearance is not making sense at all”. Another word that indicates AC participants’ high and rigorous standards is “elimination” (18 occurrences), which, as an identification method, is criticized by AC participants. This indicates attention to the process of the identification. For example, “She used a process of ‘elimination’ rather than a process of ‘identification’ and was unsure of herself. She lacked confidence in her answer and did not seem reliable”.
Unlike AC participants’ attention to what witnesses said about confidence, memory, difficulty, view, and attention levels, CBC participants seemed to focus on “details,” primarily in the witness’s description of the subject. This can be seen in the frequency of “details” (Table 3). When its singular form is also investigated, we find that the lemma’s combined occurrences amount to 79 times in total in corpus CBC. This frequency would rank it among the top 55 words on the word list, which includes all functional words. This highlights the importance of detail in CBC participants’ evaluations.
The collocates (shown in Table 6) show that “details” mainly occurs in contexts where the witnesses provided details about the suspect. Such actions are deemed important by CBC participants, making their evaluation of the witness lean towards the positive. An exemplary response is “The witness appeared consistent and specific in their reasoning. They referred to clear details about the suspect’s facial structure and features that matched the defendant in the lineup.” On the contrary, when a witness fails to provide details, participants tend to give a negative evaluation. For example, “However, some aspects warrant scrutiny. The witness’s basis for identification—resemblance to the person who took the laptop—is somewhat vague, lacking specific details about distinguishing features.” The keyness of “explanation”, “shape”, “match(ed)”, and “specific” (Table A2) and the frequency of “able” (Table 3) also point to CBC participants’ attention to details as shown in the following two examples.
Table 6.
Top collocates on both sides of “details” in corpus CBC.
“The witness gave a detailed and clear explanation for their identification, mentioning specific features like the distinct red jacket and the way the person moved. These specific details suggest that the witness paid close attention to the suspect’s characteristics, which supports the likelihood of accuracy.”
“…he was able to grab the shape of his head and a view detail even though he is not 100 percent confident, but he was able to identify number 5 to be the culprit. that tells a lot”
The above examples regarding attention to details also show CBC participants’ tendency to make inferences, even though clear statements have been explicitly given by the witnesses about their confidence and attention levels. For these CBC participants, the witnesses’ own descriptions of their levels of confidence and attention are not as important as the participants’ inferences based on their observations of the witnesses’ verbal and non-verbal cues. This is further supported by the keyness of “hesitation” in the CBC corpus (Table A2). The following two are examples of how they evaluate “hesitation”.
“I think he seems somewhat uncertain but he does say he has a generally good memory for faces of people he meets. I think it’s quite possible he picked the right one but there is also a possibility that he was wrong, His hesitation made me wonder about how sure he really was.”
“I believe the witness made the correct identification because she appeared confident in her decision. She showed no signs of hesitation at all. Given the clarity and the speed of her choice, along with the absence of any cues or influence, I feel confident that their identification was accurate.”
Consistency is also an important factor in CBC participants’ evaluations, as “consistent” is a keyword that stands out in Table A2. This suggests that, unlike AC participants, who focus on an identification itself, CBC participants also attach importance to the witness’s credibility or, more specifically, honesty, which again involves inferences. For example, the following are two responses from a participant with an overall score of one:
“He seemed consistent, but admitted to missing key features and being unable to say with confidence which was the person who stole the laptop. Admitted to not being good with faces as well! When he said he was 60% confident in addition to not being good with faces, I had my mind made up to not fully rely on his testimony! He was honest, which is vital!”
“Inconsistent markers in the first few sets of answers. He claimed he seen a direct shot of the person who stole the laptop, then after that had said he looked similar to the picture he chose from. That choice should have been more direct, instead of seeming more like a choice or opinion. Regardless of the circumstance! It should have been more direct! Based on the disorganized thought process/speaking (mixing up words & numbers slightly) I would also say this is a little bit of a red flag, since we are seeking honesty here. I personally find these two things inconsistent when determining if someone is correct or incorrect in what they say.”
4. Discussion
4.1. How Effective Are Lay Assessors at Distinguishing Between Accurate and Inaccurate Identification Evidence?
Results support prior research (e.g., Lindsay et al., 2000) by demonstrating a general tendency to over-estimate the reliability of identification evidence. While the majority of our sample performed better than chance in their four assessments, there remained a clear risk of inaccuracy in assessments, particularly in the assessment of false identifications, where participants believed that false identifications were correct 37% of the time. Importantly, our results also showed that participants’ own confidence in the accuracy of their assessments of false identifications was not predictive of the actual accuracy of those assessments. This reality has the potential to be problematic in the jury context, where more confident jurors are likely to be more persuasive than less confident jurors (despite our results suggesting that they are not more likely than less confident jurors to be correct) (on the role of confidence in jury deliberations, see Devine, 2012; Helm, 2024). There is therefore a risk that in some cases jurors who accurately classify a memory as false are persuaded that the memory is true by jurors who have inaccurately classified the memory as true. Future work should further probe our metacognitive ability in evaluating the accuracy of our assessments relating to others.
Another interesting finding from this study in relation to accuracy was that assessments of witness accuracy were more accurate in cases relating to one stimulus set than in cases relating to another stimulus set. This finding suggests that there may be specific categories of case in which assessors are particularly prone to mistakes in the assessments of identification evidence. It may be important that the identifications that participants performed more poorly at assessing were identifications of ethnically black targets. It is possible that participants overall had less experience of memorability and cues associated with memorability in relation to those targets and that this led to poorer performance as assessors of memory evidence. This study was not set up to test predictions relating to race and this explanation is only one of many possible explanations; however, prior work has found that ethnically black assessors perform better in assessing identifications of black targets than ethnically white assessors (Helm & Spearing, 2025). The relationship between target race and overall accuracy of juror evaluations of witness memory should be examined further in future work.
4.2. Can Instructions Utilizing Anchors Informed by Psycho-Legal Research Improve the Way That Assessors Utilize Confidence as a Cue in Assessing Identification Evidence?
This study did not find any evidence to support the effectiveness of our instructions to improve the utilization of confidence as a cue in assessing identification evidence or to improve the accuracy of assessments, although we did find limited evidence that traditional instructions may increase skepticism in assessments (essentially resulting in slightly more accurate assessments of false identifications and slightly less accurate assessments of correct identifications). One possible reason that instructions were ineffective may be the number of cues other than confidence that participants relied on in making their assessments of identifications (identified in our linguistic analysis), meaning that, even if their views on how to utilize confidence had been effectively changed by the instructions, this would not have had a significant overall impact on assessments. Another reason that instructions could have been ineffective is that they were relatively brief and not sufficiently detailed to alter participant intuitions relating to confidence in a meaningful way that they could apply in context (e.g., Helm, 2021a). It is possible that more detailed “training-based” instructions utilizing a similar approach may be more effective. Moving forward, the design of new instructions may also be informed by the results of this work, which provide new insight into characteristics and reasoning associated with assessment accuracy (as discussed below).
4.3. Which Characteristics and Beliefs of Assessors Facilitate More Accurate Assessments of Identification Evidence?
This study provides some important early insight into what might make assessors (including jurors) better or worse at assessing identification evidence in the legal context. The relationship between autistic traits (used in our study as a proxy for more verbatim-based reasoning) and accuracy, specifically that higher scores on the AQ were associated with lower overall accuracy, supports the suggestion that there may be certain cognitive differences that make us better or worse at making assessments of the likely accuracy of the memory of others (rather than just biasing judgments in a particular direction). A possible explanation for the relationship, linked to our predictions relating to reliance on gist, is that those with more verbatim processing (indicated here by higher autistic traits) focus more on details and less on a meaningful evaluation of the situation of the witness. This explanation is somewhat supported by the subscales of the AQ that seem to be driving the identified relationship (attention to detail and lack of imagination), and, also, the finding from our linguistic analysis that focus on detail in this case appears to be related to lower levels of accuracy. However, further research should examine this relationship in order to understand it further and to test these potential explanations. The fact that beliefs in innate criminality were related to accuracy, such that greater beliefs in innate criminality were associated with lower overall accuracy, may be driven by decisions being influenced by bias (specifically towards classifying identifications as true) rather than more precise assessment. However, an interesting note in this regard is that beliefs in innate criminality were not significantly associated with increased accuracy in the classification of true identifications.
Endorsement of a number of beliefs (all of which are generally endorsed by experts), specifically “eyewitnesses sometimes identify as a culprit someone they have seen in another situation or context,” “An eyewitness’s confidence can be influenced by factors that are unrelated to identification accuracy,” and “Young children are more vulnerable than adults to interviewer suggestion, peer pressures and other social influences,” were also associated with greater accuracy. These statements are all related to a relatively cautious and nuanced approach towards the evaluation of memory, recognizing ways in which memory can be flawed. The greater accuracy may therefore be a result of a broad cautious and nuanced approach rather than the result of specific endorsement of statements themselves. However, further research should examine whether these beliefs in particular are predictive of accuracy and, if they are, could build on this insight in designing instructions to educate the jury in a way that improves the accuracy of assessments.
Results also provide new and important insight into the reasoning underlying more accurate and less accurate assessments of identification evidence. Overall, results suggest that a focus on what witnesses say, particularly in relation to factors that are likely to be probative, including self-reported confidence, memory experience, view, and difficulty level, has the potential to increase accuracy while a reliance on inferences from how a witness explains their identification, including the detail that they provide about the suspect or non-verbal cues, has the potential to decrease accuracy. In relation to non-verbal cues, these findings are consistent with research in the literature on deception detection, which shows that the non-verbal cues that people tend to pick up on in making assessments in relation to the honesty of others have little to no probative value and that reliance on them can decrease the quality of assessments (see, for e.g., DePaulo et al., 2003; Vrij et al., 2019; Mann et al., 2004 [in the context of police assessments]; McKimmie et al., 2014).
4.4. Limitations, Future Directions, and Policy Implications
As noted in the introduction, this study is relatively exploratory and is intended to build on existing work to begin to provide information that can be drawn on in seeking to better understand assessments of identifications made by others and to develop an intervention to improve these assessments in the legal context. The results should be interpreted with the exploratory nature of this study in mind and should be viewed as early evidence providing insight that can be built on in future work to inform policy rather than as conclusive findings that will generalize more broadly. The study also has several specific limitations in relation to generalizability to the legal context in practice. First, all participants in our study were viewing identifications made in relation to the same mock crime. Although we used two stimulus sets (one of which included an ethnically white target and one of which included an ethnically black target) and 40 individual identifications, all witnesses viewed a video of a target stealing a laptop in which the target was clearly visible for approximately one minute. Identifications following viewing this crime may have particular characteristics that make them different from identifications in different contexts (including events viewed in person rather than on camera, although see Pozzulo et al., 2008; Rubínová et al., 2021), and future work should examine the extent to which results generalize to assessments of identifications made after viewing other crimes (including crimes involving factors such as violence, with the potential to influence memory; see, for e.g., Bogliacino et al., 2017). In addition, in this study participants viewed witnesses providing information in response to questions about their identification in a neutral context, and in video form. In real adversarial trials evidence would be provided through direct and cross examination, and this form of eliciting information could change the cues that assessors rely on and, relatedly, alter the accuracy of their assessments. In addition, witnesses would typically appear in person, meaning that assessors may pick up on different cues, particularly in the case of non-verbal cues, which may appear differently in person.
Despite these limitations and the need for further research to (1) replicate findings, (2) generalize findings to new contexts, and (3) examine findings in more ecologically valid contexts, this study provides important and novel insight about the general ability of non-experts to evaluate the memory of others that has the potential to be important not only in legal contexts but in any other contexts in which we make judgments about the memory of others (e.g., in education). In the legal context, the findings, when extended in future work, have the potential to inform more effective practice and procedure surrounding the evaluation of memory evidence, including to inform the design of effective instructions to improve the abilities of jurors in this role. For example, the work suggests that it may be helpful to instruct jurors to rely on what eyewitnesses say rather than to make inferences from their verbal and non-verbal behavior and that, if such instructions are ineffective, providing evidence in written form rather than verbal form may be helpful to consider. The findings may also be applied in the context of voir dire in cases in which eyewitness identification testimony is particularly important. For now, the findings add to an existing body of work that demonstrates that humans are not inherently effective evaluators of the memories of others, and that legal systems should not rely on our ability to assess the likely accuracy of the memory of others in making important legal determinations.
Author Contributions
Conceptualization, R.K.H.; methodology, R.K.H.; formal analysis, R.K.H. and Y.C.; investigation, R.K.H. and Y.C.; resources, R.K.H.; data curation, R.K.H. and Y.C.; writing—original draft preparation, R.K.H. and Y.C.; writing—review and editing, R.K.H. and Y.C.; visualization, R.K.H. and Y.C.; project administration, R.K.H.; funding acquisition, R.K.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by UK Research and Innovation, Grant/Award Number: MR/T02027X/1.
Institutional Review Board Statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Faculty of Humanities and Social Science Ethics Committee at the University of Exeter (protocol code 6021296, date of approval 25 April 2024).
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data from this study are available on OSF at osf.io/7wy2c/.
Acknowledgments
We would like to thank Emily Spearing for her work in generating the underlying materials used in this study.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
Table A1.
Top 100 keywords in corpus ACBC.
Table A1.
Top 100 keywords in corpus ACBC.
| Rank | Type | Freq_Tar | Range_Tar | Keyness (Likelihood) |
|---|---|---|---|---|
| 1 | witness | 1140 | 306 | 6301.6 |
| 2 | person | 927 | 332 | 4192.761 |
| 3 | confident | 721 | 343 | 3893.981 |
| 4 | she | 1831 | 367 | 3476.409 |
| 5 | identification | 433 | 164 | 2280.037 |
| 6 | memory | 462 | 244 | 2135.866 |
| 7 | sure | 526 | 293 | 2005.005 |
| 8 | faces | 399 | 224 | 1989.449 |
| 9 | laptop | 339 | 190 | 1848.386 |
| 10 | he | 1647 | 366 | 1792.416 |
| 11 | face | 508 | 254 | 1648.942 |
| 12 | video | 298 | 183 | 1430.945 |
| 13 | confidence | 294 | 142 | 1398.197 |
| 14 | saw | 398 | 226 | 1355.317 |
| 15 | suspect | 267 | 121 | 1292.155 |
| 16 | very | 466 | 266 | 1218.169 |
| 17 | good | 480 | 271 | 1157.195 |
| 18 | not | 964 | 386 | 1123.588 |
| 19 | attention | 297 | 192 | 1052.662 |
| 20 | was | 1359 | 405 | 926.82 |
| 21 | lineup | 146 | 80 | 819.119 |
| 22 | facial | 145 | 104 | 791.056 |
| 23 | correct | 161 | 101 | 706.731 |
| 24 | remembering | 118 | 84 | 637.672 |
| 25 | took | 260 | 155 | 617.297 |
| 26 | also | 388 | 213 | 612.347 |
| 27 | features | 145 | 104 | 581.984 |
| 28 | the | 5086 | 473 | 555.762 |
| 29 | unsure | 97 | 87 | 507.574 |
| 30 | identify | 131 | 94 | 504.008 |
| 31 | strangers | 106 | 76 | 491.567 |
| 32 | paying | 113 | 97 | 459.511 |
| 33 | seemed | 180 | 121 | 442.097 |
| 34 | mentioned | 98 | 63 | 430.707 |
| 35 | choice | 141 | 91 | 425.531 |
| 36 | identified | 114 | 85 | 421.374 |
| 37 | details | 115 | 81 | 413.689 |
| 38 | stated | 92 | 62 | 402.145 |
| 39 | decision | 129 | 93 | 400.775 |
| 40 | clearly | 124 | 97 | 398.493 |
| 41 | clear | 147 | 105 | 371.77 |
| 42 | pretty | 120 | 97 | 369.089 |
| 43 | because | 278 | 162 | 367.048 |
| 44 | camera | 95 | 78 | 332.956 |
| 45 | wasn | 147 | 94 | 312.302 |
| 46 | made | 225 | 140 | 308.554 |
| 47 | is | 960 | 342 | 307.28 |
| 48 | paid | 100 | 78 | 307.129 |
| 49 | admitted | 65 | 45 | 298.451 |
| 49 | seems | 123 | 86 | 296.628 |
| 51 | thief | 59 | 33 | 295.572 |
| 52 | assessment | 90 | 59 | 289.209 |
| 53 | description | 73 | 55 | 285.835 |
| 54 | accuracy | 60 | 34 | 274.993 |
| 55 | t | 497 | 201 | 271.48 |
| 56 | difficult | 98 | 75 | 265.305 |
| 56 | they | 514 | 139 | 257.582 |
| 58 | did | 215 | 147 | 250.928 |
| 59 | stole | 52 | 40 | 245.505 |
| 60 | able | 106 | 79 | 238.902 |
| 61 | identifying | 56 | 40 | 227.187 |
| 62 | based | 123 | 81 | 225.163 |
| 63 | incorrect | 46 | 33 | 224.049 |
| 64 | said | 382 | 199 | 222.622 |
| 65 | testimony | 53 | 26 | 216.667 |
| 66 | guessing | 43 | 35 | 215.853 |
| 67 | defendant | 42 | 29 | 214.881 |
| 68 | that | 1176 | 349 | 213.46 |
| 69 | guy | 83 | 47 | 213.079 |
| 70 | number | 131 | 87 | 210.386 |
| 71 | certain | 98 | 76 | 197.113 |
| 72 | seem | 90 | 73 | 197.02 |
| 73 | believe | 102 | 68 | 194.899 |
| 74 | answers | 52 | 47 | 182.579 |
| 75 | accurate | 50 | 38 | 179.893 |
| 76 | culprit | 30 | 18 | 172.564 |
| 77 | uncertainty | 44 | 27 | 169.584 |
| 78 | really | 110 | 82 | 163.531 |
| 79 | remember | 78 | 64 | 162.83 |
| 80 | explanation | 43 | 29 | 160.804 |
| 81 | hesitant | 28 | 24 | 152.476 |
| 82 | look | 113 | 92 | 147.377 |
| 83 | shape | 54 | 46 | 146.891 |
| 84 | detail | 49 | 44 | 145.586 |
| 85 | picked | 54 | 45 | 141.077 |
| 86 | about | 298 | 194 | 140.435 |
| 87 | saying | 68 | 59 | 139.892 |
| 88 | specific | 68 | 43 | 139.25 |
| 89 | doubt | 55 | 42 | 138.642 |
| 90 | somewhat | 47 | 37 | 138.24 |
| 91 | angle | 33 | 32 | 136.553 |
| 92 | didn | 135 | 89 | 134.857 |
| 93 | close | 81 | 67 | 134.459 |
| 94 | hesitation | 28 | 21 | 130.067 |
| 95 | who | 310 | 191 | 128.851 |
| 96 | recall | 41 | 33 | 128.206 |
| 97 | think | 125 | 93 | 125.346 |
| 98 | her | 495 | 233 | 124.539 |
| 99 | elimination | 26 | 23 | 122.742 |
| 100 | acne | 22 | 18 | 118.434 |
Table A2.
Keywords not shared between corpora AC and CBC.
Table A2.
Keywords not shared between corpora AC and CBC.
| AC (39,086 Tokens, 309 Files) | CBC (21,718 Tokens, 178 Files) | ||||||
|---|---|---|---|---|---|---|---|
| Type | Freq_Tar | Range_Tar | Keyness (Likelihood) | Type | Freq_Tar | Range_Tar | Keyness (Likelihood) |
| t | 367 | 147 | 264.814 | explanation | 27 | 16 | 137.973 |
| difficult | 77 | 59 | 242.784 | consistent | 30 | 16 | 112.259 |
| guy | 61 | 32 | 177.129 | matched | 24 | 14 | 111.666 |
| really | 77 | 60 | 128.419 | shape | 29 | 25 | 106.333 |
| angle | 26 | 25 | 122.812 | specific | 34 | 20 | 94.055 |
| didn | 100 | 66 | 122.368 | somewhat | 23 | 18 | 87.02 |
| picked | 40 | 33 | 118.579 | recollection | 12 | 9 | 85.84 |
| recall | 30 | 23 | 105.271 | attentive | 12 | 8 | 77.964 |
| acne | 17 | 14 | 104.47 | videos | 13 | 10 | 77.142 |
| look | 74 | 61 | 101.962 | individual | 31 | 19 | 76.301 |
| her | 334 | 159 | 99.937 | hesitation | 12 | 8 | 70.188 |
| perpetrator | 16 | 7 | 97.992 | likely | 34 | 18 | 68.179 |
| stranger | 25 | 19 | 96.162 | appeared | 24 | 14 | 66.558 |
| elimination | 18 | 16 | 95.006 | who | 121 | 71 | 64.216 |
| think | 84 | 65 | 91.988 | gave | 33 | 24 | 62.499 |
| close | 52 | 44 | 89.728 | their | 131 | 34 | 59.38 |
| suspects | 18 | 15 | 82.969 | level | 31 | 17 | 57.677 |
| picture | 36 | 32 | 81.66 | average | 24 | 19 | 57.04 |
| lot | 50 | 43 | 80.641 | witnesses | 11 | 10 | 56.469 |
| judgement | 14 | 13 | 80.426 | match | 15 | 13 | 55.145 |
References
- Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a metacognitive nation. Personality and Social Psychology Review, 13(3), 219–235. [Google Scholar] [CrossRef] [PubMed]
- Anthony, L. (2024). AntConc (Version 4.3.1) [Computer software]. Waseda University. Available online: https://www.laurenceanthony.net/software (accessed on 20 June 2025).
- Baker, P. (2010). Corpus methods in linguistics. In L. Litosseliti (Ed.), Research methods in linguistics (pp. 93–113). Continuum International Publishing Group. [Google Scholar]
- Baron-Cohen, S., Wheelwright, S., Skinner, R., Martin, J., & Clubley, E. (2001). The autism-spectrum quotient (AQ): Evidence from asperger syndrome/high-functioning autism, malesand females, scientists and mathematicians. Journal of Autism and Developmental Disorders, 31, 5–17. [Google Scholar] [CrossRef]
- Bell, B. E., & Loftus, E. F. (1988). Degree of detail of eyewitness testimony and mock juror judgments 1. Journal of Applied Social Psychology, 18(14), 1171–1192. [Google Scholar] [CrossRef]
- Benton, T. R., Ross, D. F., Bradshaw, E., Thomas, W. N., & Bradshaw, G. S. (2006). Eyewitness memory is still not common sense: Comparing jurors, judges and law enforcement to eyewitness experts. Applied Cognitive Psychology, 20(1), 115–129. [Google Scholar] [CrossRef]
- Bernstein, D. M., & Loftus, E. F. (2009). How to tell if a particular memory is true or false. Perspectives on Psychological Science, 4(4), 370–374. [Google Scholar] [CrossRef] [PubMed]
- Bogliacino, F., Grimalda, G., Ortoleva, P., & Ring, P. (2017). Exposure to and recall of violence reduce short-term memory and cognitive control. Proceedings of the National Academy of Sciences of the United States of America, 114(32), 8505–8510. [Google Scholar] [CrossRef]
- Bruer, K., & Pozzulo, J. D. (2014). Influence of eyewitness age and recall error on mock juror decision-making. Legal and Criminological Psychology, 19(2), 332–348. [Google Scholar] [CrossRef]
- Canadian Registry of Wrongful Convictions. (n.d.). Available online: https://www.wrongfulconvictions.ca/ (accessed on 20 June 2025).
- Cutler, B. L., Penrod, S. D., & Dexter, H. R. (1990). Juror sensitivity to eyewitness identification evidence. Law and Human Behavior, 14(2), 185–191. [Google Scholar] [CrossRef]
- DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129(1), 74–118. [Google Scholar] [CrossRef]
- Devine, D. J. (2012). Jury decision making: The state of the science. NYU Press. [Google Scholar]
- Dillon, M. K., Jones, A. M., Bergold, A. N., Hui, C. Y., & Penrod, S. D. (2017). Henderson instructions: Do they enhance evidence evaluation? Journal of Forensic Psychology Research and Practice, 17(1), 1–24. [Google Scholar] [CrossRef]
- European Registry of Exonerations. (n.d.). Available online: https://www.registryofexonerations.eu/ (accessed on 20 June 2025).
- Evidence Based Justice Lab Miscarriages of Justice Registry. (n.d.). Available online: https://evidencebasedjustice.exeter.ac.uk/miscarriages-of-justice-registry/ (accessed on 20 June 2025).
- Fisher, R. P., Vrij, A., & Leins, D. A. (2012). Does testimonial inconsistency indicate memory inaccuracy and deception? Beliefs, empirical research, and theory. In Applied issues in investigative interviewing, eyewitness memory, and credibility assessment (pp. 173–189). Springer. [Google Scholar]
- Frank, D. J., & Kuhlmann, B. G. (2017). More than just beliefs: Experience and beliefs jointly contribute to volume effects on metacognitive judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 43(5), 680–693. [Google Scholar] [CrossRef] [PubMed]
- Fraser, B. M., Mackovichova, S., Thompson, L. E., Pozzulo, J. D., Hanna, H. R., & Furat, H. (2022). The influence of inconsistency in eyewitness reports, eyewitness age and crime type on mock juror decision-making. Journal of Police and Criminal Psychology, 37(2), 351–364. [Google Scholar] [CrossRef]
- Frith, U., & Happé, F. (1994). Autism: Beyond “theory of mind”. Cognition, 50(1–3), 115–132. [Google Scholar] [CrossRef] [PubMed]
- Hans, V. P., Reed, K., Reyna, V. F., Garavito, D., & Helm, R. K. (2022). Guiding jurors’ damage award decisions: Experimental investigations of approaches based on theory and practice. Psychology, Public Policy, and Law, 28(2), 188–212. [Google Scholar] [CrossRef]
- Helm, R. K. (2021a). Evaluating witness testimony: Juror knowledge, false memory, and the utility of evidence-based directions. The International Journal of Evidence & Proof, 25(4), 264–285. [Google Scholar]
- Helm, R. K. (2021b). The anatomy of ‘factual error’ misscarriages of justice in England and Wales: A fifty-year review. Criminal Law Review, 5, 351–373. [Google Scholar]
- Helm, R. K. (2022). Wrongful conviction in England and Wales: An assessment of successful appeals and key contributors. The Wrongful Conviction Law Review, 3(3), 196–217. [Google Scholar] [CrossRef]
- Helm, R. K. (2024). How juries work: And how they could work better. Oxford University Press. [Google Scholar]
- Helm, R. K., & Growns, B. (2023). Predicting and projecting memory: Error and bias in metacognitive judgements underlying testimony evaluation. Legal and Criminological Psychology, 28(1), 15–33. [Google Scholar] [CrossRef]
- Helm, R. K., & Spearing, E. R. (2025). The impact of race on assessor ability to differentiate accurate and inaccurate witness identifications: Areas of vulnerability, bias, and discriminatory outcomes. Psychology, Public Policy, and Law. [Google Scholar] [CrossRef]
- Howe, M. L., Knott, L. M., & Conway, M. A. (2017). Memory and miscarriages of justice. Psychology Press. [Google Scholar]
- Hudson, C. A., Vrij, A., Akehurst, L., Hope, L., & Satchell, L. P. (2020). Veracity is in the eye of the beholder: A lens model examination of consistency and deception. Applied Cognitive Psychology, 34(5), 996–1004. [Google Scholar] [CrossRef]
- John Jenkins v Her Majesty’s Advocate. (2011). HCJAC 86.
- Jones, A. M., Bergold, A. N., Dillon, M. K., & Penrod, S. D. (2017). Comparing the effectiveness of Henderson instructions and expert testimony: Which safeguard improves jurors’ evaluations of eyewitness evidence? Journal of Experimental Criminology, 13, 29–52. [Google Scholar] [CrossRef]
- Jones, A. M., Bergold, A. N., & Penrod, S. (2020). Improving juror sensitivity to specific eyewitness factors: Judicial instructions fail the test. Psychiatry, Psychology and Law, 27(3), 366–385. [Google Scholar] [CrossRef]
- Jones, A. M., & Penrod, S. (2018). Improving the effectiveness of the Henderson instruction safeguard against unreliable eyewitness identification. Psychology, Crime & Law, 24(2), 177–193. [Google Scholar]
- Kassin, S. M., Tubb, V. A., Hosch, H. M., & Memon, A. (2001). On the “general acceptance” of eyewitness testimony research: A new survey of the experts. American Psychologist, 56(5), 405–416. [Google Scholar] [CrossRef]
- Lavis, T., & Brewer, N. (2017). Effects of a proven error on evaluations of witness testimony. Law and Human Behavior, 41(3), 314–323. [Google Scholar] [CrossRef]
- Lecci, L., & Myers, B. (2008). Individual differences in attitudes relevant to juror decision making: Development and validation of the pretrial juror attitude questionnaire (PJAQ). Journal of Applied Social Psychology, 38(8), 2010–2038. [Google Scholar] [CrossRef]
- Lindsay, D. S., Nilsen, E., & Read, J. D. (2000). Witnessing-condition heterogeneity and witnesses’ versus investigators’ confidence in the accuracy of witnesses’ identification decisions. Law and Human Behavior, 24, 685–697. [Google Scholar] [CrossRef]
- Lindsay, R. C. L., Lim, R., Marando, L., & Cully, D. (1986). Mock-juror evaluations of eyewitness testimony: A test of metamemory hypotheses. Journal of Applied Social Psychology, 16(5), 447–459. [Google Scholar] [CrossRef]
- Loftus, E. F., & Greenspan, R. L. (2017). If I’m certain, is it true? Accuracy and confidence in eyewitness memory. Psychological Science in the Public Interest, 18(1), 1–2. [Google Scholar] [CrossRef]
- Mann, S., Vrij, A., & Bull, R. (2004). Detecting true lies: Police officers’ ability to detect suspects’ lies. Journal of Applied Psychology, 89(1), 137–149. [Google Scholar] [CrossRef]
- Martire, K. A., & Kemp, R. I. (2009). The impact of eyewitness expert evidence and judicial instruction on juror ability to evaluate eyewitness testimony. Law and Human Behavior, 33, 225–236. [Google Scholar] [CrossRef]
- McKimmie, B. M., Masser, B. M., & Bongiorno, R. (2014). Looking shifty but telling the truth: The effect of witness demeanour on mock jurors’ perceptions. Psychiatry, Psychology and Law, 21(2), 297–310. [Google Scholar] [CrossRef]
- National Registry of Exonerations. (n.d.). Available online: https://exonerationregistry.org/ (accessed on 20 June 2025).
- New Jersey v Henderson. (2011). 27 A.3d 872.
- O’Neill, M. C., & Pozzulo, J. D. (2012). Jurors’ judgments across multiple identifications and descriptions and descriptor inconsistencies. American Journal of Forensic Psychology, 30(2), 39–66. [Google Scholar]
- Potts, A., & Baker, P. (2012). Does semantic tagging identify cultural change in British and American English? International Journal of Corpus Linguistics, 17(3), 295–324. [Google Scholar] [CrossRef]
- Pozzulo, J. D., Crescini, C., & Panton, T. (2008). Does methodology matter in eyewitness identification research?: The effect of live versus video exposure on eyewitness identification accuracy. International Journal of Law and Psychiatry, 31(5), 430–437. [Google Scholar] [CrossRef]
- Ramirez, G., Zemba, D., & Geiselman, R. E. (1996). Judges’ cautionary instructions on eyewitness testimony. American Journal of Forensic Psychology, 14(1), 31–66. [Google Scholar]
- Reyna, V. F. (2012). A new intuitionism: Meaning, memory, and development in fuzzy-trace theory. Judgment and Decision Making, 7(3), 332–359. [Google Scholar] [CrossRef]
- Reyna, V. F., & Brainerd, C. J. (2011). Dual processes in decision making and developmental neuroscience: A fuzzy-trace model. Developmental Review, 31(2–3), 180–206. [Google Scholar] [CrossRef]
- Rubínová, E., Fitzgerald, R. J., Juncu, S., Ribbers, E., Hope, L., & Sauer, J. D. (2021). Live presentation for eyewitness identification is not superior to photo or video presentation. Journal of Applied Research in Memory and Cognition, 10(1), 167–176. [Google Scholar] [CrossRef]
- R v Turnbull. (1977). Q.B. 224.
- Saraiva, R. B., van Boeijen, I. M., Hope, L., Horselenberg, R., Sauerland, M., & van Koppen, P. J. (2019). Development and validation of the Eyewitness Metamemory Scale. Applied Cognitive Psychology, 33(5), 964–973. [Google Scholar] [CrossRef]
- Semmler, C., Brewer, N., & Douglass, A. B. (2012). Jurors believe eyewitnesses. In B. L. Cutler (Ed.), Conviction of the innocent: Lessons from psychological research (pp. 185–209). American Psychological Association. [Google Scholar]
- Simons, D. J., & Chabris, C. F. (2011). What people believe about how memory works: A representative survey of the US population. PLoS ONE, 6(8), e22757. [Google Scholar] [CrossRef] [PubMed]
- Simons, D. J., & Chabris, C. F. (2012). Common (mis) beliefs about memory: A replication and comparison of telephone and Mechanical Turk survey methods. PLoS ONE, 7(12), e51876. [Google Scholar] [CrossRef]
- Slane, C. R., & Dodson, C. S. (2022). Eyewitness confidence and mock juror decisions of guilt: A meta-analytic review. Law and Human Behavior, 46(1), 45–66. [Google Scholar] [CrossRef]
- United States v Telfaire. (1972). 469 F.2d 552.
- Vrij, A., Hartwig, M., & Granhag, P. A. (2019). Reading lies: Nonverbal communication and deception. Annual Review of Psychology, 70(1), 295–317. [Google Scholar] [CrossRef] [PubMed]
- Weber, N., & Brewer, N. (2008). Eyewitness recall: Regulation of grain size and the role of confidence. Journal of Experimental Psychology: Applied, 14(1), 50–60. [Google Scholar] [CrossRef]
- Wells, G. L., & Bradfield, A. L. (1998). “Good, you identified the suspect”: Feedback to eyewitnesses distorts their reports of the witnessing experience. Journal of Applied Psychology, 83(3), 360–376. [Google Scholar] [CrossRef]
- Wells, G. L., Kovera, M. B., Douglass, A. B., Brewer, N., Meissner, C. A., & Wixted, J. T. (2020). Policy and procedure recommendations for the collection and preservation of eyewitness identification evidence. Law and Human Behavior, 44(1), 3–36. [Google Scholar] [CrossRef]
- Wixted, J. T., & Wells, G. L. (2017). The relationship between eyewitness confidence and identification accuracy: A new synthesis. Psychological Science in the Public Interest, 18(1), 10–65. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).