Advancing the Selection of Neurodevelopmental Measures in Epidemiological Studies of Environmental Chemical Exposure and Health Effects

With research suggesting increasing incidence of pediatric neurodevelopmental disorders, questions regarding etiology continue to be raised. Neurodevelopmental function tests have been used in epidemiology studies to evaluate relationships between environmental chemical exposures and neurodevelopmental deficits. Limitations of currently used tests and difficulties with their interpretation have been described, but a comprehensive critical examination of tests commonly used in studies of environmental chemicals and pediatric neurodevelopmental disorders has not been conducted. We provide here a listing and critical evaluation of commonly used neurodevelopmental tests in studies exploring effects from chemical exposures and recommend measures that are not often used, but should be considered. We also discuss important considerations in selecting appropriate tests and provide a case study by reviewing the literature on polychlorinated biphenyls.


Introduction
Many underlying causes for childhood neurodevelopmental disorders have been explored, including early (e.g., fetal, perinatal) exposures to environmental chemicals [1]. Methods for assessing adverse effects on neurodevelopment are broadening to include fetal neuroimaging (including functional magnetic resonance imaging, or fMRI), and toxicogenomics. Nevertheless, in environmental epidemiology studies, neurodevelopmental function tests form the basis for evaluations of associations between chemical exposure and human health effects.
The uses of neurodevelopmental tests in studies of environmental chemicals and pediatric neurodevelopmental disorders have been reviewed [2][3][4] and limitations of currently used tests and the difficulties with their interpretation have been described [5,6], for example in relation to long-term consistency of test outcomes. However, a comprehensive critical examination of commonly used tests in environmental epidemiology has not been conducted. In addition, many commonly used measures in other research areas (e.g., neuropsychology) have not gained wide use in the environmental chemical study arena and deserve attention.
In this paper, we seek to advance the science of neurodevelopmental function testing in environmental epidemiology studies by identifying central issues that should inform the choice of assessment devices for inclusion in future studies. These include general issues such as the relative merits of measures that capture broad versus narrow neurodevelopmental processes or domains (i.e., the function/neurodevelopmental process being assessed; for example, IQ is a broad cognitive measure, processing speed is a narrow cognitive measure), as well as technical concerns that arise when attempting to use new measurement strategies while maintaining connections with prior literature. We also make recommendations about guiding principles that can facilitate the design of neurodevelopmental studies, as well as specific suggestions about choices of measures and domains to provide a prototype-not a rigid template-for successful future investigations. Specifically, the following are reviewed: (i) commonly used neurodevelopmental measures (i.e., test or instrument) and measures that are not often used, but should be considered, by environmental epidemiologists, (ii) methodological issues that influence study findings, and (iii) methods for measuring other risk and protective factors that impact findings.
Although most environmental chemicals have not undergone extensive evaluations for their effects on neurodevelopment, a few chemicals (e.g., lead, methylmercury, polychlorinated biphenyls [PCBs]) have been studied by multiple research groups over many years. We selected PCBs as a case study for critically reviewing commonly used neurodevelopmental tests in environmental epidemiology studies because it offered a sufficient number of studies to provide a meaningful basis for evaluation of assessment methodology without requiring review of a prohibitive number of articles. We do not discuss specific outcomes reported in the individual studies, nor do we weigh in on the potential merits or weaknesses of past studies. Rather, we use the list of neurodevelopmental function tests employed in assessments of childhood neurodevelopment and PCBs as the foundation for a discussion of key aspects of test selection that must be considered when designing these types of studies. We then give recommendations for a path forward that might strengthen the use of these tests to support risk assessment. It is hoped that this exercise will serve as the foundation for multi-disciplinary discussions regarding best practices in the field of neurodevelopmental environmental epidemiology. A template for best practices is essential as these epidemiological studies (in conjunction with toxicological studies) form the foundation for risk assessment and regulation of many environmental chemicals.

Experimental Section
Our strategy was to identify key primary and review articles for a selected chemical class and review them to build an initial list of measures and domains [7,8]. We used PCBs as our chemical class as several epidemiological studies of neurodevelopment have been conducted and have included a wide range of measures [9]. We then searched for updates, revisions, and competing versions of those measures. We identified -incumbents‖ or the measures most frequently used across studies; the most frequently used in each domain are evaluated in Tables 1 and 2. For each measure, two independent raters (LA, LK) nominated an additional measure that would improve upon the incumbent. When the raters disagreed (which happened for three of the measures), the evidence base related to the measures was discussed and a consensus reached.
The measures most commonly used in epidemiological studies of PCBs are shown in Table 1. For the purposes of this research, each version of a measure was treated as a discrete entity and each distinct component of each measure was evaluated as a distinct entity. In reviewing the measures, we noted the domain labels assigned by the test developer, by the epidemiological investigators, by reviewers of the literature (e.g., [9]) and also according to current practice in neuropsychology. When the labels for domains were inconsistent, we organized Tables 1 and 2 around current practice, rather than historical or study-specific assignments.

Results and Discussion
The primary goal of this study was to identify and evaluate measures that have been commonly used in epidemiological studies examining environmental chemical effects on neurocognitive development. For new research projects that are not designed solely for hypothesis-generation to be compelling, they need to build on prior research by including additive, incremental advances and newer components that reflect current advances in theory and technique. A project's neurodevelopmental assessment battery (typically comprised of several measures) must be broad enough to capture relevant domains, but focused enough to be feasible. The measures themselves need to balance developmental appropriateness against the competing virtue of maintaining comparability across a wide age range. Additionally, measures have different strengths and weaknesses in terms of their psychometric properties (i.e., reliability, validity, population samples upon which the measure is normed). Viewed through the lens of designing an optimal neurodevelopmental study, not all psychometric features are equally important.
Measures used in epidemiological studies of PCBs and alternative measures suggested for future studies are shown in Table 1. Most of the PCBs studies used versions of tests that were current at the time of the study, but the majority of the commercially-distributed measures have been updated since the completion of the cohort studies under review here. Table 2, which is designed to serve as a resource for environmental epidemiologists, gives detailed information on various properties of neurodevelopmental measures. Together, Tables 1 and 2 should provide sufficient information for researchers to select the best neurodevelopmental measures that cover their domain of interest. We hope the comprehensive list will also inspire researchers to use different tests than those used in previous studies, thus building upon past studies by including more sensitive measures or new areas of interest.
During our review of the test batteries used in prior research and of subsequent developments with measures, we identified a set of cross-cutting themes and methodological issues pertinent to the design of new studies as well as the evaluation of published studies; these are described in following subsections. Examples from Tables 1 and 2 are used to highlight these issues. As is clear from these tables, a large number of measures have been used to assess potential effects of PCBs (which is presumably only a subset of a much larger list if additional toxicants are considered). The complete set of tests included in Table 2 is too large for any single cohort study to include or for future studies to fully incorporate. Reasonable principles or guidelines are needed to help investigators select measures that connect with prior research and also take advantage of any improved assessment tools; we provide recommendations on this topic as well. Table 1. Examples of tests used in PCB epidemiology literature and alternative recommended measure(s) for each domain. There were three possible bases for the recommended alternative measure: (1) the recommended measure has more advantages and fewer disadvantages (as enumerated in Table 2), (2) the recommended measure addresses an important domain that had been unexplored in past studies, or (3) the recommended version is a newer measure with updated norms.

Measure
Exists   Table 1). Norm quality was rated on a four point scale: ****=Exemplary, with nationally representative demographics and good sample size across relevant age spans, *** = Good, with some shortcomings (such as dated norms, coarsely clustered sampling, or omission of important group), ** = Suboptimal (e.g., badly out of date, or convenience sample that was not nationally representative), * = Flawed.         The review of the PCB literature and associated neurodevelopmental tests, as well as the exploration of alternative recommended tests, brought to light several important methodological issues to consider when designing a study and choosing assessment measures. Each issue is outlined below, followed by recommendations for future environmental epidemiology research.

Neurodevelopmental Measures and Domains
Evaluations of results of neurodevelopmental studies as part of a weight-of-evidence assessment (the process used in hazard evaluation to evaluate the degree of certainty regarding the adverse health effects of a chemical) necessarily include a review of the domains studied. This evaluative process, crucial to risk assessment, would be aided by consistent interpretations regarding the domain that a measure examines. However, the review of the PCBs literature revealed variation in the ways that neurodevelopment was parsed into domains, and also variations in how tests were categorized as measures of particular domains. A further complication is that different fields of study do not always use the same domain definitions, making interdisciplinary communication difficult (e.g., see differences in how domains are categorized in Table 1 versus categorization used by Boucher et al. [9]). This is not surprising, as it reflects the evolution of domain definitions that do not have distinct boundaries. The fact that many tasks have multiple components or involve coordination between multiple systems of functioning adds to the challenge. For example, the Arithmetic subtest from the Wechsler versions of the intelligence tests for children and adolescents asks the subject to listen to a story problem and then perform arithmetic operations in their head before producing an answer. As a result, the task includes auditory processing (listening to the passage), verbal processing (identifying the quantities and operations required), working memory components (maintaining the key elements in working memory and performing operations on them), an achievement component (having been exposed to and learning the necessary arithmetic operations), plus the nonverbal general ability component that would be expected based on the content and the subtest name (Sattler, 2001). Because of the task complexity, the Arithmetic subtest has been found to statistically relate more to the Verbal IQ and the Freedom from Distractibility Composite Index, but never significantly to the Nonverbal IQ or Perceptual Organization Composite Index (or later analogs). This illustrates the point that tests can be difficult to categorize even using quantitative and objective methods, let alone rational or theory-driven models.
Recommendations: It is clear that there have been changes over time and across studies in how assessment tests are categorized. A consistent rubric should be developed and adopted, even though it would necessarily be imperfect, provisional, and subject to periodic revision.

Broad versus Narrow Measures
Most of the neurodevelopmental studies of PCBs used a combination of broad and narrow measures. -Broad‖ in the neurocognitive sense refers to measures that use composite scores to summarize performance across multiple tasks, with the composite score acting as an indicator of a complex underlying domain. Examples include the composite index scores or full scale summary score from intelligence tests. -Narrow‖ refers to measures that assess a more focal process or construct. Examples include visual-motor, articulation, or spelling abilities.
Broad and narrow measures both have advantages and disadvantages. In general, the reliability, validity, and predictive value of a measure increase with the length of the test [44] (Figure 1). An advantage of broad measures (e.g., IQ) is that they are typically measured with greater reliability because they integrate information from multiple components, resulting in a longer test less influenced by error affecting any one component. This is a fact of psychometrics: The longer and more thorough the test, the more precise the estimate of the -true score‖-the person's level of the ability or trait, uncontaminated by error or other factors not related to the construct of interest. A second advantage of broad measures is that they tend to be based on factor analysis, which provides the important conceptual advantage that measurement is organized around the underlying domain of interest, not just observed performance on a test. Scores on a vocabulary test, for instance, can be influenced by educational opportunity, personality factors, language development, and a variety of other factors in addition to intelligence; whereas a verbal composite index focuses on the underlying ability that is shared across a vocabulary test as well as analogies, measures of general knowledge, and other tasks. Broad measures are thus more reliable and potentially more -pure‖ measures of some domains. A third major advantage of broad measures is that they have the greatest predictive value in terms of relating to educational, occupational, and health outcomes. General cognitive ability has consistently proven to be one of the most robust predictors of functional and vocational attainment [45,46] and has a surprisingly powerful association with health, longevity, and other important outcomes [47]. A fourth potential advantage is that more broad, global measures of performance may be sensitive to the cumulative effects of multiple decrements across a set of underlying, more focal processes (as a hypothetical example, a chemical could negatively impact Time Required to Administer Reliability of Measure working memory and processing speed; the broader measure could capture the confluence of these impacts, which more closely mirrors what one would observe in the child's everyday life). The disadvantages of broad measures are in many ways the converse of the strengths. Estimating a broad score requires that the test sample from a variety of different domains, creating pressure for longer test length and greater expense and burden. Within the cognitive ability literature, the tension between the competing aims of precise estimation of global abilities versus minimizing burden has been partially solved in two ways: choosing the most important subtests and choosing the most predictive items. An approach to shortening battery length without compromising the estimate of overall cognitive ability is to concentrate on an abbreviated battery that includes only the tasks most correlated with the underlying factor. This is the method guiding the use of two-subtest brief batteries (typically a vocabulary and a matrix or block design task), and it also is the rationale for the development of several four subtest measures of ability (i.e., designed and validated specifically as four subtest instruments) (e.g., Wide Range Intelligence Test, or WRIT; [48]; and the Wechsler Abbreviated Scales of Intelligence, or WASI; [22]). A second, more technical approach is to use a family of statistical methods known as -item response theory‖ (IRT) to guide the selection of test items so that the tests provide the most precise estimate of ability possible with the minimal number of items [49]. IRT methods have been incorporated into the selection of items for the instruments designed to be brief batteries (e.g., WASI and WRIT). IRT methods also can be used in an -adaptive testing‖ framework, where computer administration makes it possible to select subsequent items based on individual performance on earlier items. Adaptive testing makes it possible to achieve equally precise estimates with roughly 30% fewer items administered, but it requires computer administration. Adaptive testing will be become increasingly feasible to add to epidemiological studies as computer administration of other performance tests becomes more commonplace.
The advantages of narrow measures (e.g., Beery Test of Visual-Motor Integration) include greater brevity and a more direct connection to a specific neurocognitive process or brain region. There is also the potential for narrow measures to be more sensitive to neurotoxic effects on specific systems or areas of the brain [4]. However, detection of effects on narrow tasks is made harder by the lower reliability and sometimes unknown but often lower validity of task performance as a measure of an underlying domain. A major issue is that performance on a single task can be influenced by multiple variables. Sattler [50], for example, lists between nine and two dozen variables that can affect performance on each of the subtests comprising a Wechsler intelligence test. When multiple subtests are available, it is possible to use techniques like factor analysis to uncover the underlying domains of interest; but with an individual test it is not possible to disentangle the potential sources of error and variation. Some tasks, such as the Wisconsin Card Sorting Test (see Table 1) [17], are now recognized to be intrinsically complex and involve multiple neurocognitive processes for the person taking the test. At the same time, some narrow measures relate to an underlying function or domain that may truly stand alone.
In the educational assessment literature, there has been much discussion of -cross battery assessment‖ as a means of improving the measurement of specific domains. The main concept in cross-battery assessment relies on choosing several different tests that are supposed to measure the same domain, though often drawn from different published tests. For example, to provide good measurement of working memory, the three subtests from the WISC-IV might be supplemented with two more tests from the Wide Range Assessment of Memory and Learning. There are a variety of technical obstacles to the implementation of this cross-battery assessment strategy, some of which would be tractable in a large-group epidemiological study because it would be possible to redo factor analyses on the measures in question within the epidemiological study [51].
Recommendations: Given the largely complementary strengths and weaknesses of broad versus narrow measures, an optimal strategy for future environmental epidemiology studies would be to include a mix of both broad and narrow measures. Broad measures are best at estimating real world functioning and provide the most reliable and valid measurement options. Narrow measures are still important, however, because they may identify specific neurocognitive impacts that may not be observed with the broad measures. The choice of narrow measures should be tailored to each study based on prior evidence and specific hypotheses or questions about neurodevelopmental vulnerabilities potentially linked to the toxicant. However, studies that include a large number of narrow tasks without a priori motivation based on the literature or theory will create more problems than they solve. Increasing the number of batteries incurs costs of greater expense, increased burden, more missing data, inflated Type I errors or false positive results, less parsimony and more potential redundancy in findings. There is also the potential for Type II or false negative errors if psychometrically weak measures fail to detect true neurodevelopmental effects.
It is possible to use newer, brief, well-validated measures to provide precise estimates of global functioning. For example, using a four subtest battery provides equally precise estimates of general cognitive ability and verbal or nonverbal functioning as would be obtained using a corresponding ten or twelve subtest battery. The choices of narrow tests should be informed in part by prior research, making sure to include domains that previously have been found to be affected by exposures to toxicants. The battery can also be supplemented by some narrow measures chosen for conceptual reasons.

Old versus New Versions of a Measure
An important issue is the basis for choosing between using newer versus older versions of measures. The ethical guidelines of the American Psychological Association and other professional organizations clearly state that practitioners should use the most current version available for each measure [52]. The most appropriate measure for a practitioner may differ from that of a researcher. However, benefits for the researcher using the current version of a measure include: (a) enhanced generalizability of findings from the research cohort into clinical practice-at least until the measure in question is updated again; (b) congruence with ethical guidelines for practice; (c) gaining any theoretical or psychometric advantages built into the revision of the measure; and (d) avoidance of problems due to differences in the older standardization sample versus the population to which the investigator or others wish to generalize results.
However, there are costs associated with adopting newer versions of measures, especially in the context of conducting repeated assessments on a cohort of interest. If a cohort completed a particular version of a measure at study inception, then it would simplify the research design to continue administering the same version of the measure at follow-up periods (ignoring the constraints of practice effects-the effect associated with improvement on a test simply due to repeated administration-or developmental appropriateness). Using the Wechsler Intelligence Scales for Children (WISC) as an illustrative example, if at the start of the study the WISC-III (Wechsler Intelligence Scales for Children, 3 rd edition) was used, but the WISC-IV is the current version available, then it is not a simple matter to switch to the new version of the measure and compare the scores. Each revision from WISC to WISC-R to WISC-III and WISC-IV has involved the addition or the subtraction of subtests. Each revision has changed the underlying factor structure of the battery [50], with some subtests (e.g., Arithmetic) migrating from one composite index into a different composite index. As a result, comparisons of two composite scores with the same name (e.g., Verbal IQ) are complicated by the fact that they might not be based on the same underlying set of tasks, and newer batteries may omit composite scores that were included on previous versions of the measure (e.g., the WISC-IV no longer provides Verbal IQ and Performance IQ estimates). Adding to the complexity are changes in names for composite scores, which are usually intended to reflect theoretical models or reconceptualizations, but nonetheless add to the challenge of describing results (as when -Freedom from Distractibility‖ changes into -Working Memory‖, sometimes with an additional subtest added to the composite score).
There are other issues involved in changing versions of measures. One is the change in standardization samples. Most measures are interpreted by comparing the raw score to the average score for peers of the same age or demography (i.e., the standardization sample). Standardized scores are created by comparing individual performance to the standardization sample. The methods for constructing the standardization sample vary widely, from local convenience samples of cases in a single clinic or community to stratified samples that are designed to be nationally representative. At present, the best normative samples typically are available for intelligence tests and measures that are co-normed in the same sample with them. However, these samples typically involve aggregating many smaller convenience samples distributed throughout the country of interest ( Table 2 includes scored evaluations of the type and quality of the standardization samples in the measures used in PCBs studies, revealing a full range from small clinical convenience samples to population-level studies).
When conducting studies on effects of toxicants, researchers selecting a battery need to be cognizant of the composition of the standardization sample and how it compares to the sample included in their study. The discrepancy between the standardization versus participant samples causes problems when the norms are based on a US sample and the participants come from other countries (e.g., differences in language, culture). An obvious example is on the WPPSI test, which includes a picture of a child kicking an American football; the test requires accurate identification of this activity as -football‖ to earn full credit; this would be an unfair question to most of the rest of the world.
Discrepancies can also be meaningful within the same country. A standardization sample that was matched to national demography in 1970 will under-represent Latino Americans if the study sample was collected in 2009. Similarly, a test with nationally representative norms based on the year 2009 could still under-represent Latino Americans if the sample gathered for the environmental epidemiology study was drawn primarily from a heavily Latino region such as Texas. All of the cohorts studied in the PCBs literature were drawn from relatively geographically circumscribed regions, not from stratified nationally representative samples. This suggests that for epidemiological studies of toxicants the more common practice will be to gather samples from subsets of the population. Researchers should carefully consider whether the sample of participants differs from the demography included in the standardization sample. If there are differences, then the researchers should review the literature to determine whether these factors are associated with differences in performance on the neurodevelopmental test in question. If so, then the analytic plan of the study needs to address the potential confounding variables, at a minimum by including the potential confounders as correlates. Failure to do so could result in the appearance of seeming deficits that actually are due to cultural or demographic factors, and not due to the environmental exposure. These differences need not be limited to effects of culture or language on measures of academic knowledge or intelligence [44]; there also will be regional differences in diet or prevalence of genes that may be associated with performance on more narrow measures as well as potentially conferring differences in susceptibility to environmental exposures. For example, there are sizeable epidemiological differences in the distribution of the DRD4 alleles that are associated with sensation-seeking and impulsivity [53], and it is likely that there will be other differences in distribution of genes that influence performance on narrow measures.
Another potential confound related to changes in standardization samples is the possibility of temporal trends that alter the performance of the sample on the tasks. The most critical example of this is the -Flynn Effect,‖ where performance on tests of general cognitive ability has been found to increase by an average of roughly three points per decade [54]. This pattern has been observed across multiple measures and multiple samples from different countries around the world. Thus it appears to be a general trend, although there is no clear explanation for why performance would be improving globally [55]. For the purposes of an epidemiological researcher, the practical consequence is that observed scores will appear lower on newer versions of tests (because the scores are being compared to the new, higher average level of performance). If a study is conducted such that a cohort first gets an older version of a measure, such as a WISC-III, and then the cohort is followed up with a WISC-IV, scores might be expected to drop 3 to 5 points at the later assessment due to the change in the norms, and not due to any actual change in performance. It would be a mistake to attribute this effect to long-term sequelae of the environmental exposure. Although the Flynn Effect represents a small effect size, this could generate spuriously large differences in the percentage of cases with extreme scores (see section on Clinical Significance below). Unfortunately, there is no easy solution to the confound introduced by the change in norms. For example, analyzing the raw scores would not be workable because (a) average performance changes rapidly with age-hence the need for age-based norms; (b) the actual item content of the subtests will change between versions; (c) sometimes entire subtests change between re-standardizations of the battery. If there is a linking sample of cases that took both the old and the new versions of the test (which is often done as part of the updating process for new versions of measures), then it may be possible to estimate the size of the Flynn effect and the extent to which it might influence performance on particular measures.
Recommendations: Researchers will almost always want to use the newest available versions of measures at the beginning of a study. They will want to become familiar with the differences between the new version and older versions that may have been used in prior published studies. Differences in subtest composition, factor structure, and constitution of the standardization sample all become confounding variables and would rival hypotheses for any differences in patterns of findings. If repeated assessments are performed on the same cohort, then consideration needs to be given to the benefits of using consistent measures versus switching to newer tests when an older version might still be viable. If the primary purpose is within-subjects comparisons looking at trajectories over time within the cohort, then a good case would be made for retaining the older test even though a different version becomes available. Some of the technical issues with changes in version and norms will be unavoidable when the cohort ages across the boundaries between different versions of tests, such as the transition from preschool to school-aged, or adolescence to adulthood. Interestingly, many of the brief four-subtest versions of intelligence measures have broader age norms (e.g., 6 to 80 years versus 6 to 16 years), and they also may be less prone to changes in subtest content or factor structure than the larger batteries. These attributes may make them attractive candidates for many epidemiological studies. Researchers should also bear in mind that these factors affect comparisons between samples more than they affect correlations within the same sample: Using a particular version of a test may provide an accurate estimate of the association between toxicant exposure and neurocognitive functioning, even though the test may provide biased estimates of average functioning compared to the normative sample.

Psychometrics: Conventional and Relevant Metrics
Test publishers provide information about the psychometric properties of instruments, including various measures of reliability (referring to the reproducibility of scores) and validity (referring to evidence that the instrument actually measures what it is designed to measure) [44,56,57]. It is crucial for investigations into environmental impacts on neurodevelopment to include consideration of the psychometric properties of the measures selected when designing the study. There are many different ways of measuring both reliability and validity. Information on these issues is discussed in the following subsections and included in Table 2.

Reliability
One form of reliability is internal consistency, indicating the extent to which different parts of a test are measuring the same domain. Internal consistency is the single most widely reported measure of reliability, due to the fact that it is the least expensive type of reliability data to gather, not because it is intrinsically superior to other forms of reliability. For the purposes of epidemiological studies of toxicants, internal consistency often may be the least relevant of the major forms of reliability coefficients in guiding the selection of measures. It is also possible for internal consistency to be -too high‖ in some circumstances. Most indices of internal consistency are influenced by scale length, such that longer scales tend to be more internally consistent. Two items with very similar content will also correlate more highly than two items measuring different aspects of the same domain. For example, responses to items asking whether the participant -feels down‖ and -feels blue‖ would show greater internal consistency than would -feels down‖ and -insomnia,‖ even though all three items are relevant to the domain of depression. As a result, concentrating on maximizing internal consistency may paradoxically result in selecting scales that are longer than necessary, and may favor more redundancy or narrowness of domain representation rather than broad coverage with less internal consistency [57].
Another form of reliability is inter-rater reliability, which refers to the extent to which scores are reproducible when the same test is administered or scored by different individuals (-raters‖) administering the measure [57]. Many tests involve scoring decisions, including judging the quality of verbal responses and assigning them to zero, one or two-point categories on vocabulary tests, or timing the speed at which block patterns are duplicated and making decisions about what constitutes an acceptable degree of rotation in the orientation of the pattern. These decisions introduce opportunities for human error and also for a degree of subjectivity in the decision-making; thus, it is important to evaluate the degree of reproducibility of scores across raters [50]. This issue also applies to giving neurological assessments, reading x-ray or MRI images, and many other classification decisions [58]. Cicchetti et al. [59] provide a review of different benchmarks for describing inter-rater reliability and some thoughts about selection of measures in terms of trade-off between reliability and validity.
Retest stability refers to the extent to which individuals tend to maintain the same scores upon repeated administrations, such that high scorers on the initial assessment also tend to be the highest scorers when taking the test again. Retest stability is usually indexed as a correlation between the two sets of scores, thus ignoring overall changes in the level of scores. Retest stability tends to diminish as a function of time between administrations, such that two-week stabilities would be higher than twoyear stabilities. Stability also varies as a function of the domain being assessed. As per the state versus trait distinction in psychology, some individual differences are expected to vary substantially across time and situation (state variables, e.g., sleep deprivation), whereas others are expected to show greater temporal and situational stability (trait variable, e.g., IQ). Stability also increases with age. For instance, the two-year stability of performance on a cognitive variable is likely to be much greater in the period between 24 and 26 years of age than would be found for the same dimension between 4 and 6 years of age. In the context of environmental epidemiological studies, retest stability can be informative by suggesting which tasks might be expected to show greater spontaneous recovery (or regression to the mean in the event of low stability) [60]. If the study design includes a low-exposed comparison cohort, then between-group comparisons provide a way of examining change in the effects of exposure over time; and if three or more administrations are available, then we recommend growth-curve modeling techniques as a way of comparing group differences in developmental trajectories (Figure 2) [61,62]. Where available, the basic information on reliability is included in Table 2.

Validity
There are several types of psychometric validity. The most important to the environmental epidemiology literature are construct validity, predictive validity, exposure sensitivity, and ecological validity.
Construct validity: Construct validity describes the extent to which a measure satisfies multiple underlying forms of validity (e.g., the extent to which the measure includes appropriate content, correlates with other established measures of the same domain, correlates with measures of different but related domains, and discriminates among diagnostic groups) [44,63]. Where available, the basic information on construct validity is included in Table 2.
Predictive validity: Predictive validity refers to concurrent or prospective predictions and was used by Davidson et al. [4] to evaluate tests for environmental epidemiology studies. The value of longitudinal prediction in a neurodevelopmental framework is clear. Concurrent predictive validity can also be called diagnostic efficiency when the measure is demonstrating validity in terms of assigning children into categories such as clinical diagnosis. Diagnostic efficiency is most commonly reported in terms of sensitivity and specificity, where sensitivity refers to the percentage of children that truly have the target condition who are classified correctly, and specificity refers to the rate of children who do not have the target condition who are classified correctly [64]. A challenge in using diagnostic efficiency is that there needs to be a gold standard indicator of -true‖ status against which the assessment tools can be evaluated. For environmental epidemiological studies, the choice of criterion diagnoses could include definitions such as presence/absence of mental retardation, presence/absence of clinically significant impairment, or other definitions. For diagnostic efficiency statistics to be readily interpretable, the criterion needs to be dichotomous. However, this raises important questions about whether taking a criterion that could be measured continuously (such as cognitive ability) and converting it to a category (such as mental retardation versus within normal limits) loses information and reduces statistical power to detect effects [65]. There are considerable communication and policy advantages to using a dichotomous definition [66]; however, it must be recognized that important information is lost in this process, especially in terms of clinical significance. Adopting the framework of diagnostic efficiency would also provide methods for dividing individuals into the dichotomous groups based on costs and benefits attached to correct identification and avoidance of errors [66][67][68]. Some groups have already used the diagnostic efficiency framework to evaluate the performance of candidate tests at discriminating between known groups, such as low birth weight versus normal birth weight, or learning disabled versus not [4]. This approach is an approximation, in that known categories (low birth weight, learning disability) are being substituted for an unknown category (effect of toxicant), and the specific effects of a toxicant may be different from the signature effects of low birth weight. However, the results demonstrated that the majority of the assessments investigated could not discriminate to a statistically significantly degree between known groups, raising serious concerns about their assay sensitivity if used in epidemiological studies. Where available, the basic information on the predictive validity of measures used in PCBs neurodevelopmental research is included in Table 2.
Exposure sensitivity: This is similar to the concept of -treatment sensitivity‖ in the clinical trials literature: Has a measure demonstrated an ability to pick up the signal of a treatment effect when there is other evidence that the effect is present? This type of validity information is almost completely absent from the technical manuals or primary publications describing the psychometric properties of tests reviewed for this research. There were some exceptions, including the Bayley-III (see Tables 1  and 2) technical manual's presentation of scores for children who were exposed to alcohol in utero per mother report, resulting in small effect sizes for decrements in gross and fine motor ability (d ~ 0.3), moderate deficits in cognitive ability (d ~ 0.6) and large deficits on language ability and socioemotional functioning (d ~ 0.8). The same manual also provided information about average scores for a sample of infants that suffered asphyxia at birth (again per maternal report), with associated average deficits in the moderate range across all scales (d = 0.3 to 0.7) [18].
The advantages of the -exposure sensitivity‖ approach are that the statistical methods will be familiar to the scientific community, and it is often easier to assign people to groups based upon exposure status instead of outcome status (although there has also been concern about the heterogeneity and imprecision of definitions of exposure in the literature) [69]. Demonstration of sensitivity to exposure effects offers evidence that a measure can overcome the problems of imperfect reliability and validity to detect a measurable outcome. Even when found, exposure effects need to be interpreted with caution, for example studies that use a large number of tests or statistical comparisons increase the risk of false discovery (meaning detecting a statistically significant result by chance; this risk can be reduced by using a false detection rate correction to the p value to determine significance). Prior success at detecting exposure effects provides a method for streamlining batteries by eliminating instruments that have failed to detect effects, and also concentrates more attention and resources on tools that detect larger effects.
Ecological validity: Ecological validity is the ability for a measure to relate to real world functioning [63,70]. Many past environmental epidemiology studies have not included measures that focus specifically on everyday functioning. In Tables 1 and 2, we include measures that have been shown to have improved ecological validity. For example, epidemiology studies have used continuous performance tasks (CPT, a computerized test of attention). However, research in the field of ADHD shows that parent and teacher rating scales are better at detecting clinically significant differences in attention functioning. We therefore recommended the Conners Rating Scale if the goal is to identify meaningful behavioral effects, whereas the CPT might be a better -narrow‖ measure of attention processes (Table 1).
Recommendations: For the purposes of detecting the effects of toxicant exposure, conventional psychometric properties will not be equally important. Nor does the frequency with which psychometric characteristics are reported align with the degree of importance for epidemiological studies. Internal consistency is probably less useful for appraising candidate tests than inter-rater reliability or retest stability, but internal consistency is far more commonly reported in the primary publications and technical manuals of the assessment tools reviewed (Table 2). Using computerassisted testing increases the standardization of administration and scoring for complex tasks, reducing a source of inter-rater reliability error and potentially enhancing the power of research designs to detect exposure effects (e.g., see Table 2, the CTONI, WCST or CPT). For applied purposes, higher interrater reliability is always desirable; but when comparing measures it is important to recognize that different designs can produce different reliability estimates. Inter-rater reliability will generally be much higher when judges are given the same audiotape or transcript to rate versus conducting separate interviews with the participant (adding variability due to administration as well as variability in scoring). We recommend evaluating the psychometric properties of each measure used with the study's sample, when possible, and comparing those properties to those found in the standardization sample. It is probably most important to evaluate inter-rater reliability in the test administrators/scorers regularly during the course of a study. We recommend growth-curve modeling techniques as a way of comparing group differences in developmental trajectories Similarly, predictive validity and exposure sensitivity are two highly relevant but rarely reported parameters. We recommend increased emphasis on reporting the relevant parameters, both in technical manuals and in research reports, to facilitate improving test selection. We also recommend a multitiered approach to test selection, where tests that have demonstrated exposure sensitivity may be supplemented by a second tier of other tests chosen on a theoretical basis, and perhaps a third tier of exploratory measures if resources permit.

Cultural Effects
Cultural effects are a major consideration in test selection. Most tests only have a standardization sample and normative data available in one language, even if the instrument has been translated into multiple languages. Translation is a complex process, and even with fluent translators and blinded -back-translation‖ into the original language for review, there can be important cultural differences in the way concepts are expressed. There can also be differences in the behaviors of interest on which the measure focuses. For example, there might be differences in the way that cultures experience depression. There might also be culture-dependent differences in the relationship between an item asking if the person -cries a lot‖ and their underlying level of depression. In addition, there may be differences in the amount of crying that is typical in a culture, independent of the underlying level of depression. These issues can be formally investigated using both qualitative techniques (ethnographic interviews and focus groups) as well as quantitative methods. However, with regard to neurodevelopmental tasks, most of the research about cultural effects is in its infancy.
The current shortcomings of research on cultural effects leave limited options for environmental epidemiologists. If the battery is constructed to avoid verbal or culturally loaded tasks, then the range of measures is constrained, and many of the tests with the strongest relationships to functional outcomes or behavior would be excluded. If only tests with thorough cultural adaptation and separate norms are used, then only a few instruments are added to the available pool. Reliance on tools that have been translated but not validated introduces potential confounds that should at a minimum be acknowledged as a potential limitation. Ideally, if the sample size is large enough and analytic resources are available, then examining the stability of the psychometrics using multi-group statistical methods would become a valuable secondary aim for the research [71].
Recommendations: We recommend increased resources be dedicated to research on cultural effects. Few of the tests we reviewed have been translated, and even fewer have normative data available for the translated version. We recommend that researchers use measures with similar levels of translation and validation, report them accurately in the measures sections of papers, and discuss the potential limitations in their reports. When selecting measures, it will be important to include some tests that have minimal verbal components. We do not recommend avoiding verbal tests, though, particularly if a goal of the investigation is to generalize to functioning in everyday settings. A secondary aim of projects with adequate resources would be to use qualitative and statistical methods to evaluate the degree of measurement equivalence when tests are transported into different languages and cultures.

Measuring Other Risk and Protective Factors
Most of the environmental epidemiological studies under review recognized the importance of measuring other factors besides toxicant exposure that could affect the individual's outcome. In addition to measuring comprehensive demographics (place of residence, parental age, race, marital status, etc.), medical status of the child and mother during pregnancy and birth, birth order of the child measured, age at exposure, severity of exposure, exposure to other important toxicants (e.g., smoking in the home, prenatal alcohol exposure, lead) and route of exposure, there are several other important factors that could either increase or decrease the severity of the effects. For example, nutrition has been measured in some studies and found to act as an important moderator [72]. Breastfeeding has also been shown to act as a protective factor. Socio-economic status (family income, parent education and parent occupation) is known to have profound effects on neurocognitive development and should be measured in every study. Studies have also used the Home Observation for Measurement of the Environment (HOME; [73]) to measure quality of home environment in a standardized manner as it is also known to have profound effects on development. Parental verbal ability/IQ is often reported as a covariate, though the most commonly used measure (Peabody Picture Vocabulary Test, or PPVT) [74] is not a culture-free test and should, therefore, be used with caution. Additionally, a child's overall cognitive ability acts as a protective factor regardless of the endpoint of interest. Such influential factors as cognitive ability should be included statistically as covariates.
Recommendations: It would be useful for investigators from multiple disciplines to pre-determine a set of variables that should be considered as covariates for every study, and suggest a systematic way of measuring those variables to increase the ability to make direct comparisons among studies and cohorts. For example, when measuring socioeconomic status, some investigators in the studies we reviewed used the Hollingshead Scale [75], some used education and income separately, and others created a unique approach using combined percentiles. A consistent method that could be used cross-culturally would be preferable. The demographic variables that are routinely described as features for standardization samples should typically be included as covariates, especially if the group exposed to the toxicant might differ on any of these features from the comparison group. If the research design includes different levels of exposure to the toxicant (e.g., exposed versus unexposed, unexposed versus single exposure versus multiple exposure, or more commonly, different amounts of exposure), then including interaction terms between the covariate and the exposure variable in the statistical approach will markedly reduce bias in the estimates of effects for the toxicant [76]. Another struggle relates to balancing the importance of measuring the possible covariates with the time required to measure some of them well. If investigators are looking for a more culture-free but still quick estimate of parental IQ, they might consider a measure such as the Test of Nonverbal Intelligence, 3 rd Edition, (TONI-3 [77]), which is similar to the Ravens Progressive Matrices [78], but with much more recent norms.

Statistical Significance versus Clinical Significance
A recurring theme in the clinical literature is the distinction between statistical significance versus clinical significance; this issue has also been raised in the context of environmental epidemiological studies [79]. This distinction has proven challenging to use in practice, but it is also highly relevant to discussions of measuring the effects of toxicants on neurodevelopment.
Statistical significance most commonly refers to situations where the observed results (i.e., the study's findings) fall outside of a confidence interval (range of scores) around the result that would have been expected under a null hypothesis (i.e., a finding of no difference between groups); or similarly, when a test statistic evaluating an observed finding exceeds a critical value for the desired level of significance. In a study of toxicants, a statistically significant result would mean that the differences between the exposed and unexposed groups (or high exposure versus low exposure groups) were large enough that they would only have been observed by chance -rarely‖-with -rarely‖ typically being defined as less than 5% of the time. Sometimes results are presented as an estimate of the effect size of exposure (see below) with a confidence interval that indicates the upper and lower bounds of the estimate. If the confidence interval is set at 95%, then this is conceptually equivalent to testing against a null hypothesis with an alpha <0.05.
Statistical significance thus establishes a crucial filter for evaluating the potential effects of toxicants. Significant results indicate that the toxicant has an effect on the measure, or else might be a false positive result (i.e., a result obtained by chance alone; if a study makes 20 comparisons, one of those comparisons, or 5%, might be a -rare‖ difference observed by chance alone). Nonsignificant results indicate that the toxicant is weak or inert with regard to that particular measure, or else that the study might have produced a false negative error (i.e., there is a true difference that could not be detected by the study because of other factors such as poor measurement, poor inter-rater reliability, cultural effects, or not enough children in the study to detect a result). For the purposes of toxicant research, false positives are costly: They can lead to unnecessary increases in concern about exposure, and perhaps unnecessary regulatory action, management and/or treatment. False negatives are at least equally worrisome, as they can lead to the erroneous conclusion that the compound is safe-at least in regard to that particular measurethus perhaps resulting in less regulation and potentially greater exposure. Using psychometrically weak measures increases the risk of failing to detect effects that are actually present (false negatives). Running a large number of significance tests on a battery containing multiple measures increases the risk of false positive results. Both errors should ideally be avoided, but research study design balances them against each other.
Methods for increasing statistical power (i.e., the ability to detect a -true‖ difference) in environmental epidemiological studies include: (1) using a more liberal definition of significance (i.e., adopting a more lenient alpha level), (2) increasing the size of the effect, and (3) decreasing the size of the error in estimating the effect. The first option, using more liberal definitions of significance, directly increases the risk of false positive errors. The other options, increasing the effect size and reducing error are methods that can increase power without inflating the risk of false positives, so they are clearly preferable.
Methods for increasing the size of the effect include increasing the exposure level (in human studies, this would translate into identifying and including subjects known to be highly exposed) and focusing on the neurocognitive areas that are maximally affected by the exposure. Increasing exposure may be acceptable in animal models but raises obvious ethical issues in human models. Thus, the most effective approach for increasing power in studies of toxicants is reduction of error.
Techniques for reducing error include increasing the size of the sample, increasing the precision of the measurement of effects (e.g., choosing one's measures wisely as outlined above), eliminating variance due to extraneous sources (e.g., confounders and covariates), and using repeated measures designs (ideally combining pre-exposure and post-exposure measurements on the same individuals). Pre-post designs are again often difficult to conduct with humans and toxicants, as ethical values will dictate relying on accidental exposure and other -natural experiments‖ which make it difficult to collect pre-exposure levels (though the large, prospective National Children's Study may contain pre-post components [80]). However, the other two approaches appear promising as ways of increasing statistical power in many studies of toxicants. Adopting measures with better psychometric properties will improve measurement precision, thus reducing error and improving power. The alternative measures in Table 1 are recommended because of their strong psychometric properties.
Statistical significance is a necessary but not sufficient condition for evaluating the effects of a toxicant. If a toxicant effect does not achieve statistical significance in well-designed studies with adequate statistical power, and especially if it remains nonsignificant across multiple studies, then the interpretation would be that the toxicant does not have a meaningful effect on that outcome measure. On the other hand, it is possible to achieve statistical significance with effects that are too small to be clinically meaningful or to have policy implications (e.g., attaining statistical significance with small effects if they are measured with great accuracy or with large samples). A readily-understood example of this is as follows: Measured with enough precision, most people have one foot that is longer than the other (leading to a correct rejection of the null hypothesis of equal foot length); but the difference is rarely large enough to justify buying a different sized shoe for each foot (requiring a change in shoe purchasing policy). Conversely, there are examples where a even a small effect should result in a response (e.g., reducing heart attack risk with preventive treatment with low-dose aspirin).
The precision of a measure suggests a natural benchmark for comparison of observed effects. -Accuracy‖ typically is reported as the standard error of the measure, or the precision with which observed scores estimate the true score. If IQ tests are typically accurate to +/-3 points, and change scores on IQ tests are only accurate to +/-4.5 points at the individual level, then effect sizes that are smaller than 3-4 points are not impressive considering the precision of the tool. Although the standard errors are rarely reported in articles (they are more common in technical manuals), they can be estimated based on the standard deviation and the reliability of the instrument (see Table 2 for information on measures' standard errors). When the same test is given more than once, the precision of the difference between the two scores is lowered by the imprecision in both the first and second testing. This -standard error of the difference‖ is always 41% larger (the square root of two) than the standard error of the measure.
Another way to assess clinical significance is by comparison with benchmarks established by normative data for the measure. The most important benchmark is located two standard deviations away from the average score for the standardization (or in the case of toxicants, the unexposed) sample. This definition establishes a meaningful and consistent threshold that could be applied with any test that has normative data. However, this would capture only the most extreme or frank effects. This method sets a much higher threshold compared to using the standard error of the measure and the difference between groups.
The use of a consistent definition of clinical significance would be valuable, as many test manuals and interpretive systems advocate for the use of more idiosyncratic thresholds (e.g., [37]), and investigators also adopt different definitions across studies.
Recommendations: Statistical significance testing provides a first filter to separate effects that will probably be reproducible from those that are so small that any observed effects could be attributed to sampling variation rather than exposure to a toxicant. The chance of detecting an effect when it is present in the population-statistical power-can be enhanced in several ways. However, many of the conventional methods for improving power are problematic for epidemiological studies of toxicants. Methods that could be used to further enhance power include using factor analysis or covariance structure modeling to better assess underlying domains and remove the effects of measurement error, or inclusion of covariates chosen because they can control for variance in the outcome measures that is not dependent on exposure to the toxicant.
Statistical significance in and of itself is not necessarily equated with clinical or policy significance. Interpretation of findings from studies of toxicants would benefit from adopting some of the reporting techniques developed in the clinical significance or evidence-based medicine literatures. However, not all of the concepts and techniques will be conceptually relevant, and some will often not be feasible given the practical constraints of doing large-scale studies of exposure to toxicants in humans.

Developmental Effects on Neurocognitive Functioning and Consequent Changes in Assessment Stability and Validity
Developmental brain changes can influence the domain of functioning tested by different instruments, and development also affects the stability and predictive validity of test scores. Brain functioning is less differentiated, and expression is less specific at an early age. As speech, abstract abilities, and meta-cognitive processes develop, different brain regions and processes are recruited in the performance of tasks. These developmental changes imply some instability of outcome with increasing age, especially at younger ages. Thus instability of outcome does not automatically imply that the measurement at early age has been invalid, especially with regard to evaluating contemporary functioning. At the same time, the lower predictive validity associated with measures administered at young ages could be attributable to resilience or to difficulty assessing the construct at a younger age (e.g., it may be impossible to evaluate impaired reading ability in a preverbal child).

Conclusions
We reviewed the measures used to assess neurodevelopmental effects of toxicants, concentrating on those measures previously used in the PCBs epidemiology literature. We found that:  there are a large number of measures that have been used, including both global and more narrowly-focused measures;  there have been continued revisions and changes to many of the core measures, which necessitate changes in the selection of tests for new research protocols;  entirely new measures are available that warrant consideration for inclusion in new studies of toxicants due to their superior psychometric properties;  entirely new domains should be explored in new studies of toxicants due to their importance in real world functioning and/or the possibility that they would be sensitive to toxicants' effects (e.g., adaptive functioning, executive functioning, articulation);  the most commonly documented psychometric properties for measures (such as internal consistency reliability estimates or concurrent validity correlations) are only indirectly relevant to the main objectives of epidemiological studies of toxicants;  the most relevant psychometric features for measures used in toxicant studies (such as retest stability or sensitivity to exposure effects) have been reported only rarely;  the selection of covariates in environmental studies has been largely focused on demographics and confounders, whereas the inclusion of other covariates (e.g., IQ) that are highly correlated with the dependent variable (e.g., language) would further improve estimation of the effects of toxicants;  the field of environmental epidemiology may be nearing a stage where a formal set of reporting guidelines could be developed to help the design of future studies, as has been done with clinical trials, studies of diagnostic assessment tools, and medical epidemiological studies;  in terms of domains, it is clear that there have been changes over time and across studies in how assessment measures are categorized. A consistent rubric should be developed and adopted, even though it would necessarily be imperfect, provisional, and subject to periodic revision;  predictive validity and exposure sensitivity are two highly relevant but rarely reported parameters. We recommend increased emphasis on reporting the relevant parameters, both in technical manuals and in research reports, to facilitate improving measure selection. We also recommend a multi-tiered approach to measure selection, where measures that have demonstrated exposure sensitivity may be supplemented by a second tier of other measures chosen on a theoretical basis, and perhaps a third tier of exploratory measures if resources permit. Our comprehensive list of measures will be used by researchers to build upon past studies by including more sensitive measures or new areas of interest.