Towards the Validation of Executive Functioning Assessments: A Clinical Study

Neuropsychological assessment needs a more profound grounding in psychometric theory. Specifically, psychometrically reliable and valid tools are required, both in patient care and in scientific research. The present study examined convergent and discriminant validity of some of the most popular indicators of executive functioning (EF). A sample of 96 neurological inpatients (aged 18–68 years) completed a battery of standardized cognitive tests (Raven’s matrices, vocabulary test, Wisconsin Card Sorting Test, verbal fluency test, figural fluency test). Convergent validity of indicators of intelligence (Raven’s matrices, vocabulary test) and of indicators of EF (Wisconsin Card Sorting Test, verbal fluency test, figural fluency) were calculated. Discriminant validity of indicators of EF against indicators of intelligence was also calculated. Convergent validity of indicators of intelligence (Raven’s matrices, vocabulary test) was good (rxtyt = 0.727; R2 = 0.53). Convergent validity of fluency indicators of EF against executive cognition as indicated by performance on the Wisconsin Card Sorting Test was poor (0.087 ≤ rxtyt ≤ 0.304; 0.008 ≤ R2 ≤ 0.092). Discriminant validity of indicators of EF against indicators of intelligence was good (0.106 ≤ rxtyt ≤ 0.548; 0.011 ≤ R2 ≤ 0.300). Our conclusions from these data are clear-cut: apparently dissimilar indicators of intelligence converge on general intellectual ability. Apparently dissimilar indicators of EF (mental fluency, executive cognition) do not converge on general executive ability. Executive abilities, although non-unitary, can be reasonably well distinguished from intellectual ability. The present data contribute to the hitherto meager evidence base regarding the validity of popular indicators of EF.


Introduction
Executive functioning (EF) is a construct of fundamental importance for cognitive neuropsychology, although a definition of which cognitive abilities are denoted by the term EF is not yet available. There seems to be a consensus that EF encompasses 'higher' cognitive functions, usually defined as a set of domain-general cognitive control mechanisms supporting goal-directed behavior (e.g., [1]), but their exact nature remains a matter of debate [2][3][4][5].
Many cognitive neuropsychologists share the widely held-yet barely evidencebased-belief that EF represents a cognitive construct that is separable from general intellectual abilities, in particular from intelligence (e.g., [1]). David Wechsler once defined intelligence as the "the global capacity of the individual to act purposefully, to think rationally and to deal effectively with his environment" [6] (p. 3). From Wechsler's definition, it becomes evident that intelligence and EF may share substantial conceptual overlap.
Psychological science in the 20th century has evidenced a controversy about the most reasonable theoretical model of intelligence [7]. Spearman initially identified a single general intellectual ability, for which he coined the term g (for "general factor" [8,9], but see [10]). Meanwhile, a consensus regarding the dimensionality of intelligence has only been achieved insofar as most researchers agree with the assumption that cognitive abilities underlying intelligence are organized in a hierarchical structure, with g at its highest level. Cattell [11] distinguished two types of cognitive abilities that are relevant for general intelligence in a revision of Spearman's concept of g. Cattell hypothesized fluid intelligence (g f ) as the ability to solve novel problems by using reasoning, while he hypothesized crystallized intelligence (g c ) as a knowledge-based ability, which is heavily dependent on education. Horn [12] identified a number of additional broad cognitive abilities in a revision of the g f -g c theory, and Carroll [13] proposed a hierarchical model of intelligence with three levels, which is now known as the CHC (Cattell-Horn-Carroll) model [14]. The bottom level of the CHC model consists of highly specialized, task-specific cognitive abilities. The middle level of the CHC model consists of Horn's broad cognitive abilities, including-but not limited to-g f and g c . Carroll accepted Spearman's concept of g as representing the highest level of intellectual abilities, but affecting performance on any particular test solely via its influence on the identified broad cognitive abilities such as g f and g c [14].
The present study focusses on validating putative indicators of EF. Cronbach once characterized the problem of validity in the following words: "To defend the proposition that a test measures a certain variable defined by a theory, one looks basically for two things. The first is convergence of indicators. There need to be two or more different kinds of data that are regarded as suitable evidence that a person is high or low on the variable. If these indicators agree, despite their surface dissimilarity, we place greater faith in the proposed theoretical interpretation. [ . . . ] The second kind of evidence is divergence of indicators that are supposed to represent different constructs. If a test is said to measure "ability to reason with numbers," it should not rank pupils in the order a test of sheer computation gives, because the computation test cannot reasonably be interpreted as a reasoning test. The test interpretation should also be challenged if the correlation with a test of verbal reasoning is very high, because this would suggest that general reasoning ability accounts for the ranking, so that specialized ability to reason with numbers is an unnecessary concept." ( [15], p. 144; italics in the original text).
Cronbach's approach to validity was based on two or more different theoretical constructs (which are needed for discriminant validation) with two or more different kinds of data (indicators) per construct (which are needed for convergent validation). The design of the present study, therefore, included two constructs (i.e., intelligence and EF), each of which was represented by two or more indicators. Measures of g f and g c were utilized as indicators of intelligence, and measures of executive cognition, also known as cognitive flexibility, verbal and figural fluency, provided indicators of EF. The study aimed at evaluating convergent validity of the named indicators of EF, and it also aimed at evaluating discriminant validity of indicators of EF against indicators of g f and g c .
Some intelligence tests target rather directly the assessment of g f and g c . The National Adult Reading Test [16] (NART) is often used to assess g c in clinical neuropsychology, under the assumption that this education-dependent facet of intelligence is relatively insensitive to brain disease and can thus serve as a reasonable indicator of premorbid crystallized intelligence [1]. Raven's Progressive Matrices [17] (RPM) is often considered as a quintessential indicator of g f (e.g., [9]). We considered an analogue of the NART (which would not be suitable for German speaking patients) as an indicator of g c , and a recently standardized variant of the RPM as an indicator of g f . The Wisconsin card sorting task [18,19] currently provides one of the most popular assessment techniques for EF [20]. The purpose of the Wisconsin card sorting task is to evaluate the ability to form abstract concepts, to maintain and to shift the mental set in response to verifying or falsifying feedback, respectively. Multiple standardized variants of the Wisconsin card sorting task are now in use in clinical neuropsychology; we prefer the Modified Wisconsin Card Sorting Test (M-WCST; [21]) for reasons that have been outlined elsewhere [22]. M-WCST scores provide three standardized scores, i.e., 'number of categories correct', 'number of perseveration errors', and their linear combination, which is referred to as 'executive functioning composite'. These M-WCST scores are thought to provide indicators of essential aspects of executive cognition/cognitive flexibility, namely the ability to abstract (categories) and to remain flexible in response to falsifying feedback (perseveration errors; [20,23]).
Verbal fluency tasks evaluate the spontaneous oral production of words; they have a long history of use in psychology, dating from the work of Thurstone [24]. The most common version of verbal fluency tasks are lexical (a.k.a. letter or phonemic) fluency (here, the task is producing as many words as possible with a specified initial letter) and semantic (a.k.a. category) fluency (here, the task is producing as many words as possible from a specified semantic category). Moderately high correlations between intellectual abilities and verbal fluency have been reported in the literature (for review see [20]).
Design (a.k.a. figural) fluency tasks measure the spontaneous graphical production of novel designs. Design fluency tasks [25] were developed as non-verbal analogs to verbal fluency tasks. Five-point tasks [26] arrange five dots, as on a die, and they request the production of as many unique figures as possible by connecting neighboring dots. Ruff (1987) developed a standardized variant of the five-point task, the Ruff Figural Fluency Test (RFFT; [27,28]). As is the case with verbal fluency, higher intelligence is known to be associated, to some degree, with better figural productivity on the RFFT (for review see [20]).
The relationships between intelligence and EF remain under debate in the neuropsychological literature. Some colleagues have emphasized discriminability of intelligence and EF (e.g., [1,29]), while other authors have claimed that the available indicators of intelligence and of EF merely provide convergent measures of g (e.g., [9,30]). More detailed discussions about putative relationships between intelligence and EF can be found in [9,[31][32][33].
Attempts towards evidence-based validation are of crucial importance for the further advancement of cognitive neuropsychology [34]. The design of the present study allowed us to to examine multiple validity-related research questions that are relevant for the neuropsychological EF construct. Convergent validity could be examined because each of the two relevant constructs (intelligence, EF) was assessed by multiple indicators (indicators of intelligence included proxies for g c and g f ; indicators of EF included executive cognition/cognitive flexibility, verbal and figural fluency). Discriminant validity could, likewise, be examined. Discriminant validation of EF against intelligence would be essential, since the EF construct no longer possesses neuroanatomical claims, such as that it represents functions of the frontal lobes. Note that the formerly popular construct of 'frontal lobe functions' is no longer conventional in clinical neuropsychology. EF, however, is a purely cognitive construct, without any reference to its potential neuroanatomical substrates. Evaluating the discriminant validity of EF against intelligence is crucial for validating the EF construct. Failures of discriminant validation of EF against intelligence would suggest that EF might be an untenable neuropsychological construct, and that g might actually account for inter-individual differences in indicators of EF. Despite the far-reaching implications that validity studies might lead to, discriminant validation of EF against intelligence has been a relatively neglected topic in the literature [35]. Most of the previous validation studies had their methodological grounding in factor-analytic methods e.g., [31,[36][37][38][39] and regression-based methods e.g., [40]. Here we deliberately chose an easily applicable, correlative methodology, in order to encourage clinical neuropsychologists to contribute to the evidence-based validation of the EF construct through the proliferation of future studies.
The present study serves to add to the hitherto rather meager evidence base regarding the validation of EF. It is based on analyses of correlative data from a clinical sample of neurological inpatients (see also [29,30]). The patient sample mainly consisted of patients who were referred to our university-based neurological department for diagnosis. Such clinical samples offer the advantage that the full spectrum of cognitive abilities comes under scrutiny, especially when non-selected consecutive samples of patients are studied. Such samples display a huge heterogeneity of individual cognitive abilities across the full ability spectrum, through inclusion of patients with severe and with less severe diseases of the central nervous system, and inclusion of patients who suffer from a peripheral nervous system disease only, along with the inclusion of patients who suffer from a non-neurological disease or no disease.

Participants
We analyzed data that were obtained from 96 consecutively admitted neurological inpatients. A sample size of n = 96 is sufficient to determine whether a correlation coefficient r = 0.2825 differs from zero (α = 0.05 (two-sided); β = 0.20; see https://sample-size.net/ correlation-sample-size/ accessed on 25 July 2022). The patient sample mainly consisted of patients who were referred to our university-based neurological department for potential neurological diagnosis during the period January to July 2022. They were referred for neuropsychological assessment by the collaborating neurologists (GMG, MK, SP, PS, KWS). Participants had to be between 18 and 69 years old. The choice of this age range allowed the transfer of raw scores to standard scores on all cognitive tests that were conducted. Only patients with German as their native language were included as participants. Exclusion criteria were severe visual or motor dysfunction, dementia, and inadequate vigilance because these symptoms precluded in-depth cognitive testing.
Our sample consisted of 35 male and 60 female patients (plus one patient who preferred not to say). Table 1 summarizes the sociodemographic characteristics of the sample, divided into subsamples of 57 patients who were diagnosed with various brain diseases, and 39 patients without brain diseases. Brain diseases included vascular diseases, autoimmune/inflammatory diseases, and neurodegenerative diseases. Peripheral nervous system diseases (e.g., polyneuropathy, myopathy) and non-neurological (e.g., functional) diseases were subsumed under the second subsample. Age and education are presented in years. a One preferred not to say.

Materials and Design
Cognitive testing lasted about 1.5 h per participant. The study design consisted of two intelligence tests (a proxy for fluid intelligence, g f , and a proxy for crystallized intelligence, g c ), and three tests of EF (executive cognition/cognitive flexibility, verbal fluency, figural fluency), as well as self-report questionnaire regarding non-somatic depressive symptoms. Patients did not receive any additional reward.

Intelligence
Fluid Intelligence (g f ): Raven's Matrices Raven's matrices are commonly regarded as a suitable measure of fluid intelligence e.g., [9]. We utilized the recently published German version of Raven's Progressive Matrices 2 Clinical Edition [41] in the paper-and-pencil format; we refer to this test simply as Raven 2 throughout the article. The Raven 2 consists of five sets of items (Set A-E containing twelve items each, ordered by complexity). Sets A-C are typically conducted with children (aged 4-8 years). However, Sets A-C can also be administered to patients exhibiting intellectual/mental disabilities. We decided to assess Sets A-C with all participants, because our sample comprised individuals with neurological diagnoses. Sets B-E are typically used in the age range 9-69 years. Following successful completion of Sets A-C, we decided whether or not a given patient would continue to complete the remaining Sets D-E. This decision was based upon the patient's A-C score. Specifically, patients completed Sets D-E only if they achieved at least 18/36 correct items on Sets A-C. Similar procedures were applied in older versions of the Raven's matrices, namely in the Standard Progressive Matrices [42] and the Colored Progressive Matrices [43]. Time limits were 30 min for Sets A-C (first timer) and 45 min for Sets B-E, respectively. Thus, there were two Raven 2 scores, i.e., number of correct items on Sets A-C (obtained from all participants), and number of correct items on Sets B-E (obtained from only those participants who had at least 18 correct items on Sets A-C). In order to check the time limit for Set B-E (45 min), a second timer always started at the beginning of Set B. The examiner, rather than the examinee, marked the answers on the answer sheet in an attempt to exclude any influence of potential psychomotor slowing/visual-constructive disabilities.
The Raven 2 [41] standardization sample comprised n = 1200 healthy individuals across six European countries (n = 200 per country). The age range was from 4 to 69 years.
Crystallized Intelligence (g c ): Vocabulary Verbal knowledge is commonly regarded as a suitable measure of crystallized intelligence e.g., [44]. We utilized the Wortschatztest (acronym: WST; literal translation: 'vocabulary test', [45]). The WST is a German language vocabulary test and it measures word recognition on 42 rows, each of which is composed of one valid German word embedded in five nonwords. The rows are ordered according to difficulty (i.e., higher row numbers contain less frequently utilized words), and subjects have to mark the recognized word. Guessing was discouraged. There was no time limit for task completion. The WST score simply reflects the number of correctly identified words.
The WST standardization sample comprised n = 573 healthy participants (M = 40 years of age). The age range was from 16 to 90 years.

Executive Functioning Executive Cognition/Cognitive Flexibility: Wisconsin Card Sorting Task
The Modified Wisconsin Card Sorting Test (acronym: M-WCST; [21]) is a commercially available version of the Wisconsin card sorting task. The M-WCST consists of four stimulus cards, which are placed in front of the subject. They depict a red triangle, two green stars, three yellow crosses, and four blue circles, respectively. The subject receives 48 response cards (=48 trials), which can be categorized according to their color, shape, or number. The subject is asked to match each of the 48 response cards to one of the four stimulus cards. After each trial, verbal feedback is given by the examiner ('correct' or 'incorrect'). After six consecutive correct card sorts, the task rules change. The exact test administration followed the arrangements in [22]. To analyze M-WCST performance, we utilized three scores, i.e., number of correct categories (i.e., six consecutive correct rule matches), number of perseveration errors (i.e., rule repetitions following negative feedback), and their linear combination, referred to as 'executive function composite' in the manual.
The M-WCST [21] standardization sample comprised n = 323 healthy participants (M = 54.69 years of age) recruited through random sampling. The age range was from 18 to >85 years.

Verbal Fluency
Verbal fluency was assessed with the Regensburger Wortflüssigkeits-Test (acronym: RWT; literal translation: 'word fluency test'; [46]). Four two minute sub-tests of the RTW were selected, namely lexical fluency (producing words with initial letter S), lexical switching (producing words with initial letters G and R in alternating order), semantic fluency (producing words from the semantic category animals), and semantic switching (producing words from the semantic categories sports and fruits in alternating order). RWT scores simply reflect the number of valid words produced on the sub-tests lexical fluency, lexical switching, semantic fluency, and semantic switching.

Figural Fluency
Figural fluency was assessed with the German version of the Ruff Figural Fluency Test (acronym: RFFT; [47]). The RFFT consists of five parts, each consisting of a page with 35 five-dot patterns. Subjects have to produce as many unique designs as possible by connecting two or more of the five dots in unique ways. The time limit for each of the five parts was one minute and subjects were instructed to avoid design repetitions. If a subject repeated a design on one of the pages, this was considered as a perseverative error. The scores were the sum of unique designs produced across all five parts. An error ratio was also calculated by dividing the number of perseverative errors by the number of unique designs.
The RFFT standardization sample comprised n = 358 healthy participants. The age range was from 16 to 70 years.

Depressive Mood
The German version of the Beck Depression Inventory-Fast Screen was used (acronym: BDI-FS, [48]). The BDI-FS is a self-report questionnaire that consists of seven items, which stem from the full 21 item Beck Depression Inventory [49]. The BDI-FS items are intended to measure non-somatic depressive symptoms such as feelings of sadness, pessimism, failure in the past, loss of pleasure, self-dislike, self-criticism, and suicidal thoughts. The score was simply the sum of the single items scores (ranging from 0 to 21).
The BDI-FS [48] provides no standardization sample, but rather displays multiple studies conducted to estimate reliability and validity for this self-report instrument.

Reliability Estimates
The available reliability estimates of all test scores that were considered in the present study are presented in Table 2. Consistency reliability estimates (split-half reliability (r SB ) or Cronbach's α were preferred when available, and test-retest reliability estimates (r tt ) were utilized otherwise. As a note of caution, the reader is reminded that reliability generalization should not be taken for granted, especially when reliability estimates are transferred from studies of healthy participants to clinical samples [22]. Therefore, the available reliability estimates can only be considered as approximations, and uncertainty regarding the actual reliability of the considered scores must be acknowledged. Table 2. Reliability estimates of all test scores. When achievable, estimates are preferentially in terms of consistency reliability (r SB ) or Cronbach's α; when consistency reliability was not achievable, estimates are in terms of test-retest reliability (r tt ).  [47]; UD = unique designs; ER = error ratio; BDI-FS = Beck Depression Inventory-Fast Screen (German version) [48]. Reliability estimates are displayed as Spearman-Brown split-half reliability coefficients (r SB ), unless indicated otherwise. a Cronbach's α b Test-retest reliability (r tt ).

Intelligence Reliability Raven 2 Reliability
The Raven 2 manual [41] provides estimates of r SB for both Sets, i.e., A-C and B-E, with 0.80 < r SB < 0.90, which would typically be considered as reasonably good internal consistencies. However, it should be kept in mind that an influential psychometric textbook recommended that "a reliability of 0.90 is the bare minimum, and a reliability of 0.95 should be considered the desirable standard" [50]. Viewed from the psychometric perspective, Set A-C should be preferred over Set B-E.

RWT Reliability
We identified one study [51] that reported consistency reliability (Cronbach's α = 0.90) for RWT lexical fluency, but not for the remaining RWT fluency scores. Regarding the remaining RWT fluency scores, we had to rely on the RWT manual [46] that provides test-retest reliability estimates (r tt ) from a sample of n = 90 university students (M = 22 years of age) across a relatively short test-retest interval (three weeks). Test-retest reliability estimates ranged between 0.72 ≤ r tt ≤ 0.85, akin to results obtained from related reliability studies in English language speaking countries [52,53]. Taken together, the two RWT fluency (lexical, semantic) scores achieved somewhat higher reliability estimates, 0.85 ≤ rel ≤ 0.90, compared to the two RWT switching (lexical, semantic) scores, 0.72 ≤ r tt ≤ 0.77.

RFFT Reliability
The RFFT manual [47] does not present estimates of consistency reliability. Ruff et al. [28] utilized a relatively small sub-sample (n = 95) of his original standardization sample, and he then provided test-retest reliability estimates (r tt ) across a relatively long test-retest interval (six months). Fernandez et al. [54] provided an estimate of consistency reliability for the Unique Designs score from the original Five-Point Test [26] based on a non-clinical sample of healthy participants (n = 209), with r SB = 0.80. Here, we assume that the consistency reliability reported for Five-Point Test UD scores may be generalizable to RFFT UD scores. The consistency reliability of the RFFT ER score remains obscure. One study [55] examined its test-retest reliability based on a sample of n = 90 college undergraduates with a somewhat disappointing result, i.e., r tt = 0.64.

BDI-FS Reliability
Poole et al. [56] examined the consistency reliability (Cronbach's α) of the BDI-FS in a sample of n = 1.227 chronic pain patients.

The Importance of Reliability
At this point, the distinction between measured scores (observed scores) and their respective true scores in term of classical psychometric test theory [50] comes into the play. Basically, the almost exclusively less than perfect measurement reliability of psychological measures attenuates correlations between observed scores compared to correlations between true scores [57,58]. As an example, consider a correlation between two observed measures (e.g., r xy = 0.50), and (consistency) reliabilities of r xx = 0.80 and r yy = 0.70. Through application of Spearman's disattenuation equation (Equation (1)), we obtain r xtyt = 0.67, i.e., a considerably higher correlation between the true score variables x t and y t compared to the correlation between the observed score variables x and y, simply through consideration of their less than perfect measurement reliability. r xtyt = r xy r xx × r yy (1)

Rationale for Preferring Standard Scores over Raw Scores
Throughout this article, we considered standard scores rather than raw scores. Consideration of standard scores (such as z-scores; Gauss distribution with central tendency M = 0 and variability SD = 1) have some virtues compared to raw scores. Most importantly, a standard score reveals individual abilities after accounting for sociodemographic variables such as age, education, and sex, which are known to exert strong effects on cognitive abilities. As an example, imagine a raw score of 18 out of 36 on Raven 2 A-C. This raw score would probably indicate low individual abilities if obtained from a young, well-educated female, whereas the same raw score would probably indicate higher individual abilities if obtained from an old, poorly-educated male. Thus, standard scores offer potentially age-, education-and sexadjusted estimates of individual abilities, through adequate socio-economic stratification of sufficiently large standardization samples. This advantage of standard scores, however, does not come without potential caveats, especially when the standardization of the considered assessment instruments follows very heterogeneous procedures (in terms of sample size, socio-economic stratification, etc.).

Resolution Limitations of Standard Scores
Another typical problem of standard scores is their limited resolution. As an example, a test manual may reveal that a particular raw score falls below the one percent percentile, without exact resolution (i.e., the observed standard score equals 'less than' one percentile). We applied the following interpolation equation (Equation (2)) in such cases: 'Less than' interpolation rule (<percentile): interpolated percentile = 0 + 'less than' percentile 2 (2) 'More than' interpolation rule (>percentile): interpolated percentile = 100 + 'more than' percentile 2 The 'less than' interpolation described by Equation (2) had to be applied 51 times, as follows: 47 times for 'percentile < 1' (interpolated percentile = 0.5), two times for 'percentile < 1.1' (interpolated percentile = 0.55), and two times for 'percentile < 2' (interpolated percentile = 1). The 'more than' interpolation described by Equation (2) Table 3 shows analyses of the simplest and perhaps most commonly calculated standard score, i.e., the z score, obtained from n = 96 neurological inpatients. The median standard (z) scores (Mdn(z)) from all performance assessments were negative or very close to zero (−1.16 < Mdn(z) < +0.03), yet all their interquartile ranges included z = 0, with the exception of the RWT Lex Switch assessment. Minima and maxima indicated that we were considering broadly distributed individual standard scores that fell between −2.58 (lowest minimum) < z < +3.09 (highest maximum) at the extremes. Conventional t-tests of statistical significance (one sided, µ(z) < 0) yielded rejection of the null hypothesis in nearly all variables, indicating that the study patients performed below average performance in the standardization samples. Exceptions were WST (the indicator of g c ) and RFFT ER (error rate on the figural fluency assessment), for which average performance in the standardization samples predicted the patient's performance reasonably well.   [47]; UD = unique designs; ER = error ratio; BDI-FS = Beck Depression Inventory-Fast Screen (German version) [48]. Student's t-tests. p-values refer to statistical significance of H 1 (one-sided): µ < 0. a n = 89 b df = 88 * α < α n (Tests) , * α < 0.05 13 = 0.0038 (* α equals the significance level after Bonferroni correction based on an initial significance level of α < 0.05).

Neuropsychological Findings
With regard to self-reported depression on the BDI-FS, the Mdn(z) = +1.16 fell more than one standard deviation above the population average, the interquartile range did not include z = 0, and the distribution of individual standard scores was generally shifted toward positive values that fell within −0.15 (minimum) < z < +3.09 (maximum) at the extremes. A conventional t-test of statistical significance (one sided, µ(z) > 0) yielded rejection of the null hypothesis, indicating that the study patients reported more depressive symptoms than expected based on the average mood in the standardization sample. Table 4 shows all Spearman rank correlations between standard scores based on observed scores (i.e., r xy ; below diagonal) as well as based on true scores (i.e., r xtyt ; above diagonal). Details about the calculation of the observed standard scores (z) can be found in the Section 2. True score correlations arise from their respective observed score correlations, plus the application of Spearman's disattenuation correction for imperfect reliabilities, as described in the Section 2 in detail (Equation (1)). As a rule of thumb, true score correlations are always higher than their corresponding observed score correlations, due to the applied disattenuation for imperfect reliabilities. Table 4. Spearman rank correlations between z scores based on observed scores (i.e., r xy ; below diagonal) and based on true scores (i.e., r xtyt ; above diagonal).  [48]. Italicized areas of the matrix refer to measures of convergent validity (black area of the matrix, measures of intelligence; grey area of the matrix, measures of executive functioning). Non-italicized areas of the matrix refer to measures of discriminant validity (black area of the matrix, measures of intelligence against measures of executive functioning; grey area of the matrix, measures of cognition against measures of depression). a n = 89 b value truncated in response to rounding errors.

Intelligence (Italicized Black Area of the Matrix)
The observed score correlation between Raven 2 A-C and WST amounts to r S = 0.653, and the corresponding true score correlation amounts to r S = 0.727 after correction for imperfect reliabilities of the Raven 2 A-C and WST scores (see Table 4). As predicted by the CHC model of intelligence (see Introduction for details), the magnitude of these correlations provides an estimate of the convergent validity for these two measures of intelligence, with Raven 2 A-C serving as a proxy for fluid intelligence (g f ) and WST serving as a proxy for crystallized intelligence (g c ). Their relatively strong convergence supports the idea that broad cognitive abilities, including g f and g c , might jointly contribute to g.

Executive Functioning (Italicized Grey Area of the Matrix)
Correlations between scores that were obtained from EF assessments (M-WCST, RWT, and RFFT) were also calculated. Observed score (true score in brackets) correlations between the EF scores ranged from r S = −0.008 (RFFT UD / RFFT ER; r S = −0.011, RFFT UD/RFFT ER) to r S = 0.917 (M-WCST EFc/M-WCST Persev; r S = 0.981, M-WCST EFc/M-WCST Persev). Inspection of these coefficients suggests huge heterogeneity with regard to the magnitude of the associations between EF scores. Correlations between the two independent M-WCST scores (Categ, Persev with r S = 0.700, observed score; r S = 0.753, true score) as well as those between the four RWT scores (Lex Fluency, Lex Switch, Sem Fluency, Sem Switch with r S ≥ 0.522, observed score; r S ≥ 0.647, true score) were substantial. Correlations between scores that were obtained from different tests (e.g., correlations between M-WCST and RWT scores) seemed to be weak or moderate at best (r S ≤ 0.318, observed score; r S ≤ 0.419, true score). We examine this topic in more detail below.

Discriminant Validity of Cognition against Depression (Non-Italicized Grey Area of the Matrix)
Observed score (true score in brackets) correlations between all performance scores (intelligence; EF) and self-reported depression (BDI-FS) ranged from r S = −0.302 (r S = −0.388; RWT Sem Switch) to r S = 0.096 (r S = 0.108; M-WCST Categ). Tests of statistical significance of these coefficients indicate minor (Raven 2 A-C; RWT) or negligible (WST; M-WCST; RFFT) associations between cognitive performance and self-reported depressive mood.

Discriminant Validity of Executive Functioning against Intelligence (Non-Italicized Black Area of the Matrix)
Correlations between EF assessments (M-WCST, RWT, and RFFT) and measures of intelligence (Raven 2 as a proxy for g f and WST as a proxy for g c ) were the focus of the study. Observed score (true score in brackets) correlations between EF scores and Raven 2 A-C (g f ) ranged from r S = 0.237 (M-WCST Categ; r S = 0.265, M-WCST Categ) to r S = 0.451 (RFFT UD; r S = 0.566, RWT Sem Switch). Observed score (true score in brackets) correlations between EF scores and WST (g c ) ranged from r S = 0.100 (M-WCST Categ; r S = 0.106, M-WCST Categ) to r S = 0.485 (RWT Lex Fluency; r S = 0.548, RWT Sem Switch). Inspection of these coefficients suggests that moderate associations may exist between indicators of EF and intelligence. We examine this topic in more detail below.

Reprise: Convergent Validity of Executive Functioning
Statistical significance tests were conducted that tested the equality of correlations between (verbal, figural) fluency scores (obtained from RWT and RFFT) and executive cognition scores (obtained from M-WCST) against correlations between the executive cognition scores (obtained from M-WCST), i.e., against the convergent validity coefficients of the M-WCST scores. These analyses were conducted with the online tool that is provided by psychometrica.de [59] for testing whether correlation coefficients differ from arbitrarily chosen values. The test is approximate, and it proceeds via Fisher-Z-transformation, as described by Eid et al. [60].
As previously stated, the convergent validity of M-WCST Categ (abstraction) and M-WCST Persev (cognitive flexibility) amounted to r S = 0.700 (observed score; r S = 0.753, true score). These two coefficients served as benchmarks for evaluating the convergence of the various fluency scores against these M-WCST sores ( Table 5). Inspection of Table 5 reveals that all correlations between the fluency scores and the M-WCST scores fell below their respective coefficients of convergent validity. In addition, all estimates of variance shared between the M-WCST scores and the fluency scores were very low (<7% for observed scores, <10% for true scores).    Statistical significance tests were conducted that tested the equality of correlations between EF scores (obtained from M-WCST, RWT and RFFT) and intelligence scores against correlations between the intelligence scores (obtained from Raven 2 A-C and WST), i.e., against the convergent validity coefficients of the intelligence scores. These analyses were also conducted with the online tool that is provided by psychometrica.de [59] for testing whether correlation coefficients differ from arbitrarily chosen values [60].
As previously stated, the convergent validity of the Raven 2 A-C and the WST amounted to r S = 0.653 (observed score; r S = 0.727, true score). These two coefficients served as benchmarks for evaluating the divergence of the various EF scores against these intelligence sores ( Table 6). Inspection of Table 6 reveals that all correlations between the EF scores and the intelligence scores fell below their respective coefficients of convergent validity, both for the observed scores and for the true scores.  [47]; UD = unique designs; ER = error ratio. p-values represent the statistical significance comparing two correlations with each other. Correlations were calculated according to Spearman's rank-order correlations (r xy ) for observed scores, and after application of Spearman's attenuation formula (r xtyt ) for true scores. * α < α n (Tests) , * α < 0.05 9 = 0.0056 (* α equals the significance level after Bonferroni correction based on an initial significance level of α < 0.05).
In addition, all estimates of variance shared between the EF scores and the intelligence scores were low. For observed scores, M-WCST scores shared less than 12%, RWT scores shared less than 20%, and RFFT scores shared less than 21% variance with the Raven 2 A-C scores. M-WCST scores shared less than 6%, RWT scores shared less than 24%, and RFFT scores shared less than 9% variance with the WST scores. For true scores, M-WCST scores shared less than 16%, RWT scores shared less than 33%, and RFFT scores shared less than 30% variance with the Raven 2 A-C scores. M-WCST scores shared less than 7%, RWT scores shared less than 31%, and RFFT scores shared less than 12% variance with the WST scores.

Discriminant Validity of Executive Functioning against Facets of Intelligence
A secondary research question was whether indicators of EF (here denoted as EF x ) converged preferentially to the indicator of g f or to the indicator of g c , respectively. For example, the correlation between M-WCST EFc scores and Raven 2 A-C scores (a proxy for g f ) may equal the correlation between M-WCST EFc scores and WST scores (a proxy for g c ). Statistical significance tests were conducted that tested the equality of correlations between EF scores (obtained from M-WCST, RWT and RFFT) and g f against correlations between these EF scores and g c ( Table 7). These analyses were conducted with the online tool that is provided by psychometrica.de [59].  [47]; UD = unique designs; ER = error ratio.
p-values represent the statistical significance comparing two correlations with each other. Correlations were calculated according to Spearman's rank-order correlations (r xy ) for observed scores, and after application of Spearman's attenuation formula (r xtyt ) for true scores. * α < α n (Tests) , * α < 0.05 2 = 0.025 (* α equals the significance level after Bonferroni correction based on an initial significance level of α < 0.05). Table 7 reveals that the null hypothesis of equal correlations was solely rejected in one single comparison, i.e., the RFFT-based figural fluency for true scores. The overall picture here is, however, that differential correlations EF x -g f and EF x -g c were not discernible in the present study. Perhaps with the exception of productivity on figural fluency assessments, these data do not support the idea that g f might account better for variability in EF x than does g c .

Discriminant Validity of Facets of Executive Functioning against Intelligence
Another secondary research question was whether distinct indicators of EF (here denoted as EF x and EF y , respectively) converged preferentially to the indicator of g f (or likewise to the indicator of g c ). For example, the correlation between M-WCST EFc scores and Raven 2 A-C scores (a proxy for g f ) may equal the correlation between RWT Lex Fluency scores and Raven 2 A-C scores. In order to keep this analysis manageable, we (a priori) selected the following three EF scores for processing: M-WCST EFc (executive cognition) scores, RWT Lex Fluency (lexical fluency) scores, and RFFT UD (figural fluency) scores. Statistical significance tests were conducted that tested the equality of correlations between an EF x score and g . (i.e., either g f or g c ) against correlations between an EF y score and g . (Table 8). These analyses were conducted with the online tool that is provided by psychometrica.de [59]. Table 8. Are correlations between executive functioning (EF x ) and g . (i.e., either g f or g c ) equal to correlations between EF y and g? Correlations were calculated according to Spearman's rank-order correlations (r xy ) for observed scores, and after application of Spearman's attenuation formula (r xtyt ) for true scores. * α < α n (Tests) , * α < 0.05 9 = 0.0056 (* α equals the significance level after Bonferroni correction based on an initial significance level of α < 0.05). Table 8 reveals that the null hypothesis of equal correlations was solely rejected in two comparisons, i.e., the comparison between M-WCST-based executive cognition and RWT-based lexical fluency with regard to g c for observed scores as well as for true scores, suggesting a stronger association between lexical fluency and g c compared to executive cognition and g c . All remaining comparisons between correlations EF x − g . and EF y − g . proved non-significant. These data support a somewhat higher correlation between lexical fluency and g c compared to the correlation between executive cognition and g c . However, differential correlations between the three examined indicators of EF and g f were not discernible in the present study.

Discussion
The present data suggest that various indicators of EF can be discriminated from indicators of intelligence. Specifically, the M-WCST scores can neither be accounted for by indicators of g f nor by those of g c , given that estimates of shared variance did in no case surmount 12% for observed scores and 15% for true scores. The verbal fluency (RWT) scores can also neither be fully accounted for by indicators of g f nor by those of g c , albeit the estimates of variance shared with them were around twice as high for fluency scores as for M-WCST scores (i.e., they amounted to around 25% for observed scores and 33% for true scores). Finally, neither the figural fluency (RFFT) productivity nor error proneness seemed be fully accountable for by indicators of g f nor by those of g c , although productivity scores showed somewhat higher estimates of shared variance with g f (around 20% for observed scores and 30% for true scores) compared to g c (around 8% for observed scores and 11% for true scores).
The study results suggest that general intellectual abilities do not fully account for EF abilities. In particular, executive cognition/cognitive flexibility seems to represent a cognitive ability that is, by-and-large, independent of general intellectual abilities. Fluid and crystallized facets of intelligence are both moderately associated with verbal fluency, whereas solely fluid facets of intelligence are moderately associated with figural fluency. We conclude that EF-especially one of its core elements, i.e., executive cognition/cognitive flexibility-remains a psychometrically defendable neuropsychological construct.
We also observed good convergent validity of indicators of general intelligence (Raven's matrices, vocabulary test), but poor convergent validity of executive cognition/cognitive flexibility (as indicated by performance on the M-WCST) against fluency indicators of EF. These data support the assumption of general intellectual abilities, but they clearly do not support the assumption of general executive abilities.

General Intellectual Abilities
With regard to intelligence, the good convergent validity of Raven's matrices and a vocabulary test (r xtyt = 0.727; R 2 = 0.53) is a remarkable result. Firstly, solving Raven's matrices, mainly tagging fluid facets of intelligence, g f , and retrieving verbal knowledge, mainly tagging crystallized facets of intelligence, g c , represent quite dissimilar cognitive demands. Secondly, the solution of Raven's matrices is a timed test, whereas the retrieval of verbal knowledge does not impose time constraints. The high correlation between these two apparently dissimilar indicators of intelligence suggests that general intellectual ability accounts for the concordant ranking on both indicators of intelligence. Spearman's construct of general intelligence, g (see [9] for discussion) may be a reasonable assumption, although this topic was not the focus of the present study.

General Executive Abilities
With regard to EF, the poor convergent validity of executive cognition/cognitive flexibility and multiple indicators of fluency (0.087 ≤ r xtyt ≤ 0.304; 0.008 ≤ R 2 ≤ 0.092) is a surprising result. Conceptually, maintaining cognitive flexibility and fluent mental productivity may be considered as resulting from similar cognitive abilities. Specifically, the presence of generalized perseverative tendencies should affect the ranking on both indicators of EF in similar (detrimental) ways. However, the present data did not support the assumption of generalized perseverative tendencies. We conclude that EF should be considered as a non-unitary neuropsychological construct.
Future research should address the largely unknown factorial structure of EF, preferably through latent variable modeling. These studies should examine the assumption of general (sometimes also referred to as 'central') executive abilities e.g., [61][62][63], which may be allocated in a single neural system (presumably a focal locus in the prefrontal cortex such as the dorsolateral prefrontal cortex, e.g., [64]). Alternatively, multiple independent executive abilities may exist, as suggested by alternative proposals [65][66][67].

Mental Fluency and Intelligence
Mental productivity concerning verbal fluency showed moderately strong correlations with fluid and crystallized facets of intelligence. Henry & Crawford [68] provided a metaanalysis of the sensitivity of verbal fluency to the presence of focal cortical lesions. Relative to healthy controls, participants with focal frontal injuries had large and comparable deficits in lexical and semantic fluency, which could not simply be accounted for by intellectual (dis-)abilities. The authors reported that lexical fluency was more strongly related to the presence of frontal lobe lesions than the WCST scores, and that temporal-lobe lesions were associated with a lesser deficit on lexical fluency but a larger deficit on semantic fluency. The dissociation between lexical-frontal and semantic-temporal verbal fluency was also corroborated by more recent behavior-lesion mapping studies of left hemisphere stroke patients [69,70].
Intellectual abilities may be involved in the formation of verbal retrieval strategies, such as clustering. Clustering refers to generating words within self-induced subcategories (e.g., the subcategory 'farm animals' from the semantic category 'animals'), and self-induced clustering and switching between clusters may be dissociable components of verbal fluency [71]. Troyer at al. [72] reported that lexical fluency switching was impaired only in patients with frontal lobe lesions, and that semantic fluency clustering was impaired only in patients with temporal lobe lesions.
Taken together, tests of verbal fluency provide easily applicable assessment techniques, yet the nature of the cognitive processes involved, and their neural substrates, are still under debate. Fluid and crystallized intellectual abilities are both involved in a to-be-determined manner, but their variability cannot fully account for the variability observed in verbal productivity. Future studies should examine the hypothesis, which is based on Troyer et al.'s [72] finding, that the essential EF component of verbal fluency is related to the ability to switch between self-induced clusters.
Mental productivity concerning figural fluency showed moderately strong correlations with fluid, but not with crystallized, facets of intelligence. Five-point tasks are practiced much less often in daily life than are verbal retrieval tasks, such that the novelty of fivepoint tasks may place higher demands on structuring abilities than on retrieval abilities. Tests of figural fluency provide easily applicable assessment techniques, yet the nature of the cognitive processes involved, and their neural substrates, in figural productivity are still quite enigmatic. Fluid intellectual abilities are involved, but they cannot fully account for the variability observed in figural productivity. Future studies should examine the hypothesis that the essential EF component of figural fluency is related to the ability to switch between self-induced clusters of eligible figural designs.

Executive Cognition/Cognitive Flexibility and Intelligence
Some previous studies have addressed relationships between WCST-based indicators of EF and intelligence [30,31,[36][37][38][39][40][73][74][75]. Notably, Jewsbury et al. [31] showed that Wisconsin perseveration errors were subsumable under CHC (cf. Introduction) broad cognitive abilities based on factor-analytic methods. Specifically, Wisconsin perseveration errors were found to be related to g v and g f , i.e., visuospatial (g v ) and fluid (g f ) facets of intelligence (see also [36]). A patient study also corroborated the conclusion that aspects of Wisconsin card sorting performance and fluid intelligence may be highly correlated. Notably, Roca et al. [30] compared patients who suffered from frontal lobe lesions and controls. The authors reported that the frequency of Wisconsin total errors no longer differed between the two groups when they were matched with regard to a widely used indicator of g f (i.e., the Culture Fair Test [76]). The authors suggested that the between-group variance in Wisconsin error proneness was negligible, once the between-group variance in fluid intelligence was accounted for. This conclusion stands in obvious conflict with Milner's [29] assertion that measuring Wisconsin perseveration errors allows the detection of executive dysfunctions in patients with frontal brain lesions, in the absence of noticeable declines in intelligence.
A recent meta-analysis examined associations between WCST scores and intelligence [33]. Across all studies that were included in the meta-analysis, indicators of g c correlated at indistinguishably low levels (in terms of absolute values) with WCST categories (r = 0.33, confidence interval 0.26 . . . 0.39) and with WCST perseveration errors (r = −0.31, confidence interval −0.36 . . . −0.26). Likewise, indicators of g f correlated at indistinguishably low levels (in terms of absolute values) with WCST categories (r = 0.34, confidence interval 0.27 . . . 0.39) and with WCST perseveration errors (r = −0.29, confidence interval −0.34 . . . −0.24). As is the case in the present study, these results lead to marginal estimates of shared variance between the two cognitive domains (intelligence, executive cognition/cognitive flexibility) that lie between 8 and 12% based on these meta-analytic central-tendency estimates. The upper limits of the present results (12% for observed scores and 15% for true scores) are a little higher than the meta-analytic central-tendency estimates. The present estimates of shared variance between the two cognitive domains (intelligence, executive cognition/cognitive flexibility) are well in agreement with the meta-analytic data, when the comparison is based on the upper limits of the meta-analytic confidence-interval estimates (between 12 and 15%).
Taken together, the present data and the available evidence in the literature suggests that the usually assessed WCST-based indicators of EF (i.e., categories, perseveration errors) can be well discriminated from indicators of intelligence. This conclusion implies that the assessment of categories and perseveration errors on the WCST as indicators of EF remains a psychometrically defendable practice in clinical neuropsychology [21], which cannot be replaced by the assessment of intellectual abilities.
Our conclusion stands in contrast to the idea of a 'multiple-demand' system, which is supposed to subserve general intellectual abilities, i.e., g. The idea of a 'multiple-demand' system conceptualizes a general problem solver that comes into the play whenever the complexity of cognitive demands is sufficiently high [9]. However, our data do not support a unitary view of cognitive abilities. A distinction between intellectual and executive abilities seems to be in place, and WCST-based scores are among the most promising candidate indicators of EF.

Study Limitations
One obvious limitation of the present study lies in the trustworthiness of the reliability estimates of the test scores (cf. Table 2). Not all reliability coefficients were achievable in terms of consistency reliability, so that some estimates were in terms of test-retest reliability. Moreover, the assumption of reliability generalization may be misleading, especially when reliability estimates are transferred from studies of healthy participants to clinical samples [22]. For these reasons, the applied disattenuation (Equation (1)) may represent some degree of under-or overcorrection, respectively.
Another obvious limitation of the present study lies in the trustworthiness of the standardized scores. We argued that utilizing standardized scores is superior to utilizing raw scores (cf. Section 2.5.1). However, one problem that arises with standardized scores is that test standardization procedures are themselves not standardized. In fact, test standardization follows very heterogeneous paths, as is revealed by inspecting neuropsychological test manuals. Standardized scores of different neuropsychological tests may differ strongly in terms of their quality regarding the appropriateness of its socio-economic stratification. The assurance of quality standards concerning the standardization of neuropsychological tests is imperative. National and international neuropsychological societies should prioritize the definition of quality standards for standardized neuropsychological tests.
EF is often considered as an umbrella term for a diverse bundle of cognitive abilities. Our study was concerned with indicators of executive cognition/cognitive flexibility and mental fluency. Many more indicators of EF have been suggested in the neuropsychological [77][78][79][80][81][82] and in the cognitive [83,84] literature. Our study merely examined a strong selection among the many potential indicators of EF, and additional validity studies should envisage the broad landscape of potential indicators of EF.
Finally, although construct validity is an important building block of validity, criterion (including ecological) validity remains to be determined in future studies. Notably, welldesigned studies exploring relationships between successful M-WCST performance and successful social integration, educational and professional achievement, and self-care in everyday life, are needed [85,86].
Our study participants included small subsamples of patients with and without brain diseases. Due to the small sample sizes of the subgroups of patients (see Table 1), it is not possible-nor intended-to compare their neuropsychological performance. The sole rationale for including such a large variety of patient subgroups was that the complete sample of patients should display a huge heterogeneity of individual cognitive abilities across the full ability spectrum (see Table 3). In other words, our study was not concerned with a better understanding of cognitive disorders as they occurred in these subgroups of patients, but solely with questions of test score validity when these are studied in a heterogeneous sample of individual cognitive abilities.
Our test battery included a test (Regensburger Wortflüssigkeits-Test, [46]) that assesses verbal fluency in the German language. Therefore, our findings concerning verbal fluency should be treated with some caution, because the generalizability of psychometric data concerning language-specific neuropsychological tests may be limited.

Conclusions
Our conclusions from these data are quite clear-cut. Apparently dissimilar indicators of intelligence converge on general intellectual abilities. Apparently dissimilar indicators of EF (executive cognition/cognitive flexibility, mental fluency) do not converge on general executive abilities. Executive abilities, although non-unitary, can be reasonably well distinguished from general intellectual abilities. Our conclusion that a distinction can be drawn between intellectual and executive abilities holds particularly true for executive cognition/cognitive flexibility as assessed by the Wisconsin card sorting task, thereby supporting the utilization of the EF construct in clinical neuropsychology. The reported data do not support the assumption of a 'multiple-demand' system, according to which the concept of general intellectual abilities represents a domain-general capacity for solving all kinds of complex cognitive problems.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement:
The datasets used and/or analyzed for the study are available from the corresponding author upon reasonable request.