3.1. Inadequate Handling of the Tailorshop
One of the major criticisms in both commentaries refers to the application, scoring, and interpretation of the Tailorshop in Greiff et al. (2015). Kretzschmar and Funke et al. were concerned about the seeming lack of an exploration phase and the omission of a knowledge test, both of which they claim are essential parts of the Tailorshop. In addition, Funke et al. highlighted the problem of potential interdependences among the separate performance indicators of the Tailorshop and the need to consider the resulting autocorrelation. Regarding the interpretation of the resulting scores, Funke et al. further noted that using only a single microworld run compared with three MCS tasks with a total of 24
2 runs may introduce a reliability problem at the level of manifest test scores and would not allow for a fair comparison between the MCS tasks and the Tailorshop.
Unfortunately and probably at the very source of the above-mentioned points, there is still no consistent recommendation for the application or the scoring of the Tailorshop despite its 30-year research history [
11]. More specifically, a publication that lays out the exact structure of the Tailorshop for independent replications or that recommends any specific knowledge-test items is still missing (except for a highly extensive mathematical solution that has, to the best of our knowledge, not been used in any empirical study so far; see [
17]). Rather, different publications feature quite diverse versions of the Tailorshop, including inconsistent descriptions of how to score and interpret the results even though all of them refer to them as “the Tailorshop.” Two rather recent studies [
1,
18] by Danner and colleagues (partially co-authored by the authors of the Funke et al. commentary) on the reliability and validity of the Tailorshop provide several recommendations though, which were explicitly used as a guideline for the handling of the Tailorshop in Greiff et al. (2015).
In fact, except for the knowledge test, in our study, we ran an almost identical version of the Tailorshop as Danner and colleagues, which included an exploration phase that preceded the control phase. We admit that the application of a standard exploration phase was not directly evident from reading Greiff et al. (2015) and not explicitly stated in there, but it seems deducible from the frequent and very specific references to the two studies. Of note, there are no version numbers or other ways to differentiate between different versions of the Tailorshop across empirical studies and different applications are used under the same label. The version of the Tailorshop used in our study (a very similar version also without a knowledge test is currently recommended by Funke and Holt at psychologie.uni-heidelberg.de/ae/allg/tools/tailorshop/) did not include a knowledge test because Danner and colleagues specifically recommended that only the control phase be scored because of reliability issues with the knowledge test. Thus, we disagree with the notion that Greiff et al. (2015) employed some kind of exotic version of the Tailorshop as, very much to the contrary, a version recommended by some of the authors of the commentaries was used.
Regarding the interdependences among the separate performance indicators of the Tailorshop, Funke et al. argued in their commentary that by ignoring the autocorrelation of the performance indicators, we artificially increased the measure`s reliability, which in turn led to an underestimation of the validity coefficients for the Tailorshop due to a reduced correction for attenuation. We agree that this would be a plausible argument if the autocorrelation had actually been ignored. However, Danner and colleagues ([
18] p. 228) provided specific recommendations for how to best score the Tailorshop and stated that “the changes of the company values after each simulated month may be taken as performance indicators for the Tailorshop simulation” because “there is no [autocorrelation] between the changes of the company values” ([
18] p. 227) as already described previously by Funke [
19]. The empirical appropriateness of this approach is also demonstrated empirically in this paper [
18]. As these are exactly the performance indicators we used in Greiff et al. (2015), it seems peculiar that Funke et al. would criticize us for following their own recommendations. Their criticism is thus either the result of a misunderstanding or should be directed at their own work, which then failed to provide an acceptable scoring for the Tailorshop in more than 30 years of research.
In addition to their criticism of the scoring of the Tailorshop, Funke et al. seem to imply that an aggregated score from multiple runs of the Tailorshop might provide a better comparison with the MCS tests, which consist of multiple small microworlds (p. 2). Of note, multiple runs of the Tailorshop are impracticable due to the unrealistically high testing time (1 h, if not considerably more), are implausible due to learning effects that would severely reduce the measure’s reliability (for an potential use of parallel forms that reduce this problem see [
20]), and have little to no theoretical or empirical foundation. On the contrary, the single run of one very large complex problem is at the heart of the construction rationale of the classical CPS measures [
3]. Any suggestions about potential results of multiple runs of the Tailorshop must therefore remain purely speculative.
This is exactly the advantage and the main rationale behind the MCS measures. In response to some of the obvious psychometric problems of classical CPS measures (see previous point), MCS measures were designed to be short enough to allow for the assessment of multiple tasks, which are independent of each other and are, thus, less prone to learning effects [
3]. The resulting score, thus, contains the information of multiple independent measures and is therefore considerably more reliable than a score that is the result of a single large complex system. Funke et al. do not specify how they would suggest amending this issue but imply that not addressing it represents an unfair treatment of the Tailorshop. We do agree with Funke et al., however, that comparing an aggregated measure of all three MCS measures with the Tailorshop would be inadequate. Aggregated into one factor, the three MCS measures would represent a higher level of abstraction than the Tailorshop (i.e., a second-order factor), which is why we refrained from doing so in Greiff et al. (2015). Rather, we chose to compare the validity of each MCS measure with the Tailorshop separately.
Finally, Funke et al.’s criticism that the resulting MicroDYN factor (we assume they combined MicroDYN and the Genetics Lab in this criticism) might be a “bloated specific” due to the very high similarity between the tasks would be true for every construct that is measured by applying a well-defined construction rationale (e.g., also for figural matrices tasks that appear to be quite similar but rely on different sets of construction rules [
21]). Only the variation and addition of such rules leads to varying task difficulties. However, tasks such as matrices are among the best-justified psychometric measures and are rather unlikely to constitute a bloated specific. Following this line of thought, Stadler, Niepel, and Greiff [
22] demonstrated that tasks within one of the MCS measures, MicroDYN, can be constructed based on a set of six well-defined rules that fully determine their difficulty. Therefore, it seems implausible that MicroDYN tasks would constitute a bloated specific without further empirical evidence addressing this issue.
In summary, we respectfully disagree with the very general criticisms of our handling of the Tailorshop expressed in both commentaries. Our application, scoring, and interpretation procedures were based on the latest publications on the Tailorshop, which may be at odds with other publications that used the Tailorshop. However, this seems more like a general issue with the Tailorshop than with our paper.
3.2. Inadequate Inclusion of MicroFIN
Both commentaries argued that the specific version of MicroFIN we included was inadequate, mainly due to the reduced number of only two MicroFIN tasks. We wholeheartedly agree and consider this a shortcoming of Greiff at al. (2015) and also of Greiff et al. (2014) [
23], which relied on the same data but addressed a different research question. Of note, both papers acknowledged this shortcoming, discussed it in detail, and provided a number of arguments for the inclusion of MicroFIN albeit in a limited and rather brief version (cf. [
9,
23]).
However, the two commentaries drew quite contrasting conclusions about MicroFIN, and their interpretations disagreed. Funke et al. declared the 2-task version of MicroFIN the “surprise winner” (p. 4) with respect to predictive validity. They did so on the basis of absolute differences in correlation coefficients with school grades in the range of lower than 0.05 (e.g., 0.22 vs. 0.19 or 0.33 vs. 0.31; cf. Table 2 in Greiff et al., 2015; see [
24], for variability in correlation coefficients). From this, the authors concluded that the entire idea of MCS, which propagates the need for several independent CPS tasks, might be an unnecessary one as only two MicroFIN tasks were enough to outperform MicroDYN (10 tasks) and the Genetics Lab (12 tasks). Kretzschmar, on the other hand, stressed the need for a longer (i.e., more tasks) and presumably more reliable version of MicroFIN. He then argued that this would allegedly lead to a higher correlation between Tailorshop and MicroFIN and thus questioned Greiff et al.’s (2015) argument for higher convergent coefficients between the MCS tasks than between any MCS task and the Tailorshop. In his argument, however, Kretzschmar did not consider that in the envisaged scenario the relations between MicroFIN and the two other MCS measures would also increase, leading to even more consistency between the three MCS tests (MicroFIN had the lowest relations with MicroDYN and the Genetics Lab; cf. Table 2 in Greiff et al., 2015) and, further, that the issue of reliability on the level of latent modeling is unlikely to be a driving factor of latent correlations even to begin with.
Although we endorse the limitations associated with the two MicroFIN tasks, we note that both arguments are based on speculation and cannot be confirmed or rebutted without further empirical study. We invite the authors of the two commentaries to conduct empirical investigations of their propositions. However, in the absence of this, we need to rely on the little empirical evidence there is. Neubert, Kretzschmar, Wüstenberg, and Greiff [
25] published the only article that employed a longer version of MicroFIN with five tasks. In correlated trait-correlated methods minus 1 (CTC(M-1) [
26]) models, the average specificity across all five MicroFIN tasks is reported as 0.58 (cf. Table A1 in Neubert et al.) in a model with MicroDYN as the reference method. In a similar model, again with MicroDYN as the reference method, the two MicroFIN tasks employed in Greiff et al. (2015) had an average specificity of 0.53 (cf. Table 2 in Greiff et al., 2014). Put differently, the latent MicroFIN factor showed a comparable overlap/distinction with MicroDYN in both data sets irrespective of the number of MicroFIN tasks.
Thus, we cannot see how either the data reported in Greiff et al. (2015) or the (admittedly little) other empirical evidence suggests that the reliability of MicroFIN or the overall pattern of results might be distorted to an extent that would render our original conclusions invalid, and we felt that the commentaries did not offer clear and sufficiently reasoned conclusions here either.
3.3. Inadequate Statistical Analyses and Interpretations
Finally, Kretzschmar in particular discussed issues with Greiff et al.’s (2015) statistical analyses and offered several alternative interpretations of the results. We acknowledge that there are various ways to analyze any data, some of which certainly provide interesting new insights. However, we insist that the analyses we chose were adequate insofar as they allowed us to answer the research questions outlined in our study.
To answer Research Question 1 (RQ1), whether the correlations between the different MCS tests were higher than those between the MCS tests and the classical CPS measure, we compared a restricted model with equal correlations between all CPS measures and a less restricted model with equal correlations between the MCS tests and allowed for different values in the correlations between the MCS tests and the Tailorshop. Kretzschmar argued that a lack of fit of the less restricted model (after adjusting for reasoning) might have affected our findings and questioned the validity of our approach. While we do not agree with his criticism,
3 we gladly conducted additional analyses that provided further support for the interpretations made in Greiff et al. (2015). We based these analyses on the intercorrelations between all manifest indicators of the four CPS tests. If a lack of model fit had biased our original results, as Kretzschmar suggested, we should find a deviating pattern when using manifest variables instead of latent factors. We calculated hetereotrait-monotrait ratios between all CPS tests [
27]. This means that we related the strength of the correlation between indicators of two different CPS tests to the strength of the correlation between indicators of the same CPS test. We found that the ratios within the MCS tests (0.60 to 0.76) were higher than the ratios between the MCS tasks and the Tailorshop (0.24 to 0.45), and these ratios remained similar when reasoning was controlled for (0.51 to 0.71 vs. 0.19 to 0.41). This indicates convergent validity for the MCS tests as compared with the Tailorshop. Even when the analytical model was free from any restricting assumptions, we found results that were similar to those from the original analyses and to the latent correlations reported by Greiff et al. (2015; Table 2).
Funke et al. also questioned the analytical approach chosen to answer RQ1 because, in this comparison, there was no way the Tailorshop could have been more valid than the MCS tests since there was no other test in its group (classical microworlds) that it could correlate with. This appears to be the result of a misunderstanding. As outlined throughout the Greiff et al. paper, our hypothesis was that the MCS tests would be more closely related to each other than to the Tailorshop and would, thus, show higher convergent validity. If there is a CPS construct and MCS tests capture it more reliably than the Tailorshop, we would expect higher and more consistent correlations between MCS tests than between any MCS tests and the Tailorshop. On the basis of our approach, this hypothesis could have been rejected by a nonsignificant difference between the two models suggesting equal correlations between all CPS measures.
Furthermore, both commentaries emphasized the importance of adjusting our results for reasoning (or better, a comprehensive measure of intelligence). Concerning RQ2, comparing the validity of the MCS tests and the Tailorshop in predicting students’ GPAs before and after adjusting for reasoning, the authors of the commentaries pointed out how weak (statistically nonsignificant) our results were after reasoning was controlled for. They interpreted this finding in the light of our conclusion that the construct validity of MCS tests might somehow be superior to the construct validity of the Tailorshop. This reasoning is based on contradictory assumptions. First, they correctly assumed that intelligence is an important control variable because it predicts the same criteria as CPS measures do and that the reason for this is that CPS tests can be subsumed under the overarching construct of intelligence as can reasoning [
28]. Second, they were surprised that, after reasoning was adjusted for, the differences between MCS and Tailorshop decreased. If the constructs of CPS and reasoning exhibit some overlap, and MCS tests are valid measures of the construct of CPS, then it logically follows that the correlations between MCS tests and reasoning should be particularly high (which is what we and [
29] found). If MCS tests are more valid measures of the construct of CPS than the Tailorshop is, then their overlap with reasoning should be higher than the overlap between the Tailorshop and reasoning (which is what we found). This means that adjusting the MCS tests and the Tailorshop for reasoning has a stronger effect on the better test of CPS (here, presumably the MCS tests) because it shares more variance with reasoning and is therefore particularly affected by controlling for reasoning.
In summary, we agree that additional analyses leading to potentially different interpretations are possible, and we were thus careful about how we phrased our results. However, we still believe that the methods used and the interpretation of our results were adequate for answering the intended research questions and led to important and long-overdue progress in the field of CPS. If other researchers would like to explore different analyses or research questions with the data, we would be happy to provide the data for reanalyses.