Assessment of Motor Planning and Inhibition Performance in Non-Clinical Sample—Reliability and Factor Structure of the Tower of London and Go/No Go Computerized Tasks

In two studies, we examine the test-retest reliability and factor structure of the computerized Tower of London (TOL) and Go/No Go (GNG). Before analyses, raw results of variables that were not normally distributed were transformed. Study 1 examined the reliability of a broad spectrum of indicators (Initial Time Thinking, ITT; Execution Time, ET; Full Time, FT; Extra Moves, EM; No Go Errors, NGE; Reaction Time for Go Responses, RTGR) across an eight-week delay in a sample of 20 young adults. After correction for multiple comparisons and correlations, our results demonstrate that the tasks have ambiguous test-retest reliability coefficients (non-significant r for all indicators, and interclass correlation (ICC) for TOL; significant ICC for GNG; show lack of reliable change over time for all indicators in both tasks); moreover, ITT exhibits strong practice effects. Study 2 investigated both tasks’ factor structure and conducted a more detailed analysis of indicators for each trial (ITT, ET, EM) in the TOL task in the group of 95 young adults. Results reveal a satisfactory 2-factor solution, with the first factor (planning inhibition) defined by ITT, NGE, and RTGR, and the second factor (move efficiency) defined by EM and ET. The detailed analysis identified a 6-factor solution with the first factor defined by ITT for more difficult trials and the remaining five factors defined by EM and ET for each trial, reflecting move efficiency for each trial separately.


General Introduction
Adequate measurement of executive functions (EFs) in healthy and clinical individuals is still a matter of contention. According to Chan et al. [1], EFs are an umbrella term comprising a wide range of cognitive processes and behavioral competencies. They generally refer to higher-level cognitive functions involved in controlling and regulating lower-level cognitive processes and goal-directed, future-oriented behavior [2].
Two important aspects of EFs are planning and inhibition. Planning is the ability to identify and organize the steps toward a particular goal [3]. It is commonly measured by disk-transfer tests such as the Tower of London task (TOL) [4]. The different versions of the TOL task and the resulting differences in structural properties, problem space, measures, and administration lead to difficulty in reaching conclusions regarding its psychometric properties. It is unclear whether different versions of the task measure the same components of EFs [5,6]. For example, some versions of the task require the participant to find the shortest number of steps to solve the task and assess the time needed to plan and execute the task perfectly; other versions permit mistakes and measure the time and number of

Study 1 2.1. Introduction
The test-retest reliability of different indicators of TOL and GNG has been studied by a few past researchers, for a variety of versions of those tasks. A survey of studies, presented in Table 1 for TOL and Table 2 for GNG, revealed that most studies report only Pearson's r correlation, which is considered a weak measure of test-retest reliability because it is only a measure of correlation. Therefore, some authors also used Intraclass Correlation (ICC), which better measures this type of reliability because it reflects both degrees of correlation and agreement between measurements [18]. Furthermore, the survey revealed that, most of the studies investigated a group of students over a short period without using an alternative version of the tests in a second measurement. Thus, research provides an ambiguous set of results due to different populations involved and different intervals used.   [26], we investigated the reliability of more indicators of TOL and GNG in a sample of healthy young adults (20-40 y.o.) over a 2-month interval, and more complex statistical analysis. The classical version of TOL proposed by Shallice [31] is a simple task for healthy adults; therefore, it is prone to a ceiling effect in the second measurement. To counteract this problem, we used the version of the TOL task, which has a more complicated problem space. In the second measurement, we used an alternative version of the task with different trials and the same level of difficulty. In contrast to previous research, we analyzed a broader spectrum of indicators in this study: Initial Time Thinking (ITT), Execution Time (ET), Full Time (FT), Extra Moves (EM), No Go Errors (NGE), and Reaction Time for Go Responses (RTGR). In addition to the simple correlation and ICC statistics reported in most studies, we have employed other methods for measuring stability over time, such as Reliable Change Index (RCI), which provides a more precise estimate of relative change and controls for test reliability [17]. Just like Köstering et al. [23], we performed ANOVA analyses for each level of difficulty in the TOL task in order to investigate changes over time more precisely.

Participants
Thirty-five young adults participated in the study. They were recruited via the university website, posters, social networks, local radio, and adverts in newspapers. There was no financial compensation for participation. All participants met the following inclusion criteria: (1) being a native Polish speaker; (2) being between 20 and 40 years old; and (3) having normal or corrected-to-normal vision and hearing. Participants with (1) a history of neurological disorder; (2) any psychiatric disorder; (3) a history of drug addiction; (4) any head injury; or (5) education in psychology were excluded from data analysis using a self-report screening questionnaire.
Additionally, we used the General Health Questionnaire 30 (GHQ-30) to assess mental health problems. Crystallized, and fluid intelligence was measured using the Information, and Picture Completion tests from the WAIS-R battery, respectively as proposed by Lezak (1995). One participant was excluded from the analysis due to an extreme result greater than 1 SD from the mean on the GHQ (the cut-off points are 99.13 for people aged less than 30, and 95.69 for people aged 30-40 [32]). No participants were excluded due to results on intelligence tests. One participant was excluded due to missing data. The Ethics Board of the Institute of Psychology of the University of Szczecin approved the research procedure (KB 9/2018). During the first assessment, each participant gave informed consent and completed computerized versions of the Tower of London (TOL) and Go/No-Go (GNG) tasks from the Psychology Experiment Building Language (PEBL).
The tests were administered and scored following the standard procedures by a group of six well-trained examiners under the leading investigators' supervision. The testing took place in a quiet setting at the University. After eight weeks, participants were invited again and tested with an alternative version of the same tasks in the same order.

Tasks and Measurements
We used computerized versions of the TOL and GNG tasks from the PEBL. The PEBL is open-source software, licensed under the GNU General Public License 2.0, which allows scientists to create and conduct neuropsychological tests, primarily devoted to experimental design [33]. Instructions for each task were translated from English into Polish and then evaluated by two expert judges whose advice was carefully considered and applied when relevant.
This study used Ward and Allport's TOL task [34], adapted by Phillips et al. [35] in research on a group of non-clinical adults. The task consisted of five colored discs that can be moved, one by one, on and off three pegs of equal height. Each trial's goal was to move all disks from an initial state to a goal state, which was also shown on the screen. All five disks can be stored on any one of the pegs. As shown in Figure 1, the color patterns of the beginning and the goal state determine a minimal number of moves needed to solve the problem in the given trial. Responses were made using a computer mouse. The computerized version of the task prevents participants from making illegal moves, eliminating one of the sources of variability previously discussed. Participants were instructed to solve the task as quickly as possible with the minimum number of moves. At the beginning of the task, we added three practice trials with three disks (no indirect moves) to ensure that participants understood the task. The main task contained eight trials with five discs. For each trial, the level of difficulty increased with the number of moves needed to solve it (3-10) and the number of indirect, counter-intuitive moves, which do not immediately bring the configuration closer to the goal state (0-6). The numbers of distinct optimal solutions remained the same for each trial (only one solution; see Figure 2, Table 3). The number of moves for the optimal solutions in Phillips version was the same on the test and retest versions for 5 of 8 trials; therefore, we analyzed only trials 1, 2, 3, 6, and 7. All participants saw the same sequence of problems on each test administration, but the problems differed across the two testing sessions. The beginning and goal states differed for each trial, but task parameters and difficulty remained the same. We measured total time spent planning, i.e., total time until the first move ITT, total execution time, i.e., the total time from the first move to completion ET, the total time for completion of all trials FT, and the total number of EM, i.e., moves made minus 31 EM. Additionally, ITT and ET were analyzed individually for each trial.
As in the GNG study of Bezdjian et al. [15], participants were requested to follow a sequential presentation of letters and respond to the target letter (P) by pressing a button on the keyboard while withholding responses to the non-target letter (R). Letters were randomly generated and presented for 500 milliseconds in one of four squares, arranged in a 2 × 2 pattern with one star in each square. The interval between stimuli lasts 1500 milliseconds. The task consisted of 160 trials with a ratio of targets to non-targets of 80: 20 (128:32). At the beginning of the task, a short practice session was administered to    As in the GNG study of Bezdjian et al. [15], participants were requested to follow a sequential presentation of letters and respond to the target letter (P) by pressing a button on the keyboard while withholding responses to the non-target letter (R). Letters were randomly generated and presented for 500 milliseconds in one of four squares, arranged in a 2 × 2 pattern with one star in each square. The interval between stimuli lasts 1500 milliseconds. The task consisted of 160 trials with a ratio of targets to non-targets of 80:20 (128:32). At the beginning of the task, a short practice session was administered to ensure that participants understood the instructions. Because of the random generation of letters,   As in the GNG study of Bezdjian et al. [15], participants were requested to follow a sequential presentation of letters and respond to the target letter (P) by pressing a button on the keyboard while withholding responses to the non-target letter (R). Letters were randomly generated and presented for 500 milliseconds in one of four squares, arranged in a 2 × 2 pattern with one star in each square. The interval between stimuli lasts 1500 milliseconds. The task consisted of 160 trials with a ratio of targets to non-targets of 80:20 (128:32). At the beginning of the task, a short practice session was administered to ensure that participants understood the instructions. Because of the random generation of letters,

Statistical Analyses
Statistical analysis of the data was conducted using the IBM SPSS 25 Statistical package. We used the Box-Cox transformation for the EM and NGE to achieve the normality of the distribution for all analyzed variables [36]. We used paired t-tests to examine differences between baseline and retest performances with 1000 bias-corrected bootstrap samples. The size of the practice effects was assessed with Cohen's d. We used Pearson's r correlation with 1000 bias-corrected bootstrap samples. Since Pearson's r correlation is considered a weak measure of test-retest reliability, especially when group means are similar and coefficients are high [37], we also used the ICC, which measures correlations within a class of data rather than correlations between two different classes of data [38]. Here we used ICC(3,1): a two-way mixed model of ICC (consistency), with a 95% confidence interval [18,38]. For multiple correlations and comparisons, we used Holm-Bonferroni corrections [39]. We also calculated Reliable Change Indices to assess whether a change between repeated assessments exceeds the probable range of measurement error (RCI) [40]. The RCI estimates the probability that a given difference in a score is not due to measurement error, but reflects real results [41]. We used RCI adjusted for controlling practice effects computed by formula [42]: where T 1 -score at first assessment, T 2 -score at second assessment, M 1 -mean at first assessment, M 2 -mean at second assessment, SED I -standard error of the difference by Iverson.
We used alternative calculation for SED made by Iverson [43]: where S 1 -standard deviation at first assessment, S 2 -standard deviation at second assessment, r 12 -correlation coefficient between first and second assessment. Additionally, repeated measures analysis of variance (RM-ANOVA) with the withinsubject factors "Time" (Times 1 and 2) and "Trial" (Trials 1, 2, 3, 6, and 7 of TOL) were separately performed with time until the first move, and move time as the dependent variables. EM was excluded from this analysis because distribution did not reach normality, relevant skewness, and kurtosis even with transformation. We used the Bonferroni posthoc test for both variables, and in the case of ET, we used Greenhouse-Geisser correction for degrees of freedom.

Results
Descriptive statistics, Student's t-test, and Cohen's d for all indicators on TOL and GNG tasks are shown in Table 4. After Holm-Bonferroni's p-value corrections for multiple comparisons, the difference between the first and second time point was significant only for ITT of TOL, and the effect was very large. There were no significant differences between time points for indicators in GNG. Person's r, ICC, and RCI for all TOL and GNG indicators are shown in Table 5. After Holm-Bonferroni's p-value corrections for multiple correlations, there were no significant Pearson's correlations for all indicators of TOL and GNG. ICC correlations did not appear Brain Sci. 2021, 11, 1420 8 of 18 to be significant for TOL indicators but were significant for both NGE and RTGR of GNG. For ET, FT, and EM, only 5% of participants fell outside of the RCI confidence intervals, indicating reliable change overtime only for one person in the group. Interestingly, for ITT, scores did not change reliably for all of the participants. Similarly, in the GNG task, for RTGR, scores for only one person change reliably across two assessments, while for NGE, there was no reliable change over time for all participants. Repeated Measures-ANOVAs were performed separately for each variable from TOL ( Figure 3A,B) to examine the effects of within-session trials and testing sessions For ITT, differences between both sessions (F (1, 19) = 55.95; p < 0.001; ï 2 = 0.75) and trials (F (4, 76) = 18.34; p < 0.001; ï 2 = 0.49) were significant. There were significant differences between the second trial and remaining trials (0.001 > p < 0.003), the first and the sixth trials (p = 0.004), and the sixth and seventh (p = 0.038). Also interaction between trail and session was significant (F (4, 76) = 36.19; p < 0.001; 311 ï 2 = 0.66). Comparisons revealed that the first and the second session did not differ in first and second trial, while there were significant difference for other trials. The greatest difference occurred in the seventh (difference in estimated marginal Ms = −0.33), the difference was smaller for the third trial (Ms = −0.30), and for sixth trial was the smallest (Ms = −0.21). Next, repeated ANOVA measures were performed separately for each variable from TOL ( Figure 3A,B) to examine the effects of within-session trials and testing sessions For ITT, differences between both sessions (F(1, 19) = 55.95; p < 0.001; ɳ 2 = 0.75) and trials (F(4, 76) = 18.34; p < 0.001; ɳ 2 = 0.49) were significant. There were significant differences between the second trial and remaining trials (0.001 > p < 0.003), the first and the sixth trials (p = 0.004), and the sixth and seventh (p = 0.038). Also interaction between trail and session was significant (F(4, 76) = 36.19; p < 0.001; 311 ɳ 2 = 0.66). Comparisons revealed that the first and the second session did not differ in first and second trial, while there were significant difference for other trials. The greatest difference occurred in the seventh (difference in estimated marginal Ms = −0.33), the difference was smaller for the third trial (Ms = −0.30), and for sixth trial was the smallest (Ms = −0.21).

Discussion
After the correction for multiple correlations (Pearson's r and ICC), results ceased to be significant for all TOL task indicators. After the correction for multiple correlations, Pearson's r ceased to be significant for both GNG indicators. Both NGE and RTGR yielded a fair level of ICC.
Novelty effects and practice effects are essential factors that should be considered when testing the executive functions, especially when taking repeated measurements over time. These effects are intertwined, but the literature is equivocal about their relationship [17]. We observed a practice effect for ITT-times were significantly shorter at the second For ET, differences between trials were significant (F (2.12, 40.26) = 105.49; p < 0.001; ï 2 = 0.85), but the differences between sessions were not (F (1, 19) = 1.11; p = 0.305; ï 2 = 0.06). Significant differences occurred between the the first and remaining trials (p < 0.001), the second and the sixth, and the seventh trials (p < 0.001), the third and the sixth, and the seventh trials (p < 0.001). The interaction between trail and session was nonsignificant (F (2.43, 46.09) = 1.34; 320 p = 0.273; ï 2 = 0.07).

Discussion
After the correction for multiple correlations (Pearson's r and ICC), results ceased to be significant for all TOL task indicators. After the correction for multiple correlations, Pearson's r ceased to be significant for both GNG indicators. Both NGE and RTGR yielded a fair level of ICC.
Novelty effects and practice effects are essential factors that should be considered when testing the executive functions, especially when taking repeated measurements over time. These effects are intertwined, but the literature is equivocal about their relationship [17]. We observed a practice effect for ITT-times were significantly shorter at the second time point, and the effect size was very large [44]. For FT, ET, and EM for TOL, and both GNG indicators, Student's t-test showed no significant differences between the two time points, suggesting the absence of practice effects. However, when controlled for practice effects, results showed that all participants fell within the 95% confidence interval for all TOL indicators, with a cut-off point of ±1.96, which indicates a lack of reliable change over time [17,40]. As with the TOL task, for both GNG measures, almost all participants fall into a 95% interval, which indicates no reliable change over time. In contrast to Köstering et al., research [23], our results suggest that both tests have ambiguous test-retest reliability coefficients. In our view, the reason for this may be transformations of skewed raw variables (Box-Cox, recommended by Sakia [36]), and corrections for multiple comparisons and correlations [39] that we applied in our study.
With RM-ANOVA, we discovered more complex practice effects for the TOL task depending on the trial's difficulty level. A significant interaction effect for ITT suggests differentiated practice effects based on the difficulty level of the task (number of necessary direct moves or/and number of indirect moves; see Table 3). For easy first and second trials, the practice effect was absent, whereas it occurred for other, more complicated trials. Longer ITT for more difficult trials (3, 6, and 7) than easier ones (1 and 2) at the first time point, suggests that subjects developed more complex strategies for solving these problems. The lack of difference between the last three trials (3, 6, and 7), despite increasing number of indirect moves, may suggest that the identified strategies were retained (see: Figure 3A and Tables S1 and S2). Contrary to the first time point, in the second time point, despite the growing difficulty, ITT was relatively stable over trials, which may suggest the emergence of practice effects for the applied strategies.
For the ET, the effect of interaction and the main effect of time are insignificant, which suggests a lack of practice effects for this measurement. The significant main effect of the trial shows that more time was taken on the later trials due to the increasing number of movements needed for a solution (click and drag).
According to Diamond's model [12], the fact that practice effects occur for planning (TOL), and not for motor inhibition (GNG) may result from the hierarchical structure of executive functions. Lower level executive functions like inhibition involve a lesser degree of complex cognitive operations, e.g., forming strategies for solving tasks. Higher-level executive functions, like planning, are based on this kind of operation, and therefore are more susceptible to practice effects. According to Duff [17], practice effects are stronger in tasks based on fluid abilities, where answers can be obtained in the setting, and where responses have not been met previously.
For such simple tasks like GNG used in this study, practice effects rarely occur. It would be worth verifying if the absence of practice effect would also be observed for more complex tasks which involve different stimuli (e.g., facial expression or semantic meaning [45][46][47]).

Study 2 3.1. Introduction
The factor structure of different tasks capturing various aspects of EF, including planning measured by TOL task and motor inhibition measured by GNG, has been studied in past research. Levin et al. [48] investigated the structure of executive functions in headinjured children using seven tests, including the Shalice version of the TOL task and the GNG task. Indicators of TOL and GNG were loaded in three following factors: planning as planning-execution dimension (TOL: percentage solved, three trials; and the number of broken rules), schema as a mental representation of the task (TOL: percentage solved, trial 1), and inhibition (TOL: initial thinking time, and GNG false alarms). Culbertson and Zillmer [49], in the group of children with ADHD, found that among different cognitive tests, all indicators in TOLDX (total move score, total time violation, total rule violation) fall under a single factor named executive planning/inhibition. Berg et al. [8] obtained three factors in their exploratory analysis for different measures of TOL task in the group of students more and less experienced with the TOL task. The first factor, labelled move efficiency, was influenced mostly by the proportion of perfect solutions, optimal move score, and the number of extra moves, and also to some degree by total solution time. The second factor, labelled solution speed, was loaded by average time per move during the solution, and total solution time. The third factor, identified as planning speed, was influenced by a single initial planning time measure.
Georgiou et al. [50] investigated the structure of planning functions in students' groups, using a computerized version of the TOL task and different planning tasks. Indicators of TOL were loaded in both obtained factors: action planning (total number correct) and operation planning (total number correct and initial time thinking). Miyake et al. [51], in the college students group, confirmed a three-factor model with inhibition, shifting, and updating for several tasks measuring simple executive functions. Inhibition contributed to performance on the Tower of Hanoi (TOH), as indicated by the total number of moves. Bender et al. [52] in research on different response selection and response inhibition measures in a student group, confirmed a two-factor model in which GNG (errors of commission) was part of the response inhibition factor.
There is little research on executive functions' factor structure, including TOL and GNG tasks. In contrast to previous studies, we analyzed a broader spectrum of indicators in this study: ITT, ET, EM, NGE, and RTGR. Similarly, like in the first study, we used a version of the TOL task with more complicated problem space, which counteracts the ceiling effect. Additionally, we perform more detailed factor structure and ANOVA analysis' for each level of difficulty in the TOL task, which to our knowledge, has not been studied in previous research (for all peer-reviewed publications using the PEBL see: [26]).

Participants
One hundred and seven young adults participated in the second study. Recruitment and inclusion criteria follow the same principle as study 1. Similar to study 1, we used GHQ-30 to assess mental health problems, and Information and Picture Completion tests from the WAIS-R battery to assess crystallized and fluid intelligence. Three participants were excluded from the analysis due to a result greater than 1 SD from the norm on the GHQ (the cut-off points are 99.13 for people aged less than 30 and 95.69 for people aged 30-40 [32]). Two participants were excluded due to results in the intelligence tests. Seven participants were excluded due to missing data.
The The Ethics Board of the Institute of Psychology of the University of Szczecin approved the research procedure, which followed the procedure from study 1. In study 2, participants took part only in single testing.

Tasks and Measurements
We used computerized versions of the TOL and GNG tasks from the PEBL. The description of both tasks is presented in Study 1. To make studies 1 and 2 comparable, we again analyzed only trials 1, 2, 3, 6, and 7 from the TOL task (for explanation see: point 2.2.2).

Statistical Analyses
Statistical analysis of the data was conducted using the IBM SPSS 25 Statistical package. Prior to the analyses, we used Box-Cox transformation for all variables to achieve the normality of the distribution [36]. For the investigation of the factor structure of both tasks, we used three TOL scores (ITT, ET, and EM) and two GNG scores (NGE and RTGR). Additionally, we performed detailed factor analyses for scores on each of five TOL trials (trials 1, 2, 3, 6, and 7) for ITT, ET, and EM. In both cases, we used principal components analysis with VARIMAX rotation. Factors with eigenvalues >1 (the Kaiser-Guttman criterion and scree plot) were retained, and factor loadings of 0.40 or greater were considered significant. For a more in-depth investigation of differences between trials, analysis of variance (ANOVA) was performed. We used pairwise comparison with Bonferroni correction and Greenhouse-Geisser correction for degrees of freedom.

Results
Descriptive statistics for performance in the first time point on TOL and GNG tasks are shown in Table 6. Exploratory factor analysis for both tasks showed a 2-factor solution (see scree plot: Figure S1), which explained 66.19% of the total variance. Rotated factor loading estimates are shown in Table 7. Factor 1, which accounted for 34.93% of the variance, was defined by ITT from TOL, NGE, and RTGR from GNG and labelled planning/inhibition. Factor 2, which accounted for 31.26% of the variance, was defined by EM and ET, and was labelled move efficiency. Table 6. Descriptive statistics of performance on Tower of London (TOL) and Go/No Go (GNG) task in first timepoint. Exploratory factor analysis for five TOL trials showed a 6-factor solution (see scree plot: Figure S2), which explained 77.00% of the total variance. Rotated factor loading estimates are shown in Table 8. Factor 1, which accounted for 16.34% of the variance, was defined by ITT for trials third, sixth, seventh, and was labelled strategic planning. Despite equivocal factor loading estimates, the rest of the following factors refer to separate trials. Factor 2, which accounted for 14.55% of the variance, was defined by EM and ET for the first trial and ITT for the second trial. Factor 3, which accounted for 13.27% of the variance, was defined by EM for the second trial, ITT for the first trial, and ET for the second and seventh trials. Factor 4, which accounted for 12.61% of the variance, was defined by EM and ET for the third trial. Factor 5, which accounted for 11.58% of the variance, was defined by EM and ET for the sixth trial. Factor 6, which accounted for 8.65% of the variance was defined by EM in the seventh trial ET. Factor loadings for all components are presented in Table S5. Factor loadings for all components are presented in Table S6.

Discussion
The factor structure obtained for TOL and GNG measures had two factors that reflected planning/inhibition and move efficiency. The first factor grouped both GNG indicators, as well as ITT for TOL. Variable loadings were in opposite directions for RTGR and NGE, suggesting the occurrence of the speed-accuracy trade-off for the GNG task, which is a common effect in the tasks performed under time pressure [53]. Initial thinking time is a measure of time taken to plan all or part of the solution and can indicate both more thorough planning and ineffective planning [5]. Due to its shared variability with measures classically interpreted as indices of inhibition, we labelled first-factor planning/inhibition. Because of the correlation between ITT and ET, and lack of correlation between ITT and EM (see : Table S3), in our view, longer time of planning could result either due to the problems with creating a plan [54], or usage of more perceptual (simply making a next move that will bring the current state perceptually closer to the goal state), than goal-recursion strategy (extensive goal management and setting up a series of subgoals to achieve the superordinate goal [51]). The second factor, which included EM and ET, represented the cognitive dimension that appeared to be relevant only for the TOL task and was labelled move efficiency. According to Berg and Byrd [5], extra moves measure the solution's efficiency, while execution time measures the speed of the solution and combines motor time taken for moving disks and cognitive time taken for additional on-line planning and error correction. In the version of the TOL task used in our research, correction of the errors required making additional moves; therefore, in our view, execution time was growing with the number of extra moves due to both motor time and cognitive time. In our study, different TOL measures represent different sources of shared variability, suggesting that they capture different executive functions [48]. Other researchers also found different factors for TOL and GNG [48][49][50], but direct comparison is limited due to differences in used versions of tasks, number of considered tasks, number of indicators for each task, and the type of sample.
More detailed factor structure analysis for different measures on each level of difficulty in the TOL task reveals a six-factor solution. The first factor is influenced by ITT for three more difficult trials and represents strategic planning. For the first two trials, with no counter-intuitive moves involved, there is no need for a longer planning period before moving the first disc. Conversely, in more difficult trials, one needs more thorough planning at the beginning. Results of ANOVA also confirmed the difference between easier trials (first and second) and more difficult trials (third, sixth, and seventh) in terms of planning time. Besides the first, all remaining factors approximately capture ET and EM for each trial separately and can be interpreted as move efficiency for each level of difficulty. Results of ANOVA corroborate differences between all trials in terms of move efficiency. Our findings in both analyses show that strategic planning and moves efficiency display themselves in diversified ways depending on the level of trial difficulty (minimal number moves to the solution number and number of counter-intuitive moves [5]).

Limitations
It is essential to view these results in the context of their limitations. Future research should investigate other types of inhibition, such as proactive inhibition or other types of motor inhibition [13]. Although the research sample was not composed of students alone, rather young adults aged 20-40, results should not be generalized to older people, as much research shows that EFs decrease with age [2]. Further research should examine the test-retest reliability and factor structure of the TOL and GNG tasks in older people, especially those in late adulthood. Present study concerns healthy individuals; therefore, results should not be directly generalized to clinical populations. There is a need for further investigation of this version of TOL and GNG tasks in clinical samples, where test-retest reliability depends on a much greater number of factors (disease progression, fluctuation of neurological and psychiatric symptoms, and the treatments being used [17]) and where factor structure may be somewhat different [55]. As with other research on the test-retest reliability of EFs tasks [23,25], we must contend with small sample sizes, limiting the scope of generalization of the results. For the factor analysis of TOL and GNG indicators, the subject-to-variable ratio was generally within accepted limits (approximately 10 to 1 [56]), but for the factor analysis of indicators in different TOL trials, the sample was not large enough. Results obtained for smaller samples tend to be less stable and reliable than for larger samples [57]. In both studies, we analyzed results of only five out of eight TOL trials, due to discrepancies in the number of moves in the optimal solutions between test and retest versions for three of eight trials. Further research should use more trials, with systematic manipulation of different aspects of problem structure. We did not investigate the concurrent validity of the tasks in this research. Although convergent and differential validities of this version of the TOL task have been investigated in the context of other EFs tests [26], there is a need for further research on the validity of different TOL and GNG indicators in the context of other planning tasks. Lastly, the TOL version in our research is a conventional test with low ecological validity; further research should investigate concurrent validity of that version and naturalistic planning tasks [58].

Conclusions
We conducted two studies to verify psychometric properties of commonly used tasks for planning and motor inhibition assessment. Knowledge of reliability over time and factor structure of cognitive tasks is an important aspect of practical application and is needed for adequate assessment. Overall, Study 1 shows that investigated versions of TOL and GNG tasks have satisfactory test-retest reliability coefficients. Nonetheless, ITT should be interpreted cautiously due to the occurrence of practice effects, which strength can vary depending upon the trial difficulty level. Our results are in line with results obtained for various TOL and GNG tasks in similar samples over varying periods [23,24,26,27]. Study 2 shows the factor structure for TOL and GNG tasks with two factors: planning/inhibition and move efficiency. A more detailed factor structure analysis for TOL indicators in each trial shows a six-factor solution where the first factor, named strategic planning, grouped ITT for more difficult trials, while the remaining five factors, named move efficiency, grouped indicators for each trail separately. Similar to other research, TOL indicators were grouped in different factors [8], with planning time loading the same factor with GNG indicators [48].
TOL and GNG tasks are considered to capture planning, and motor inhibition, respectively [59]. However, according to Lezak et al. [3], there are no pure measures of specific executive functions, and all functions, to different degrees, are involved in each task. Our results show that aspects of planning and motor inhibition appear in different ways in both tasks. Practice effects, suggesting strategy use, occur in more difficult TOL trials, but not in GNG and less difficult TOL trials. Planning time in TOL loaded the same factor with indices of GNG, which may suggest that inhibition plays an important role in thinking how to solve the task. On the other hand, inhibition does not appear to be significant for the TOL task's execution, even though changing the strategy and additional on-line planning may occur during that period. Results of the analysis of factor structure and ANOVA for all trials in the TOL task suggest that level of difficulty (easy vs. difficult trail) may moderate the degree to which the ITT is capturing a specific aspect of executive functions.
Approaches treating executive functions as complex processes and as interrelated aspects are reflected in different theoretical models. Our results can be understood in the light of Diamond's theory, which assumes a hierarchical structure of executive functions [12]. Inhibition being a more basic function, is involved in higher-level planning. According to Miyake et al. [51], basic functions of inhibition, shifting, and updating are involved in solving complex tasks like TOL. The degree to which specific functions are involved in the task depends upon the chosen strategy of problem-solving, which, according to Miyake et al. [51], can be influenced by the character of an instruction. In our view, the level of difficulty may also influence which strategy is chosen.
Our findings correspond with the general discussion about the interpretation of TOL indicators [7] and suggest that interpretation should be made for different indicators in connection with other methods (i.e., GNG), which gives a better chance for understanding the complexity of executive functions (planning, inhibition, effective performance). It is worth considering those versions of the task, which have proven reliability coefficients and allow for systematic manipulations of problem structure (level of difficulty, i.e., the minimal number of moves to the solution, number of counter-intuitive moves), and calculation of more indicators.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/brainsci11111420/s1, Figure S1: Exploratory factor analysis for both tasks showed a 2-factor solution; Figure S2: Exploratory factor analysis for five TOL trials showed a 6-factor solution; Table S1: Performance in Tower of London (TOL) task for eight trials, across repeated measurements: Initial Thinking Time (ITT) and Execution Time (ET); Table S2. Significance of p value for three indicators of Tower of London (TOL): post hoc for main effect of "Trial", and pairwise comparisons for main effect of "Time" and for interaction between "Time" and "Trial"; Table S3. Correlation matrix for Tower of London (TOL) and Go/No Go (GNG) task performance measures; Table S4. Performance in Tower of London (TOL) task for eight trials: Initial Thinking Time (ITT), Execution Time (ET), and Extra Moves (EM); Table S5. Factor loadings of Tower of London (TOL) and Go/No Go (GNG) task; Table S6. Factor loadings of five trials of Tower of London (TOL).
Author Contributions: Conceptualization, methodology, formal analysis, investigation, resources, data curation, writing-original draft preparation, visualization, project administration, E.T.; conceptualization, methodology, formal analysis, investigation, resources, data curation, writing-original draft preparation, visualization, project administration, M.K.; methodology, formal analysis, writing-review and editing, P.K.; formal analysis, writing-review and editing, S.R.; writing-review and editing, supervision, S.T.M. All authors have read and agreed to the published version of the manuscript.