2. Materials and Methods
In order to develop this tool, we first constructed original tasks that refer to the basic CT dimensions. Robinson [
30] states that at least three (3) questions are needed to measure a characteristic on a scale, recommending four (4) for safety reasons, and we aimed for five (5) tasks for each dimension in the final assessment tool for greater reliability. The number of initial questions is recommended to be at least double the final number [
31] for the construction of a scale, and we aimed at three times the desired final number for greater safety. Thus, the design of the tool included 90 tasks, corresponding to 15 per dimension.
After creating the tasks, which are all original and were developed as part of this research, we initially administered them on a pilot research on a small sample of students to identify any inaccuracies and malfunctions in the research design and tasks. After the necessary corrections and replacements or modifications of tasks, the final bank of 90 questions was created, which was used in the main research.
As Boateng et al. [
31] point out, constructing a scale requires a large sample, which can be 10 participants per question or a total of 200–300 individuals for factor analysis. In the main survey, bearing in mind the final 30 tasks out of the initial 90, we ended up with a sample of more than 500 students.
The methodology we followed had to do with predicting the reliability index (Cronbach’s α) of the entire questionnaire. Taking into consideration the responses of all participants in SPSS, we investigated the question of reliability (reliability statistics), asking each time for a new value to be calculated if a task was deleted, removing one task at a time and running the same test again. As an end criterion, we set the point at which we were left with at least five questions in each dimension or the point at which we judged that beyond that point the planned tool would not be reliable (Cronbach’s α < 0.80).
Throughout the research, particular emphasis was placed on ethical issues, participants were thoroughly informed, the necessary permissions were obtained (e.g., access to schools, parental consent) and particular attention was paid to maintaining anonymity and informing participants about the possible context of presentation or publication of the results. At all stages, there was freedom to decide whether to participate in the research, with the ultimate goal of achieving informed consent [
32].
The ethics of our research required complete anonymity of the participants, while at the same time we had to be able to link the research data collected in more than one meeting. For this reason, codes were used and the matching was performed by the school teacher, so that the researchers could not personalize the results. Tablets and Google forms were used for the research and data collection for the questionnaires. This approach had advantages, such as avoiding errors and managing time (digitally available answers, student participation).
2.1. Creation of the Initial Pool of Tasks
The tasks were constructed as part of this research and were based on the researchers’ experience through theoretical study of the literature at an academic level, but also because of their work in the field of teaching and didactics of computer science. The first two steps of the four-step framework proposed by Li et al. [
33] were used, which concern the construction and attribution of each task to a CT concept, as well as a review of the proposed tasks by experts to confirm their correlation with the specific CT concepts.
The tasks’ construction took into consideration the basic literature from which the CT concepts referred to arose, as already mentioned in the theoretical part [
1,
15,
25,
27]. At this point, we should note that, in our opinion, there is never just one CT concept in a task, but several coexist, which is an argument that also appears in the literature [
34,
35]. Rowe et al. [
36] provide examples of logic and refer sometimes to algorithmic thinking, pattern recognition, or abstraction. In constructing the tasks, we used ideas and variations from Rowe et al. [
36] and Li et al. [
33]. Our effort could not avoid writing algorithms, instructions, and commands in a specific order and form, mainly for the concept of algorithmic thinking, but also for some of the other dimensions of CT. For this reason, we decided to create an original “programming” environment in which shapes would move and which would contain basic objects. Based on the two-dimensional approach of Scratch- and Logo-like environments, we designed a micro-world consisting of the following: a two-dimensional maze with distinct and easily measurable steps, a male and a female shape (both abstract), a broom, a bucket of paint, four basic commands that make the figure move right, left, up, and down, and programming structures that represent the simple selection structure, the complex selection structure, and the repetition structure.
The process followed specific steps: initial recording of ideas for each CT concept, design and discussion between the two researchers on the categorization and form of each task, construction of the tasks and graphics required, formatting them in Microsoft Word and entering them into a corresponding Google form to make them accessible on the internet, review by two experts to agree that these questions do indeed belong to the concept invoked by the researchers who designed and constructed them. Thus, 90 tasks were finally constructed, which were divided into six categories, thus creating six distinct questionnaires.
2.2. Pilot Implementation of Initial Tasks’ Pool
A secondary school in the provinces with three classes and a total of 70 participants took part in the pilot study. Despite our best efforts, not all of the 70 participating students completed all of the questionnaires, as they were not all completed in consecutive hours due to expected fatigue. Thus, there was absence of some students, even though the researchers returned.
Table 1 shows the participants. In the end, 41 students completed all six questionnaires and 23 students completed five out of the six questionnaires.
Table 1 shows us that the 70 participants have completed 383 questionnaires. The total number of six questionnaires per participant gives us 420 questionnaires and one could expect a percentage of 383/420 = 0.91, leading to about 63 participations with six questionnaires completed. However, as afore-mentioned, only 41 students have completed all six questionnaires.
All questions on the Google forms were mandatory, thus ensuring that no questions would remain unanswered, as is the case with printed questionnaires. To avoid hasty or random answers, all questions, without exception, included a fifth option “NO ANSWER”, which we encouraged students to use whenever they could not or would not answer a question.
We used a sheet to record the progress of the students in each class, which included details of the school and class, the dates and times we visited the class, the students’ codes, absence, and which questionnaire each student completed. We used a free-form observation sheet, which recorded the date, time, and class, so if we needed any clarification, we could return to the same class at a later date.
Following the observation, we kept notes of the questions of the students and throughout some discussion with some of them, we found out some easy tasks, difficult tasks and fun tasks.
The data were stored in spreadsheets on Google Drive and downloaded immediately to ensure maximum data integrity. The data were processed in Excel, where an auto-correct function assigned one mark to the correct answer and zero mark to the incorrect answer or “NO ANSWER”.
2.3. Main Research
Concerning the research sample, the initial target was a large number of students aged approximately 12–13 (first year of secondary school). We followed the convenience sampling method [
32], including schools from both urban centers and smaller towns, and even from two regional schools. As seen in
Table 2, the initial number of participants was 521, but the survey was finally completed by 452 individuals (86.76%). This was due to our persistence in returning to the schools, but also to the fact that the students themselves expressed their interest and wanted to participate.
The research design described in the pilot study was used in this research too, using the final questionnaires in Google form format, student codes, and tablets for the completion of the tasks. The answers per student (code) from all questionnaires were collected, checked for errors, corrected, and statistically processed. The processing included the extraction of statistics per question to show whether it was too easy or too difficult, as well as the extraction of statistics per student to discuss any unreliability in the answers (e.g., a questionnaire with 80% “NO ANSWER” responses would be deemed unsuitable for inclusion in our results). Once again, notes were kept on students’ questions and reactions in an attempt to evaluate tasks (what was very difficult, what was incomprehensible, what was fun, etc.). A second researcher participated as an observer in the research, helping to record questions asked in the classroom and malfunctions, and also to provide his own perspective.
The responses of the students included in the final sample were then statistically processed, following the removal of some entries. The statistical processing included the reliability of the questionnaire and, more specifically, its internal reliability. The following procedure was followed:
Reliability Analysis was performed repeatedly on the entire questionnaire, and in each analysis, SPSS was asked to display Cronbach’s alpha for all tasks, as well as the value of the index if a task was removed (“Scale if item deleted”).
In each iteration, one task was removed and we checked the index according to the previous step.
The process stopped when we reached the minimum number of five tasks that we had initially selected as the limit per CT concept, or if the proposed index fell below the threshold of 0.80, which is considered the threshold for acceptable internal reliability [
37].
Following the construction of the new DACT CT assessment tool, an exploratory factor analysis (EFA) was conducted to examine the dimensionality of the questionnaire, using the Jamovi v.2.6.44.0 tool. The EFA was conducted using the initial main research sample presented in
Table 2 (N = 452).
2.4. Validation
Following the initial development of the new CT assessment tool, our research proceeds with the validation process of the DACT CT assessment tool. For the validation process, we searched for independent assessment tools in Greek, but none were found in our research on CT assessment [
22]. Few attempts have been made to assess CT in Greek [
38,
39,
40,
41], but they either refer to a different age group than that of our research sample, or they are not autonomous CT assessment tools, proposing a CT assessment methodology within the context they analyze.
Thus, among the assessment tools studied, the Computational Thinking Test (CTt) [
42] was selected as the most suitable for validating the DACT CT assessment tool we created. The main reason of the selection is that CTt has already undergone several validation checks and its relationship with CT assessment has thus been proven by research. Several studies have already been conducted on the CTt to describe and validate it [
43], such as content validity [
42], descriptive statistics, reliability, criterion validity [
44], convergent validity [
43], predictive validity [
45], and cross-cultural validity [
44]. Furthermore, research has been conducted on instructional validity and the sensitivity of measurements depending on whether the assessment is conducted before or after teaching, and it has been studied both in an unplugged teaching environment [
43] and in the Scratch programming teaching environment [
46,
47]. In addition, it has been studied in other programming environments such as Penguin Go [
48] and Bomberbot [
49]. Finally, CTt has undergone validation checks using the IRT approach [
50].
We can argue that CTt belongs to programming-based assessment tools and, in fact, uses a specific, existing, programming environment. CTt assesses CT, mainly based on the programming approach, and a comparison with the DACT tool would show a positive correlation. Of course, we should mention that the creators of CTt state that they are theoretically based on Aho’s definition of CT [
51], talk about the fundamental role of algorithmic thinking in CT, and refer to the primary position of programming in relation to CT, while clarifying that we can also have unplugged environments for CT [
44]. The CTt tool does not require the use of a computer to complete it, but can also be administered on paper, as its questions (28 in total) are each based on an image and the answers are all of a multiple choice format.
The CTt is an already validated tool for assessing CT, which is based on a smaller number of CT concepts than our approach in the DACT tool, mainly referring to programming concepts. However, since both tools aim to assess CT, and CT is unique, the results of the two tools can be compared on a common sample and conclusions can be drawn about their possible correlation.
2.4.1. Adaptation of the CTt Tool in Greek
For the purposes of this study, CTt was translated and adapted into Greek so that it could be used to validate the DACT tool. For the use of the CTt tool and its adaptation to the Greek language, we have communicated with the research team that created it and have received their approval for our research purposes. Our research objectives in this section are the translation of the CTt into Greek and its cultural adaptation.
For the translation into Greek and the cultural adaptation of the CTt tool, we followed the process of translating the tool from English into Greek (forward translation) and back from Greek into English (backward translation). The translation was performed with a view to the correct reference and use of language, cultural elements, and scientific data, as reported by Borsa et al. [
52]. A positive factor that facilitated this process is that CTt does not use a large amount of text, but mainly uses images from a Scratch-type programming environment.
Furthermore, as the programming environment from Code.org was used for the initial construction of the CTt, and this environment already has a translation available in Greek, we did not need to translate the images and commands into Greek. We only created the questions in the original environment in Greek from the outset, using the official translation of the block commands provided by Code.org.
As Hambleton [
53] points out, issues related to cultural, idiomatic, linguistic, and content differences should be taken into account during translation. He also states that translators should be bilingual [
52,
54]. For this reason, two translators who met the criteria were used, who initially translated CTt into Greek. No discrepancies were found, due to the fact that there is not a large volume of text, and the commands within the blocks had already been translated by the official programming environment.
Beaton et al. [
54] suggest that one translator should be familiar with the tool’s countertext (terminology), while the other should not. They argue that in this way, the familiar translator tends to use scientific terminology, while the unfamiliar translator uses terminology that the average user of the tool will understand. For this reason, another translator was used, whose translation was not different from the translations of the first two familiar translators.
Before reaching the final form of the tool in Greek, a minor adjustment to cultural elements was made. Thus, the form of the environment referred to as “Artist” in English and literally translated as “Καλλιτέχνης” in Greek was ultimately decided to be rendered with the term “Μαθητής”, as their involvement with the program’s pen and the design of mainly lines and rectangular shapes do not refer to an “Artist” in Greek culture.
For the selection of the sample, we followed the convenient sampling method [
32], choosing a school in the city of Heraklion. The translated CTt tool was administered on a pilot research in a class of 24 students.
The instructions of the CTt creators were followed for the administration of the translated CTt tool, who recommend a pre-defined duration for its completion and the familiarization of the participants through three initial examples.
The Greek version of the CTt was coded in Google Form format so that it could be accessed remotely in electronic form. This format was also chosen by the creators of the original tool, as it facilitates the administration and electronic collection of responses. Participants completed the Greek version of the CTt within a teaching hour. Subsequently, questions were submitted to assess the students’ understanding of the tool’s questions. The recording and analysis of the responses did not reveal any need for modifications, and so the initial Greek version of the CTt used in this study also constituted the final version of the CTt tool in Greek used in the DACT validation research.
2.4.2. Validation Research
In the previous sections, we have described the development of the DACT assessment tool, as well as the translation, cultural adaptation, and pilot implementation of the CTt tool. The CTt is an already validated tool for CT assessing, referring mainly to programming concepts and covering less CT concepts than DACT. However, since both tools aim to assess CT, and CT is unique, the results of the two tools can be compared on a common sample and conclusions can be drawn about their possible correlation.
Validity is important when creating a measurement tool, a key to effective research and there are several different kinds of validity [
32]. The main research questions of this section of our research are as follows:
Does the DACT tool have internal reliability?
Does the DACT tool cover the domain it purports to cover (content validity)?
Does the DACT tool seem to measure what it is designed for (face validity)?
Is there a correlation between the DACT tool and the CTt tool?
Does a high correlation coefficient exist between the scores on the DACT tool and the scores on other accepted tests of the same performance (criterion validity)?
Do the DACT tool results concur with results on other tests or instruments that are assessing CT (concurrent validity)?
Some of our research questions, as mentioned below, have already been answered during the research procedures described in the previous chapters.
The reliability of the DACT tool was tested using Cronbach’s α, through the tool construction process, which we analyzed in a previous chapter. Concluding through successive repetitions of the corresponding criterion in SPSS, and taking into account that in the statistics provided by SPSS processing there is always a column titled “Cronbach’s Alpha if item deleted”, we came up with the final DACT tool with 36 questions and an index value of 0.926. Thus, the reliability analysis of our DACT tool has already proved that the tool is reliable and, in fact, has a very high index value for its internal reliability.
Furthermore, during the construction of the DACT tool, and prior to data collection, its content validity was ensured, as mentioned in the respective chapters on task construction. The first two steps of the proposed four-step framework by Li et al. [
33], which concern the construction and performance of each question in one CT concept, as well as a review of the proposed tasks by experts to confirm their correlation with the specific CT concepts were made. Furthermore, for the initial tasks’ construction, the relevant literature was taken into account, as well as efforts to evaluate CT that have already been mentioned in the theoretical part. Finally, in the research process that led to the final DACT tool of 36 questions, the criterion set was that no CT concept should have fewer than five questions (out of the initial 15 that corresponded to each one). The above design, expert review, and support from the literature, as well as the fact that no concept was left with fewer than five questions, allow us to claim that the DACT tool has content validity.
Regarding the DACT tool and its face validity, the initial assessment has been made, as mentioned above, both during the process of constructing the initial tasks and during the process of constructing the final DACT questionnaire, as at some points it was necessary to decide which of the two or more equally probable tasks would be removed from the tool.
In the following research, we describe the process of validating the DACT tool in terms of criterion validity, and, in particular, concurrent validity, which is based on a comparison with the already validated CTt tool. In other words, the research will provide answers to our last three research questions, while the first three have already been answered in the above paragraphs.
In order to study criterion validity and concurrent validity, two tools were administered simultaneously:
The DACT tool in its final form
The CTt tool in its Greek version, as described in the previous section
The two tools were administered to students, the results of the two tools were calculated, and the results were subjected to correlation tests using Pearson’s r coefficient. We followed the convenience sampling method [
32] for this research. Two secondary schools in the city of Heraklion were selected. A total of 119 students from these two secondary schools participated in the study. Of these individuals, 111 ultimately completed the CTt in Greek and 112 completed the DACT. Of the 111 individuals who completed the CTt, seven did not complete the DACT, while of the 112 individuals who completed the DACT, eight did not complete the CTt. Thus, seven and eight (15 in total) individuals were removed from the statistical analysis of the survey, resulting in a sample of 104 individuals who had completed both questionnaires from the initial sample of 119.
The design of this research followed the administration of the tools in a similar way that was used in our previous studies. The assessment tools were available online in Google forms, and each student had to use a code to start the process. The codes ensured the anonymity of the participants and were provided by the school teacher, but also allowed us to link the answers through the same codes used by the same student in the two questionnaires. Tablets were used again, using the DACT homepage [
https://dact.pre.uth.gr/en/] as the start page, from which the two Google forms were accessed.
Once participation was complete, the data was available electronically to the researcher and processing began. Automatic answer correction functions in Excel were used, as both questionnaires included multiple choice questions, making them easy to correct automatically. The results were then entered into SPSS and using Pearson’s r coefficient criterion, we reached the corresponding conclusions regarding the criterion validity and concurrent validity of the DACT tool.
It took about two teaching hours to complete both questionnaires. Some students needed more time and others less. In most cases, the questionnaires were completed on the same day, while in some cases we had to come back to them. In any case, the completion of both questionnaires by those students who did not finish them on the same day took place within two to three days, so that the answers given were close in time.
Microsoft Excel was initially used to process the students’ responses, and the initial scoring of the assessment tools was performed using automated correction functions. The data were also entered into SPSS (IBM SPSS Statistics v. 29.0.0.0), where the analysis was performed and the results were extracted. Once the data collection was complete, we first examined the individual data files and compared them with the research recording sheets. As in the pilot study, we observed some minor errors in the codes, mainly due to typos (e.g., someone typed “HUVPHC” instead of “HVUPHC”). These errors were easily identified and corrected, as we had assigned specific codes to each school and class, noted the date and time of intervention in each school, and the computer files of the Google form responses had timestamps, so we could easily identify any questionnaire that had an incorrect code. Note that the code is the only means of identifying the participating student, and it is used to combine their answers in the two questionnaires.
After correcting any code errors, we merged the individual files of the students’ answers in Excel. To ensure that this process was carried out correctly, we sorted each of the two individual files by code and then copied each questionnaire into a new file, side by side on the same worksheet, so that each row corresponded to a specific participant. In cases where a code was missing, as we saw in the sample, where not all students completed all the questionnaires, we left the cells corresponding to that questionnaire blank. Since there were only two questionnaires per participant, this process was fairly quick and simple.
Once all student entries had been correctly entered into Excel and the individual results for each assessment tool had been extracted, statistical processing was carried out using SPSS.
3. Results
3.1. Creation of an Initial Tasks’ Pool
At this stage of the research, the result was a basic pool of 90 tasks, which were divided into six categories, thus creating six distinct questionnaires. The six questionnaires correspond to the six main CT concepts, which we have analyzed in the Introduction: algorithmic thinking, abstraction, decomposition, generalization and patterns, evaluation, and logic. Each questionnaire consists of 15 tasks, making a total of 90 initial tasks, which were used in the pilot study.
3.2. Pilot Implementation of the Initial Tasks’ Pool
The pilot research helped to test our research design in practice and to identify specific difficulties and shortcomings. It was verified that the design of the electronic collection of responses via the Internet is feasible and did not present any problems, while it is superior in terms of speed and helps to avoid errors.
The pilot study provided an opportunity to adapt many tasks and correct them so that students would not have difficulty understanding the tasks. Visualization of the data proved to be a good practice, while the use of organizational structures, such as tables or bullet points, also helped students to understand the tasks more easily. The way the questions were written also seemed to play a significant role in how well they were understood by each participant, and for this reason, several tasks were reworded. Some incorrect code entries were also observed, but these were mainly due to transpositions. Applying the above, this research resulted in the proposal of a final pool of tasks, 90 in total, divided into 15 individual questionnaires on algorithmic thinking, evaluation, decomposition, generalization and patterns, logic and abstraction, which were used for the main research.
3.3. Main Research
In the main research, after collecting, correcting, and coding the data, it was transferred to SPSS. It consists of 92 variables, which are the code, the school, and the 90 tasks answered by each participant.
The analysis we followed began with the removal of two variables due to zero variance. In the first statistical analysis, we have 88 questions and Cronbach’s Alpha index has a value of 0.933. Following the suggestion of “Cronbach’s α if item deleted”, we chose to remove the variable that leaves us with the highest reliability index when removed. In the event that we have the same values for two or more variables, we followed different paths for which question should be removed. The possible options for removing one question at a time open up a tree of options, and the different options increase exponentially. Most of the time, we ended up back at the same point after a few steps, i.e., the same questions had been removed but in a different order each time, so we continued from the same point. At some points, where the decision was made based on the researchers’ experience, we also relied on our notes from the process (student questions and discussions), in order to select one question over another for removal. The process ended after 53 steps, and the termination condition that was activated is the one that ensures that no concept will be left with fewer than five tasks in the final questionnaire.
Table 3 shows some steps of this process at the beginning, the middle and finally at the end of the process.
We thus concluded with the final questionnaire, which ultimately consists of 36 tasks corresponding to the six dimensions of CT: algorithmic thinking (8), evaluation (5), decomposition (5), generalization and patterns (7), logic (6), and abstraction (5). Cronbach’s Alpha coefficient is 0.926, which is a very high value for internal reliability.
Table 4 shows the final tasks per concept.
The final questionnaire is available in text file format (.docx) and in portable format (.pdf) (print version), while it has already been formatted appropriately for use via the Internet and is available in Google form format. The final questionnaire is accompanied by a corresponding administration protocol, and in its final form it will be available on the project’s website in all formats for use, after the appropriate permission has been obtained from its creators.
Finally, our notes of students’ questions and reactions, as well as discussions with some of them, allowed us to argue that students find questions related to movement and shapes that have images easier and more entertaining, while they find questions with more text and no other means of representing information more difficult.
Following the construction of the proposed DACT tool, an exploratory factor analysis (EFA) was conducted to examine the dimensionality of the questionnaire. The sample used in this analysis is the initial, main research sample (N = 452) due to its large number of participants. The results indicated a single-factor solution (unidimensional structure), as only one factor had an eigenvalue greater than one and the scree plot showed a clear inflection point.
Figure 1 provides the EFA scree plot.
Table 5 shows the results of the factor loading, whereby we can see only one factor (Factor 1) and the relation of each variable to this factor. The principal axis factoring’ extraction method was used in combination with an “oblimin” rotation. Values greater than 0.4 usually indicate strong relation (high absolute values) and tasks which present such values are considered as the tasks which define this factor. Three of the final tasks, ABS-3, ABS-6 and ABS-13 present a low value of factor loading (less than 0.30). As seen in
Table 3, ABS-13 was proposed for removal in the final step (if removed it would leave the concept Abstraction with less than five tasks), and ABS-3 and ABS-6 were equally proposed for removal one step before the end of the process.
Table 6,
Table 7 and
Table 8 show the values for KMO Measure of Sampling Adequacy, Model Fit Measures and Bartlett’s Test of Sphericity, respectively.
Table 6 gives us detailed information on the KMO (Kaiser–Meyer–Olkin) Measure of Sampling Adequacy, which provides a score of the data that indicates if they are suitable for factor analysis. Scores above 0.5 and up to 0.6 are generally acceptable, 0.6 to 0.7 gives a medium acceptance, while scores of 0.7 to 0.8 are considered good and scores higher than 0.8 values are considered excellent. As we can clearly see in
Table 6, all our data is excellent. Additionally, KMO’s overall value is 0.902. These values give us the opportunity to characterize these data as excellent for Factor Analysis.
Table 7 provides us with some measurements about how well our model explains the data. RMSEA (Root Mean Square Error of Approximation) measures discrepancy per degree of freedom; usually a value less than 0.08 is considered good, while a value less than 0.06, which is our case, is considered excellent. Chi-square tests was used to measure if the model perfectly fits the data, but this measurement is sensitive to sample size, although it is statically important in our case (
p < 0.001).
Finally,
Table 8 provides us with a statically significant measurement of Bartlett’s Test of Sphericity, which is a statistical test used to testify the use of factor analysis. A statistically significant result, as in our case (
p < 0.001), means variables are correlated enough to proceed with dimension reduction, thus justifying the use of the EFA.
3.4. Validation
Following the development of the DACT tool, we continued with its validation research. In the previous paragraphs, we described the process of selecting, translating, and adapting a validated CTt assessment tool into Greek. This tool was translated and adapted into Greek for use in the validation of the DACT tool we developed. In addition, the CTt tool will now also be available in Greek, as the results of our research will be delivered to its creators, along with the Greek version of the tool, which can also be used by other Greek research efforts.
Thus, this work contributes to strengthening the effort to assess CT in the Greek context in yet another way. At this point, we should emphasize again that the two tools (CTt—DACT) do not refer exactly to the same CT concepts, so the use of one does not exclude the use of the other, but as both aim to assess CT as a whole, we can use them together and (as we will see in the next section) compare their results. Moreover, multiple assessment methods are currently the most appropriate and reliable method for CT assessment [
22].
To investigate concurrent validity, we computed Pearson’s product–moment correlation coefficient between the total scores of the two assessment tools (DACT, range 0–36; CTt, range 0–28). Data from 104 students who completed both assessments were analyzed. Both variables met the assumptions for Pearson’s correlation.
A strong positive correlation was found between the two measures, r (102) = 0.80, p < 0.001, 95% CI [0.70, 0.86], indicating that higher scores on the DACT tool are associated with higher scores on the CTt. This result supports the concurrent validity of DACT, as it correlates highly with an established measure of computational thinking (CTt), consistent with criterion validity.
4. Discussion
In our research, we took the existing literature into account, including similar efforts and suggestions for the format of the tasks belonging to each category, and we came up with an original pool of 90 tasks for assessing CT. The tasks that were constructed formed the basis of the pilot study, from which, with minimal corrections, the final tasks’ pool emerged. These 90 tasks formed the basis of our main research. The main research resulted in the final CT assessment questionnaire, which consisted of far fewer questions. The final CT assessment questionnaire consists of 36 questions, with a very high Cronbach’s alpha internal reliability index, and we hope to make it available to the scientific community soon to fill the gap mentioned in the literature [
22].
The construction of the original CT assessment tasks of DACT is theoretically based on the six CT concepts that we mentioned: algorithmic thinking, abstraction, decomposition, generalization and patterns, evaluation, and logic. However, the EFA results did not differentiate between six unique factors, but rather indicated a single-factor solution (unidimensional structure). So, in this phase of our research, the structure of the DACT assessment tool emerges to be unidimensional and does not assess distinct CT skills, but rather regards the CT as a single concept and skill. In the “Creation of the initial pool of tasks” section, we have already discussed that CT concepts coexist to a large extent and cannot be completely separated, as other researchers also point out and explain [
34,
35,
36]. We believe that the specific field of separating the concepts, as well as their temporal dependence (whether they always appear together, some before, some after, and in what way) should be the subject of extensive research in the future.
At the same time, DACT is detached from purely programming concepts, without ignoring basic programming structures. It is not based on a specific programming environment and therefore does not require knowledge of programming or a specific programming language, while in tasks that use scenery, programming forms and commands, the scenery, forms, and programming commands used do not come from a specific programming environment but from the DACT micro-world created for this research and tool. In addition to the difficulty of having to learn a programming environment in order to be assessed within it, this also avoids the misjudgment of some students who may already be familiar with the specific programming environment, compared to those who are using it for the first time.
Furthermore, as mentioned in the design of the questions, the setting and forms are deliberately abstract so as not to distract attention, while movement in two dimensions is considered sufficient and no attempt was made to involve students in the three-dimensional world, as such a need does not arise either intuitively or from our literature review.
The assessment tool is based on the theoretical analysis we have described. At the same time as our own research work, the scientific community has also been working on assessing CT. Thus, Wiebe et al. [
44] propose and validate the CTA-M tool, which is based on the existing CTt assessment tool and selected topics from the Bebras competition [
45,
55]. The tasks introduced by the authors from Bebras are similar to those we have designed and included in our tool.
In 2020, Tang et al. [
23] reported on the progress made in CT assessment to date in their review, and in their key conclusions, they recommended that more tools should be created for older students and university students, which is covered by our tool for the first grades of secondary education, as well as to give greater emphasis to the theoretical documentation of assessment tools and to ensure that they can be administered independently of specific environments (cross-platform), both of which our tool adequately covers, according to what we have already presented. More recent proposals refer to primary school students [
33], making the need for assessment aimed at secondary education even more urgent, while assessment proposals continue to be based mainly on programming structures [
56]. The use of more general concepts than basic programming concepts in the assessment of CT, which is followed by our tool, is also emphasized by Lai [
57].
In 2022, El–Hamamsy et al. [
58] used the Competent Computational Thinking Test (cCTt) assessment tool to target older elementary school students, employing an unplugged approach and moving away from basic programming environments, but the gap in secondary education, which our tool aims to fill for the first grades, still remains. More generally, other approaches to assessing CT are emerging for younger ages, such as the proposals by Shen et al. [
59], Rowe et al. [
36], Sartor Hoffer [
60], and many others presented in the research of Ocampo et al. [
61], while more recent research continues to rely mainly on programming for the assessment of CT, such as that of Ghosh et al. [
62]. This is confirmed by recent research by Ukkonen et al. [
24] on teachers’ views on CT assessment, where they report that it is easier to assess basic programming concepts, but they consider CT to also include concepts more general than programming, which are not adequately assessed by existing assessment tools, thus showing that the research gap that prompted us to create the CT assessment tool continues to exist.
Furthermore, Román–González and Pérez–González [
29] analyze the dimensions of higher-order thinking that are expected to develop in different age groups, placing the foundation of logical thinking (according to Piaget) and its transfer to abstract schemas in middle school (typically ages 11–14), i.e., the use of abstract thinking, evaluation through argumentation and reflection, while approaching problem solving logically and methodically, rather than simply relying on trial-and-error approaches. The above confirms the targeting of our tool at this age group.
In this study, specific validity checks were performed on the DACT tool. In the future, additional validity checks could be performed, such as external validity, concurrent validity, construct validity or structural validity, which includes convergent, discriminant, and predictive validity, as well as criterion validity.
The DACT tool could also be tested with a wider age range, as we believe it could cover both the older grades of elementary school and the last grades of middle school (ages 10–15).
Furthermore, it would be worthwhile for future research to translate and adapt this tool into English; as the text is not very long, it uses easily understandable images and symbols, and it could thus be used by the international community as a CT assessment tool.
To present the research results and support the research, a website was created on the site of the Department of Primary Education of University of Thessaly, Greece, with the acronym DACT, from the initials of the words Development and Assessment of Computational Thinking. This acronym also characterizes the tool created in the context of this research.
Moving on to the validation of the DACT tool, we searched the literature for a validated tool for CT assessment. At the time of designing the study, there were no validated tools for autonomous CT assessment. One tool that stood out due to the numerous validation checks it had undergone was the Computational Thinking test (CTt). Although the CTt has been validated by its creators and is a tool for CT assessment, it is mainly based on a programming environment for assessment. It consists of 28 questions and uses a ready-made programming environment (from code.org) for the forms, commands, and scenarios it contains. Its analysis is based on the recognition and use of programming patterns related to basic instructions and uses sequence structure, repetition structure, selection structure, and functions. In a sense, CTt’s orientation leans more towards algorithmic thinking and evaluation, two of the CT concepts, but still remains a validated tool for CT assessment.
CT is a cognitive process that consists of several dimensions, as we have already analyzed. This research is based on the analysis of CT in six (6) concepts, and thus our assessment tool refers to more CT concepts than CTt. The two approaches and the two assessment tools are not identical. However, CT is one and unique, so since the CTt tool assesses specific CT concepts, based mainly on the programming approach, then a comparison with the DACT tool should show a positive correlation. It is important to note that the creators of CTt state that they are theoretically based on Aho’s definition of CT [
51], whereby they talk about the fundamental role of algorithmic thinking and programming in CT, but clarify that we can also have unplugged environments for CT [
44]. Thus, although the CTt tool is based on an existing programming environment, it does not require the use of a computer to complete it, but can also be administered on paper, since each of its tasks is based on an image and the answers are all multiple choice.
Thus, we believe that since CTt is a validated tool for assessing CT, taking into account that it may use fewer CT concepts, it can be used to validate DACT. Even though the latter refers to more concepts, the results of the two tools should show a positive correlation as they assess the same general common concept of CT.
In order to be able to use the CTt tool in Greek, we translated and adapted it, thus producing the Greek version of the CTt tool. With the Greek version of the CTt tool now available, we proceeded to validate the DACT tool. The CTt was used to check concurrent validity, where both tools (CTt and DACT) were administered to the same student population, and our statistical test gave us statistically significant results for their correlation (high correlation), thus proving concurrent validity, which belongs to content validity.
The process of constructing the DACT through continuous monitoring of Cronbach’s alpha provided us with statistical results for internal reliability, while our construction process ensured criterion validity. In addition, during the construction process, the face validity of the DACT was checked, thus completing a basic cycle of validation checks for the tool.
Thus, we can now recommend DACT as a CT assessment tool, which is based on the analysis of CT in six main dimensions, namely algorithmic thinking, evaluation, generalization and patterns, abstraction, logic, and decomposition, but it provides a single result and does not evaluate each of these dimensions separately. The DACT tool is administered autonomously, without depending on a specific programming environment, and can be administered either online or printed and completed by students. The online version is recommended, as it has the advantage of automatically checking the answers to each question, and the results are immediately available electronically.
The license to use the tool will also include the automated correction function spreadsheet for the DACT tool. The provision of the DACT tool will be accompanied by an administration protocol for their use.
A limitation of this research consists of conducting an Explanatory Factor Analysis (EFA) using an even larger sample of participants, who will complete all 90 tasks of the CT assessment tasks’ pool. Following this case, a new research will be conducted, which might also include a Confirmatory Factor Analysis (CFA). This work presents the first steps of the development and validation of the DACT assessment tool, keeping in mind that there are more things to be examined in future research and probably a new wider research cycle will be necessary in order to present the final assessment tool.
In future research, the questionnaire construction process could be repeated facing the afore-mentioned limitations, and it will be interesting to compare the results for the final CT assessment tool DACT. It might also be interesting to target not the whole class, but selected students who have a proven interest in the subjects and possibly in programming or problem solving, to see what would come out of their participation alone, assuming that there would be no element of randomness in the answers. Furthermore, as there are different kinds of validity, more research on the validity of the DACT assessment tool could be conducted in the future.