On the Quality and Validity of Course Evaluation Questionnaires Used in Tertiary Education in Greece

: In compliance with national legislation, Greek tertiary education institutions assess educational quality often using a standardized anonymous questionnaire completed by students. This questionnaire aims to independently evaluate various course components, including content organization, instructor quality, facilities, infrastructure


Introduction
Quality assurance in higher education is a self-evident goal for institutions, as its internationalization has led to an increasing call for accountability and the need to develop a culture of quality in order to meet the challenges of globalized higher education [1].It refers to a systematic, organized, and ongoing commitment to quality.It implies the establishment of an internal system of principles, standards, and norms, the proper functioning of which is confirmed by periodic internal and external evaluation methods [2].The major goal of implementing quality assurance methods throughout Europe was to instill trust in the quality of educational outcomes, provide assurance that academic standards are being protected and improved, and provide a good return on public investment in higher education [3].
An important aspect of an internal evaluation system is the evaluation of courses by students.Course evaluation serves as an important process through which higher education institutions receive feedback from students regarding courses [4].Typically conducted anonymously at the end of a semester, student evaluations allow universities to assess how positively a course was perceived, the competency of teaching, and the materials provided, and to gain insights into potential areas for improvement.Analysis of student evaluations of courses can provide important information for institutional management and instructors on how to improve the student experience [5,6].For students, it provides an opportunity to directly share their perspectives on which elements of a course were effective or which may have been lacking.By thoughtfully and honestly compiling student evaluation data across campuses, institutional leaders can identify areas for improvement and implement changes that increase student satisfaction and success [6].
There is a debate about the validity of student evaluations of teaching (SETs) in higher education [7].Some studies have supported the validity of SETs, while others have challenged it [8][9][10].Literature reviews that have examined this issue have been inconclusive [11,12].Spooren et al. [11] found that validity was affected by the methods used, and in many cases, tools developed by the organizations themselves were used rather than standardized scientific tools.In addition, scholars argue that SET ratings reflect students' assessments of teachers as persons rather than the quality of teaching [11], and this view is also supported by complementary studies [12,13].Even students' assessments of course outcomes appeared to be biased [14].
More specifically, student evaluations of teaching (SETs) have long been criticized and debated as a tool for evaluating instructional quality and effectiveness.Research has shown that SETs can lead to unacceptably high error rates and misclassify teachers' performance [15].Furthermore, SETs have little correlation with learning, and groups such as women faculty and faculty of color tend to be disadvantaged in SET ratings [16].Additional studies have found that students reward lenient grading and easy courses with higher SETs, and instructors feel pressure to achieve good SET ratings [17].Other studies have also demonstrated that biases related to race, gender, discipline, and other factors can negatively impact minority and female instructors' SET scores, even when course design and content are held constant [18].Together, this body of research suggests that SETs alone are an imperfect and potentially biased measure of teaching effectiveness that should be supplemented with multiple qualitative and quantitative measures in evaluation processes.Given the mixed results and lack of consensus, the validity of SETs remains an open question.
In Greece, the process of student evaluation of courses has been a reality for several decades.The main tool is the "Student Course Evaluation Questionnaire", a tool recommended by the Hellenic Quality Assurance and Accreditation Agency (HQA) (https://www.ethaae.gr/en,accessed on 12 January 2024).This tool was developed in 2007 and is used either without any adaptations or with minor modifications by the institutions (a draft here).This paper attempts to evaluate the structure of the questionnaire using data from the University of the Peloponnese, a regional university in the country, in order to assess whether the measures of the questionnaire are consistent with the understanding resulting from the separation of its individual dimensions.These dimensions are evident from the separation of the questions into categories, as shown in the questionnaire.
This study contributes to the evaluation of the institutions because demonstrating the existence of specific factors allows university administrations to examine different aspects of teaching.In this way, universities can examine the scores on the various factors to identify strengths and areas for improvement.Ensuring that measurement tools have a strong evidence base is critical for the purposes of quality assurance and fair assessment of teaching.

Data and Methods
The data used in this study originate from the internal evaluation process of the University of the Peloponnese and, more specifically, from the students' evaluation forms regarding various factors of the educational work.The questionnaires are distributed during the instruction period, between the 8th and 10th weeks of teaching, and are completed anonymously by the students.It is emphasized that the "questionnaires" are used only for quality assurance purposes within the institution, including the preparation of annual internal evaluation reports and the improvement of the teaching process and infrastructure, and they are not used for the evaluation of teaching staff.
The dataset utilized in this study included information about the department, the course, the academic year, and assessment data on the following aspects: the course, supporting teaching (if applicable), assignments (if applicable), teaching staff, the course laboratory (if applicable), and the student's self-assessment.The type of each general variable is shown in Table 1.Course evaluation variables were provided on a five-point Likert-type scale (1 = unacceptable, 2 = unsatisfactory, 3 = moderate, 4 = satisfactory, and 5 = very good).The variable 'Course category' indicated whether the course instruction included laboratory practice or not.The questionnaire statements were divided into sections, corresponding to the dimensions factors, as shown in Table 2.The dataset included 48.008 questionnaire submissions for eight years, from the academic year 2014-2015 to 2021-2022.The questionnaires span across 68 undergraduate and postgraduate programs and 2380 courses.Due to the different types and requirements of the courses, there were missing entries in some variables, mainly in questions related to supporting teaching, labs, and assignments.

Data Preparation
As noted above, there were missing values in some questions concerning supporting teaching, assignments, and the course's laboratories.This was a systematic characteristic of the questionnaire and indeed a desired outcome, since students should not be expected to rate aspects of the course that are not actually present (e.g., labs for courses not involving laboratory practice).For this reason, the dataset was divided into eight subsets according to whether the above dimensions were present or not.In this way, eight new subsets of data were created (Table 3).After splitting into subsets, only missing entries in var8 were mentioned in the prerequisites for the course.The missing values were filled in with the median of the lesson.

Methods
To assess the fit of the model, a series of confirmatory factor analyses (CFAs) were first conducted on eight sub-datasets.In other words, CFA was used to determine whether the data are consistent with the understanding of the factors resulting from the separation of the questions into groups according to the questionnaire.In other words, the goal of confirmatory factor analysis is to determine whether the data fit a hypothetical measurement model.Each analysis related to a specific number of underlying dimensions (from 3 to 6), since the criterion for selecting the sub-datasets depended on the existence or absence of dimensions such as supporting teaching, according to the curriculum.
Next, a series of exploratory factor analyses were also carried out to test whether the collected data corresponded to a latent factor variable structure similar to the one arising from the separation of the questionnaire statements.In this way, the structure created by the questionnaire data alone was tested without any prior evaluation of the statements.Confirming the factorial structure of the questionnaire provides evidence for its validity and the significance of the results extracted from it.
The free and open-source statistical software "Jamovi", version 2.4 (https://www.jamovi.org/,accessed on 15 December 2023) [19] was utilized for conducting exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) in this study."Jamovi" provides a user-friendly graphical interface for applying many statistical and machine learning techniques using the R language.The software streamlined factor extraction, rotation, and model testing in an accessible workflow for the performed factor analyses.

Confirmatory Factor Analysis
Firstly, the factor loadings of the variables were computed.Table 4 presents the results of this process for the dataset containing all factors (subset #1 in Table 3).For conciseness purposes, only CFA factor loadings for this subset are presented in this paper; the complete data for all subsets are available in the Supplementary Materials of this paper.By analyzing the results of the CFAs, we can observe that the factor loadings were strong (above 0.6), indicating a strong relationship between the variable and the factor in most variables, suggesting that the tool is a suitable representation of the underlying structure.Additionally, the p-values indicate that the factor loadings are statistically significant.In this case, all factor loadings are highly significant (p < 0.001) [20].Similar results were obtained for the loadings in the other datasets.On the other hand, the results indicated a less-than-adequate fit for all datasets based on several fit indices.The χ 2 test was significant (p-values < 0.001) for all datasets, indicating a statistically significant divergence between the hypothesized model (as indicated by the questionnaire) and the observed data.Recognizing the disadvantages of the χ 2 test, such as the sensitivity to sample size, with larger samples leading to smaller p-values, we also used additional fit measures such as Tucker's TLI [21], SRMR [22], and RMSEA [23] (Table 5).The values for the comparative fit index (CFI) [24] ranged from 0.859 to 0.894, falling below the recommended threshold of ≥0.95 for good fit [25].The Tucker-Lewis index (TLI) also failed to reach adequacy across datasets, ranging from 0.845-0.884,with values not meeting the ≥0.95 threshold.The standardized root mean square residual (SRMR) fell within the recommended range of ≤0.08 for most of the models (0.0518-0.0757), indicating adequate absolute fit on this index only.However, the root mean square error of approximation (RMSEA) exceeded the recommended cutoff of ≤0.06 for a good fit [26,27], with values ranging from 0.0809 to 0.0896.The 90% confidence intervals for RMSEA also confirmed poor fitting across all datasets.Taken together, these results show inadequate model fit, suggesting that the theoretical model did not fit the empirical data well.Modifications to the hypothesized model are needed to improve fit across datasets before the interpretation of the findings.This was evident in all subsets of data and in almost all goodness-of-fit measures studied.Although the cutoff criteria for CFA are not absolute [27], the fact that there is almost total agreement between the indexes in all data subsets confirms this claim.

Exploratory Factor Analysis
The next stage was to perform a series of exploratory factor analyses (EFAs), comparing the factors present in the questionnaire structure with those extracted from the data extension.Two distinct approaches were employed to this end.The first method was to extract factors from the data extension without any constraint on the number of factors to be identified.The second method was to extract factors from the data extension, requesting that the variables within each analyzed subset be grouped to a number of factors equal to that of the subset being analyzed.These two analyses are presented in the following paragraphs.
As noted above, the first approach employed in this stage was to extract the dimensions of the questionnaire as derived from the data, performing a series of exploratory factor analyses (EFAs) without imposing any prerequisites concerning the grouping of the variables.All subsets of data were analyzed in the same way.Assuming that the resulting factors are correlated with each other, the promax rotation method in conjunction with the maximum likelihood extraction method was used.The expected factor structure, according to the questionnaire for the dataset with all dimensions, is shown in Table 6 below.The results of all analyses showed that, based on the eigenvalue criterion bigger than 1, the extracted factors were only two instead of three to six, as suggested in the questionnaire (Course, Supporting teaching, Assignments, Teaching staff, Lab, Self-assessment), with no particular conceptual similarity.An example from the subset of data that does not include support teaching, assignments, and labs is shown in Table 7, where the expected structure based on the questionnaire is not confirmed.The same pattern is repeated in all datasets.
According to the literature, the Keiser criterion [28] of eigenvalues greater than one is not absolute [29]; eigenvalue limits less than one and between 0.5 and 0.8 were tested.In these cases, too, the expected number of dimensions was not obtained.Failure to identify a number of dimensions equal to what is expected in each dataset indicates that a mismatch exists between the intended conceptual model and the one that actually emerges in the data.
The second approach followed in this EFA was to perform a factor analysis with a specific number of factors, corresponding to the number of dimensions of the questionnaire for each subset of the data.It was found that the variables that, according to the questionnaire, corresponded to (i) the courses (var1 to var10) and (ii) the teaching staff (var22 to var28) were actually associated with one factor, in a quite distorted fashion compared to the original conceptual grouping of the questionnaire (an example is shown in Table 8).In contrast, the questions related to students' self-assessment (var33-var37) appeared to relate to the same factor, distinct from the one to which courses and teaching staff relate.Clustering other variables into dimensions was also found to be biased compared to the theoretical model.The allocation of the courses and teaching staff to the same factor, combined with the specific number of factors to be extracted, has been also reported in other studies [15][16][17][18].Overall, extracting the expected number of dimensions resulted in an unexpected structure for all data subsets, which was not expected.
From a statistical perspective, the previous finding is associated with the high correlation coefficients between the variables related to the course (var1 to 5) and the teachers (var22 to 28), as shown in Figure 1 with italics.In particular, data show moderate to strong average correlation coefficients within the factors' statements in most cases.The coefficient of correlation between each factor's questions is high, indicating that the questions meet the requirement of creating a factor.On the other hand, a strong correlation between factors' statements was found, even if the requirement for a proper structure is a smaller correlation coefficient.The strongest average correlations were seen: 1.
Between course factors and teaching staff factors (avg 0.76), leading to matching to a factor.2.
Between course and assignment factors (avg 0.67).The high correlation between course and teaching staff factors is the root cause that led to a different factor structure than expected based on the under-study tool.
Correlations between supporting teaching and other factors are low or moderate, ranging from 0.19 to 0.57.Correlations between assignment and teaching staff factors are also strong, around 0.69, while the correlations between the assignment factor and the student factor are weak (avg 0.34).Teaching staff and laboratory factors show a moderate correlation of 0.53.Finally, student factors and other groups show low average correlations, ranging from 0.27 to 0.34.In summary, the strongest inter-factor correlations on average are seen between the factors of (i) course, (ii) teaching staff, and (iii) assignment.The weakest average correlations are between the student factor and the other factors.
The goodness-of-fit test indexes of EFAs with a specific number of factors indicate an acceptable or less than the good fit across all sub-datasets.The p-values for the χ 2 test are all less than 0.0001, suggesting poor fit, but the χ 2 test is affected by sample size.The corrected χ 2 statistic based on the degrees of freedom shows an acceptable fit for "Lab-No AS/NT-TS" and a marginally acceptable fit for the "Lab-No AS/NT-No TS" subset.
Related to mean square error (RMSEA) values, we can observe that data subsets "Lab-AS/NT-TS" and "No Lab-AS/NT-TS" have the lowest values (0.711 and 0.741, respectively), while the "No Lab-AS/NT-No TS" data subset exhibits the highest value (0.0840).Considering that the RMSEA values of 0.05 and 0.08 correspond to "good fit" and "mediocre fit" [20], we can conclude that the determined RMSEA indicates a model fit that is marginally acceptable to mediocre.The 90% confidence intervals of RMSEA are narrow, demonstrating precision around the point estimates, below the 0.1 threshold.The Tucker-Lewis index (TLI) values exceed 0.90 for the "Lab-No AS/NT-TS" and "No Lab-AS/NT-TS" datasets, signifying a good fit relative to the conventional cutoffs.The values of TLI for the rest of the subsets are between 0.857 and 0.895, indicating an adequate fit.
Regarding comparisons between datasets, the "Lab-AS/NT-TS" model exhibits the best fit based on the combination of a low RMSEA value (0.0711) and a high TLI (0.917).The "No Lab-No AS/NT-No TS" dataset also shows a good fit (TLI 0.892, RMSEA 0.0772, RMSEA 90% CI (0.0734, 0.0812)).In contrast, the "Lab-No AS/NT-TS" dataset has a lower TLI (0.848) compared to alternatives.Overall, the results provide some support for an acceptable fit across the sub-datasets, with some variability across them.However, the resulting factor structure is different from the expected one.
Finally, variables were clustered based on the factor loadings determined by EFA, and this clustering was compared to the clustering expressed by the questionnaire structure using the adjusted rand score index [30].The computed adjusted rand score ranges from 0.2316 to 0.5128, indicating that the groupings observed in practice differ significantly from the theoretical ones.
The goodness-of-fit measures of exploratory factor analysis (EFA) without prerequisites of factors across the different datasets reveal varying degrees of model fit.The Root Mean Square Error of Approximation (RMSEA) generally suggests an acceptable-tomediocre fit, with values ranging from approximately 0.0930 to 0.112; the Tucker-Lewis Index (TLI) values range from about 0.776 to 0.841, indicating an adequate fit for datasets.Despite significant chi-square (χ 2 ) values, likely influenced by large sample sizes, the χ 2 /df ratios generally suggest a reasonable fit in some datasets, ranging from 3.83 to 152.23.Additionally, all datasets exhibit p-values below 0.0001, indicating rejection of the null hypothesis of perfect fit.Overall, the EFA models show significant differences between observed and hypothesized structures.Also, the study of multiple fit indexes shows a moderate fit.Again, variables were clustered based on the factor loadings determined by EFA, and this clustering was compared to the clustering expressed by the questionnaire structure using the adjusted rand score index [30].The computed adjusted rand score ranges from 0.0549 to 0.3670 (Table 9), indicating that the groupings observed in practice differ significantly from the theoretical ones.This range is lower than that observed when a specific number of outputs is requested to be extracted (Table 10).

CFA and EFAs' Results
Both confirmatory and exploratory factor analyses conclude a lack of agreement between the hypothesized structure of the research instrument capturing students' views.Based on the abovementioned findings, questions can be raised regarding the ability of students to distinguish the degree of satisfaction they receive from the course or the teacher, a view also adopted in [15].An alternative (or complementary) interpretation is to attribute the findings to the lack of validity of the questionnaire.Validity refers to the assessment of whether a tool actually measures what it claims to measure or whether there is a systematic error.When we examine the structure and layout of the questionnaire, we note its face validity, as the statement questions are located under the factors that they are intended to relate to [29,31].However, the structural and layout aspects are only a superficial assessment of the measurement tool in terms of the underlying dimensions, providing only a rough initial estimate of whether the content of the question-statements is conceptually relevant to the intended conceptual structure.In contrast, other forms of validity, such as internal and content validity, require the implementation of quantitative and qualitative methods to assess them.The need to use additional quantitative and qualitative tools has been underscored in other studies [17,18].

Discussion
This study highlighted the weaknesses of an important tool for evaluating teaching work and providing feedback from students, using an extensive set of data.Student evaluations of teaching (SETs) are commonly used in higher education as a measure of teaching effectiveness [8][9][10][11][12][13][14][15][16][17][18][19].The present study is the first that evaluates the factorial structure of the specific questionnaire used in the University of the Peloponnese and other Greek universities.Confirmatory factor analyses were initially conducted to determine if the data matched the hypothesized factor structure implied by the grouping of statements into categories on the questionnaire.The results indicated a mediocre-to\-poor model fit across all datasets, based on χ 2 , CFI, TLI, SRMR, and RMSEA values.This suggests that the data did not align well with the proposed factor structure.
Several exploratory factor analyses were performed in order to let the factors emerge directly from the data.From these analyses, only two or three factors were extracted, rather than the expected number of factors based on the questionnaire categories.Even when the theoretically expected number of factors was defined, the item loadings did not match the dimensions of the questionnaire.In particular, course evaluation and teacher evaluation variables consistently loaded onto the same factor.These findings are in line with other studies and evaluation tools [15][16][17][18][19]31].
The high positive correlation between course and teacher evaluation items explains their clustering on one factor.The average correlation between course and teaching staff factors was 0.76, representing the highest inter-factor correlation.This raises questions about the ability of students to distinguish the satisfaction they received from the courses and the instructors.It may also point to issues with the discriminatory power of the tool.Students may not be able to accurately assess the satisfaction they receive, and this may be due to the wording of the questions.Face validity of the tool seems reasonable, but a more rigorous examination of content and construct validity is required.
The EFA results lend some support to a low-to-acceptable fit when extracting the expected number of factors but with a different structure than expected.The RMSEA and TLI values were marginally adequate for most datasets.This indicates that the data may have an underlying factor structure that does not match the questionnaire structure.
Overall, CFA and EFA findings converge to suggest limitations of this evaluation tool, while similar findings exist in the literature.The lack of validity in the questionnaire has important consequences.The study shows difficulties in reaching precise conclusions about some facets of teaching, which limits the tool's usefulness as a means of assessing teaching factors.In order to improve the usefulness of the questionnaire for collecting information from students and evaluating teaching effectiveness in general, future research should focus on improving and revalidating it.
Modifications are needed to improve the model fit.Adding, removing, or revising items to increase validity between the course and teaching staff factors is also suggested, as the high correlation between them indicates that students currently do not adequately distinguish these dimensions.
Restructuring the questionnaire sections and items, as in supporting teaching, could also help.Reducing the correlations between factors by removing redundant elements or differentiating the content could help delineate the factors.Examining the validity of factor scores based on variables such as grades or demographic characteristics of students and teachers may reveal other issues.Periodic re-validation, following improvements, will ensure that the factor structure will continue to fit new students and curricula over time.Finally, improving content validity through experts is also important.This collaborative approach not only strengthens content validity but also ensures the reliability and effectiveness of the tool.
Through these steps, the validity of the tool can be enhanced to better capture the key dimensions in student evaluations of teaching.Further reliability and validity testing on other datasets is recommended in order to improve the results obtained from these data.The examination of the effect of reducing statements on the reliability of the tool and the results obtained from it is another future field of study, as the current findings provide a first evaluation of the properties of this important student feedback tool.

Conclusions
The present study aimed to evaluate the validity of a student evaluation of teaching (SET) questionnaire utilized at the University of the Peloponnese.The analysis covered 48.008 records for eight academic years and for all university courses.These records spanned across 68 undergraduate and postgraduate programs and 2380 courses.The study indicated an insufficient fit to the theorized underlying structure.However, the existing literature emphasizes the necessity of empirically testing the instruments used to ensure that they provide unbiased data to inform administrative and academic decision-making.
The lack of validity of the assessment tool highlights the importance of empirical evaluation of these instruments to ensure that they provide accurate and objective feedback.This study is useful for institutions worldwide that use student feedback for administrative and academic decision-making, as it highlights the need to validate the instruments in order to produce useful data.The findings may have implications for methods of improving the assessment tools used worldwide to more accurately assess teaching quality and collect insightful feedback from students.
The lack of validity of the questionnaire holds important implications at both theoretical and practical levels.Theoretically, refinement and revalidation of the questionnaire were highlighted as necessities for adapting the data generated to the anticipated theoretical framework.Practically, the study identified difficulties drawing specific conclusions pertaining to diverse facets of instruction, hampering its utility as an evaluation tool for teaching quality and highlighting the need for enhancement.Future research could explore refining and revalidating the questionnaire to bolster its usefulness for obtaining student feedback and assessing teaching effectiveness.

Table 2 .
Questionnaire items per factor.

Table 3 .
Data subsets according to the courses' structures and requirements.
√indicates presence of answers in dataset, x indicates absence.

Table 4 .
CFA factor loadings; dataset contains all factors.
No Lab-No AS/NT-No TS: Datasets from courses without Laboratory, Assignment, Supporting Teaching.Lab-AS/NT-TS: Datasets from courses with Laboratory, Assignment, Supporting Teaching.

Table 6 .
Theoretical association between variables and dimensions.

Table 8 .
EFA, factorial structure, No Lab-No AS/NT-No TS dataset, based on number of factors expected (3).

Table 10 .
EFA, goodness-of-fit indexes, specific number of factors.
Institutional Review Board Statement:The study has been approved by the board of Research Ethics and the Integrity Committee of the University of Peloponnese-Peloponnese (Decision number 2675/12 February 2024).Not applicable.