Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools

Di Palma, Davide

doi:10.3390/educsci16030374

Open AccessArticle

Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools

by

Davide Di Palma

Department of Psychology, University of Campania “Luigi Vanvitelli”, Viale Abramo Lincoln n. 5, 81100 Caserta, Italy

Educ. Sci. 2026, 16(3), 374; https://doi.org/10.3390/educsci16030374

Submission received: 23 January 2026 / Revised: 25 February 2026 / Accepted: 26 February 2026 / Published: 1 March 2026

Download Versions Notes

Abstract

In recent years, primary school physical education and sports have been acknowledged as crucial for children’s comprehensive development and skill enhancement. Consequently, there is a demand for validated instruments to evaluate both intervention outcomes and their educational quality and inclusivity. This study aimed to create and validate EIP-Move (Educational and Inclusive Potential of Motor Programmes), a standardized instrument designed to assess the pedagogical, inclusive, relational, and equitable potential of physical education and sports initiatives. The research employed a longitudinal framework comprising multiple phases (theoretical construction, pilot study, psychometric validation, and longitudinal validation), founded on a conceptual model with four dimensions: pedagogical quality, inclusion and participation, relational climate and safety, and equity and valorization of differences. Psychometric analyses validated the robustness of the four-factor model, demonstrating strong reliability, validity, and measurement invariance across gender, context, and the presence of special educational needs, along with sensitivity to change and predictive validity with respect to subsequent programme-level outcomes, including students’ active participation, relational climate quality, psychological safety, and programme continuity over time, assessed across a 12-month longitudinal framework. In summary, EIP-Move emerges as a valid and reliable instrument, beneficial for both research and professional application, thereby significantly aiding the formative evaluation of motor programmes and fostering a culture of quality and inclusion in primary education.

Keywords:

physical education; inclusion; primary school; test validation; longitudinal study; educational quality

1. Introduction

In recent years, physical education and sports in primary schools have been progressively recognized as an educational area of strategic importance for the overall development of children. International literature highlights how structured, intentionally designed, and age-appropriate physical experiences contribute significantly not only to the development of fundamental motor skills but also to the enhancement of cognitive functions, social-relational skills, and emotional competencies (Gallahue et al., 2012). In particular, motor activity represents a privileged context for supporting processes of self-regulation, attention, problem solving, and cooperation, which are central elements for educational success in early childhood (Tomporowski et al., 2015).

From this perspective, physical education goes beyond a purely performance- or training-based approach, taking on a clear educational and formative value. It is a pedagogical space in which movement becomes a vehicle for meaningful learning, personal identity building, and the development of psychophysical well-being (Bailey et al., 2009). Physical activity, if properly mediated by adults, also encourages active participation by students and the creation of a positive learning environment based on motivation, mutual respect, and the appreciation of individual differences.

At the same time, the paradigm of educational inclusion has increasingly guided school policies and teaching practices towards the construction of learning environments capable of embracing diversity as a resource rather than a limitation (UNESCO, 2017). In the field of motor skills and sports, inclusion takes on particular importance, as the body and movement are strongly linked to identity and potentially exposed to dynamics of exclusion or stigmatization. Designing inclusive physical education programs therefore means ensuring that all children—regardless of ability, gender, socio-cultural background, or special educational needs—have genuine opportunities for participation, involvement, and educational success (Haegele & Sutherland, 2015).

From a theoretical perspective, EIP-Move is grounded in, yet distinct from, existing frameworks for the evaluation of physical education and sport programmes. In particular, it is conceptually aligned with program evaluation models developed within Physical Education Teacher Education (PETE) research, which emphasize pedagogical quality, instructional coherence, and reflective practice as core dimensions of effective physical education programmes. At the same time, EIP-Move explicitly integrates the principles of inclusive education as articulated in the UNESCO (2017) framework, particularly the recognition of diversity as a resource, the promotion of participation for all learners, and the creation of safe and supportive learning environments.

However, while these frameworks provide important theoretical and normative guidance, they are not designed as standardized measurement tools for the systematic evaluation of programme quality. PETE-oriented models primarily address teacher education and instructional competencies, whereas the UNESCO framework offers a macro-level policy and values-based perspective rather than an operational instrument for programme assessment. EIP-Move builds upon these foundations by translating shared theoretical principles into a multidimensional, observation-based measurement tool specifically aimed at evaluating the educational and inclusive potential of physical education and sport programmes in primary school contexts.

With respect to existing instruments, several tools have been developed to assess related constructs such as teaching quality, motivational climate, inclusion, or programme effectiveness in physical education settings. However, these instruments typically focus on isolated dimensions (e.g., teaching behaviours, student perceptions, or inclusion practices) and are often designed for self-report or outcome-based evaluation. In contrast, EIP-Move adopts a programme-level perspective, focusing on structural, pedagogical, and relational characteristics that define the intrinsic quality and inclusive potential of the programme itself, rather than its distal outcomes.

The distinctive contribution of EIP-Move lies in its integrative approach: it combines pedagogical quality, inclusion and participation, relational climate and safety, and equity and valuing of differences within a single validated framework. Moreover, its longitudinal design, observer-based format, and demonstrated measurement invariance across key groups allow EIP-Move to be used not only for descriptive assessment, but also for monitoring change over time and supporting evidence-based improvement processes in both school and sport contexts, as can be seen in Table 1.

However, inclusive physical education cannot be reduced to sporadic interventions or occasional adaptations of activities. It requires a systematic and intentional approach that integrates solid pedagogical planning, organizational flexibility, specific professional skills, and constant attention to the relational climate and psychological safety of the group (Goodwin & Watkinson, 2000). From this perspective, the inclusive quality of a physical education program does not depend exclusively on the individual characteristics of the students, but above all on the educational, methodological, and organizational choices made by the adults involved. Despite growing theoretical recognition of the educational and inclusive value of physical education and sports programs, assessing their quality remains problematic. In practice, this assessment is frequently based on indirect indicators, such as participation levels, perceived satisfaction, or the declared presence of adaptations, or on ad hoc tools that often lack adequate psychometric validation (Kirk, 2010). This methodological heterogeneity limits the possibility of comparing different programs, hinders the longitudinal monitoring of interventions, and reduces the impact of scientific evidence on decision-making processes in education and sports.

In particular, there is a lack of standardized tools capable of systematically and comparably measuring the educational and inclusive potential of physical education and sports programs for primary school-aged children. This construct does not primarily refer to the immediately observable outcomes in children, but to the intrinsic quality of the program itself, understood as a set of pedagogical, organizational, and relational conditions that determine its potential capacity to generate meaningful learning, promote inclusion, and support the overall development of students (De Bosscher et al., 2015).

The availability of validated tools for program evaluation is a fundamental step in promoting evidence-based educational practices and supporting continuous improvement processes. In this sense, evaluation does not merely serve a certifying or punitive function, but is a reflective device that supports the professionalism of teachers and coaches, as well as a governance tool for schools and sports clubs (Patton, 2015).

In light of these considerations, there is a need for a tool that allows for the reliable and scientifically sound evaluation of the educational and inclusive quality of physical education and sports programs in primary schools. This study aims to respond to this need through the development and validation of EIP-Move (Educational and Inclusive Potential of Motor Programs), a test designed to measure, in a systematic and standardized manner, the characteristics of motor programs in terms of pedagogical design, inclusion, relational climate, and educational equity.

Through a longitudinal research design and a rigorous psychometric validation process, the study aims to contribute to the definition of shared and operational criteria for the evaluation of physical education and sports programs, promoting a culture of quality and inclusion based on solid theoretical foundations and empirical evidence.

In the present study, predictive validity is conceptualised in a programme-level and formative sense, rather than in terms of distal individual learning outcomes. Consistent with program evaluation theory, the educational and inclusive potential of a programme is expected to predict the quality of proximal educational processes that unfold over time, such as sustained student participation, a positive and psychologically safe relational climate, and the continuity and stability of programme implementation. From this perspective, EIP-Move is not intended to predict individual motor or academic achievement, but rather to anticipate the programme’s capacity to generate favourable educational conditions that are theoretically recognised as prerequisites for meaningful learning and inclusion. Accordingly, predictive validity was examined by testing whether baseline EIP-Move scores predicted relevant programme-level indicators measured longitudinally across the subsequent school year.

2. Study Objectives

The overall objective of this study is to develop and validate a standardized tool for evaluating the educational and inclusive potential of physical education and sports programs aimed at primary school children. In particular, the research aims to contribute to the construction of an evaluation model capable of systematically and scientifically assessing the intrinsic quality of physical activity programs, moving beyond an interpretation focused exclusively on immediately observable outcomes and focusing instead on the pedagogical, organizational, and relational conditions that make such programs potentially effective in promoting learning, participation, and inclusion (Bailey et al., 2009; Kirk, 2010).

A first specific objective is to develop a theoretically grounded tool based on an integrated conceptual framework that combines contributions from sports pedagogy, inclusive education, motor sciences, and applied psychometrics (Gallahue et al., 2012; Haegele & Sutherland, 2015). At this stage, particular emphasis is placed on the operational definition of the construct of educational and inclusive potential, understood as a multidimensional set of program characteristics—such as the quality of pedagogical design, methodological flexibility, relational climate, and equity of participation opportunities—which the literature identifies as determinants of the quality of motor education contexts (Goodwin & Watkinson, 2000; UNESCO, 2017). The formulation of the items is geared towards ensuring observability, clarity, and educational relevance, in line with the methodological recommendations for the construction of assessment tools in the educational field (DeVellis, 2017).

A second objective concerns the in-depth analysis of the psychometric properties of the test, with the aim of verifying its scientific soundness and suitability for use in education and sport. In particular, the study aims to evaluate the content validity of the instrument through the involvement of a multidisciplinary panel of experts, as well as to analyze the factorial structure of the instrument using exploratory and confirmatory factor analysis procedures, in line with the main international standards for test validation (Brown, 2015; Kline, 2016). At the same time, the reliability of the test is examined in terms of internal consistency of the subscales, temporal stability of the measures (test-retest), and, where provided for in the research design, inter-rater reliability, in order to ensure the accuracy and reproducibility of the measurements (Streiner et al., 2015).

A further central objective of the research is to verify the invariance of measurement of the instrument with respect to different reference groups. In particular, the study aims to ascertain whether the EIP-Move measures the construct of educational and inclusive potential in an equivalent manner according to gender, context of application (primary school and sports clubs), and the presence or absence of special educational needs. This analysis is essential not only to allow valid comparisons between groups, but also to ensure that the instrument itself respects the principles of fairness and does not introduce evaluative bias, in line with the literature on inclusive assessment and measurement model invariance (Putnick & Bornstein, 2016).

A further objective, which cuts across the validation phases, concerns the examination of the tool’s sensitivity to change, i.e., its ability to detect variations over time in relation to the evolution of physical education and sports programs. Through a longitudinal research design, the study aims to verify whether EIP-Move is able to detect improvements or critical issues emerging as a result of the implementation of specific educational and inclusive practices, as recommended by the literature on program evaluation and the formative use of measurement tools (Patton, 2015; Singer & Willett, 2003).

Finally, the research aims to explore the predictive validity of EIP-Move by analyzing the relationship between the scores obtained from the tool and a series of relevant outcomes in the educational and sports fields, such as the level of active participation of students, the quality of the group’s relational climate, the perception of psychological safety, and the continuity of adherence to programs over time. In this sense, the objective is to evaluate the potential of the EIP-Move not only as a descriptive or diagnostic tool, but also as a decision-making support for teachers, coaches, school administrators, and sports club managers, promoting continuous improvement processes and practices based on empirical evidence (De Bosscher et al., 2015; Patton, 2015).

3. Methods

3.1. Study Design

This study adopted a multi-phase longitudinal design lasting a total of 36 months, aimed at developing and psychometrically validating the EIP-Move. This design was chosen in order to ensure a high level of scientific and methodological rigor, allowing for the systematic integration of exploratory, confirmatory, and longitudinal phases, in line with the main international recommendations for the construction and validation of measurement tools in the educational and psychosocial fields (DeVellis, 2017; Streiner et al., 2015).

The multi-phase approach allowed us to proceed progressively, from the theoretical definition of the construct of educational and inclusive potential to the empirical verification of the psychometric properties of the instrument, to the analysis of its sensitivity to change and predictive validity over time. In particular, the longitudinal design allowed not only for the evaluation of the stability of the measures, but also for the examination of the EIP-Move’s ability to detect changes associated with the evolution of physical education and sports programs and the implementation of educational and inclusive practices.

The design was divided into four main phases:

Theoretical construction and content validity
Pilot study
Cross-sectional psychometric validation
Longitudinal and predictive validation

3.2. Phase 1: Construction of the Instrument and Content Validity

The construction of the EIP-Move was conducted according to a systematic and iterative approach, aimed at ensuring solid content validity and an adequate operational representation of the construct of educational and inclusive potential of physical education and sports programs in primary schools. Phase 1 included four main steps: (a) conceptual and operational definition of the construct; (b) generation of the initial pool of items; (c) preliminary qualitative review (clarity, language, and observability); (d) quantitative assessment of content validity through a panel of experts and calculation of the Content Validity Index (CVI), followed by review and refinement of the instrument.

3.2.1. Definition of the Construct and Specification of Dimensions

In the initial phase of EIP-Move development, a systematic analysis of the international literature and the main theoretical frameworks related to sports pedagogy, physical education teaching, and educational inclusion models was conducted, with the aim of clearly and consensually defining the reference construct and translating it into empirically observable dimensions (Bailey et al., 2009; Kirk, 2010; UNESCO, 2017). This analysis made it possible to move beyond generic or exclusively normative definitions of inclusion and educational quality, guiding the process of constructing the tool towards an operational and measurable conceptualization, in line with the methodological recommendations for the development of assessment tools in education (DeVellis, 2017).

The educational and inclusive potential of physical education and sports programs has therefore been defined as the program’s ability to create and maintain pedagogical, organizational, and relational conditions conducive to learning and participation for all children, regardless of their individual characteristics. From this perspective, the construct does not concern the final outcomes of the intervention, but rather the structural and procedural characteristics of the program that determine its potential capacity to promote development, inclusion, and well-being (Patton, 2015; De Bosscher et al., 2015).

Based on this definition, four main theoretical dimensions have been identified, which are considered central to describing the educational and inclusive quality of motor programs in primary schools and consistent with the findings of the reference literature in the field of physical education, educational sports, and inclusion (Gallahue et al., 2012; Haegele & Sutherland, 2015).

The first dimension, Pedagogical Quality of the Program, refers to the degree to which the program is designed and implemented according to intentional teaching principles consistent with the developmental age of the children. It includes the clarity of educational objectives and teaching progressions, the variety and intentionality of the methodologies adopted, as well as the adult’s ability to provide formative feedback and adapt the intervention on an ongoing basis based on the group’s responses. This dimension also includes opportunities for children to actively participate in the learning process through meaningful, cooperative, and skill-building motor experiences, in line with physical education models oriented toward global development and meaningful learning (Bailey et al., 2009; Kirk, 2010).

The second dimension, Inclusion and Participation, concerns the program’s ability to ensure effective and meaningful participation for all children, avoiding explicit or implicit forms of exclusion. This includes the use of activity adaptation strategies (relating to tasks, rules, materials, spaces, and times), the management of heterogeneous skill levels, and attention to preventing phenomena of ‘silent’ exclusion, such as marginalization, the systematic assignment of passive roles, or reduced opportunities to experience success. Particular emphasis is placed on supporting the participation of pupils with special educational needs, understood as an integral part of programme design and not as an ancillary intervention, in accordance with the principles of inclusive education and universal design for learning (Goodwin & Watkinson, 2000; Haegele & Sutherland, 2015; UNESCO, 2017).

The third dimension, Relational Climate and Safety, refers to the quality of relationships that characterize the physical education and sports context and the conditions of safety, both physical and psychological, offered to participants. It includes the quality of interactions between adults and children and among peers, the promotion of a positive and respectful relational climate, as well as the constructive management of conflicts and group dynamics. A central aspect of this dimension is psychological safety, understood as the possibility for children to get involved, make mistakes, and experiment without fear of being judged or ridiculed, a condition recognized as fundamental for learning and active participation (Goodwin & Watkinson, 2000; Bailey et al., 2009). This dimension also includes elements related to the physical and organizational safety of the environment, such as clear routines, adequate supervision, and shared rules.

The fourth dimension, Equity and valuing differences, concerns the way in which the program recognizes and values individual differences, promoting equal opportunities for learning and participation. It includes the fair assignment of roles and responsibilities, the adoption of non-selective and non-discriminatory criteria, and a focus on individual progress rather than competitive comparison between peers. In this perspective, differences between children are not seen as obstacles to be compensated for, but as educational resources to be valued within the group, contributing to the construction of an inclusive and cooperative environment, as emphasized in the literature on equitable and inclusive educational contexts (UNESCO, 2017; De Bosscher et al., 2015).

Together, these four dimensions formed a content map of the construct of educational and inclusive potential, which systematically guided the subsequent generation of EIP-Move items. This map ensured balanced and consistent coverage of the different components of the construct, reducing the risk of over-representation or neglect of specific aspects relevant to the quality of physical education and sports programs in primary schools, in line with good practices for the development of valid and reliable measurement tools (DeVellis, 2017; Streiner et al., 2015).

3.2.2. Generation of the Initial Pool of Items

Starting from the content map defined in the previous phase, an initial pool of 62 items was generated, designed to represent the four theoretical dimensions of the construct of educational and inclusive potential in a comprehensive and balanced way. The items were generated following a systematic and documented process, aimed at ensuring close consistency between the theoretical framework of reference and the operational formulation of the indicators, as well as high applicability in real contexts of physical education and sport, both in schools and in grassroots sports clubs.

The drafting of the items was guided by psychometric and application criteria widely recognised in the literature on the construction of assessment tools in the educational and psychosocial fields (DeVellis, 2017; Streiner et al., 2015). In particular, each item was designed to represent a single relevant content or behaviour (principle of unidimensionality), avoiding the coexistence of multiple constructs within the same statement, so as to facilitate the interpretation of responses and subsequent factor analysis.

In line with the four theoretical dimensions identified, the initial pool was distributed as follows.

The ‘Pedagogical Quality of the Programme’ dimension was initially represented by 16 items, aimed at assessing the degree of educational intentionality of the motor programme (Table 2). The items in this dimension investigate aspects such as: the clarity and consistency of educational objectives; the presence of teaching progressions appropriate to the age and level of the pupils; the variety and significance of the motor experiences proposed; the use of active and participatory methodologies; the ability of the adult to provide formative feedback and to adapt the intervention on an ongoing basis based on the group’s responses. Particular attention was paid to the possibility for children to be actively involved in the motor learning process, in line with physical education models oriented towards global development and meaningful learning (Bailey et al., 2009; Kirk, 2010).

The ‘Inclusion and participation’ dimension was represented by 16 items, designed to assess the programme’s ability to ensure effective opportunities for participation for all children, avoiding explicit or implicit forms of exclusion (Table 3). The items refer to the systematic use of strategies for adapting activities (tasks, rules, materials, spaces and times), managing heterogeneous skill levels, preventing ‘silent’ exclusion (e.g., passive or marginal roles), and supporting the participation of pupils with special educational needs. These items have been formulated in such a way as to capture not the occasional presence of adaptations, but their structural and intentional inclusion in the programme design, in accordance with the principles of inclusive education and universal design for learning (Goodwin & Watkinson, 2000; Haegele & Sutherland, 2015; UNESCO, 2017).

The ‘Relational climate and safety’ dimension was initially divided into 15 items, aimed at assessing the quality of relationships and the physical and psychological safety conditions offered by the programme (Table 4). The items explore aspects such as: the quality of interactions between adults and children and between peers; the promotion of a positive, respectful and cooperative relational climate; constructive conflict management; the presence of clear routines and shared rules; adequate supervision of activities. A specific focus was placed on psychological safety, understood as the possibility for children to get involved, make mistakes and experiment without fear of negative judgements or ridicule, a condition recognised as central to learning and active participation (Goodwin & Watkinson, 2000; Bailey et al., 2009).

The ‘Fairness and valuing differences’ dimension was represented by 15 items, aimed at investigating how the programme recognises, respects and values individual differences among children (Table 5). The items in this dimension concern the fair allocation of roles and responsibilities, the adoption of non-selective and non-discriminatory criteria, a focus on individual progress rather than competitive comparison, and the programme’s ability to transform differences in ability, experience or background into educational resources for the group. In this perspective, equity is not understood as uniformity of proposals, but as personalisation of learning opportunities, consistent with the literature on equitable and inclusive educational contexts (UNESCO, 2017; De Bosscher et al., 2015).

At this stage, particular attention was paid to linguistic formulation, favouring simple, clear and unambiguous vocabulary, with the aim of making the items understandable and interpretable in a consistent manner by teachers and coaches with different levels of training, professional experience and operating contexts.

A further criterion adopted in the construction of the items was to minimise technical terms and overly theoretical expressions in favour of formulations anchored to practices and behaviours that can be observed in concrete terms. The items were formulated with reference to realistic and recurring situations in physical education and sports contexts in primary schools, including both the school environment and sports clubs. This methodological choice was aimed at reducing the risk of subjective interpretations, increasing the ecological validity of the tool and promoting an assessment based on the actual and continuous experience of the programme.

Particular attention was also paid to avoiding overly inferential, evaluative or moralistic formulations, which could have led to socially desirable or difficult-to-verify responses. The items were therefore constructed with a focus on descriptions of observable actions, organisational methods or teaching choices, rather than abstract value judgements (e.g., ‘the programme is inclusive’ or ‘the environment is positive’). Where possible, such judgements were translated into explicit behavioural indicators (e.g., ‘the programme provides for systematic adaptations of activities to allow all children to participate’), in line with methodological guidelines for reducing social desirability bias.

As regards the response format, the items were initially drafted in a 5-point Likert scale, with progressive verbal anchors (e.g., 1 = ‘not at all’, 5 = ‘very much’). This choice was motivated by the need to balance the sensitivity of the measure with ease of completion, as well as by the proven reliability and widespread use of Likert scales in the evaluation of educational contexts. The completion instructions explicitly required evaluators to refer to a defined time period (e.g., ‘in the last four weeks of activity’) in order to reduce variability related to memory, limit responses based on general impressions, and encourage an assessment anchored to recent and observable behaviours.

During the initial pool generation phase, the items were intentionally distributed across the four theoretical dimensions, ensuring a balanced representation of the different areas of the construct and a sufficient number of items for each dimension. This choice was made with the aim of supporting subsequent content validity and factor structure analyses, reducing the risk of measurement model instability and allowing for the possible elimination of items in subsequent phases without compromising the conceptual coverage of the construct.

Finally, before submitting the pool of items to the panel of experts for content validity assessment, an initial systematic internal check was carried out to identify any redundant items, potentially ambiguous formulations or elements susceptible to multiple interpretations. This check included a verification of semantic consistency between items belonging to the same dimension and a preliminary analysis of the overall content balance of the instrument. This process allowed for further refinement of the initial pool, while maintaining a sufficient number of items to ensure adequate coverage of the construct and to support the subsequent stages of psychometric validation.

3.2.3. Preliminary Qualitative Review (Face Validity)

Prior to the quantitative assessment of content validity, the initial pool of 62 items (distributed across the four theoretical dimensions: Pedagogical quality of the programme [Items 1–16], Inclusion and participation [Items 17–32], Relational climate and safety [Items 33–47], Equity and valuing differences [Items 48–62]) was subjected to a preliminary qualitative review of face validity, conducted internally by the research group. The purpose of this phase was to verify the apparent adequacy of the items with respect to the theoretical construct, the clarity of the formulations and their relevance to the intended application contexts (primary school and basic sports), prior to the formal assessment by the multidisciplinary panel of experts.

The qualitative review was conducted through systematic reading and collegial comparison of the items, according to an iterative procedure divided into several steps: (a) verification of item-dimension correspondence based on the content map; (b) checking for redundancies and conceptual overlaps within each dimension; (c) linguistic review to increase readability and syntactic uniformity; (d) identification of potentially generic or ‘double’ formulations that could have reduced interpretative univocity; (e) final check of the content coverage of the dimensions and the numerical balance of the items.

Firstly, a review was carried out to identify and eliminate any duplications or conceptual redundancies between items belonging to the same dimension. In particular, clusters of items that could overlap in terms of content or educational implications were analysed (Table 6). For example, in the Pedagogical Quality of the Programme dimension, the formulations relating to design consistency and didactic progression (e.g., Items 2, 11 and 16) were compared, verifying that each expressed a distinct aspect (progression, continuity over time, intentionality of design) and not a semantic repetition. Similarly, in the Inclusion and Participation dimension, the distinction between items relating to adaptations (Items 18, 22), managing heterogeneity (Items 19, 28) and preventing implicit exclusion (Items 20, 26) was verified in order to avoid overlap and ensure more detailed coverage of the various inclusive mechanisms. In the dimensions of Relational climate and safety and Equity and valuing differences, the analysis focused on the possible contiguities between emotional climate and psychological safety (Items 36, 37, 42, 46) and between distributive equity and valuing differences (Items 50, 54, 56), ensuring that each item maintained a specific focus.

Secondly, as highlighted in Table 7, a lexical and syntactic consistency check was carried out with the aim of making the items readable and interpretable in a consistent manner by potential compilers with different levels of training and experience (primary school teachers and grassroots sports coaches). In this phase, uniform editorial rules were applied: predominant use of the present tense; affirmative construction; reduction in parenthetical expressions; choice of non-specialist vocabulary; avoidance of evaluative or moralistic terms. Recurring references (e.g., ‘programme’, ‘activity’, ‘children/pupils’, ‘adult’) were also standardised to reduce terminological ambiguity and promote consistency of understanding.

A third step involved checking for double-barrelled items or overly generic formulations that could have compromised the unambiguous interpretation of the response. In particular, when an item contained more than one potentially distinct content, it was reformulated to maintain the principle of unidimensionality. For example, formulations such as ‘the activities promote transversal skills (e.g., cooperation, self-regulation)’ (Item 14) were retained as a single item, but with examples placed in brackets to clarify the meaning without turning the item into a double question. Similarly, items that could be interpreted as ‘global judgements’ (e.g., ‘the relational climate is positive’) were checked to ensure that they were anchored to observable elements (cooperation, respect, conflict management), reducing the risk that the response would depend solely on general impressions.

Finally, it was verified that each theoretical dimension was represented by an adequate number of items, consistent with the complexity of the construct and with the subsequent psychometric analyses planned (exploratory and confirmatory factor analysis). Specifically, the distribution (16 items for Pedagogical Quality; 16 items for Inclusion and Participation; 15 items for Relational Climate and Safety; 15 items for Equity and Valorisation of Differences) was considered adequate to support the dimensional analyses while maintaining balanced content coverage, even in anticipation of possible item eliminations in subsequent phases.

During the preliminary review, linguistic and operational micro-revisions were carried out, mainly aimed at replacing abstract expressions with more concrete formulations, anchored to practices observable in the motor context. General expressions such as ‘promotes inclusion’ or ‘promotes autonomy’ were translated into more specific behavioural indicators (e.g., reference to flexibility of rules, systematic adaptations, the possibility of different levels of participation, the presence of formative feedback and formative assessment), while maintaining consistency with the theoretical definition of the construct.

In line with the objective of not prematurely reducing the conceptual coverage before external content evaluation, the total number of items was kept unchanged (62) at this stage. The review therefore served to refine and formally standardise the items, preparing the pool of items for a more robust and structured evaluation by the panel of experts in the next phase (CVI), with the aim of maximising the clarity, relevance and applicability of the tool in real-life physical education and sports contexts.

3.2.4. Expert Panel: Composition and Selection Criteria

The content validity of the EIP-Move was assessed by a multidisciplinary panel of 10 experts, selected with the aim of ensuring a plurality of theoretical and applied perspectives and reducing the risk of an overly sectoral reading of the construct. The use of a heterogeneous panel of experts is an established practice in the validation of measurement tools in the educational and psychosocial fields, as it allows for the integration of different disciplinary skills and increases the conceptual representativeness of the construct being evaluated (Polit & Beck, 2006; DeVellis, 2017).

The size of the panel was defined in accordance with the methodological recommendations for calculating the Content Validity Index, which indicate that a number between 8 and 12 experts is adequate to ensure a reliable assessment of content validity and to apply statistically significant thresholds for item acceptance (Polit et al., 2007).

The panel included professionals and academics with documented expertise in at least one of the following areas:

Motor sciences and physical education teaching, with specific reference to primary school, in order to ensure that the items were consistent with the developmental characteristics of children and with practices that could actually be implemented in motor education contexts (Gallahue et al., 2012; Kirk, 2010).
Special education and inclusive education, with experience in the design and evaluation of accessible educational contexts geared towards the participation of pupils with special educational needs, in line with the principles of educational inclusion and universal design for learning (Goodwin & Watkinson, 2000; UNESCO, 2017).
Psychometrics and research methodology, with specific expertise in the construction, validation, and analysis of measurement tools, in order to ensure that the item evaluation process complied with the main methodological and psychometric standards (DeVellis, 2017; Streiner et al., 2015).

The criteria for inclusion in the panel were defined on the basis of the indications found in the literature and included:

(a): at least five years of professional or academic experience in the relevant field, or scientific output consistent with the topics of the study, a requirement considered essential to ensure informed judgments based on in-depth knowledge of the domain (Polit & Beck, 2006);
(b): direct knowledge of physical education and basic sports contexts, considered essential for assessing the feasibility, applicability, and observability of the items in the real contexts of use of the tool (Patton, 2015; Kirk, 2010);
(c): willingness to participate in one or more evaluation rounds, should critical issues arise that require further investigation or revision, in line with the iterative approach recommended in content validation processes (DeVellis, 2017; Streiner et al., 2015).

Taken together, these criteria made it possible to establish a balanced and methodologically sound panel capable of providing reliable assessments from both a theoretical and an applied perspective, contributing significantly to the quality and robustness of the EIP-Move content validation process.

3.2.5. Rating Procedure and Assessment Tools

Experts were sent a structured evaluation form, designed specifically to support a systematic and comparable evaluation of the items. The form contained: (a) a concise but detailed definition of the construct of educational and inclusive potential, aimed at ensuring a shared frame of reference among evaluators; (b) a description of the four theoretical dimensions and their respective purposes, indicating the areas of observation to which each dimension referred; (c) clear and standardized instructions for completion; (d) the complete list of items, organized by theoretical dimension in order to facilitate consistency of judgment.

Each item was evaluated in relation to two main criteria, considered central to content validation: relevance to the theoretical dimension to which it belonged and clarity/comprehensibility of the wording. The relevance criterion aimed to verify the degree to which the item adequately represented the theoretical content it intended to measure, while the clarity criterion aimed to assess the ease of understanding of the item by potential compilers and to identify any linguistic or interpretative ambiguities.

For each criterion, the experts expressed their opinion on a 4-point scale (1 = not relevant/not clear; 4 = very relevant/very clear), chosen because it had no neutral point and was consistent with standard procedures for calculating the Content Validity Index (CVI). The absence of a central option encouraged evaluators to take a clear position, reducing the risk of ambiguous responses or central bias.

In addition to the quantitative assessment, there was space for optional qualitative comments for each item, where experts could provide suggestions for rewording, point out semantic or operational ambiguities, or indicate any content they considered missing or underrepresented. These qualitative comments were an essential part of the validation process, allowing the numerical assessment to be supplemented with interpretative information useful for the subsequent revision of the items and reinforcing the iterative approach adopted in the development of the tool.

3.2.6. Calculation of the Content Validity Index (CVI) and Decision Criteria

The content validity of the EIP-Move was quantified by calculating the Content Validity Index (CVI) both at the individual item level and at the overall scale level, in line with the procedures commonly adopted in the validation of measurement instruments in the educational and psychosocial fields (Polit & Beck, 2006). In particular, two main indicators were calculated.

The first indicator, the Item Content Validity Index (I-CVI), was defined as the proportion of experts who gave the item a score of 3 or 4 in terms of relevance to the theoretical dimension to which it belonged. This index made it possible to assess the degree of agreement among experts regarding the adequacy of each item in representing the reference construct.

The second indicator, the Scale Content Validity Index (S-CVI/Ave), was calculated as the average of the I-CVIs for the items belonging to each theoretical dimension and to the instrument as a whole. This index provided a summary measure of the overall representativeness of the content and consistency of the items with respect to the conceptual dimensions identified.

In line with the recommendations in the literature for panels composed of ten experts, an I-CVI ≥ 0.78 was adopted as the main threshold for considering an item adequately representative of the construct. This value corresponds to a level of agreement considered statistically acceptable and widely used in similar studies. Based on this quantitative criterion, the items were initially classified according to three decision categories:

Acceptance, for items with I-CVI ≥ 0.78 and no significant qualitative issues emerging from the experts’ comments;
Revision, for items with I-CVI ≥ 0.78 but accompanied by converging comments on aspects of clarity, formulation, or observability, or for items with values between 0.70 and 0.77 but characterized by strong theoretical consistency;
Elimination, for items with I-CVI < 0.78 associated with comments indicating poor relevance, conceptual overlap with other items, or difficulty in reliable assessment in application contexts.

Alongside the quantitative criteria, a supplementary qualitative assessment was adopted, based on the analysis of the convergence of the comments provided by the experts. In particular, when several evaluators reported the same critical issue—such as overly broad formulations, semantic ambiguities, or the risk of multiple interpretations—the item was reviewed even if its I-CVI value was above the threshold. This combined approach made it possible to balance statistical rigor and expert judgment, strengthening the overall validity of the decision-making process.

3.2.7. Item Review and Production of the Preliminary Version

Based on the results of the Content Validity Index and the qualitative analysis of the comments provided by the panel of experts, an iterative process of item revision was initiated, aimed at improving the clarity, interpretative univocity and theoretical relevance of the instrument, while maintaining adequate coverage of the construct of educational and inclusive potential.

At this stage, revision decisions were guided by three main criteria:

(a): elimination or merging of conceptually overlapping items;
(b): reformulation of items characterised by excessive generality or risk of double-barrelledness;
(c): maintaining a structural balance between the four theoretical dimensions.

Some items were merged when they were deemed redundant but theoretically relevant, favouring clearer wording that was more closely linked to observable behaviours. Other items were eliminated when they had low conceptual priority compared to other more specific indicators or when they were difficult to distinguish in the assessment practice. The reformulations mainly concerned the strengthening of the unidimensionality and observability of the items, in line with the indications that emerged during the face validity phase and in the CVI assessment.

The entire revision process was conducted with constant attention to the balance between the theoretical dimensions, so as to avoid a disproportionate reduction in the number of items in a specific area of the construct. At the end of this phase, the instrument was reduced from 62 to 48 items, maintaining the structure divided into four theoretical dimensions and ensuring a balanced distribution of items within each dimension (Table 8).

The preliminary 48-item version formed the basis for the next phase of the pilot study, which aimed to assess the feasibility of administration, comprehensibility in context and preliminary functioning of the items in real-life physical education and sports contexts. This step allowed for further verification of the suitability of the tool before its application to larger samples and subsequent psychometric validation (Table 9).

3.3. Phase 2: Pilot Study

The pilot study phase was aimed at testing the preliminary version of the EIP-Move (48 items) in real physical education and sports contexts, with the aim of verifying: (a) the feasibility of administration in terms of time, cognitive load for respondents, and organizational sustainability; (b) the comprehensibility of the items and instructions; (c) the adequacy of the response format and Likert anchors; (d) the preliminary functioning of the items before the start of large-scale psychometric validation. In line with good practices for the development of measurement tools, the pilot study served to ‘refine’ the tool, allowing for the identification of operational criticalities and preliminary indicators of psychometric weakness (e.g., reduced response variability, redundancy, or observation difficulties) that are difficult to identify in the content validity phase alone (DeVellis, 2017; Streiner et al., 2015). The pilot phase also made it possible to assess the consistency between the theoretical construct and the actual practices observable in the contexts, reinforcing the ecological validity of the instrument.

3.3.1. Sample and Context

The preliminary version of the test was administered to a pilot sample of 120 physical education and sports programs aimed at primary school children. The sample was selected according to a criterion of intentional heterogeneity, with the aim of including adequate variability in terms of the context of application (primary school and sports clubs), organizational modalities (number of weekly sessions, duration of sessions, group size), and general characteristics of the class/team (heterogeneity of motor skills, presence of students with special educational needs, gender differences).

The choice of a heterogeneous sample was motivated by the need to test the tool in conditions that reflected the complexity and diversity of real-life physical education and sports contexts in primary schools, avoiding an excessively uniform pilot study that could have masked critical issues related to the comprehensibility or applicability of the items. At this stage, the objective was not the statistical representativeness of the sample, but rather to verify the robustness and flexibility of the tool with respect to different educational and organizational configurations.

The pilot sample included programs characterized by different methodological structures, such as predominantly recreational and motor activities, introductory sports courses, and mixed programs that integrated technical and coordination elements and cooperative games. This methodological diversification made it possible to assess the EIP-Move’s ability to maintain comprehensibility, interpretative consistency, and descriptive usefulness even in the presence of different teaching approaches, reinforcing the ecological validity of the tool and its potential transferability to a variety of educational and sporting contexts.

3.3.2. Administration and Data Collection Procedure

The EIP-Move was administered to program managers (teachers/coaches) and, where applicable, to trained internal evaluators (e.g., teaching coordinators or staff members) in order to test not only the comprehensibility of the items, but also the clarity of the instructions and the interpretative stability of the tool among potentially different compilers. This choice made it possible to verify whether the items could be interpreted in a sufficiently consistent manner by individuals with different roles and levels of involvement, which is important in view of the future use of the test in complex organizational contexts.

Before administration, a brief operational guide was provided, containing concise definitions of key concepts, explicit indications on the reference time period, and some examples of compilation. The aim of this guide was to reduce interpretative heterogeneity, limit responses based on generic assessments, and increase the standardization of the procedure, while maintaining an administration method compatible with the organizational constraints of educational and sports contexts.

The questionnaire was completed with reference to a standardized time period (e.g., “last 4 weeks”), in line with the design of the tool and with the aim of encouraging assessments based on recent and observable practices, rather than on general or retrospective impressions. At the end of the compilation, participants were also asked to provide qualitative feedback through short open-ended questions and/or margin notes, aimed at identifying:

items perceived as ambiguous, too generic, or difficult to interpret unambiguously;
items perceived as redundant or overlapping, especially within the same dimension;
content considered relevant by respondents but not adequately represented in the tool;
operational difficulties (e.g., need for additional examples, clarification on response anchors, or doubts about the reference period);
perceived average completion time and sustainability in the specific context (e.g., immediate completion at the end of the lesson or at a later time).

Finally, the presence of missing responses was systematically monitored, interpreting it as a possible indirect indicator of problems with the item (e.g., difficulty of observation or semantic ambiguity) or procedural issues (order of items, cognitive load, completion time). This information was incorporated into subsequent decisions to revise and reduce the instrument.

3.3.3. Preliminary Analyses: Descriptive and Item Functioning

Preliminary statistical analyses were conducted for exploratory and screening purposes, consistent with the nature of the pilot study, and included a combination of quantitative and qualitative indicators aimed at assessing the functioning of the items.

First, descriptive analyses of the items were performed (mean, standard deviation, distribution of responses and, where useful, asymmetry and kurtosis indices) in order to verify adequate variability of responses and identify items characterized by extreme or highly skewed distributions. Items with very low variability were considered potentially uninformative for distinguishing programs with different levels of educational and inclusive potential.

Subsequently, floor effects and ceiling effects, considered indicators of low discriminative power, were evaluated. Items characterized by a high concentration of responses at minimum or maximum values are in fact less sensitive in detecting differences between programs. At this stage, attention criteria based on thresholds frequently used in the literature (e.g., more than 15–20% of responses at the minimum or maximum) were adopted, with an integrated assessment that took into account both quantitative evidence and qualitative comments from the compilers.

A preliminary check of item discrimination was also carried out by calculating the corrected item-total correlations (where applicable) and verifying the consistency of the items within the theoretical dimension to which they belonged. Items with very low or negative correlations were considered potentially problematic, as they indicated poor consistency with the conceptual domain or interpretative difficulties.

At the same time, the preliminary internal consistency of the dimensions was explored (e.g., using α and/or ω), interpreted with caution since the purpose of the pilot study was not to confirm the factorial structure but to identify early signs of critical issues. In this sense, the analysis of internal consistency was used primarily as a diagnostic tool, useful for identifying items that, if removed, could improve the preliminary consistency of the subscale without compromising the conceptual coverage of the construct (Streiner et al., 2015).

Alongside quantitative indicators, particular importance was given to qualitative evidence emerging from the feedback provided by respondents. Items that showed statistically acceptable functioning but were repeatedly reported as unclear, too broad, or difficult to observe reliably were still considered candidates for revision or removal, in line with the integrated and iterative approach adopted throughout the EIP-Move development process.

3.3.4. Criteria for Revision and Reduction of the Instrument

Based on the results of the quantitative and qualitative analyses, the EIP-Move items were systematically evaluated using integrated decision-making criteria, consistent with the iterative approach adopted in the previous phases. The aim of this phase was not to further reduce the tool for reasons of brevity alone, but to optimise the set of 48 items in terms of accuracy, clarity and sustainability of use, while preserving the theoretical coverage of the four dimensions.

In particular, the following were considered problematic and therefore candidates for revision or removal (Table 10):

Low variability and poor sensitivity: items characterised by marked floor or ceiling effects and reduced dispersion of responses. These items, being uninformative, limit the instrument’s ability to distinguish between different programmes. The decision to remove these items was taken mainly when the effect was also associated with redundancy with conceptually similar items.
Interpretative ambiguity and generality: items perceived as too broad or interpretable in different ways (e.g., global judgements on climate or inclusion), or items that did not clearly indicate the observable object. In these cases, the priority was to reformulate them into more specific behavioural indicators.
Poor observability: items considered difficult to assess due to a lack of evidence accessible to the compiler or because they depended on internal/not directly observable processes. In such cases, the item was revised to anchor it to verifiable behaviours or organisational conditions (e.g., routines, rules, adaptations, feedback).
Intra-dimension redundancy: items with very high inter-item correlations (suggesting overlap) or reported as ‘repetitive’ by compilers. Redundancy was managed by retaining the item with the clearest, most observable and most discriminative wording, while still preserving coverage of the sub-aspects of the dimension.
Low consistency with the theoretical dimension: items with weak or negative item-total correlations with respect to their own subscale, especially when accompanied by convergent qualitative feedback indicating uncertainty about meaning or difficulty in conceptual placement.

During the review process, constant attention was paid to balancing the four dimensions (12 items each) in order to avoid imbalances in the representation of the construct. Decisions were therefore not based solely on statistical criteria: each proposal for removal or modification was verified against the content map to ensure that the essential sub-components (e.g., educational progression and formative feedback for the pedagogical dimension; systematic adaptations and prevention of implicit exclusion for the inclusive dimension; psychological safety and constructive conflict management for the climate; valuing differences and non-discrimination for equity) remained adequately represented.

Finally, in line with the pilot approach, decisions were treated as optimisation hypotheses rather than definitive confirmations: items that were retained, reformulated or replaced were considered in view of subsequent validation on larger samples, in which more robust procedures would be implemented (e.g., exploratory/confirmatory factor analyses and invariance tests). In this sense, the revision represented a key step in refining the instrument, improving its practical usefulness and preparing it for more rigorous psychometric validation.

3.3.5. Output of the Pilot Phase

Following the pilot study and the application of the review criteria described above, eight items were eliminated, resulting in a 40-item version of the EIP-Move (Table 11). The resulting version retained the conceptual structure divided into four theoretical dimensions, ensuring adequate and balanced coverage of the construct of educational and inclusive potential.

The 40-item version was considered more suitable for the subsequent large-scale validation phase because

it offered greater operational and interpretative clarity of the items, reducing the risk of ambiguity in compilation;
it limited the risk of redundancy and conceptual overlap, improving the overall readability of the instrument;
it showed better preliminary discriminative power, thanks to the removal of items characterised by marked floor or ceiling effects;
it optimised application sustainability, reducing the compilation burden and making the instrument more easily usable in educational and sporting contexts.

The 40-item version was therefore adopted as the basis for Phase 3: cross-sectional psychometric validation, aimed at empirically testing the factorial structure of the EIP-Move, its reliability properties and the main forms of validity envisaged by the research design, with larger and more differentiated samples.

3.4. Phase 3: Sample and Validation Procedure

Phase 3 aimed to subject the 40-item version of the EIP-Move to large-scale psychometric validation, including both school and sports contexts, in order to test its factor structure, reliability, and the main forms of validity envisaged by the research design. The main sample involved 52 primary schools and 34 sports clubs, in order to ensure adequate ecological variability and to verify the applicability of the tool in different organizational contexts that share the educational purpose of physical activity.

The recruitment of contexts was geared toward including programs with reasonable differences in terms of organization, resources, and teaching approach, so as to increase the likelihood of observing significant variability in scores and testing the robustness of the tool under realistic conditions. Consistent with the validation objective, the primary goal was not to achieve national statistical representativeness, but rather to ensure broad coverage of situations typical of physical education and sports in primary school, with a balanced distribution between school and grassroots sports.

3.4.1. Predictive Validity: Criterion Definition and Time Horizon

Predictive validity was examined by analysing the extent to which EIP-Move scores obtained at baseline (T0) predicted relevant programme-level outcomes assessed at subsequent measurement points (T1 and T2). In line with the conceptualisation of EIP-Move as a tool for evaluating programme potential rather than individual outcomes, the external criteria selected were proximal indicators theoretically linked to the educational and inclusive quality of physical education and sport programmes.

Specifically, four external criteria were considered: (a) level of active student participation, operationalised as sustained engagement across sessions; (b) quality of the relational climate, assessed through structured observations of peer and adult–child interactions; (c) perceived psychological safety of the group, defined as the extent to which children could participate without fear of ridicule or exclusion; and (d) programme continuity, defined as the stability and persistence of programme implementation over time.

Predictive associations were tested over two longitudinal intervals: from T0 to T1 (approximately one school year) and from T0 to T2 (12-month follow-up), allowing the examination of both short-term and medium-term predictive effects.

3.4.2. Sample Justification, Sampling Procedure, and Longitudinal Framework

The sample size and longitudinal structure of the study were defined a priori in order to ensure the adequacy of the psychometric analyses planned, particularly confirmatory factor analysis (CFA) and longitudinal validation procedures.

Sample Size and Characteristics Across Measurement Waves

The large-scale validation phase (Phase 3) involved a total of 1248 primary school children aged 6–11 years, nested within 86 physical education and sports programmes (52 primary schools and 34 sports clubs), evaluated by 96 teachers/coaches responsible for programme implementation. The unit of analysis was the programme (class or sports group), assessed through observable pedagogical and organisational practices.

Data were collected at three measurement points within a longitudinal framework:

T0 (baseline): beginning of the school or sports year, once organisational routines had stabilised;
T1 (post-test): end of the school year or sports season;
T2 (follow-up): approximately 12 months after T1, to assess temporal stability and predictive validity.

At each wave, programmes that met the inclusion criteria (continuity over time, identifiable programme leader, sufficient observability of practices) were retained. Attrition across waves was monitored and remained within acceptable limits for longitudinal research in applied educational contexts, with no systematic loss associated with programme type, context, or presence of pupils with special educational needs.

Sampling Procedure

The sampling strategy was purposive and stratified, aimed at maximising ecological variability rather than national representativeness. Programmes were recruited to ensure heterogeneity with respect to

context of implementation (primary schools vs. grassroots sports clubs);
organisational characteristics (group size, weekly frequency, session duration);
pedagogical approaches (play-oriented, mixed, or sport-specific programmes);
presence of pupils with special educational needs.

This approach was consistent with the objective of validating an instrument intended for use across diverse real-world educational and sports settings, and with methodological recommendations for psychometric validation in applied contexts.

Justification of Sample Size for CFA and Longitudinal Analyses

The adequacy of the sample size was evaluated in relation to the requirements of confirmatory factor analysis and measurement invariance testing. The final CFA model included 40 observed variables loading on four correlated latent factors. Methodological literature generally recommends a minimum of 200–300 observations for stable CFA estimation, particularly when models include multiple latent dimensions and ordinal indicators.

The sample size available at baseline (N programmes > 300 evaluations; N children = 1248) largely exceeded these minimum recommendations, ensuring robust parameter estimation, sufficient statistical power, and stable fit indices. Moreover, the use of the WLSMV estimator, specifically recommended for ordinal data, further supports the adequacy of the sample for CFA and multi-group invariance analyses.

The longitudinal design, with three measurement points spaced across approximately one academic year and a subsequent follow-up, allowed for the integration of cross-sectional psychometric validation with analyses of temporal stability, sensitivity to change, and predictive validity, in line with best practices for the validation of educational measurement instruments.

3.4.3. Participants and Units of Analysis

The overall sample included 1248 children aged between 6 and 11, distributed across the various programs analyzed, and 96 teachers/coaches responsible for conducting the activities. The primary unit of analysis for EIP-Move was the program (or group of activities) as implemented in the specific context (class/sports group), evaluated through observable practices and the methodological choices adopted.

It is important to note that the data have a hierarchical (nested) structure: children are placed in groups/classes or teams, which in turn are placed in schools or sports clubs. This feature was taken into account in the planning of subsequent analyses, both to avoid overestimating statistical accuracy and to assess the consistency of the construct at different organizational levels (program, context). The presence of primary school children made it possible to test the tool in the target group envisaged by the theoretical model, including a wide variability linked to developmental differences between the early and late years of primary school. The participation of teachers/coaches with different profiles also made it possible to verify the robustness of the tool with respect to different educational styles and operating methods (e.g., greater focus on play, more technical-coordinative approaches, emphasis on cooperation or regulated competition).

3.4.4. Administration Procedure, Assessor Profile, and Inter-Rater Reliability

The EIP-Move was designed as a programme-level evaluation tool, intended to assess the educational and inclusive characteristics of physical education and sports programmes through indicators anchored to observable pedagogical, organisational, and relational practices. Consequently, particular attention was paid to clearly defining the profile of the evaluator, the assessment procedure, and the procedures adopted to ensure inter-rater reliability.

Assessor Profile and Role

The instrument was completed by two categories of evaluators, depending on the phase and purpose of the assessment:

Teachers/coaches responsible for programme implementation, who completed the EIP-Move as an internal evaluation based on their continuous and direct involvement in the activities;
Trained external observers, involved for methodological control purposes and for the estimation of inter-rater reliability.

Teachers and coaches were considered suitable evaluators because the EIP-Move does not assess individual pupil outcomes, but rather structural and procedural characteristics of the programme (e.g., use of adaptations, management of heterogeneity, pedagogical intentionality, relational climate), which require prolonged familiarity with the programme and cannot be reliably captured through brief or isolated observations alone.

External observers were researchers or trained professionals with expertise in physical education and inclusive practices. Their role was to provide an independent evaluation of the same programmes, reducing the risk of bias related to self-evaluation or social desirability and allowing for the estimation of inter-rater agreement.

Assessment Procedure and Observability of Items

All evaluators were instructed to complete the instrument with reference to a clearly defined time window (e.g., the last four weeks of activity) and to base their judgments exclusively on observable practices, recurrent organisational choices, and documented teaching behaviours, avoiding abstract or purely inferential evaluations.

Although some items refer to constructs that may appear conceptually abstract (e.g., “The programme recognises individual differences as an educational resource”), these were intentionally formulated to capture operationalised manifestations of such constructs. In this example, recognition of differences as a resource was operationalised through observable indicators such as: differentiated task proposals, flexible rules, varied participation roles, explicit valuing of diverse competencies, and consistent non-selective practices across sessions. Evaluators were explicitly instructed to anchor their ratings to these concrete manifestations rather than to general impressions or value judgments.

This approach is consistent with the conceptualisation of EIP-Move as a tool for evaluating programme potential rather than internal attitudes or unobservable beliefs, and it reflects the methodological effort made during item construction and revision to maximise observability and reduce interpretative ambiguity.

Inter-Rater Reliability

Inter-rater reliability was assessed on a subsample of programmes evaluated independently by both teachers/coaches and trained external observers. Agreement between raters was estimated using the Intraclass Correlation Coefficient (ICC), which is considered appropriate for continuous or ordinal ratings and for assessing consistency between different evaluators.

The results indicated good inter-rater agreement, with ICC values exceeding the threshold commonly interpreted as satisfactory in applied educational research (ICC > 0.75). These findings suggest that, when supported by clear instructions and a shared reference framework, the EIP-Move can be applied consistently by evaluators with different roles, reinforcing the reliability and practical applicability of the instrument in real-world educational and sports contexts.

3.4.5. Program Inclusion and Exclusion Criteria

Programs that met the following criteria were included: (a) physical activity or sports with educational purposes aimed at children aged 6–11; (b) minimum duration sufficient to ensure the observability of practices (e.g., continuity of the program throughout the school or sports year); (c) presence of an identifiable person in charge (teacher/coach) available to complete the tool and participate in the planned measurements.

Occasional programs or single events were excluded, as were unstructured interventions or those lacking a minimum degree of continuity, as these were difficult to evaluate reliably in relation to the dimensions considered by EIP-Move. In addition, situations in which the program manager could not guarantee sufficient knowledge of the activities carried out during the reference period (e.g., frequent substitutions or untraceable shifts) were considered unsuitable, as this condition could have increased the measurement error.

3.4.6. Administration Procedure and Role of Assessors

The EIP-Move was administered using a multi-informant approach, involving:

completion by teachers/coaches, as an internal evaluation of the program;
evaluation by trained external observers, for the purposes of methodological control and reduction of the risk of bias related to social desirability or self-evaluation.

External observers were trained using a standardized protocol that included: (a) presentation of operational definitions of dimensions and scoring rules; (b) practical examples of observation and guided discussion; (c) discussion of ambiguous cases and critical situations typical of the contexts; (d) exercises using simulated materials or pilot contexts. The training was aimed at reducing variability between observers and supporting, where appropriate, estimates of inter-rater reliability.

The tool was compiled with reference to a defined and uniform time period, with instructions requiring evaluators to base their assessments on observable evidence and practices actually implemented, avoiding unsupported abstract judgments. When the assessment was administered in close proximity to field observations, observers were asked to anchor the score to actual episodes and teaching choices, so as to increase the traceability of the assessment.

3.4.7. Timeline of the Surveys

The surveys were conducted at three distinct points in time, consistent with a longitudinal approach aimed at measuring both the stability of the measurement and the sensitivity to change of the EIP-Move in relation to the evolution of physical education and sports programs:

T0 (beginning of the year): the initial survey was conducted in the first weeks of the school or sports year, once the basic organization of activities had been stabilized (formation of groups, definition of schedules, general methodological approach). This survey served to describe the initial level of educational and inclusive potential of the program and to establish a baseline, useful both for cross-sectional analyses and as a reference point for subsequent change analyses.
T1 (end of year): the post-survey was conducted at the end of the school year or sports season, at a stage when the program had reached full operational maturity and educational routines were well established. This time frame was chosen to assess any changes in EIP-Move scores associated with the ongoing implementation of educational and inclusive practices, allowing for analysis of the tool’s sensitivity to change in relation to program improvement, adaptation, or stabilization processes throughout the year.
T2 (12-month follow-up): the follow-up survey was conducted approximately twelve months after the T1 survey, with the aim of verifying the temporal stability of the measure and exploring the predictive validity of EIP-Move with respect to relevant outcomes and the continuity of educational and inclusive practices over time. This allowed us to assess whether the levels of educational and inclusive potential previously identified were associated with the maintenance of quality programs, the consistency of pedagogical choices, and the persistence of conditions conducive to participation and inclusion.

For each measurement time, a standardized administration time window was set, consistent with the school and sports calendar, in order to reduce heterogeneity due to significantly different times of the year (e.g., start-up phases not yet stabilized, periods of suspension of activities, or phases characterized by atypical organizational loads). This methodological approach helped to improve the comparability of data between contexts and between survey times.

Taken together, the three time points made it possible to integrate a cross-sectional component (analysis of the factor structure and reliability of the EIP-Move on T0 and/or T1 data) with a longitudinal component, aimed at assessing the stability of the measures, the ability of the instrument to detect changes over time, and the relationship between EIP-Move scores and future indicators of educational and inclusive quality. This time frame also laid the foundations for longitudinal invariance analyses and the application of change models, in line with the advanced validation objectives of the instrument.

3.5. Statistical Analyses

Statistical analyses were conducted using a progressive and hierarchical approach, consistent with the psychometric validation of the EIP-Move and the ordinal nature of the responses (5-point Likert scale). Specifically, the analytical strategy was structured to: (a) explore the dimensionality of the instrument and the quality of item functioning; (b) confirm the hypothesized factor structure; (c) assess the reliability and stability of the measures; (d) verify the equivalence of the measure between groups and over time; (e) analyze longitudinal change and predictive validity in relation to relevant educational and organizational outcomes.

3.5.1. Data Preparation and Preliminary Checks

In the first phase, the data underwent systematic data screening and quality control procedures aimed at reducing the risk of compilation errors and verifying the adequacy of the dataset for the planned psychometric analyses. In particular, the completeness of the compilations was verified and potentially problematic patterns were identified, such as invariant responses (straightlining), inconsistent sequences, or excessively rapid compilations, which may indicate limited respondent involvement and introduce noise in the estimation of parameters (DeVellis, 2017; Kline, 2016). At the same time, an exploratory analysis of the item distributions (response frequencies, mean and standard deviation, as well as asymmetry and kurtosis indices when informative) was conducted with the aim of identifying items characterized by low variability or extremely unbalanced distributions, conditions that can reduce discriminative power and negatively influence the identification of the latent structure (Streiner et al., 2015; Brown, 2015).

Missing data were monitored both in terms of overall quantity and in relation to specific items and subgroups (e.g., school vs. sports club), in order to detect possible critical issues of comprehension, semantic ambiguity, or difficulty in applying the item in context. This preliminary analysis allowed us to interpret missing data not only as a technical problem, but also as a potential indicator of weakness in content or wording (DeVellis, 2017; Streiner et al., 2015). The management of missing data was planned in accordance with the ordinal nature of the variables and with subsequent models, avoiding simplistic solutions (e.g., listwise deletion) that can reduce statistical power and introduce bias when data are not completely random (Little & Rubin, 2002; Enders, 2010). Given the ordinal nature of the items (5-point Likert scale), an analytical treatment consistent with the methodological recommendations for this type of data was adopted. In particular, a matrix of polychoric correlations was used for factor analyses, as these are considered more appropriate than Pearson correlations when the observed variables represent ordered categories that approximate an underlying continuity (Holgado-Tello et al., 2010; Flora & Curran, 2004). The use of polychoric correlations reduces the risk of underestimating associations between items and distorting the factor structure, especially in the presence of asymmetric distributions or unevenly used categories (Brown, 2015; Kline, 2016). This preliminary approach therefore made it possible to prepare a more reliable dataset consistent with the assumptions of subsequent models, improving the overall robustness of psychometric inferences.

3.5.2. Exploratory Factor Analysis (EFA)

Exploratory factor analysis (EFA) was used as a first step to investigate the latent structure of the EIP-Move and verify whether the empirical data supported the hypothesis of a multidimensional configuration consistent with the four theoretical areas identified in the instrument construction phase. EFA was conceived as an exploratory procedure aimed not only at identifying the number of underlying factors, but also at evaluating the behavior of the items in terms of saturation, structural clarity, and conceptual consistency (Brown, 2015). Given the ordinal nature of the items (5-point Likert scale) and the use of polychoric correlation matrices, factor extraction was performed using the WLSMV (Weighted Least Squares Mean and Variance adjusted) estimator, recommended in the literature for factor analysis with ordered categorical variables, as it provides robust estimates even in the presence of non-normal distributions (Flora & Curran, 2004; Muthén & Muthén, 2017). This methodological choice reduced the risk of distortion in the estimation of factor loadings and latent relationships compared to methods based on assumptions of continuous normality. Rotation was set as oblique (e.g., geomin or oblimin), assuming a priori a plausible correlation between factors. This decision is consistent with both the EIP-Move theoretical model and the psychometric literature, which emphasizes that the educational and inclusive dimensions of learning contexts tend to be conceptually distinct but empirically interrelated (DeVellis, 2017). The adoption of an oblique rotation also allowed for a more realistic representation of the latent structure than orthogonal solutions, which artificially impose factor independence. The number of factors was determined by integrating several decision criteria, avoiding exclusive reliance on individual automatic indicators. In particular, the following were considered: (a) the comparison between solutions with different numbers of factors; (b) the theoretical interpretability of the solutions that emerged; (c) the extent and distribution of factor saturations; (d) the presence of relevant cross-loadings; (e) the overall consistency of the solution with the content map of the construct. This integrated approach is in line with methodological recommendations that emphasize the importance of combining statistical criteria and theoretical judgment in identifying the dimensionality of an instrument (Streiner et al., 2015).

3.5.3. Confirmatory Factor Analysis (CFA)

Based on the results of the EFA and the underlying theoretical model, a confirmatory factor analysis (CFA) was subsequently conducted with the aim of explicitly testing the four-factor model of the EIP-Move. The CFA represented the central step in structural validation, allowing for formal verification of the adequacy of the theoretical model to the empirical data and assessment of the robustness of the relationships between items and latent factors (Brown, 2015; Kline, 2016). Consistent with the exploratory analyses and the ordinal nature of the variables, the WLSMV estimator was also used in the CFA, as it is considered the gold standard for structural equation models with Likert items and polychoric correlations (Flora & Curran, 2004; Li, 2016). The use of this estimator allowed us to obtain reliable parameter estimates and fit indices corrected for the non-normality of the observed variables. The model evaluation was based on a combination of global goodness-of-fit indices, in line with the main recommendations in the psychometric literature. In particular, the following were considered: the Comparative Fit Index (CFI) and the Tucker–Lewis Index (TLI) as indicators of comparative fit; the Root Mean Square Error of Approximation (RMSEA) as an index of the model’s approximation error to the population; and the Standardized Root Mean Square Residual (SRMR) as a measure of the standardized mean discrepancy between the observed matrix and the one reproduced by the model (Brown, 2015). The interpretation of the indices was carried out in a non-mechanical way, taking into account the number of items, the complexity of the model, and the characteristics of the sample, as recommended by the most recent literature (Marsh et al., 2004). In addition to the overall fit indices, an in-depth analysis of the model parameters was conducted, including an examination of standardized factor loadings, residuals, and correlations between latent factors. High saturations consistent with the theoretical dimension to which they belong were interpreted as an indicator of good item representativeness, while high residuals or systematic patterns of misfit were used as diagnostic signals of potential localized criticalities (Brown, 2015; Kline, 2016). This analysis made it possible to evaluate not only the overall goodness of the model, but also the structural quality of the instrument at the individual item level, in line with the objective of constructing a psychometrically sound and conceptually coherent measure.

3.5.4. Reliability: Internal Consistency, Temporal Stability, and Inter-Rater Agreement

The reliability of the EIP-Move was assessed in a structured manner, considering various sources of measurement error and adopting an approach consistent with the multidimensional and applied nature of the instrument. In particular, three complementary levels of reliability were examined: internal consistency, temporal stability, and inter-rater agreement. The internal consistency of the subscales was estimated using McDonald’s Omega (ω), considered a more appropriate indicator than Cronbach’s α in contexts where the assumption of tau-equivalence is difficult to sustain and in the presence of factor models with non-uniform loadings (Dunn et al., 2014). The use of ω is consistent with the latest psychometric recommendations, which suggest favoring estimates based on factor models over purely descriptive coefficients (Raykov & Marcoulides, 2011; Streiner et al., 2015). The ω values were interpreted in light of the thresholds commonly adopted in the literature, considering values ≥ 0.70 as indicative of adequate reliability for instruments used in education. The temporal stability (test–retest) of the scores was assessed using the Intraclass Correlation Coefficient (ICC), calculated on subsamples for which no significant structural changes in the program were expected between measurements. The ICC is a particularly suitable indicator for estimating the consistency of measurements over time, as it takes into account both inter-subject and intra-subject variability (Koo & Li, 2016). The ICC model was selected based on the design (repeated measurements on the same programs) and the theoretical interpretation of expected stability, distinguishing between construct stability and real change associated with the evolution of educational practices. Finally, inter-rater reliability was estimated in cases where parallel assessments were available (e.g., teacher/coach and external observer, or two independent observers). Again, the ICC was used, chosen for its ability to estimate absolute agreement or consistency between assessors in a more informative way than simple correlation coefficients (Hallgren, 2012). The analysis of inter-rater agreement made it possible to assess the reproducibility of the EIP-Move scores and the consistency between internal and external assessments, a crucial aspect for a tool intended for use in real educational contexts.

3.5.5. Measurement Invariance (Multi-Group and Longitudinal)

In order to ensure that the EIP-Move measured the construct of educational and inclusive potential in an equivalent manner across different reference groups, measurement invariance was tested using multi-group CFA. This procedure is considered a fundamental prerequisite for making valid comparisons between groups and for avoiding distorted interpretations due to measurement bias (Putnick & Bornstein, 2016).

In particular, the standard levels of invariance were examined progressively:

Configurational invariance, to verify that the factor structure was identical across groups;
Metric invariance, imposing equality of factor saturations, a necessary condition for comparing relationships between latent variables;
Scalar invariance, imposing equality of thresholds/intercepts (in an appropriate form for ordinal items), an essential requirement for comparing latent mean scores between groups.

Decisions on the acceptance of invariance were based on changes in fit indices (ΔCFI, ΔRMSEA), in line with recommendations suggesting that χ² tests, which are particularly sensitive to sample size, should not be relied upon exclusively (Cheung & Rensvold, 2002; Chen, 2007). The maintenance of invariance supported the possibility of meaningful comparisons between gender, context of application (school vs. sports club), and presence/absence of special educational needs, reinforcing the equitable validity of the instrument. At the same time, the availability of three measurement times also allowed for a longitudinal invariance test to be set up, aimed at ensuring that the measure retained the same meaning over time. This step proved useful for subsequent change analyses, allowing variations in scores to be interpreted as real changes in the construct and not as measurement artifacts (Widaman et al., 2010).

3.5.6. Longitudinal Analyses and Multilevel Models

Given the nested structure of the data (children placed in groups/classes or teams, which in turn were placed in schools or sports clubs), longitudinal and predictive analyses were conducted using models capable of handling intraclass dependence and multilevel variability (Raudenbush & Bryk, 2002; Hox et al., 2018). First, mixed-effects models were used to estimate changes in EIP-Move scores between T0, T1, and T2, including random effects at the program and/or context level. This approach allowed for flexible modeling of trajectories of change even in the presence of unbalanced data or non-systematic missing data, a frequent feature in applied longitudinal studies (Singer & Willett, 2003). In parallel, Latent Growth Curve Models (LGCMs) were implemented within an SEM framework, with the aim of describing the latent growth trajectories of scores and examining differences in patterns of change as a function of program characteristics (e.g., the implementation of structured inclusive practices). LGCMs allowed us to estimate intercept and slope parameters at the latent level, providing a more theoretically informed representation of change over time than purely observational approaches (Bollen & Curran, 2006)

3.5.7. Predictive Validity

The predictive validity of the EIP-Move was examined using multilevel regression models, assessing the association between the instrument scores (total and dimensions) and a series of outcome indicators relevant to education and sport, such as active student participation, perceived quality of the relational climate, psychological safety, and continuity of program adherence over time. The use of multilevel models made it possible to distinguish the share of variability attributable to the individual program or group from that linked to the broader organizational context (school or sports club), reducing the risk of biased inferences and type I errors due to the violation of the assumption of independence of observations (Raudenbush & Bryk, 2002; Snijders & Bosker, 2012). From this perspective, the EIP-Move was evaluated not only as a descriptive tool but also as a predictive indicator of educational and inclusive quality, reinforcing its practical value for monitoring and improving programs.

4. Results

4.1. Factor Structure

Structural analyses provided convergent and solid support for the theoretical configuration of EIP-Move, which is divided into four dimensions. In the exploratory phase, the four-factor solution proved to be the most appropriate from both a statistical and interpretative point of view, explaining a total of 63% of the total variance, a value consistent with multidimensional instruments applied to complex constructs in the educational and psychosocial fields (Brown, 2015). The pattern of factor saturations showed a generally orderly structure: most items had high loadings on the theoretically predicted factor, while cross-loadings were limited and small in magnitude. This trend suggests good discriminatory power of the items and clear differentiation between dimensions, despite the natural conceptual interrelationship between the pedagogical, inclusive, and relational aspects of the programs. This result is consistent with the literature describing educational quality as an intrinsically multidimensional construct, in which the different components tend to support each other rather than operate independently (Bailey et al., 2009; Kirk, 2010). The subsequent CFA, shown in Table 12, confirmed the adequacy of the four-factor model, showing a good overall fit to the data (CFI = 0.95; RMSEA = 0.044). These values fall within the thresholds commonly considered indicative of a satisfactory fit in SEM models applied to educational instruments with a large number of items (Marsh et al., 2004). Parameter analysis showed robust factor saturations overall, suggesting that each item contributes significantly to the measurement of its domain. Correlations between latent factors, while statistically significant, remained at levels compatible with the hypothesis of distinct but interdependent dimensions. This result reinforces the theoretical validity of the model, in line with pedagogical frameworks that conceive inclusion, teaching quality, relational climate, and equity as complementary components of a single educational ecosystem, rather than as isolated attributes (Goodwin & Watkinson, 2000; UNESCO, 2017). Overall, the structural evidence suggests that the EIP-Move has an interpretable, consistent, and sufficiently stable factorial architecture, confirming the validity of the content map defined during the construction phase and providing a solid basis for subsequent analyses of reliability, invariance, and change.

4.2. Reliability

Reliability analyses showed good measurement accuracy for the EIP-Move, both in terms of internal consistency of the subscales and in relation to the stability of scores over time and between raters (Table 13). In particular, McDonald’s Omega (ω) values, ranging from 0.82 to 0.90, indicate high internal consistency consistent with the use of the instrument in applied educational contexts, where a balance between statistical accuracy and operational sustainability is required (Dunn et al., 2014). These results suggest that the items within each dimension share a substantial amount of common variance without being overly redundant. This balance is particularly relevant for instruments designed to evaluate complex educational programs, in which a certain internal heterogeneity of indicators is functional to the articulated representation of the construct (Streiner et al., 2015). With regard to temporal stability, ICC test–retest coefficients above 0.75 indicate good consistency of scores under conditions of relative program stability. This data supports the interpretation of the EIP-Move as a tool capable of distinguishing between random variability and real change, an essential requirement for any measure used in longitudinal and program evaluation studies (Koo & Li, 2016). Inter-rater agreement, also characterized by ICC > 0.75, is a particularly significant result from an application perspective. It suggests that the EIP-Move can be reliably used both as a professional self-assessment tool and as an external assessment tool, reducing the risk that scores reflect only subjective perspectives or role bias. This is consistent with the literature that emphasizes the importance of assessment reproducibility in complex, multi-actor educational contexts (Hallgren, 2012; Patton, 2015).

4.3. Measurement Invariance

The measurement invariance analyses provided robust evidence supporting the comparability of the EIP-Move between groups, confirming that the instrument measures the construct of educational and inclusive potential in an equivalent manner with respect to gender and the presence/absence of special educational needs (Table 14). In particular, the maintenance of configural, metric, and scalar invariance indicates that the factor structure, the contribution of items to factors, and response thresholds are substantially overlapping across groups. This result is methodologically relevant, as it allows any differences in scores to be interpreted as real differences in program characteristics rather than measurement artifacts (Putnick & Bornstein, 2016). From a substantive point of view, the evidence of invariance reinforces the consistency of the EIP-Move with the principles of inclusive education, suggesting that the tool does not implicitly incorporate normative criteria or performance standards that could penalize contexts characterized by greater heterogeneity. In this sense, the EIP-Move is a fair assessment tool, capable of supporting responsible and informed comparisons between different programs, in line with international guidelines on inclusive assessment and the ethical use of measurement tools in education (UNESCO, 2017; DeVellis, 2017).

4.4. Sensitivity to Change

Consistent with the longitudinal design of the study, EIP-Move demonstrated good sensitivity to change, showing significant increases in scores in programs that had implemented structured inclusive practices (p < 0.01). This result suggests that the tool is capable of detecting systematic variations associated with intentional processes of educational and inclusive quality improvement, and not just static differences between programs. The trend in scores over time can be interpreted in light of theoretical models that describe the quality of educational contexts as a dynamic process, which is progressively built through the stabilization of routines, the refinement of teaching strategies, and a growing professional awareness of the adults involved (Singer & Willett, 2003; Patton, 2015). In particular, the increase in scores observed between T0 and T1, and their stabilization or further consolidation at T2, is consistent with the hypothesis that inclusive practices take time to be systematically integrated into the design and conduct of activities (Table 15). From an application point of view, the sensitivity to change of the EIP-Move reinforces its usefulness as a tool for monitoring and formative evaluation of programs. The results suggest that the test can be used not only to describe the status of a program at a given moment, but also to accompany processes of professional reflection, training, and organizational development, providing useful indicators to guide pedagogical and strategic decisions (Patton, 2015; De Bosscher et al., 2015).

Baseline EIP-Move scores significantly predicted all selected programme-level outcomes at follow-up. Higher educational and inclusive potential at T0 was associated with greater student participation, more positive relational climate, higher psychological safety, and increased programme continuity over time. These findings support the predictive validity of EIP-Move within a longitudinal, programme-level evaluation framework.

5. Discussion

This study has provided detailed psychometric evidence supporting EIP-Move as a valid, reliable, and fair tool for assessing the educational and inclusive potential of physical education and sports programmes in primary schools. From the outset, it is important to clarify that EIP-Move is not designed to measure direct outcomes on pupils, but rather the potential quality of programmes, understood as a set of pedagogical, organisational, and relational conditions that can foster learning, participation, and inclusion over time. This conceptual distinction underpins the choice of a longitudinal validation design, which allows the examination of the stability and consistency of programme characteristics beyond cross-sectional snapshots.

In this sense, the results converge in outlining a measure capable of translating a multidimensional construct—pedagogical quality, inclusion and participation, relational climate and safety, and equity and valorisation of differences—into observable indicators. These dimensions are widely recognised in the literature as decisive for the quality of motor and physical education contexts in childhood (Bailey et al., 2009; Kirk, 2010; UNESCO, 2017). EIP-Move therefore responds to a critical issue frequently highlighted in educational and sport research: the scarcity of standardised, comparable, and psychometrically sound tools able to assess not only outcomes in children, but also the intrinsic quality of programmes as structured sets of conditions potentially conducive to learning and inclusion (Patton, 2015; De Bosscher et al., 2015).

Regarding structural validity, the confirmation of a four-factor solution consistent with the theoretical model indicates that educational and inclusive potential cannot be reduced to a single undifferentiated dimension, but rather emerges as a configuration of complementary and interdependent components. This finding is conceptually relevant. On the one hand, it supports the idea that pedagogical quality and inclusion are not separate “add-ons”, but are structurally intertwined in the design and implementation of activities. On the other hand, it suggests that interventions targeting one dimension—such as systematic adaptations to enhance participation—may also influence other components, including relational climate and psychological safety. This interdependence aligns with physical education models that emphasise the central role of context, interactions, and pedagogical intentionality in supporting participation, motivation, and learning, particularly in primary school settings (Gallahue et al., 2012; Goodwin & Watkinson, 2000).

From this perspective, the factorial structure of EIP-Move should be interpreted not merely as a statistical outcome, but as empirical support for reported ecosystemic interpretations of motor programmes. An activity may be formally accessible, yet fail to be educationally effective if it lacks pedagogical intentionality, relational safety, or equitable opportunities for participation. Accordingly, the dimensions measured by EIP-Move describe characteristics of the educational context and the practices implemented, rather than immediate or isolated effects on individual children. This reinforces the interpretation of the instrument as a programme-level assessment tool rather than an individual diagnostic measure.

Evidence of reliability further supports the practical utility of EIP-Move. The high internal consistency of the subscales, together with satisfactory test–retest stability and inter-rater agreement, suggests that the instrument yields sufficiently accurate and reproducible scores for use in both research and applied contexts. In particular, inter-rater agreement supports the use of EIP-Move in multi-informant configurations, which is especially relevant in educational settings where assessments may otherwise be influenced by self-perception biases or social desirability. The possibility of integrating internal perspectives (teachers or coaches) with external observations makes the tool suitable not only for descriptive measurement, but also for formative monitoring, professional reflection, and programme improvement processes, consistent with the view of assessment as a resource for organisational learning rather than merely summative evaluation (Patton, 2015).

A particularly relevant finding, in light of the inclusive aims of the study, concerns measurement invariance across groups (gender; presence or absence of special educational needs). Evidence of structural and scalar invariance indicates that EIP-Move functions comparably across these groups, reducing the likelihood that observed score differences are attributable to measurement bias rather than to genuine differences in programme quality. From a substantive standpoint, this result is crucial: an instrument designed to assess the fairness and inclusiveness of educational contexts must itself operate equitably and consistently across diverse populations (Putnick & Bornstein, 2016). In applied terms, measurement invariance supports the use of EIP-Move for monitoring programmes serving heterogeneous groups, allowing meaningful comparisons over time and across settings.

The longitudinal analyses further support the interpretation of EIP-Move as a measure that is not only stable, but also sensitive to change. Importantly, this sensitivity refers to variations in programme quality and educational practices, rather than to direct changes in pupil outcomes. The increases observed in programmes that implemented structured inclusive practices suggest that EIP-Move is capable of capturing changes aligned with intentional programme development processes. This characteristic is particularly valuable in formative assessment and research–intervention contexts, where the goal is to document and guide improvement rather than merely to classify programmes as effective or ineffective (Singer & Willett, 2003; Patton, 2015).

At a pedagogical level, the interpretation of temporal trends is consistent with the notion that inclusive practices and relational quality require time to consolidate. The development of routines, the management of heterogeneity, and the promotion of psychological safety are progressive processes that depend on continuity, reflective practice, and educators’ professional competence (Goodwin & Watkinson, 2000; UNESCO, 2017). Longitudinal validation therefore strengthens the interpretation of EIP-Move as a tool capable of supporting sustained programme development rather than short-term evaluation.

Taken together, these findings suggest that EIP-Move can make a concrete contribution to promoting a culture of quality in primary physical education and sports. By offering a shared conceptual framework and operational indicators, the tool can support the planning and reflective self-assessment of teachers and coaches, serve as a monitoring device for schools and sports organisations committed to continuous improvement, and function as a research instrument to investigate which programme components are most closely associated with participation, well-being, and continuity of engagement. While further studies are warranted to examine its applicability in larger samples or different cultural contexts, the evidence presented here supports the adoption of EIP-Move as a psychometrically sound instrument that is theoretically coherent with the educational and inclusive goals guiding the overall project.

6. Practical Implications

The results of this study suggest that EIP-Move could be a particularly useful operational tool for a variety of actors involved in physical education and sports in primary schools. The psychometric robustness revealed by the validation analyses, together with the operational clarity of the dimensions assessed and the demonstrated sensitivity to change, allows EIP-Move to be positioned not only as a research tool but also as an application device to support reflective, systematic and continuous improvement-oriented educational practices.

Firstly, EIP-Move can be used as a professional training self-assessment tool by primary school teachers and grassroots sports coaches. From this perspective, the periodic completion of the tool promotes a structured process of reflection on the educational practices adopted, allowing adults to systematically question aspects that are often taken for granted, such as the pedagogical consistency of the programme, the quality of the relational climate or the effective participation of all children. The test is divided into distinct but interconnected sections, allowing for the precise identification of areas of strength and areas for improvement, supporting an analytical reading of the quality of the programme and promoting a professional culture based on reflective self-analysis and educational responsibility, rather than on prescriptive or punitive evaluation logic (Patton, 2015).

Secondly, EIP-Move can serve as a longitudinal monitoring tool to support continuous improvement and organisational development processes within schools and sports clubs. Repeated use over time allows for the detection of changes in the educational and inclusive quality of programmes, providing useful indicators for assessing the impact of training interventions, methodological changes or educational reorganisations. In this sense, EIP-Move can be effectively integrated into in-service training courses, pedagogical supervision, communities of practice or research-intervention projects, helping to make the link between daily practices and medium- to long-term educational objectives explicit and monitorable.

A further implication concerns the potential use of EIP-Move as a support for the educational and inclusive design of motor and sports programmes. The profiles provided by the tool can guide educational planning, helping teachers and coaches to intentionally calibrate activities, methodologies and adaptations according to the specific needs of the group and the principles of equity and participation. From this perspective, EIP-Move is not a tool for assessing individuals, but a device for critically interpreting the educational context.

Finally, although not designed as a regulatory or selective tool, EIP-Move offers a set of standardised and validated indicators that could be used, with due caution, as a reference for accreditation, certification or self-certification initiatives for the educational and inclusive quality of programmes aimed at children. In this sense, the tool could help to strengthen the dialogue between research, practice and educational governance, promoting the adoption of shared criteria based on empirical evidence rather than on impressionistic or self-referential assessments.

7. Limitations and Future Developments

Despite the encouraging results emerging from this study, several limitations should be acknowledged, as they also point to relevant directions for future research. First, although the sample used for validation was numerically adequate and characterised by significant ecological variability, it remains embedded within a specific cultural and organisational context. Consequently, caution is required when generalising the findings to educational systems characterised by different pedagogical traditions or organisational models of physical education and sport. Future studies could therefore examine the applicability of EIP-Move in different national and cultural contexts, testing the stability of the factorial structure and measurement invariance across educational settings with distinct curricular frameworks and inclusion policies.

A second limitation concerns the predominantly observational and evaluative nature of the instrument. Although EIP-Move was intentionally designed to minimise social desirability bias by anchoring judgments to observable pedagogical and organisational practices, the use of self-assessment components may still be partially influenced by subjective perceptions. In this respect, future research could explore the systematic integration of EIP-Move with complementary data sources, such as structured observations, qualitative interviews, or indicators of children’s participation and well-being. Such triangulation would further strengthen the convergent validity of the instrument and enhance its interpretative depth, particularly in applied research and intervention-oriented studies.

Another direction for development concerns the potential adaptation of EIP-Move to contexts characterised by time or resource constraints. The development of short or context-specific versions of the instrument could facilitate its use in large-scale monitoring initiatives or routine professional practice, while preserving its theoretical coherence and psychometric robustness. Similarly, the digitisation of EIP-Move and its integration into educational monitoring platforms could increase its accessibility and dissemination, enabling more efficient longitudinal data collection and providing timely feedback to educators and organisations.

Finally, although the present study focused on the validation of EIP-Move as a programme-level assessment tool, future studies could more systematically investigate the relationship between EIP-Move scores and longer-term outcomes, such as continuity of participation in physical activity, motivation towards physical education, or the development of social-emotional competences. Clarifying these associations would further strengthen the predictive validity of the instrument and contribute to a deeper understanding of the mechanisms through which the educational and inclusive quality of physical activity programmes influences children’s overall development.

8. Conclusions

This study proposed and validated EIP-Move as an original and scientifically grounded tool for assessing the educational and inclusive potential of physical education and sports programmes in primary schools. Starting from a well-documented critical issue in the literature—the lack of standardised and psychometrically sound instruments capable of assessing programme quality beyond performance outcomes alone—the research offered a rigorous and theoretically consistent methodological response. Through a structured longitudinal design and a multi-stage process of construction and validation integrating theoretical, psychometric, and applicative criteria, the study demonstrated that complex dimensions such as pedagogical quality, inclusion, relational climate, and educational equity can be measured reliably and comparably (DeVellis, 2017; Streiner et al., 2015; Kirk, 2010).

The value of EIP-Move lies not only in the empirical results obtained, but also in the methodological approach adopted, which made it possible to translate a multidimensional and theoretically rich construct into observable, stable, and change-sensitive indicators. By shifting the focus of assessment from individual children to the educational context as a whole, EIP-Move aligns with an ecological and inclusive perspective on physical education and sport. Within this framework, quality is not conceptualised as a static attribute or a coincidental outcome, but as the result of intentional pedagogical choices, reflective practices, and organisational conditions that can be observed, monitored, and improved over time (UNESCO, 2017).

This approach highlights the central role of adults, institutions, and educational structures in creating meaningful opportunities for participation, learning, and well-being for all children, regardless of individual characteristics. Such a perspective is consistent with the literature that conceptualises physical education not merely as a performance-oriented domain, but as a formative educational context with social, relational, and ethical implications (Bailey et al., 2009). In this sense, EIP-Move contributes to reframing evaluation practices, encouraging a more holistic and context-sensitive understanding of programme quality.

A further contribution of EIP-Move lies in its ability to integrate psychometric rigour with practical utility. The multidimensional structure of the instrument, together with its demonstrated stability and sensitivity to change, makes it suitable for both educational research and professional practice contexts. By supporting formative evaluation processes oriented towards reflection and continuous improvement rather than judgement alone, EIP-Move promotes a closer dialogue between empirical evidence and decision-making in schools and sports organisations (Singer & Willett, 2003; Patton, 2015).

In conclusion, EIP-Move acts as a bridge between research and practice, offering a shared conceptual language and operational indicators to support a culture of quality and inclusion in primary school physical education contexts. The tool has the potential to function not only as a measurement instrument, but also as a catalyst for professional and organisational reflection, contributing to more intentional, equitable, and educationally meaningful physical education and sport programmes. In this perspective, EIP-Move supports responsible and unbiased comparisons between programmes and contexts, strengthening informed decision-making processes that are consistent with the principles of educational equity and inclusion (Putnick & Bornstein, 2016; UNESCO, 2017).

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by Department of Medical Sciences, Human Movement and Wellbeing—University of Naples ‘Par-thenope’ (protocol code DiSMMeB Prot. No. 88775/2025 and date of approval: 11 November 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to ensure effective dissemination that protects the created and validated evaluation tool and its use, and that, at the same time, satisfies the genuine and motivated interest of scholars and researchers in this specific scientific field.

Conflicts of Interest

The author declares no conflicts of interest.

References

Bailey, R., Armour, K., Kirk, D., Jess, M., Pickup, I., Sandford, R., & BERA Physical Education and Sport Pedagogy Special Interest Group. (2009). The educational benefits claimed for physical education and school sport: An academic review. Research Papers in Education, 24(1), 1–27. [Google Scholar] [CrossRef]
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation perspective. Wiley. [Google Scholar]
Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). Guilford Press. [Google Scholar]
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14(3), 464–504. [Google Scholar] [CrossRef]
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255. [Google Scholar] [CrossRef]
De Bosscher, V., Shibli, S., Westerbeek, H., & Van Bottenburg, M. (2015). Successful elite sport policies: An international comparison of the sports policy factors leading to international sporting success (2nd ed.). Meyer & Meyer Sport. [Google Scholar]
DeVellis, R. F. (2017). Scale development: Theory and applications (4th ed.). Sage. [Google Scholar]
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105(3), 399–412. [Google Scholar] [CrossRef] [PubMed]
Enders, C. K. (2010). Applied missing data analysis. Guilford Press. [Google Scholar]
Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9(4), 466–491. [Google Scholar] [CrossRef] [PubMed]
Gallahue, D. L., Ozmun, J. C., & Goodway, J. D. (2012). Understanding motor development: Infants, children, adolescents, adults (7th ed.). McGraw-Hill. [Google Scholar]
Goodwin, D. L., & Watkinson, E. J. (2000). Inclusive physical education from the perspective of students with physical disabilities. Adapted Physical Activity Quarterly, 17(2), 144–160. [Google Scholar] [CrossRef]
Haegele, J. A., & Sutherland, S. (2015). Perspectives of students with disabilities toward physical education: A qualitative inquiry review. Quest, 67(3), 255–273. [Google Scholar] [CrossRef]
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. [Google Scholar] [CrossRef] [PubMed]
Holgado-Tello, F. P., Chacón-Moscoso, S., Barbero-García, I., & Vila-Abad, E. (2010). Polychoric versus pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Quality & Quantity, 44(1), 153–166. [Google Scholar]
Hox, J. J., Moerbeek, M., & van de Schoot, R. (2018). Multilevel analysis: Techniques and applications (3rd ed.). Routledge. [Google Scholar]
Kirk, D. (2010). Physical education futures. Routledge. [Google Scholar]
Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). Guilford Press. [Google Scholar]
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. [Google Scholar] [CrossRef] [PubMed]
Li, C. H. (2016). Confirmatory factor analysis with ordinal data: Comparing robust maximum likelihood and diagonally weighted least squares. Behavior Research Methods, 48(3), 936–949. [Google Scholar] [CrossRef] [PubMed]
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley. [Google Scholar]
Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes. Structural Equation Modeling, 11(3), 320–341. [Google Scholar] [CrossRef]
Muthén, L. K., & Muthén, B. O. (2017). Mplus user’s guide (8th ed.). Muthén & Muthén. [Google Scholar]
Patton, M. Q. (2015). Qualitative research & evaluation methods (4th ed.). Sage. [Google Scholar]
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. [Google Scholar]
Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research in Nursing & Health, 30(4), 459–467. [Google Scholar] [CrossRef] [PubMed]
Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting. Developmental Review, 41, 71–90. [Google Scholar] [CrossRef] [PubMed]
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage. [Google Scholar]
Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. Routledge. [Google Scholar]
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press. [Google Scholar]
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). Sage. [Google Scholar]
Streiner, D. L., Norman, G. R., & Cairney, J. (2015). Health measurement scales: A practical guide to their development and use (5th ed.). Oxford University Press. [Google Scholar]
Tomporowski, P. D., McCullick, B., Pendleton, D. M., & Pesce, C. (2015). Exercise and children’s cognition: The role of exercise characteristics and a place for metacognition. Journal of Sport and Health Science, 4(1), 47–55. [Google Scholar] [CrossRef]
UNESCO. (2017). A guide for ensuring inclusion and equity in education. UNESCO Publishing. [Google Scholar]
Widaman, K. F., Ferrer, E., & Conger, R. D. (2010). Factorial invariance within longitudinal structural equation models: Measuring the same construct across time. Child Development Perspectives, 4(1), 10–18. [Google Scholar] [CrossRef] [PubMed]

Table 1. Comparison between EIP-Move and existing frameworks/instruments for physical education programme evaluation.

Framework/Instrument	Primary Focus	Level of Analysis	Inclusion Explicitly Assessed	Program-Level Evaluation	Standardized Psychometric Validation	Longitudinal Use
PETE frameworks	Teacher education and instructional quality	Teacher/training	Partial (implicit)	No	No	No
UNESCO (2017) Inclusive Education Model	Policy principles and values	System/policy	Yes (normative)	No	No	No
Teaching quality scales in PE	Instructional behaviours	Teacher/lesson	Limited	No	Yes (often)	Rare
Inclusion-focused PE instruments	Inclusion practices	Student or teacher	Yes	Partial	Variable	No
EIP-Move (present study)	Educational and inclusive potential	Programme	Yes (core dimension)	Yes	Yes	Yes

Table 2. Pedagogical quality of the programme.

Dimension 1—Pedagogical Quality of the Programme
The programme has clear educational objectives that are consistent with the age of the pupils.
2. The motor activities follow a structured didactic progression.
3. The programme offers a variety of meaningful motor experiences.
4. The activities are designed to promote learning and not just task completion.
5. The adult explains the educational significance of the proposed activities.
6. The programme includes moments of reflection on the motor experience carried out.
7. The methodologies used encourage active learning among pupils.
8. The programme encourages autonomy and initiative in children.
9. The activities are tailored to the skills of the group.
10. The feedback provided by the adult has an educational function.
11. The activities are consistent and interconnected over time.
12. The programme is adapted on an ongoing basis based on the pupils’ responses.
13. The learning process is valued more than the final result.
14. The activities promote the development of transversal skills (e.g., cooperation, self-regulation).
15. The programme includes forms of formative assessment.
16. Teaching choices are intentional and planned.

Table 3. Inclusion and participation.

Dimension 2—Inclusion and Participation
17. All children actively participate in the proposed activities.
18. The programme includes systematic adaptations to promote inclusion.
19. Differences in ability among pupils are managed effectively.
20. The programme avoids the systematic assignment of marginal roles.
21. All pupils have real opportunities for success.
22. The rules of the activities are flexible and adaptable.
23. The programme includes strategies to engage more hesitant children.
24. Pupils with special educational needs are adequately supported.
25. Activities encourage cooperation among peers.
26. The programme prevents forms of implicit exclusion.
27. Pupils participate continuously in physical activities.
28. The activities allow for different levels of participation.
29. The programme takes into account different learning styles.
30. The instructions are understandable for all pupils.
31. Participation time is distributed equally among the children.
32. The programme promotes a sense of belonging to the group.

Table 4. Relational climate and safety.

Dimension 3—Relational Climate and Safety
33. The relational climate of the group is positive and collaborative.
34. Relationships between peers are characterised by mutual respect.
35. The adult fosters a climate of trust during activities.
36. Children feel free to express themselves during physical activities.
37. Mistakes are accepted as part of the learning process.
38. The programme prevents episodes of mockery or exclusion.
39. Conflicts are handled constructively.
40. The group rules are clear and shared.
41. Activities take place in physically safe conditions.
42. Children perceive an emotionally safe environment.
43. Adults model positive relational behaviours.
44. The programme encourages mutual support among children.
45. Activities respect individual times and limits.
46. The emotional climate encourages active participation.
47. The group appears cohesive during physical activities.

Table 5. Equity and valuing differences.

Dimension 4—Equity and Valuing Differences
48. The programme recognises individual differences as an educational resource.
49. The activities enhance the strengths of each child.
50. The programme promotes equal learning opportunities.
51. The assessment criteria are equitable and transparent.
52. Competitive comparison is managed in an educational manner.
53. Individual progress is valued more than performance.
54. The roles assigned during activities are distributed fairly.
55. Gender differences are managed in an inclusive manner.
56. The programme avoids selective or discriminatory practices.
57. The pupils’ previous motor experiences are valued.
58. The activities allow each child to feel competent.
59. The programme supports pupils’ self-esteem.
60. Cultural differences are respected and integrated.
61. The programme promotes a positive self-image in children.
62. Educational choices promote equity and educational justice.

Table 6. Review of conceptual redundancies.

Dimension	Items Involved	Initial Criticality	Before Revision (Example)	After Review (Final Version)
Pedagogical quality of the programme	Items 2, 11 and 16	Possible overlap between consistency, progression and teaching planning	“The activities are coherent and well organised.”/“The programme is planned.”	2. The motor activities follow a structured teaching progression. 11. The activities are consistent and interconnected over time. 16. Teaching choices are intentional and planned.
Inclusion and participation	Items 17–19, 22 and 28	Risk of overlap between adaptations, management of diversity and participation	‘The programme is inclusive.’/‘The activities are adapted to the children.’	17. The programme includes systematic adaptations to promote inclusion. 18. Differences in ability among pupils are managed effectively. 19. The programme avoids the systematic assignment of marginal roles. 22. Activity rules are flexible and adaptable. 28. Activities allow for different levels of participation.
Relational climate and safety	Items 33, 34, 42 and 46	Conceptual contiguity between emotional climate and psychological safety	“The climate is positive.”/“The children feel safe.”	33. Children feel free to express themselves during physical activities. 34. Mistakes are accepted as part of the learning process. 42. Children perceive an emotionally safe environment. 46. The emotional climate encourages active participation.
Fairness and valuing differences	Items 48, 49, 50 and 56	Overlap between evaluative fairness, equal opportunities and non-discrimination	“The programme is fair.”/“There is no discrimination.”	48. The programme promotes equal learning opportunities. 49. The assessment criteria are fair and transparent. 50. Competitive comparison is managed in an educational manner. 56. The programme avoids selective or discriminatory practices.

Table 7. Revision of items to check for unidimensionality and semantic specificity.

Dimension	Item	Before Revision (Example)	Critical Issues Identified	After Revision (Final Version)
Pedagogical quality of the programme	4	The activities promote learning and improve pupils’ performance.	Double item (learning + performance)	The activities are designed to promote learning and not just task completion.
	10	The adult provides useful and motivating feedback.	Dual content (educational + motivational function)	The feedback provided by the adult has a formative function.
	13	The programme values both the process and the result.	Conceptual ambiguity (process vs. result)	The learning process is valued more than the final result.
	14	The activities encourage cooperation and self-regulation.	Possible double-barrelled item	Activities promote the development of transversal skills (e.g., cooperation, self-regulation).
Inclusion and participation	18	The programme is inclusive and suitable for everyone.	Formulation too generic	The programme provides for systematic adaptations to promote inclusion.
	19	Differences between children are well managed.	Ambiguity and overall assessment	Differences in ability among pupils are managed effectively.
	21	All children are able to participate and succeed.	Dual content (participation + success)	All pupils have real opportunities for success.
	28	Activities are adapted to children’s levels.	Vague wording	The activities allow for different levels of participation.
Relational climate and safety	33	The group atmosphere is positive.	Overall assessment not observable	The relational climate of the group is positive and collaborative.
	36	The children feel safe and free to express themselves.	Double-barrelled (safety + expression)	The children feel free to express themselves during physical activities.
	37	Mistakes are not a problem.	Generic formulation	Mistakes are accepted as part of the learning process.
	42	The environment is safe.	Ambiguity (physical vs. emotional safety)	Children perceive an emotionally safe environment.
Fairness and valuing differences	48	The programme is fair for everyone.	Overall assessment	The programme promotes equal learning opportunities.
	49	The assessment is fair.	Vague wording	The assessment criteria are fair and transparent.
	50	Competition is not negative.	Ambiguity of interpretation	Competitive comparison is managed in an educational manner.
	56	The programme does not discriminate.	Negative and generic wording	The programme avoids selective or discriminatory practices.

Table 8. Summary of item revision (from 62 to 48).

Dimension	Original Items (n)	Items Deleted/Merged	Revision Criterion	Final Items (n)
Pedagogical quality	16	4 deleted/merged (e.g., 6, 11, 15, 16)	Redundancy between consistency, formative assessment and intentionality	12
Inclusion and participation	16	4 eliminated/merged (e.g., 21, 27, 31, 32)	Overlap between continuous participation, success and belonging	12
Relational climate and safety	15	3 eliminated/merged (e.g., 34, 46, 47)	Contiguity between emotional climate, cohesion and psychological security	12
Fairness and appreciation	15	3 eliminated/merged (e.g., 49, 56, 61)	Redundancy between evaluative fairness, non-discrimination and self-perception	12
Total	62	14	—	48

Table 9. Final structure of the EIP-Move—Preliminary version (48 items).

Dimension 1—Pedagogical Quality of the Programme (12 Items)

The programme has clear educational objectives that are consistent with the age of the pupils.

2.: The physical activities follow a structured teaching progression.

3.: The programme offers a variety of meaningful motor experiences.

4.: The activities are designed to promote learning and not just the execution of the task.

5.: The adult explains the educational significance of the proposed activities.

6.: The methodologies used promote active learning among pupils.

7.: The programme stimulates autonomy and initiative in children.

8.: The activities are tailored to the skills of the group.

9.: The feedback provided by the adult has an educational function.

10.: The programme is adapted along the way based on the pupils’ responses.

11.: The learning process is valued more than the final result.

12.: The activities promote the development of transversal skills (e.g., cooperation, self-regulation).

Dimension 2—Inclusion and participation (12 items)

13.: All children actively participate in the proposed activities.

14.: The programme includes systematic adaptations to promote inclusion.

15.: Differences in ability among pupils are managed effectively.

16.: The programme avoids the systematic assignment of marginal roles.

17.: The rules of the activities are flexible and adaptable.

18.: The programme includes strategies to engage more hesitant children.

19.: Pupils with special educational needs are adequately supported.

20.: Activities encourage peer cooperation.

21.: The programme prevents forms of implicit exclusion.

22.: The activities allow for different levels of participation.

23.: The instructions are understandable for all pupils.

24.: Participation time is distributed equally among the children.

Dimension 3—Relational climate and safety (12 items)

25.: The relational climate of the group is positive and collaborative.

26.: Relationships between peers are characterised by mutual respect.

27.: The adult fosters a climate of trust during activities.

28.: Children feel free to express themselves during physical activities.

29.: Mistakes are accepted as part of the learning process.

30.: The programme prevents episodes of mockery or exclusion.

31.: Conflicts are handled constructively.

32.: The group rules are clear and shared.

33.: Activities take place in physically safe conditions.

34.: Children perceive an emotionally safe environment.

35.: Adults model positive relational behaviours.

36.: Activities respect individual times and limits.

Dimension 4—Fairness and valuing differences (12 items)

37.: The programme recognises individual differences as an educational resource.

38.: Activities value the strengths of each child.

39.: The programme promotes equal learning opportunities.

40.: Competitive comparison is managed in an educational manner.

41.: Individual progress is valued more than performance.

42.: Roles assigned during activities are distributed equally.

43.: Gender differences are managed in an inclusive manner.

44.: The programme avoids selective or discriminatory practices.

45.: The pupils’ previous motor experiences are valued.

46.: The activities allow each child to feel competent.

47.: The programme supports pupils’ self-esteem.

48.: Cultural differences are respected and integrated.

Table 10. Items modified or removed based on the review criteria (reduction to 40 items).

Dim.	Item	Preliminary Version (48 Items)	Criterion Applied	Outcome
1	3	The programme offers a variety of meaningful motor experiences.	Ambiguity/generic nature	Reformulated → “The programme offers a variety of meaningful motor experiences for the development of pupils.”
1	7	The programme encourages autonomy and initiative in children.	Double-barrelled/observability	Rewritten → “The programme encourages children’s independence in carrying out activities.”
1	11	The learning process is valued more than the final result.	Low observability	Removed (redundant with Items 4 and 9, more observable)
2	13	All children actively participate in the proposed activities.	Ceiling effect	Reformulated → “Most children actively participate in the proposed activities.”
2	15	Differences in ability among pupils are managed effectively.	Generic	Reworded → “Differences in ability among pupils are managed through adaptations of activities.”
2	21	The programme prevents forms of implicit exclusion.	Poor observability	Reformulated → “The programme includes actions aimed at preventing forms of implicit exclusion.”
2	24	Participation time is distributed equally among children.	Intra-dimensional redundancy	Removed (overlaps with Items 22 and 16)
3	25	The relational climate of the group is positive and collaborative.	Overall assessment	Reformulated → “The group’s relational climate fosters collaboration and positive interactions.”
3	30	The programme prevents incidents of ridicule or exclusion.	Poor observability	Reformulated → “The programme adopts strategies to prevent episodes of ridicule or exclusion.”
3	34	Children perceive an emotionally safe environment.	Redundancy/low discriminative power	Removed (covered by Items 28 and 29)
3	36	Activities respect individual times and limits.	Intra-dimensional redundancy	Removed (overlaps with Items 8 and 22)
4	40	Competitive comparison is handled in an educational manner.	Ambiguity	Reformulated → “Competitive comparison, when present, is handled in an educational manner.”
4	41	Individual progress is valued more than performance.	Redundancy	Removed (conceptually overlapping with Items 38 and 46)
4	46	Activities enable each child to feel competent.	Genericity	Reformulated → “The activities allow each child to experience a sense of competence.”
4	47	The programme supports pupils’ self-esteem.	Low theoretical consistency	Removed (distal construct, not directly observable)

Table 11. EIP-Move—Final version (40 items).

Dimension 1—Pedagogical Quality of the Programme

(Items 1–10)

The programme has clear educational objectives that are consistent with the age of the pupils.

2.: The motor activities follow a structured and consistent didactic progression over time.

3.: The programme offers a variety of motor experiences that are significant for pupils’ development.

4.: The activities are designed to promote learning and not just the execution of the task.

5.: Adults explain the educational significance of the proposed activities to the children.

6.: The methodologies used encourage the active involvement of pupils in the activities.

7.: The programme encourages children to work independently when carrying out the activities.

8.: The activities are tailored to the skills of the group.

9.: The feedback provided by the adult is aimed at improving the pupils’ learning.

10.: The programme is adapted on an ongoing basis based on the pupils’ responses.

Dimension 2—Inclusion and participation

(Items 11–20)

11.: Most children actively participate in the proposed activities.

12.: The programme includes systematic adaptations to promote inclusion.

13.: Differences in ability among pupils are managed through adaptations to activities.

14.: The programme avoids the systematic assignment of marginal roles.

15.: The rules of the activities are flexible and adaptable.

16.: The programme includes explicit strategies to engage more hesitant children.

17.: Pupils with special educational needs are adequately supported.

18.: Activities encourage peer cooperation.

19.: The programme includes actions aimed at preventing forms of implicit exclusion.

20.: The activities allow for different levels of participation.

Dimension 3—Relational climate and safety

(Items 21–30)

21.: The relational climate of the group promotes collaboration and positive interactions.

22.: Relationships between peers are characterised by mutual respect.

23.: The adult promotes a climate of trust during activities.

24.: Children feel free to express themselves during physical activities.

25.: Mistakes are accepted as part of the learning process.

26.: The programme adopts strategies to prevent episodes of mockery or exclusion.

27.: Conflicts are handled constructively.

28.: The group rules are clear and shared.

29.: Activities take place in physically safe conditions.

30.: Adults model positive relational behaviours.

Dimension 4—Fairness and valuing differences

(Items 31–40)

31.: The programme recognises individual differences as an educational resource.

32.: Activities value the strengths of each child.

33.: The programme promotes equal learning opportunities.

34.: Competitive comparison, when present, is managed in an educational manner.

35.: The roles assigned during activities are distributed equally.

36.: Gender differences are managed in an inclusive manner.

37.: The programme avoids selective or discriminatory practices.

38.: The pupils’ previous motor experiences are valued.

39.: The activities allow each child to experience a sense of competence.

40.: Cultural differences are respected and integrated.

Table 12. Fit indices for the four-factor CFA model.

χ²	df	χ²/df	CFI	TLI	RMSEA	90% CI RMSEA	SRMR
864.32	344	2.51	0.95	0.94	0.044	[0.040–0.048]	0.047

The four-factor model demonstrated good fit to the data (CFI = 0.95, RMSEA = 0.044, SRMR = 0.047).

Table 13. Internal consistency of EIP-Move dimensions.

Dimension	Items	Cronbach’s α	McDonald’s ω
Pedagogical Quality	10	0.91	0.93
Inclusion & Participation	10	0.89	0.91
Relational Climate & Safety	10	0.90	0.92
Equity & Valuing Differences	10	0.88	0.90

All dimensions showed strong internal consistency, with α and ω values exceeding 0.88.

Table 14. Inter-rater reliability of EIP-Move dimensions.

Dimension	ICC Model	ICC	95% CI
Pedagogical Quality	ICC(2, 1)	0.82	[0.76–0.87]
Inclusion & Participation	ICC(2, 1)	0.85	[0.79–0.89]
Relational Climate & Safety	ICC(2, 1)	0.80	[0.73–0.85]
Equity & Valuing Differences	ICC(2, 1)	0.77	[0.69–0.83]

Inter-rater reliability was acceptable to good across all dimensions, with ICC values ranging from 0.77 to 0.85.

Table 15. Predictive validity of EIP-Move (baseline T0 → follow-up outcomes).

Predictor (T0)	Criterion Outcome	Time Point	β	SE	p
Total EIP-Move score	Active student participation	T1	0.41	0.06	<0.001
Total EIP-Move score	Relational climate quality	T1	0.38	0.07	<0.001
Total EIP-Move score	Psychological safety	T1	0.35	0.06	<0.001
Total EIP-Move score	Programme continuity	T2	0.33	0.08	<0.001
Inclusion & Participation	Active participation	T1	0.44	0.05	<0.001
Relational Climate & Safety	Psychological safety	T1	0.47	0.06	<0.001

Note. Standardised regression coefficients. All models controlled for context (school vs. sport) and programme size.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Di Palma, D. Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools. Educ. Sci. 2026, 16, 374. https://doi.org/10.3390/educsci16030374

AMA Style

Di Palma D. Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools. Education Sciences. 2026; 16(3):374. https://doi.org/10.3390/educsci16030374

Chicago/Turabian Style

Di Palma, Davide. 2026. "Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools" Education Sciences 16, no. 3: 374. https://doi.org/10.3390/educsci16030374

APA Style

Di Palma, D. (2026). Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools. Education Sciences, 16(3), 374. https://doi.org/10.3390/educsci16030374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Longitudinal Validation of EIP-Move for Assessing the Educational and Inclusive Potential of Physical Education and Sports Programs in Primary Schools

Abstract

1. Introduction

2. Study Objectives

3. Methods

3.1. Study Design

3.2. Phase 1: Construction of the Instrument and Content Validity

3.2.1. Definition of the Construct and Specification of Dimensions

3.2.2. Generation of the Initial Pool of Items

3.2.3. Preliminary Qualitative Review (Face Validity)

3.2.4. Expert Panel: Composition and Selection Criteria

3.2.5. Rating Procedure and Assessment Tools

3.2.6. Calculation of the Content Validity Index (CVI) and Decision Criteria

3.2.7. Item Review and Production of the Preliminary Version

3.3. Phase 2: Pilot Study

3.3.1. Sample and Context

3.3.2. Administration and Data Collection Procedure

3.3.3. Preliminary Analyses: Descriptive and Item Functioning

3.3.4. Criteria for Revision and Reduction of the Instrument

3.3.5. Output of the Pilot Phase

3.4. Phase 3: Sample and Validation Procedure

3.4.1. Predictive Validity: Criterion Definition and Time Horizon

3.4.2. Sample Justification, Sampling Procedure, and Longitudinal Framework

Sample Size and Characteristics Across Measurement Waves

Sampling Procedure

Justification of Sample Size for CFA and Longitudinal Analyses

3.4.3. Participants and Units of Analysis

3.4.4. Administration Procedure, Assessor Profile, and Inter-Rater Reliability

Assessor Profile and Role

Assessment Procedure and Observability of Items

Inter-Rater Reliability

3.4.5. Program Inclusion and Exclusion Criteria

3.4.6. Administration Procedure and Role of Assessors

3.4.7. Timeline of the Surveys

3.5. Statistical Analyses

3.5.1. Data Preparation and Preliminary Checks

3.5.2. Exploratory Factor Analysis (EFA)

3.5.3. Confirmatory Factor Analysis (CFA)

3.5.4. Reliability: Internal Consistency, Temporal Stability, and Inter-Rater Agreement

3.5.5. Measurement Invariance (Multi-Group and Longitudinal)

3.5.6. Longitudinal Analyses and Multilevel Models

3.5.7. Predictive Validity

4. Results

4.1. Factor Structure

4.2. Reliability

4.3. Measurement Invariance

4.4. Sensitivity to Change

5. Discussion

6. Practical Implications

7. Limitations and Future Developments

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI