Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models

Fauszt, Tibor

doi:10.3390/info17060581

Open AccessArticle

Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models

by

Tibor Fauszt

Department of Informatics, University of Dunaújváros, 2400 Dunaújváros, Hungary

Information 2026, 17(6), 581; https://doi.org/10.3390/info17060581

Submission received: 30 April 2026 / Revised: 8 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue AI and Machine Learning in the Big Data Era: Advanced Algorithms and Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

Data leakage is a major source of overoptimistic performance estimates in machine learning-based predictive modelling. In higher education, dropout models are increasingly used to support interventions, yet leakage may arise not only from technical mistakes but also from an underspecified prediction task. This article conceptualizes predictive modelling as a temporally specified decision configuration and proposes Time-Consistent Dropout Prediction (TCDP) as a diagnostic framework for longitudinal educational data. TCDP defines four core components: an explicit prediction cutoff, a pre-cutoff information set, a risk-set-consistent target population, and a temporally appropriate validation design. The validation component comprises two design conditions: entity-level train–test separation and cohort-level temporal consistency. A structured methodological screening of recent dropout-prediction studies shows that temporal anchoring is becoming more common, whereas risk-set definition and validation hierarchy remain less consistently formalized. The empirical demonstration uses a pseudonymized longitudinal student panel from a Hungarian higher education institution, accompanied by a controlled synthetic reproduction package. Results show that row-wise validation in data with repeated student observations generates increasing train–test entity overlap across prediction cutoffs and inflates AUC, especially for high-capacity or instance-based models. Enforcing entity-level separation removes this inflation and yields lower, more realistic performance ranges. The paper contributes an operational validation grammar linking prediction time, admissible information, population eligibility, and validation design.

Keywords:

learning analytics; dropout prediction; predictive modelling; data leakage; identity leakage; temporal modelling; longitudinal data; model evaluation; predictive validity

1. Introduction

Machine learning-based predictive models have become widely used in a range of application domains, including Learning Analytics and Educational Data Mining, particularly in applications targeting early dropout prediction [1,2,3,4]. Numerous studies report high predictive performance across diverse data structures, feature sets, and classification algorithms [5,6,7,8,9]. The interpretability of these performance indicators depends on whether the modelling environment accurately reflects the temporal and decision context in which predictions are intended to be deployed, particularly with respect to the definition of the prediction task and the information available at the time of prediction.

Over the past decade, methodological research has evolved from the early identification of data leakage [10] to broader concerns regarding reproducibility in machine learning [11], emphasizing that high performance metrics alone do not establish predictive validity [12]. Although internal validation via train–test splits or cross-validation is methodologically accepted, its validity depends on informational and temporal separation. When such separation is compromised, the modelling process may incorporate post-outcome signals or future information unavailable under the intended prediction setting [10], leading to data leakage—a systemic issue widely recognized in machine learning practice [10,11,12,13,14]. Such distortions may manifest as statistical optimism: models appear to perform well under evaluation conditions while exhibiting reduced effectiveness under operational deployment [12,15,16].

Existing studies typically address leakage as an isolated validation issue, often focusing on distortions arising from improper train–test splits [17]. Fewer contributions examine leakage in relation to the temporal interpretation of prediction tasks and the structural characteristics of longitudinal data representation [18,19].

Higher education dropout provides a clear example of a temporally extended, dynamic process that requires a longitudinal representation. A widely adopted practice in the literature, however, involves the cross-sectional “flattening” of longitudinal data without explicit temporal anchoring [20]. Although this approach is not inherently incorrect, the absence of a fixed prediction time point may turn the task into the classification of completed student trajectories rather than prediction from a defined temporal state [21,22,23]. Different data representations induce distinct forms of leakage: aggregated structures may incorporate predictors that implicitly encode future information, resulting in temporal leakage, whereas longitudinal panel designs are susceptible to repeated observations from the same individuals overlapping across partitions, resulting in identity leakage [24,25,26]. Consequently, rigorous treatment of panel data structures is essential to prevent models from exploiting individual-specific fingerprints rather than learning generalizable patterns [27,28,29].

The consequences of such methodological flaws extend beyond technical metrics and may affect the practical utility of educational decision-support systems. Overly optimistic models may misguide institutional interventions, contribute to the misallocation of scarce resources, and reduce trust in predictive systems [30]. Because the temporal orientation of a model determines whether it can support future-oriented decisions, explicit temporal formalization is a prerequisite for reliable data-informed intervention in educational settings [31,32].

The aim of this study is to address this methodological gap by formulating a diagnostic framework for Temporally Consistent Prediction. In the specific context of higher education dropout prediction, this framework is operationalized as Time-Consistent Dropout Prediction (TCDP). The framework systematizes key requirements that have appeared in a fragmented manner in the literature, organizing them into an explicitly time-formalized structure. TCDP specifies five validity conditions organized into four core components (prediction cutoff, information set, risk set, and validation hierarchy, the latter comprising entity-level separation and cohort-level temporal consistency): (i) explicit specification of the prediction time point, (ii) temporal restriction of the information set, (iii) consistent delineation of the at-risk population, and (iv) validation procedures that ensure entity-level and cohort-level consistency. Within this framework, data leakage is conceptualized not merely as an accidental coding flaw, but as a structural consequence of improperly specified prediction tasks.

The proposed framework does not claim that each individual condition is entirely new. Rather, its contribution lies in integrating concepts that are usually treated separately across adjacent methodological traditions—such as clinical time-series prediction, where the time of prediction and the prediction horizon are explicit components of the modelling task [33]; survival analysis and dynamic prediction, where predictions are conditioned on information observed up to a landmark or cutoff time and restricted to individuals still at risk [34,35]; longitudinal and repeated-measures validation, where observations from the same entity must not be split across training and test partitions [18,19]; and data leakage research, which emphasizes that information unavailable at prediction time must be excluded from model development and evaluation [23]—into a single minimal specification of a temporally valid educational prediction task.

Focusing on the methodological foundations of leakage control, this study presents a focused literature review and an empirical demonstration based on student-level administrative data from a Hungarian higher education institution. Although this framework is demonstrated through institutional dropout prediction, its underlying principles are transferable to other predictive tasks based on longitudinal panel data, particularly when decisions depend on temporally bounded information and repeated observations of the same entities. Indeed, the challenge of validation-induced identity leakage and performance inflation is actively discussed across diverse domains, including econometrics [28] and biomedical analytics, where row-wise splitting similarly compromises model generalizability [18]. By addressing these foundational conditions, the empirical component illustrates how a commonly applied record-level validation strategy induces identity leakage and inflates reported performance, while entity-level separation provides an operationally valid evaluation configuration [17].

The empirical dataset used for the reported results cannot be made publicly available because of data protection and institutional confidentiality constraints. However, to support procedural reproducibility, we provide a fully synthetic dataset with the same longitudinal structure, feature schema, cutoff logic, and validation configurations as the empirical experiment, together with the Python code required to run the analysis. To support transparency and replication, the study is accompanied by a reproducible experiment package.

Research Questions and Diagnostic Expectations

This study is guided by three methodological research questions rather than by causal hypotheses about student dropout behaviour.

RQ1. To what extent do recent dropout-prediction studies explicitly specify the four TCDP dimensions: prediction cutoff, temporally restricted information set, risk set, and validation hierarchy?
RQ2. To what extent does row-wise validation in longitudinal student panels generate train–test entity overlap across prediction cutoffs, and how does this overlap affect reported AUC relative to an entity-level group split?
RQ3. How do model class and the representational granularity of static predictors affect the empirical visibility of identity leakage?

Based on the TCDP framework, the empirical demonstration is expected to show that violating Condition (iv.a), Entity-Level Separation, inflates predictive performance relative to a valid entity-level group split. Because the number of repeated observations per student increases across later cutoffs, the magnitude of train–test entity overlap is expected to increase with the cutoff. Finally, leakage-induced performance inflation is expected to depend not only on algorithmic capacity but also on the quasi-identifying strength of the static feature representation; therefore, high-cardinality categorical predictors may allow even linear models to exploit leaked identity structure.

2. Conceptual and Methodological Framework

The proposed framework specifies the minimal temporal and structural conditions required for valid predictive modelling in longitudinal settings. In the context of dropout prediction, this framework is referred to as Time-Consistent Dropout Prediction (TCDP).

The interpretability of predictive models depends on the explicit specification of the prediction task. In longitudinal settings, this requires defining the time point at which prediction is made, the information available at that time, the population to which the prediction applies, and the validation procedure used to estimate performance [4,10,25].

The framework therefore treats prediction as a formally specified decision configuration rather than as a purely algorithmic task. Within this configuration, data leakage is interpreted as a violation of temporal or structural validity conditions [11,13].

The following subsection defines the four core components of this configuration: the prediction cutoff, the information set, the risk set, and the validation hierarchy.

2.1. Prediction as a Temporally Formalized Task

Let

i

denote an individual student and let

t

denote discrete observation time, such as semester, week, course phase, or another temporally ordered unit. A longitudinal educational dataset can be represented as a collection of observations of the form

D = \{(i, t, X_{i} (t), Y_{i} (t))\},

(1)

where

X_{i} (t)

denotes the observed characteristics of student

i

at time

t

, and

Y_{i} (t)

denotes the outcome status observed at time

t

.

(i): Explicit Specification of the Prediction Cutoff ( $t_{c}$ )

A time-consistent prediction task is anchored at a prediction cutoff, denoted by

t_{c}

. The cutoff defines the time point at which prediction is issued and separates the information available for prediction from the future outcome to be predicted. All subsequent components of the prediction task are defined relative to this cutoff [34,35].

(ii): Information Set ( $I (t_{c})$ )

The information set

I (t_{c})

contains the variables that are observed and available at or before the prediction cutoff. Formally, a predictor

X

belongs to the valid information set only if its observation time does not exceed the cutoff:

X \in I (t_{c}) | ⟺ | o b s (X) \leq t_{c} .

(2)

Predictors observed after the cutoff violate the learn–predict separation and introduce temporal leakage into the prediction task [10].

(iii): Risk Set (R( $t_{c}$ ))

The risk set

R (t_{c})

consists of individuals who are still under observation and for whom the outcome event has not yet occurred at the prediction cutoff [36]. Let

E_{i}

denote the observed or operationally defined event time for student

i

. In dropout prediction, this corresponds to the time at which the student is first considered to have dropped out according to the study-specific dropout definition. Students who have already experienced the event before the prediction cutoff are excluded from the risk set:

\begin{matrix} R (t_{c}) = {i : E_{i} > t_{c} or E_{i} is unobserved by t_{c}} . \end{matrix}

(3)

The risk set defines the population to which the predictive claim applies. TCDP does not require a full event-history or survival model; however, it requires the dropout event and the cutoff-specific risk set to be explicitly specified for the prediction task under study.

The first three components define the substantive prediction task: what is to be predicted, at what time, using which information, and for which population. Within this structure, the predictive claim can be expressed as a probability of a future outcome for members of the risk set, conditional on the information available at the cutoff:

P (Y_{i} (t > t_{c}) | I_{i} (t_{c})), i \in R (t_{c}) .

(4)

This formulation clarifies that prediction is defined only for individuals in the risk set and only on the basis of temporally bounded information. The fourth component, validation hierarchy, concerns the empirical estimation of this predictive claim during model training and evaluation.

(iv)

Validation design

The validation design specifies how model performance is estimated relative to the intended prediction task. In longitudinal settings, two forms of separation are required: entity-level separation and cohort-level temporal consistency.

(iv.a): Entity-Level Separation
Observations linked to the same individual must not overlap across training and test sets. In longitudinal datasets, repeated observations associated with the same entity may introduce hidden dependencies between partitions if row-wise splitting procedures are applied [17,18,19,37].
Valid entity-level separation requires

$I D_{t r a i n} \cap I D_{t e s t} = \emptyset .$

(5)
(iv.b): Cohort-Level Temporal Consistency

Training and test partitions must preserve the chronological order of cohorts, calendar periods, or institutional presentations [38,39]. In a strict out-of-time validation design, this can be expressed as

T_{train} ≺ T_{test},

(6)

where < denotes chronological precedence. This condition ensures that later cohorts or institutional periods do not inform predictions evaluated for earlier decision settings.

This distinction also clarifies the positioning of TCDP relative to standard validation strategies. Conventional procedures often address only one dimension of the validation problem when applied in their default form. For example, group-based validation can enforce entity-level separation, but it does not by itself guarantee chronological or cohort-level consistency. Conversely, time-series validation preserves temporal ordering, but in longitudinal panel data it may still allow repeated observations of the same individual to appear across training and test partitions unless grouping is explicitly enforced. The TCDP framework therefore requires the simultaneous enforcement of both entity-level separation and cohort-level temporal consistency.

These components jointly define a valid decision configuration, where

V (t_{c})

denotes the validation design applied at the prediction cutoff, including the temporal and entity-level separation rules.

C (t_{c}) = \{I (t_{c}), R (t_{c}), Y_{i} (t > t_{c}), V (t_{c})\} .

(7)

The first three components define the prediction task, while the validation hierarchy defines the admissible empirical evaluation of that task. A configuration is valid only if the implemented validation design is consistent with the specified task structure. The formal definitions are summarized in Table 1, and their structural relationships are illustrated in Figure 1.

The conditions summarized in Table 1 define the minimal requirements for time-consistent prediction in longitudinal settings. Figure 1 illustrates how these components operate within a unified prediction process by relating the cutoff, admissible information, risk population, and evaluation horizon.

2.2. Data Leakage as Violation of Temporal and Structural Conditions

Violations of the conditions defined above may manifest as data leakage during model construction or evaluation. Leakage occurs when model training or validation incorporates information that is not available within the specified decision configuration, or when the required separation between individuals, time periods, or cohorts is not preserved.

In the literature, data leakage is often described through technical categories such as target leakage, preprocessing leakage, or feature selection leakage [10,11,13]. The present framework reinterprets these manifestations as consequences of violated temporal or structural conditions rather than as isolated technical errors. In longitudinal educational data, two forms are especially relevant: temporal leakage and identity-based leakage.

2.2.1. Temporal Leakage

Temporal leakage occurs when a model uses information or temporal structure that would not be available at the specified prediction cutoff

t_{c}

[10,23,33]. In the present framework, it has two relevant forms: predictor-level temporal leakage and cohort-level temporal leakage.

At the predictor level, this corresponds to the inclusion of variables observed after the cutoff:

\exists X \in X : o b s (X) > t_{c} .

(8)

At the validation level, temporal leakage may also occur when chronological ordering across cohorts or institutional periods is not preserved:

T_{train} ≺ T_{test},

(9)

where < denotes chronological precedence.

For example, a dropout model evaluated at the end of the second semester would be temporally invalid if it included cumulative GPA calculated over the entire study period. Similarly, random splitting across pre- and post-reform cohorts may allow the model to exploit institutional patterns that were not available in the earlier prediction setting.

2.2.2. Identity Leakage

Identity leakage occurs when repeated observations of the same individual are present in both the training and test partitions [18,37]. Formally, this corresponds to a violation of entity-level separation:

I D_{t r a i n} \cap I D_{t e s t} \neq \emptyset .

(10)

In panel-structured datasets, such overlap typically results from record-level splitting, where rows are randomly assigned to training and test sets without grouping observations by individual. Under this configuration, the model may exploit stable individual-specific patterns across repeated records instead of learning relationships that generalize to unseen individuals.

In such a split, the test observations are no longer independent of the training observations, because the same entity contributes records to both partitions. Stable or slowly changing individual-level attributes may therefore act as implicit identifiers, allowing the model to approximate entity recognition rather than estimate relationships that generalize to unseen students. The resulting performance estimate is biased because the validation sample contains partially familiar entities rather than an independent decision population.

In student dropout prediction, this may occur when semester-level records of the same student are split across training and test sets. The model can then encounter earlier observations of a student during training and later observations of the same student during evaluation. Performance estimates obtained under this configuration are therefore affected by recurring entity-level information rather than by generalization to independent students or future cohorts.

2.2.3. Leakage and Metric Interpretation

Data leakage and risk-set inconsistency can cause reported metrics, such as AUC or F1-score, to overstate model performance. Leakage is known to produce overly optimistic performance estimates when information unavailable in the intended prediction setting enters model development or evaluation [10]. Similar effects have been demonstrated empirically for repeated-subject leakage, where non-independent observations across partitions substantially inflate reported predictive performance [37]. This distortion may be particularly difficult to detect when the improvement is moderate and remains within a plausible performance range [11]. For this reason, performance metrics should be interpreted relative to the specified cutoff, information set, risk set, and validation design rather than as standalone indicators of predictive validity [33,40].

2.3. Operational Checklist for Time-Consistent Validation

The framework can be operationalized as a pre-analysis validation checklist. Before model fitting, the analyst should document the intended prediction cutoff, verify that every predictor is observable at or before that cutoff, and define the risk set by excluding students whose outcome status has already been resolved.

Before performance reporting, the analyst should also verify that the validation design prevents both repeated entity overlaps and reverse temporal ordering. In practice, this means that all records belonging to the same student must be assigned to a single partition, and that evaluation cohorts or calendar periods must not be informed by later institutional periods.

A model should therefore not be interpreted as a prospective dropout predictor unless all four components—cutoff, information set, risk set, and validation hierarchy—are explicitly reported and mutually consistent. This checklist translates the formal TCDP conditions into a practical diagnostic sequence for validation design.

3. Methodological Patterns in Dropout Prediction: A Literature-Based Analysis

This section examines recurring methodological designs in recent empirical studies on higher education dropout prediction. The purpose is not to provide an exhaustive systematic review, but to conduct a structured methodological screening of studies in which the temporal structure of prediction can be assessed. The screening focuses on whether each study specifies the four TCDP dimensions: prediction cutoff, temporally restricted information set, risk-set definition, and validation hierarchy.

The screening was conducted in Scopus, Web of Science, Google Scholar, MDPI, SpringerLink, and ScienceDirect using the following search chains: (“student dropout prediction” OR “higher education dropout” OR “student retention” OR “academic risk”) AND (“machine learning” OR “learning analytics” OR “predictive modelling”) AND (“longitudinal” OR “temporal” OR “early warning” OR “cohort validation”). The primary date range was January 2024 to May 2026. Earlier methodological studies were retained only when they provided reference points for leakage, validation, or temporal modelling.

Studies were included when they addressed student-level dropout, retention, enrolment-continuation, or academic-risk prediction in higher education, MOOCs, or online learning; used statistical or machine-learning predictive models; and provided sufficient methodological detail to code at least three of the four TCDP dimensions. Studies were excluded when they focused primarily on dashboards, tool usability, AutoML accessibility, conceptual policy design, or static algorithm benchmarking without enough information about prediction timing and validation design.

Each retained study was coded conservatively as explicit, partial, or absent for each dimension. A criterion was coded as explicit only when the reported design made the corresponding temporal or structural condition reproducible. It was coded as partial when the condition could be reasonably inferred but was not formally specified. It was coded as absent when the paper did not report the condition or when the reported configuration was inconsistent with prospective prediction.

3.1. Methodological Orientations in Recent Dropout-Prediction Studies

The retained studies suggest that recent dropout-prediction research is best understood through three partially overlapping methodological orientations rather than as a sequence of directly comparable algorithmic applications. These orientations differ primarily in how they define the target status, delimit the information available at prediction time, and specify the population for which a prediction is meaningful.

Outcome-status classification designs model dropout, retention, enrolment continuation, or academic failure using institutional, academic, financial, or course-level information, often aggregated over semester- or year-level periods [4,5]. They are useful for institutional profiling and retrospective risk stratification, but they do not always make the prediction moment, admissible information window, and eligible at-risk population explicit. From a TCDP perspective, this distinction is critical because aggregated end-of-period records may shift the task from prospective prediction toward the classification of trajectories that are already largely observed.

Learning-process early-warning designs move prediction closer to possible intervention by using information generated during an ongoing course or semester, such as LMS or Moodle activity logs, clickstream or inactivity indicators, course-progress measures, assessment behaviour, and accumulated engagement features [41,42]. Compared with static outcome-status classification, these studies make stronger use of temporal checkpoints. However, their time consistency still depends on whether each checkpoint is paired with a formally restricted feature set and an explicitly defined risk set.

Longitudinal and cohort-oriented designs extend temporal modelling across monthly, semester-based, rolling-window, cross-semester, or cohort-specific settings [7,43,44]. This orientation is especially relevant in higher education, where dropout often develops through repeated enrolment decisions, inactivity, delayed progression, interruption, or non-continuation rather than through a single isolated event. At the same time, repeated observations introduce additional validation risks unless entity-level or cohort-level separation is enforced.

Overall, the reviewed literature shows a clear movement toward greater temporal awareness through staged checkpoints, time-bounded features, semester snapshots, and cohort-based evaluation settings [7,41,42,43,44]. The remaining methodological gap is therefore not the absence of temporality as such, but the uneven joint specification of prediction time, admissible information, population eligibility, and validation design. The retained studies provide several partial implementations of time-consistent prediction, while the combined specification of all TCDP components remains uncommon.

3.2. Comparative Synthesis Across TCDP Dimensions

To make this comparison explicit, this subsection evaluates the retained studies according to the four TCDP dimensions: prediction cutoff, temporally restricted information set, risk-set definition, and validation hierarchy. The purpose of the coding is not to rank studies by overall quality, but to identify which elements of a time-consistent prediction configuration are explicitly specified, partially inferable, or absent. This distinction is necessary because temporal structure may enter a dropout-prediction design at different stages. A study may construct features at weekly, monthly, semester-based, or cross-semester checkpoints, while still leaving the risk set or validation hierarchy only partially formalized.

The coding results make these differences explicit. Of the eight retained studies, five explicitly define operational prediction cutoffs [7,41,42,43,44], two provide only implicit or partial temporal anchors [5,45], and one does not implement a prospective cutoff [4]. A similar pattern appears for information-set restriction: five studies construct predictors around the corresponding prediction point [7,41,42,43,44], two do so only partially or implicitly [5,45], and one does not define the predictor set in temporal terms [4].

The risk-set and validation dimensions are less consistently specified. Only two studies explicitly define or substantively implement the population still at risk at the prediction time, with Kwon [43] and Vaarma and Li [7] providing the clearest examples. Cheng and Lin [44] partly imply such a population through the cross-semester setup. Three studies address the risk set only indirectly [41,42,44], and three provide insufficient information to verify the condition [4,5,45]. The strongest limitation appears in validation. Vaarma and Li [7] provide the clearest explicitly cohort-based test design combined with a longitudinal monthly prediction structure, whereas the remaining studies use random splits, mixed holdout/CV procedures, year-based separations with incomplete risk-set reporting, unclear validation strategies, or temporal checks only as robustness analyses.

Table 2 summarizes the methodological coding of the reviewed dropout-prediction studies according to the main TCDP dimensions.

Table 2. Methodological configurations of recent dropout-prediction studies coded by TCDP dimensions.

Study	Cutoff	Information Set	Risk Set	Validation Strategy	Validation Hierarchy
[5]	P	P	A	2016–2019 train; 2020 test	P
[41]	E	E	P	Random 80/20; SMOTE; GridSearch CV	P
[7]	E	E	E	2015 entrants as test cohort; monthly models	P
[42]	E	E	P	Weekly models with CV/holdout; all-weeks 30% random holdout	P
[4]	A	A	A	2018 train; 2019 test; 10-fold CV in training	P
[45]	P	P	A	Model comparison; validation details insufficient	A
[43]	E	E	E	Stratified 80/20 main split; temporal robustness split	P
[44]	E	E	P	AY108-112 train; AY112/2 features -> AY113/1 labels	P

Note: Coding labels: E = explicitly defined and operationalized; P = partially or implicitly satisfied; A = absent, unclear, or not supported by the reported methodology.

The matrix shows that the integrated application of all four structural dimensions remains uncommon. A study may define a meaningful cutoff without a formal risk set, restrict the information set without enforcing temporal validation, or use longitudinal data while evaluating the model through random train–test partitions. Thus, temporal awareness at the feature-construction stage does not automatically imply time-consistent validation.

The analytical implication is that current dropout-prediction research increasingly recognizes the need for early or staged prediction, but the validation hierarchy often lags behind the substantive task definition. The proposed TCDP framework addresses this gap by requiring the cutoff, information set, risk set, and validation hierarchy to be specified jointly before predictive performance is interpreted.

Because the screening identifies validation hierarchy as the least consistently formalized dimension, the empirical demonstration below isolates one specific validation failure: entity-level overlap under row-wise splitting.

4. Empirical Demonstration: TCDP-Formalized Identity Leakage in Longitudinal Data with Static Predictors

This section provides an empirical demonstration of identity leakage in longitudinal prediction settings with repeated student observations. The aim is not to optimize dropout prediction, but to isolate a validation failure that can create the illusion of predictive improvement when semester-level records from the same student are allowed to cross the train–test boundary.

The empirical analysis is restricted to the longitudinal panel representation because the diagnostic target is the dependency structure created by repeated observations of the same student across successive cutoffs. The purpose of the experiment is therefore to examine the effect of violating and then restoring TCDP Condition (iv.a), Entity-Level Separation, in a panel-based validation procedure.

The experiment is organized explicitly as a TCDP specification. Thus, the empirical design is not defined by the algorithms and train–test split alone, but by the entity unit, prediction cutoff, admissible information set, risk set, feature representation, validation partition, and evaluation unit. This formalization makes the validation logic auditable and directly addresses the methodological concern that leakage control must be specified before performance metrics are interpreted.

The reported results compare an intentionally invalid row-wise split, which violates TCDP Condition (iv.a), Entity-Level Separation, with a valid entity-level group split that restores this condition, while holding the cutoff, preprocessing, feature representation, and algorithmic settings constant.

4.1. Longitudinal Data Representation and Prediction Task

The empirical setting is based on a longitudinal administrative dataset collected over multiple academic years at a Hungarian higher education institution. The institutional data were processed in pseudonymized form. The initial administrative extract contained 2929 students. From this population, the analytical sample was restricted to students whose terminal status could be unambiguously determined within the six-semester observation window. Students were retained if they either met the operational dropout definition, defined as no course enrolment in the subsequent semester, or obtained their degree by the end of the sixth semester. Students who remained enrolled beyond the sixth semester without a terminal outcome were excluded because their final status could not be determined within the defined observation horizon. After this filtering step, the empirical sample contained 1851 students from multiple entry cohorts observed between 2019 and 2024. Each retained student was represented by six consecutive semester-level records, yielding a balanced raw panel of 11,106 student–semester observations. The raw panel is balanced at the record-structure level: dropout does not remove students from the table; instead, subsequent semester records are retained with the activity/enrolment-count indicator equal to zero. In the cutoff-specific prediction tasks, however, these post-dropout records are excluded through the risk-set rule. Thus, for a given cutoff τ, only students still active at τ are retained, and only their records with time_index ≤ τ are used for model fitting and evaluation.

Table 3 describes the structure of the longitudinal panel dataset used in the empirical demonstration. Each row in the panel corresponds to one student–semester observation, identified by an entity_id and a time_index. The entity_id denotes the student, while time_index denotes the semester-level observation period. For reproducibility, the same experimental logic is implemented in a Python 3.11.1 pipeline using generic feature names and a structurally analogous synthetic dataset available in a public GitHub repository.

This dataset was selected because its semester-level panel structure contains repeated observations of the same students across successive cutoffs, making it suitable for isolating the validation-induced identity leakage mechanism examined in this study.

Table 4 summarizes the role of each variable in the experimental design, distinguishing predictors, risk-set indicators, target-related variables, and variables excluded from model fitting. The prediction target is a binary student-level event outcome. The risk set is constructed from feature_7, an activity/enrolment-count indicator. A value feature_7 > 0 denotes that the student is still active in the semester, whereas feature_7 = 0 denotes the absence of active course enrolment. Importantly, feature_7 is not included in the predictor matrix; it is used only to determine whether the student remains eligible for the event at a given cutoff.

4.2. TCDP-Based Formalization of the Experimental Design

The experiment can be written as a TCDP-compatible longitudinal prediction problem. Let i index students and t index semesters. The observed panel consists of repeated records for the same entity:

D = \{(i, t, S_{i}, A_{i, t}, Y_{i}) : i = 1, \dots, N, t = 1, \dots, 6\}

(11)

Here, S_i denotes the static background profile of student i, A_{i,t} denotes the activity indicator derived from feature_7, and Y_i denotes the binary event outcome. The prediction cutoffs are τ ∈ {1,2,3,4,5,6}.

The event time is defined as the first semester in which the activity indicator becomes zero after being positive in the previous semester. If no such transition is observed within the six-semester window, the event time is treated as censored or infinite:

E_{i} = m i n {t \in {2, \dots, 6} : A_{i, t} = 0 \land A_{i, t - 1} > 0}, E_{i} = \infty if no such transition is observed .

(12)

The risk set at cutoff τ contains only students who have not experienced the event before the cutoff:

R_{τ} = {i : A_{i, τ} > 0 and E_{i} > τ}

(13)

The cutoff-specific panel used for model fitting and testing is therefore

D_{τ} = \{(i, t) : i \in R_{τ}, t \leq τ\}

(14)

The feature matrix is static-only. The model receives a representation of the static profile, but not the student identifier, the semester index, or the risk-set activity variable:

X_{i, t}^{(τ)} = ϕ (S_{i}), {e n t i t y_i d, t i m e_i n d e x, f e a t u r e_7} \cap c o l s (X^{(τ)}) = \emptyset .

(15)

The representation map φ is a central part of the TCDP specification. In the coarse residence condition, φ maps the residence feature to a low-cardinality B/V code. In the full city-name condition, the same feature is mapped to high-cardinality city labels and then one-hot-encoded. This difference does not alter the validation rule, but it changes the strength of the quasi-identifying signal available under an invalid split.

TCDP Condition (iv.a), Entity-Level Separation, is expressed as a constraint on the entity sets present in the training and test partitions. The intentionally leaky row-wise split is designed to violate Condition (iv.a), because repeated semester-level records from the same student may be assigned to different partitions. Let

I D (D)

denote the set of students appearing in dataset

D

. A valid longitudinal validation split must satisfy the entity-level separation condition defined in Equation (5), applied to the cutoff-specific panel as follows:

I D (D_{τ}^{t r a i n}) \cap I D (D_{τ}^{t e s t}) = \emptyset .

(16)

The intentionally leaky row-wise split violates this condition whenever repeated records from the same student are assigned to different partitions:

I D (D_{τ}^{t r a i n}) \cap I D (D_{τ}^{t e s t}) \neq \emptyset .

(17)

For panel evaluation, row-level probabilities for a test entity can be summarized at the entity level. In the identity-leaky branch, this aggregation is necessary because row-wise splitting can leave an unequal number of test rows per overlapping student:

{\hat{p}}_{i, τ} = \frac{1}{| T_{i, τ}^{t e s t} |} \sum_{t \in T_{i, τ}^{t e s t}} {\hat{p}}_{i, t}

(18)

Table 5 summarizes how the empirical experiment operationalizes the TCDP components, including the prediction cutoffs, information set, risk-set construction, validation configurations, and evaluation metric. In the valid group-split branch, row-level and entity-level AUCs are equivalent in this static-only setup because all rows of a held-out student remain in the same partition, the same static feature vector is repeated within that student, and each cutoff contributes the same number of rows per at-risk entity. The reported AUC values therefore follow the same TCDP evaluation logic across the two validation configurations.

4.3. Validation Configurations and Reproducible Pipeline Parameters

In TCDP terms, the identity_leaky_rowwise_split is the deliberately invalid configuration because it violates Condition (iv.a), Entity-Level Separation, by allowing semester-level records from the same student to appear in both training and test partitions. Conversely, the good_valid_entity_level_group_split enforces this condition by assigning all records of each student to a single partition. In this diagnostic experiment, the term “valid” therefore refers specifically to validity with respect to entity-level separation. The experiment is not intended to constitute a full out-of-time cohort validation design; rather, it isolates the effect of restoring entity-level separation while holding the other experimental components fixed.

Table 6 reports the code-level experimental settings used to ensure that the two validation configurations differ only in their splitting logic. Both configurations use the same cutoff-specific panel, the same static predictors, the same preprocessing pipeline, the same random seed, and the same train–test proportion. Consequently, any systematic AUC difference between the two configurations is attributable to the validation design and its interaction with feature representation, not to a change in available dynamic information.

Table 7 lists the algorithms included in the diagnostic comparison, contrasting a linear baseline with nonlinear ensemble and instance-based models, and reports the fixed parameter settings used for each method.

In this diagnostic comparison, Gradient Boosting and Random Forest represent nonlinear tree-based ensemble models, while k-nearest neighbours represents an instance-based classifier that is particularly sensitive to repeated or near-duplicate entity profiles. These models are contrasted with Logistic Regression, which serves as a global linear baseline.

4.4. Empirical Results

4.4.1. Cutoff-Specific Panel Size and Train–Test Entity Overlap

Table 8 reports the cutoff-specific risk-set size, panel size, and train–test entity overlap used to quantify identity leakage under the row-wise split. The central leakage diagnostic is the number of students appearing in both training and test partitions. Under the valid entity-level group split, this count is zero by construction. Under the row-wise split, the overlap grows as the cutoff advances because the panel contains more repeated rows per student. At cutoff 1, each at-risk student contributes only one row, so row-wise splitting cannot place another semester of the same student in the opposite partition. From cutoff 2 onward, this becomes possible and progressively more likely. Figure 2 visualizes this cutoff-specific increase in train–test student overlap.

This pattern is the empirical manifestation of the violation of TCDP Condition (iv.a), Entity-Level Separation, formalized in Equation (5) and operationalized in the empirical design by the contrast between the valid split in Equation (16) and the leaky split in Equation (17). The leakage is not caused by directly including entity_id as a predictor. Rather, it arises because the validation partition allows repeated rows from the same entity to cross the train–test boundary, while static background variables create recurring quasi-identifying profiles.

4.4.2. Coarse Two-Category Residence Encoding

Table 9 reports the AUC values obtained under the valid entity-level split and the identity-leaky row-wise split when the residence feature is encoded in a coarse two-category form. The first diagnostic run used this low-cardinality residence representation to examine whether identity leakage is already detectable under a relatively limited feature encoding. This setting produces strong leakage-induced inflation for nonlinear and instance-based algorithms, while Logistic Regression remains comparatively stable and increases only modestly under the leaky split. The resulting difference between the valid and identity-leaky validation configurations is visualized in Figure 3.

At cutoff 1, the valid and leaky configurations are nearly indistinguishable because no repeated rows from the same student are available. From cutoff 2 onward, the invalid row-wise split creates substantial train–test entity overlap. RF, kNN, and GB then approach near-perfect or highly inflated AUC values under the leaky split. Under the valid entity-level split, the same algorithms remain in substantially lower and more plausible ranges.

The LR result in this coarse setting should not be interpreted as evidence that linear models are immune to identity leakage. Its leaky AUC changes from 0.640 at cutoff 1 to 0.689 at cutoff 6, whereas RF reaches 1.000 and kNN and GB reach approximately 0.998. In TCDP terms, the validation condition is violated, but the low-granularity feature representation provides only a weak identity signature for the linear model.

4.4.3. Full City-Name Residence Encoding

Table 10 reports the AUC values obtained under the valid entity-level split and the identity-leaky row-wise split when feature_4 is represented by full city names rather than by a coarse two-category residence code. This second diagnostic run changes the representation map φ for feature_4 and increases the cardinality of the categorical feature space through one-hot encoding. Compared with the coarse residence encoding reported in Table 9, the higher-granularity representation strengthens the quasi-identifying signal available to the model when entity-level separation is violated, making the leakage effect visible for Logistic Regression as well. The corresponding difference between the valid and identity-leaky validation configurations is visualized in Figure 4.

The full city-name run changes the interpretation of the LR baseline. Under the valid entity-level split, LR declines from AUC = 0.615 at cutoff 1 to AUC = 0.556 at cutoff 6. Under the identity-leaky row-wise split, the same LR model increases from AUC = 0.613 to AUC = 0.812. The earlier apparent stability of LR was therefore not an inherent property of the algorithm; it reflected the limited granularity of the available static predictors.

The nonlinear and instance-based algorithms remain highly sensitive to leakage in the full city-name setting. RF converges to AUC = 1.000 by cutoff 6, and kNN reaches AUC = 0.994. GB also increases under the leaky split, though less dramatically than kNN and RF. The central result is that even a linear model can exploit identity leakage when the one-hot-encoded static feature space contains sufficiently granular quasi-identifiers.

Table 11 summarizes the leakage-induced AUC inflation at the final cutoff by comparing the coarse two-category residence encoding with the full city-name residence encoding. Figure 5 visualizes how the inflation differs across algorithms and residence granularities.

At the final cutoff, leakage-induced AUC inflation ranged from 0.111 to 0.540 in the coarse residence condition and from 0.196 to 0.450 in the full city-name condition, demonstrating that the magnitude of performance inflation is both algorithm- and representation-dependent.

4.5. TCDP Interpretation: Validation Design, Feature Granularity, and Model Class

The combined results show that identity leakage is a structural validation problem rather than a peculiarity of a single algorithm. In TCDP terms, the necessary failure is the violation of Condition (iv.a), Entity-Level Separation. This condition is defined in Equation (5) and empirically represented by the non-empty intersection between the train and test entity sets in Equation (17). Once this violation occurs, the empirical severity of the leakage depends on the interaction between the model class and the representation map φ applied to the static predictors.

The coarse residence experiment shows that high-capacity and instance-based models can exploit repeated static profiles even when the linear baseline appears relatively stable. The full city-name experiment shows that this stability disappears when the categorical representation becomes more granular. Logistic Regression is therefore not protected against identity leakage; it requires a sufficiently informative one-hot-encoded quasi-identifier to make the leaked structure visible in AUC.

Because all experiments use static predictors only, increasing AUC across cutoffs under the leaky split cannot be explained by genuine temporal learning or by accumulating dynamic behavioural information. The predictor content is intentionally held constant. What changes across cutoffs is the number of repeated rows per student and, consequently, the opportunity for row-wise splitting to place familiar entity profiles into the test set.

4.6. Reproducibility and Validation Diagnostics

The empirical design reports not only predictive performance but also the validation structure that produced it. The public pipeline records the cutoff, split type, algorithm, number of entities, number of panel rows, rows per entity, train–test entity overlap, feature metadata, date-feature mode, city-feature encoding, and the output pivot tables. These metadata are necessary because aggregate AUC values alone do not reveal whether a longitudinal validation design is structurally valid.

The experiment therefore provides a diagnostic template for evaluating longitudinal educational prediction models. A valid result requires not only time-consistent input data, but also a risk-set definition, an explicit feature representation, and strict entity-level separation. Without these components, the evaluation may measure familiarity with repeated entities rather than predictive performance on unseen students.

The substantive conclusion is not that one algorithm is inherently safe and another inherently unsafe. The conclusion is that leakage control must be specified at the TCDP level before algorithmic performance can be interpreted. Once entity-level separation is enforced, the inflated performance pattern observed under the row-wise split disappears and the models return to substantially lower and more realistic performance ranges.

4.7. Transferability Beyond Higher Education

Although the empirical demonstration is framed around student dropout prediction, the same validation logic applies to any longitudinal prediction task with repeated observations of the same entity. In medical prediction, the entity may be a patient, the cutoff may be a clinical visit, and the risk set may include patients who have not yet experienced the target event. In credit risk, insurance, fraud detection, public-sector forecasting, and panel econometrics, the entity may be a client, account, household, firm, region, or transaction stream.

Across these domains, the methodological requirement is the same: the information set must correspond to the decision time, the risk set must include only eligible entities, and validation must prevent entity-level overlap unless the intended deployment setting explicitly involves repeated predictions for already-seen entities. TCDP turns these requirements into an auditable decision-time validation specification.

4.8. Data Availability and Reproducible Experiment Package

Due to data protection and institutional confidentiality constraints, the original student-level longitudinal dataset cannot be made publicly available. The dataset contains educational records that, even after direct identifier removal, may remain sensitive under GDPR because of the combination of repeated observations, temporal structure, and demographic or administrative attributes.

To support reproducibility without disclosing protected student data, we provide the complete analysis code together with fully synthetic reproduction data generated from scratch. These public files are not anonymized, masked, or transformed exports of the original institutional dataset. They contain no real student records. The repository also includes the Python script used to generate the two synthetic Excel input files corresponding to the low-resolution and high-resolution feature-representation conditions.

The synthetic files preserve the variable schema, longitudinal entity-level organization, cutoff logic, risk-set construction, and feature-resolution conditions required to rerun the validation experiment. When the public pipeline is executed on these synthetic files, the same methodological phenomenon can be observed: the deliberately invalid row-wise split produces different performance estimates from the valid entity-level group split, and the magnitude of this difference depends on the granularity of the feature representation.

Therefore, the public repository allows readers to inspect and rerun the data-generation script, preprocessing pipeline, two validation configurations, model settings, and leakage diagnostics.

The reproducible experiment package is available at https://github.com/fauszt/identity-leakage-panel-experiment.git (accessed on 8 June 2026).

5. Conclusions

This study proposed the Time-Consistent Dropout Prediction (TCDP) framework as a formal structure for specifying and evaluating predictive models in longitudinal educational settings. The central contribution is the argument that dropout prediction should not be treated as a generic classification task detached from time, eligibility, and validation design. A prospective prediction task must define the prediction cutoff, the temporally admissible information set, the risk set, and the validation hierarchy before model performance can be meaningfully interpreted.

The research questions are addressed as follows. RQ1 is addressed by the literature-based analysis, which shows that recent dropout-prediction studies increasingly specify prediction cutoffs and time-bounded information sets, whereas risk-set definition and validation hierarchy remain less consistently formalized. RQ2 is answered by the empirical validation experiment, which shows that row-wise validation in longitudinal student panels generates increasing train–test entity overlap across later prediction cutoffs and leads to inflated AUC values relative to an entity-level group split. RQ3 is addressed by the comparison of model classes and residence-feature representations, showing that leakage visibility depends on both algorithmic capacity and feature granularity: nonlinear and instance-based models exploit the leaked structure strongly, but Logistic Regression can also be affected when static predictors contain sufficiently granular quasi-identifying information.

The empirical results therefore show that identity leakage is a structural validation problem rather than a peculiarity of a single algorithm. Under the intentionally invalid row-wise split, repeated semester-level records from the same student were allowed to cross the train–test boundary, producing substantially inflated AUC values. When entity-level separation was restored through group-based splitting, this inflated performance pattern disappeared and the models returned to lower and more realistic ranges. The full city-name residence experiment further showed that the apparent stability of Logistic Regression in the coarse setting was not an inherent protection against leakage, but a consequence of limited feature granularity.

These findings have practical implications for educational prediction and learning analytics. Removing explicit identifiers such as StudentID is not sufficient to prevent leakage in longitudinal panel data. Static demographic, administrative, or contextual variables may still act as implicit identity signatures when repeated observations are split across partitions. Reported performance metrics should therefore be accompanied by diagnostics documenting the cutoff, risk-set rule, feature representation, train–test entity overlap, and evaluation unit. Without these diagnostics, high AUC values may reflect familiarity with repeated entities rather than generalization to unseen students.

The study has several limitations. The empirical experiment was designed as a diagnostic demonstration rather than as a performance-optimization study. It intentionally used static predictors to isolate validation-induced identity leakage, and the reported AUC values should not be interpreted as institutionally generalizable estimates of dropout predictability. Future research should extend the TCDP framework to richer dynamic predictors, multi-institutional datasets, externally validated cohort designs, calibration analysis, fairness assessment, and intervention-oriented evaluation settings.

Overall, the results demonstrate that predictive performance in longitudinal dropout modelling cannot be interpreted independently of the temporal and structural definition of the prediction task. TCDP provides a practical and auditable framework for aligning model evaluation with the conditions under which predictions would be used in real decision contexts. More broadly, the same logic applies beyond higher education to any longitudinal prediction setting in which repeated observations, evolving eligibility, and entity-level validation structure determine whether reported performance reflects genuine predictive capability or leakage-induced artefacts.

Funding

This research received no external funding. The article processing charge (APC) was partially funded by the University of Dunaújváros.

Institutional Review Board Statement

Ethical review and approval were not required for this study because the analysis was based on retrospective, anonymized institutional records and did not involve any intervention, direct interaction with students, or collection of new personal data.

Informed Consent Statement

Individual informed consent was not required because the study was based on retrospective, anonymized institutional records and did not involve any intervention, direct interaction with students, or collection of new personal data.

Data Availability Statement

The original institutional, student-level longitudinal dataset analyzed in this article is not publicly available and cannot be shared by the author due to data protection, privacy, and institutional confidentiality restrictions.To support reproducibility, the analysis code and fully synthetic data files are available in a public GitHub repository: https://github.com/fauszt/identity-leakage-panel-experiment.git (accessed on 8 June 2026).

Conflicts of Interest

The author declares no conflict of interest.

References

Bognár, L.; Fauszt, T. Factors and Conditions That Affect the Goodness of Machine Learning Models for Predicting the Success of Learning. Comput. Educ. Artif. Intell. 2022, 3, 100100. [Google Scholar] [CrossRef]
Cho, C.H.; Yu, Y.W.; Kim, H.G. A Study on Dropout Prediction for University Students Using Machine Learning. Appl. Sci. 2023, 13, 12004. [Google Scholar] [CrossRef]
Okoye, K.; Nganji, J.T.; Escamilla, J.; Hosseini, S. Machine Learning Model (RG-DMML) and Ensemble Algorithm for Prediction of Students’ Retention and Graduation in Education. Comput. Educ. Artif. Intell. 2024, 6, 100205. [Google Scholar] [CrossRef]
Rabelo, A.M.; Zárate, L.E. A Model for Predicting Dropout of Higher Education Students. Data Sci. Manag. 2025, 8, 72–85. [Google Scholar] [CrossRef]
Bouihi, B.; Bousselham, A.; Aoula, E.; Ennibras, F.; Deraoui, A. Prediction of Higher Education Student Dropout Based on Regularized Regression Models. Eng. Technol. Appl. Sci. Res. 2024, 14, 17811–17815. [Google Scholar] [CrossRef]
Hassan, M.A.; Muse, A.H.; Nadarajah, S. Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland. Appl. Sci. 2024, 14, 7593. [Google Scholar] [CrossRef]
Vaarma, M.; Li, H. Predicting Student Dropouts with Machine Learning: An Empirical Study in Finnish Higher Education. Technol. Soc. 2024, 76, 102474. [Google Scholar] [CrossRef]
Villar, A.; De Andrade, C.R.V. Supervised Machine Learning Algorithms for Predicting Student Dropout and Academic Success: A Comparative Study. Discov. Artif. Intell. 2024, 4, 2. [Google Scholar] [CrossRef]
Ouyang, F.; Zheng, L.; Jiao, P. Artificial Intelligence in Online Higher Education: A Systematic Review of Empirical Research from 2011 to 2020. Educ. Inf. Technol. 2022, 27, 7893–7925. [Google Scholar] [CrossRef]
Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Trans. Knowl. Discov. Data 2012, 6, 1–21. [Google Scholar] [CrossRef]
Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef] [PubMed]
Ebrahimi, A.; Luo, S.; Alzheimer’s Disease Neuroimaging Initiative. Convolutional Neural Networks for Alzheimer’s Disease Detection on MRI Images. J. Med. Imaging 2021, 8, 024503. [Google Scholar] [CrossRef] [PubMed]
Apicella, A.; Isgrò, F.; Prevete, R. Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning. Artif. Intell. Rev. 2025, 58, 339. [Google Scholar] [CrossRef]
Chiavegatto Filho, A.; Batista, A.F.D.M.; Dos Santos, H.G. Data Leakage in Health Outcomes Prediction with Machine Learning. Comment on “Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning”. J. Med. Internet Res. 2021, 23, e10969. [Google Scholar] [CrossRef]
Hand, D.J. Classifier Technology and the Illusion of Progress. Statist. Sci. 2006, 21, 1–14. [Google Scholar] [CrossRef]
Abdullah, A.; Ali, R.H.; Koutaly, R.; Khan, T.A.; Ahmad, I. Enhancing Student Retention: Predictive Machine Learning Models for Identifying and Preventing University Dropout. In Proceedings of the 2025 International Conference on Innovation in Artificial Intelligence and Internet of Things (AIIT), Jeddah, Saudi Arabia, 7 May 2025; pp. 1–6. [Google Scholar]
Barros, B.M.; do Nascimento, H.A.D.; Guedes, R.; Monsueto, S.E. Evaluating Splitting Approaches in the Context of Student Dropout Prediction 2023. arXiv 2023, arXiv:2305.08600. [Google Scholar]
Karbalaie, A.; Abtahi, F.; Häger, C.K. Participant-Aware Model Validation for Repeated-Measures Data: Comparative Cross-Validation Study. JMIR AI 2026, 5, e87728. [Google Scholar] [CrossRef]
Rumala, D.J. How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis. In Clinical Image-Based Procedures, Fairness of AI in Medical Imaging, and Ethical and Philosophical Issues in Medical Imaging; Wesarg, S., Puyol Antón, E., Baxter, J.S.H., Erdt, M., Drechsler, K., Oyarzun Laura, C., Freiman, M., Chen, Y., Rekik, I., et al., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 14242, pp. 235–245. ISBN 978-3-031-45248-2. [Google Scholar]
Dake, D.K.; Buabeng-Andoh, C. Using Machine Learning Techniques to Predict Learner Drop-out Rate in Higher Educational Institutions. Mob. Inf. Syst. 2022, 2022, 1–9. [Google Scholar] [CrossRef]
Gašević, D.; Dawson, S.; Siemens, G. Let’s Not Forget: Learning Analytics Are about Learning. TechTrends 2015, 59, 64–71. [Google Scholar] [CrossRef]
Brooks, C.; Thompson, C. Predictive Modelling in Teaching and Learning. In Handbook of Learning Analytics; Lang, C., Siemens, G., Wise, A., Gašević, D., Eds.; Society for Learning Analytics Research: Vancouver, BC, Canada, 2017; pp. 61–68. ISBN 978-0-9952408-0-3. [Google Scholar]
Yuan, W.; Beaulieu-Jones, B.K.; Yu, K.-H.; Lipnick, S.L.; Palmer, N.; Loscalzo, J.; Cai, T.; Kohane, I.S. Temporal Bias in Case-Control Design: Preventing Reliable Predictions of the Future. Nat. Commun. 2021, 12, 1107. [Google Scholar] [CrossRef]
Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A Survey of Cross-Validation Procedures for Model Selection. Statist. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Bergmeir, C.; Hyndman, R.J.; Koo, B. A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction. Comput. Stat. Data Anal. 2018, 120, 70–83. [Google Scholar] [CrossRef]
Bollen, K.A.; Brand, J.E. A General Panel Model with Random and Fixed Effects: A Structural Equations Approach. Soc. Forces 2010, 89, 1–34. [Google Scholar] [CrossRef]
Cerqua, A.; Letta, M.; Pinto, G. On the (Mis)Use of Machine Learning with Panel Data. Oxf. Bull. Econ. Stat. 2026, 88, 506–518. [Google Scholar] [CrossRef]
Shmueli, G. To Explain or to Predict? Statist. Sci. 2010, 25, 289–310. [Google Scholar] [CrossRef]
Knight, S.; Buckingham Shum, S.; Littleton, K. Epistemology, Pedagogy, Assessment and Learning Analytics. In Proceedings of the Third International Conference on Learning Analytics and Knowledge, Leuven, Belgium, 8 April 2013; pp. 75–84. [Google Scholar]
Singell, L.D.; Waddell, G.R. Modeling Retention at a Large Public University: Can at-Risk Students Be Identified Early Enough to Treat? Res. High. Educ. 2010, 51, 546–572. [Google Scholar] [CrossRef]
Knight, S.; Friend Wise, A.; Chen, B. Time for Change: Why Learning Analytics Needs Temporal Analysis. Learn. Anal. 2017, 4, 7–17. [Google Scholar] [CrossRef]
Sherman, E.; Gurm, H.; Balis, U.; Owens, S.; Wiens, J. Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale. AMIA Annu. Symp. Proc. Arch. 2018, 2017, 1571. [Google Scholar] [CrossRef]
Zheng, Y.; Heagerty, P.J. Partly Conditional Survival Models for Longitudinal Data. Biometrics 2005, 61, 379–391. [Google Scholar] [CrossRef]
Van Houwelingen, H.C. Dynamic Prediction by Landmarking in Event History Analysis. Scand. J. Stat. 2007, 34, 70–85. [Google Scholar] [CrossRef]
Kleinbaum, D.G.; Klein, M. Survival Analysis: A Self-Learning Text; Statistics for Biology and Health; Springer: New York, NY, USA, 2012; ISBN 978-1-4419-6645-2. [Google Scholar]
Rosenblatt, M.; Tejavibulya, L.; Jiang, R.; Noble, S.; Scheinost, D. Data Leakage Inflates Prediction Performance in Connectome-Based Machine Learning Models. Nat. Commun. 2024, 15, 1829. [Google Scholar] [CrossRef] [PubMed]
Ramspek, C.L.; Jager, K.J.; Dekker, F.W.; Zoccali, C.; Van Diepen, M. External Validation of Prognostic Models: What, Why, How, When and Where? Clin. Kidney J. 2021, 14, 49–58. [Google Scholar] [CrossRef] [PubMed]
De Hond, A.A.H.; Shah, V.B.; Kant, I.M.J.; Van Calster, B.; Steyerberg, E.W.; Hernandez-Boussard, T. Perspectives on Validation of Clinical Predictive Algorithms. npj Digit. Med. 2023, 6, 86. [Google Scholar] [CrossRef] [PubMed]
Steyerberg, E.W.; Vickers, A.J.; Cook, N.R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M.J.; Kattan, M.W. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology 2010, 21, 128–138. [Google Scholar] [CrossRef]
Goren, O.; Cohen, L.; Rubinstein, A. Early Prediction of Student Dropout in Higher Education Using Machine Learning Models. In Proceedings of the International Educational Data Mining Society, Atlanta, GA, USA, 14–17 July 2024. [Google Scholar] [CrossRef]
Rebelo Marcolino, M.; Reis Porto, T.; Thompsen Primo, T.; Targino, R.; Ramos, V.; Marques Queiroga, E.; Munoz, R.; Cechinel, C. Student Dropout Prediction through Machine Learning Optimization: Insights from Moodle Log Data. Sci. Rep. 2025, 15, 9840. [Google Scholar] [CrossRef]
Kwon, J.B. A Portable, Generalizable Machine Learning Framework for Long-Term Student Dropout Prediction. IEEE Access 2026, 14, 30830–30843. [Google Scholar] [CrossRef]
Cheng, Y.-H.; Lin, C.-E. Predicting University Student Dropout Risk Using Deep Learning and Ensemble Voting Mechanism. In Proceedings of the 8th International Conference on Knowledge Innovation and Invention, Fukuoka, Japan, 22–24 August 2025; p. 66. [Google Scholar]
Aouarib, H.E.; Henouda, S.E.; Laallam, F.Z. Predicting Student Dropout in MOOCs Using Genetic Algorithms and XGBoost. J. Inf. Syst. Eng. Manag. 2025, 10, 1081–1092. [Google Scholar]

Figure 1. Schematic representation of the core components of the time-consistent prediction framework.

Figure 2. Growth of train–test entity overlap under the identity-leaky row-wise split.

Figure 3. Comparison of AUC values by prediction cutoff and algorithm under two validation configurations using the coarse residence encoding: (a) valid entity-level group split, which prevents identity leakage by keeping all records of the same entity in a single partition; (b) identity-leaky row-wise split, in which repeated records of the same entity may occur in both training and test partitions.

Figure 4. AUC values by prediction cutoff and algorithm under the full city-name residence encoding.

Figure 5. AUC inflation by algorithm at cutoff 6 under two residence-feature granularity settings: (a) coarse residence encoding; (b) full city-name residence encoding. Inflation is calculated as

Δ A U C = A U C_{l e a k y} - A U C_{v a l i d}

. The figure illustrates that performance inflation is not constant across algorithms and may become more visible for logistic regression when a higher-resolution encoding is used.

Figure 5. AUC inflation by algorithm at cutoff 6 under two residence-feature granularity settings: (a) coarse residence encoding; (b) full city-name residence encoding. Inflation is calculated as

Δ A U C = A U C_{l e a k y} - A U C_{v a l i d}

. The figure illustrates that performance inflation is not constant across algorithms and may become more visible for logistic regression when a higher-resolution encoding is used.

Table 1. Valid and invalid conditions in time-consistent prediction.

Component	Designation	Validity Condition	Typical Violation
(i) Cutoff	$t_{c}$	The prediction time point is explicitly specified.	The prediction has no explicit temporal anchor.
(ii) Information set	$I_{t_{c}}$	$Only variables observed at or before t_{c}$ $are used : o b s (X) \leq t_{c}$	$Post-cutoff variables or full-period aggregates enter the model : o b s (X) > t_{c}$ .
(iii) Risk set	$R_{t_{c}}$	$Only individuals active at t_{c}$ , for whom the outcome has not yet occurred, are included.	Individuals with already determined outcome status are included.
(iv.a) Entity-level separation	$I D_{t r a i n} \cap I D_{t e s t} = \emptyset$	The same individual cannot appear in both training and test partitions.	Repeated observations of the same individual overlap across partitions.
(iv.b) Cohort-level temporal consistency	$(T_{t r a i n} < T_{t e s t}$ )	Training data precede the evaluation cohort or period.	Later cohorts or calendar periods inform earlier prediction settings.

Table 3. Structural schema of the longitudinal panel representation.

entity_id	time_index	Static Predictors	Risk-Set Indicator	Target
1	1	feature_1 … feature_6	feature_7 > 0	Y₁
1	2	feature_1 … feature_6	feature_7 > 0	Y₁
…	…	…	…	…
1	6	feature_1 … feature_6	feature_7 = 0 or > 0	Y₁
2	1	feature_1 … feature_6	feature_7 > 0	Y₂
2	2	feature_1 … feature_6	feature_7 > 0	Y₂
…	…	…	…	…
2	6	feature_1 … feature_6	feature_7 = 0 or > 0	Y₂

Note: Each row is a semester-level observation for one student. The static predictors are repeated across the student’s semester-level rows; feature_7 is used only to define risk-set membership.

Table 4. Feature roles and model treatment.

Generic Variable	Interpretation in the Experiment	Model Treatment
entity_id	Pseudonymized student identifier	Excluded; used for grouping and overlap diagnostics
time_index	Semester index/cutoff coordinate	Excluded; used to construct cutoff-specific panels
feature_1	Static administrative or demographic predictor	Categorical; imputed and one-hot-encoded
feature_2	Birth-related Excel date feature	Converted to date-derived numeric components: year-only or year–month–day
feature_3	Static administrative or demographic predictor	Categorical; imputed and one-hot encoded
feature_4	Residence/city feature	Categorical; either coarse B/V coding or full city-name coding
feature_5	Numeric static predictor, e.g., admission score	Numeric; median imputation
feature_6	Static administrative or funding-related code	Categorical; imputed and one-hot-encoded
feature_7	Activity/enrolment count used to define the risk set	Excluded from the predictor space
target	Binary event outcome	Prediction target

Note: Generic feature names are used to support anonymization and public reproducibility. Student identifiers are never used as predictors.

Table 5. TCDP specification of the empirical experiment.

TCDP Element	Formal Specification	Experimental Implementation
Entity	i ∈ {1, …, N}	Student/entity_id
Time index	t ∈ {1, …, 6}	Semester/time_index
Prediction cutoff	τ ∈ {1, …, 6}	One model is fitted for each cutoff
Admissible information set	t ≤ τ	Only rows available up to the cutoff are used
Risk set	Rτ = {i:E_i > τ}	Students not observed to have dropped out before τ
Predictor representation	$X_{i, t}^{(τ)} = ϕ (S_{i})$	Static-only features; date and city representation configurable
Excluded variables	entity_id, time_index, feature_7 ∉ predictors	Identifiers, cutoff coordinates, and risk-set variable are not predictors
Invalid validation	$I D (D_{τ}^{t r a i n}) \cap I D (D_{τ}^{t e s t}) \neq \emptyset$	identity_leaky_rowwise_split/ShuffleSplit
Valid validation	$I D (D_{τ}^{t r a i n}) \cap I D (D_{τ}^{t e s t}) = \emptyset$	good_valid_entity_level_group_split/ GroupShuffleSplit
Evaluation unit	$A U C (Y_{i}, {\hat{p}}_{i, τ})$	AUC with entity-level interpretation; leaky branch aggregates test rows by entity

Note: The table makes explicit which parts of the empirical design are held fixed and which part is intentionally violated in the leaky configuration.

Table 6. Code-level settings used in the TCDP-formalized panel experiment.

Component	Setting Used in the Reproducible Pipeline
Programming environment	Python/scikit-learn pipeline components
Input columns	entity_id, time_index, feature_1 … feature_7, target
Cutoffs	CUTOFF_LIST = [1,2,3,4,5,6]
Train–test proportion	TEST_SIZE = 0.30; approximately 70% train and 30% test
Random seed	RANDOM_STATE = 42
Sampling	SAMPLE_ENTITIES = None; all available entities are used
Invalid split	ShuffleSplit(n_splits = 1, test_size = 0.30, random_state = 42)
Valid split	GroupShuffleSplit(n_splits = 1, test_size = 0.30, random_state = 42) with entity_id as group
Numeric preprocessing	SimpleImputer(strategy = median)
Categorical preprocessing	SimpleImputer(strategy = constant, fill_value = “__MISSING__”) + OneHotEncoder(handle_unknown = “ignore”)
Date handling	DATE_FEATURE_COLUMN = feature_2; DATE_FEATURE_MODE = full_date; DATE_DAYFIRST = False
City handling	CITY_FEATURE_MODE ∈ {coarse_residence, full_city_name}; feature_4 is encoded either as a low-cardinality residence code or as full city-name labels.
Output diagnostics	AUC, cutoff, algorithm, split type, panel size, rows per entity, train–test entity overlap, feature metadata

Note: The public pipeline stores the same metadata in the output files, making the validation design auditable rather than relying only on aggregate AUC values.

Table 7. Algorithms and fixed model parameters.

Algorithm	Role in the Diagnostic Comparison	Fixed Parameter Settings
Logistic Regression (LR)	Global linear baseline	max_iter = 1000; random_state = 42
Gradient Boosting (GB)	Nonlinear tree-based ensemble	n_estimators = 100; random_state = 42
Random Forest (RF)	Bagged nonlinear tree ensemble	n_estimators = 100; random_state = 42; n_jobs = 1
k-nearest neighbours (kNN)	Local instance-based classifier	n_neighbours = 5

Note: No task-specific hyperparameter tuning was performed. The purpose was to conduct a controlled leakage diagnosis, not predictive optimization.

Table 8. Cutoff-specific panel size and train–test entity overlap under the identity-leaky row-wise split.

Cutoff τ	At-Risk Students at τ	Cutoff-Specific Panel Rows	Test Entities	Train–Test Overlapping Students	Entity-Overlap Ratio
1	1840	1840	552	0	0
2	1569	3138	799	656	0.821
3	1365	4095	901	859	0.953
4	1256	5024	947	933	0.985
5	1139	5695	934	929	0.995
6	1087	6522	965	964	0.999

Note: The full analytical sample contained 1851 students before cutoff-specific risk-set filtering. The 1840 students reported at cutoff 1 are those who remained at risk after applying the activity/enrolment-count rule at the first prediction cutoff. At-risk students are students with a positive activity/enrolment-count indicator at the given cutoff. Cutoff-specific panel rows are obtained after applying the risk-set rule and retaining records with time_index ≤ τ. Train–test overlapping students are test-set entities that also appear in the training partition under the identity-leaky row-wise split. Entity-overlap ratio is calculated as overlapping test entities divided by all test entities. Under the valid entity-level group split, train–test entity overlap is zero by construction.

Table 9. AUC under valid and identity-leaky validation with coarse two-category residence encoding.

Cutoff	Valid Entity-Level Group Split				Identity-Leaky Row-Wise Split
	GB	kNN	LR	RF	GB	kNN	LR	RF
1	0.567	0.527	0.640	0.610	0.567	0.527	0.640	0.593
2	0.566	0.510	0.640	0.592	0.773	0.686	0.633	0.980
3	0.559	0.510	0.532	0.550	0.858	0.799	0.645	0.999
4	0.572	0.484	0.559	0.579	0.924	0.917	0.629	1.000
5	0.505	0.525	0.558	0.670	0.976	0.990	0.679	1.000
6	0.674	0.458	0.578	0.569	0.998	0.998	0.689	1.000

Note: AUC values are rounded to three decimals. The two validation configurations use the same cutoff-specific panels and feature set; only the split mechanism differs.

Table 10. AUC under valid and identity-leaky validation with full city-name residence encoding.

Cutoff	Valid Entity-Level Group Split				Identity-Leaky Row-Wise Split
	GB	kNN	LR	RF	GB	kNN	LR	RF
1	0.687	0.551	0.615	0.658	0.687	0.549	0.613	0.668
2	0.688	0.559	0.616	0.644	0.746	0.690	0.698	0.971
3	0.633	0.506	0.607	0.612	0.783	0.821	0.742	0.998
4	0.620	0.514	0.594	0.576	0.797	0.924	0.772	0.999
5	0.588	0.553	0.591	0.598	0.821	0.983	0.812	1.000
6	0.601	0.545	0.556	0.572	0.798	0.994	0.812	1.000

Table 11. Leakage-induced AUC inflation at the final cutoff under the two residence granularities.

Residence Encoding	Algorithm	Valid AUC at Cutoff 6	Leaky AUC at Cutoff 6	Inflation
Coarse residence encoding	LR	0.578	0.689	0.111
	GB	0.674	0.998	0.324
	kNN	0.458	0.998	0.540
	RF	0.569	1.000	0.431
Full city-name residence encoding	LR	0.556	0.812	0.256
	GB	0.601	0.798	0.196
	kNN	0.545	0.994	0.450
	RF	0.572	1.000	0.428

Note: Inflation is defined as the difference between the AUC of the identity_leaky_rowwise_split configuration and the AUC of the good_valid_entity_level_group_split configuration at cutoff 6. LR inflation increases from 0.111 in the coarse residence encoding to 0.256 in the full city-name encoding.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fauszt, T. Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models. Information 2026, 17, 581. https://doi.org/10.3390/info17060581

AMA Style

Fauszt T. Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models. Information. 2026; 17(6):581. https://doi.org/10.3390/info17060581

Chicago/Turabian Style

Fauszt, Tibor. 2026. "Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models" Information 17, no. 6: 581. https://doi.org/10.3390/info17060581

APA Style

Fauszt, T. (2026). Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models. Information, 17(6), 581. https://doi.org/10.3390/info17060581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Time-Consistent Prediction in Higher Education: A Framework for Preventing Data Leakage in Longitudinal Models

Abstract

1. Introduction

Research Questions and Diagnostic Expectations

2. Conceptual and Methodological Framework

2.1. Prediction as a Temporally Formalized Task

2.2. Data Leakage as Violation of Temporal and Structural Conditions

2.2.1. Temporal Leakage

2.2.2. Identity Leakage

2.2.3. Leakage and Metric Interpretation

2.3. Operational Checklist for Time-Consistent Validation

3. Methodological Patterns in Dropout Prediction: A Literature-Based Analysis

3.1. Methodological Orientations in Recent Dropout-Prediction Studies

3.2. Comparative Synthesis Across TCDP Dimensions

4. Empirical Demonstration: TCDP-Formalized Identity Leakage in Longitudinal Data with Static Predictors

4.1. Longitudinal Data Representation and Prediction Task

4.2. TCDP-Based Formalization of the Experimental Design

4.3. Validation Configurations and Reproducible Pipeline Parameters

4.4. Empirical Results

4.4.1. Cutoff-Specific Panel Size and Train–Test Entity Overlap

4.4.2. Coarse Two-Category Residence Encoding

4.4.3. Full City-Name Residence Encoding

4.5. TCDP Interpretation: Validation Design, Feature Granularity, and Model Class

4.6. Reproducibility and Validation Diagnostics

4.7. Transferability Beyond Higher Education

4.8. Data Availability and Reproducible Experiment Package

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI