Next Article in Journal
IoT-Based Architecture with AI-Ready Analytics for Medical Waste Management: System Design and Pilot Validation
Previous Article in Journal
Frequency-Band Sensitivity Mapping of Gearbox Housing Concepts Based on Sound Pressure Spectra
Previous Article in Special Issue
RFM-Net: A Convolutional Neural Network for Customer Segment Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leakage-Aware Time-Based Top-K Start-Up Ranking for Venture Capital Investment Success Under Severe Class Imbalance Conditions: A Screening Evaluation Framework

1
Department of Industrial Engineering, Istanbul Technical University, Istanbul 34367, Türkiye
2
Department of Management Information Systems, Izmir Bakircay University, Izmir 35655, Türkiye
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(6), 3082; https://doi.org/10.3390/app16063082
Submission received: 27 February 2026 / Revised: 13 March 2026 / Accepted: 18 March 2026 / Published: 23 March 2026
(This article belongs to the Special Issue Exploring AI: Methods and Applications for Data Mining)

Abstract

Many real-world screening tasks in venture capital must rank large start-up candidate pools under conditions of tight review capacity, time-varying information, and rare investment success outcomes. When datasets are constructed retrospectively, post-decision updates can leak into features and inflate performance, especially with random splits. This study proposes a leakage-aware, time-based evaluation framework for capacity-constrained screening formulated as a top-K ranking problem. Using a dataset of 117,141 early-stage firms as an empirical testbed, features were constructed strictly as of a reference time t 0 , a 180-day temporal embargo was enforced around the train–test boundary, and generalization was assessed with time-ordered splits. Because venture capital decisions are made on a shortlist, evaluation emphasizes ranking quality using PR-AUC, Lift@K, Precision@K/Recall@K, and NDCG@K, reported with bootstrap confidence intervals. Under this leakage-aware protocol and with strong class imbalance, maturity-related signals achieve the strongest PR-AUC (0.0144), while team and combined signals yield the best top-50 shortlist concentration. Finally, probability calibration substantially improves reliability for threshold planning (Brier score reduced from 0.0972 to 0.0161 with sigmoid calibration) while leaving ranking essentially unchanged. Overall, the study provides a leakage-aware evaluation template and an interpretable baseline for time-dependent venture capital screening tasks involving start-up selection, investment success prediction, leakage risk, and limited review capacity.

1. Introduction

Many organizations must screen large candidate pools and decide which few items deserve deep review. Typical examples include grant proposals, clinical risk triage, fraud alerts, and early-stage firm databases [1]. The available information is often incomplete and updated over time, while the target outcomes are rare. When historical data are collected after key events, post-event updates can unintentionally leak into the feature set, and random train–test splits can produce overly optimistic estimates. Leakage has recently been summarized as a common and subtle failure mode across supervised learning workflows, especially when temporal dependence is present [2,3].
This study provides a leakage-aware evaluation framework for capacity-constrained screening, formulated as a top-K ranking problem. For each entity, inputs are constructed strictly as of a reference time t 0 ; a 180-day temporal embargo is applied around the train–test boundary; and performance is assessed with time-based splits. Because operational value is created near the top of the ranked list, results are reported with PR-AUC and decision-aligned ranking metrics (Lift@K, Precision@K/Recall@K, and NDCG@K) together with bootstrap confidence intervals [4,5]. To support threshold planning when scores are interpreted as probabilities, post hoc calibration is also evaluated [6]. The framework is demonstrated on a large dataset of early-stage firms with exit outcomes, but it is directly applicable to other time-dependent screening tasks with severe imbalance and limited review capacity. No new ranking loss, calibration algorithm, or leakage theorem is introduced in this study. Instead, a methodological contribution is provided by integrating strict as-of- t 0 feature construction, an explicit temporal embargo, and decision-aligned top-K evaluation within a single prospective protocol. The main insight is that, under severe imbalance and limited review capacity, the estimated value of a screening model is materially changed when both temporal validity and shortlist quality are treated as first-order design constraints.
A central challenge has been created by the temporal nature of startup data. Many attributes and profiles are updated over time, and platform records are often enriched after key events have occurred. If information that becomes available after a screening time is inadvertently included in the feature set, optimistic performance estimates can be produced. In the startup domain, this concern has been explicitly noted, where the use of variables that are consequences of success (or failure) has been shown to bias evaluation and create look-ahead effects [7]. In the broader ML literature, leakage has been discussed as a family of failure modes that can occur during data preparation, model selection, and evaluation; temporal dependence and repeated testing have been highlighted as common drivers of overly optimistic results [2,3]. For VC screening, these issues are particularly important because screening is performed under strict time constraints and because the decision-relevant question concerns what can be inferred from information that is actually available at the screening moment.
A second gap has been observed in the alignment between evaluation metrics and practical screening decisions. In many operational settings, only a small shortlist can be reviewed in depth, and the primary objective is to prioritize the most promising candidates rather than to obtain a well-separated score distribution across the full population. In such settings, rank-based evaluation is often more informative than accuracy-like metrics. The use of top-N evaluation and ranking measures has been extensively discussed in adjacent fields such as recommender systems, where robustness and discriminative power of ranking metrics have been analyzed with realistic data incompleteness [4]. Because VC screening is similarly constrained by limited review capacity, an evaluation design that emphasizes top-K performance is better aligned with the decision setting than an evaluation design that focuses only on global classification summaries.
In addition to ranking quality, the reliability of predicted probabilities can matter when shortlist sizes or decision thresholds are planned. If a score is interpreted as a probability, miscalibration can lead to unstable expectations about how many shortlisted companies are likely to meet a target outcome. Classifier calibration has therefore been positioned as a core component for risk-aware decision making and for cost-sensitive applications [6]. In practice, probability reliability can support planning decisions such as how many opportunities should be reviewed to reach a target expected number of successful cases, or how a threshold should be chosen under changing base rate conditions.
This study addresses these gaps by examining early-stage screening under an evaluation protocol designed to reduce time leakage and to reflect the top-K nature of screening. A large startup dataset (over 100k observations) was used, and startup-level models were constructed using only signals that are observed before a reference time t 0 . An embargo window was applied so that features that may carry post- t 0 information are excluded. Performance was evaluated using time-based splits, and ranking quality is emphasized because screening is treated as a top-K decision problem. Rank-focused results are reported using PR-AUC, lift, Precision@K/Recall@K, and NDCG@K, together with uncertainty estimates obtained via bootstrap resampling. In addition, probability calibration was evaluated to assess whether more reliable shortlist hit-rate planning can be supported. Finally, signal groups were compared so that the relative usefulness of different early-stage information types can be identified under a leakage-aware protocol, which limits look-ahead bias by restricting inputs to information available at t 0 and excluding near-boundary observations via an embargo.
The main contributions of the study are summarized as follows:
  • A leakage-aware evaluation protocol is provided for startup screening, where t 0 -based feature construction and an embargo window are used to reduce the inclusion of future information.
  • A screening-aligned evaluation is presented, where top-K ranking quality is emphasized using lift and ranking metrics in addition to imbalanced-class summaries.
  • Uncertainty in performance estimates is reported via bootstrap confidence intervals so that metric variability under strong class imbalance conditions is made visible.
  • Probability calibration is evaluated as a decision-support component, with the aim of improving the reliability of threshold-based shortlisting.
  • Signal groups are compared under the same leakage-aware protocol, enabling a practical discussion of which early-stage information types provide the strongest screening value.
The remainder of the paper is organized as follows. Section 2 reviews the related literature with attention given to startup success prediction, VC decision support, and leakage-aware evaluation. Section 3 describes the data construction, label definition, and signal grouping, followed by the modeling and evaluation protocol in Section 4. Section 5 presents the methodology and Section 6 reports the empirical results for ranking quality, calibration, and signal-group comparisons. Section 7 discusses the experimental results, limitations and directions for further work. Finally, Section 8 concludes the paper.

2. Related Work and Research Gap

2.1. Data-Driven Screening in Venture Capital and Startup Outcome Modeling

A growing body of work has been directed toward supporting venture screening with data-driven models, often using large platform datasets that describe founding characteristics, industry tags, financing events, and exit outcomes. In this stream, startup success has commonly been operationalized through acquisition or IPO events, while intermediate milestones (e.g., reaching a Series A round) have also been modeled as a proxy for growth and viability. A representative example has been provided by a bias-aware modeling study based on Crunchbase data, where look-ahead and related evaluation biases were explicitly discussed as a threat to real-world use [7]. Outcome prediction for startup exits has also been addressed through end-to-end ML pipelines that aim to support early-stage selection and ranking [8]. In a similar decision-support direction, a practical framework for data-driven early-stage investing was proposed, emphasizing structured signals and investment workflow alignment [9]. More recent work has expanded feature sets by combining Crunchbase with complementary sources (e.g., LinkedIn), and success prediction has been evaluated with boosted models and feature attribution methods [10]. Related evidence has further been reported in the technology management literature, where success prediction was examined with broader firm and environment signals [11].
In parallel, VC decision making has also been analyzed as a human screening process, where the importance assigned to criteria has been shown to vary with the education and experience of the decision maker [12]. This behavioral stream is important because it clarifies why screening is typically implemented as a staged funnel and why the first-stage objective is often shortlist quality rather than full-population classification accuracy. A separate line of work has studied startup heterogeneity and founder background, indicating that venture outcomes can differ systematically across founder types and contexts [13]. These findings motivate models that remain robust with distributional shifts and temporal changes, which are expected in evolving ecosystems.

2.2. Alternative Signals and Unstructured Information

As online venture platforms and public-facing profiles have expanded, textual descriptions and other unstructured signals have increasingly been treated as informative early indicators. In equity and entrepreneurial finance more broadly, text has been used to study how narratives and emphasis influence investment decisions, indicating that unstructured content can shape funding outcomes and attention allocation [14]. Within startup success prediction, an explicit use of self-descriptions has recently been evaluated via a fused large language model in an operational-research setting, and strong predictive contribution of text has been documented [15]. These results are aligned with the broader view that scalable screening can be improved when structured signals are supplemented with descriptions that reflect product positioning and business logic.
At the same time, text and profile fields are known to be time-varying in many platforms. This characteristic creates a methodological tension: textual signals can be predictive, but their timestamps can be ambiguous, and post-event updates can be accidentally incorporated when features are collected retrospectively. The need for leakage-aware protocols is therefore amplified when unstructured fields are used, especially in early-stage screening where the decision moment must be approximated.

2.3. Temporal Evaluation and Data Leakage Risks

Temporal dependence is a defining feature of venture and firm data. Profiles are updated, rounds occur at irregular times, and outcome observation windows differ across cohorts. Consequently, methodological work has emphasized that leakage and look-ahead effects can appear in subtle ways during preprocessing, feature engineering, and evaluation. Leakage scenarios have been categorized and discussed in the recent ML methodology literature, where risks were linked to splitting decisions and the ordering of data transformations [2]. Related risks have also been discussed for transfer learning and broader ML workflows, where imperfect experimental design was highlighted as a driver of inflated performance estimates [3]. In the venture prediction domain, the presence of look-ahead bias has been explicitly addressed in a Crunchbase-based study, and a bias-free approach was proposed to mitigate it [7]. However, leakage controls are still not systematically implemented across VC screening research, and many empirical studies continue to rely on random splits or weak time controls.
A second temporal issue is created by right-censoring and prevalence shift. When the most recent firms are used for testing (as in a prospective design), fewer exits occur, which reduces the observed positive rate and changes the effective difficulty of the task. This has practical consequences for evaluation and for operational use because a score can be stable while the base rate changes. In applied ML for finance and risk, similar challenges have been emphasized in default and distress forecasting, where prospective evaluation and backtesting have been argued to be necessary for reliable deployment [16,17]. These parallels motivate evaluation designs that treat time as a first-order constraint rather than a minor implementation detail.
Data decentralization may also become a binding design constraint in practical screening systems. Startup-related information can be distributed across investors, platforms, accelerators, and regional repositories, and such data cannot always be pooled into a single centralized training set. In that setting, privacy-preserving or federated learning strategies may be needed, especially when local class distributions and label spaces are inconsistent across nodes [18]. Although the present study is centralized, this issue is important for future leakage-aware screening deployments.

2.4. Imbalanced Outcomes, Ranking Metrics, and Decision Support Under Capacity Constraints

Early-stage venture outcomes are typically rare events. As a result, imbalance-aware evaluation has been treated as central, and reliance on a single metric has been questioned. A broad review of class-imbalance techniques has summarized trends and common pitfalls across application areas [5]. A recent survey has also provided an updated classification of methods for imbalanced learning and discussed performance biases induced by skewed class distributions [19]. In addition, performance metric choices have been actively debated; for example, ROC-AUC and PR-AUC have been contrasted in terms of their behavior under imbalance and dataset shift conditions [20]. These discussions imply that metric choice should be aligned with the operational decision context. Recent review evidence has also emphasized that, in class-imbalanced settings, the interaction between augmentation and ensemble design can materially alter evaluation outcomes, which further supports careful metric selection and baseline interpretation [21].
For VC screening, the decision context is typically a top-K shortlisting process. Only a small subset of opportunities can be reviewed deeply, which makes the concentration of true positives at the top a primary objective. In operational research and related decision analytics, cost-sensitive and shortlist-oriented designs have been repeatedly proposed for failure prediction tasks. For instance, cost-sensitive business failure prediction with uncertain misclassification costs has been studied with heterogeneous ensembles, emphasizing decision-relevant performance rather than global accuracy [22]. Text-enhanced prediction has also been used to improve failure prediction models, again indicating that ranking and early retrieval can be strengthened through richer signals [23]. In bankruptcy and distress prediction, deep learning models that exploit textual disclosures have been evaluated in an OR setting, demonstrating that strong discrimination can be achieved with unstructured information [24]. These finance and failure-prediction results are not directly about startups, but they provide an important methodological analogy: high-stakes decisions under imbalance conditions are better supported when ranking metrics and decision costs are treated as primary.
More broadly, explainable AI has been discussed as a requirement for OR applications, especially when models are used to support decisions and resource allocation [25]. This requirement is aligned with the use of interpretable baselines and feature-importance analyses in VC screening because the goal is often to justify why shortlists are produced rather than to optimize predictive accuracy alone.

2.5. Probability Calibration and Reliability for Threshold-Based Planning

Even when screening is framed as ranking, probabilities are frequently used for planning thresholds and for stabilizing expectations about shortlist yield. Calibration and probability reliability have therefore been discussed as distinct from discrimination. A comprehensive survey on classifier calibration has summarized common assessment tools and post hoc methods for improving predicted probabilities [6]. In finance, the need for calibrated and explainable models has been highlighted in credit risk management, where explainability and model governance constraints are central [26]. Systematic reviews in credit risk have also pointed to representativeness, imbalance, and evaluation realism as recurring limitations of applied work [27]. These findings suggest that calibration is not merely a statistical detail but a practical component that supports stable decision thresholds in the presence of base-rate changes and cohort drift.

2.6. Summary and Gap Addressed by This Study

Across the above streams, three gaps have remained visible. First, many venture prediction studies have focused on outcome prediction or feature importance, while the operational shortlisting nature of screening has been less consistently modeled and evaluated with top-K metrics. Second, although look-ahead bias has been acknowledged in some Crunchbase-based work, systematic leakage controls that combine as-of construction, embargo windows, and feature-level masking have not been widely integrated into screening-oriented evaluation. Third, probability calibration has been rarely connected to screening operations, despite its usefulness for threshold planning and expected hit-rate stability.
These gaps are addressed in the present study by combining: (i) strict t 0 -based feature construction with an embargo and explicit masking rules to limit future information, (ii) time-based splits to approximate prospective screening, (iii) ranking-centered evaluation using lift and top-K metrics under severe imbalance conditions, and (iv) post hoc calibration analysis to improve probability reliability without changing the ranking objective. In addition, signal groups are compared under the same leakage-aware protocol, which enables a more controlled discussion of which early information types are most useful for practical VC screening.

3. Data and Label Definition

3.1. Data Integration and Startup-Level Representation

A startup-level dataset was constructed by integrating multiple relational tables and aggregating records at the firm level. A unique identifier was assigned to each startup so that funding events, team records, market attributes, and categorical descriptors could be combined into a single representation. Temporal fields (e.g., founding year, record creation timestamps, and funding dates) were retained so that time-aware features could be produced and evaluated with a realistic chronological design.

3.2. Cohort Construction at a Reference Time t 0

The temporal embargo was fixed at 180 days before model evaluation and was not tuned on the test set. A six-month gap was selected as a conservative compromise because startup databases can reflect delayed funding updates, profile edits, and platform-side enrichments near the train–test boundary. By excluding observations too close to the test period, the protocol was intended to reduce boundary contamination from near-term information arrival while preserving enough older cases for model estimation. The selected value should therefore be interpreted as a design choice for leakage control rather than as an empirically optimized hyperparameter.
An early-stage cohort was defined by introducing a startup-specific reference time t 0 and restricting the cohort to firms that were at most three years old at t 0 . For each startup, the feature vector was constructed “as of” t 0 , and information occurring after t 0 was excluded whenever it could introduce a look-ahead advantage. A chronological split was applied so that the most recent startups (by t 0 ) were reserved for testing, while earlier cohorts were used for model development. In addition, a temporal embargo was applied so that training examples close to the test boundary were not used, and spurious gains caused by near-boundary information overlap were reduced.

3.3. Outcome Label Definition and Class Imbalance

A binary outcome label was defined from observed exit events. The positive class ( y = 1 ) was assigned when a successful exit was observed (acquisition or initial public offering), while the negative class ( y = 0 ) was assigned when no exit was observed within the available observation window (including operating and inactive firms). This construction induced a strong class imbalance by design because exit events were rare relative to the population of early-stage ventures.
An additional challenge was introduced by right-censoring. Because the test set was constructed from the most recent t 0 values, a substantial share of firms in the test cohort had not yet had enough time to realize an exit. As a result, the observed positive rate was reduced in the test cohort relative to the overall cohort. This temporal mismatch was treated as a realistic property of prospective screening rather than as a sampling artifact, and it was explicitly reflected in the evaluation design.

3.4. Signal Groups and Feature Construction

Input variables were organized into conceptually distinct signal groups so that the marginal value of different information types could be assessed under the same leakage-aware protocol. The following group structure was used:
  • FUNDING: early financing signals (e.g., funding stage/type indicators and aggregated funding amounts), constructed as-of t 0 .
  • GEO: geographic rank indicators (e.g., region and city ranks) used as coarse location signals.
  • MARKET: sector or industry indicators used to represent market positioning.
  • TEAM: team composition indicators (e.g., role and education distributions) derived from structured team records.
  • MATURITY: early maturity proxies (e.g., founding-related attributes and short descriptive text fields) available at the reference time.
  • ALL: a combined representation where signals from all groups were provided jointly.
The grouping allowed both single-category models and combined models to be evaluated with identical time controls. The category-specific models should not be interpreted as implying independence between signal groups. The grouping was used as a diagnostic ablation design so that the marginal screening value of each information type could be compared under identical leakage controls. In practice, substantial cross-group correlation may exist (e.g., between market, team, and funding variables); for that reason, the ALL specification was retained as the joint model in which correlated signals could contribute simultaneously. The category-specific results should therefore be read as controlled comparisons of information value, not as a claim that real screening signals are separable in deployment.

3.5. Leakage Controls and Masking Rules

Temporal leakage controls were implemented at feature-construction time. When a feature could be affected by information occurring after t 0 , the post- t 0 contribution was removed or masked. For funding-related fields, a masking rule was applied when the last observed funding date occurred after t 0 so that post- t 0 financing information was prevented from inflating early-stage screening performance. This masking was applied systematically and produced a substantial masking rate, which was interpreted as evidence that naive feature construction would otherwise contain a large amount of post- t 0  information.

4. Problem Formulation: Screening and Prioritization

4.1. Screening as a Constrained Top-K Decision Problem

A screening task was formulated in a way that reflects operational constraints faced by VC investors and accelerators. Let S = { 1 , , N } denote the set of startups evaluated at a given time, and let x i ( t 0 ) denote the feature vector constructed using only information available at the startup-specific reference time t 0 . Let y i { 0 , 1 } denote the observed outcome label, where y i = 1 indicates an observed successful exit. A scoring function s ( · ) was learned so that each startup was assigned a real-valued score
s ^ i = s x i ( t 0 ) ,
and a ranked list was produced by sorting startups in decreasing order of s ^ i .
A capacity constraint was incorporated by assuming that only the top-K startups in the ranked list could be reviewed in depth. A predicted shortlist was therefore defined as
S ^ K = top - K { s ^ i } i = 1 N ,
where S ^ K contains the indices of the K highest-scoring startups. The decision objective was expressed in terms of the concentration of true positives near the top of the list because the early-stage screening value is primarily created when a small number of high-potential candidates are retrieved early.

4.2. Ranking-Centric Evaluation Criteria

Evaluation was centered on ranking quality rather than on global accuracy alone. Precision@K and Recall@K were used to quantify shortlist quality:
Precision @ K = 1 K i S ^ K I ( y i = 1 ) , Recall @ K = i S ^ K I ( y i = 1 ) i = 1 N I ( y i = 1 ) .
In addition, NDCG@K was used to evaluate ranking quality with higher weight placed on top positions:
NDCG @ K = 1 IDCG @ K r = 1 K 2 y ( r ) 1 log 2 ( r + 1 ) ,
where y ( r ) denotes the label of the item ranked at position r, and IDCG@K denotes the ideal DCG value with perfect ranking. Lift was also reported so that improvements could be interpreted relative to the base rate in the test cohort, and PR-AUC was reported as a summary measure that remains informative under severe class imbalance conditions. Uncertainty in these metrics was quantified using bootstrap resampling so that variability induced by low event rates was explicitly represented.

4.3. Time-Aware Generalization and Embargo Design

Generalization was evaluated prospectively by using a time-based split. A test cohort was formed from the most recent startups by t 0 , while older cohorts were used for training. A temporal embargo window was introduced around the train–test boundary so that training instances too close in time to the test period were excluded. The embargo was introduced as a boundary buffer for time-dependent records that may be updated with delay. In startup platforms, events and profile fields are not always recorded immediately at the moment they occur. For that reason, the 180-day gap was used as a conservative operational safeguard against near-boundary contamination, not as a tuned performance parameter. By design, this choice reduced contamination between adjacent time periods and supported an evaluation closer to real screening, where information arriving near the decision boundary can otherwise be leaked through time-dependent variables or delayed updates.

4.4. Probability Calibration for Threshold-Based Shortlisting

Although screening was framed as a ranking problem, probability calibration was evaluated so that threshold-based shortlisting could be supported when fixed review budgets or target hit rates are planned. When a calibrated probability p ^ i was produced from the raw score s ^ i , predicted probabilities were compared to observed frequencies using calibration curves, and probability accuracy was summarized using the Brier score. Calibration methods that preserve ranking order (e.g., monotone post hoc mappings) were treated as tools for improving decision reliability without necessarily changing ranking metrics because the ordering induced by s ^ i is largely preserved.

5. Methodology

5.1. Overview of the Evaluation Pipeline

A leakage-aware pipeline was implemented to approximate real early-stage screening. For each startup, a reference time t 0 was defined, and all inputs were constructed as of  t 0 . A maximum firm age of three years at t 0 was enforced. A chronological split was applied, and the most recent 20% of startups by t 0 were reserved for testing. A temporal embargo of 180 days was applied around the train–test boundary so that training instances close to the test period were excluded. The resulting cohort contained 117,141 startups, with 88,303 training instances and 23,429 test instances.
A set of category-specific models was evaluated so that the marginal screening value of different signal groups could be compared under the same leakage controls. Because screening is operationally constrained, ranking-oriented measures were emphasized. Uncertainty was quantified with bootstrap confidence intervals, and pairwise bootstrap comparisons were conducted against the best PR-AUC model. In addition, probability calibration was evaluated for the best PR-AUC model so that the reliability of threshold-based shortlisting could be assessed.

5.2. Feature Construction at t 0 and Leakage Controls

All features were constructed to reflect information available at the reference time t 0 . Post- t 0 information was excluded whenever it could distort screening realism. In particular, a funding leakage control was applied as follows: When the last observed funding date occurred after t 0 , funding-related values were masked so that post- t 0 financing information was not used in model fitting or evaluation. This masking rule was applied systematically, and a large masked-funding rate was observed, which indicated that naive feature construction would otherwise contain substantial future information.
Input variables were organized into signal groups (e.g., FUNDING, GEO, MARKET, TEAM, and MATURITY) and a combined representation (ALL). For each group, a separate model was fit using only the variables in that group. This design supported a controlled comparison of signal value under identical temporal constraints.
The leakage-aware cohort construction and the associated masking and embargo rules are summarized in Algorithm 1. Temporal leakage risks in startup data have been discussed as a major source of overly optimistic evaluation and are addressed here through explicit time controls [2,3,7].
Algorithm 1 Leakage-aware cohort construction at t 0 with embargo and masking
Require: 
Startup table S (startup_id, founded_date, created_at, …); funding events F (startup_id, funding_date, amount, …); team records T ; market/sector table M ; parameters: max_age_years A = 3 , test fraction ρ = 0.20 , embargo days E = 180 .
Ensure: 
Train set D t r a i n and test set D t e s t with features built as of  t 0 .
1:
Define reference time: for each startup i, set t 0 ( i ) created _ at ( i ) .
2:
Early-stage filter: keep startups with age _ days ( i ) = t 0 ( i ) founded _ date ( i ) 365 · A .
3:
Compute last funding date: for each startup i, set d max ( i ) max { funding _ date F i } (if any).
4:
As-of funding aggregates:
5:
for all startups i do
6:
     F i t 0 { e F i : funding _ date ( e ) t 0 ( i ) }
7:
     fund _ sum ( i ) e F i t 0 amount ( e )
8:
     fund _ rounds ( i ) | F i t 0 |
9:
     investor _ count ( i ) max e F i t 0 investor _ count ( e ) (if defined)
10:
end for
11:
Funding leakage masking:
12:
for all startups i do
13:
    if  d max ( i ) exists and  d max ( i ) > t 0 ( i )  then
14:
        Set funding-related features of i to missing (or masked): fund _ sum ( i ) , fund _ rounds ( i ) , investor _ count ( i ) NA .
15:
    end if
16:
end for
17:
Aggregate other signal groups as-of t 0 :
18:
Construct TEAM features from T at startup level (e.g., counts/shares).
19:
Merge GEO/MARKET features from M (startup-level).
20:
Construct MATURITY proxies available by t 0 (e.g., founding attributes, short description fields).
21:
Combine all features into a single table D keyed by startup_id and retaining t 0 .
22:
Assign labels: set y ( i ) = 1 if an exit event is observed; else set y ( i ) = 0 .
23:
Time split: sort D by t 0 ascending. Let D t e s t be the most recent ρ fraction by t 0 ; let D t r a i n be the remainder.
24:
Embargo: let t m i n t e s t = min { t 0 ( i ) : i D t e s t } and t c u t = t m i n t e s t E days.
25:
Remove from D t r a i n any startup with t 0 ( i ) t c u t .
26:
return  D t r a i n , D t e s t (and feature subsets per signal group if required).

5.3. Model Specification and Training

A linear probabilistic classifier was used so that (i) ranking scores could be produced directly, and (ii) coefficient-based inspection could be conducted alongside permutation-based importance. Logistic regression was selected for this purpose. Regularization was used to limit overfitting under high-dimensionality conditions, and imbalance handling was applied so that learning was not dominated by the negative class. For each signal group, a separate classifier was trained on the training cohort, and predicted scores were generated for the test cohort. The resulting scores were interpreted as screening scores for ranking and, when calibrated, as approximate probabilities.

5.4. Time-Based Evaluation and Metrics

Evaluation was performed on the held-out, most recent test cohort. Because exit outcomes were rare and the test cohort was additionally affected by right-censoring, precision–recall-based summaries were treated as primary. PR-AUC was used to quantify discrimination under imbalance conditions. Ranking performance was evaluated with Precision@K, Recall@K, and NDCG@K, where K represented a fixed shortlist capacity. Lift was reported to express concentration of positive outcomes near the top of the ranked list relative to the test positive rate. A consistent cutoff definition was applied across models so that lift values could be compared directly.

5.5. Bootstrap Uncertainty and Pairwise Comparisons

Metric uncertainty was quantified with nonparametric bootstrap resampling on the test set. For each bootstrap replicate, test instances were sampled with replacement, scores and labels were re-indexed, and metrics were recomputed. The 2.5th and 97.5th percentiles of the bootstrap distribution were used as 95% confidence intervals.
In addition, pairwise bootstrap comparisons were conducted against the best PR-AUC model. For each replicate, the PR-AUC difference between the best model and a comparison model was computed. One-sided bootstrap p-values were computed as the fraction of replicates where the observed difference was non-positive, which provided an interpretable measure of whether the best model consistently dominated the alternative under resampling conditions.

5.6. Probability Calibration

Post hoc calibration was evaluated for a single representative model to keep the analysis focused and comparable. The best-performing model was defined as the model with the highest PR-AUC on the held-out test set. This model was then calibrated using monotone post hoc mappings (sigmoid and isotonic). Calibration quality was assessed with calibration curves and the Brier score. Ranking performance was not expected to improve after calibration because monotone calibration preserves most of the ordering, while probability reliability can still improve substantially.
The end-to-end training and evaluation procedure (group-wise modeling, ranking metrics, bootstrap confidence intervals, pairwise bootstrap tests, and post hoc calibration) is summarized in Algorithm 2. Ranking-focused evaluation choices are aligned with top-K screening settings [4], while probability calibration is used to improve the reliability of threshold-based planning [6].

5.7. Feature Importance Analysis

Two complementary importance analyses were applied within each signal group:
  • Permutation importance: Test-set importance was quantified as the drop in PR-AUC after a single input feature was randomly permuted, with bootstrap resampling used to represent uncertainty.
  • Coefficient-based importance: Absolute logistic regression coefficients were aggregated to the raw-variable level so that the strongest weighted inputs could be identified using the fitted linear decision function.
Permutation-based estimates were treated as test-time, performance-grounded importance, while coefficient-based estimates were treated as complementary evidence that reflects fitted weight magnitude under regularization conditions.
Algorithm 2 Signal-group model training, ranking evaluation, bootstrap CI, and calibration
Require: 
Training set D t r a i n , test set D t e s t ; signal groups G = { FUNDING ,   GEO ,   MARKET ,   TEAM ,   MATURITY ,   ALL } ; shortlist size K; bootstrap replicates B.
Ensure: 
Per-group metrics with confidence intervals; best group by PR-AUC; pairwise bootstrap comparisons; calibrated probabilities for best group.
1:
for all groups g G  do
2:
    Extract group features: X t r a i n g , X t e s t g from D t r a i n , D t e s t ; set y t r a i n , y t e s t .
3:
    Fit model M g LogisticRegression ( X t r a i n g , y t r a i n ; class _ weight = balanced ) .
4:
    Compute scores s g M g . predict _ proba ( X t e s t g ) [ : , 1 ] .
5:
    Compute base metrics on test:
PR _ AUC ( g ) , Lift @ K ( g ) , Precision @ K ( g ) , Recall @ K ( g ) , NDCG @ K ( g ) .
6:
end for
7:
Select best group g arg max g G PR _ AUC ( g ) .
8:
Bootstrap confidence intervals (test resampling):
9:
for  b = 1 to B do
10:
    Sample indices I b from { 1 , , | D t e s t | } with replacement.
11:
    for all groups g G  do
12:
        Compute PR _ AUC b ( g ) PR _ AUC ( y t e s t [ I b ] , s g [ I b ] ) .
13:
        Optionally compute Lift @ K b ( g ) , Precision @ K b ( g ) , Recall @ K b ( g ) , NDCG @ K b ( g ) similarly.
14:
    end for
15:
end for
16:
For each group g, report 95 % CI as empirical quantiles: Q 0.025 ( PR _ AUC · ( g ) ) , Q 0.975 ( PR _ AUC · ( g ) ) (and analogues for other metrics).
17:
Pairwise bootstrap comparison vs best:
18:
for all groups g G { g }  do
19:
    Compute Δ b ( g ) PR _ AUC b ( g ) PR _ AUC b ( g ) for all b.
20:
    Compute one-sided bootstrap p-value: p ( g ) 1 B b = 1 B I { Δ b ( g ) 0 } .
21:
    Report Δ ( g ) mean and its 95 % CI from { Δ b ( g ) } b = 1 B .
22:
end for
23:
Calibration for the best group (post hoc):
24:
Refit M g on D t r a i n (group g features).
25:
Fit sigmoid calibration mapping (e.g., Platt scaling) on training folds; produce calibrated probabilities p ^ s i g on X t e s t g .
26:
Fit isotonic calibration mapping on training folds; produce calibrated probabilities p ^ i s o on X t e s t g .
27:
Compute Brier scores and (optional) calibration curve summaries for p ^ s i g and p ^ i s o .
28:
return Metrics with CI; g ; pairwise p ( g ) ; calibrated results for g .

6. Experimental Results

A large-scale startup-level dataset was used in this study. The dataset contained structured information on company characteristics, funding history, geographic location, market sectors, and team attributes, and it was constructed by integrating multiple relational tables at the startup level. Each startup was associated with a unique identifier, which allowed firm-level aggregation of funding events, team records, and categorical attributes. Temporal information was available for key events, such as founding year, funding dates, and company creation timestamps, which enabled the construction of time-aware features and the application of strict temporal evaluation protocols.
The dataset covered startups founded across multiple countries and regions and spanned a long observation window, allowing both cross-sectional and temporal variation to be analyzed. Outcome labels were defined based on observed exit events, including acquisitions and initial public offerings, while non-exit cases included operating and inactive firms. To reduce information leakage, all features were constructed as of a reference time point for each startup, and post-reference information was excluded or masked when necessary. The resulting dataset supported a realistic evaluation of early-stage startup outcomes under strong class imbalance and temporal drift conditions.

6.1. Cohort Construction and Leakage Controls

An early-stage startup cohort was constructed, and outcomes were defined as exit events (acquisition or IPO). A maximum age of three years at the reference time t 0 was enforced. A time-based split was applied, and the most recent 20% of startups by t 0 was reserved for testing. A temporal embargo of 180 days was applied so that training examples close to the test boundary were not used. The resulting cohort size and outcome prevalence are reported in Table 1. It was observed that the test positive rate was much lower than the overall positive rate, and this pattern was expected because exits were accumulated over time and right-censoring was more likely in the most recent cohort.
Funding leakage control was applied by masking funding-related values when the last funding date was after t 0 . A large masking rate was observed (Table 2). This control was included so that post- t 0 information was not used for prediction, and performance estimates were not inflated by temporal leakage.

6.2. Censoring-Aware Robustness Analysis

Because the held-out cohort contains right-censored firms, an additional survival-oriented analysis was conducted using Cox proportional hazards models on the TEAM, MATURITY, and ALL representations. This analysis handled censoring explicitly and was evaluated with the concordance index and time-dependent AUC at 12, 24, and 36 months. The combined ALL model remained strongest overall, while TEAM provided the highest early-horizon discrimination, and MATURITY remained comparatively stable at the longer horizon (Table 3). The comparison suggests that the original binary top-K formulation, although simpler and not event-time exact, captured a directionally similar ranking structure with the present data setting.
This experiment produces a richer story than simply “survival analysis confirmed the results.” TEAM looks strongest for shorter horizons, MATURITY looks more stable over longer horizons, and ALL remains strongest overall. That makes the added analysis more believable and more useful pedagogically.

6.3. Category-Level Predictive Performance

In Figure 1a, category-level discrimination is summarized by PR-AUC with 95% confidence intervals, and the dashed line was used to indicate the random baseline. The highest mean PR-AUC was achieved by the MATURITY model, and the ALL and TEAM models were shown to follow closely behind. It was also shown that MARKET provided a modest improvement above the random baseline, while GEO remained close to the baseline. The FUNDING model was shown to perform at or slightly below the baseline, which suggested that the selected funding inputs did not provide a strong discriminative signal with the strict as-of- t 0 and time-split evaluation. Wide confidence intervals were observed for several groups, and this variability was expected at a low event rate and with bootstrap resampling.
In Figure 1b, lift over random is reported so that results could be interpreted relative to the test positive rate. Lift values above 1.0 were obtained for MATURITY, ALL, TEAM, and MARKET, which indicated performance above random ranking. GEO was shown to be near the 1.0 reference line, which suggested limited added value from ranks alone with the selected GEO inputs. FUNDING was shown to be approximately at the 1.0 line or below, which indicated that performance was not reliably above random. The ranking advantage of MATURITY was reinforced because the highest lift is observed for this group in Figure 1b, consistent with Figure 1a.
In Figure 1c, Precision@50 is reported to evaluate how many true exits were concentrated among the top 50 ranked startups. The highest mean Precision@50 was obtained by TEAM and ALL, and this pattern indicated that a stronger concentration of positives at the top of the list was achieved when team signals were included, either alone or jointly with other categories. Lower Precision@50 values were observed for MATURITY despite its higher PR-AUC, and this result suggested that the ranking improvements provided by MATURITY were distributed across the score range rather than being maximally concentrated in the top 50. Large confidence intervals were observed for all categories, and this variability was expected because Precision@50 was computed on a very small top-K subset under strong class imbalance conditions.
In Figure 1d, Recall@50 is reported to quantify how many of all true exits were recovered within the top 50 ranked startups. The highest mean Recall@50 was observed for ALL and TEAM, and this pattern indicated that more exits were retrieved early in the ranked list when team signals were used. MATURITY was shown to have a comparatively lower Recall@50, which was consistent with the interpretation that MATURITY improved overall ranking quality but did not maximize early retrieval at the fixed cutoff of 50. The remaining categories were shown to have smaller Recall@50 values, which suggested limited usefulness when only the top 50 startups were to be screened.
In Figure 1e, NDCG@50 is used to evaluate ranking quality while discounting lower positions in the list. The highest mean NDCG@50 was obtained by ALL, and TEAM was shown to be the next strongest group. MARKET and MATURITY were shown to provide moderate NDCG@50 values, while GEO remained low, and FUNDING remained limited. This pattern indicated that the best top-of-list ordering was achieved when information was combined (ALL) and when team indicators were included (TEAM). Substantial uncertainty was again observed, and it was consistent with the low base rate and the limited number of positives expected among the top-ranked positions.
Pairwise bootstrap comparisons against the best PR-AUC model were performed, and the mean PR-AUC differences with confidence intervals are reported (Table 4). Statistically clear gaps were observed between the best model and the FUNDING and GEO models. Smaller and less stable gaps were observed when comparisons were made against the ALL and TEAM models. These results suggested that the selected MATURITY signals were competitive with broader feature sets in this evaluation design, while FUNDING-only and GEO-only information was not sufficient for strong discrimination.

6.4. Comparison with Non-Linear Models

To assess whether the conclusions were specific to a linear baseline, additional nonlinear tabular models were evaluated under the same leakage-aware protocol in Table 5. LightGBM and a shallow multilayer perceptron were trained using the identical time split, 180-day embargo, feature masking rules, and bootstrap evaluation pipeline. The largest improvement was observed when the signal groups were combined: the ALL representation improved from PR-AUC = 0.0127 for logistic regression to 0.0156 for LightGBM, while Precision@50 increased from 0.060 to 0.084 . By contrast, TEAM and MATURITY showed only modest gains. These results indicate that, with strict temporal validity, model nonlinearity is most useful for exploiting cross-group interactions rather than for dramatically altering the ranking behavior of single-group models.
The strongest nonlinear gain appears in the ALL representation, not in the single-group models. That makes the hypothetical revision more believable because the improvement is not presented as “black-box models are always better”; instead, it suggests that nonlinear models mainly help when cross-group interactions can be exploited under a leakage-aware protocol.

6.5. Comparison with Naive Features

To quantify the effect of the leakage controls directly, a naive benchmark was also estimated using a random split and unmasked features. The resulting performance was materially higher than under the final protocol, especially for ALL and MATURITY (Table 6). For example, ALL yielded PR-AUC = 0.0718 and Precision@50 = 0.280 with the naive setup, compared with 0.0127 and 0.060 for the leakage-aware setup. The inflation was even more pronounced for MATURITY, which is consistent with the susceptibility of profile text and retrospectively enriched firm attributes to post- t 0 contamination. These results show that the proposed protocol is not merely conservative in principle; it changes the estimated screening value materially in practice.
This example gives a clear "before versus after" picture of temporal leakage. The more interesting inference is that MATURITY appears especially inflation-prone under the naive protocol, which fits the manuscript’s own concern about profile text and time-sensitive platform updates.

6.6. Probability Calibration Results

Probability calibration was intentionally performed for a single model rather than for all category models. In this study, calibration was applied only to the best-performing group, where “best” was defined as the model that achieved the highest PR-AUC on the held-out test set. This choice was made to keep the calibration analysis focused and comparable because calibration is used to evaluate the reliability of predicted probabilities, while PR-AUC is used to select the model with the strongest ranking performance. For this reason, the calibration results were not intended to imply that only one model can be calibrated; instead, a single representative model was calibrated to demonstrate how probability reliability changes after post hoc calibration.
The label “Best group” shown in the calibration figure was therefore determined by the PR-AUC ranking in Table 7. In the current run, the MATURITY model achieved the highest mean PR-AUC, so it was selected automatically for calibration and is displayed as the “Best group” in Figure 2. It should also be noted that calibration was evaluated with the Brier score and calibration curves, which measure probability accuracy and not ranking ability. As a result, calibration was expected to improve probability reliability (e.g., lower Brier score) without necessarily improving PR-AUC because the ordering of instances is not changed substantially by monotonic calibration methods.

6.7. Feature Importance Within Each Category

Feature importance was analyzed within each category so that the most influential inputs could be identified. Permutation importance was computed on the test set as the drop in PR-AUC after one input was shuffled. The resulting importance rankings were shown for each category in Figure 3a–e. Coefficient-based importance was also computed from logistic regression by aggregating absolute coefficients to the raw-variable level. The coefficient-based rankings were shown for each category in Figure 4a–e.

6.7.1. Permutation-Based Importance

In Figure 3a, permutation importance within the FUNDING category is reported for seven selected funding-related inputs. Positive PR-AUC drops were observed for i n v e s t m e n t _ t y p e , t o t a l _ f u n d i n g _ u s d , and t o t a l _ f u n d i n g , which indicated that these variables contributed a useful signal when evaluated individually. Negative or near-zero importance values were observed for i n v e s t o r _ c o u n t , l o g _ t o t a l _ f u n d i n g _ u s d , and n u m _ f u n d i n g _ r o u n d s . This pattern suggested that the FUNDING-only model was weak and that some variables were redundant or noisy with the strict as-of- t 0 masking and time-based evaluation. Because permutation estimates were computed on a low-prevalence test set, small negative values were interpreted as sampling variability or collinearity effects rather than as evidence of true “harmful” predictors.
In Figure 3b, permutation importance within the GEO category is reported for rank of region and rank of city. Positive mean drops were observed for both variables, which indicated that geographic rank information provided a measurable signal for exit ranking. However, wide confidence intervals were observed, and intervals overlapped zero. This result suggested that the geographic rank features were informative but that their marginal effects were unstable under resampling conditions, which was consistent with limited model strength for the GEO-only configuration.
In Figure 3c, permutation importance within the MARKET category is reported as the decrease in PR-AUC when a single sector indicator was shuffled on the test set. A dominant contribution was observed for Biotechnology because the largest PR-AUC drop was produced when this variable was permuted. Much smaller effects were observed for the remaining sectors, and several confidence intervals overlapped zero. This pattern indicated that most sector indicators carried a limited marginal signal for the selected feature set, while Biotechnology provided the most distinctive information for ranking exits in the MARKET-only model.
In Figure 3d, permutation importance within the MATURITY category is reported for four inputs. The strongest effect was attributed to s h o r t _ d e s c r i p t i o n because a large PR-AUC reduction was produced when the text description was permuted. A smaller but positive effect was attributed to roles. Near-zero effects were observed for f o u n d e d _ y e a r and p r i m a r y _ r o l e , and this result suggested that, in the selected configuration, descriptive text was used as the primary maturity signal, while the remaining structured maturity attributes contributed little additional discrimination.
In Figure 3e, permutation importance within the TEAM category is reported for job-type and education indicators. The largest PR-AUC drops were produced by degree-related variables, and the strongest effects were attributed to d e g _ o t h e r , d e g _ m b a , and d e g _ b a c h e l o r . A moderate positive contribution was also observed for j o b _ e x e c u t i v e , while smaller effects were observed for j o b _ e m p l o y e e and d e g _ p h d . Near-zero or negative effects were observed for j o b _ b o a r d _ o b s e r v e r , j o b _ a d v i s o r , and j o b _ b o a r d _ m e m b e r . These results indicated that, within the selected TEAM inputs, educational composition carried a stronger marginal signal than role composition, although uncertainty remained substantial.
Figure 3. Permutation importance within feature categories, measured as the PR-AUC drop on the test set after shuffling one input feature. (a) Permutation importance within the FUNDING category; (b) permutation importance within the GEO category; (c) permutation importance within the MARKET category; (d) permutation importance within the MATURITY category; (e) permutation importance within the TEAM category.
Figure 3. Permutation importance within feature categories, measured as the PR-AUC drop on the test set after shuffling one input feature. (a) Permutation importance within the FUNDING category; (b) permutation importance within the GEO category; (c) permutation importance within the MARKET category; (d) permutation importance within the MATURITY category; (e) permutation importance within the TEAM category.
Applsci 16 03082 g003

6.7.2. Coefficient-Based Importance

In Figure 4a, coefficient-based importance is reported for the FUNDING category. The largest aggregated magnitude was assigned to investment_type, while log_total_funding_usd formed the second strongest contribution. Smaller magnitudes were assigned to total_funding, investor_count, and num_funding_rounds, while negligible magnitude was assigned to total_funding_usd and last_funding_year. This result indicated that categorical funding-stage information and transformed funding size were emphasized by the fitted linear model, whereas the remaining funding variables were down-weighted when modeled jointly.
Figure 4. Coefficient-based feature importance within categories. Absolute coefficients were aggregated to the raw-variable level. (a) Coefficient-based importance within the FUNDING category; (b) coefficient-based importance within the GEO category; (c) coefficient-based importance within the MARKET category; (d) coefficient-based importance within the MATURITY category; (e) coefficient-based importance within the TEAM category.
Figure 4. Coefficient-based feature importance within categories. Absolute coefficients were aggregated to the raw-variable level. (a) Coefficient-based importance within the FUNDING category; (b) coefficient-based importance within the GEO category; (c) coefficient-based importance within the MARKET category; (d) coefficient-based importance within the MATURITY category; (e) coefficient-based importance within the TEAM category.
Applsci 16 03082 g004
In Figure 4b, coefficient-based importance is reported for the GEO category. A substantially larger aggregated magnitude was assigned to rank of region than to rank of city. This result indicated that broader regional ranking information was weighted more strongly than city-level ranking in the fitted linear decision function for the selected GEO representation. It was also noted that, because both variables were numeric and were modeled jointly, coefficient magnitudes reflected relative scaling and fitted effects under regularization conditions, and they were therefore interpreted as complementary evidence to the permutation analysis rather than as a standalone estimate of test-time importance. In Figure 4c, coefficient-based importance is reported for the MARKET category by aggregating the absolute logistic regression coefficients at the raw-variable level. The largest aggregated magnitude was assigned to Biotechnology, and similarly high magnitudes were assigned to Software and Advertising. A second tier of importance was observed for Apps, Manufacturing, and Information Technology. Smaller magnitudes were observed for the remaining sectors. This pattern indicated that, with the selected sector-only representation, model weights were concentrated on a small subset of sectors, and the majority of sectors contributed weakly in the linear decision function.
In Figure 4d, coefficient-based importance is reported for the MATURITY category after aggregation to the raw-variable level. The coefficient magnitude was dominated by short_description, while roles contributed only marginally. Near-zero magnitudes were assigned to primary_role and founded_year. This pattern was consistent with the permutation results and indicated that descriptive text features drove most of the linear signal in the MATURITY model, whereas the remaining structured maturity inputs added little incremental contribution in the fitted classifier.
In Figure 4e, coefficient-based importance is reported for the TEAM category. The highest aggregated magnitudes were assigned to degree-related indicators, and the largest values were observed for deg_bachelor, deg_other, and deg_mba. A moderate magnitude was assigned to job_board_observer and job_executive, while smaller magnitudes were assigned to deg_phd and the remaining job-type indicators. This pattern suggested that, in the linear model, educational composition was used more strongly than role composition when the selected TEAM inputs were provided, although it was noted that coefficient magnitudes reflected fitted weight size rather than direct causal importance.

7. Discussion

The study was designed to approximate real early-stage venture screening, where limited attention is allocated under severe outcome rarity conditions and where information is updated over time. A strict as-of- t 0 feature construction, an embargo window, and explicit masking rules were used so that post- t 0 information was not allowed to inflate performance. In this setting, a substantial share of startups was affected by the funding masking rule, which supported the view that temporal leakage would be likely with naive feature construction. In addition, a pronounced prevalence shift was observed between the overall cohort and the most recent held-out cohort, and this shift was consistent with right-censoring in prospective evaluation settings.

7.1. Interpretation of Category-Level Performance

A key empirical pattern was the divergence between global ranking discrimination and shortlisting-oriented metrics. The MATURITY signal group achieved the strongest PR-AUC, while the highest Precision@50 and Recall@50 were obtained by TEAM and ALL. This discrepancy was interpreted as evidence that MATURITY improved ordering across a broader range of scores, whereas TEAM-related signals provided stronger concentration of positives at the very top of the list for a fixed shortlisting capacity. The ranking-quality summary with position discounting (NDCG@50) also favored ALL and TEAM, suggesting that the most practically useful ordering for a strict top-K workflow was obtained when team information was included and when signals were combined.
A plausible difference in signal shape appears to exist between MATURITY and TEAM. The descriptive text within MATURITY seems to provide a broad but relatively diffuse signal that helps separate companies across a wider portion of the score distribution, which is consistent with its stronger PR-AUC. By contrast, TEAM variables appear to be more selective: when favorable team patterns are present, a sharper concentration of positives can be created near the top ranks, which is consistent with the stronger Precision@50 and Recall@50. The result should therefore not be read as a contradiction; rather, it suggests that different signal groups can support different parts of the ranking objective.
The bootstrap comparison results were aligned with these patterns. Statistically clear PR-AUC gaps were observed between the best model and the FUNDING and GEO models, while differences against TEAM and ALL were not statistically stable. This result suggested that the selected maturity-related inputs were competitive with broader feature sets under the leakage-aware protocol, whereas FUNDING-only and GEO-only representations were not sufficient for strong discrimination in the current configuration. A modest improvement above baseline was also indicated for MARKET, although uncertainty remained substantial with the low event rate.

7.2. Implications for Practical VC Screening

Two implications for screening practice were highlighted. First, a single metric was not sufficient to characterize the screening value under capacity constraints. When the operational goal is to prioritize a small shortlist, metrics that emphasize early retrieval (Precision@K, Recall@K) and top-of-list ordering (NDCG@K) provide direct evidence about shortlisting performance. In contrast, PR-AUC provided a broader discrimination summary that remained informative under class imbalance conditions but did not necessarily identify the configuration that maximized early retrieval at a fixed cutoff. As a result, a two-stage screening workflow could be supported: a shortlisting stage could be guided by TEAM/ALL-style signals to concentrate positives early, while a broader triage stage could be guided by MATURITY-like signals to improve overall ranking discrimination.
Because the held-out prevalence was only π test = 0.00956 , the absolute magnitude of PR-AUC should not be interpreted in isolation. A more decision-relevant view is obtained from the top-K metrics. For example, the TEAM model achieved Precision @ 50 = 0.062 , which corresponds to approximately 3.1 expected successful exits in the top 50, compared with 0.478 for random ranking. Thus, the practical value shown here is not that a production-ready screening engine has been obtained, but that leakage-aware evaluation and shortlist-oriented metrics can still yield interpretable operational evidence under extreme rarity conditions.
Second, probability reliability was shown to be materially improved by post hoc calibration for the best model. A large reduction in Brier score was obtained for sigmoid and isotonic calibration, while PR-AUC was not improved and was expected to remain similar with monotone recalibration. This pattern supported the interpretation that calibration primarily improved the usability of scores as probabilities for threshold-based planning, while leaving ranking ability largely unchanged. For decision support, this distinction is important: ranking metrics determine which startups enter a shortlist, while calibration improves the stability of expected hit rates when thresholds or target shortlist yields are planned. Calibration should therefore be read as a reliability layer on top of ranking, not as a mechanism by which weak discrimination is converted into strong screening performance. Its role in the present study is to improve threshold planning once a ranking model has already been specified.

7.3. Signal Interpretation and Model Transparency

Feature-importance patterns provided additional insight into what was being captured within each signal group. In MATURITY, most of the discriminative signal was attributed to the short descriptive text field, while the remaining structured maturity attributes contributed little additional discrimination. In TEAM, educational composition indicators provided the strongest marginal contributions, and role-related indicators provided smaller but measurable effects. These findings suggested that the most influential screening signals were not limited to purely financial history; instead, descriptive and human-capital proxies were emphasized for the current representation. At the same time, these important patterns should be interpreted as predictive associations with regularization and resampling variability rather than as causal mechanisms.
Because the category-specific analyses were designed as controlled ablations, they should not be interpreted as evidence that the underlying startup signals are mutually independent. Rather, they provide a structured view of marginal information value under a common leakage-aware protocol, while the ALL model serves as the corresponding joint specification.

7.4. Limitations

Several limitations were identified. First, outcome labels were defined as observed exit events, while non-exit cases included operating and inactive firms, and right-censoring was more likely in the most recent cohort. As a result, some negative labels in the held-out period may later convert to positive outcomes, which can attenuate measured performance with a strict prospective split. The present binary label should therefore be interpreted as a screening-oriented approximation, not as an unbiased estimator of event-time risk under censoring conditions. For applications in which event timing is central, a time-to-event formulation would be preferable because right-censoring could then be handled explicitly. Second, a fixed top-50 cutoff was used for several screening metrics, and high variance was expected because these metrics are computed on a small subset under extreme class imbalance conditions. Third, the dominant contribution of descriptive text in MATURITY raised a practical measurement concern: although as-of- t 0 construction was enforced, profile text can be updated over time on many platforms. As a result, additional auditing of text timestamping and controlled text representations would strengthen robustness. Fourth, the present workflow assumes that startup-level signals can be centralized before model development. In practice, relevant information may be distributed across multiple holders, in which case privacy, communication cost, and distribution mismatch can become important constraints. Future leakage-aware screening systems may therefore require federated or privacy-preserving learning designs rather than a single centralized pipeline [18]. Fifth, a direct naive-versus-leakage-aware benchmark was not included in the current study. Consequently, the degree of performance inflation that would be produced by random splitting or by unmasked post- t 0 features was not quantified explicitly. Finally, the modeling choice was intentionally simple and interpretable so that the evaluation protocol could be isolated from model complexity. For that reason, the present results should be interpreted as a leakage-aware baseline rather than as the performance ceiling for startup screening. Stronger nonlinear tabular models, such as boosted tree ensembles or neural architectures, remain an important extension for future work.

8. Conclusions

Leakage-aware startup screening has been examined under a time-based evaluation protocol that approximates real early-stage VC and accelerator workflows. A reference time t 0 was used to construct features as of the screening moment, an embargo window was applied around the train–test boundary, and explicit masking rules were introduced to reduce the inclusion of post- t 0 information. A large proportion of startups were affected by the funding masking rule, which indicated that naive feature construction would likely contain substantial future information. In addition, a pronounced prevalence shift was observed between the overall cohort and the held-out, most recent cohort, which was consistent with right-censoring in prospective evaluation settings.
Screening performance has been reported with metrics aligned to capacity-constrained decision making. Ranking-oriented measures (Lift@50, Precision@50/Recall@50, and NDCG@50) were used alongside PR-AUC, and uncertainty was represented with bootstrap confidence intervals and pairwise bootstrap comparisons. Across signal groups, the strongest PR-AUC was achieved by the MATURITY representation, while the most favorable top-50 shortlisting metrics were obtained by TEAM and ALL. These results suggested that the screening-relevant notion of “best” can depend on the operational objective: broader discrimination under imbalance conditions was supported by MATURITY, while early retrieval at a fixed shortlist size was strengthened when team-related signals were included. Pairwise comparisons further indicated that the best PR-AUC model consistently outperformed FUNDING and GEO representations, while differences versus TEAM and ALL were not statistically stable under resampling conditions.
Probability calibration was also evaluated as a decision-support component. A large improvement in probability reliability was obtained for the best model, as reflected by the reduction in Brier score after sigmoid calibration, while PR-AUC was not materially altered. This pattern supported a practical distinction between ranking quality (which determines which startups enter the shortlist) and probability reliability (which supports threshold selection and stable expected hit-rate planning).
The contribution of this study, in summary, should be read as methodological rather than algorithmic. A leakage-aware prospective protocol is specified for screening problems in which information evolves over time and only a small shortlist can be reviewed. The value of the framework is created by reducing look-ahead bias, aligning evaluation with top-K decisions, and improving probability reliability for threshold planning. The present study should therefore be interpreted as an evaluation-oriented baseline study, not as an attempt to optimize predictive performance across model classes. Overall, the contribution should be understood as a methodological template for leakage-aware evaluation and decision interpretation under severe rarity conditions, rather than as a claim that the present baseline model is deployment-ready.
For future work, several extensions are suggested. A time-to-event formulation, such as Cox-type hazard modeling or Random Survival Forests, could be used so that censoring is handled explicitly rather than indirectly absorbed through prevalence shift [28,29]. A direct comparison against a leakage-permissive baseline, such as random splitting with unmasked future information, would also be valuable because the inflation caused by look-ahead bias could then be quantified more explicitly. Dynamic decision formulations could also be explored, where sequential updates to signals are treated as an evolving state and shortlisting is framed as a policy problem under capacity and cost constraints. In addition, controlled and timestamp-audited text representations could be introduced to mitigate concerns about post- t 0 profile updates. External validation across regions, periods, and data providers would further support generalization claims, and fairness-aware audits could be included to examine whether screening performance and error rates differ systematically across geographies or other groups under the same leakage-aware evaluation protocol.

Author Contributions

Conceptualization, M.K., U.C. and O.D.; methodology, O.D.; formal analysis, O.D.; investigation, M.K. and O.D.; writing—original draft preparation, M.K., U.C. and O.D.; writing—review and editing, M.K., U.C. and O.D.; supervision, U.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the conclusions of this article can be made available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kellekci, M.; Cebeci, U.; Dogan, O. A Methodological Framework for Venture Capital Investment Decision Making Using Hybrid MCDM and Machine Learning. In Innovative Approaches to AI-Supported Hybrid and Remote Workplaces; IGI Global Scientific Publishing: Hershey, PA, USA, 2026; pp. 93–124. [Google Scholar] [CrossRef]
  2. Sasse, L.; Nicolaisen-Sobesky, E.; Dukart, J.; Eickhoff, S.B.; Götz, M.; Hamdan, S.; Komeyer, V.; Kulkarni, A.; Lahnakoski, J.M.; Love, B.C.; et al. Overview of leakage scenarios in supervised machine learning. J. Big Data 2025, 12, 135. [Google Scholar] [CrossRef]
  3. Apicella, A.; Isgrò, F.; Prevete, R. Don’t push the button! Exploring data leakage risks in machine learning and transfer learning. Artif. Intell. Rev. 2025, 58, 339. [Google Scholar] [CrossRef]
  4. Valcarce, D.; Bellogín, A.; Parapar, J.; Castells, P. Assessing ranking metrics in top-N recommendation. Inf. Retr. J. 2020, 23, 411–448. [Google Scholar] [CrossRef]
  5. Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 141, 110415. [Google Scholar] [CrossRef]
  6. Silva Filho, T.; Song, H.; Perello-Nieto, M.; Santos-Rodríguez, R.; Kull, M.; Flach, P. Classifier calibration: A survey on how to assess and improve predicted class probabilities. Mach. Learn. 2023, 112, 3211–3260. [Google Scholar] [CrossRef]
  7. Żbikowski, K.; Antosiuk, P. A machine learning, bias-free approach for predicting business success using Crunchbase data. Inf. Process. Manag. 2021, 58, 102555. [Google Scholar] [CrossRef]
  8. Ross, G.; Das, S.; Sciro, D.; Raza, H. CapitalVX: A machine learning model for startup selection and exit prediction. J. Financ. Data Sci. 2021, 7, 94–114. [Google Scholar] [CrossRef]
  9. Corea, F.; Bertinetti, G.; Cervellati, E.M. Hacking the venture industry: An early-stage startups investment framework for data-driven investors. Mach. Learn. Appl. 2021, 5, 100062. [Google Scholar] [CrossRef]
  10. Te, Y.F.; Wieland, M.; Frey, M.; Pyatigorskaya, A.; Schiffer, P.; Grabner, H. Making it into a successful series A funding: An analysis of Crunchbase and LinkedIn data. J. Financ. Data Sci. 2023, 9, 100099. [Google Scholar] [CrossRef]
  11. Kim, J.; Kim, H.; Geum, Y. How to succeed in the market? Predicting startup success using a machine learning approach. Technol. Forecast. Soc. Change 2023, 193, 122614. [Google Scholar] [CrossRef]
  12. Moritz, A.; Diegel, W.; Block, J.; Fisch, C. VC investors’ venture screening: The role of the decision maker’s education and experience. J. Bus. Econ. 2022, 92, 27–63. [Google Scholar] [CrossRef]
  13. Roche, M.P.; Conti, A.; Rothaermel, F.T. Different founders, different venture outcomes: A comparative analysis of academic and non-academic startups. Res. Policy 2020, 49, 104062. [Google Scholar] [CrossRef]
  14. Wang, W.; Chen, W.; Zhu, K.; Wang, H. Emphasizing the entrepreneur or the idea? The impact of text content emphasis on investment decisions in crowdfunding. Decis. Support Syst. 2020, 136, 113341. [Google Scholar] [CrossRef]
  15. Maarouf, A.; Feuerriegel, S.; Pröllochs, N. A fused large language model for predicting startup success. Eur. J. Oper. Res. 2025, 322, 198–214. [Google Scholar] [CrossRef]
  16. Climent, F.; Momparler, A.; Carmona, P. Anticipating bank distress in the Eurozone: An Extreme Gradient Boosting approach. J. Bus. Res. 2019, 101, 885–896. [Google Scholar] [CrossRef]
  17. Moscatelli, M.; Parlapiano, F.; Narizzano, S.; Viggiano, G. Corporate default forecasting with machine learning. Expert Syst. Appl. 2020, 161, 113567. [Google Scholar] [CrossRef]
  18. Yang, B.; Lei, Y.; Li, N.; Li, X.; Si, X.; Chen, C. Balance recovery and collaborative adaptation approach for federated fault diagnosis of inconsistent machine groups. Knowl.-Based Syst. 2025, 317, 113480. [Google Scholar] [CrossRef]
  19. Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
  20. Richardson, E.; Trevizani, R.; Greenbaum, J.A.; Carter, H.; Nielsen, M.; Peters, B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns 2024, 5, 100994. [Google Scholar] [CrossRef]
  21. Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
  22. De Bock, K.W.; Coussement, K.; Lessmann, S. Cost-sensitive business failure prediction when misclassification costs are uncertain: A heterogeneous ensemble selection approach. Eur. J. Oper. Res. 2020, 285, 612–630. [Google Scholar] [CrossRef]
  23. Borchert, P.; Coussement, K.; De Caigny, A.; De Weerdt, J. Extending business failure prediction models with textual website content using deep learning. Eur. J. Oper. Res. 2023, 306, 348–357. [Google Scholar] [CrossRef]
  24. Mai, F.; Tian, S.; Lee, C.; Ma, L. Deep learning models for bankruptcy prediction using textual disclosures. Eur. J. Oper. Res. 2019, 274, 743–758. [Google Scholar] [CrossRef]
  25. De Bock, K.W.; Coussement, K.; De Caigny, A.; Słowiński, R.; Baesens, B.; Boute, R.N.; Choi, T.-M..; Delen, D.; Kraus, M.; Lessmann, S.; et al. Explainable AI for Operational Research: A defining framework, methods, applications, and a research agenda. Eur. J. Oper. Res. 2024, 317, 249–272. [Google Scholar] [CrossRef]
  26. Bussmann, N.; Giudici, P.; Marinelli, D.; Papenbrock, J. Explainable Machine Learning in Credit Risk Management. Comput. Econ. 2021, 57, 203–216. [Google Scholar] [CrossRef]
  27. Noriega, J.P.; Rivera, L.A.; Herrera, J.A. Machine Learning for Credit Risk Prediction: A Systematic Literature Review. Data 2023, 8, 169. [Google Scholar] [CrossRef]
  28. Rasool, A.; Tao, R.; Kamyab, M.; Hayat, S. GAWA—A feature selection method for hybrid sentiment classification. IEEE Access 2020, 8, 191850–191861. [Google Scholar] [CrossRef]
  29. Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
Figure 1. Model performance across feature categories with the time-based split; (a) PR-AUC across feature categories with the time-based split; (b) lift across feature categories; (c) Precision@50 across feature categories; (d) Recall@50 across feature categories; (e) NDCG@50 across feature categories.
Figure 1. Model performance across feature categories with the time-based split; (a) PR-AUC across feature categories with the time-based split; (b) lift across feature categories; (c) Precision@50 across feature categories; (d) Recall@50 across feature categories; (e) NDCG@50 across feature categories.
Applsci 16 03082 g001
Figure 2. Calibration comparison for the best group (MATURITY). Observed positive rates were plotted against predicted probabilities. The y-axis was limited to the observed range so that the differences between methods were visible.
Figure 2. Calibration comparison for the best group (MATURITY). Observed positive rates were plotted against predicted probabilities. The y-axis was limited to the observed range so that the differences between methods were visible.
Applsci 16 03082 g002
Table 1. Cohort summary for the time-based evaluation.
Table 1. Cohort summary for the time-based evaluation.
Number of Startups117,141
Overall positive rate0.07683
Test positive rate0.00956
t 0 minimum25 May 2007
t 0 maximum28 February 2023
Maximum age at t 0 (years)3
Embargo (days)180
Train size88,303
Test size23,429
Table 2. Funding masking summary for leakage control.
Table 2. Funding masking summary for leakage control.
Startups with funding fields117,141
Startups with masked funding68,368
Masked funding rate0.584
Table 3. Censoring-aware robustness analysis with Cox models.
Table 3. Censoring-aware robustness analysis with Cox models.
RepresentationC-IndexAUC@12mAUC@24mAUC@36m
TEAM0.6680.7210.6900.661
MATURITY0.6760.7010.6940.682
ALL0.7010.7340.7180.689
Table 4. Pairwise PR-AUC differences against the best PR-AUC model (MATURITY). One-sided bootstrap p-values are reported. Positive values indicated that the best model achieved a higher PR-AUC.
Table 4. Pairwise PR-AUC differences against the best PR-AUC model (MATURITY). One-sided bootstrap p-values are reported. Positive values indicated that the best model achieved a higher PR-AUC.
Compared Group Δ PR-AUC (Mean [lo, hi])p
FUNDING0.0073 [0.0021, 0.0157]0.003
GEO0.0067 [0.0019, 0.0153]0.000
MARKET0.0045 [−0.0008, 0.0116]0.040
TEAM0.0024 [−0.0076, 0.0144]0.300
ALL0.0017 [−0.0093, 0.0114]0.373
Table 5. Additional nonlinear baselines under the same leakage-aware protocol.
Table 5. Additional nonlinear baselines under the same leakage-aware protocol.
RepresentationModelPR-AUCPrecision@50Recall@50NDCG@50
MATURITYLogistic regression0.01440.0200.00450.066
MATURITYLightGBM0.01510.0280.00630.074
MATURITYMLP0.01470.0240.00540.070
TEAMLogistic regression0.01200.0620.01380.091
TEAMLightGBM0.01330.0700.01560.100
TEAMMLP0.01290.0680.01520.097
ALLLogistic regression0.01270.0600.01340.094
ALLLightGBM0.01560.0840.01880.115
ALLMLP0.01500.0780.01740.108
Table 6. Naive versus leakage-aware evaluation.
Table 6. Naive versus leakage-aware evaluation.
RepresentationProtocolPR-AUCLift@50Precision@50Recall@50NDCG@50
ALLNaive random split + no masking0.071829.30.2800.06250.247
ALLTime split + embargo + masking0.01276.280.0600.01340.094
MATURITYNaive random split + no masking0.064923.00.2200.04910.215
MATURITYTime split + embargo + masking0.01442.090.0200.00450.066
Table 7. Calibration results for the best group (MATURITY).
Table 7. Calibration results for the best group (MATURITY).
MethodPR-AUCBrier Score
Uncalibrated0.01440.0972
Sigmoid0.01440.0161
Isotonic0.01380.0169
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kellekci, M.; Cebeci, U.; Dogan, O. Leakage-Aware Time-Based Top-K Start-Up Ranking for Venture Capital Investment Success Under Severe Class Imbalance Conditions: A Screening Evaluation Framework. Appl. Sci. 2026, 16, 3082. https://doi.org/10.3390/app16063082

AMA Style

Kellekci M, Cebeci U, Dogan O. Leakage-Aware Time-Based Top-K Start-Up Ranking for Venture Capital Investment Success Under Severe Class Imbalance Conditions: A Screening Evaluation Framework. Applied Sciences. 2026; 16(6):3082. https://doi.org/10.3390/app16063082

Chicago/Turabian Style

Kellekci, Mustafa, Ufuk Cebeci, and Onur Dogan. 2026. "Leakage-Aware Time-Based Top-K Start-Up Ranking for Venture Capital Investment Success Under Severe Class Imbalance Conditions: A Screening Evaluation Framework" Applied Sciences 16, no. 6: 3082. https://doi.org/10.3390/app16063082

APA Style

Kellekci, M., Cebeci, U., & Dogan, O. (2026). Leakage-Aware Time-Based Top-K Start-Up Ranking for Venture Capital Investment Success Under Severe Class Imbalance Conditions: A Screening Evaluation Framework. Applied Sciences, 16(6), 3082. https://doi.org/10.3390/app16063082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop