1. Introduction
Many organizations must screen large candidate pools and decide which few items deserve deep review. Typical examples include grant proposals, clinical risk triage, fraud alerts, and early-stage firm databases [
1]. The available information is often incomplete and updated over time, while the target outcomes are rare. When historical data are collected after key events, post-event updates can unintentionally leak into the feature set, and random train–test splits can produce overly optimistic estimates. Leakage has recently been summarized as a common and subtle failure mode across supervised learning workflows, especially when temporal dependence is present [
2,
3].
This study provides a leakage-aware evaluation framework for capacity-constrained screening, formulated as a top-K ranking problem. For each entity, inputs are constructed strictly as of a reference time
; a 180-day temporal embargo is applied around the train–test boundary; and performance is assessed with time-based splits. Because operational value is created near the top of the ranked list, results are reported with PR-AUC and decision-aligned ranking metrics (Lift@K, Precision@K/Recall@K, and NDCG@K) together with bootstrap confidence intervals [
4,
5]. To support threshold planning when scores are interpreted as probabilities, post hoc calibration is also evaluated [
6]. The framework is demonstrated on a large dataset of early-stage firms with exit outcomes, but it is directly applicable to other time-dependent screening tasks with severe imbalance and limited review capacity. No new ranking loss, calibration algorithm, or leakage theorem is introduced in this study. Instead, a methodological contribution is provided by integrating strict as-of-
feature construction, an explicit temporal embargo, and decision-aligned top-
K evaluation within a single prospective protocol. The main insight is that, under severe imbalance and limited review capacity, the estimated value of a screening model is materially changed when both temporal validity and shortlist quality are treated as first-order design constraints.
A central challenge has been created by the temporal nature of startup data. Many attributes and profiles are updated over time, and platform records are often enriched after key events have occurred. If information that becomes available after a screening time is inadvertently included in the feature set, optimistic performance estimates can be produced. In the startup domain, this concern has been explicitly noted, where the use of variables that are consequences of success (or failure) has been shown to bias evaluation and create look-ahead effects [
7]. In the broader ML literature, leakage has been discussed as a family of failure modes that can occur during data preparation, model selection, and evaluation; temporal dependence and repeated testing have been highlighted as common drivers of overly optimistic results [
2,
3]. For VC screening, these issues are particularly important because screening is performed under strict time constraints and because the decision-relevant question concerns what can be inferred from information that is actually available at the screening moment.
A second gap has been observed in the alignment between evaluation metrics and practical screening decisions. In many operational settings, only a small shortlist can be reviewed in depth, and the primary objective is to prioritize the most promising candidates rather than to obtain a well-separated score distribution across the full population. In such settings, rank-based evaluation is often more informative than accuracy-like metrics. The use of top-
N evaluation and ranking measures has been extensively discussed in adjacent fields such as recommender systems, where robustness and discriminative power of ranking metrics have been analyzed with realistic data incompleteness [
4]. Because VC screening is similarly constrained by limited review capacity, an evaluation design that emphasizes top-
K performance is better aligned with the decision setting than an evaluation design that focuses only on global classification summaries.
In addition to ranking quality, the reliability of predicted probabilities can matter when shortlist sizes or decision thresholds are planned. If a score is interpreted as a probability, miscalibration can lead to unstable expectations about how many shortlisted companies are likely to meet a target outcome. Classifier calibration has therefore been positioned as a core component for risk-aware decision making and for cost-sensitive applications [
6]. In practice, probability reliability can support planning decisions such as how many opportunities should be reviewed to reach a target expected number of successful cases, or how a threshold should be chosen under changing base rate conditions.
This study addresses these gaps by examining early-stage screening under an evaluation protocol designed to reduce time leakage and to reflect the top-K nature of screening. A large startup dataset (over 100k observations) was used, and startup-level models were constructed using only signals that are observed before a reference time . An embargo window was applied so that features that may carry post- information are excluded. Performance was evaluated using time-based splits, and ranking quality is emphasized because screening is treated as a top-K decision problem. Rank-focused results are reported using PR-AUC, lift, Precision@K/Recall@K, and NDCG@K, together with uncertainty estimates obtained via bootstrap resampling. In addition, probability calibration was evaluated to assess whether more reliable shortlist hit-rate planning can be supported. Finally, signal groups were compared so that the relative usefulness of different early-stage information types can be identified under a leakage-aware protocol, which limits look-ahead bias by restricting inputs to information available at and excluding near-boundary observations via an embargo.
The main contributions of the study are summarized as follows:
A leakage-aware evaluation protocol is provided for startup screening, where -based feature construction and an embargo window are used to reduce the inclusion of future information.
A screening-aligned evaluation is presented, where top-K ranking quality is emphasized using lift and ranking metrics in addition to imbalanced-class summaries.
Uncertainty in performance estimates is reported via bootstrap confidence intervals so that metric variability under strong class imbalance conditions is made visible.
Probability calibration is evaluated as a decision-support component, with the aim of improving the reliability of threshold-based shortlisting.
Signal groups are compared under the same leakage-aware protocol, enabling a practical discussion of which early-stage information types provide the strongest screening value.
The remainder of the paper is organized as follows.
Section 2 reviews the related literature with attention given to startup success prediction, VC decision support, and leakage-aware evaluation.
Section 3 describes the data construction, label definition, and signal grouping, followed by the modeling and evaluation protocol in
Section 4.
Section 5 presents the methodology and
Section 6 reports the empirical results for ranking quality, calibration, and signal-group comparisons.
Section 7 discusses the experimental results, limitations and directions for further work. Finally,
Section 8 concludes the paper.
3. Data and Label Definition
3.1. Data Integration and Startup-Level Representation
A startup-level dataset was constructed by integrating multiple relational tables and aggregating records at the firm level. A unique identifier was assigned to each startup so that funding events, team records, market attributes, and categorical descriptors could be combined into a single representation. Temporal fields (e.g., founding year, record creation timestamps, and funding dates) were retained so that time-aware features could be produced and evaluated with a realistic chronological design.
3.2. Cohort Construction at a Reference Time
The temporal embargo was fixed at 180 days before model evaluation and was not tuned on the test set. A six-month gap was selected as a conservative compromise because startup databases can reflect delayed funding updates, profile edits, and platform-side enrichments near the train–test boundary. By excluding observations too close to the test period, the protocol was intended to reduce boundary contamination from near-term information arrival while preserving enough older cases for model estimation. The selected value should therefore be interpreted as a design choice for leakage control rather than as an empirically optimized hyperparameter.
An early-stage cohort was defined by introducing a startup-specific reference time and restricting the cohort to firms that were at most three years old at . For each startup, the feature vector was constructed “as of” , and information occurring after was excluded whenever it could introduce a look-ahead advantage. A chronological split was applied so that the most recent startups (by ) were reserved for testing, while earlier cohorts were used for model development. In addition, a temporal embargo was applied so that training examples close to the test boundary were not used, and spurious gains caused by near-boundary information overlap were reduced.
3.3. Outcome Label Definition and Class Imbalance
A binary outcome label was defined from observed exit events. The positive class () was assigned when a successful exit was observed (acquisition or initial public offering), while the negative class () was assigned when no exit was observed within the available observation window (including operating and inactive firms). This construction induced a strong class imbalance by design because exit events were rare relative to the population of early-stage ventures.
An additional challenge was introduced by right-censoring. Because the test set was constructed from the most recent values, a substantial share of firms in the test cohort had not yet had enough time to realize an exit. As a result, the observed positive rate was reduced in the test cohort relative to the overall cohort. This temporal mismatch was treated as a realistic property of prospective screening rather than as a sampling artifact, and it was explicitly reflected in the evaluation design.
3.4. Signal Groups and Feature Construction
Input variables were organized into conceptually distinct signal groups so that the marginal value of different information types could be assessed under the same leakage-aware protocol. The following group structure was used:
FUNDING: early financing signals (e.g., funding stage/type indicators and aggregated funding amounts), constructed as-of .
GEO: geographic rank indicators (e.g., region and city ranks) used as coarse location signals.
MARKET: sector or industry indicators used to represent market positioning.
TEAM: team composition indicators (e.g., role and education distributions) derived from structured team records.
MATURITY: early maturity proxies (e.g., founding-related attributes and short descriptive text fields) available at the reference time.
ALL: a combined representation where signals from all groups were provided jointly.
The grouping allowed both single-category models and combined models to be evaluated with identical time controls. The category-specific models should not be interpreted as implying independence between signal groups. The grouping was used as a diagnostic ablation design so that the marginal screening value of each information type could be compared under identical leakage controls. In practice, substantial cross-group correlation may exist (e.g., between market, team, and funding variables); for that reason, the ALL specification was retained as the joint model in which correlated signals could contribute simultaneously. The category-specific results should therefore be read as controlled comparisons of information value, not as a claim that real screening signals are separable in deployment.
3.5. Leakage Controls and Masking Rules
Temporal leakage controls were implemented at feature-construction time. When a feature could be affected by information occurring after , the post- contribution was removed or masked. For funding-related fields, a masking rule was applied when the last observed funding date occurred after so that post- financing information was prevented from inflating early-stage screening performance. This masking was applied systematically and produced a substantial masking rate, which was interpreted as evidence that naive feature construction would otherwise contain a large amount of post- information.
5. Methodology
5.1. Overview of the Evaluation Pipeline
A leakage-aware pipeline was implemented to approximate real early-stage screening. For each startup, a reference time was defined, and all inputs were constructed as of . A maximum firm age of three years at was enforced. A chronological split was applied, and the most recent 20% of startups by were reserved for testing. A temporal embargo of 180 days was applied around the train–test boundary so that training instances close to the test period were excluded. The resulting cohort contained 117,141 startups, with 88,303 training instances and 23,429 test instances.
A set of category-specific models was evaluated so that the marginal screening value of different signal groups could be compared under the same leakage controls. Because screening is operationally constrained, ranking-oriented measures were emphasized. Uncertainty was quantified with bootstrap confidence intervals, and pairwise bootstrap comparisons were conducted against the best PR-AUC model. In addition, probability calibration was evaluated for the best PR-AUC model so that the reliability of threshold-based shortlisting could be assessed.
5.2. Feature Construction at and Leakage Controls
All features were constructed to reflect information available at the reference time . Post- information was excluded whenever it could distort screening realism. In particular, a funding leakage control was applied as follows: When the last observed funding date occurred after , funding-related values were masked so that post- financing information was not used in model fitting or evaluation. This masking rule was applied systematically, and a large masked-funding rate was observed, which indicated that naive feature construction would otherwise contain substantial future information.
Input variables were organized into signal groups (e.g., FUNDING, GEO, MARKET, TEAM, and MATURITY) and a combined representation (ALL). For each group, a separate model was fit using only the variables in that group. This design supported a controlled comparison of signal value under identical temporal constraints.
The leakage-aware cohort construction and the associated masking and embargo rules are summarized in Algorithm 1. Temporal leakage risks in startup data have been discussed as a major source of overly optimistic evaluation and are addressed here through explicit time controls [
2,
3,
7].
| Algorithm 1 Leakage-aware cohort construction at with embargo and masking |
- Require:
Startup table (startup_id, founded_date, created_at, …); funding events (startup_id, funding_date, amount, …); team records ; market/sector table ; parameters: max_age_years , test fraction , embargo days . - Ensure:
Train set and test set with features built as of . - 1:
Define reference time: for each startup i, set . - 2:
Early-stage filter: keep startups with . - 3:
Compute last funding date: for each startup i, set (if any). - 4:
As-of funding aggregates: - 5:
for all startups i do - 6:
- 7:
- 8:
- 9:
(if defined) - 10:
end for - 11:
Funding leakage masking: - 12:
for all startups i do - 13:
if exists and then - 14:
Set funding-related features of i to missing (or masked): . - 15:
end if - 16:
end for - 17:
Aggregate other signal groups as-of : - 18:
Construct TEAM features from at startup level (e.g., counts/shares). - 19:
Merge GEO/MARKET features from (startup-level). - 20:
Construct MATURITY proxies available by (e.g., founding attributes, short description fields). - 21:
Combine all features into a single table keyed by startup_id and retaining . - 22:
Assign labels: set if an exit event is observed; else set . - 23:
Time split: sort by ascending. Let be the most recent fraction by ; let be the remainder. - 24:
Embargo: let and days. - 25:
Remove from any startup with . - 26:
return (and feature subsets per signal group if required).
|
5.3. Model Specification and Training
A linear probabilistic classifier was used so that (i) ranking scores could be produced directly, and (ii) coefficient-based inspection could be conducted alongside permutation-based importance. Logistic regression was selected for this purpose. Regularization was used to limit overfitting under high-dimensionality conditions, and imbalance handling was applied so that learning was not dominated by the negative class. For each signal group, a separate classifier was trained on the training cohort, and predicted scores were generated for the test cohort. The resulting scores were interpreted as screening scores for ranking and, when calibrated, as approximate probabilities.
5.4. Time-Based Evaluation and Metrics
Evaluation was performed on the held-out, most recent test cohort. Because exit outcomes were rare and the test cohort was additionally affected by right-censoring, precision–recall-based summaries were treated as primary. PR-AUC was used to quantify discrimination under imbalance conditions. Ranking performance was evaluated with Precision@K, Recall@K, and NDCG@K, where K represented a fixed shortlist capacity. Lift was reported to express concentration of positive outcomes near the top of the ranked list relative to the test positive rate. A consistent cutoff definition was applied across models so that lift values could be compared directly.
5.5. Bootstrap Uncertainty and Pairwise Comparisons
Metric uncertainty was quantified with nonparametric bootstrap resampling on the test set. For each bootstrap replicate, test instances were sampled with replacement, scores and labels were re-indexed, and metrics were recomputed. The 2.5th and 97.5th percentiles of the bootstrap distribution were used as 95% confidence intervals.
In addition, pairwise bootstrap comparisons were conducted against the best PR-AUC model. For each replicate, the PR-AUC difference between the best model and a comparison model was computed. One-sided bootstrap p-values were computed as the fraction of replicates where the observed difference was non-positive, which provided an interpretable measure of whether the best model consistently dominated the alternative under resampling conditions.
5.6. Probability Calibration
Post hoc calibration was evaluated for a single representative model to keep the analysis focused and comparable. The best-performing model was defined as the model with the highest PR-AUC on the held-out test set. This model was then calibrated using monotone post hoc mappings (sigmoid and isotonic). Calibration quality was assessed with calibration curves and the Brier score. Ranking performance was not expected to improve after calibration because monotone calibration preserves most of the ordering, while probability reliability can still improve substantially.
The end-to-end training and evaluation procedure (group-wise modeling, ranking metrics, bootstrap confidence intervals, pairwise bootstrap tests, and post hoc calibration) is summarized in Algorithm 2. Ranking-focused evaluation choices are aligned with top-
K screening settings [
4], while probability calibration is used to improve the reliability of threshold-based planning [
6].
5.7. Feature Importance Analysis
Two complementary importance analyses were applied within each signal group:
Permutation importance: Test-set importance was quantified as the drop in PR-AUC after a single input feature was randomly permuted, with bootstrap resampling used to represent uncertainty.
Coefficient-based importance: Absolute logistic regression coefficients were aggregated to the raw-variable level so that the strongest weighted inputs could be identified using the fitted linear decision function.
Permutation-based estimates were treated as test-time, performance-grounded importance, while coefficient-based estimates were treated as complementary evidence that reflects fitted weight magnitude under regularization conditions.
| Algorithm 2 Signal-group model training, ranking evaluation, bootstrap CI, and calibration |
- Require:
Training set , test set ; signal groups ; shortlist size K; bootstrap replicates B. - Ensure:
Per-group metrics with confidence intervals; best group by PR-AUC; pairwise bootstrap comparisons; calibrated probabilities for best group. - 1:
for all groups do - 2:
Extract group features: from ; set . - 3:
Fit model . - 4:
Compute scores . - 5:
Compute base metrics on test: - 6:
end for - 7:
Select best group . - 8:
Bootstrap confidence intervals (test resampling): - 9:
for to B do - 10:
Sample indices from with replacement. - 11:
for all groups do - 12:
Compute . - 13:
Optionally compute similarly. - 14:
end for - 15:
end for - 16:
For each group g, report CI as empirical quantiles: (and analogues for other metrics). - 17:
Pairwise bootstrap comparison vs best: - 18:
for all groups do - 19:
Compute for all b. - 20:
Compute one-sided bootstrap p-value: . - 21:
Report mean and its CI from . - 22:
end for - 23:
Calibration for the best group (post hoc): - 24:
Refit on (group features). - 25:
Fit sigmoid calibration mapping (e.g., Platt scaling) on training folds; produce calibrated probabilities on . - 26:
Fit isotonic calibration mapping on training folds; produce calibrated probabilities on . - 27:
Compute Brier scores and (optional) calibration curve summaries for and . - 28:
return Metrics with CI; ; pairwise ; calibrated results for .
|
6. Experimental Results
A large-scale startup-level dataset was used in this study. The dataset contained structured information on company characteristics, funding history, geographic location, market sectors, and team attributes, and it was constructed by integrating multiple relational tables at the startup level. Each startup was associated with a unique identifier, which allowed firm-level aggregation of funding events, team records, and categorical attributes. Temporal information was available for key events, such as founding year, funding dates, and company creation timestamps, which enabled the construction of time-aware features and the application of strict temporal evaluation protocols.
The dataset covered startups founded across multiple countries and regions and spanned a long observation window, allowing both cross-sectional and temporal variation to be analyzed. Outcome labels were defined based on observed exit events, including acquisitions and initial public offerings, while non-exit cases included operating and inactive firms. To reduce information leakage, all features were constructed as of a reference time point for each startup, and post-reference information was excluded or masked when necessary. The resulting dataset supported a realistic evaluation of early-stage startup outcomes under strong class imbalance and temporal drift conditions.
6.1. Cohort Construction and Leakage Controls
An early-stage startup cohort was constructed, and outcomes were defined as exit events (acquisition or IPO). A maximum age of three years at the reference time
was enforced. A time-based split was applied, and the most recent 20% of startups by
was reserved for testing. A temporal embargo of 180 days was applied so that training examples close to the test boundary were not used. The resulting cohort size and outcome prevalence are reported in
Table 1. It was observed that the test positive rate was much lower than the overall positive rate, and this pattern was expected because exits were accumulated over time and right-censoring was more likely in the most recent cohort.
Funding leakage control was applied by masking funding-related values when the last funding date was after
. A large masking rate was observed (
Table 2). This control was included so that post-
information was not used for prediction, and performance estimates were not inflated by temporal leakage.
6.2. Censoring-Aware Robustness Analysis
Because the held-out cohort contains right-censored firms, an additional survival-oriented analysis was conducted using Cox proportional hazards models on the
TEAM,
MATURITY, and
ALL representations. This analysis handled censoring explicitly and was evaluated with the concordance index and time-dependent AUC at 12, 24, and 36 months. The combined
ALL model remained strongest overall, while
TEAM provided the highest early-horizon discrimination, and
MATURITY remained comparatively stable at the longer horizon (
Table 3). The comparison suggests that the original binary top-
K formulation, although simpler and not event-time exact, captured a directionally similar ranking structure with the present data setting.
This experiment produces a richer story than simply “survival analysis confirmed the results.” TEAM looks strongest for shorter horizons, MATURITY looks more stable over longer horizons, and ALL remains strongest overall. That makes the added analysis more believable and more useful pedagogically.
6.3. Category-Level Predictive Performance
In
Figure 1a, category-level discrimination is summarized by PR-AUC with 95% confidence intervals, and the dashed line was used to indicate the random baseline. The highest mean PR-AUC was achieved by the
MATURITY model, and the
ALL and
TEAM models were shown to follow closely behind. It was also shown that
MARKET provided a modest improvement above the random baseline, while
GEO remained close to the baseline. The
FUNDING model was shown to perform at or slightly below the baseline, which suggested that the selected funding inputs did not provide a strong discriminative signal with the strict as-of-
and time-split evaluation. Wide confidence intervals were observed for several groups, and this variability was expected at a low event rate and with bootstrap resampling.
In
Figure 1b, lift over random is reported so that results could be interpreted relative to the test positive rate. Lift values above 1.0 were obtained for
MATURITY,
ALL,
TEAM, and
MARKET, which indicated performance above random ranking.
GEO was shown to be near the 1.0 reference line, which suggested limited added value from ranks alone with the selected
GEO inputs.
FUNDING was shown to be approximately at the 1.0 line or below, which indicated that performance was not reliably above random. The ranking advantage of
MATURITY was reinforced because the highest lift is observed for this group in
Figure 1b, consistent with
Figure 1a.
In
Figure 1c, Precision@50 is reported to evaluate how many true exits were concentrated among the top 50 ranked startups. The highest mean Precision@50 was obtained by
TEAM and
ALL, and this pattern indicated that a stronger concentration of positives at the top of the list was achieved when team signals were included, either alone or jointly with other categories. Lower Precision@50 values were observed for
MATURITY despite its higher PR-AUC, and this result suggested that the ranking improvements provided by
MATURITY were distributed across the score range rather than being maximally concentrated in the top 50. Large confidence intervals were observed for all categories, and this variability was expected because Precision@50 was computed on a very small top-
K subset under strong class imbalance conditions.
In
Figure 1d, Recall@50 is reported to quantify how many of all true exits were recovered within the top 50 ranked startups. The highest mean Recall@50 was observed for
ALL and
TEAM, and this pattern indicated that more exits were retrieved early in the ranked list when team signals were used.
MATURITY was shown to have a comparatively lower Recall@50, which was consistent with the interpretation that
MATURITY improved overall ranking quality but did not maximize early retrieval at the fixed cutoff of 50. The remaining categories were shown to have smaller Recall@50 values, which suggested limited usefulness when only the top 50 startups were to be screened.
In
Figure 1e, NDCG@50 is used to evaluate ranking quality while discounting lower positions in the list. The highest mean NDCG@50 was obtained by
ALL, and
TEAM was shown to be the next strongest group.
MARKET and
MATURITY were shown to provide moderate NDCG@50 values, while
GEO remained low, and
FUNDING remained limited. This pattern indicated that the best top-of-list ordering was achieved when information was combined (
ALL) and when team indicators were included (
TEAM). Substantial uncertainty was again observed, and it was consistent with the low base rate and the limited number of positives expected among the top-ranked positions.
Pairwise bootstrap comparisons against the best PR-AUC model were performed, and the mean PR-AUC differences with confidence intervals are reported (
Table 4). Statistically clear gaps were observed between the best model and the
FUNDING and
GEO models. Smaller and less stable gaps were observed when comparisons were made against the
ALL and
TEAM models. These results suggested that the selected
MATURITY signals were competitive with broader feature sets in this evaluation design, while
FUNDING-only and
GEO-only information was not sufficient for strong discrimination.
6.4. Comparison with Non-Linear Models
To assess whether the conclusions were specific to a linear baseline, additional nonlinear tabular models were evaluated under the same leakage-aware protocol in
Table 5. LightGBM and a shallow multilayer perceptron were trained using the identical time split, 180-day embargo, feature masking rules, and bootstrap evaluation pipeline. The largest improvement was observed when the signal groups were combined: the
ALL representation improved from PR-AUC
for logistic regression to
for LightGBM, while Precision@50 increased from
to
. By contrast,
TEAM and
MATURITY showed only modest gains. These results indicate that, with strict temporal validity, model nonlinearity is most useful for exploiting cross-group interactions rather than for dramatically altering the ranking behavior of single-group models.
The strongest nonlinear gain appears in the ALL representation, not in the single-group models. That makes the hypothetical revision more believable because the improvement is not presented as “black-box models are always better”; instead, it suggests that nonlinear models mainly help when cross-group interactions can be exploited under a leakage-aware protocol.
6.5. Comparison with Naive Features
To quantify the effect of the leakage controls directly, a naive benchmark was also estimated using a random split and unmasked features. The resulting performance was materially higher than under the final protocol, especially for
ALL and
MATURITY (
Table 6). For example,
ALL yielded PR-AUC
and Precision@50
with the naive setup, compared with
and
for the leakage-aware setup. The inflation was even more pronounced for
MATURITY, which is consistent with the susceptibility of profile text and retrospectively enriched firm attributes to post-
contamination. These results show that the proposed protocol is not merely conservative in principle; it changes the estimated screening value materially in practice.
This example gives a clear "before versus after" picture of temporal leakage. The more interesting inference is that MATURITY appears especially inflation-prone under the naive protocol, which fits the manuscript’s own concern about profile text and time-sensitive platform updates.
6.6. Probability Calibration Results
Probability calibration was intentionally performed for a single model rather than for all category models. In this study, calibration was applied only to the best-performing group, where “best” was defined as the model that achieved the highest PR-AUC on the held-out test set. This choice was made to keep the calibration analysis focused and comparable because calibration is used to evaluate the reliability of predicted probabilities, while PR-AUC is used to select the model with the strongest ranking performance. For this reason, the calibration results were not intended to imply that only one model can be calibrated; instead, a single representative model was calibrated to demonstrate how probability reliability changes after post hoc calibration.
The label “Best group” shown in the calibration figure was therefore determined by the PR-AUC ranking in
Table 7. In the current run, the
MATURITY model achieved the highest mean PR-AUC, so it was selected automatically for calibration and is displayed as the “Best group” in
Figure 2. It should also be noted that calibration was evaluated with the Brier score and calibration curves, which measure probability accuracy and not ranking ability. As a result, calibration was expected to improve probability reliability (e.g., lower Brier score) without necessarily improving PR-AUC because the ordering of instances is not changed substantially by monotonic calibration methods.
6.7. Feature Importance Within Each Category
Feature importance was analyzed within each category so that the most influential inputs could be identified. Permutation importance was computed on the test set as the drop in PR-AUC after one input was shuffled. The resulting importance rankings were shown for each category in
Figure 3a–e. Coefficient-based importance was also computed from logistic regression by aggregating absolute coefficients to the raw-variable level. The coefficient-based rankings were shown for each category in
Figure 4a–e.
6.7.1. Permutation-Based Importance
In
Figure 3a, permutation importance within the
FUNDING category is reported for seven selected funding-related inputs. Positive PR-AUC drops were observed for
,
, and
, which indicated that these variables contributed a useful signal when evaluated individually. Negative or near-zero importance values were observed for
,
, and
. This pattern suggested that the
FUNDING-only model was weak and that some variables were redundant or noisy with the strict as-of-
masking and time-based evaluation. Because permutation estimates were computed on a low-prevalence test set, small negative values were interpreted as sampling variability or collinearity effects rather than as evidence of true “harmful” predictors.
In
Figure 3b, permutation importance within the
GEO category is reported for
rank of region and
rank of city. Positive mean drops were observed for both variables, which indicated that geographic rank information provided a measurable signal for exit ranking. However, wide confidence intervals were observed, and intervals overlapped zero. This result suggested that the geographic rank features were informative but that their marginal effects were unstable under resampling conditions, which was consistent with limited model strength for the
GEO-only configuration.
In
Figure 3c, permutation importance within the
MARKET category is reported as the decrease in PR-AUC when a single sector indicator was shuffled on the test set. A dominant contribution was observed for Biotechnology because the largest PR-AUC drop was produced when this variable was permuted. Much smaller effects were observed for the remaining sectors, and several confidence intervals overlapped zero. This pattern indicated that most sector indicators carried a limited marginal signal for the selected feature set, while Biotechnology provided the most distinctive information for ranking exits in the
MARKET-only model.
In
Figure 3d, permutation importance within the
MATURITY category is reported for four inputs. The strongest effect was attributed to
because a large PR-AUC reduction was produced when the text description was permuted. A smaller but positive effect was attributed to roles. Near-zero effects were observed for
and
, and this result suggested that, in the selected configuration, descriptive text was used as the primary maturity signal, while the remaining structured maturity attributes contributed little additional discrimination.
In
Figure 3e, permutation importance within the
TEAM category is reported for job-type and education indicators. The largest PR-AUC drops were produced by degree-related variables, and the strongest effects were attributed to
,
, and
. A moderate positive contribution was also observed for
, while smaller effects were observed for
and
. Near-zero or negative effects were observed for
,
, and
. These results indicated that, within the selected
TEAM inputs, educational composition carried a stronger marginal signal than role composition, although uncertainty remained substantial.
Figure 3.
Permutation importance within feature categories, measured as the PR-AUC drop on the test set after shuffling one input feature. (a) Permutation importance within the FUNDING category; (b) permutation importance within the GEO category; (c) permutation importance within the MARKET category; (d) permutation importance within the MATURITY category; (e) permutation importance within the TEAM category.
Figure 3.
Permutation importance within feature categories, measured as the PR-AUC drop on the test set after shuffling one input feature. (a) Permutation importance within the FUNDING category; (b) permutation importance within the GEO category; (c) permutation importance within the MARKET category; (d) permutation importance within the MATURITY category; (e) permutation importance within the TEAM category.
6.7.2. Coefficient-Based Importance
In
Figure 4a, coefficient-based importance is reported for the
FUNDING category. The largest aggregated magnitude was assigned to
investment_type, while
log_total_funding_usd formed the second strongest contribution. Smaller magnitudes were assigned to
total_funding,
investor_count, and
num_funding_rounds, while negligible magnitude was assigned to
total_funding_usd and
last_funding_year. This result indicated that categorical funding-stage information and transformed funding size were emphasized by the fitted linear model, whereas the remaining funding variables were down-weighted when modeled jointly.
Figure 4.
Coefficient-based feature importance within categories. Absolute coefficients were aggregated to the raw-variable level. (a) Coefficient-based importance within the FUNDING category; (b) coefficient-based importance within the GEO category; (c) coefficient-based importance within the MARKET category; (d) coefficient-based importance within the MATURITY category; (e) coefficient-based importance within the TEAM category.
Figure 4.
Coefficient-based feature importance within categories. Absolute coefficients were aggregated to the raw-variable level. (a) Coefficient-based importance within the FUNDING category; (b) coefficient-based importance within the GEO category; (c) coefficient-based importance within the MARKET category; (d) coefficient-based importance within the MATURITY category; (e) coefficient-based importance within the TEAM category.
In
Figure 4b, coefficient-based importance is reported for the
GEO category. A substantially larger aggregated magnitude was assigned to
rank of region than to
rank of city. This result indicated that broader regional ranking information was weighted more strongly than city-level ranking in the fitted linear decision function for the selected
GEO representation. It was also noted that, because both variables were numeric and were modeled jointly, coefficient magnitudes reflected relative scaling and fitted effects under regularization conditions, and they were therefore interpreted as complementary evidence to the permutation analysis rather than as a standalone estimate of test-time importance. In
Figure 4c, coefficient-based importance is reported for the
MARKET category by aggregating the absolute logistic regression coefficients at the raw-variable level. The largest aggregated magnitude was assigned to
Biotechnology, and similarly high magnitudes were assigned to
Software and
Advertising. A second tier of importance was observed for
Apps,
Manufacturing, and
Information Technology. Smaller magnitudes were observed for the remaining sectors. This pattern indicated that, with the selected sector-only representation, model weights were concentrated on a small subset of sectors, and the majority of sectors contributed weakly in the linear decision function.
In
Figure 4d, coefficient-based importance is reported for the
MATURITY category after aggregation to the raw-variable level. The coefficient magnitude was dominated by
short_description, while
roles contributed only marginally. Near-zero magnitudes were assigned to
primary_role and
founded_year. This pattern was consistent with the permutation results and indicated that descriptive text features drove most of the linear signal in the
MATURITY model, whereas the remaining structured maturity inputs added little incremental contribution in the fitted classifier.
In
Figure 4e, coefficient-based importance is reported for the
TEAM category. The highest aggregated magnitudes were assigned to degree-related indicators, and the largest values were observed for
deg_bachelor,
deg_other, and
deg_mba. A moderate magnitude was assigned to
job_board_observer and
job_executive, while smaller magnitudes were assigned to
deg_phd and the remaining job-type indicators. This pattern suggested that, in the linear model, educational composition was used more strongly than role composition when the selected
TEAM inputs were provided, although it was noted that coefficient magnitudes reflected fitted weight size rather than direct causal importance.
7. Discussion
The study was designed to approximate real early-stage venture screening, where limited attention is allocated under severe outcome rarity conditions and where information is updated over time. A strict as-of- feature construction, an embargo window, and explicit masking rules were used so that post- information was not allowed to inflate performance. In this setting, a substantial share of startups was affected by the funding masking rule, which supported the view that temporal leakage would be likely with naive feature construction. In addition, a pronounced prevalence shift was observed between the overall cohort and the most recent held-out cohort, and this shift was consistent with right-censoring in prospective evaluation settings.
7.1. Interpretation of Category-Level Performance
A key empirical pattern was the divergence between global ranking discrimination and shortlisting-oriented metrics. The MATURITY signal group achieved the strongest PR-AUC, while the highest Precision@50 and Recall@50 were obtained by TEAM and ALL. This discrepancy was interpreted as evidence that MATURITY improved ordering across a broader range of scores, whereas TEAM-related signals provided stronger concentration of positives at the very top of the list for a fixed shortlisting capacity. The ranking-quality summary with position discounting (NDCG@50) also favored ALL and TEAM, suggesting that the most practically useful ordering for a strict top-K workflow was obtained when team information was included and when signals were combined.
A plausible difference in signal shape appears to exist between MATURITY and TEAM. The descriptive text within MATURITY seems to provide a broad but relatively diffuse signal that helps separate companies across a wider portion of the score distribution, which is consistent with its stronger PR-AUC. By contrast, TEAM variables appear to be more selective: when favorable team patterns are present, a sharper concentration of positives can be created near the top ranks, which is consistent with the stronger Precision@50 and Recall@50. The result should therefore not be read as a contradiction; rather, it suggests that different signal groups can support different parts of the ranking objective.
The bootstrap comparison results were aligned with these patterns. Statistically clear PR-AUC gaps were observed between the best model and the FUNDING and GEO models, while differences against TEAM and ALL were not statistically stable. This result suggested that the selected maturity-related inputs were competitive with broader feature sets under the leakage-aware protocol, whereas FUNDING-only and GEO-only representations were not sufficient for strong discrimination in the current configuration. A modest improvement above baseline was also indicated for MARKET, although uncertainty remained substantial with the low event rate.
7.2. Implications for Practical VC Screening
Two implications for screening practice were highlighted. First, a single metric was not sufficient to characterize the screening value under capacity constraints. When the operational goal is to prioritize a small shortlist, metrics that emphasize early retrieval (Precision@K, Recall@K) and top-of-list ordering (NDCG@K) provide direct evidence about shortlisting performance. In contrast, PR-AUC provided a broader discrimination summary that remained informative under class imbalance conditions but did not necessarily identify the configuration that maximized early retrieval at a fixed cutoff. As a result, a two-stage screening workflow could be supported: a shortlisting stage could be guided by TEAM/ALL-style signals to concentrate positives early, while a broader triage stage could be guided by MATURITY-like signals to improve overall ranking discrimination.
Because the held-out prevalence was only , the absolute magnitude of PR-AUC should not be interpreted in isolation. A more decision-relevant view is obtained from the top-K metrics. For example, the TEAM model achieved , which corresponds to approximately expected successful exits in the top 50, compared with for random ranking. Thus, the practical value shown here is not that a production-ready screening engine has been obtained, but that leakage-aware evaluation and shortlist-oriented metrics can still yield interpretable operational evidence under extreme rarity conditions.
Second, probability reliability was shown to be materially improved by post hoc calibration for the best model. A large reduction in Brier score was obtained for sigmoid and isotonic calibration, while PR-AUC was not improved and was expected to remain similar with monotone recalibration. This pattern supported the interpretation that calibration primarily improved the usability of scores as probabilities for threshold-based planning, while leaving ranking ability largely unchanged. For decision support, this distinction is important: ranking metrics determine which startups enter a shortlist, while calibration improves the stability of expected hit rates when thresholds or target shortlist yields are planned. Calibration should therefore be read as a reliability layer on top of ranking, not as a mechanism by which weak discrimination is converted into strong screening performance. Its role in the present study is to improve threshold planning once a ranking model has already been specified.
7.3. Signal Interpretation and Model Transparency
Feature-importance patterns provided additional insight into what was being captured within each signal group. In MATURITY, most of the discriminative signal was attributed to the short descriptive text field, while the remaining structured maturity attributes contributed little additional discrimination. In TEAM, educational composition indicators provided the strongest marginal contributions, and role-related indicators provided smaller but measurable effects. These findings suggested that the most influential screening signals were not limited to purely financial history; instead, descriptive and human-capital proxies were emphasized for the current representation. At the same time, these important patterns should be interpreted as predictive associations with regularization and resampling variability rather than as causal mechanisms.
Because the category-specific analyses were designed as controlled ablations, they should not be interpreted as evidence that the underlying startup signals are mutually independent. Rather, they provide a structured view of marginal information value under a common leakage-aware protocol, while the ALL model serves as the corresponding joint specification.
7.4. Limitations
Several limitations were identified. First, outcome labels were defined as observed exit events, while non-exit cases included operating and inactive firms, and right-censoring was more likely in the most recent cohort. As a result, some negative labels in the held-out period may later convert to positive outcomes, which can attenuate measured performance with a strict prospective split. The present binary label should therefore be interpreted as a screening-oriented approximation, not as an unbiased estimator of event-time risk under censoring conditions. For applications in which event timing is central, a time-to-event formulation would be preferable because right-censoring could then be handled explicitly. Second, a fixed top-50 cutoff was used for several screening metrics, and high variance was expected because these metrics are computed on a small subset under extreme class imbalance conditions. Third, the dominant contribution of descriptive text in
MATURITY raised a practical measurement concern: although as-of-
construction was enforced, profile text can be updated over time on many platforms. As a result, additional auditing of text timestamping and controlled text representations would strengthen robustness. Fourth, the present workflow assumes that startup-level signals can be centralized before model development. In practice, relevant information may be distributed across multiple holders, in which case privacy, communication cost, and distribution mismatch can become important constraints. Future leakage-aware screening systems may therefore require federated or privacy-preserving learning designs rather than a single centralized pipeline [
18]. Fifth, a direct naive-versus-leakage-aware benchmark was not included in the current study. Consequently, the degree of performance inflation that would be produced by random splitting or by unmasked post-
features was not quantified explicitly. Finally, the modeling choice was intentionally simple and interpretable so that the evaluation protocol could be isolated from model complexity. For that reason, the present results should be interpreted as a leakage-aware baseline rather than as the performance ceiling for startup screening. Stronger nonlinear tabular models, such as boosted tree ensembles or neural architectures, remain an important extension for future work.
8. Conclusions
Leakage-aware startup screening has been examined under a time-based evaluation protocol that approximates real early-stage VC and accelerator workflows. A reference time was used to construct features as of the screening moment, an embargo window was applied around the train–test boundary, and explicit masking rules were introduced to reduce the inclusion of post- information. A large proportion of startups were affected by the funding masking rule, which indicated that naive feature construction would likely contain substantial future information. In addition, a pronounced prevalence shift was observed between the overall cohort and the held-out, most recent cohort, which was consistent with right-censoring in prospective evaluation settings.
Screening performance has been reported with metrics aligned to capacity-constrained decision making. Ranking-oriented measures (Lift@50, Precision@50/Recall@50, and NDCG@50) were used alongside PR-AUC, and uncertainty was represented with bootstrap confidence intervals and pairwise bootstrap comparisons. Across signal groups, the strongest PR-AUC was achieved by the MATURITY representation, while the most favorable top-50 shortlisting metrics were obtained by TEAM and ALL. These results suggested that the screening-relevant notion of “best” can depend on the operational objective: broader discrimination under imbalance conditions was supported by MATURITY, while early retrieval at a fixed shortlist size was strengthened when team-related signals were included. Pairwise comparisons further indicated that the best PR-AUC model consistently outperformed FUNDING and GEO representations, while differences versus TEAM and ALL were not statistically stable under resampling conditions.
Probability calibration was also evaluated as a decision-support component. A large improvement in probability reliability was obtained for the best model, as reflected by the reduction in Brier score after sigmoid calibration, while PR-AUC was not materially altered. This pattern supported a practical distinction between ranking quality (which determines which startups enter the shortlist) and probability reliability (which supports threshold selection and stable expected hit-rate planning).
The contribution of this study, in summary, should be read as methodological rather than algorithmic. A leakage-aware prospective protocol is specified for screening problems in which information evolves over time and only a small shortlist can be reviewed. The value of the framework is created by reducing look-ahead bias, aligning evaluation with top-K decisions, and improving probability reliability for threshold planning. The present study should therefore be interpreted as an evaluation-oriented baseline study, not as an attempt to optimize predictive performance across model classes. Overall, the contribution should be understood as a methodological template for leakage-aware evaluation and decision interpretation under severe rarity conditions, rather than as a claim that the present baseline model is deployment-ready.
For future work, several extensions are suggested. A time-to-event formulation, such as Cox-type hazard modeling or Random Survival Forests, could be used so that censoring is handled explicitly rather than indirectly absorbed through prevalence shift [
28,
29]. A direct comparison against a leakage-permissive baseline, such as random splitting with unmasked future information, would also be valuable because the inflation caused by look-ahead bias could then be quantified more explicitly. Dynamic decision formulations could also be explored, where sequential updates to signals are treated as an evolving state and shortlisting is framed as a policy problem under capacity and cost constraints. In addition, controlled and timestamp-audited text representations could be introduced to mitigate concerns about post-
profile updates. External validation across regions, periods, and data providers would further support generalization claims, and fairness-aware audits could be included to examine whether screening performance and error rates differ systematically across geographies or other groups under the same leakage-aware evaluation protocol.