1. Introduction
As noted by Aldous [
1], in any study that seeks to model the outcomes of a sports competition, the probability that
A (a player or team) defeats
B is assumed to be a specific function of the difference in their underlying ‘strength’. This premise naturally raises several questions including the following: How should a competitor’s “strength” be defined? How can this “strength” be measured?
This line of reasoning is also pursued by McHale and Morton [
2], who model tennis match outcomes under the assumption that the probability of victory is a function of each player’s latent abilities or ‘strength’. The authors explicitly remark that ‘It is of interest to use the player abilities… to produce an alternative ranking system and to compare the new rankings with the official ATP rankings’. Similarly, Ricard and Legarra [
3] address this idea by developing models that interpret observed rankings as the empirical outcome of an underlying hierarchy of latent, normally distributed performances of horses in competition.
Within the field of horse racing, numerous studies provide evidence that betting markets yield reasonably accurate forecasts. According to Stekler et al. [
4], empirical results indicate that these markets are generally able to discriminate among horses’ potential and quality, producing probability estimates that are well aligned with the observed frequencies of victory. However, the authors also confirm a systematic deviation at the extremes of the probability distribution, commonly referred to as the ‘longshot bias’. This bias manifests as excessive wagering on highly favored horses and disproportionate betting on low-probability outcomes, motivated by the prospect of achieving higher returns. Such behavioral patterns distort the implied probabilities and thus compromise the reliability of market-based predictions when betting data are used as explanatory or predictive variables. This adverse effect of betting behavior has been extensively documented in the literature (see Lo and Bacon-Shone [
5] and Snyder [
6]). Consequently, betting data cannot be considered suitable for estimating intrinsic performance.
In light of the preceding arguments, if the aim is to develop indicators of horse performance or ‘strength’ in races without incorporating the horses’ phenotypic or genetic characteristics, it is reasonable to restrict the analysis to the results of previous races. The use of competitive outcomes constitutes the basis of Elo-type rating systems and standard ranking procedures (e.g., in chess, tennis, car and motorcycle racing, etc.). However, for performance indicators to provide timely information, historical data must be temporally weighted so that more recent results carry greater importance. In particular, weighting schemes based on kernel-type functions offer an objective criterion for assigning weights and align with the principle that an athlete’s (or horse’s) current form has a temporal component that diminishes over time, analogously to the ‘decay factor’ used in dynamic Elo ratings. This weighting strategy has been employed in various studies aimed at constructing models for forecasting sports outcomes (see McHale and Morton [
2]). Aldous [
1] further reinforces this line of reasoning by arguing that, in sporting contexts, the estimates of a player’s or team’s strength or performance must be updated after every match or competition.
Furthermore, in many sports, competitions take place under heterogeneous conditions and environments that exert a substantial influence on outcomes. For example, in tennis, the surface on which each tournament is played (clay, grass, etc.) is a decisive factor. In horse racing, race distance or specific modalities can play an important role. In trotting, variables such as race distance or start type (autostart or handicap) can significantly affect the time per kilometer, and thus must be incorporated into the performance indicators. Clearly, such aspects must be taken into account when constructing measures of performance or strength. For instance, McHale and Morton [
2] explicitly account for playing surface in the model they propose for forecasting tennis match outcomes.
On the other hand, the competition category must also be taken into account when defining an indicator of sporting strength or skill. Undoubtedly, in any sport—whether individual or team-based—the category of the tournament influences athletes’ performances and the outcomes they achieve. The quality of opponents, the psychological effects on each athlete, environmental and media pressure, the suitability of the conditions under which the event is held, and even the associated sporting and financial rewards are all relevant factors that shape results. This provides a clear justification for incorporating the competition category into the definition of sporting strength.
In many sports, events or tournaments are classified into categories that determine the points awarded for official rankings. For example, tennis tournaments are categorized as ATP Grand Slam, ATP 1000, ATP 500, ATP 250, and so on. In horse racing, events with a higher competitive level (e.g., those offering larger prize purses or greater prestige) tend to attract higher-quality competitors; consequently, the results obtained in such contexts provide greater discriminative information regarding a horse’s relative strength.
The objective of this study is to develop a statistical methodology for estimating quantitative strength indicators in trotter horses, based exclusively on historical race results that are temporally weighted and adjusted for event category and distance. The definition of such indicators must incorporate the aspects discussed above, namely, the time elapsed between the event and the moment at which performance is assessed, the category of the event, and the quality of the participating rivals. It is also important to emphasize that the goal is not to predict race outcomes, as a predictive approach would require additional information on the horse’s morphological characteristics as well as specific data on the driver. Betting data are likewise not considered. Although such data are used in many outcome-prediction models, they do not necessarily convey information about a horse’s inherent potential or performance. Moreover, as reported in the literature, they may introduce undesirable noise into the construction of animal strength indicators. Although the objective is not direct prediction, validation of the proposed indicators is carried out through their predictive validity. These indicators are suitable as inputs to an Elo-type system, analogous to those used in chess. Finally, the proposed framework and general procedure are adaptable to any sporting discipline that produces comparative results among competitors, provided that chronological ranking information is available.
The remainder of this paper is structured as follows.
Section 2 describes the dataset used to address the objective of the study. In
Section 3, we define and outline the parameters, measures, and statistical quantities that form the basis for constructing the performance or strength indicators. The subsequent section introduces the proposed performance measures or indicators and presents an analysis of their theoretical underpinnings. In
Section 5, we assess the proposed indicators using the dataset. Finally,
Section 6 summarizes the main conclusions and outlines directions for future research.
2. Dataset Description
For this study, performance data from the Balearic Trotting Federation [
7] collected between 1990 and 2023 were used. This database consists of a set of horse–event records, meaning that each entry contains:
Horse identifier;
Driver identifier;
Event identifier;
Associated characteristics for both the horse (sex, date of birth, time per kilometer, position, and earnings) and the event (date, racetrack, distance, type, and starting mode).
In this study, only harness (trotter) races were considered, as they constitute the majority of trotting competitions in Spain. The starting modes are either autostart or handicap; further details are provided in Gómez et al. [
8]. Race distances were grouped following the same categorization used by Gómez et al. [
8]:
Short: < 2000 m.
Medium: 2000 2200 m.
Long: 2200 m.
Similarly, a new variable, race category, was created by classifying races into five groups based on the total prize money awarded:
Category A: prize money exceeding EUR 10,000
Category B: prize money between EUR 6000 and EUR 9999
Category C: prize money between EUR 3000 and EUR 5999
Category D: prize money between EUR 800 and EUR 2999
Category E: prize money less than EUR 800.
This classification follows the prize-money distribution observed in the database and reflects the practical competition levels recognized by the Balearic Trotting Federation.
Table 1 provides a summary of the dataset records by distance and category.
Regarding race position, it should be noted that horses exceeding a time limit in the races are recorded as ‘A’, and their time per kilometer is not recorded. Furthermore, horses that were disqualified or withdrawn from the race were excluded from the analysis, as the interruption of their participation prevents an accurate measurement of their performance. These cases accounted for approximately of the total records and were excluded to ensure comparability.
It is important to emphasize that the aim of this study is to quantify the intrinsic performance or “strength” of each horse, rather than to predict official finishing positions. In trotting competitions, particularly in handicap events, horses may cover different distances, meaning that the finishing order is not an objective or directly comparable measure of performance. For this reason, Time per Kilometer (
) is used as the primary performance metric, in line with previous studies such as Ekiz and Kocak [
9], Gómez et al. [
8], and Ricard and Legarra [
3].
This procedure follows the general principle that a win/loss outcome can be assigned only when the performance measures of both horses are objectively comparable.
Based on the available , the pairwise outcome between horses i and j in a race is defined as follows:
Difficulties arise when one or both horses are assigned the position ‘A’, which indicates that the horse exceeded the time limit and, consequently, no is recorded. If both horses receive ‘A’, or if one horse receives ‘A’ in a handicap race (where distances differ), there is insufficient information to determine which horse performed better. These cases represent a limited subset of all pairwise comparisons, and treating them as draws prevents the introduction of arbitrary and unverifiable outcomes.
Therefore, all pairwise outcomes for which the relative performance could not be objectively established were classified as draws. Consequently, each race yields a confrontation matrix with three possible outcomes:
if i clearly out performs j (i defeats j);
if i is clearly out performed by j (i loses to j);
0 when the comparison is not interpretable based on observable performance (‘draw’).
In this manner, after data cleaning and exclusion of incomplete records, the final dataset comprises 421,329 records, among which there are 9271 animals, 2140 drivers, and 56,882 races or events.
3. Notation and Definitions of Parameters and Quantities
The notation and statistical quantities required for constructing the strength indicators are defined below.
N: Number of horses in the database.
The collection of events is denoted by , where H represent the total number of events (or competitions), with distinct editions of the same tournament considered as separate events.
The available data cover a time interval denoted by . Thus, an event or tournament occurs at a specific time instant . In practice, corresponds to the calendar date of event t, expressed as the number of days elapsed since the first recorded race.
Given an event
, the collection of events held prior to it, that is, within the time interval
, is denoted by the following:
Each event or competition (for each distinct edition) is assigned a category based on an ordinal variable with five levels:
. As discussed in the Introduction, victories or high placements in top-category events (
A or
B) are more significant than successes in lower-category events and should therefore contribute more to indicators of horse quality. Thus, given an event
with
, we consider the collection of events held prior to it,
. Each event in this collection can be assigned a category correction coefficient, such as
Thus, the application of these coefficients weights the outcome of a Category C event at of the value assigned to an identical outcome in a Category A event. The coefficients were assigned according to a monotonically decreasing scheme, ensuring that higher-level events contribute proportionally more to the final indicator. This design preserves ordinal consistency and allows for future calibration through data-driven optimization.
Each event (for each distinct edition) is associated with a distance (in meters) based on an ordinal categorical variable as follows: short, medium, and long, according to the categorization described in the preceding section. It is denoted by
Given that a horse’s performance potential may vary with distance, this factor must be considered in the quality indicators designed to characterize the horses. Accordingly, correction coefficients can be assigned based on the similarity to the distance of the event
under consideration, analogous to the approach used for event categories.
As in the preceding case, the selection of these coefficients follows a criterion consistent with the study objective, though they may be adjusted if necessary. Specifically, higher similarity in distance to the target event results in higher weights, reflecting a logical and coherent weighting scheme.
To define quantities or measures related to horse outcomes, they must be referenced either to a time instant or to an event or tournament . For any pair of horses , the notation ‘’ indicates that horse i defeated horse j in event t.
Thus, given an event
(or an instant
), we can define the
matrix of prior victories to that event or instant,
or
, as an
victory indicator matrix. Each element counts the number of victories of one horse over another, defined for each pair of horses
as follows:
and for
,
This matrix, , is not symmetric and does not account for possible draws between two horses. Furthermore, the following considerations can be made regarding it:
Given horse
i, the number of ‘victories over other horses’ prior to event
is given by
Given a pair of horses
and an event or tournament
, we denote
if the pair of horses participated in that event
t. Thus,
where
is the indicator function:
represents the number of events in which horse j defeated horse i prior to event , which account for the non-symmetry of . Obviously, the matrix is symmetric and represents the total number of pairwise confrontations (excluding ties) between each pair of horses over the time interval .
Analogously, the draw matrix
can be defined. A draw between horses
i and
j in an event
t is denoted by
, and the corresponding draw indicator function is defined as
Then,
and diagonal elements are null,
.
Finally, the confrontation matrix up to event
is defined as
which is to say,
where the element
of
represents the total number of confrontations between the pair of horses
prior to the instant
. Obviously, this is a symmetric matrix.
These matrices summarize the number of victories and confrontations between each pair of horses over the entire period preceding the construction of the matrix, i.e., prior to event . When estimating the likelihood of a future confrontation between two horses, particularly in event , using these matrices, one captures the historical performance of each animal but conflates its current form with its average performance over the entire past. To address this, a weighting scheme must be incorporated to assign greater importance to more recent results. One approach is to apply kernel-type functions, which assign higher weights to events closer in time to the prediction or adjustment point, with the weight gradually decreasing for events further away. In this context, temporal proximity is defined by the elapsed time until the event under consideration for prediction or adjustment.
A kernel function suitable for this purpose is the so-called Triweight Kernel (Gramacki [
10]):
where
is the base period, or bandwidth, which must be fixed a priori. Note that when weighting with
, events occurring at a time earlier than
(measured in days) receive zero weight and are therefore excluded from the projection. Conversely, receive a weight close to
. In the present study, the bandwidth is set to four years, with the unit of time being days, so that
. This window was selected based on expert knowledge regarding the competitive lifespan and performance maturation of trotting horses, which typically evolve over multi-year periods. As such, a four-year span provides a biologically and contextually meaningful scale for temporal weighting in this discipline.
Kernel weighting ensures a smooth temporal decay of past results. Among several possible choices (e.g., Gaussian, Epanechnikov, or Triweight), the Triweight kernel offers compact support and finite influence, making it particularly suitable for dynamic performance modeling (Gramacki [
10]).
Thus, for an event
the elements of the time-weighted matrix of prior victories are defined as follows:
where
is the weight determined by the kernel function for event
t relative to event
, that is,
with
being the time instant at which event
t occurs. Thus,
represents the time elapsed between the two events.
Analogously, the
time-weighted draw matrices can be defined, with elements:
and the
time-weighted confrontation matricesUsing the preceding matrices, the
time-weighted victory ratio matrix,
, can be defined, with the elements given by
That is to say, represents the ratio of horse i’s victories over horse j’s, time-weighted through the Triweight Kernel function.
These matrices collectively capture the historical head-to-head relationships among horses and provide the mathematical foundation for constructing time-weighted indicators of relative strength. This framework can be readily generalized to other sports that involve pairwise comparisons.
3.1. Corrections According to Event Category
As noted previously, when defining a strength indicator for each horse, it is important to account for the categories of the events in which the horse has participated. To this end, we construct analogous matrices that incorporate event category through the coefficients defined in Equation (
2) and the prize-money levels (A–E) described in
Section 2.
Each event or competition (for each distinct edition) is associated with a category represented by an ordinal variable with five levels:
. The correction coefficients specified in Equation (
2) are applied accordingly. Based on this information, the following matrices and derived quantities can be constructed:
Time-weighted and category-corrected victory matrix:
Time-weighted and category-corrected draw matrix:
Time-weighted and category-corrected confrontation matrix:
These matrices, respectively, represent victories, draws, and confrontations, all weighted by time and corrected according to the coefficients corresponding to event categories. Analogous to the previous procedure, the
time-weighted and category-corrected victory ratio matrices can be defined as follows:
The category coefficients
follow a monotonically decreasing scheme (see Equation (
2)), ensuring ordinal coherence: higher-level events contribute proportionally more to the indicator. This design is theoretically consistent and may be calibrated using data in future work, for instance, via constrained cross-validation under monotonicity.
Note that in defining the ratio associated with the pair , the corresponding element from the time-weighted confrontation matrix, , is retained as the denominator. In other words, the ratio quantifies the time-weighted and category-corrected victories relative to the total time-weighted confrontations between the two horses. By construction, and inherits the time-decay induced by . Using the uncorrected denominator preserves the head-to-head frequency scale.
3.2. Corrections According to Event Distance
Analogous matrices can be constructed by incorporating event distance. Each event or competition (for each distinct edition) has a distance, measured in meters, represented by an ordinal categorical variable: short, medium, and long distance (see
Section 2 for operational definitions: short < 2000 m; medium [2000, 2200] and long > 2200 m). In Equation (
4), correction coefficients are defined based on the similarity between the distance of the event under consideration and that of the event being adjusted. Accordingly, the following matrices and derived quantities are defined:
Time-weighted and distance-corrected victory matrix: if
with
, then
Time-weighted and distance-corrected draw matrix: if
, then
Time-weighted and distance-corrected confrontation matrix:
Analogous to the preceding procedure, the ratio matrices can be defined. That is, the
time-weighted and distance-corrected victory ratio matrix is given by
As in the definition of the time-weighted and category-corrected victory ratio matrix, in this case the uncorrected denominator is also used, preserving the head-to-head frequency scale.
4. Strength Indicators
Performance and strength indicators are fundamental metrics for assessing and evaluating athletic performance in sports science. These indicators provide quantifiable data that facilitate the evaluation of an athlete’s or team’s efficiency and effectiveness. They are essential for understanding how results vary as a function of different factors and can inform the adjustment of training programs to enhance performance. Such indicators can capture multiple dimensions, including biomechanical, psychological, technical–tactical, biological–functional, biochemical, and anthropometric–morphological aspects (see Urdampilleta et al. [
11]). Extensive research has addressed this topic in team sports such as soccer (Herold et al. [
12]), basketball (García et al. [
13]), and baseball (Mercier et al. [
14]), as well as in individual sports such as athletics (Johns et al. [
15]) and cycling (Phillips and Hopkins [
16]). However, when the objective is to construct an Elo-type system, indicators based on the physical or psychological characteristics of the athlete are less suitable, as such systems are primarily result-driven, as exemplified by the original Elo system in chess. Accordingly, in our study, we rely on the horse’s historical performance results to construct these indicators.
Clearly, a rigorous mathematical foundation and the application of appropriate statistical techniques and models are necessary, in line with the assertion of Lames and McGarry [
17]: ‘performance analysis for purposes of theoretical advancement must make use of mathematical modeling and simulation techniques’.
Based on the matrices defined in the previous section, strength or performance measures can be constructed for each horse, incorporating information from their recent past performance.
First, we can consider the victory ratio matrix
, in which Kernel weighting is applied to assign greater weight to recent events. The
-th element of this matrix,
, represents the time-weighted victory ratio of horse
i over horse
j. Accordingly, the
i-th row contains the victory ratios of the
i-th horse against all other horses in the study. Thus, the
time-weighted mean victory ratio of the
i-th horse, defined as the mean of the values in the
i-th row of
, is given by
where
Obviously, this ratio represents a measure of the horse’s prior performance.
It is worth highlighting several key aspects of this strength indicator, as outlined in the following comments:
It takes values in the interval and represents an index of the horse’s overall winning ability at the instant , with more recent events (closer to that instant) carrying greater weight.
It only considers the victory ratios achieved against horses with which the horse has competed in at least one event prior to the instant . Obviously, if the i-th individual has not participated in any event before to , (i.e., ), then the indicator is not defined. In other words, this indicator is assigned only to horses that have previously participated in at least one event.
The value 0 is assigned to any horse that, across all events in which it has competed during the considered time window, has never finished ahead of another horse.
The value 1 is assigned to any horse that has won all events in which it has competed during the considered time window, meaning it has not been defeated by any other horse.
Clearly, if a horse has participated in only a few events within the considered time window, the indicator may be unstable and provide limited information about its true strength. Consequently, for young horses at the beginning of their competitive careers, this measure may be unreliable. However, as the horse competes in more events, the indicator becomes increasingly stable and emerges as a reliable measure of its strength.
To illustrate the indicator’s adequacy, values were obtained for five randomly selected horses from the database, specifically among those that have participated in at least 80 races or events and have competed across all three race distance categories, ensuring that the following conditions are met:
Horse 1: Has finished in first position two or more times in category A races.
Horse 2: Has finished in first position two or more times in category B races, but has not won any category A event.
Horse 3: Has finished in first position two or more times in category C races, but has not won any events in categories A or B.
Horse 4: Has finished in first position two or more times in category D races, but has not won any events in categories A, B, or C.
Horse 5: Has finished in first position two or more times in category E races, but has not won any events in categories A, B, C, or D.
The evolution of the indicators for these horses over time is represented in
Figure 1.
The behavior pattern of the mean victory ratio supports comment 5. A slight initial increase in the indicator can be observed, reflecting the horse’s accumulation of experience and gradual stabilization of performance, followed by a decrease in the final phase of its competitive lifespan. Additionally, the indicator generally preserves the intuitive ranking of horses’ potential based on the random selection criteria employed (e.g., Horse 1 outperforming Horse 2, Horse 2 outperforming Horse 3, etc.). However, this differentiation based on performance in higher-category events is not captured by this indicator, as its definition does not account for event categories; this limitation is addressed by the category-corrected indicator described below.
Given the potential influence of event categories on the strength measure, an analogous indicator can be defined using the time-weighted and category-corrected victory ratio matrix
. Accordingly, the
time-weighted and category-corrected mean victory ratio for horse
i is defined as
where
was defined in Equation (
29). Only the numerator is category-modified (see Equation (
23)), preserving the head-to-head frequency scale in the denominator.
This quantity can be interpreted as a measure of the winning capacity or potential of horse i in race or event , or at the instant .
The definition of this indicator incorporates both the horse’s recent performance history (through kernel weighting within the considered time window) and the relevance of the events in which the horse has participated during that period (through the correction coefficients associated with event categories).
Consequently, it takes values in the interval and serves as an index of the horse’s overall strength or winning capacity at the instant , giving greater weight to more recent events and to results achieved in higher-category competitions. Furthermore, comments 2, 3, and 5 regarding are also applicable to .
With regard to comment 4, for this ratio, a value of 1 is assigned to any horse that has participated exclusively in Category A events during the considered time window and has won all of them. Conversely, a horse competing solely in Category E events, even if it wins all of them, will not exceed a value of 0.2 for this indicator.
Following an approach analogous to that used for
, the values of this indicator,
, for the five selected horses are shown in
Figure 2. It is noteworthy that the differentiation of horses based on performance in higher-category events is clearly captured, providing a more accurate representation of the intuitive ranking among the selected horses.
Finally, the effect of race distance must be taken into account. To this end, we utilize the time-weighted and distance-corrected victory ratio matrices. As defined in the previous section, prior to the instant
of an event
, three time-weighted and distance-corrected victory ratio matrices are constructed, corresponding to the distance associated with that event,
, with the weightings specified in Equation (
4):
For short distance (
), at the instant
, we consider the
time-weighted and distance-corrected victory ratio matrix for short distances:
For medium distance (
), at the instant
, we consider the
time-weighted and distance-corrected victory ratio matrix for medium distances:
For long distance (
), at the instant
, we consider the
time-weighted and distance-corrected victory ratio matrix for long distances:
In the preceding notation, teach matrix can be considered as corresponding to the time instant
, or equivalently, the instant immediately preceding event
. Using these matrices, mean ratios can be calculated for each horse, serving as strength measures specific to each distance. Accordingly, the
time-weighted and distance-corrected mean victory ratio for short distance for horse
i is defined as
where
was defined in Equation (
29).
Analogously, the
time-weighted and distance-corrected mean victory ratio for medium distance and the
time-weighted and distance-corrected mean victory ratio for long distance for horse
i are defined, respectively, as:
and
For each horse, these three quantities can be interpreted as measures of the winning potential or capacity of the horse in the race or event , or at the instant . As with the previously defined indicators, the definitions of incorporate both the horse’s recent performance history (via kernel weighting within the considered time window) and the distances of the events in which the horse has participated (via the correction coefficients associated with the three distance categories). Consequently, these measures take values in the interval and, for a given distance, serve as indices of the horse’s strength or overall winning capacity at for events of that distance. These indices give greater weight to more recent events and to results achieved in competitions of similar distances. Additionally, comments 2, 3, and 5 regarding also apply to these distance-corrected indicators, while comment 4 should be adapted analogously to the category-corrected ratio. Thus, each horse is assigned a separate ratio for each of the three distance modalities.
Analogously, similar graphs have been obtained for these indicators (see
Figure 3). Similar comments to the preceding ones can be made regarding the evolution of the horses’ performance based on these figures.
In summary, five strength indicators have been proposed for each horse (see
Table 2).
: time-weighted mean victory ratio, which serves as a general strength indicator, accounts solely for results within the recent past as defined by the considered time window ().
: time-weighted and category-corrected mean victory ratio, which serves as a general strength indicator that accounts for results in the recent past based on the time window considered, providing greater relevance to results obtained in high-category events.
: time-weighted and distance-corrected mean victory ratio for short, medium, and long distance, respectively. These serve as specific strength indicators for each of the three distance modalities, taking into account results in the recent past based on the time window considered.
5. Validation of the Strength Indicators
The previously defined indicators can be regarded as measures of a horse’s strength and performance, as they are fundamentally based on results obtained in prior events. While additional analysis is not strictly necessary to establish them as strength indicators, a validation study is recommended to confirm their appropriateness with greater certainty. Validation ensures that the proposed indicators accurately measure the intended construct and provide consistent results across different occasions. The reliability of these measures has already been illustrated graphically in
Figure 1,
Figure 2 and
Figure 3, which show that, when evaluated over time for a given horse, the indicator values progressively stabilize, remain consistent, and generally do not exhibit abrupt or anomalous fluctuations.
There is no exact or universally accepted measure of a horse’s strength, and data on the horse’s physical or biological characteristics are not available to directly assess the indicators’ ability to capture that strength. Consequently, we focus on a form of criterion validation, evaluating the relationship between the indicator values and other related measures or proxies of the horse’s strength. In particular, we assess the indicators’ ability to predict each horse’s performance in events or tournaments, i.e., their ‘predictive validity’ (see Cronbach and Meehl [
18] and Clemens et al. [
19]). Specifically, for an event
t in which a collection of horses
participates, we use the strength indicators of each horse to predict the victory outcome for every pair
(
or
), and then validate these predictions against the real available data. All validation is performed under a strict out-of-time protocol: for each test event
, the indicators and matrices are computed using only data available prior to
.
Given a pair of horses
scheduled to run a specific event or tournament
, denote the probability that “
” let
. Throughout, we denote
. Let
(defined in Equation (
14)) and
. The objective is focused on estimating this probability using the strength indicators: the general strength indicators,
and
, the general strength indicators corrected by event category,
and
, and the strength indicators specific to the event distance
(
or
l),
(or
) and
(or
).
Using these indicators, the following differences are defined:
The probability
is intended to be estimated through a logistic regression model (see McCullagh and Nelder [
20]), considering the three differences in strength indicators as predictor variables. That is, by considering the model,
or
For the estimation of the logistic regression model, a random sample or collection of mutually independent sample data is necessary. Therefore, from the collection of horses included in the database starting in 2005, a random sample was selected following the procedure described in
Appendix A, for each of the three distances associated with the events.
It should be noted that the selection of pairs was performed randomly such that each horse appears at most once in the collection of pairs, and from each race or event only a single pair of horses was considered at most. Thus, the collection of records or cases included in can be considered mutually independent and, consequently, can be used for the adjustment of the logistic regression model.
The same procedure is applied for the other two distances, resulting in three samples
,
and
, each with its corresponding dataset. The sample sizes obtained are 1823, 2244, and 1995 records, respectively. The logistic regression model is then fitted to each of these datasets. Given the potential for multicollinearity among the explanatory variables, the ‘elastic net’ regularized logistic regression model (Zou and Hastie [
21]) is applied, utilizing the “glmnet” library in R (see Friedman et al. [
22] and Tay et al. [
23]). The regularization criterion implemented in the aforementioned library can be selected as a mixture of ridge and LASSO regularization, with a parameter
that takes values in the interval
(
for ridge regularization;
for LASSO regularization). The regularization parameter
is selected through cross-validation using the ‘cv.glmnet()’ function of the library, specifically to achieve the maximum value of the ‘area under the curve’ (AUC) criterion of the associated classification rule.
Thus, for each of the event distances (short, medium, long), models were obtained for the values of
, with the
parameter determined via cross-validation. Therefore, the fitted models were obtained for each distance.
Table 3 presents the results obtained for LASSO regularization, ridge regularization, and the optimal
value according to the AUC criterion. The AUC values for all tested
levels across the three event distances are illustrated in
Figure 4. In addition, the table includes the parameters for the regularization fitting (
values) and the estimators of the model parameters for each event distance:
.
In order to compare these results with those obtained using other classification methods, we applied the following techniques to each of the samples using the caret library (Kuhn [
24]): Support Vector Machines with Linear Kernel, Support Vector Machines with Polynomial Kernel, Neural Networks, Bagging (Bagged CART), Random Forest, and eXtreme Gradient Boosting. The results, according to the AUC criterion, are comparable to those achieved through regularized logistic regression (see
Appendix B). Since the latter approach enables a more straightforward interpretation of the role of each predictor based on the Strength indicators—via the estimated odds ratios and the coefficients in the linear predictor—we opted to continue the validation of these indicators using the logistic regression model. Nevertheless, the validation process could be carried out using any of the aforementioned techniques, yielding virtually identical results.
Based on the coefficient estimates reported in
Table 3, the probabilities outlined in Equation (
38) can be estimated; that is:
con
.
Thus, the validation of each model will be based on comparing the actual results registered in the binary variable Y with the values fitted by the model, according to the rule:
For the pair in event :,
If then ,
If then .
This classification rule is applied to all possible pairs of competing horses in each event from the year 2005 onward.
The total number of records, or horse pairs, for which the probabilities were estimated was 676,047. This large dataset enables a comprehensive evaluation of the forecasting performance based on the horses’ strength indicators. For this assessment, we follow the framework proposed by Williams et al. [
25] in the context of ATP tennis match forecasting. Specifically, the performance measures considered include prediction accuracy, calibration, model discrimination, and the Brier Score [
26]. The rationale for using these measures is that, while accuracy is often viewed as the most desirable property in predictive modeling, sensitivity to potential bias is also critical (Irons et al. [
27]). Definitions of the aforementioned model performance measures are provided below:
Prediction accuracy () is a measure of the number of correctly predicted matches, that is, measures how well a model’s predicted outcomes match reality, calculated as the ratio of correct predictions to total predictions.
The calibration ratio
C is calculated as the sum of the victory probabilities of the horse with the highest probability, divided by the number of pairs or records in which this horse actually wins. For an event
, let the collections of pairs or records be:
Obviously,
. We denote the collections for all events considered in the validation set as
Thus, the calibration ratio
C is defined as
As noted by Williams et al. [
25], the closer the value is to one, the better calibrated and less biased the prediction method will be, and consequently, the more representative of reality the estimated probabilities are. If the model prioritizes the victory of the horses with the highest probability, the calibration ratio may be greater than one. Conversely, if the ratio is less than one, it means the model underestimates the horses with the highest estimated probability.
The discrimination
D metric is calculated as the mean of the estimated probabilities of the pairs where the horse with the higher probability won, minus the mean of the estimated probabilities of the pairs where that horse lost (surprises). For an event
, let the collections of pairs be
and
Respectively, let the unions for the collection of all events or tournaments considered be
and
with their mean estimated probabilities:
If
, we consider
. Thus, the discrimination metric is defined as
Obviously, high values of this measure reflect greater discriminatory power.
The Brier score [
26] is defined as the average sum of the squared differences between a predicted probability and the actual outcome in matches between two horses,
where the total number of pairs or records analyzed, denoted as
, is given by
where
is the number of horses that participated in event
. Obviously, a Brier score of 0 means perfect accuracy, and a Brier score of 1 means perfect inaccuracy. If the decision rule is practically random, i.e., with victory (or defeat) probabilities close to
,
takes values close to
. Consequently, values less than
will indicate ‘good accuracy’.
Table 4 presents the results of the four performance measures for the Elastic Net regularized models across each race distance. For short distances, given the equivalent performance in model fitting according to the
criterion (see
Table 3), the equality in the
measure, and better values for the
C,
D and
metrics, the model with ridge-LASSO mixture regularization (
) is selected, as highlighted in bold in the table. Similarly, for medium and long distances, ridge regularization (
) and LASSO regularization (
) are selected, respectively.
The last row of
Table 4 reports the global validation measures across all records combined. Globally, the results show good accuracy and Brier scores, very good calibration, and low discriminatory capacity. Nevertheless, these findings indicate that the horses’ strength or performance indicators enable the prediction of sporting outcomes with acceptable accuracy. In other words, the association between a horse’s indicators at a given time and its results in an event occurring at that time is confirmed.
To analyze these results, it must be taken into account that the models based on the horse strength indicators estimate the victory probabilities
(defined in Equation (
38)) of one horse over another in each of the races or events. Furthermore, the models do not include as covariate characteristics that describe the horses’ morphology or genetics, variables describing the specific event or the conditions under which it takes place. Neither are pre-race betting data included, which is information widely used in sports predictions. Finally, no traits or characteristics associated with the drivers steering each horse have been considered. An alternative approach might have involved constructing strength indicators for the combined ‘horse–driver’ pair, but such metrics would not accurately represent the horse’s intrinsic strength and were therefore excluded from the analysis. Therefore, reported performance should be considered a lower bound relative to models including betting or morpho-genetic covariates.
To establish a frame of reference for the values achieved by these metrics, we can consider some studies where prediction models are applied to sporting competitions.
In the work by McHale and Morton [
2], various models are proposed to study the outcomes of tennis matches in ATP tournaments. The accuracy or proportion of matches in which the higher ranked player won, for rankings derived from the official ATP rankings and five different models, takes values in the interval
. The prediction is based on the ATP rankings, which can be considered a measure of the tennis player’s ‘strength’.
Spann and Skiera [
28] compare the forecasting accuracy of different methods and evaluate their ability to systematically generate profits in a betting market. They report the results of an empirical study using match data from three seasons of the German Bundesliga (first division football). The accuracy or hit rate of the proposed models does not exceed 55%.
Gifford and Bayrak [
29] desarrollan predictive analytics models to forecast the NFL games outcomes in a season using decision trees and logistics regression. Using the 2002–2018 NFL data, this study focuses on developing and constructing predictive models to quantify the influence of team statistics on the 2018 NFL regular season wins. En los modelos se incluyen como variables predictoras, además de número de victorias y derrotas previas de los equipos, variables descriptivas del desarrollo de los partidos (total yards gained by rushing, total yards gained by offense, team turnovers lost…) La presencia de estas variables permite to predict the outcomes of the NFL games with high accuracy. The misclassification rate for the decision tree model is approximately 0.216 while the logistic regression is 0.169.
In the previously cited work by Vaughan Williams et al. [
25], the purpose is to examine the performance of different forecasting methodologies for both men’s and women’s professional tennis matches. The authors utilize various variables of prior athlete performance (the official men’s tennis and women’s tennis rankings, the standard Elo ratings or the surface-specific Elo ratings). The authors apply their methodologies to several of the most relevant ATP tournaments (Wimbledon, US Open, Australian Open and French Open), thus featuring the world’s best tennis players. The following measures of predictive capacity and fit for match outcome prediction were achieved across their models:: accuracy,
; calibration ratio,
, discrimination,
; Brier score,
.
The analysis of the preceding references allows us to conclude that the results obtained by the models fitted in this work should be considered good, given the inherent difficulty of predicting outcomes in equine sporting events, and considering that the sole variables included in the models are indicators based purely on prior results. Consequently, the validation analysis, based on the principle of predictive validation, confirms that the defined indicators are valid and adequate for providing information on the strength of the horses; that is, they ‘measure what they intend to measure’ and offer consistent and reliable results.
6. Discussion and Conclusions
This study introduces a family of five purely result-based strength indicators for trotter horses, incorporating (i) time-decay via kernel weighting, (ii) competition level through category coefficients, and (iii) event–distance adjustments. These indicators are straightforward to compute, interpretable on the scale, and can be updated after each race; they provide the building blocks for an Elo–type ranking system.
It is important to emphasize that this work does not aim to address the prediction of race or sporting event outcomes. While this problem has attracted considerable attention from researchers and betting agencies, success has not always been achieved. Numerous studies have been published in this domain, and many more presumably remain unpublished. Wunderlich and Memmert [
30] provide an overview of key topics in sports outcome forecasting, highlighting the central role of ratings as an intermediate step in predictive models and discussing the challenges associated with evaluating the quality of ratings-based forecasts. Accordingly, the construction of strength or performance indicators can serve as a first step toward (i) facilitating the development of Elo-type ranking or classification systems and, potentially, (ii) enabling more accurate outcome prediction through these systems. The present study focuses exclusively on the first step: the construction of the indicators.
The proposed methodology relies on the horse’s historical outcomes, emphasizing the critical importance of up-to-date information and the need for performance or strength measures to be updated after each event (Aldous [
1]). To this end, a weighting system is applied to previous results, assigning greater importance to more recent outcomes through the use of kernel functions. Additionally, two other factors play a central role in the methodology: a categorization scheme for sporting events based on their sporting or economic significance, and the conditions under which the events take place, which can influence the performance of the athlete, team, or animal. In the context of this study, these factors correspond to the race category, determined by prize money, and the race distance.
Based on a collection of historical data regarding events or competitions and athletes or teams, the general methodology proposed can be summarized as follows
Phase 0 Prerequisites:
- -
Define the temporal weighting scheme.
- -
Define the categories of the events or competitions and the corresponding weighting scheme based on category.
- -
Define the collection of event development conditions (e.g., track type, race distance…)
Phase 1 Construction of Ratio Matrices.
- -
Construction of the time-weighted victory ratio matrices, , for each instant of interest, or , for the instant immediately preceding an event or competition t.
- -
Construction of the time-weighted and category-corrected victory ratio matrices, , for each instant of interest, or , for the instant immediately preceding an event or competition t.
- -
For each condition , construction of the time-weighted and condition-corrected victory ratio matrices, , for each instant of interest, or for the instant immediately preceding an event or competition t.
Phase 2 Construction of Individual Strength Indicators (Athlete, Team, Horse, etc.).
- -
For each individual i, construction of the general strength indicator: time-weighted mean victory ratio for each instant of interest, or for the instant immediately preceding an event or competition t, defined as the average of the i-th row of the matrix or , respectively.
- -
For each individual i, construction of the general strength indicator: time-weighted and category-corrected mean victory ratio for each instant of interest, or for the instant immediately preceding an event or competition t, defined as the average of the i-th row of the matrix or , respectively.
- -
For each individual i and for each condition , construction of the specific strength indicator: time-weighted and condition-corrected mean victory ratio for each instant of interest, or for the instant immediately preceding an event or competition t, defined as the average of the i-th row of the matrix or , respectively.
The procedure described above for developing and constructing the indicators must be complemented by a validation study. Such a step enables the use of these indicators as the foundation for a monitoring and evaluation system to track the progression of athletes’ strength and performance. The validation procedure may vary depending on the availability of relevant information, but it generally involves applying a criterion validation approach, which assesses whether the indicators correlate with a recognized “gold standard” or established outcome. More specifically, a predictive validity framework is employed, evaluating the extent to which the new indicators can predict future outcomes. This framework is adopted in the present study and can be summarized as follows:
For each sporting event t included in the historical data collection ,
- -
Let
be the instant at which the event was held. Obtain the collection of performance indicators for the participants at that instant:
- -
The prediction model or technique is applied using these strength indicators, and the measure of success or failure in that event is recorded.
The predictive capacity is evaluated through the goodness-of-fit measures and performance metrics of the applied prediction model or technique.
The results obtained by applying this methodology to horse racing can be considered positive, as five indicators of each horse’s strength or performance were constructed based solely on prior outcomes. These indicators are computationally straightforward, easily interpretable, useful for comparing performance among horses, and exhibit adequate stability, ultimately demonstrating reliable measures of a horse’s strength. Furthermore, the predictive capacity of these indicators is strong, confirming that they are validated as effective measures according to the ‘predictive validity’ criterion.
In summary, the following practical implications, limitations, and future directions can be expressed:
Practical implications. The proposed indicators enable within-season monitoring of form, transparent comparisons across competition levels, and distance-specific profiling; they can also serve as inputs to Elo-type rankings and to simple selection/handicap rules.
Limitations. Indicators are purely result-based and ignore morphology/genetics, driver effects, track/weather and betting information; pairwise observations are event-clustered; and the category/distance weights were fixed a priori (monotone but not data-calibrated). Performance in
Section 5 (see
Table 4) should therefore be viewed as a transparent baseline.
Future work. We plan (i) data-driven calibration of and under monotonic constraints, (ii) hierarchical/Bayesian formulations to propagate uncertainty and separate horse/driver effects, (iii) sensitivity analyses of the kernel and bandwidth , (iv) integration of track/driver/betting covariates for full forecasting systems, and (v) external validation across other trotting circuits and gallop racing.