1. Introduction
Most Thoroughbred racehorses complete their racing careers and retire around the age of 5 years. However, given the long lifespan of horses—approximately 25 years—new uses must be found for many of them [
1,
2]. A common option is to employ them in equestrian competitions such as dressage and show jumping after transition training (retraining) [
2,
3]. To investigate the factors affecting the success of this transition, previous studies have examined how racing performance and behavioral characteristics immediately after retirement are associated with subsequent suitability for equestrian use [
3,
4]. However, the ability of retired racehorses to perform successfully in these equestrian competitions is also likely affected by a variety of factors, including age, sex, and the duration of retraining after retirement. Clarifying the extent to which each factor affects competitive performance could provide valuable insights that would help promote the effective use of retired racehorses through the development of more refined training policies and rider–horse pairing strategies.
Previous analyses of equestrian competition performance, including show jumping, have reported that stallions and geldings outperform mares or, alternatively, that stallions perform better than both mares and geldings [
5,
6,
7], with this tendency being stronger in more difficult competitions [
5,
6]. Although these findings are useful in the development of training and competition strategies, most of the studied populations consisted of breeds produced for equestrian competitions rather than Thoroughbreds used for racing, and several previous reports have noted performance differences among breeds [
7,
8,
9]. In addition, racehorses were originally bred and trained exclusively for racing; thus, such findings may not necessarily apply when retired racehorses are retrained for equestrian competition.
To clarify the factors affecting the successful transition of retired Thoroughbred racehorses to equestrian disciplines, performance data from competitions restricted to retired Thoroughbred racehorses need to be examined. In Japan, such data are available through the Retired Racehorse Cup (RRC), a nationwide equestrian competition established in 2018 to promote the use of retired Thoroughbred racehorses for equestrian purposes. In this study, we fitted separate multivariable Bayesian linear mixed models for rank, round time, and obstacle faults in RRC show-jumping competitions. Each model included horse sex, age, and the time between retirement from racing and competition entry (defined as the interval, assumed to represent the retraining period) as covariates, with rider, rider–horse pair, individual horse ability (horse), sire, affiliation after retirement, competition year, and venue as random effects.
2. Materials and Methods
2.1. Overview of the Competition and Dependent Variables
The RRC qualifying competitions followed the Japan Equestrian Federation “Small Obstacle B” (90 cm class) rules and consisted of a course with 11 obstacles, divided into six in round 1 and five in round 2. Rankings were primarily determined by the total number of obstacle faults accumulated across both rounds, and in the case of ties, the round time in round 2 was used to determine placement. Further details on the RRC competition format, rules, and eligibility criteria are available from the official RRC website (in Japanese):
https://jouba.nrca.or.jp/rrc/ (accessed on 6 January 2026).
In this study, the dependent variables included ranking, round time in the first and second rounds (round time 1 and round time 2, respectively), and net obstacle faults in each round, excluding time penalties (obstacle faults 1 and obstacle faults 2, respectively). Rankings at each venue within each competition were transformed using the following formula: 1 − (rank − 1)/(max (rank) − 1). Each qualifying competition had at least two records.
Round time values were log-transformed, and the resulting estimated marginal means (EMMs) were back-transformed using the exponential function.
Early RRC competitions had no age restriction, and a two-year-old horse competed once; however, for animal welfare reasons, participation was later limited to horses aged three years or older.
2.2. Data Overview and Descriptive Statistics
Between 2018 and 2024, data were collected from a total of 76 qualifying RRC competitions held annually at a cumulative total of 20 venues across Japan. Records for rider–horse pairs that were withdrawn or eliminated due to falls or refusals, as well as 36 records from 15 geldings for which sex changed from stallion to gelding during the study period, were excluded, resulting in 1951 records for analysis (
Table S1).
The 1951 records analyzed in this study included 860 horses (38 stallions, 228 mares, and 594 geldings), 534 riders, and 1190 rider–horse pairs, of which 369 horses, 319 rider–horse pairs, and 376 riders participated more than once
(Figure S1A,B). Figure S1C–F present the number of rider–horse pairs per competition year, the mean age of horses in each competition year, the mean transformed rank in each competition year, and the number of rider–horse pairs in each interval category, respectively.
In the first year of the competition (2018), only one round was held under conditions comparable to those of round 2, which was therefore treated as the second round in this study. Accordingly, analyses of the first round were conducted using 1831 records, excluding those from the year 2018. For the obstacle-fault analyses, net obstacle faults were calculable only when time-penalty values were available (see Materials and Methods,
Section 2.1); therefore, records lacking time-penalty information were excluded. As a result, 893 first-round records (2019, 2021, 2023, and 2024) and 1017 second-round records (2018, 2019, 2021, 2023, and 2024) were available for the obstacle-fault analyses (four and five competition years, respectively).
In the analyses of the sex × interval and sex × age interactions for transformed rank, the mean transformed rank ± SD was calculated for each sex, for each interval category and age group, and for each sex within each interval category and age group. Differences among groups were tested using Kruskal–Wallis tests, followed by pairwise Wilcoxon tests with Holm correction.
2.3. Bayesian Linear Mixed Model Analysis
Because the competition dataset includes many grouping levels (e.g., horse, rider, and rider–horse pair), with sparse observations per level, we used Bayesian linear mixed models. Bayesian estimation with appropriate priors provides stable variance estimates in such sparse hierarchical structures [
10].
In the Bayesian linear mixed model analysis, horse sex was included as the primary effect of interest, with interval and age included as covariates. Sex was categorized as stallion, mare, or gelding. Interval was defined as the time between retirement from racing (i.e., the date of the final race) and competition entry. Because the RRC targeted only horses within 3 years of retirement (deregistration from the racing registry), interval was classified as ≤1 year, >1 and ≤2 years (2 years), and >2 and ≤3 years (3 years). Age was categorized as 2–5 years, 6–7 years, 8–9 years, and 10–15 years. The RRC also accepted entries from horses that had been registered as racehorses but retired without participating in any races; therefore, the interval for these horses was undefined due to the lack of data regarding the final race date, so these horses were classified as the NI (non-interval) group. The ≤1 year, 2 years, 3 years, and NI groups included 754, 615, 488, and 94 records, respectively. The 2–5 years, 6–7 years, 8–9 years, and 10–15 years age groups included 705, 648, 427, and 171 records, respectively.
In the analysis of fixed effects, it was not feasible to simultaneously include interactions between sex and both covariates (age and interval) because of computational resource limitations. However, descriptive statistics suggested that sex-specific trends across interval categories diverged more than sex-specific trends across age categories, indicating that the sex × interval interaction was more salient than the sex × age interaction (
Figure 1B). Furthermore, when rank was set as the dependent variable and models were preliminarily fitted with either a sex × age or a sex × interval interaction, statistically meaningful effects were observed only in the model including the sex × interval interaction. Therefore, the interaction between sex and interval was incorporated in the present analysis.
Random effects were specified for rider (534 levels), horse (860 levels), sire (210 levels), affiliation after retirement (defined as the facility responsible for housing and equestrian training the horses after retirement from racing; 259 levels), competition year (2018–2024, 7 levels), and venue (20 venues across Japan; 20 levels). To account for the combined effect of rider and horse, rider–horse pair (1190 levels) was additionally included (
Table S2). Because interval was assumed to correspond to the retraining period and could substantially influence changes in performance, random slopes for interval were specified for the horse and for the rider–horse pair. For the analyses of round time and obstacle faults, random slopes for interval were excluded for the horse and rider–horse pair to reduce computational burden. The full model formula for the transformed-rank analysis is provided below (where “Sex * Interval” denotes the main effects of sex and interval and their interaction). For round time and obstacle faults, the same fixed-effects structure was used, but without the interval random slopes for horse and rider–horse pair:
In this analysis, we constructed a model assuming a Student-
t error distribution within a Bayesian framework. However, a Gaussian distribution was applied to the data for obstacle faults 1 and 2 because obstacle-fault scores reflect an underlying continuous process, and because the negative binomial and zero-inflated models failed to converge due to the large number of zeros [
11].
The model was built using the brms package [
12,
13], and sampling was performed using the probabilistic programming language Stan [
14]. Default priors provided by brms were used: a flat prior was set for the fixed-effects coefficients, a Student-
t distribution (df = 3, mean = 0, scale = 10) for the intercept, a half–Student-
t distribution (df = 3, scale = 10) for the residual scale and the SD of random effects, and a Gamma (2, 0.1) distribution for the degrees of freedom parameter of the Student-
t distribution. These priors were automatically adjusted according to the distribution of the dependent variable.
Estimation was performed using four chains, each running for 5000 iterations (2500 for warm-up). Convergence was confirmed using the R-hat statistic [
15] and effective sample size (Bulk ESS and Tail ESS) [
16]. All parameters showed R-hat values < 1.01 and sufficiently large ESS values (typically >1000), with no divergent transitions or other sampler warnings observed. Model fit was assessed using posterior predictive interval coverage for all dependent variables [
17,
18]. The coverage values were close to, or slightly higher than, the nominal levels, suggesting mildly conservative predictive intervals and indicating no major lack of fit (
Table S3). The model’s explanatory power was assessed using Bayesian R
2 (conditional and marginal) [
19]. In addition, to compare the relative importance of random effects, approximate R
2 values were derived by distributing the difference between the conditional and marginal Bayesian R
2 in proportion to the posterior variance components of random effects that showed statistically meaningful contributions.
The Bayesian linear mixed model used in this study can be expressed as follows:
where
Y represents the dependent variable,
X and
Z represent the design matrices for the fixed and random effects, respectively,
β represents the vector of fixed-effect coefficients,
b represents the vector of random effects assumed to follow
b ~
N (0, Σb), and
ε represents the residual error assumed to follow
ε ~ Student-
t(0, σ, ν).
For fixed effects, the effect of each factor was interpreted using the posterior mean of the EMMs with 95% credible intervals (CIs). EMM values were calculated using the R4.2.2 package “emmeans” according to the following equation:
Here, Y denotes the dependent variable (e.g., rank), and the expression E(Y | Xref) represents the expected value of Y under a specific combination of factor levels, which corresponds to the EMM. The vector β consists of the regression coefficients of the fixed effects estimated by the model, and the linear predictor for the reference condition is obtained by multiplying Xref (which defines the specified levels of the fixed factors) by β. EMMs were estimated for sex, age, and the interaction between sex and interval. Differences between EMMs were evaluated using the posterior distribution of their contrasts and were considered statistically meaningful when the 95% CI of the contrast did not include zero. For random effects, the estimated SDs were reported together with their 95% CIs. A level was considered to show a meaningful effect if the lower bound of the 95% CI was >0.
2.4. Analysis of Transformed-Rank Trends in Horses with Records at Multiple Interval Categories
In this study, an improvement in transformed rank was observed with increasing interval, possibly reflecting the effect of a longer retraining period. However, this improvement may only reflect selective withdrawal of poorly performing horses from subsequent competitions, with no effect of retraining. If this were the case, performance would not be expected to improve within the same horses over time. To address this possibility, changes in transformed rank across interval categories were examined in horses with records at both the ≤1-year and 2-year intervals (stallions: 1 horse, 8 records; mares: 47 horses, 253 records; geldings: 106 horses, 373 records), as well as in horses with records at all interval categories of ≤1 year, 2 years, and 3 years (mares: 28 horses, 255 records; geldings: 49 horses, 266 records).
The Bayesian linear mixed model used for this analysis included interval, sex, and age as fixed effects, with horse included as a random effect to account for repeated measurements within individuals. The model formula was as follows:
EMMs were obtained from the model, and differences between EMMs across intervals were evaluated using the posterior distribution of their contrasts. Statistical significance was inferred when the 95% CI of the contrast did not include zero.
No generative artificial intelligence (GenAI) tools were used in this study.
4. Discussion
In this study, we used a Bayesian linear mixed model to analyze factors influencing show-jumping performance in retired Thoroughbred racehorses, with a focus on horse sex and the interval between race retirement and competition entry, which may correspond to the retraining period for show jumping. Mares and geldings outperformed stallions at short intervals after race retirement, whereas performance improved across all sexes as the interval increased, resulting in no clear sex-related differences at later intervals, consistent with an effect of the prolonged retraining period. On the other hand, fixed effects, with sex and interval as the primary factors, accounted for only a small portion (7%) of the variance in ranking, whereas random effects, including horse-specific ability, rider, sire, and affiliation after retirement as major contributors, accounted for 44% of the variation in ranking, highlighting the multifactorial nature of success in show jumping in retrained retired racehorses.
In the descriptive analysis of transformed rank, the overall mean transformed rank was highest in mares, followed by geldings, and lowest in stallions (
Figure 1A), and the mean transformed rank increased with longer intervals in all sexes (
Figure 1B). Mares performed better than geldings across all interval years, whereas stallions showed poorer performance at the ≤1-year interval but improved thereafter, with a significant increase (
Figure 1B). These descriptive trends were largely retained in the Bayesian model-based analysis, with stallions exhibiting markedly worse transformed ranks than the other sexes at the ≤1-year interval (
Figure 1D). While temperament and/or behavioral difficulties of stallions have been considered a cause of poorer performance at the early stage of training (corresponding to the ≤1-year interval in our study) [
5], several studies have reported results different from ours, showing that stallions and geldings perform better than mares or that stallions outperform the other sexes [
5,
6,
7]. These findings have been explained in those previous studies by the superior speed and jumping ability of stallions, which have been suggested to be associated with higher aerobic capacity [
20,
21] and greater explosive power [
22,
23]. One possible explanation for this discrepancy is differences in horse usage between our study and previous studies. Specifically, those studies focused exclusively on horses originally bred for equestrian purposes. In such populations, stallions may receive more intensive training than other sexes for evaluation as potential sires [
24], which could enhance their performance—a situation that does not apply to retired Thoroughbred racehorses. On the other hand, fewer stallions were analyzed in our study compared with mares and geldings (
Figure S1A), and this sample imbalance may have influenced the results. Furthermore, as stallions with performance problems are often gelded, the observed trend in stallions in our study may also reflect the characteristics of a subset of stallions that did not require gelding. However, with longer intervals, rankings improved in all sexes, with the most pronounced improvement observed in stallions, and the sex-related difference observed at the ≤1-year interval disappeared, highlighting the importance of transition training from racing to show jumping across sexes, particularly in stallions (
Figure 1D).
Within the interval factor, the NI group comprised horses with no history of racing and therefore had no definable interval length. These horses were included in the model as a separate interval factor level for convenience. In the NI group, mares had a statistically higher rank than geldings (
Figure 1D,
Table 1), and this may reflect factors specific to horses with no racing experience, such as the absence of racing-related injuries. However, although horses in the NI group shared the common feature of having no racing experience, their training history, management conditions, and ability levels were likely more diverse than those of horses with racing experience. Such heterogeneity in backgrounds could obscure the effects of sex; as a result, it is difficult to draw clear conclusions about the characteristics of the NI group from the present results.
For the analysis of ranking, the conditional R
2 was 0.44, whereas the marginal R
2 was 0.07 (
Table 3), indicating a substantially greater contribution of random effects than fixed effects, including sex and interval. To facilitate comparison of the relative importance of fixed effects and individual random factors, approximate R
2 values were calculated for random effects with statistically significant contributions (
Figure 1F) by distributing the difference between the conditional and marginal R
2 (0.37) according to the proportion of variance explained by each factor: 0.15 for horse at the ≤1-year interval, 0.15 for rider, 0.05 for sire, and 0.05 for affiliation after retirement. Each of these values was roughly comparable to the contribution of the fixed effects primarily attributable to the interval (R
2 = 0.07), potentially reflecting the retraining period. These results suggest that ranking was not determined by a single dominant factor but rather by multiple contributors, including the retraining period, individual horse ability, rider skill, genetic background, and post-retirement management environment.
The random-effects analysis of transformed rank showed that the horse-specific effect at the ≤1-year interval was the largest source of variation, suggesting that individual differences among horses strongly influenced ranking at the early stage of retraining (
Figure 1F). However, beyond the first year after retirement, horse-specific effects were no longer prominent, indicating that with longer intervals, the relative contribution of individual horse effects decreased and that sufficient retraining time may attenuate the impact of initial individual differences. Moreover, rider effects were also substantial, indicating that rider skill contributed independently to variation in ranking. Sire and affiliation after retirement showed moderate but consistent effects, indicating that both genetic background and the post-retirement management environment contributed to ranking (
Figure 1F).
In general, horses kept at equestrian facilities are routinely ridden and exercised on a daily basis as part of their regular management for competition. Therefore, the improvement in ranking observed with longer intervals in our study is primarily thought to reflect the effect of a longer retraining period. However, this improvement may only reflect selective withdrawal of poorly performing horses from subsequent competitions, with no effect of retraining. If this were the case, performance would not be expected to improve within the same horses over time. To evaluate this possibility, changes in transformed rank were examined using Bayesian linear mixed model analyses under two restricted conditions: (i) horses with records at both the ≤1-year and 2-year interval categories, and (ii) horses with records across all interval categories (≤1 year, 2 years, and 3 years). In both analyses, transformed rank increased significantly with increasing interval, suggesting that the observed improvement cannot be explained solely by selective withdrawal and is consistent with a substantial contribution of retraining (
Figure 2).
For round time 1, interval-related trends within each sex were generally flat, and sex-related differences within interval categories were largely absent (
Figure 3A;
Table 1 and
Table 2). This trend is consistent with the nature of round 1, in which only obstacle faults are considered for ranking, with no contribution from round time. The conditional R
2 was 0.45, whereas the marginal R
2 was 0.02, indicating that most of the model’s explanatory power could be attributed to the random effects rather than fixed effects, including sex and interval (
Table 3). Among the random effects, significant contributions were observed for venue, competition year, horse and rider; however, the approximate R
2 values were higher for venue and competition year (0.21 and 0.16, respectively), whereas those for horse and rider were only 0.03 (
Figure 3C). These results suggest that round time 1 is more strongly influenced by competition-related environmental factors, such as weather conditions, than by rider skill and intrinsic horse ability, again consistent with the nature of round 1.
For round time 2, stallions performed significantly worse than mares and geldings at the ≤1-year interval. Performance then improved with longer intervals in all sexes, particularly in stallions, and the sex-related differences observed at the early stage were no longer evident at the 2- and 3-year intervals, suggesting an effect of a longer retraining period for all sexes (
Figure 3D). However, despite these interval- and sex-related trends, the overall variation in round time 2 was driven predominantly by random effects rather than by fixed effects, including sex and interval: the conditional R
2 was large (0.65), whereas the marginal R
2 was only 0.04 (
Table 3). The approximate R
2 values showed a dominant contribution of competition year (R
2 = 0.51), with only minor contributions from the other random effects, venue, horse and rider (R
2 = 0.00–0.02). Unlike round time 1, round time 2 contributes directly to ranking; therefore, riders tend to attempt shortcuts to reduce time. In early RRC competitions, the course design allowed only highly skilled rider–horse pairs to use such shortcuts, whereas this option was later removed. This change likely resulted in a polarization of performance between pairs that could and could not exploit the shortcut, which may explain the strong effect of competition year.
For obstacle faults 1, stallions performed significantly worse than mares and geldings at the ≤1-year interval, but this disadvantage disappeared at longer intervals, suggesting a strong effect of retraining on stallions. Geldings also improved significantly with longer intervals, whereas mares showed only weak and non-significant improvement (
Figure 4A;
Table 1 and
Table 2). This pattern could indicate that additional retraining has a limited effect on jumping performance in mares; however, it may also suggest that mares complete the necessary retraining at a very early stage within the ≤1-year interval category. Consistently, mares have been reported to perform better than stallions at the beginner level of show-jumping competitions [
5,
6]. On the other hand, despite these sex- and interval-related patterns, the overall variation in obstacle faults 1 was explained more by random effects than by fixed effects, including sex and interval: the conditional R
2 was 0.40, whereas the marginal R
2 was only 0.07 (
Table 3). Among the random effects examined in this study, rider and sire showed clear contributions in the analysis of obstacle faults 1, with approximate R
2 values of 0.27 and 0.06, respectively (
Figure 4C;
Table 3). In the first round, ranking is determined solely by obstacle faults, and course time is not considered. Therefore, the random effects identified in the analysis of obstacle faults 1, rider and sire, are expected to be more directly related to jumping performance than those found in obstacle faults 2.
For obstacle faults 2, stallions again performed significantly worse than mares and geldings at the ≤1-year interval, but this difference diminished with longer intervals and was no longer evident at the 2- and 3-year intervals, suggesting a clear effect of a prolonged retraining period on stallions. As observed for obstacle faults 1, geldings also improved significantly with longer intervals, whereas mares showed only weak and non-significant improvement (
Figure 4D;
Table 1 and
Table 2), which may be attributable to the earlier completion of retraining in mares, as discussed above. Despite the clear sex- and interval-related patterns observed in the fixed-effect analysis, most of the variance in obstacle faults 2 was accounted for by random effects: the conditional R
2 was 0.41, indicating moderate explanatory power, whereas the marginal R
2 for the fixed effects of sex and interval was only 0.06 (
Table 3). All random effects showed clear contributions; however, each of the corresponding approximate R
2 values was relatively low (0.01–0.11) compared with those observed for obstacle faults 1 (rider = 0.27 and sire = 0.06), suggesting that a small but diverse array of factors contributes to obstacle faults 2. This pattern may be related to the fact that, unlike the first round, both obstacle faults and round time directly determine the ranking in the second round, resulting in a greater diversity of factors that can influence obstacle faults (
Figure 4F).
In summary, to explore the factors affecting the transition of retired racehorses to equestrian disciplines, we analyzed show-jumping performance in retired Thoroughbred racehorses using a Bayesian linear mixed model, focusing on horse sex, age and the interval between retirement and competition entry. Our findings suggest that a sufficiently long interval, which may reflect the transition training period, is important for improving jumping performance regardless of sex, while factors other than the interval also contributed to performance.