From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN

Huesca-Flores, Sergio; Benitez-Garcia, Gibran; Juarez-Sandoval, Oswaldo; Takahashi, Hiroki; Perez-Meana, Hector; Nakano-Miyatake, Mariko

doi:10.3390/engproc2026123003

Open AccessProceeding Paper

From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN^†

by

Sergio Huesca-Flores

^1,*

,

Gibran Benitez-Garcia

²

,

Oswaldo Juarez-Sandoval

¹

,

Hiroki Takahashi

^2,3

,

Hector Perez-Meana

¹

and

Mariko Nakano-Miyatake

^1,*

¹

ESIME Culhuacan, Instituto Politecnico Nacional, Mexico City 04440, Mexico

²

Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan

³

Artificial Intelligence eXploration Research Center, The University of Electro-Communications, Tokyo 182-8585, Japan

^*

Authors to whom correspondence should be addressed.

^†

Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.

Eng. Proc. 2026, 123(1), 3; https://doi.org/10.3390/engproc2026123003

Published: 29 January 2026

(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Download

Browse Figure

Versions Notes

Abstract

We present a skeleton-based approach to baseball pitch type classification from broadcast video. We leverage Human Pose Estimation and an ST-GCN architecture, improved with a projection-gated temporal downsampler, to learn kinematic signatures of the pitcher’s body, adaptively selecting the most informative frames, enabling pitch type classification without the need for ball tracking. On the MLB-YouTube dataset, our proposed method reaches ~62% six-class accuracy, highlighting body mechanics as a practical biometric cue.

Keywords:

baseball pitch type classification; pose estimation; skeleton-based action recognition; ST-GCN; projection-gated temporal downsampling

1. Introduction

Biometrics in sport increasingly leverages kinematic signatures for recognition and assessment. In baseball, a pitcher’s coordinated joint timings, segment orientations, and inter-limb couplings form a movement biometric indicative of pitch intent and deception. While systems such as Statcast provide precise ball and arm trajectories, they require costly radar/multi-camera setups limited to elite venues [1]. A body-centric alternative is to infer a skeletal pose from broadcast video and classify pitch type directly from pitcher dynamics, avoiding explicit ball tracking.

Prior work follows two directions. Trajectory-centric methods analyze ball motion or pre-pitch context for discrimination and outcome prediction [2,3]. Movement-centric methods read the pitcher’s mechanics via motion capture, wearables, or vision: wearables show proximal kinematics (pelvis/trunk) are predictive of pitch categories and outcomes [4,5]; vision pipelines range from handcrafted features and 3D CNNs on full frames [6,7,8,9] to pose-guided models emphasizing the athlete [10]. Skeleton-based action recognition, modeling the body as a spatiotemporal graph, learns inter-joint relations directly [11,12].

2. Materials & Methods

Motivated by this biometrics perspective, we isolate the pitcher using OpenPose [13] and model body motion with an ST-GCN backbone [11]. To improve temporal selectivity, we replace uniform stride-based downsampling with a Projection-Gated Temporal Downsampler (PGTD) that halves the temporal rate by softly selecting and fusing non-overlapping frame pairs, sharpening informative kinematic events while suppressing noise. This body-centric, graph-based design aligns with evidence linking kinematic measurements to pitch categorization [4], while remaining practical in broadcast settings.

We insert PGTD between the spatial graph convolution (GCN) and the temporal convolution (TCN) Figure 1. From broadcast video, we estimate and isolate pitcher poses exactly as in [12]; the pose stream feeds our Projection-Gated ST-GCN. In the baseline, an input

X \in R^{N \times C \times T \times V}

(batch, channels, frames, joints) is processed by a spatial aggregation and then a

(k_{t}, 1)

temporal convolution with BN-ReLU-Dropout. Using normalized adjacency partitions

{{\hat{A}}_{k}}_{k = 1}^{K}

and weights

W_{k} \in R^{C \times C^{'}}

, the spatial operator appears inline as

GCN (X) = \sum_{k = 1}^{K} {\hat{A}}_{k} X W_{k}

. Temporal downsampling in original ST-GCN uses stride 2 in selected blocks, blindly discarding every other frame.

PGTD activates only in blocks with projection residuals (channel changes). When active, the TCN keeps stride 1 and the gate performs the halving. Given GCN output

X^{'} \in R^{N \times C^{'} \times T \times V}

, PGTD yields

Y \in R^{N \times C^{'} \times ⌊ T / 2 ⌋ \times V}

by fusing non-overlapping pairs

(2 u, 2 u + 1)

. The value signal uses a pointwise tanh plus per-time “cv” normalization (layer-normalization across

[C^{'}, V]

):

V = N (tanh (X^{'})) .

(1)

Gating logits come from a depthwise temporal convolution (groups

= C^{'}

), kernel

k_{t} = 3

:

G = {DWConv}_{t} (X^{'}) .

(2)

For each window

(2 u, 2 u + 1)

we compute a two-way softmax with temperature

τ = 1.0

,

W_{u} = softmax (\frac{[G [:, :, 2 u, :], G [:, :, 2 u + 1, :]]}{τ}),

(3)

and mix the pair at half rate:

Y [:, :, u, :] = W_{u, 0} ⊙ V [:, :, 2 u, :] + W_{u, 1} ⊙ V [:, :, 2 u + 1, :] .

(4)

Finally, the block sums a stride-1 TCN on Y with a temporally aligned residual via average pooling (kernel/stride 2):

Z = {TCN}_{stride = 1} (Y) + {AvgPool}_{t}^{(k = 2, s = 2)} (R) .

(5)

For odd T, the last frame of

X^{'}

is dropped; residual pooling also yields

⌊ T / 2 ⌋

, avoiding padding. This placement (GCN → PGTD → TCN) replaces blind decimation with data-driven selection that emphasizes motion-salient frames before temporal filtering. Because the gate is depthwise and pointwise-free, its cost is modest,

O (C^{'} k_{t} T V)

, and it is applied only when channels change; identity blocks match the original ST-GCN. In practice, non-overlapping windows of 2, tanh values, cv normalization,

k_{t} = 3

,

τ = 1.0

, and TCN stride 1 provide adaptive halving with minimal disruption and stable training.

We evaluate on MLB-YouTube [10], a fine-grained benchmark with 5759 segmented clips from 20 2017 MLB postseason broadcasts. Of these, 5244 are pitches labeled as changeup, curveball, fastball, knuckle-curve, sinker, slider with class counts 331, 338, 2974, 395, 213, 993; the remainder are other events. Clips are ∼6 s at 60 fps and 1280 × 720. The split provides 4177 training and 1067 test clips (80/20). Challenges include subtle inter-class kinematic differences, severe class imbalance, and complex visuals. Experiments ran on an Intel Core i7-12700 CPU and a single NVIDIA RTX 4060 Ti (16 GB).

3. Results

Trained with Adam (

l r = 10^{- 3}

) and Cross-Entropy loss on MLB-YouTube [10], the Projection-Gated ST-GCN attains 61.8% six-class accuracy at epoch 216. Class-wise accuracies show strong discrimination for fastball (78.41%) and knucklecurve (71.26%), solid changeup (65.62%) and curveball (62.32%), and remaining difficulty on sinker (50.00%) and slider (43.22%). The gap among precision (0.58), recall (0.62), macro-F1 (0.59), and weighted-F1 (0.69) reflects class imbalance, with abundant fastballs lifting the weighted score. Higher recall but lower precision for curveball suggests confusion with related breakers; slider remains most confounded, consistent with kinematic overlap. These findings corroborate Table 1: projection-gated processing improves recognition from body motion alone, with adaptive selection particularly aiding classes with distinctive kinematic signatures. We also report parameter count, FLOPs, and inference time, indicating a practical accuracy/efficiency balance: from 3.07 M, 2.51 G, 2.06 ms for ST-GCN to 3.09 M, 3.32 G, 2.53 ms for our model.

4. Conclusions

Treating a pitcher’s movement as a biometric signal, we show that skeleton-based modeling can classify pitch types from broadcast video without ball tracking. Our Projection-Gated variant of ST-GCN replaces stride-2 decimation with content-adaptive selection between GCN and TCN, yielding 61.8% accuracy on MLB-YouTube with a modest trade-off, adding ∼0.47 ms of average inference time per clip for a 3.3% accuracy gain. Per-class trends indicate benefits for pitches with distinctive kinematic cues. Future work aims to mitigate class imbalance with different augmentation techniques, explore overlapping or learned windows to capture the different phases of the pitching movement, and using 3D poses when available, targeting further gains without sacrificing deployability.

Author Contributions

S.H.-F. is the principal author of this paper, responsible for the conception, development, and implementation of the methodology, as well as conducting the experiments and analyzing the results. G.B.-G., O.J.-S., H.T., H.P.-M., and M.N.-M. supervised the proposal, design of experiments, results analysis, and the writing process of the manuscript. All authors contributed to the revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset analysed during the current study is available in the following github repository, https://github.com/piergiaj/mlb-youtube (accessed on 15 October 2025).

Acknowledgments

The authors would like to thank the National Polytechnic Institute of Mexico and The University of Electro-Communications of Tokyo for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Major League Baseball. Statcast Glossary. Available online: https://www.mlb.com/glossary/statcast (accessed on 13 August 2025).
Hoang, P.; Hamilton, M.; Murray, J.; Stafford, C.; Tran, H. A Dynamic Feature Selection Based LDA Approach to Baseball Pitch Prediction. In Trends and Applications in Knowledge Discovery and Data Mining; Li, X.-L., Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 125–137. [Google Scholar] [CrossRef]
Li, C.-C.; Lin, C.-W.; Yu, J.-Y. Statistical pitch type recognition in broadcast baseball videos. J. Comput. 2010, 21, 17–24. [Google Scholar]
Gomaz, L.; Bouwmeester, C.; van der Graaff, E.; van Trigt, B.; Veeger, D. Machine Learning Approach for Pitch Type Classification Based on Pelvis and Trunk Kinematics Captured with Wearable Sensors. Sensors 2023, 23, 9373. [Google Scholar] [CrossRef] [PubMed]
Manzi, J.E.; Dowling, B.; Krichevsky, S.; Roberts, N.L.S.; Sudah, S.Y.; Moran, J.; Chen, F.R.; Quan, T.; Morse, K.W.; Dines, J.S. Pitch-classifier model for professional pitchers utilizing 3D motion capture and machine learning algorithms. J. Orthop. 2024, 49, 140–147. [Google Scholar] [CrossRef] [PubMed]
Takahashi, M.; Fujii, M.; Yagi, N. Automatic Pitch Type Recognition from Baseball Broadcast Videos. In Proceedings of the 2008 Tenth IEEE International Symposium on Multimedia, Berkeley, CA, USA, 15–17 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 15–22. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 6299–6308. [Google Scholar] [CrossRef]
Chen, R.; Siegler, D.; Fasko, M.; Yang, S.; Luo, X.; Zhao, W. Baseball Pitch Type Recognition Based on Broadcast Videos. In Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health; Ning, H., Ed.; Springer Singapore: Singapore, 2019; Volume 1138, pp. 328–344. [Google Scholar] [CrossRef]
Piergiovanni, A.J.; Ryoo, M.S. Fine-Grained Activity Recognition in Baseball Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1821–1829. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2018, 32, 12328. [Google Scholar] [CrossRef]
Huesca-Flores, S.; Benitez-Garcia, G.; Nakano-Miyatake, M.; Takahashi, H. Skeleton-based baseball pitch classification on broadcast videos. In Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT), Douliu City, Taiwan, 6–8 January 2025; Volume 13510, pp. 158–163. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Projection-Gated ST-GCN Architecture. Pitcher skeleton sequences enter the architecture to be modeled as spatiotemporal graphs to classify six baseball pitch types.

Table 1. Comparison with state-of-the-art methods on MLB-YouTube for six-class pitch classification.

Method	Parameters	FLOPs	Inference Time	Accuracy (%)
Random	—	—	—	17.0
I3D [10]	12.73 M	528.9 G	80.73 ms	34.5
InceptionV3 [10]	27.19 M	5.73 G	25.42 ms	36.4
ST-GCN [12]	3.07 M	2.51 G	2.06 ms	58.5
Projection-Gated ST-GCN [Ours]	3.09 M	3.32 G	2.53 ms	61.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huesca-Flores, S.; Benitez-Garcia, G.; Juarez-Sandoval, O.; Takahashi, H.; Perez-Meana, H.; Nakano-Miyatake, M. From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN. Eng. Proc. 2026, 123, 3. https://doi.org/10.3390/engproc2026123003

AMA Style

Huesca-Flores S, Benitez-Garcia G, Juarez-Sandoval O, Takahashi H, Perez-Meana H, Nakano-Miyatake M. From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN. Engineering Proceedings. 2026; 123(1):3. https://doi.org/10.3390/engproc2026123003

Chicago/Turabian Style

Huesca-Flores, Sergio, Gibran Benitez-Garcia, Oswaldo Juarez-Sandoval, Hiroki Takahashi, Hector Perez-Meana, and Mariko Nakano-Miyatake. 2026. "From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN" Engineering Proceedings 123, no. 1: 3. https://doi.org/10.3390/engproc2026123003

APA Style

Huesca-Flores, S., Benitez-Garcia, G., Juarez-Sandoval, O., Takahashi, H., Perez-Meana, H., & Nakano-Miyatake, M. (2026). From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN. Engineering Proceedings, 123(1), 3. https://doi.org/10.3390/engproc2026123003

Article Menu

From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN^†

Abstract

1. Introduction

2. Materials & Methods

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN †

Abstract

1. Introduction

2. Materials & Methods

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN^†