Review Reports - Exploratory Proof-of-Concept: Predicting the Outcome of Tennis Serves Using Motion Capture and Deep Learning

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents a tennis-serve analysis method that combines motion capture and deep learning. A Stacked Bidirectional LSTM is used to classify serve outcomes (in / out / net), while a 3-D Convolutional Neural Network predicts the landing coordinates of successful serves. The manuscript is well structured and clearly written. Below are some suggestions to improve the paper.

The dataset is extremely small: only 344 valid serves in total, and only a few dozen samples are actually used to train the Stacked Bidirectional LSTM (33 serves) and the 3-D CNN (20 serves). This is far below the usual requirements for deep learning and makes severe over-fitting almost inevitable. Consequently, the claim that the work serves as a “proof-of-concept” is not yet convincing. The authors should add a detailed over-fitting analysis, for example, by providing validation loss/accuracy curves and by discussing whether the model complexity (≈7 M parameters) is justified for such a small sample size.
For the classification task, only 33 serves are used (24 for training and 9 for validation), and for the regression task, 20 serves are used (14 for training and 6 for validation). Please clarify exactly how these subsets were selected. Was the split random? Was k-fold cross-validation applied? Providing these details is essential for assessing the robustness of the reported accuracies.
In Tennis Serve Prediction, the 75 markers × 3 coordinates are concatenated into a 225-dimensional vector without any joint-level features, kinematic parameters (joint angles, velocities, accelerations), or dimensionality reduction (e.g., PCA). Please justify why this “black-box” representation is preferable to physically interpretable features that are commonly used in biomechanics.
The dataset was collected from only two right-handed male players on an indoor hard court, limiting demographic and environmental diversity. Have the authors validated the approach on other cohorts—e.g., female players, left-handed players, or different court surfaces—or on any public datasets? If not, please explicitly discuss these limitations and their impact on generalizability.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Summary

This paper presents a two-stage system that uses marker-based motion capture and deep learning to analyze tennis serves.

Stage 1: classifies serve outcomes as “in,” “out,” or “net” using a stacked bidirectional LSTM.

Stage 2: predicts the (x, y) landing coordinates for successful serves using a 3D CNN.

The authors report 89% accuracy for the 3-class serve outcome classification and 63% of predictions within 1 m (98% within 2 m) for coordinate regression, with MAE = 0.59 m and RMSE = 0.68 m. The work also describes the data collection protocol with 75 body/racket markers captured at 180 Hz and a calibrated video for ground-truth landing coordinates.

The study stands out for precise, well-synced motion-capture data, a practical two-step approach (classify serve outcome, then predict landing spot), clear explanations of the AI methods used, results reported in meters for easy understanding, and responsible data sharing with participant consent.

Comments

The paper states a total of 344 usable serves from two right-handed male participants. However, only a very small subset is used for modeling: 33 serves for classification (24 train, 9 validation) and 20 serves for regression (14 train, 6 validation).
With only two participants, there is a high risk of subject-specific overfitting
There is no held-out test set, and no K-fold cross-validation. The reported “top twelve runs” versus “twenty runs” for the 3D CNN suggests instability and possible optimistic selection.
Clarify how random seeds, weight initialization, and run-to-run variability were handled.
Include stronger and fairer baselines on your dataset: feature-based classical models (e.g., logistic regression, random forest, gradient boosting) for classification and regression, plus simple sequence baselines (e.g., 1D CNN, unidirectional LSTM, GRU). Provide ablations (e.g., fewer layers, fewer markers, shorter time windows) to justify architectural choices.
The classification model lists “Dense layers with 128 and 3 filters.” For Dense layers, “units” (neurons) is the conventional term, not “filters.” Also, L2 regularization “0.7” appears unusually high; confirm this value and units.
Wording/typos: “generlization” → “generalization”; use consistent terminology “marker-based” rather than “markered.” Replace “filters” with “units” for Dense layers. “Two Dense Layers: With 128 and 3 filters respectively, 0.7 L2 regularisation” should be clarified (which layers have L2? magnitude?).
Figures: Ensure all figures are legible and included with captions that fully describe axes, units, and experimental conditions. The confusion matrix should include normalized values per class.
Table 1 presents “Total” counts but the modeling subsections use much smaller subsets—add an explicit rationale and a flow diagram of data inclusion/exclusion.

Recommendation

Minor revisions. The core idea and data acquisition are strong, and the topic is timely. However, the small modeling subsets lack subject independence. With expanded evaluation, clearer baselines/ablations, and strengthened reproducibility, this work could make a solid contribution.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In my view, assessing a scientific work only on the basis of numerical outcomes, such as classification accuracy compared with values reported in the literaturę, is not fully appropriate, and in fact becomes impossible if the code is not made available. For this reason, my comments below focus on the methodology of the reviewed work.

The manuscript is written with care and includes a solid literature review, although the emphasis is placed mainly on numerical results.

My main concern about the chosen methodology is the use of trajectories of all markers as input data. While this choice may increase accuracy relative to the state of the art, it does not provide insights that are interpretable, and it remains specific to the particular player whose data were analyzed. I believe a more informative approach would be to construct, using Qualisys software (as described in the tennis literature), a feature vector of the serve, and then to model the mapping between this vector and the ball impact coordinates on the court. Such an approach would also allow for a feature ranking, i.e., the identification, using known algorithms, of those features or groups of features most strongly correlated with the ball impact coordinates. This ranking could highlight individual differences, whether for a single player or across a group of players. Only in a subsequent step would it be valuable to use the full marker trajectories to verify, by comparing prediction accuracy, whether the raw data contain additional information that was not captured in the feature extraction process. If the Authors’ research plan was indeed intended in this way but presented in reverse order, I suggest making this clearer.

It may also be noted that using the full trajectories of all markers could be seen as an early analogue of markerless systems, where the input data consist of silhouettes or skeleton sequences.

In addition, when trajectories over a selected time window are used as inputs, information on instantaneous velocities and accelerations is lost. To address this, it would be advisable to derive sequences of velocity and acceleration vectors from the marker positions.

I also encourage the Authors to explain their choice of the 140 Hz sampling frequency. In my opinion, this seems relatively low. A simple way to evaluate this is to calculate the displacement of the racket marker between consecutive frames. By way of comparison, in the VICON laboratory I supervised, motion in combat scenes was typically recorded at 200 Hz, and special cases—such as a golf swing or postural hesitation of an archer at the moment of release, were captured at 400 Hz.

Further clarification of the learning process with LSTM would also be valuable, both the reasoning behind the chosen network variant and the way cross-validation was implemented given the temporal structure of the data.

Finally, to make the work more accessible for readers outside the motion capture community, it would be very helpful to describe the recording conditions (e.g., controlled lighting) and to include an image from a single camera.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The author has addressed my concerns. The paper can be accepted.

Author Response

No comments to address.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a two-stage system that uses marker-based motion capture and deep learning to analyze tennis serves.

Stage 1: classifies serve outcomes as “in,” “out,” or “net” using a stacked bidirectional LSTM.

Stage 2: predicts the (x, y) landing coordinates for successful serves using a 3D CNN.

The authors addressed all requirements in the minor revisions.

Author Response

No comments to address.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for your comprehensive answers to the R 1 round questions. Unfortunately, the issue remains of having no solid basis for comparing results obtained from such limited measurement data and the arbitrary choices in training and validation with those obtained from the THETIS dataset. Likewise, despite the explanation provided, the problem of too-low sampling frequency also remains. The authors argue that increasing the sampling frequency would reduce resolution. A similar trade-off exists in the VICON system; however, what matters is not a high-resolution image of the marker itself, but the accuracy of estimating its centroid even from a lower-quality image.
The information-processing procedure is correct and has been described in great detail in the supplementary explanations, but it was carried out on a very small dataset, so generalisation of the results is not justified.
In summary, acceptance for publication is conditional on changing the title to “Exploratory Proof-of-Concept: Predicting the Outcome of Tennis Serves Using Motion Capture and Deep Learning” as well as removing comparisons to literature datasets—although the achieved numerical values may remain.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf