Next Article in Journal
Pre-During-After Software Development Documentation (PDA-SDD): A Phase-Based Approach for Comprehensive Software Documentation in Modern Development Paradigms
Next Article in Special Issue
A Guided Self-Study Platform of Integrating Documentation, Code, Visual Output, and Exercise for Flutter Cross-Platform Mobile Programming
Previous Article in Journal
Interoperable Semantic Systems in Public Administration: AI-Driven Data Mining from Law-Enforcement Reports
Previous Article in Special Issue
Virtual Reality for Hydrodynamics: Evaluating an Original Physics-Based Submarine Simulator Through User Engagement
 
 
Article
Peer-Review Record

The Learning Style Decoder: FSLSM-Guided Behavior Mapping Meets Deep Neural Prediction in LMS Settings

Computers 2025, 14(9), 377; https://doi.org/10.3390/computers14090377
by Athanasios Angeioplastis 1, John Aliprantis 2, Markos Konstantakis 2,*, Dimitrios Varsamis 1 and Alkiviadis Tsimpiris 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Computers 2025, 14(9), 377; https://doi.org/10.3390/computers14090377
Submission received: 10 June 2025 / Revised: 4 September 2025 / Accepted: 5 September 2025 / Published: 8 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

On the positive side:

The presented work tries to leverage the data traces that students leave while working with Moodle. Different ML methods are applied to predict the success of individual students based on this data.

On the negative side:

It does not work very good. The quality of the ML predictions is pretty low. The only task that worked somewhat reliable is the binary classification. However, this task might be pretty easy depending of the distribution if students who failed or passed. Nevertheless, precision and recall are around 65%. That means of all students who are predicted to pass only 65% did so. Furthermore, of all students who passed overall, only 65% were classified as such. This is not much better than a coin toss.

Furthermore, the process is not fully automated. Manual tasks are still required. The paper does not explain how that has been done. Thus, replicating the work seems impossible based on the details given in the paper.

It would be interesting to know how the ML-based approach works in comparison to an entirely manual approach. But I could not find such an evaluation in the paper. Does the ML provide any additional benefit at all, especially since the evaluation show low values (accuracy, precision, recall, F1) and an R^2 close to 0 in the regression task?

Author Response

Firstly, we would like to thank you for your effort in revising our manuscript titled “The Learning Style Decoder: FSLSM-Guided Behavior Mapping Meets Deep Neural Prediction in LMS Settings”.  We really appreciate the careful review and constructive suggestions. In what follows, we try to address all the major points raised in the review marked in red color inside the manuscript. It is our belief that the manuscript is now substantially improved after making the suggested edits. 

Comment1: It does not work very good. The quality of the ML predictions is pretty low. The only task that worked somewhat reliable is the binary classification. However, this task might be pretty easy depending of the distribution if students who failed or passed. Nevertheless, precision and recall are around 65%. That means of all students who are predicted to pass only 65% did so. Furthermore, of all students who passed overall, only 65% were classified as such. This is not much better than a coin toss.

Response1: We would like to sincerely thank the reviewer for this valuable observation. We acknowledge that the overall predictive performance of our models remains modest, with precision and recall values in the range of 60–65% for binary classification. However, we respectfully argue that these results should not be interpreted as near random. In our case, the prediction problem is highly imbalanced and complex, relying exclusively on behavioral interaction data derived from Moodle logs, without access to demographic, cognitive, or affective variables that are often used to boost performance in educational prediction models.

Specifically, regarding the binary classification task, while it may seem simplistic at first glance, it in fact reflects a non-trivial challenge in our datasets. The "pass/fail" boundary varies contextually and does not follow a perfectly balanced distribution. As reported in Section 6, class imbalance is evident across all datasets, and we employed class weighting during training to mitigate potential bias. Furthermore, the balanced F1-scores indicate that the models did not overfit the majority class. Thus, we interpret the 64–65% precision/recall as a reasonable baseline for a fully automated prediction system based solely on anonymized interaction sequences, without manual features or user-specific prior knowledge.

We have now included additional discussion in Section 6.1 and Section 8 to better contextualize the limitations of our modeling approach and to clarify that our intent is not to propose a production-ready predictive system but rather to explore the feasibility of behavior-driven learner modeling as a scalable profiling strategy. We also revised the conclusion to more accurately reflect the scope and implications of our findings.

Comment2: Furthermore, the process is not fully automated. Manual tasks are still required. The paper does not explain how that has been done. Thus, replicating the work seems impossible based on the details given in the paper.

Response2: We thank the reviewer for this important remark. We agree that reproducibility is crucial, especially when manual steps are involved. However, we respectfully note that the personalization and event mapping process is in fact fully documented and reproducible, and this is explicitly included in Appendix A of the manuscript. Appendix A provides the complete Personalization Array, where over 200 Moodle interaction types are systematically mapped to FSLSM dimensions using a normalized scoring system. This array clearly indicates the pedagogical intent of each action (e.g., quiz submission, resource viewing, forum participation) and assigns scores (+1, 0, –1) across all four FSLSM dimensions, based on structured criteria grounded in prior research [2, 9, 29].

In the revised manuscript, we have now made this element more visible by:

  • Adding direct references to Appendix A from Sections 4.2 and 4.3.
  • Clarifying that the mapping was applied programmatically using the array as a lookup table, allowing the process to be replicated on any Moodle dataset with standard log fields (eventname, action, target).

Therefore, although the initial creation of the array involved expert pedagogical input, its application is fully automated, reproducible, and generalizable to other Moodle environments.

Importantly, the contribution is not limited to the methodology on paper: we have implemented a fully functional software tool, which allows any user to input their own data (in the specified format) and automatically generate learner profiles and prediction results. This makes the approach replicable and readily usable in practice, beyond the boundaries of our own datasets (we refer to this in the Discussion section)

Comment3: It would be interesting to know how the ML-based approach works in comparison to an entirely manual approach. But I could not find such an evaluation in the paper. Does the ML provide any additional benefit at all, especially since the evaluation shows low values (accuracy, precision, recall, F1) and an R^2 close to 0 in the regression task?

Response3: We appreciate the reviewer’s comment regarding the comparative value of the machine learning (ML) models employed, especially given the moderate performance metrics in some tasks. This is indeed a central concern when assessing whether ML-based approaches provide meaningful advantages over simpler or manual alternatives.

To address this, we would like to clarify that we did not rely solely on deep or complex models. In fact, during our testing phase, we tested a range of ML architectures, including:

  • Traditional, interpretable models such as logistic regression and decision trees,
  • Shallow neural networks (Sequential Feedforward),
  • More advanced temporal models such as BiLSTM and MLSTM-FCN.

This multimodel evaluation helped us investigate the trade-off between predictive performance and computational efficiency. Simple models showed faster training and inference times but significantly lower accuracy and generalization. On the other hand, highly complex models (e.g., MLSTM-FCN) offered only marginal performance improvements but came at the cost of significantly increased inference time, which can limit scalability and responsiveness in real-world learning platforms.

As a result, we opted for the intermediate models (e.g., the Sequential and BiLSTM architectures) presented in the paper, as they struck the best balance between accuracy, robustness, and execution time. This decision was further supported by empirical evaluations across all datasets and tasks, as detailed in Section 6 and summarized in Table 14.

Ultimately, while ML-based approaches may not yet reach production-grade performance in all prediction tasks (particularly regression), our results show that even mid-level models can deliver nontrivial predictive insights using only anonymized behavioral data—insights that are difficult to extract via manual inspection or rule-based systems alone.

Reviewer 2 Report

Comments and Suggestions for Authors

In this article was introduced a hybrid profiling methodology that combines psychometric data from an extended Felder–Silverman Learning Style Model questionnaire with behavioral analytics derived from Moodle Learning Management System interaction logs. A structured mapping process is employed to associate over 200 unique log event types with FSLSM cognitive dimensions, enabling dynamic, behavior-driven learner profiles.

However, there are a few things that need to be made:

  1. Please better describe Figure 1.
  2. The paper contains only 17 references. This is a bit too few for a review paper. I think there should be at least 30 items.
  3. Reference to Table 12 is missing.

Once the changes have been made, the article is suitable for publication.

Author Response

Firstly, we would like to thank you for your effort in revising our manuscript titled “The Learning Style Decoder: FSLSM-Guided Behavior Mapping Meets Deep Neural Prediction in LMS Settings”.  We really appreciate the careful review and constructive suggestions. In what follows, we try to address all the major points raised in the review marked in red color. It is our belief that the manuscript is now substantially improved after making the suggested edits. 

Comment1: Please better describe Figure 1.

Response1: We would like to thank the reviewer for this observation. Accordingly, we expanded the description of Figure 1 to provide additional detail.

Comment2: The paper contains only 17 references. This is a bit too few for a review paper. I think there should be at least 30 items.

Response2: We would like to thank the reviewer for this observation. We added more than 13 publications through the paper content.

Comment3: Reference to Table 12 is missing.

Response3: We would like to thank the reviewer for this observation. We added a reference inside the document for Table 12.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript proposes a hybrid learner profiling framework that combines FSLSM-based psychometric data with behavioral analytics from Moodle logs to personalize learning. Deep learning models are used to predict student performance, with mixed success across regression and classification tasks. Below are my concerns.

The manuscript’s mapping of Moodle log events to FSLSM dimensions (Sec. 4.2–4.4) is described only at a high level without a clear algorithmic specification. Critical details are omitted. For example, Table 1 introduces a “Stochastic” profile without ever defining it in the FSLSM framework, and the “Reflective” dimension is missing entirely. The authors themselves note this manual mapping is “heuristic” and subjective.

There are inconsistencies between standard FSLSM terminology and what appears in the tables. In Table 1, the profile “Sensor” appears, and “Reflective” does not appear at all, while a non-FSLSM category “Stochastic” is listed. 

The experimental methodology (Sec. 5) lacks any description of how the data was partitioned into training/validation/test sets or whether cross-validation was used or not. Section 5.3 only mentions “stratified sampling” and early stopping during fine-tuning but gives no specifics on train/test splits or random seed. 

There is confusion between using static tabular features and time-series input. The createlogfeatures function is described as aggregating millions of log entries into fixed “behavioral metrics” per student, yet the model descriptions claim to process “multivariate time series representing student behavioral sequences”. It is not clear how input sequences are constructed

Section 5.1 mentions a “public dataset from Kaggle” with 1,018 records and ≈2.7M logs but provides no citation or context. 

The manuscript briefly notes “removal of missing or incomplete rows” and feature normalization, but it omits key statistics. For example, how many records were dropped? Were outliers clipped? How were categorical variables encoded? The grade normalization is not fully explained

The study evaluated 3 deep learning models without evaluating andy simple model, it’s unclear if the complex models provide any benefit.

Numerical details are absent about the models, such as hyperparameters.

Some conclusions about performance are not fully supported by the metrics. For instance, the abstract claims “high accuracy in binary classification”, but Table 3 shows ~64–61% accuracy on IHU data with balanced precision/recall, and Kaggle’s ~80% accuracy still has very low F1 on the minority class. Also, negative R² values for regression (e.g. –0.15) indicate the models often underperform a trivial mean predictor.

The manuscript does not mention whether the study received ethical clearance from an Institutional Review Board (IRB) or equivalent ethics committee. Any study involving student data must obtain ethical approval to comply with institutional and legal obligations.

There is no mention of how consent was obtained from students whose Moodle logs or questionnaire data were used.

 

 

Author Response

Firstly, we would like to thank you for your effort in revising our manuscript titled “The Learning Style Decoder: FSLSM-Guided Behavior Mapping Meets Deep Neural Prediction in LMS Settings”. 

We really appreciate the careful review and constructive suggestions. In what follows, we try to address all the major points raised in the review marked in red color. It is our belief that the manuscript is now substantially improved after making the suggested edits. 

Comment1: The manuscript’s mapping of Moodle log events to FSLSM dimensions (Sec. 4.2–4.4) is described only at a high level without a clear algorithmic specification. Critical details are omitted. For example, Table 1 introduces a “Stochastic” profile without ever defining it in the FSLSM framework, and the “Reflective” dimension is missing entirely. The authors themselves note this manual mapping is “heuristic” and subjective.

Response1: We thank the reviewer for this valuable comment. We acknowledge that the terminology in Table 1 was inconsistent, as both Reflective and Stochastic appeared across different sections of the manuscript. This issue stems from variations in terminology found in prior FSLSM-related references, where the “Reflective” dimension has occasionally been referred to as “Stochastic”. In our submitted version we mistakenly used both, which created confusion.
In the revised manuscript, we have standardized the terminology and now use only the term Reflective consistently across all tables and sections, in alignment with the standard FSLSM framework. This correction improves clarity and avoids the impression of introducing a non-standard category.
Moreover, our event-to-dimension mapping is based on a structured scoring matrix, fully documented in Appendix A, and applied programmatically. However, we acknowledge that the original scoring matrix was constructed based on expert interpretation by a small team of researchers, and therefore lacks a formal inter-rater reliability assessment or data-driven validation. This is a valid limitation of the current study, and we now explicitly acknowledge it in Section 8 (Discussion).

Comment2: There are inconsistencies between standard FSLSM terminology and what appears in the tables. In Table 1, the profile “Sensor” appears, and “Reflective” does not appear at all, while a non-FSLSM category “Stochastic” is listed. 

Response2: We thank the reviewer for this comment, which directly relates to Comment 1. As noted there, we identified and corrected the terminology inconsistencies in Table 1 and throughout all relevant sections. In particular, we removed the duplicate use of “Stochastic” and now consistently use the standard FSLSM term Reflective. Similarly, the terminology for the other FSLSM dimensions (e.g., Sensor) has been reviewed and aligned with the standard framework. These corrections ensure that all categories are consistent with the established FSLSM model.

Comment3: The experimental methodology (Sec. 5) lacks any description of how the data was partitioned into training/validation/test sets or whether cross-validation was used or not. Section 5.3 only mentions “stratified sampling” and early stopping during fine-tuning but gives no specifics on train/test splits or random seed. 

Response3: We thank the reviewer for this comment. We acknowledge that the original manuscript did not provide sufficient detail on dataset partitioning. In fact, the data were split programmatically using an 80/20 stratified train–test split, ensuring that class proportions were preserved across subsets. The test split was used as the held-out validation set during training, with early stopping applied based on validation loss to mitigate overfitting. No k-fold cross-validation was applied in the reported experiments. We have now updated Section 5.3 to explicitly describe the splitting procedure and early stopping strategy, making the experimental methodology transparent and reproducible.

Comment4: There is confusion between using static tabular features and time-series input. The createlogfeatures function is described as aggregating millions of log entries into fixed “behavioral metrics” per student, yet the model descriptions claim to process “multivariate time series representing student behavioral sequences”. It is not clear how input sequences are constructed

Response4: We thank the reviewer for pointing out this ambiguity. As noted, our createlogfeatures function aggregates millions of Moodle log entries into per-student feature vectors, but the aggregation is not purely static. In particular, we incorporated temporal information by estimating the duration of each event as the time until the subsequent event. To avoid extreme outliers, any estimated duration exceeding one hour was truncated to exactly one hour. This procedure allowed us to encode temporal dynamics of student behavior within the tabular feature set.

While the features are ultimately stored in a tabular format, the presence of these derived temporal dimensions motivated our exploration of sequence-based models (e.g., BiLSTM, MLSTM-FCN) in addition to simpler baselines. We have now clarified this implementation detail in Section 5.1

Comment5: Section 5.1 mentions a “public dataset from Kaggle” with 1,018 records and ≈2.7M logs but provides no citation or context. 

Response5: We thank the reviewer for this observation. We acknowledge that the original version of the manuscript did not include a proper citation and context for the Kaggle dataset. In the revised version, we have added the appropriate reference in Section 5.1.

Comment6: The manuscript briefly notes “removal of missing or incomplete rows” and feature normalization, but it omits key statistics. For example, how many records were dropped? Were outliers clipped? How were categorical variables encoded? The grade normalization is not fully explained

Response6: We thank the reviewer for highlighting this important point. In the revised manuscript we have expanded Section 5.1 to provide further details. Specifically:

      • Row removal: All rows with missing values in any of the features considered were dropped. In addition, for students with multiple course attempts, only the first attempt/grade per course was retained to avoid data duplication. Although we did not track the exact number of rows removed, these filters were applied consistently across datasets.
      • Outliers: For time-derived features, event durations exceeding one hour were truncated to one hour to prevent extreme outliers from dominating the feature distribution.
      • Grade normalization: Final course grades were normalized to the range [0,1] using min–max scaling, ensuring compatibility with regression outputs.
      • Categorical encoding: For binary classification, student outcomes were encoded as 0/1. For multi-class classification (Fail/Average/Excellent), one-hot encoding was applied.

Comment7: The study evaluated 3 deep learning models without evaluating any simple model, it’s unclear if the complex models provide any benefit.

Response7: We thank the reviewer for raising this point. This concern overlaps with previous comments from other reviewers too, where we clarified that a wide range of models were in fact tested, including both simple baselines and more complex architectures. Simpler models (e.g., logistic regression, decision trees) produced faster results but significantly lower predictive performance, while more complex deep learning models yielded only marginal improvements at a much higher computational cost. In our analysis we reported the models that best balanced predictive accuracy with computational efficiency. This rationale has been further clarified in Section 5.3.

Comment8: Numerical details are absent about the models, such as hyperparameters.

Response8: We thank the reviewer for this observation. We acknowledge that the original submission did not provide sufficient detail about hyperparameters. However, the research is still ongoing, with more databases being tested to improve the model’s performance, thus the final parameters may vary from the current ones. For full transparency and reproducibility, we plan to publish the complete codebase, including all model definitions and parameter configurations, on a public GitHub repository upon completion of this research. This ensures that all implementation details will be openly available for replication and further study.

Comment9: Some conclusions about performance are not fully supported by the metrics. For instance, the abstract claims “high accuracy in binary classification”, but Table 3 shows ~64–61% accuracy on IHU data with balanced precision/recall, and Kaggle’s ~80% accuracy still has very low F1 on the minority class. Also, negative R² values for regression (e.g. –0.15) indicate the models often underperform a trivial mean predictor.

Response9: We thank the reviewer for this comment. While it is correct that regression performance was weak (negative R²) and multi-class tasks proved highly challenging, binary classification consistently outperformed baseline predictors, offering a useful signal for personalization purposes. In our experiments we systematically compared simple and complex models. Simpler models achieved lower predictive power, whereas more complex ones yielded marginally better results at the cost of significantly longer runtimes. For this reason, we chose to highlight models that best balance accuracy with computational efficiency.

We have revised the abstract and conclusion sections to reflect this balanced perspective: our findings indicate that ML can provide meaningful predictive value for binary classification tasks, though regression and fine-grained classification remain open challenges.

Comment10: The manuscript does not mention whether the study received ethical clearance from an Institutional Review Board (IRB) or equivalent ethics committee. Any study involving student data must obtain ethical approval to comply with institutional and legal obligations.

Response10: We thank the reviewer for this important observation. As we clarify in Section 5.1, no personally identifiable information was used in this study. All student data were fully anonymized by the institutional system prior to our access, and we only received coded identifiers, course/grade information, and Moodle log data. As such, no names, emails, or other personal details were ever available to the research team. In line with institutional guidelines for handling anonymized datasets, explicit IRB approval was not required for this study.

Comment11:There is no mention of how consent was obtained from students whose Moodle logs or questionnaire data were used.

Response11: We thank the reviewer for raising this concern. As noted in our response to Comment 10, the data were anonymized prior to researcher access, and thus no direct student consent was necessary since no personally identifiable information was processed. The anonymization procedure ensured that all analyses were conducted on coded identifiers only, with no possibility of linking back to individual students.

Reviewer 4 Report

Comments and Suggestions for Authors

The paper proposes a hybrid learner modeling framework combining FSLSM-based questionnaire profiling with LMS log-based behavioral analytics, applying deep learning models for predicting student performance. Comments can be found below,

  1. The integration of FSLSM with behavioral log data is not novel. The approach mirrors prior work and the mapping of Moodle events to FSLSM dimensions follows established heuristics without advancing theoretical understanding or proposing new modeling paradigms.
  2. The behavioral-to-cognitive style mapping relies heavily on a manual, scoring-based system (+1, 0, –1) with no validation or empirical justification.
  3. There is no inter-rater agreement, reliability analysis, or data-driven grounding for how behavioral events are linked to FSLSM categories, introducing substantial bias and reducing reproducibility.
  4. Regression models perform poorly (R² near zero or negative across all datasets), indicating that the behavioral features engineered do not carry sufficient signal for grade prediction. Fine-grained classification (11-class) performs near chance, and even 3-class tasks suffer from imbalanced accuracy across classes, undermining claims of generalizability and robustness.
  5. The use of BiLSTM and MLSTM-FCN is not convincingly motivated, given the engineered feature set is tabular and lacks raw sequential structure (i.e., time steps are aggregated). The models may be unnecessarily complex and not well-suited for the problem formulation as implemented.
  6. The study would benefit from calibration metrics (e.g., Brier score) and confusion matrix analyses for all tasks, especially in imbalanced settings. ROC-AUC scores are absent but would add useful insight, particularly for binary classification.
  7. Numerous grammatical errors and awkward phrasing throughout the manuscript (e.g., "stochastic" is used inappropriately to describe a learning dimension).
  8. Figure 1 and Figure 2 are not adequately described or discussed in the main text.

Author Response

Firstly, we would like to thank you for your effort in revising our manuscript titled “The Learning Style Decoder: FSLSM-Guided Behavior Mapping Meets Deep Neural Prediction in LMS Settings”.  We really appreciate the careful review and constructive suggestions. In what follows, we try to address all the major points raised in the review marked in red color. It is our belief that the manuscript is now substantially improved after making the suggested edits. 

Comment1: The integration of FSLSM with behavioral log data is not novel. The approach mirrors prior work and the mapping of Moodle events to FSLSM dimensions follows established heuristics without advancing theoretical understanding or proposing new modeling paradigms.

Response1: We thank the reviewer for this important comment regarding the originality of our approach. We agree that prior work has explored the use of FSLSM in educational settings, and that heuristic-based mappings from LMS data to FSLSM dimensions have been previously proposed.

However, we respectfully submit that our work makes a distinct methodological and applied contribution, even if it does not propose a novel theoretical framework. In particular:

  • We developed a fully operational and automatable mapping system, which systematically links over 200 Moodle log event types to FSLSM learner dimensions using a consistent scoring matrix (provided in Appendix A). This level of detail and scalability is not commonly found in prior studies.
  • Unlike many earlier works which rely on self-reported questionnaires, our method derives FSLSM learner profiles solely from behavioral log data, eliminating the need for direct learner input.
  • We validated the approach across three diverse datasets, each from different educational contexts, demonstrating the robustness and generalizability of the mapping methodology.

While we do not claim to advance FSLSM theory, we believe our contribution lies in making this model operational at scale, in a way that can support real-time personalization in learning environments such as Moodle.

Comment2: The behavioral-to-cognitive style mapping relies heavily on a manual, scoring-based system (+1, 0, –1) with no validation or empirical justification.

Response2: We thank the reviewer for raising this concern. This issue has already been addressed in detail in our response to Reviewer 1 – Comment 2, where we explain that the mapping system is not arbitrary but based on a structured and repeatable methodology that is fully documented in Appendix A of the manuscript.
In short, the behavioral-to-cognitive mapping relies on a scoring array derived from prior FSLSM-aligned research and pedagogical interpretation, applied programmatically to Moodle event data. We have updated Sections 4.2 and 4.3 to make this clearer in the manuscript.

Comment3: There is no inter-rater agreement, reliability analysis, or data-driven grounding for how behavioral events are linked to FSLSM categories, introducing substantial bias and reducing reproducibility.

Response3: We thank the reviewer for this insightful comment. This issue is related to our previous response to Comment 2, where we explain that our event-to-dimension mapping is based on a structured scoring matrix, fully documented in Appendix A, and applied programmatically.

However, we acknowledge that the original scoring matrix was constructed based on expert interpretation by a small team of researchers, and therefore lacks a formal inter-rater reliability assessment or data-driven validation. This is a valid limitation of the current study, and we now explicitly acknowledge it in Section 8 (Discussion).

As suggested, future work should include an inter-rater evaluation involving multiple annotators or instructors, and potentially a data-driven mapping refinement based on student outcomes or unsupervised clustering techniques. We believe this will further improve the generalizability and reproducibility of the FSLSM behavioral mapping methodology.

Comment4: Regression models perform poorly (R² near zero or negative across all datasets), indicating that the behavioral features engineered do not carry sufficient signal for grade prediction. Fine-grained classification (11-class) performs near chance, and even 3-class tasks suffer from imbalanced accuracy across classes, undermining claims of generalizability and robustness.

Response4: This issue has been thoroughly addressed in our response to Reviewer 1 – Comments 1 and 3. While we acknowledge the weak performance of the regression models (low or negative R²), our classification models—especially for binary and 3-class outcomes—showed consistently stronger results (e.g., F1-scores above 0.60 in key tasks).
These findings suggest that behavioral features do carry predictive information, although perhaps not granular enough for fine-scale grade estimation. We have updated the Discussion and Conclusion sections to clarify this point.

Comment5: The use of BiLSTM and MLSTM-FCN is not convincingly motivated, given the engineered feature set is tabular and lacks raw sequential structure (i.e., time steps are aggregated). The models may be unnecessarily complex and not well-suited for the problem formulation as implemented.

Response5: We thank the reviewer for the observation. We note that this point is closely related to Reviewer 1 – Comment 3, which concerns the justification and value of using machine learning models of varying complexity.
As previously stated, we tested a wide range of model types—from simple (logistic regression, decision trees) to more complex deep learning architectures. The selected BiLSTM model was chosen because it provided a good trade-off between classification accuracy and computational efficiency. Although the final input representation was tabular, certain engineered features preserved latent temporal structures, justifying the inclusion of a sequence-aware model such as BiLSTM.
We have added a brief clarification in Section 5.3 of the manuscript to explain this design decision.

Comment6: The study would benefit from calibration metrics (e.g., Brier score) and confusion matrix analyses for all tasks, especially in imbalanced settings. ROC-AUC scores are absent but would add useful insight, particularly for binary classification.

Response6: We thank the reviewer for this valuable suggestion. We agree that additional evaluation metrics are important, particularly in imbalanced classification tasks. In response, we have now incorporated confusion matrices for selected representative cases (binary and three-class settings) in Section 6, along with explanatory discussion.

Due to space constraints and the large number of models and datasets tested, it was not feasible to include confusion matrices for every single experiment. Instead, we selected those cases that best illustrate model behavior in terms of false positives and false negatives. We believe this focused presentation provides a clearer and more interpretable view of classifier performance without overwhelming the reader with excessive tables.

Comment7: Numerous grammatical errors and awkward phrasing throughout the manuscript (e.g., "stochastic" is used inappropriately to describe a learning dimension).

Response7: We thank the reviewer for pointing out areas of grammatical and terminological improvement. Regarding the specific concern about the use of the term “stochastic”, we would like to respectfully clarify that this terminology is intentionally used and is grounded in the established dimensions of the Felder-Silverman Learning Style Model (FSLSM).

More precisely, the Stochastic–Sequential axis is one of the eight core dimensions used in FSLSM to categorize learners based on how they prefer to process information. Stochastic learners (also referred to as global learners in earlier FSLSM literature) tend to grasp material in a non-linear, integrative manner, focusing on overall structure and interrelationships before details. This stands in contrast to sequential learners who proceed in logical, step-by-step order.

In addition, we have conducted a detailed grammatical and stylistic review of the manuscript, revising awkward or unclear phrasing to enhance readability and scholarly tone.

Comment8: Figure 1 and Figure 2 are not adequately described or discussed in the main text.

Response8: We would like to thank the reviewer for this observation. Accordingly, we have removed Figure 2 and expanded the description of Figure 1 to provide additional detail.

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have replied the comments in a proper way.

Author Response

We would like to thank the Reviewer and the Editor for the detailed feedback and for recognizing the improvements made in the previous revision. We fully acknowledge the remaining concerns and have undertaken a substantial revision to address them. We really appreciate the careful review and constructive suggestions. In what follows, we try to address all the major points raised in the review marked in red color. It is our belief that the manuscript is now substantially improved after making the suggested edits. 

First, we have reframed our contribution more realistically: rather than claiming strong predictive performance across all tasks, we now highlight the framework as a proof-of-concept for hybrid profiling that combines FSLSM questionnaires with LMS behavioral logs in a transparent and reproducible way. The modest performance in regression and multi-class classification is discussed explicitly in the revised Discussion, where we provide a critical analysis of the limitations and identify directions for improvement, including multimodal data integration and fairness-aware evaluation.

Second, we have clarified the rationale for employing deep learning models: their use in this study was exploratory, to test the ability to capture sequential patterns, while acknowledging that simpler models often perform comparably on tabular data. We emphasize that our main contribution lies not in outperforming baselines, but in establishing a deployable software pipeline that can flexibly adapt to new datasets.

Third, we have recalibrated our claims of novelty by situating our work in relation to prior literature. The true contribution is the systematic mapping matrix and the generalizable implementation, rather than an entirely new theoretical model.

Finally, we have strengthened the Discussion and Conclusion by foregrounding limitations more directly—scalability, interpretability, risk of overfitting—and by tempering our language to avoid overstating the empirical findings.

Collectively, these revisions ensure that the manuscript presents a balanced, transparent, and realistic contribution to the field while retaining its practical relevance for educational technology research and practice.

Back to TopTop