Review Reports
- Hakan Gunduz
Reviewer 1: Jianhua Zhu Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Stefanos Balaskas
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1. The articulation of innovation remains relatively weak; the dimension of theoretical contribution should be further strengthened.
(1)The current focus is largely on engineering implementation and performance improvement, but lacks discussion of methodological mechanisms or theoretical novelty. The authors are advised to include the following:A comparison with existing deep embedding and feature selection frameworks (e.g., BERT + PCA/SVD), highlighting their limitations;
(2)An explanation of how ARO outperforms other heuristic methods in terms of search efficiency in non-convex, high-dimensional spaces;
(3)A theoretical discussion or hypothesis on how semantic consistency is preserved after aggressive embedding compression.
2. Lack of robust evaluation for algorithm generalizability and robustness.
(1)Although the study employs 10-fold cross-validation and an independent test set, it does not assess model transferability across different types of corpora (e.g., product reviews, social media text, etc.).
(2)It is strongly recommended to add cross-domain generalization experiments or a dedicated discussion on domain adaptability to improve the credibility and practical relevance of the model.
3. The theoretical explanation of the ARO algorithm is overly brief and fails to justify its suitability for gender identification in textual data.
(1)The manuscript should offer a clearer interpretation of the "detour foraging" and "hiding" strategies within the context of textual embedding space.
(2)It is recommended to add intuitive visualizations (e.g., schematic feature selection path diagrams) to illustrate how ARO navigates the exploration–exploitation trade-off when selecting features from high-dimensional embeddings.
4. Absence of an ablation study.
(1)The lack of ablation analysis limits the interpretability and rigor of the findings. The following ablation configurations are suggested:
(2)Using embeddings only without any feature selection;
(3)Applying each feature selection algorithm independently;
(4)Exploring the effect of different feature compression ratios on model performance;
(5)Presenting trade-off curves between compression rate and accuracy degradation in a visual format.
5. Ethical implications and bias control are not adequately addressed.
(1)Given that gender prediction carries potential risks of ethical concerns and discriminatory outcomes, the claim of "ethical personalization" in the paper appears superficial. The authors should elaborate on:
(2)Whether the model exhibits gender prediction bias (e.g., sensitivity toward one gender over another);
(3)Whether the model might amplify stereotypical language cues leading to algorithmic discrimination;
(4)It is advised to include this analysis in the Discussion or Future Work section to emphasize ethical awareness.
6.Additional suggestions.
(1)Include pseudocode or flowcharts in the methodology section to improve clarity and reproducibility.
(2)Standardize model naming throughout the manuscript (e.g., “BERTurk+GA+LSTM”) to ensure consistency, and avoid using vague terms like “our model.”
(3)Add statistical significance indicators (e.g., p-values, confidence intervals) to tables instead of relying solely on mean ± standard deviation.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsPlease revise the manuscript by addressing the following concerns.
1. While the manuscript's title suggests identifying gender from digital content, the proposed methodology focuses solely on identifying gender from text. Although text certainly falls under the umbrella of digital content, it does not fully represent digital content. Other significant forms like video and image content are not addressed by this text-based method.
- Dataset Representativeness. The manuscript validates the proposed method using news articles. However, in business contexts, consumer-generated content (CGC) or user-generated content (UGC) demonstrably differs from news articles. Key distinctions include text length, vocabulary usage, writing styles, punctuation patterns, and topic clarity, among others.
3. The manuscript needs to more clearly articulate the research motivation. Specifically, why it's necessary to develop a method for identifying gender specifically from consumer/user-generated digital content? It is acknowledged that gender is a crucial variable for personalized marketing. However, within business settings, especially e-commerce platforms, gender can usually be identified far more accurately and efficiently through alternative means than analyzing potentially brief textual comments. For instance, e-commerce platforms can readily leverage data such as past purchasing categories/brands, product browsing histories, or even the gender information actively provided by consumers/users during registration to identify gender precisely and quickly. Furthermore, information sharing of consumer/user details across different companies or platforms can also provide a highly reliable source for confirming gender information, making the text-based approach less critical in practice. The necessity and practical viability of the proposed method compared to these existing solutions require stronger justification.
4. Overall Writing Style. The manuscript's writing style, overall, should move beyond the characteristics of a purely technical report. It needs to more effectively demonstrate the necessity and applicability of developing this specific method from a business and management perspective. The value proposition and relevance to practitioners need clearer emphasis.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsComments
- The paper's title lacks clarity and demonstrates poor information flow. Authors may consider revising to more concisely convey the study's scope and contribution. The abstract also contains excessive abbreviations and lacks proper structure. Please ensure all essential elements (background, objective, methods, results, conclusions) are presented.
- Authors should clearly articulate the paper's key objectives and novel contributions beyond the general statement about bridging gaps in Turkish language gender identification
- The current description of data processing procedures (section 3.2) is insufficient. Please provide detailed information about data preprocessing steps.
- The rationale for selecting the Jaya Algorithm is inadequately presented. Please provide: comparative analysis with alternative optimisation algorithms, clear definition of "best solution" criteria, justification for why this algorithm is optimal for your specific problem, and revision of lines 233 and 237-238 for clarity.
- Authors should elaborate on the Artificial Rabbit Optimisation algorithm by explicitly stating key assumptions underlying the algorithm, limitations and potential constraints, and parameter sensitivity analysis
- Please remove redundant content and ensure each paragraph contributes unique information in lines 36-42.
- Authors should separate the discussion of results from the conclusions section. Create a dedicated discussion section that interprets findings and provides comparative analysis.
- Authors should replace general statements with specific conclusions directly supported by their reported outcomes
- Please address the significant limitation of using data from a single Turkish news platform. Discuss how this constraint may affect the validity and generalizability of your findings, and suggest strategies for future validation.
- While the research addresses an important gap in Turkish language processing, the paper requires substantial revision to meet publication standards.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsIntroduction
The Introduction outlines the applicability of gender profiling to tailor-made e-commerce systems and points out issues with the processing of Turkish language content. It refers to the use of deep embeddings as well as meta-heuristic feature selection without stating clearly the precise research gap and motive.
- Weakness of problem statement: Introduction does not clearly indicate the exact research gap or inadequacies of earlier methodologies.Added a paragraph that directly addresses the research gap
- Delay the list of contributions after the problem and objectives are set correctly.
- No strong motivation for method selection is provided: No strong motivation is given for the integration of BERTurk, FastText, and meta-heuristic FS techniques.
- Conclude with a paragraph describing the structure of the paper, facilitating reader orientation.
Literature Review
Literature review discusses gender classification classifiers, feature selection methods, and deep embeddings but does not have synthesis or ordering.
- The writing is in the form of linear listing of tools with no thematic structure or conceptual aggregation.
- There is no connection to the research design—it doesn't discuss how the work summarized ended up choosing the pipeline selected.Integrate comparative commentary
- Without current gender classification studies in other languages, particularly those supporting deep learning. Include 2–3 newer studies from multilingual or low-resource settings for further context.
Methodology
Methodology describes a six-step pipeline consisting of preprocessing, embedding, feature selection, classification, and evaluation. It is algorithmically specified but procedurally unreasonable and unclear to implement.
- No explanation of why GA, Jaya, ARO, LSTM, and GBM have been selected over other options. Add a "Model Selection Rationale" section discussing why some decisions were made.
- The models' hyperparameters and the optimization algorithms are not included.Add a hyperparameter summary table and search strategy.
- Dataset description is absent: class balance, preprocessing steps, data splits are not presented. Provide data features: gender label distribution, average text length, preprocessing steps.
- Cross-validation step is susceptible to data leakage because FS may have been performed before splitting.
- Therefore, ensure FS is performed within each CV fold to prevent leakage.
The paper measures model performance by Accuracy, F1-score, and MCC and employs 10-fold cross-validation. No statistical tests or measures of variability are reported, however.
- No statistical testing is conducted to provide evidence for model superiority.
- No class distribution is reported, which is a validity issue for accuracy as a measure. Report class frequencies and employ stratified CV to preserve balance.
- No standard deviations, CIs, or plots are provided to report model variability or robustness. Report standard deviation or CI values for all metrics.
- Perform paired statistical tests (e.g., Wilcoxon signed-rank) across cross-validation folds and report p-values. Include plots of accuracy/F1 vs. number of features to illustrate FS trade-offs and a fitness-over-iterations plot for optimization convergence.
- The fitness function employed in FS optimization is explained, but no convergence behavior is illustrated.
Discussion and Conclusion
Discussion touches upon briefly which model worked best and the dimensionality reduction potential but not with proper interpretation, implications, and limitations.
- No discussion of misclassification, model weakness, or operation deployment issues.Enlarge discussion to include why some models worked better.
- No generalized discussion of ethical implications, e.g., algorithmic profiling, gender fairness.Mention ethical implications of gender profiling in practical use.
- Give specific future applications, e.g., cross-domain Turkish text generation with the model, or sentiment/emotion markers.
- Say practical limits, i.e., model size, inference time, and if suitable for real-time applications.
- Future directions are not specific, without some technological or conceptual target.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI have carefully reviewed the manuscript entitled “Scalable Gender Profiling from Turkish Text Using Deep Embeddings and Meta-Heuristic Feature Selection for E-Commerce Personalization.” While the study demonstrates competent engineering execution in model selection and experimental design, I regret to conclude that the manuscript does not meet the scholarly standards for publication in this journal. My recommendation is to reject the submission based on the following concerns:
1.The authors propose a hybrid architecture that combines BERTurk and FastText embeddings with three meta-heuristic feature selection algorithms (GA, Jaya, ARO) for gender classification. While technically sound, this represents a methodological combination of well-established components rather than a conceptual or algorithmic innovation. The paradigm of embedding-based classification with heuristic selection has been extensively explored in prior literature (e.g., PAN, CLEF, IEEE Access). As such, the study falls short of offering novelty beyond integration.
2.Although extensive engineering efforts are presented, the manuscript lacks theoretical justification for key design choices. For instance:
– No explanation is given as to why meta-heuristics are preferable to conventional dimensionality reduction methods such as PCA or autoencoders for high-dimensional linguistic features.
– The ARO algorithm, while effective in compression, is not analyzed for computational complexity or convergence stability.
– The complementary strengths of BERT and FastText are not formally modeled or quantified; their integration appears purely empirical and lacks linguistic grounding.
3.The study relies exclusively on the IAG-TNKU corpus, a dataset composed of journalist-authored Turkish news articles. This domain-specific data diverges considerably from user-generated content typically found in e-commerce contexts. Despite a brief acknowledgment of this limitation, the absence of domain adaptation experiments or cross-genre validation raises serious concerns regarding the model’s generalizability and practical utility.
4.While the authors briefly mention potential risks related to gender bias and stereotyping, their discussion remains general and lacks empirical analysis. Specifically, the study does not investigate embedding bias, nor does it employ interpretability techniques (e.g., SHAP, LIME) to assess fairness or transparency. Given the sensitive nature of gender inference in commercial applications, the manuscript falls short of current standards in AI ethics and responsible modeling.
5.Gender classification presumes that linguistic behavior reflects stable and binary gender traits. However, the manuscript fails to address foundational sociolinguistic questions—such as whether gendered language patterns are consistent, or whether style is inherently linked to biological sex. These questions have significant implications for the validity of the task and should be clarified before modeling.
6.The study employs a conventional male/female binary, which no longer reflects contemporary understanding of gender in NLP. Recent work (e.g., PAN Author Profiling, ACL Gender-Fair NLP) has introduced concepts such as non-binary identity, self-identified gender, and degrees of genderedness. The absence of such perspectives renders the methodology outdated and normatively limited.
7.Though the authors perform cross-validation, this alone does not evaluate robustness. Leading research in this area now includes zero-shot or few-shot evaluations to test cross-domain adaptability (e.g., transfer from news to social media). The lack of such experiments undermines claims of scalability and real-world relevance.
8.While GA, Jaya, and ARO are applied for dimensionality reduction, the authors do not analyze whether selected features retain linguistic interpretability. In current NLP, attention has shifted toward understanding embedding dimensions (e.g., probing tasks), and treating them merely as numeric vectors without linguistic relevance limits the transparency of the system.
9.Although the manuscript reports standard deviations and partial confidence intervals, it does not employ statistical significance testing (e.g., McNemar's test, Wilcoxon signed-rank test) to validate performance differences between models. Given that reported gains are often within ±1%, this omission weakens claims about “optimal configurations.”
10.A disproportionate number of references are authored by the submitting author, primarily published in local or low-impact venues between 2023 and 2024. At the same time, the manuscript fails to cite recent work from premier NLP conferences (e.g., ACL, EMNLP, COLING) on gender prediction, embedding interpretability, and fairness-aware NLP. This creates a narrow and self-referential scholarly context.
11.While the dataset is publicly available, the paper does not provide access to source code, preprocessing scripts, or model configuration files. In light of the complexity of embedding models and optimization algorithms, the absence of reproducible artifacts contravenes established norms for computational reproducibility in NLP and machine learning research.
In summary, although the manuscript offers a competent engineering exercise, it lacks the theoretical insight, ethical rigor, and empirical breadth expected for publication. I therefore recommend rejection in its current form.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for revising the manuscript. However, the most significant remaining issue concerns the applicability of the proposed method across different application scenarios. The manuscript's title indicates that this 'gender profiling method' is intended for E-Commerce Personalization. I fully understand the challenges in obtaining suitable test datasets. Nevertheless, since the title explicitly states the method's application to E-Commerce Personalization, it should provide direct empirical evidence using relevant data rather than News articles. Alternatively, the authors could remove 'E-Commerce Personalization' from the title—though this might weaken the manuscript's alignment with the Journal's scope. Please consider these points when revising.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have adequately addressed all my comments and concerns.
Author Response
Comments 1: "The authors have adequately addressed all my comments and concerns."
Response 1: We sincerely thank the reviewer for the positive assessment and for acknowledging that all previous comments and concerns have been addressed. We are grateful for the constructive feedback provided during the review process, which has significantly improved the clarity, methodological rigor, and overall quality of the manuscript.
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsAfter carefully reading the manuscript and the authors’ responses, I acknowledge that the study represents a reasonably thorough engineering implementation for Turkish gender classification. However, I find that it falls short in terms of methodological originality, theoretical grounding, experimental breadth, cross-domain applicability, ethical treatment, and interpretability. Below, I detail my reasons for recommending rejection:
1. Lack of substantive methodological novelty. In my view, the core framework—BERTurk + FastText + three meta-heuristic feature selection algorithms (GA, Jaya, ARO)—relies entirely on well-established components already widely applied in the literature. This is primarily an engineering integration rather than a methodological innovation, with no new algorithmic principle, optimization strategy, or modeling mechanism introduced. Even with the emphasis on “Turkish adaptation” and comparative evaluation, these contributions remain within the scope of application and benchmarking rather than fundamental advances.
2. Insufficient theoretical justification for key design choices. Several critical design decisions—such as preferring meta-heuristics over PCA/autoencoders, the formal rationale for combining BERT and FastText, or the computational complexity and convergence analysis of ARO—are not rigorously justified. While the revised manuscript adds comparison tables and some linguistic motivation, it lacks mathematical derivation, complexity analysis, or formal hypothesis testing. As a result, the design choices appear empirically motivated but theoretically under-supported, limiting their transferability and generalizability.
3. Narrow dataset scope and lack of cross-domain validation. The study relies solely on the IAG-TNKU news corpus, yet the initial framing suggested applicability to domains such as e-commerce or social media, which differ markedly in genre and linguistic style. The absence of cross-domain, zero-shot, or few-shot transfer experiments undermines claims of scalability and real-world relevance. Simply removing “e-commerce” from the text does not address the core limitation.
4. Missing empirical fairness and interpretability analysis. Given the socially sensitive nature of gender inference, I find the treatment of fairness and transparency inadequate. There is no empirical bias detection for embeddings, no group-level performance analysis, and no use of explainability tools such as SHAP or LIME to examine feature contributions or model decisions. The authors defer such analyses to “future work,” but at present the study does not meet accepted ethical and responsible AI standards in NLP.
5. Outdated binary gender assumption. The work is strictly limited to male/female binary classification, without consideration of non-binary or fluid gender identities, which are increasingly addressed in contemporary gender-fair NLP research (e.g., PAN, ACL workshops). While dataset constraints may impose binary labels, it is still possible to explore or simulate more inclusive label frameworks, or at least reflect this in model design and discussion. The lack of such treatment is a notable limitation.
6. Limited robustness and adaptability evaluation. Relying solely on 10-fold cross-validation does not sufficiently demonstrate model robustness, especially when target applications may differ significantly from the training domain. The lack of cross-domain, zero-shot, or few-shot evaluation weakens the credibility of scalability and deployment claims.
7. Weak link between selected features and linguistic interpretability. The embedding dimensions retained by the meta-heuristics lack linguistic interpretability, and no probing tasks are used to link them to morphological patterns, syntactic structures, or gendered lexical usage. This limits both the academic contribution and the model’s trustworthiness in practical settings.
8. Narrow and self-referential citation scope. I note an over-reliance on the authors’ own recent publications in local or lower-impact venues, with insufficient engagement with high-impact, recent work from ACL, EMNLP, COLING, and other top-tier venues. This narrows the scholarly dialogue and diminishes the work’s integration with the state of the art.
The main contributions of this manuscript lie in engineering implementation and adaptation to a Turkish dataset, rather than in methodological innovation. The lack of theoretical rigor, the narrow experimental scope, insufficient fairness and interpretability analysis, outdated gender conceptualization, and limited robustness evaluation collectively prevent the paper from meeting the scholarly and ethical standards expected for publication in this journal.
I recommend rejection, with the following suggestions for substantial improvement prior to resubmission:
Introduce genuinely novel algorithmic or optimization contributions beyond component integration.
Provide rigorous theoretical analysis and complexity modeling for design choices.
Add cross-domain, zero-shot, or few-shot evaluation to substantiate scalability claims.
Conduct empirical fairness and interpretability assessments.
Incorporate a more inclusive approach to gender labeling where feasible.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsPlease consider the following points and revise.
- The structure of Section 3 ("Methods") should be revised to more clearly demonstrate the methodological process. Specifically, subsection titles should align closely with the stages outlined in Figure 1 (Flowchart) to enhance the clarity of the proposed method. Additionally, it is suggested that Subsections 3.1 and 3.2 either be consolidated into an independent section or integrated into Section 2 of the manuscript.
- Spelling errors throughout the paper need to be addressed comprehensively.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf