You are currently viewing a new version of our website. To view the old version click .

Review Reports

J. Theor. Appl. Electron. Commer. Res.2025, 20(4), 343;https://doi.org/10.3390/jtaer20040343 
(registering DOI)
by
  • Nahed Alowidi

Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article An Intelligent Multimodal Deep Learning Framework for Automated Website Usability Evaluation addresses one of the most current issues, in the context of the current technological revolution. The author correctly captures the need for companies to offer exceptional user experience, which is to attract customer satisfaction and which will later be found in business success. the novelty of the research is, thus, respected and the results contribute to the advance of the current knowledge.
From the perspective of the methodology used, the research is complex and meticulously constructed, and the analysis presented is supported by the tables and figures presented in the article.
The article has a high and valuable scientific contribution and, as the author also specified, its applicability to other domains other than the one presented in the article (fashion industry websites), can be made relatively easily.
From a linguistic perspective, the text is coherent and logical, but I recommend improving readability by using more concise phrases to ensure fluent and professional communication.
The results are presented in a comprehensible manner and are supported by valid statistical analyses and the conclusions are supported by and in line with the arguments backed up by data.

Comments on the Quality of English Language

From a linguistic perspective, the text is coherent and logical, but I recommend improving readability by using more concise phrases to ensure fluent and professional communication.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Manuscript ID: jtaer-3943142-peer-review-v1 entitled “An Intelligent Multimodal Deep Learning Framework for Automated Website Usability Evaluation”, has been reviewed.

The following suggestions are provided for the author's reference, hoping that the manuscript can become better.

  1. It is recommended that "Fashion E-commerce Websites" be appropriately added to the title and keywords to better reference the research application area and enhance indexing effectiveness.

 

  1. The abstract adequately covers the study's purpose, methods, main results, and contributions. It is recommended that the conclusion be simplified, focusing on the core results and scope of application for quick understanding.

Additionally, quantitative information, such as sample size and experimental control conditions, can be provided.

 

  1. The introduction fully discusses the context of digitization and website diversity, the research motivation for user experience and commercial competition, and the limitations of previous methods. It then defines core concepts from an ISO 9241-11 and user-centric perspective, extending the discussion to the potential of automation and deep learning in usability evaluation.

It is recommended that a paragraph be added to clarify the rationale and representativeness of the statement "This study focuses on fashion websites."

 

  1. The literature review systematically reviews website usability evaluation methods across various fields, spanning heuristic evaluation, user testing, data envelopment analysis, and machine learning automation.

The paper emphasizes that existing solutions are often limited to a single data source or modality, lacking an integrated framework combining numerical and image-based (multimodal) deep learning.

It is recommended to more clearly summarize the commonalities and differences across research topics in different fields to strengthen the connection between research motivations.

The literature section cites recent authoritative journals to enhance credibility. It is recommended that methods, industry types, and core indicators be annotated in tabular format for easier reading.

  1. Materials and Methods / Methodology: The paper clarifies the specific criteria for website usability evaluation, data collection, and detailed processes. The two fusion strategies (Early Fusion and Late Fusion) for multimodal deep learning models are transparently structured, with parameter selection, training and validation broken down, and high experimental reproducibility.

A clearer presentation of each model layer's architecture, including hierarchical and dimensional design, would facilitate replication. A flowchart combining data, model processes, and evaluation methods could be developed to enhance the visual presentation standards of artificial intelligence papers.

  1. Results: The paper objectively presents clear comparative performance data of different CNN backbone models under the fusion model, using metrics such as accuracy and F1 score. The results highlight the superior performance of early fusion.

 

  1. The discussion section explains the improvement of the multimodal model over the unimodal method, the effectiveness of the fusion strategy, and its impact on fashion e-commerce and other industries.

Suggestions: To enhance the response to the literature, it is recommended to directly compare the data of relevant papers in recent years in each section. Two subsections, "Theoretical Implications" and "Industrial Applications", can be added, each focusing on theoretical innovation and practical implementation.

  1. The conclusion section reiterates how this study solves the research problem and emphasizes the practical value of the proposed data-driven usability improvement framework. It also provides policy recommendations for commercial applications.

It is recommended that this section be more concise, focusing on results, limitations, and future work.

Overall Suggestions, (1) Adhere to objective description and factual citation, and avoid subjective interpretation mixed into the results paragraph. (2) Supplement with charts or flow charts to comply with MDPI and the trend of information visualization in international journals. (3) References should fully capture DOI, issue number, and page number to improve search ability. 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1. Insufficient Research Background and Literature Review
The paper states that "there is currently no DL-based multimodal framework for website usability evaluation", but its review of research on multimodal evaluation in related fields (such as human-computer interaction and web page quality evaluation) is incomplete. For instance, it fails to compare image-text multimodal evaluation (e.g., web page aesthetics combined with user reviews) and the application of cross-modal attention mechanisms in similar tasks, making it impossible to clearly define the innovative boundaries of this research. It is recommended to supplement comparisons of key literature in the field of multimodal evaluation over the past three years and clarify the differentiated contributions of this framework.
2. Lack of Representativeness and Screening Criteria for the Dataset
The dataset only includes 300 fashion websites, with a small sample size and unclear screening logic: it does not specify the source of the websites (e.g., regional distribution, market scale, traffic level) or whether it covers extreme cases of "high/low usability", which may lead to sample bias. It is suggested to supplement the specific screening process of the dataset and statistics on basic attributes of the websites (such as region and average daily traffic), and consider expanding the sample size to more than 500 to improve statistical significance.
3. Unverified Reliability of the Visual Modality Annotation Tool
The WebScore AI tool is used to generate visual modality scores during data annotation, but no evidence for the effectiveness of this tool is provided: there is no consistency comparison between this tool and manual annotation (e.g., Kappa coefficient), nor any citation of previous validation studies on this tool, making it impossible to ensure the credibility of visual modality annotation. It is recommended to supplement a comparative experiment between WebScore AI and annotations by 3 HCI experts, or provide a reliability report released by the tool developers.
4. Lack of Support for the Rationality of Model Architecture Design
In early fusion, only 6 traditional architectures (e.g., ResNet152, EfficientNetB0) are selected as the CNN backbone, while modern architectures suitable for web page screenshots (e.g., Vision Transformer, ViT) are not considered; moreover, the selection of the 128-dimensional vector after feature extraction lacks a basis (e.g., no ablation experiment has been conducted to verify the performance differences among 64/128/256 dimensions). It is suggested to supplement comparative experiments on architectures such as ViT and an analysis of the rationality of feature dimension selection to enhance the persuasiveness of the architecture design.
5. Incomplete Baseline Setup for Comparative Experiments
The experiments only compare "unimodal models" and "the multimodal model in this study", without including the latest multimodal baseline models (e.g., Transformer-based multimodal evaluation frameworks, cross-modal fusion web page quality models) or comparing with commonly used industrial tools (e.g., the extended evaluation module of Google Lighthouse), making it impossible to fully demonstrate the superiority of this framework. It is recommended to supplement comparisons with the latest baseline models and clarify the position of this research in the performance ranking.
6. Lack of Mechanistic Explanation in Result Analysis
In early fusion, the accuracy (0.88) of EfficientNetB0 is significantly higher than that of other architectures, but the internal mechanism of this performance advantage is not analyzed: for example, whether its "compound scaling of depth-width-resolution" is more suitable for web page layout (e.g., multi-module typesetting) and visual features such as color contrast. Only presenting data without mechanistic analysis reduces the academic depth of the results. It is suggested to explain the adaptability of EfficientNetB0 in combination with the particularity of web page visual features, such as showing the key areas it focuses on through visualized feature heatmaps.
7. Lack of Experimental Verification for Cross-Domain Generalization
The paper claims that the framework can be extended to fields such as e-learning and government websites, but it has only been verified in the fashion field, and no preliminary cross-domain experiments have been conducted: for example, using data from 20-30 e-learning websites to test "changes in model performance after adjusting domain-specific indicators (e.g., video loading time)", resulting in the generalization being only a theoretical speculation. It is recommended to supplement small-scale cross-domain verification experiments to prove the feasibility of framework adjustment.
8. Lack of Model Interpretability
As an evaluation framework for practical applications, it does not adopt explainable AI (XAI) technologies (e.g., SHAP, LIME) to analyze "key influencing factors": for example, the contribution weight of indicators such as loading time and layout coherence to usability scores, making it impossible to provide designers with specific optimization directions (e.g., "usability scores decrease by 15% when loading time > 0.5s"). It is recommended to supplement XAI analysis to quantify the influence degree of key features and enhance the application value.
9. Insufficient Analysis of Computational Efficiency
The paper mentions that "early fusion is computationally complex", but it does not quantitatively evaluate the computational cost of the model: there is no comparative data on training time (e.g., time consumption per epoch) and inference speed (e.g., time consumption for evaluating a single website) (early fusion vs. late fusion, this model vs. baseline models), while computational efficiency is a key indicator for real-time evaluation of large-scale websites. It is recommended to supplement computational efficiency experiments under the same hardware environment to provide a basis for practical deployment.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have revised the manuscript, I have no other concern. 

Author Response

Thank you for this helpful comment.

The manuscript has now been professionally edited to improve the English and to more clearly express the research.