Investigating the Role of Logistics Delivery Services in Shaping Customer Satisfaction: LLM-Aspect-Based Sentiment Analysis of Perceived Quality in Indonesian E-Commerce
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors study how perceived quality influences online customer satisfaction using review data from Tokopedia.com and Google Gemma 2. This is an interesting and timely paper, particularly in its identification of delivery as a key factor of customer satisfaction and in its use of large-scale online review data with LLM-based text analysis. Below I provide suggestions for strengthening the paper.
Theory:
- Clarify your contribution. At present, the paper mainly tests an existing theory in the Indonesian online market. Beyond this replication, what new theoretical contributions are you offering? For example, are you the first to identify delivery as a central component of quality perception? You also mentioned that testing current theories in diverse cultural and economic environments. Did you test any cultural or economic factors that might interact with perceived quality? This would add substantial theoretical value.
- Factor selection. You propose eight aspects of quality. Why were these chosen over others? Why is delivery treated as a moderator? Why eight and not ten or twenty? A stronger theoretical rationale is needed. Consider adding a conceptual map or figure to visually summarize the relationships among factors.
Methods:
- Data scraping. Why did you choose Tokopedia.com specifically, and why restrict the analysis to five smartphone brands? The decision to limit reviews to 5-40 words seems arbitrary and should be justified. Similarly, you mention making “adjustments” during sample selection to “balance” across ratings. Please specify what these adjustments were. Finally, the time range of the scraped data is not described and should be reported. I’ve added one paper below that might be useful for you when reporting online data scraping.
- LLM ratings. Why were only the top two aspects analyzed instead of all eight? How did you prompt each aspect for Google Gemma 2, and how did you account for sensitivity to prompt framing since small wording changes can shift results? Prompts are important for LLMs and please describe them in detail. I’ve added one paper below that might be useful for you when reporting LLM analyses.
- Dependent variable. Product ratings ranged from 1–4 but were converted into a binary variable (satisfied vs. not satisfied) for logistic regression. Please justify this choice, as dichotomizing continuous variables can reduce precision.
- Aspect prevalence. Delivery services, as the most frequently mentioned aspect, were mentioned in only about 35% of reviews. What were the most frequently mentioned terms overall? Should they be included in your analysis for completeness?
Last, much of the important information is buried in dense text. A professional copy editor could help improve readability. Typo on page 7: “It that allows”. Good luck to the authors!
References for consideration
Boegershausen, J., Datta, H., Borah, A., & Stephen, A. T. (2022). Fields of gold: Scraping web data for marketing insights. Journal of Marketing, 86(5), 1-20.
Hewitt, L., Ashokkumar, A., Ghezae, I., & Willer, R. (2024). Predicting results of social science experiments using large language models [Working paper].
Author Response
Comments 1: Clarify your contribution. At present, the paper mainly tests an existing theory in the Indonesian online market. Beyond this replication, what new theoretical contributions are you offering? For example, are you the first to identify delivery as a central component of quality perception?
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added explanation and paragraphs for this purpose.
Manuscript has been updated in page 3 line 122-141.
“While the Expectancy-Confirmation Theory (ECT) provides a robust framework for understanding satisfaction, its application in digital marketplaces, particularly in Indonesia as an emerging economy, remains underspecified. Traditional models of perceived quality, often developed in offline or Western contexts, have heavily emphasized intrinsic product attributes (e.g., functionality, durability) [20]. However, the e-commerce environment, characterized by physical separation between buyer and seller, necessitates a broader conceptualization.
This study posits that in online retail, perceived quality is a dual-dimensional con-struct comprising Perceived Product Quality (intrinsic attributes like Functionality and Originality) and Perceived Service Quality (extrinsic attributes related to the transaction and fulfillment, such as Logistics Delivery, Packaging, and Responsiveness). We argue that in emerging markets like Indonesia, where logistical infrastructure and trust in online transactions are still evolving, the Service-Centric dimension may carry disproportionate weight. Furthermore, we theorize that Logistics Delivery Service is not merely a direct antecedent of satisfaction but acts as a critical moderating 'gateway'. A positive delivery experience validates the entire online transaction, thereby amplifying the perceived value of both product and other service attributes. By testing this expanded model, we contribute to theory by refining the ECT framework for the digital age and identifying the unique structural relationships between quality dimensions in an emerging market context.”
---
Comments 2: You also mentioned that testing current theories in diverse cultural and economic environments. Did you test any cultural or economic factors that might interact with perceived quality? This would add substantial theoretical value.
Response 2: Thank you for pointing this out. We agree with this comment. The reviewer is correct. We didn't explicitly test cultural factors. However, we theorize that our findings might be particular to an emerging market like Indonesia. Therefore, we have added explanation in the discussion for this purpose.
Manuscript has been updated in page 21-22 line 720-732.
“Our findings, particularly the dominance of Logistics Delivery Services and Packaging, may be uniquely pronounced within the Indonesian context, an archetypal emerging market. Culturally, the high-context and relationship-oriented nature of Indonesian society (gotong royong) may translate into a heightened expectation for reliable and personal service, even in digital transactions. Economically, challenges in logistics infrastructure across the archipelago make reliable delivery a salient and noteworthy achievement for consumers, unlike in mature markets where it is often a baseline expectation. Furthermore, the strong influence of Packaging could be linked to the importance of unboxing experiences and gift-giving in social culture, as well as a need for tangible reassurance against product damage during potentially longer and less predictable delivery journeys. While this study does not directly measure these cultural and economic variables, it lays the groundwork for future cross-cultural comparative research to formally test these contextual influences.”
---
Comments 3: Factor selection. You propose eight aspects of quality. Why were these chosen over others? Why is delivery treated as a moderator? Why eight and not ten or twenty? A stronger theoretical rationale is needed. Consider adding a conceptual map or figure to visually summarize the relationships among factors.
Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we have added explanation in the hypotheses’ development for this purpose and added conceptual map to the manuscript.
Manuscript has been updated in page 4 line 165-196.
“Drawing from a preliminary analysis of Indonesian online review discourse, we identify eight distinct aspects of perceived quality. These are categorized into two groups to provide a clearer theoretical structure:
A. Perceived Product Quality:
a. Functionality: The core performance and features of the product.
b. Originality: The authenticity and brand assurance, critical in markets with counterfeit concerns.
c. Price: The perceived value and fairness of the cost.
B. Perceived Service Quality:
a. Logistics Delivery Service: The fulfillment process, including speed, reliability, and condition upon arrival.
b. Packaging: The protective and experiential element of product receipt.
c. Responsiveness: The seller's communication and customer service pre- and post-purchase.
d. Warranty: The post-purchase security and guarantee.
e. Promotion: The incentives and deals offered at the point of sale.
We posit Logistics Delivery Service as a moderator based on its unique position in the customer journey. It is the final and most tangible touchpoint that culminates the online transaction. A positive delivery experience can act as a 'halo effect,' reinforcing the value of the product and other services. Conversely, a negative delivery experience can negate positive perceptions of product functionality or seller responsiveness, as the customer cannot fully enjoy the product until it is successfully delivered. Therefore, we hypothesize that Logistics Delivery Service moderates the relationship between other perceived quality aspects and satisfaction, serving as a crucial reinforcing or mitigating factor. The proposed conceptual model illustrating the direct effects of perceived product quality and perceived service quality aspects on customer satisfaction is presented in Figure 1, and the hypothesized moderating effects of Logistics Delivery Service presented in Figure 2.”
Figure 1. Conceptual Model of Perceived Quality and Customer Satisfaction in E-commerce
Figure 2. Conceptual Model of Logistics Delivery Service Moderation on Perceived Quality and Customer Satisfaction in E-commerce
---
Comments 4: Methods: Data scraping. Why did you choose Tokopedia.com specifically, and why restrict the analysis to five smartphone brands? The decision to limit reviews to 5-40 words seems arbitrary and should be justified. Similarly, you mention making “adjustments” during sample selection to “balance” across ratings. Please specify what these adjustments were. Finally, the time range of the scraped data is not described and should be reported. I’ve added one paper below that might be useful for you when reporting online data scraping.
Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we have added refine data collection part in sub chapter 2.2.1.
Manuscript has been updated in page 8-9 line 310-352.
2.2.1. Online Reviews Data Sources
“The study acquired data through web scraping from the official stores of various smartphone brands on Tokopedia.com. Tokopedia was selected as the data source for several reasons. First, it is one of the two largest e-commerce platforms in Indonesia by market share and user base [51], ensuring the data reflects a significant portion of the Indonesian online consumer population. Second, as a local Indonesian platform, it pro-vides a more authentic view of domestic consumer behavior compared to global plat-forms.
Data collection focused on top-five (market share) smartphone brands in Indonesia: Infinix, Oppo, Samsung, Vivo, and Xiaomi [52]. This selection was strategic, aiming to capture a representative spectrum of the market. It includes Samsung as global premium leader; dominant Chinese mid-range brands: Oppo, Vivo, and Xiaomi, which collectively hold a majority market share in Indonesia and a budget-oriented brand: Infinix, representing the important entry-level segment. This mix ensures our analysis covers the primary price and brand perception tiers relevant to Indonesian consumers, rather than being limited to a single segment. Other limitation is for Apple, there is no official store in Tokopedia, and their own online store does not collect reviews from consumers. Also, other brands have a little review on their official store on the Tokopedia platform.
The raw data was initially filtered by word count. A minimum limit of 5 words was applied to exclude short and non-substantive reviews (e.g., 'good,' 'thanks,' 'okay') that lack the descriptive content needed for aspect-based sentiment analysis. A maximum limit of 40 words was set to focus the LLM analysis on concise, aspect-specific feedback, avoiding very long, narrative-style reviews that often contain multiple topics and are more complex to classify accurately. This range was chosen to capture reviews with sufficient detail while maintaining a focus on clear, primary customer assertions.
From this filtered population, a final sample of 5,000 reviews was selected. To ensure this sample was representative of the underlying review ecosystem and to prevent bias from an overrepresentation of any single group, we employed a stratified random sampling technique. The population was stratified based on three key variables:
a. Brand: To ensure proportional representation from each of the five selected brands.
b. Star Rating: To include a balanced mix of positive (4-5 stars), neutral (3 stars), and negative (1-2 stars) reviews, as an overabundance of positive reviews is common on e-commerce platforms.
The proportional allocation for each stratum (brand/rating combination) was calculated based on its share of the total filtered review population. This method ensures our sample is balanced and enhances the generalizability of our findings within the defined context of Indonesian smartphone purchases on Tokopedia.
Time of Review Submission: To cover a recent and relevant period, specifically re-views posted between August 2023 and November 2024. This one-year window captures contemporary consumer sentiments while minimizing the impact of outdated product models or service practices. Moreover, our data collection and reporting methodology aligns with best practices for scientific research using online review data, as outlined in studies like [53], emphasizing transparency in platform selection, time frames, and sampling procedures.”
---
Comments 5: LLM ratings. Why were only the top two aspects analyzed instead of all eight? How did you prompt each aspect for Google Gemma 2, and how did you account for sensitivity to prompt framing since small wording changes can shift results? Prompts are important for LLMs and please describe them in detail. I’ve added one paper below that might be useful for you when reporting LLM analyses.
Response 5: We thank the reviewer for this critical methodological point. We have now thoroughly revised Section 2.2.2 to provide a detailed account of our LLM-ABSA procedure. Specifically: We clarify that the model was prompted to identify all eight aspects, but we retained the two aspects with the highest model-assigned probability for analysis, a choice we have now justified based on aspect salience and noise reduction. We have included a detailed description of our structured prompting strategy.
As suggested, and to ensure the manuscript remains readable, we have included the complete, verbatim prompt in Appendix A for full transparency and reproducibility. The main text now references this appendix.
Additionally, we have added a crucial clarification in Section 2.2.2 regarding the nature of our data. We note that the average review length is 11 words, which empirically supports our decision to focus on the top two aspects, as the reviews are inherently concise and typically contain mentions of only one or two primary aspects. This demonstrates that our methodological choice was data-driven and appropriate for the text corpus we analyzed.
Manuscript has been updated in page 9 line 354-382.
“2.2.2. LLM-ABSA: Text Classification Model for Perceived Quality Aspects
This study utilizes aspect-based sentiment analysis (ABSA) as a text classification method to understand the thematic content of consumer reviews. A LLM is employed for this purpose, specifically the Google Gemma 2 model [54]. The Gemma 2 is designed for optimal efficiency and superior performance in LLM applications, prioritizing performance and cost-efficiency [55]. In this study, the Gemma 2 model is applied using the Scikit-LLM Python package, which seamlessly integrates excessive language models with Scikit-learn to improve text classification analysis [56,57].
We employed a structured, zero-shot prompting approach to mitigate the known sensitivity of LLM outputs to prompt framing [58]. The model was instructed to identify all aspects present in a review from our predefined list and to assign a sentiment and a probability score to each. The complete, verbatim prompt is provided in Appendix A to ensure full reproducibility.
The LLM-generated JSON output was parsed programmatically. While the model identified all applicable aspects, we implemented a selection rule to focus on the most salient customer feedback: for each review, we retained only the two aspects with the highest assigned probability scores. This decision is methodologically justified for several reasons:
• Data-Driven Salience: Our data consists of concise reviews, with an average length of 11 words. In such short texts, customers typically focus on one or two primary concerns. Our rule prioritizes these salient aspects, which are the most likely drivers of their satisfaction.
• Reduction of Noise: It minimizes the inclusion of minor, tangential, or weakly im-plied mentions that could add noise to the statistical model, a critical consideration with short-text data.
• Cognitive Plausibility: It aligns with the finding that consumers, especially in quick online reviews, focus on a limited number of key factors when evaluating a product experience.
The algorithm for this process is summarized in Figure 3, and Figure 4 provides an instance of the model’s output for a sample review.”
---
Comments 6: Dependent variable. Product ratings ranged from 1–4 but were converted into a binary variable (satisfied vs. not satisfied) for logistic regression. Please justify this choice, as dichotomizing continuous variables can reduce precision.
Response 6: We thank the reviewer for this important methodological observation. We have now added a detailed justification for dichotomizing the rating variable in Section 2.2.3. Our approach is grounded in:
• Theoretical Alignment: It directly mirrors the binary outcome (satisfaction/dissatisfaction) of the Expectancy-Confirmation Theory that underpins our study.
• Empirical Reality in E-commerce: Online rating distributions are highly non-linear. A rating of 3 is often used to signal a failed expectation, not neutral satisfaction. Therefore, the critical threshold lies between ratings 3 and 4, making a binary classification (4-5 vs. 1-3) a meaningful and interpretable representation of the satisfaction construct in this context.
• Interpretability: The binary logistic model yields odds ratios, which provide clear, actionable insights for practitioners on how perceived quality aspects influence the likelihood of a customer being satisfied.
We acknowledge the potential loss of precision and have noted this as a consideration for future research in the revised manuscript.
Manuscript has been updated in page 11 line 404-420.
“The decision to dichotomize the 1–5-star rating scale was based on both theoretical and empirical considerations. Theoretically, it aligns with the fundamental tenet of Expectancy-Confirmation Theory, where the outcome is a binary state of satisfaction (positive disconfirmation) or dissatisfaction (negative disconfirmation). Empirically, the distribution of online ratings is often J-shaped, heavily skewed towards 5 stars, with 1-star and 4-star reviews being the most informative. In such a context, the psychological difference between a 4 (good) and a 5 (excellent) is less critical to our research question than the fundamental distinction between a positive experience (4,5) and a non-positive one (1,2,3). A rating of 3 is frequently considered "neutral" or "met minimum expectations" rather than "satisfied," and is often used by consumers to express mild disappointment. Therefore, grouping 1-3 as "not satisfied" captures the meaningful threshold between experiences that met or exceeded expectations versus those that did not.
We acknowledge that dichotomization can reduce statistical precision. However, for the purpose of this study, which aims to identify the key drivers that push a consumer across the critical threshold from a non-positive to a positive overall evaluation, a binary logistic model provides clear, interpretable results in the form of odds ratios, which are highly actionable for managers.”
Manuscript has been updated in page 24 line 837-840.
“Furthermore, the dichotomization of the satisfaction variable, while theoretically and empirically justified, may have reduced some statistical precision. Future research could employ ordinal logistic regression or sentiment analysis of the review text itself to create a more nuanced continuous measure of satisfaction.”
---
Comments 7: Aspect prevalence. Delivery services, as the most frequently mentioned aspect, were mentioned in only about 35% of reviews. What were the most frequently mentioned terms overall? Should they be included in your analysis for completeness?
Response 7: We thank the reviewer for this insightful comment, which has allowed us to better contextualize our findings.
We fully agree that the 34.18% mention rate for Logistics Delivery should be interpreted in the context of the overall distribution of aspects. As now clarified, the key insight from Table 3 is that our eight pre-defined aspects are highly comprehensive, accounting for 98.52% of all primary aspect mentions.
The remaining 1.48% categorized as "Other factors" confirms that no major, recurring theme was missing from our initial framework. Therefore, the 34.18% figure for DEL accurately reflects its status as the most salient single theme in a consumer discourse that is naturally distributed across multiple quality dimensions.
This distribution underscores the multifaceted nature of perceived quality, where no single aspect dominates a majority of conversations. Instead, consumer evaluation is built on a combination of pillars, with Logistics Delivery, Functionality, and Originality being the three most prominent.
Manuscript has been updated in page 12 line 456-462.
“The 34.18% mention rate for DEL signifies its salient role as the single most discussed theme. The remaining reviews are dominated by discussions of core product attributes, with Functionality and Originality being the next most prominent aspects. The fact that no single aspect dominates a majority of conversations reflects the multifaceted nature of perceived quality; consumers evaluate their experience based on a combination of product-centric and service-centric attributes, with DEL, FUN, and ORI representing the three primary pillars of evaluation in this domain.”
---
Comments 8: Last, much of the important information is buried in dense text. A professional copy editor could help improve readability. Typo on page 7: “It that allows”. Good luck to the authors!
Response 8: We sincerely thank the reviewer for this constructive feedback and for the good wishes. We have refined the paragraphs. We fully agree regarding readability. We have undertaken a thorough revision of the manuscript with a focus on enhancing clarity.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper innovatively combines LLM-ABSA and logistic regression to reveal the direct and moderating effects of perceived quality aspects—particularly logistics delivery services and packaging—on customer satisfaction. The following points are provided for revision reference:
1. The study focuses only on five smartphone brands in the Indonesian market, which may limit the representativeness of the sample. The findings may not be generalizable to other product categories or markets with different cultural contexts (e.g., developed countries). Please add some explanations and clarifications.
2. Although the LLM (Gemma 2) performs well in terms of efficiency, AI models may not fully capture subtle contextual and emotional expressions in reviews. For example, sarcasm or culture-specific language could lead to classification biases. The paper suggests combining manual coding or hybrid methods to improve accuracy.
3. The research relies on subjective review data and does not incorporate objective service performance indicators (e.g., actual delivery time, return rates). This may result in a disconnect between perceived quality measurements and actual service quality. Future studies could integrate objective metrics such as logistics tracking data to validate the results.
4. While the moderating effect of logistics delivery services was examined, other potential moderating variables (e.g., consumer demographics or platform types) were not thoroughly explored. Additionally, Promotion was not significant in the standalone model but showed significance in the full model, indicating its complex role—a phenomenon that was not sufficiently explained in the paper.
5. The logistic regression model assumes a linear relationship, but in reality, the relationship between perceived quality and satisfaction may be non-linear. The paper did not attempt to compare other machine learning models (e.g., decision trees or neural networks), which may limit multi-angle validation of the findings.
6. Future research should expand the data sample, enhance contextual understanding in text analysis, and integrate multi-source data to improve reliability.
7. These findings provide theoretical support for e-commerce service strategy optimization but need to be validated in broader scenarios. Some explanations regarding this aspect can be added.
Author Response
Comment 1: The study focuses only on five smartphone brands in the Indonesian market, which may limit the representativeness of the sample. The findings may not be generalizable to other product categories or markets with different cultural contexts (e.g., developed countries). Please add some explanations and clarifications.
Response 1:
We thank the reviewer for this important comment regarding the generalizability of our findings. We fully agree that the specific results from our study of five smartphone brands in Indonesia may not directly translate to other product categories or cultural contexts.
In response, we have added a dedicated paragraph in the Discussion section (Section 4) that explicitly addresses this point. We now clarify that the salience of aspects like Logistics Delivery is likely category-dependent and context-specific, shaped by Indonesia's status as an emerging market.
We have reframed our primary contribution to emphasize the methodological innovation (the LLM-ABSA framework for perceived quality) and the theoretical insight that service-centric aspects can be as critical as product-centric ones in online retail, rather than claiming a universal hierarchy of aspects.
Furthermore, we have strengthened the Conclusion section (Section 5) to explicitly list this as a limitation and to propose future research that applies our methodology to other categories and countries, which would allow for valuable cross-cultural comparisons.
We believe these clarifications appropriately contextualize our findings while underscoring their significance within the studied domain and their value as a blueprint for future research.
Manuscript has been updated in page 21 line 733-748:
“While this study provides detailed insights into the Indonesian smartphone market, the generalizability of the specific findings to other product categories or cultural contexts requires careful consideration. The prominence of aspects like Logistics Delivery and Packaging is likely amplified in the smartphone category, which consists of high-value, electronic items that are sensitive to shipping handling and where authenticity is a paramount concern. Conversely, for low-involvement, commoditized products (e.g., office supplies, dry goods), these service aspects may be less salient than price. Further-more, the consumer priorities identified here are shaped by the Indonesian context, an emerging market with specific logistical infrastructure and cultural norms. The relative importance of aspects might differ in developed economies where next-day delivery is standardized and trust in online transactions is higher. Therefore, we caution against directly extrapolating our results without further validation. The primary contribution of this study is not to present a universal hierarchy of quality aspects, but to demonstrate the methodology for uncovering such a hierarchy and to validate the critical, and often underestimated, role of service-centric qualities like logistics within a specific and important market segment.”
Manuscript has been updated in page 23 line 810-816:
“Building upon the findings and limitations of this study, we propose a multi-faceted agenda for future research to advance the understanding of perceived quality in e-commerce. Future studies should expand the data sample to include a wider variety of product categories (e.g., fashion, groceries, durable goods) and geographic markets, including both emerging and developed economies. This would test the generalizability of our findings and allow for cross-cultural comparisons of perceived quality drivers.”
Comment 2:
Although the LLM (Gemma 2) performs well in terms of efficiency, AI models may not fully capture subtle contextual and emotional expressions in reviews. For example, sarcasm or culture-specific language could lead to classification biases. The paper suggests combining manual coding or hybrid methods to improve accuracy.
Response 2:
We thank the reviewer for this critical methodological insight. We completely agree that LLMs may struggle with sarcasm, cultural nuance, and other subtle linguistic features, which is a recognized challenge in the field of NLP.
We have explicitly acknowledged this as a key limitation in the Conclusion (Section 5). We provide a concrete example of how sarcasm in Indonesian reviews could lead to misclassification and directly cite the reviewer's suggestion of using hybrid methods (e.g., manual coding for fine-tuning) as a valuable direction for future research.
We believe these additions demonstrate a rigorous and self-critical approach to our methodology, acknowledging its boundaries while justifying its use for the scale and objectives of our study.
Manuscript has been updated in page 24 line 817-828:
“To address the limitations of automated text analysis, researchers should employ hybrid methods. This could involve manual coding to create high-quality datasets for fine-tuning LLMs on domain-specific language or the development of more sophisticated models capable of better detecting nuance, sarcasm, and culture-specific expressions in reviews. A critical next step is to integrate subjective review data with objective performance metrics. Linking reviews to logistics API data (actual delivery times), seller response logs, and product return rates would allow researchers to triangulate findings and explore the crucial gap between perceived quality and actual service performance. Employing non-parametric machine learning models (e.g., Random Forests, Gradient Boosting) could help uncover non-linear relationships and complex interactions between quality aspects that are not captured by linear models, providing a more nuanced predictive framework”
Comment 3:
The research relies on subjective review data and does not incorporate objective service performance indicators (e.g., actual delivery time, return rates). This may result in a disconnect between perceived quality measurements and actual service quality. Future studies could integrate objective metrics such as logistics tracking data to validate the results.
Response 3:
We thank the reviewer for this valuable suggestion. We fully agree that integrating objective performance data would be a logical and powerful extension of this research.
We have added a clear acknowledgment of this limitation in the Conclusion (Section 5). We now explicitly state that our study measures perceived quality, which may differ from objective reality, and that this is a inherent characteristic of using review data.
We have also directly incorporated the reviewer's excellent suggestion for future research. This provides a clear pathway for subsequent studies to build upon our work.
We believe this addition adds an important layer of methodological clarity and sets a compelling agenda for future investigation.
Manuscript has been updated in page 24 line 817-828:
“To address the limitations of automated text analysis, researchers should employ hybrid methods. This could involve manual coding to create high-quality datasets for fine-tuning LLMs on domain-specific language or the development of more sophisticated models capable of better detecting nuance, sarcasm, and culture-specific expressions in reviews. A critical next step is to integrate subjective review data with objective performance metrics. Linking reviews to logistics API data (actual delivery times), seller response logs, and product return rates would allow researchers to triangulate findings and explore the crucial gap between perceived quality and actual service performance. Employing non-parametric machine learning models (e.g., Random Forests, Gradient Boosting) could help uncover non-linear relationships and complex interactions between quality aspects that are not captured by linear models, providing a more nuanced predictive framework”
Comment 4:
While the moderating effect of logistics delivery services was examined, other potential moderating variables (e.g., consumer demographics or platform types) were not thoroughly explored.
Response 4:
We thank the reviewer for this insightful observation. We agree that exploring other moderating variables could provide a more complete picture.
In response, we have added a paragraph to the Conclusion (Section 5) explicitly acknowledging that our study focused on a single, albeit critical, moderator (Logistics Delivery) and that other factors like consumer demographics and platform types represent fruitful avenues for future research.
We have framed this not as a flaw, but as a necessary focusing of scope for this initial investigation, which now provides a solid foundation for future studies to explore these additional complex interactions.
This addition enhances the scholarly contribution of our work by clearly delineating the boundaries of our study and providing a clear roadmap for subsequent research.
Manuscript has been updated in page 24 line 829-836:
“This study focused specifically on the moderating role of Logistics Delivery Services. While this provides a focused and deep insight, it does not exhaust the list of potential moderators in the e-commerce ecosystem. Other important variables, such as consumer demographics (e.g., age, tech-savviness), purchase history, or platform-specific features (e.g., marketplace vs. brand-owned website), could also significantly influence the relationship between perceived quality and satisfaction. Future research could build upon our model by incorporating these variables to develop a more comprehensive understanding of the contextual factors that shape customer satisfaction.”
Comment 4 (continue):
Additionally, Promotion was not significant in the standalone model but showed significance in the full model, indicating its complex role—a phenomenon that was not sufficiently explained in the paper.
Response 4:
We thank the reviewer for highlighting this intriguing result, which we agree warranted a more thorough explanation.
In response, we have added a detailed explanation of this phenomenon in Section 3.2 (Results). We frame it as a classic suppressor effect, where the strong correlation between Promotion and other aspects (especially Price) masks its unique effect in a simple model.
We now posit that in the full model, once the effect of general price perception is controlled for, the unique psychological benefit of receiving a promotional "deal" or "bonus" emerges as a significant and positive contributor to satisfaction.
This provides a more nuanced and theoretically grounded interpretation of Promotion's role, moving beyond a simple direct effect to understanding its position within a network of perceived quality attributes.
Manuscript has been updated in page 16 line 540-549:
“The finding that Promotion (PRO) was not significant in isolation (Model 5) but became significant in the full model (Model 7) suggests a complex, interdependent relationship with other quality aspects. This pattern is indicative of a suppressor effect. Promotion may be highly correlated with other variables, particularly Price (PRI). When evaluated alone, the variance in satisfaction explained by Promotion is confounded with the effect of price perception. However, when Price and other aspects are controlled for in the full model, the unique, positive effect of receiving a promotional benefit, such as a cashback or voucher, distinct from the product's base price, is revealed as a significant driver of satisfaction. This implies that consumers value promotions not merely as a price reduction, but as a separate positive event that enhances their overall shopping experience.”
Manuscript has been updated in page 19-20 line 627-631:
“The complex role of Promotion, which only showed a significant impact when considered alongside other aspects, underscores the nuanced nature of consumer decision-making. It appears that promotions are not a primary initial driver but act as a valuable enhancer, contributing to satisfaction after core expectations of product functionality, originality, and fair pricing are met.”
Comment 5:
The logistic regression model assumes a linear relationship, but in reality, the relationship between perceived quality and satisfaction may be non-linear. The paper did not attempt to compare other machine learning models (e.g., decision trees or neural networks), which may limit multi-angle validation of the findings.
Response:
We thank the reviewer for this insightful methodological point. We agree that exploring non-linear relationships and comparing model performance could be a valuable extension.
We have added a dedicated paragraph in the Conclusion (Section 5) to acknowledge this limitation. We explicitly state that logistic regression assumes linearity in the log-odds and may not capture more complex, non-linear relationships.
We have incorporated the reviewer's specific suggestion to mention "decision trees or neural networks" as potential models for future research to provide multi-angle validation.
At the same time, we have justified our choice of logistic regression by reiterating the explanatory goal of our study. The primary objective was hypothesis testing and obtaining interpretable parameters (odds ratios) to understand the effect of each aspect, which aligns perfectly with the generalized linear model framework. More complex "black-box" models, while potentially offering higher predictive accuracy, would not have provided the same level of clear, actionable insight into the specific relationships we set out to investigate.
We believe this clarification strengthens the methodological rationale of our paper while thoughtfully outlining a path for future analytical work.
Manuscript has been updated in page 20-21 line 817-828:
“To address the limitations of automated text analysis, researchers should employ hybrid methods. This could involve manual coding to create high-quality datasets for fine-tuning LLMs on domain-specific language or the development of more sophisticated models capable of better detecting nuance, sarcasm, and culture-specific expressions in reviews. A critical next step is to integrate subjective review data with objective performance metrics. Linking reviews to logistics API data (actual delivery times), seller response logs, and product return rates would allow researchers to triangulate findings and explore the crucial gap between perceived quality and actual service performance. Employing non-parametric machine learning models (e.g., Decision Trees, Neural Networks, Random Forests, or Gradient Boosting) could help uncover non-linear relationships and complex interactions between quality aspects that are not captured by linear models, providing a more nuanced predictive framework.”
Comment 7:
These findings provide theoretical support for e-commerce service strategy optimization but need to be validated in broader scenarios. Some explanations regarding this aspect can be added.
Response:
We thank the reviewer for this important nuance regarding the application of our findings.
We have added a clarification in the Conclusion (Section 5) that explicitly acknowledges the reviewer's point. We now state that while our findings provide strong theoretical support for strategy optimization in contexts similar to our study (Indonesian smartphone market), their direct application to "broader scenarios" requires validation.
We specify that the universal contribution lies in the methodological framework and the principle of integrating service-centric qualities into perceived quality models, while the specific strategic priorities (e.g., the high rank of Packaging) are context-dependent.
This addition ensures a more precise and academically rigorous claim for the generalizability of our work.
Manuscript has been updated in page 23 line 793-805:
“The findings of this study provide a theoretical foundation and empirical support for optimizing e-commerce service strategies, particularly in the smartphone sector and similar high-involvement product categories within emerging markets. The identified hierarchy of perceived quality aspects, with Packaging, Logistics Delivery, and Functionality being paramount, offers a clear framework for resource allocation. However, as the reviewer rightly notes, the direct application of these findings to other contexts requires careful consideration. The specific weight of each aspect is likely contingent on product type (e.g., the importance of packaging and originality would differ for commodity goods like groceries) and market maturity (e.g., logistics may be a baseline expectation in developed economies). Therefore, while the methodology and the demonstrated importance of a dual-dimensional (product and service) quality framework are universally valuable, the specific strategic priorities identified here should be validated and calibrated in broader scenarios before generalized application.”
Reviewer 3 Report
Comments and Suggestions for AuthorsGeneral Comment
The study uses Google's Gemma 2 LLM to identify perceived quality aspects in online reviews and investigates the relationship between perceived quality and customer satisfaction, incorporating interaction variables to evaluate logistics delivery services.
- Discussions of the hypotheses are misleading. There is a mistake in discussing hypotheses. The text under hypothesis 1 must be under hypothesis 2. Likewise, the text under hypothesis 2 must be under hypothesis 3... and so on. The paragraph before hypothesis 1 should come after the same hypothesis.
The study examines perceived product quality in online consumer reviews, highlighting its significant impact on customer satisfaction. Factors such as functionality, originality, price, logistics, packaging, promotion, responsiveness, and warranty directly affect these evaluations, thereby building brand trust and satisfaction.
- But there are two concerns: First regarding the validity of the results: how author validate the solutions. Second, the applicability and practical implications of the solutions.
Specific Comments
- The title is relatively long, so it's better to shorten its length (optional).
- The research problem and objectives should be briefly defined in the abstract.
- English errors: " …chasing habits globally, including in Indonesia…". "This study constructs the following (should be this) hypothesis and explores the originality effect on customer satisfaction" this is applicable for all hypothesis. "It that allows an LLM to perform ABSA". "The algorithm for this purpose present in Figure 1."…..
- While the gaps and research questions are clearly defined, the introduction should clearly state the problem definition, research objectives and summary of the tools/ solutions used in the analysis.
- There is a mistake in discussing hypotheses. The text under hypothesis 1 must be under hypothesis 2. Likewise, the text under hypothesis 2 must be under hypothesis 3... And so on. The paragraph before hypothesis 1 should come after the same hypothesis.
- The method and tools used are discussed appropriately and in a suitable sequence, and the equations and algorithms are clearly presented.
- It is not clear how the values and results in Table 2 were calculated.
- The results were presented in a structured and sequential manner and discussed clearly.
- It is better to have a table summarizing the results in the discussion section to summarize the outcomes and facilitate their understanding.
- The conclusion should include the applicability and practical implications of the results and solutions.
There are some English comments that should be revised as shown in the comments.
Author Response
Comment 1:
The research problem and objectives should be briefly defined in the abstract.
Response:
We thank the reviewer for this constructive suggestion to improve the clarity and impact of our abstract.
In direct response to your comment, we have revised the abstract to front-load the research problem and objectives.
The new opening sentences now clearly state the core challenge in e-commerce and the specific research gaps our study addresses, immediately orienting the reader.
This revision creates a more logical flow: Problem -> Objectives -> Methodology -> Key Findings -> Implications, ensuring the abstract is a more comprehensive and self-contained summary of our work.
We believe these edits have significantly strengthened the abstract and thank the reviewer for the valuable feedback.
Revised abstract:
“A significant challenge in e-commerce is the inability of consumers to physically inspect products, forcing them to rely on perceived quality derived from other consumers' experiences. However, gaps remain in understanding which dimensions of perceived quality are most frequently mentioned and influential for customer satisfaction, particularly in emerging markets like Indonesia. This study investigates these gaps by identifying key perceived quality aspects and examining their impact on satisfaction, with a specific focus on the moderating role of logistics delivery services. Using a large language model (LLM), specifically Google’s Gemma 2, we performed aspect-based sentiment analysis on 5,000 smartphone reviews from Indonesian e-commerce. Logistic regression models incorporating interaction variables were employed to evaluate the relationships. The results identify the most frequently mentioned aspects of perceived quality: Logistics delivery services, Functionality, Originality, Responsiveness, and Packaging. While Logistics delivery services was the most mentioned aspect, Packaging had the most significant direct influence on satisfaction. Notably, Logistics delivery services also play a significant moderating role, enhancing the positive effect of other perceived quality aspects on satisfaction. These findings suggest that Logistics delivery services contribute directly to satisfaction and amplify other aspects, resulting in greater customer satisfaction. The study contributes to the literature by demonstrating LLM-driven aspect-based sentiment analysis methods and expanding the concept of perceived quality to include service aspects, thus promoting a more complete consideration of perceived quality in e-commerce.”
Comment 2:
While the gaps and research questions are clearly defined, the introduction should clearly state the problem definition, research objectives and summary of the tools/ solutions used in the analysis.
Response:
We thank the reviewer for this excellent suggestion to enhance the clarity and structure of our introduction.
In response, we have revised the introduction to include a dedicated paragraph that explicitly states the problem definition, formalizes the research objectives, and summarizes the analytical tools and solutions used in the study.
This new paragraph provides a clear and concise roadmap of the paper, allowing readers to immediately understand the study's purpose, methodology, and contribution.
We believe this addition significantly improves the narrative flow of the introduction and provides a stronger foundation for the rest of the paper.
Manuscript has been updated in page 2-3 line 80-92:
“To address these research questions, this study is designed with the following objectives: (1) to identify and quantify the key aspects of perceived quality from online consumer reviews in the Indonesian smartphone market; (2) to measure the direct impact of these aspects on customer satisfaction; and (3) to investigate the moderating role of logistics delivery services in the relationship between perceived quality and satisfaction. The primary problem we tackle is the lack of a nuanced, data-driven understanding of how both product and service-related quality dimensions collectively shape customer satisfaction in a rapidly growing, yet understudied, e-commerce market. To solve this, we employ a novel methodological framework combining a LLM and statistical modeling. Specifically, we utilize Google's Gemma 2 for ABSA to automatically classify aspects and sentiment in Indonesian reviews, followed by logistic regression analysis with interaction effects to test our hypotheses and quantify the moderating influence of logistics delivery.”
Comment 3:
There is a mistake in discussing hypotheses. The text under hypothesis 1 must be under hypothesis 2. Likewise, the text under hypothesis 2 must be under hypothesis 3... And so on. The paragraph before hypothesis 1 should come after the same hypothesis.
Response:
We sincerely thank the reviewer for their meticulous attention to detail in identifying this significant error in the presentation of our hypotheses.
We have thoroughly reviewed the Hypotheses section and corrected the misalignment between the hypothesis and their corresponding explanatory text.
We have carefully checked the entire sequence to ensure that each hypothesis is now correctly paired with its intended justification and literature support.
We apologize for this error and are grateful for the correction, which has improved the clarity and logical flow of our theoretical framework.
Comment 4:
The method and tools used are discussed appropriately and in a suitable sequence, and the equations and algorithms are clearly presented.
Response:
We thank the reviewer for their positive feedback on the presentation of our methodology. We are pleased that the description of our methods, tools, and the sequence of our analytical approach was found to be clear and appropriate.
Comment 5:
It is not clear how the values and results in Table 2 were calculated.
Response:
We thank the reviewer for pointing out this lack of clarity. We have now added explanation of the model evaluation procedure. We have added explanation that explicitly describes the creation of a manually annotated test set and calculate the performance metrics in Table 2.
This addition ensures full transparency and reproducibility of our LLM-ABSA classification results.
Manuscript has been updated in page 12 line 434-436:
“To evaluate the performance of the LLM-ABSA model, a manually annotated test set was created. The recruited annotator independently labeled a random sample of 500 reviews from the dataset, identifying the aspects and its associated sentiment.”
Comment 6:
The results were presented in a structured and sequential manner and discussed clearly.
Response:
We thank the reviewer for their positive feedback on the clarity and structure of our results and discussion. We are pleased that the presentation of our findings was found to be effective.
Comment 7:
It is better to have a table summarizing the results in the discussion section to summarize the outcomes and facilitate their understanding.
Response:
We thank the reviewer for this excellent suggestion to enhance the clarity of our discussion. We have added a new table (Table 8) to the Discussion section that succinctly summarizes the key findings regarding hypothesis testing and the role of each perceived quality aspect. This provides a clear and concise overview for the reader and strengthens the connection between our results and their interpretation.
Manuscript has been updated in page 22 line 749-755:
To facilitate a comprehensive understanding of our results, Table 8 provides a summary of the hypothesis tests and the definitive roles of each perceived quality aspect, both in terms of their direct influence on satisfaction and their interaction with logistics delivery services. As summarized in Table 8, the results strongly support the direct influence of all perceived quality aspects on customer satisfaction and the moderating role of logistics delivery was significant for most aspects.
Table 8. The Summary of the Hypotheses Results
|
Hypothesis |
Aspect |
Direct Effect on Satisfaction |
Moderating Role of Logistics Delivery |
|
H1 |
FUN |
Supported |
Supported |
|
H2 |
ORI |
Supported |
Supported |
|
H3 |
PRI |
Supported |
Rejected |
|
H4 |
DEL |
Supported |
- |
|
H5 |
PAC |
Supported |
Supported |
|
H6 |
PRO |
Supported |
Rejected |
|
H7 |
RES |
Supported |
Supported |
|
H8 |
WAR |
Supported |
Supported |
|
H9 |
DEL as Moderator |
Supported |
Supported |
Comment:
The conclusion should include the applicability and practical implications of the results and solutions.
Response:
We thank the reviewer for this essential suggestion to enhance the impact of our conclusion. We have thoroughly revised the Conclusion section to explicitly outline the practical implications of our findings on subsection 5.1.
