1. Introduction
In a rapidly changing business environment, a customer-centered product improvement strategy has become an important means for enterprises to gain insights into customer evaluations and thereby improve customer satisfaction (CS) [
1,
2]. The effective implementation of this strategy depends on the accurate identification of customer sentiment feedback and scientifically grounded guidance for product improvement [
3]. However, traditional data collection methods, such as questionnaires, have become outdated. They cannot meet the need for real-time mining of massive amounts of unstructured, authentic customer feedback in the big data era [
4].
In contrast to traditional survey-based methods, the widespread adoption of e-commerce and social platforms has made online reviews a major medium through which customers express their usage experiences and opinions, as well as an important factor influencing product sales and corporate reputation [
5,
6,
7]. Online review data not only reveal customers’ preferences, behaviors, and subjective perceptions [
8], but also provide a dynamic, rich, and real-time source of customer insight into products [
9,
10], thereby providing strong support for customer-centered targeting and agile product improvement.
Customers’ experiences with products or services often involve multiple dimensions. Therefore, identifying sentiment only at the overall review level may obscure their attitudes toward specific product attributes [
11,
12,
13]. In fact, customers may express distinct or even opposing sentiments toward different attributes within the same review [
14,
15]. Accurately extracting these focal points and the fine-grained sentiments embedded in reviews is of great importance for revealing their actual concerns and sentiment tendencies [
16,
17]. However, in real-world review contexts, textual expressions are often characterized by grammatical irregularities, semantic ambiguity, and complex rhetorical devices, which can interfere with the accurate identification of such information [
8]. Therefore, improving fine-grained sentiment identification at this level has become a key issue in understanding customer attitudes [
18]. This capability not only helps reveal product shortcomings and improvement directions in greater detail, but also provides data-driven, precise support for enterprise product iteration and decision optimization in e-commerce contexts [
19]. Although existing studies can automatically identify key attributes and their sentiment polarity in review texts, there remains substantial room for improvement in modeling fine-grained, multi-attribute sentiment relationships [
20].
To translate attribute sentiment (AS) into actionable product improvement insights, the Kano model provides a useful analytical framework. The Kano model has been widely applied in product and service contexts, including hotels [
3], tourism services [
21], fresh products [
22], and shopping platforms [
23], because it can reveal the asymmetric relationship between AS and CS. When constructing Kano models from online reviews, prior research has mainly relied on binary sentiment (positive or negative) to investigate the impact of AS on CS. However, this binary treatment generally overlooks the independent value of neutral sentiment and often categorizes it as either positive or negative. This approach may lead to the loss of critical sentiment information [
24], weaken the analytical value of objective factual information contained in review texts, and reduce the explanatory power of the mechanism through which AS influences CS, thereby distorting the true relationship between AS and CS [
25].
Identifying attribute priorities is a critical step in resource allocation decisions [
26]. However, existing studies have largely stopped at Kano category classification, paying insufficient attention to the relative priority of attributes within the same category. As a result, it is difficult to further distinguish the priority order of attributes within the same category [
27], which weakens the decision-support value of the research findings for managerial practice [
28,
29]. Even if multiple attributes belong to the same Kano category, their perceived importance among customers may differ significantly [
30]. Therefore, addressing how to rank attributes within the same category in a reasonable manner has become a critical issue in current research.
Accordingly, we propose the following research questions in this paper:
How can AS in online reviews be accurately recognized?
How can neutral sentiment be incorporated into the Kano classification method?
How can attributes within the same category be ranked?
To address these research questions, this study develops a systematic decision-support framework to mine CS from online reviews and guide product improvement. This paper makes three main contributions. First, an attribute sentiment analysis method, termed BERT-A-Conv, is developed based on Bidirectional Encoder Representations from Transformers (BERT) by integrating an attribute-aware mechanism with convolutional feature extraction, thereby enabling accurate identification of sentiments associated with multiple attributes in online reviews. Second, we propose the marginal contribution difference-based Kano model (MCD-Kano) to incorporate neutral sentiment into Kano classification, thereby addressing the limitations of traditional binary sentiment. Third, the Attribute Improvement Priority Score (AIPS) is developed by integrating attribute MCDs with their improvement potential to rank attributes within a category, thereby providing quantitative support for enterprise resource allocation and product improvement. Overall, this study supports a data-driven decision-support logic that transforms product-attribute-level information into interpretable product-improvement insights.
The remainder of this paper proceeds as follows.
Section 2 reviews the relevant literature;
Section 3 presents the methodological framework;
Section 4 reports the case study and comparative analysis; and
Section 5 discusses the theoretical and practical contributions, as well as the study’s limitations and future research directions.
3. Methodology
As shown in
Figure 2, this paper proposes an explainable Kano-based decision-support framework for mining CS from online reviews and guiding product improvement. The framework aligns with the research stream of data-driven CS analysis and product improvement informed by online reviews. Rather than treating product attributes, customer sentiment, and satisfaction outcomes as separate analytical objects, the framework connects them within a unified decision-support process that transforms product-attribute-level information into interpretable product-improvement insights. The proposed method consists of five key steps: (1) Data Acquisition and Pre-processing; (2) Product Attribute Extraction; (3) Attribute Sentiment Analysis; (4) Construction of the MCD-Kano model; and (5) Attribute Improvement Priority Assessment Based on MCD. Together, these steps establish a coherent logic for converting online review data into actionable product improvement priorities.
3.1. Data Acquisition and Pre-Processing
During the data collection stage, large-scale customer review data were obtained from e-commerce platforms using a self-developed web crawler, covering key fields such as customer ID, review time, review content, and ratings. To ensure the authenticity and validity of the review data, we implemented a hierarchical quality control procedure during the pre-processing stage. First, the raw reviews were cleaned, including removing duplicate records, excluding reviews with missing text or invalid content, retaining only Chinese-language text, and filtering out excessively short reviews with fewer than 5 words, thereby reducing the proportion of noisy, information-insufficient samples. Second, to address potential interference from fake reviews and spamming, we performed additional consistency and anomaly screening, including verifying the correspondence between ratings and review texts, identifying abnormally active users who posted large numbers of reviews within a short period, and removing the associated samples. For text processing, we used Jieba 0.42.1 for Chinese word segmentation, constructed a synonym dictionary to enhance semantic recognition, and introduced a stop-word dictionary to reduce interference from noisy information and irrelevant terms.
3.2. Product Attribute Extraction
Because a single review may contain multiple attributes, direct modeling may easily lead to topic mixing. To improve identification accuracy in multi-attribute expressions, this paper adopts a sentence-level segmentation strategy that divides each review into multiple sentences based on punctuation. This study then combines all sentences into a corpus and uses the bitermplus toolkit in Python 3.11.5 to train a Biterm Topic Model (BTM) for identifying potential product attributes. BTM represents the topic structure by capturing the co-occurrence patterns of word pairs (biterms) across the global corpus and is particularly suitable for short-text scenarios [
65,
66].
To determine the optimal number of topics, this paper adopts the
coherence score for evaluation [
31]. This metric measures the degree of semantic cohesion among keywords within the same topic. A higher
score indicates that the semantics within a topic are more concentrated and the expression is clearer, thereby reflecting higher topic quality.
Figure 3 illustrates the generative process of BTM and its role in attribute extraction in the present study.
(1) The model generates the global topic distribution θ according to the hyperparameter σ, which controls the topic sampling of biterms in the corpus;
(2) For each biterm, the topic variable z is sampled from θ to determine the topic membership of the word pair;
(3) The word distribution of each topic, Pw|k, k = {1, 2, …, K}, is generated according to the hyperparameter τ, and two words, wp and wq, are then separately sampled from the corresponding Pw|k (K denotes the predefined number of topics in BTM);
(4) After model training, the set of high-probability keywords for each topic is output;
(5) Through manual inspection and synonym merging, semantically similar topics are identified as specific product attributes, and this study constructs an attribute lexicon for subsequent attribute context extraction and sentiment prediction.
3.3. Attribute Sentiment Analysis
Based on the attributes extracted in
Section 3.2, this section further incorporates a sentiment analysis model to achieve attribute-level sentiment analysis. This study proposes a BERT-A-Conv model. In this model, the attribute-aware mechanism guides the model to focus on semantic segments that are highly relevant to the target attribute, thereby effectively alleviating sentiment feature confusion in multi-attribute reviews. The convolutional module captures local feature patterns, making it suitable for fine-grained sentiment analysis. This method takes the “attribute–review” combination as the model input. Accordingly, this study decomposes each review involving multiple attributes into several “attribute–review” samples, and
Algorithm A1 in
Appendix A presents the matching algorithm. This design enables the model to learn sentiment expression independently of features for each attribute and to explicitly focus on review segments semantically related to each attribute, thereby effectively reducing sentiment feature confusion in multi-attribute scenarios.
During the feature encoding stage, pre-trained BERT is used to obtain the contextual representation of the input text, where the target attribute and the review text are concatenated into a single input sequence with special separators added, thereby forming an embedding representation that incorporates both attribute semantics and review context [
67]. Through a multi-layer Transformer architecture, BERT captures global dependencies, and the resulting contextual vectors preserve both review semantics and attribute context. Different from conventional approaches that directly feed the entire review sentence into a sentiment model, this study introduces an attribute-aware strategy at the encoding stage: the target attribute and the review text are concatenated and input into BERT, and the weights of words in the contextual vectors are dynamically adjusted through attribute-aware attention, enabling the model to prioritize the sentiment signals most relevant to the current attribute when processing multi-attribute reviews without interference from irrelevant information associated with other attributes. On this basis, this study employs the CNN to extract n-gram-level local sentiment features. Convolutional kernels slide over the review text representations to capture phrase-level sentiment patterns, and max pooling is applied to retain the most salient sentiment cues, thereby enhancing the model’s capability for fine-grained sentiment identification. Subsequently, this study fuses the output vectors from the attribute-aware attention mechanism and the CNN module, and performs sentiment classification for the target attribute using a fully connected layer. The output layer uses the Softmax activation function to generate a probability distribution over three sentiment categories—positive, negative, and neutral—and identifies the category with the highest probability as the final sentiment class for the target attribute.
Figure 4 illustrates the structure of the BERT-A-Conv model and the functions of its modules.
To validate model performance, this study adopted commonly used classification metrics, including Precision, Recall, and F1-score, to evaluate the sentiment analysis model. Precision measures the proportion of correctly predicted positive samples among all samples predicted as positive, while Recall measures the proportion of correctly predicted positive samples among all actual positive samples. The F1-score is the harmonic mean of Precision and Recall. Their calculation formulas are shown in Equation (1).
3.4. Construction of the MCD-Kano Model
In this paper, we propose the MCD-Kano model to accurately map ASs to the Kano model’s categories by quantifying each AS’s marginal contribution to changes in overall CS, thereby providing effective decision support for product improvement in the e-commerce environment.
In the CS modeling stage, this study adopted the LightGBM model, a gradient-boosting decision tree [
67]. The model has demonstrated superior computational efficiency and predictive performance in large-scale and high-dimensional data analysis [
68,
69], and is particularly suitable for processing online review data characterized by complex structures and substantial feature redundancy. Empirical comparisons indicate that LightGBM outperforms XGBoost and the Bagging-based Random Forest model in terms of prediction accuracy, training efficiency, and feature importance interpretation [
61,
70]. Specifically, the model’s input features were the sentiment categories extracted during the sentiment analysis stage, and the target variable was the review rating (R), thereby establishing a nonlinear mapping between AS and CS. During model training, this study constructed a unique index from timestamps and user IDs to ensure precise correspondence between AS and R and performed hyperparameter tuning using a hierarchical optimization strategy. To further ensure the robustness and generalizability of the model, five-fold cross-validation was used during training. This study comprehensively evaluated the model’s predictive performance using Precision, Recall, and F1-score. If a review mentioned a sentiment category, it was coded as 1; if it did not mention it, it was coded as 0; and if the review did not involve the attribute, it was recorded as {0, 0, 0}.
Table 1 shows the data structure of the model input.
This study used the Shapley additive explanations (SHAP) method to explain how AS affects CS. SHAP builds on the game-theoretic feature attribution framework proposed by Shapley [
71] and quantifies the average marginal contribution of a feature to the model output through the prediction change induced by adding that feature across all possible feature combinations [
72]. By accounting for all possible feature combinations in the attribution process, SHAP is suitable for interpreting the nonlinear relationship between AS and CS captured by the LightGBM-based model, which may not be adequately represented by traditional linear models [
73]. In tree models, the TreeSHAP method adopted in this study retains the theoretical accuracy and interpretability of SHAP values while leveraging tree structure information to reduce computational complexity from exponential to polynomial, making it an efficient and precise implementation [
74]. SHAP values represent the marginal contribution of an attribute to the rating: positive values indicate that the attribute increases CS, whereas negative values indicate that the attribute reduces CS. Since the contribution of the same attribute may be opposite in direction across reviews, direct averaging may conceal these differences; therefore, this study separately computed the mean SHAP value for each sentiment category to represent its marginal contribution to CS. Let the sentiment category be denoted as
, with a total of
reviews in that category. For the
-th review, the SHAP value of attribute
is
, and the marginal contribution
of attribute
for this sentiment category is computed as shown in Equation (2).
On this basis, drawing on reference-dependent utility theory, this study treats the neutral state as the reference point and calculates the MCDs between adjacent sentiment categories. Specifically,
measures the change in the marginal contribution of attribute
when sentiment shifts from neutral to positive, whereas
measures the change in the marginal contribution when sentiment shifts from negative to neutral. Equation (3) presents the detailed expression. These two measures represent changes in the satisfaction response curve across the “negative–neutral–positive” sentiment states. A positive
indicates that positive sentiment contributes more to CS than neutral sentiment, whereas a negative value indicates its contribution is weaker than that of neutral sentiment. Similarly, a positive
indicates that neutral sentiment contributes more to CS than negative sentiment, whereas a negative value indicates that the contribution of neutral sentiment is weaker than that of negative sentiment.
Based on the above two indicators, an MCD-Kano model was constructed in this study (as shown in
Figure 5), and the classification rules are described as follows: If
and
, the attribute is classified as I-type, where experts determine ς, a near-zero threshold, based on domain knowledge and experience to exclude the influence of weak effects and random fluctuations. In this case, the marginal contributions of the attribute in the two sentiment transition intervals, namely from neutral to positive and from negative to neutral, are both close to zero. Therefore, it neither significantly increases CS nor effectively reduces customer dissatisfaction, and its overall effect on sentiment change is weak. Second, if
and
, the attribute is classified as R-type. This indicates that the attribute exerts negative effects in both sentiment transition directions. It not only fails to move customer sentiment from neutral to positive, but also fails to restore sentiment from negative to neutral, and may even intensify negative experiences. For the remaining attributes, the relative magnitudes of
and
are further compared to identify the shape of the contribution curve. Here, the threshold δ determines whether the relative difference between the pos−neu and neu−neg intervals meets the predefined criterion. Since this boundary is typically highly context-dependent and cannot be determined solely from sample data, it must be judged and specified by domain experts in light of the specific research context. If
, this indicates that the attribute shows a bigger difference in the pos−neu interval than in the neu−neg interval. Furthermore, if
, the relative difference in strength between the two intervals exceeds the predefined threshold, the attribute is therefore classified as A-type (with a V-shaped curve). In other words, the primary role of this attribute is to generate additional satisfaction, whereas its absence does not necessarily lead to obvious dissatisfaction. Otherwise, the attribute is classified as O-type (with a linearly increasing curve), indicating that the attribute’s effect intensity is relatively balanced across the two adjacent intervals. If
, this indicates that the attribute shows a bigger difference in the neu−neg interval than in the pos−neu interval. Furthermore, if
, the relative difference in strength between the two intervals exceeds the predefined threshold, the attribute is therefore classified as M-type (with an inverted V-shaped curve). Otherwise, the attribute is classified as O-type.
3.5. Attribute Improvement Priority Assessment Based on MCD
In this study,
is defined as the global arithmetic mean marginal contribution of attribute
i across all sentiment categories. It is calculated by averaging the marginal contributions of attribute
i across all its occurrences in positive, neutral, and negative reviews, as shown in Equation (4), where
denotes the sentiment category corresponding to positive, neutral, and negative, respectively.
was normalized using Equation (5) to obtain
.
Because different categories of attributes function differently in enhancing satisfaction and alleviating dissatisfaction, adopting a uniform evaluation criterion may obscure their actual improvement value [
75,
76].
For A-type attributes, according to the two-factor theory, the core mechanism is that fulfillment brings additional CS enhancement, whereas non-fulfillment does not necessarily lead to obvious dissatisfaction. Therefore,
is more suitable for characterizing their satisfaction enhancement effect. For M-type attributes, the core mechanism is that fulfillment may not significantly improve CS, whereas non-fulfillment leads to dissatisfaction. Therefore,
is more suitable for capturing their dissatisfaction-prevention effect. For O-type attributes, attribute performance generally exhibits a relatively linear or monotonic relationship with CS; that is, higher fulfillment improves CS, whereas non-fulfillment or poor performance leads to dissatisfaction. Therefore, O-type attributes involve both satisfaction enhancement and dissatisfaction prevention, and this study uses both
and
to characterize these two components [
77,
78]. The preference direction of R-type attributes is opposite to that in conventional improvement logic, whereas I-type attributes exert no significant effects on either satisfaction or dissatisfaction. Therefore, firms should reduce resource allocation to both categories to avoid blind investment, and this study excludes both categories from the scope of the priority discussion. This category-specific specification allows attributes within the same Kano category to be ranked according to the impact dimension most relevant to their improvement value.
On this basis, considering the important role of improvement potential in product improvement [
79], this study incorporates improvement potential into the MCD of attributes to evaluate intra-class priorities and construct AIPS. This method simultaneously captures the impact intensity of attributes and their actual room for improvement, thereby providing a more targeted basis for identifying priority improvement targets under constrained resources.
In Equation (6),
and
denote binary indicator coefficients whose values are defined in Equation (7); they specify which directional component enters AIPS, while the corresponding MCD value determines the effect magnitude:
4. Case Study
4.1. Data Description and Pre-Processing Results
JD.com (
http://www.jd.com) and Taobao.com (
https://www.taobao.com) both have large, active customer bases, particularly in electronics and smart devices [
8]. Given their large user bases and authentic transaction contexts, these two platforms provide abundant online review data for research on product performance analysis and customer demand mining. Therefore, this study regards them as ideal data sources for identifying customer needs in the product domain.
To validate the effectiveness of the proposed method, this study selected smartwatches as the research case. The market for this type of product is relatively mature, and customer cognition is relatively stable, which helps reduce data interference and improve the accuracy and consistency of Kano category classification. In addition, smartwatches typically generate a large volume of high-quality customer reviews, providing a rich data foundation for attribute extraction, sentiment identification, and model validation. We selected eight representative mainstream smartwatch brands with high customer recognition as the research objects. Using web-crawling technology, this study collected customer reviews from self-operated flagship stores on the two aforementioned platforms from June 2024 to March 2025, yielding a total of 38,027 original reviews, including 17,316 from Taobao.com and 20,711 from JD.com. After text cleaning and pre-processing, a total of 35,640 valid reviews were retained. Specifically, this study strictly cleaned the data according to predefined rules by removing duplicate reviews, blank or garbled reviews, obviously meaningless reviews, and reviews irrelevant to the research object, thereby minimizing the effects of noise and subjective selection bias on the analysis results. To ensure the reliability of the cleaning results, this study further compared the samples before and after cleaning with respect to major characteristics, such as review time distribution and platform source, and found no significant shift in the overall distribution.
Table 2 provides detailed information for each product.
4.2. Product Attribute Extraction Results
In this study, the BTM was implemented in Python after data pre-processing. Drawing on previous studies, this study set the expected number of topics to
K,
σ, and
τ to 50/
K, 0.01, and 1, respectively, and the number of iterations and the random state to 1000 and 1, respectively. By comparing topic coherence scores across different topic numbers, 19 was selected as the expected number of topics because it achieved a high coherence score and semantically interpretable topics; ultimately, 11 topics were determined, with a maximum coherence score of 0.646. On this basis, domain experts semantically named the keywords under each topic.
Table 3 presents the 11 identified attributes and the five most representative keywords for each.
4.3. Attribute Sentiment Analysis Results
To enable attribute-level sentiment analysis, this study matched review texts to the attribute lexicon and constructed input samples in the form of “attribute–review” pairs. Specifically, when a representative keyword of a given attribute appeared in a review, that attribute was paired with the entire review to generate a new input record; if the same review involved multiple attributes, multiple corresponding “attribute–review” samples were constructed. After attribute matching, 11,532 attribute–review pairs were generated as supervised samples. All pairs were manually annotated as positive, neutral, or negative by two trained research team members over two months using a unified annotation guideline. Positive labels indicated explicit approval, support, optimism, or other favorable attitudes; negative labels indicated explicit criticism, concern, rejection, pessimism, or other unfavorable attitudes; and neutral labels indicated factual, descriptive, ambiguous, balanced, or non-dominant evaluative content. Borderline cases were labeled as neutral when the sentiment orientation was weak, implicit, mixed, or insufficiently supported by textual evidence. To assess annotation reliability, Cohen’s κ was calculated based on the two annotators’ independent labels, yielding κ = 0.82, indicating strong inter-annotator agreement. A domain expert resolved disagreements or ambiguous cases to determine the final labels.
The finalized dataset was split into training, validation, and test sets at an 8:1:1 ratio. The BERT-A-Conv model was initialized with the Google Chinese BERT-Base pretrained checkpoint. Hyperparameter tuning was performed empirically based on the model’s performance on the validation set. Key hyperparameters, including the learning rate, batch size, dropout rate, convolution kernel size, and number of convolution filters, were adjusted and selected according to the validation macro-F1 score. The maximum sequence length, embedding dimension, number of attributes, convolution kernel size, number of convolution filters, dropout rate, and batch size were set to 512, 768, 11, 3, 128, 0.3, and 16, respectively. The model was optimized using Adam with a learning rate of and trained for up to 5 epochs. Early stopping was applied with a patience of 3 based on the validation macro-F1 score, and the best-performing checkpoint on the validation set was retained for testing. The experiments were conducted on an NVIDIA GeForce RTX 4060 Laptop GPU, and each BERT-A-Conv training run took approximately 8 h under the above parameter settings. Across five independent runs with different random seeds, the model achieved macro-averaged Precision, Recall, and F1-score values of 0.926 ± 0.004, 0.941 ± 0.005, and 0.933 ± 0.004, respectively, on the test set, indicating stable performance in attribute-level sentiment identification. To avoid relying on an unusually favorable single run, the downstream attribute-level sentiment distribution analysis was conducted using the median-performing run, defined as the third-ranked run by validation macro-F1 among the five independent runs.
4.4. MCD-Kano Classification Results
To reduce model bias and ensure the reliability of the analytical procedure, we divided the dataset into training, validation, and test sets at a ratio of 6.5:1:2.5. The original review ratings were binarized into satisfaction labels: ratings of 4 and 5 were coded as satisfied reviews, denoted as class 1, whereas ratings of 1–3 were coded as not satisfied reviews, denoted as class 0. After binarization, class 0 accounted for 38.22% of the samples, while class 1 accounted for 61.78%. In the LightGBM model, we set the initial learning rate
to 0.05 and the maximum number of iterations (n_estimators) to 1000. We also introduced an early stopping mechanism (early_stopping_rounds = 50) to prevent overfitting. Using the training set, we tuned the model hyperparameters through five-fold cross-validation, while the validation set was used to monitor early stopping. The optimal configuration is reported in
Table A1 of
Appendix A. After hyperparameter tuning based on the validation macro-F1 score, the model was retrained on the combined training and validation sets and evaluated on an independent test set.
Under these settings, the LightGBM model was evaluated across five independent runs with different random seeds to assess its prediction stability. The model achieved mean ± SD Precision, Recall, and F1-score values of 0.915 ± 0.004, 0.944 ± 0.006, and 0.929 ± 0.005, respectively, on the test set. Using the same dataset and optimization strategy, we also evaluated XGBoost and Bagging–Random Forest, and
Table A2 of
Appendix A presents the complete comparative results. Based on the final LightGBM model, SHAP values were computed on the test set to examine the contribution of each sentiment attribute to CS predictions.
Table 4 presents the SHAP-derived MCD values. Most attributes show positive
, indicating that positive sentiment generally contributes to CS improvement. Price, Battery, Quality, Practicality, Design, Fitness tracking, and Operate present relatively high
values, suggesting stronger satisfaction gains when customer perceptions shift from neutral to positive. By contrast, Service and Wearing show negative
, implying limited additional satisfaction gains from positive sentiment for these attributes.
The results further reveal the role of attributes in dissatisfaction reduction. Practicality, Operate, Fitness tracking, Wearing, Service, and Connectivity present relatively high positive values, indicating that moving from negative to neutral sentiment substantially improves CS. In contrast, Design and Price show negative , while App support, Quality, and Battery show only limited positive values, suggesting weaker effects in reducing dissatisfaction.
These differences explain the resulting Kano classifications. Based on interviews with product development engineers and cost analysis experts, we set ς to 0.05 and selected δ = 2 to distinguish between different curve shapes. A-type attributes, including App support, Design, Battery, Price, and Quality, are mainly characterized by stronger gains from neutral-to-positive sentiment shifts. O-type attributes, including Connectivity, Practicality, Fitness tracking, and Operate, are characterized by relatively balanced marginal contribution differences across adjacent sentiment transitions, indicating a more consistent association between sentiment shifts and CS changes. M-type attributes, including Service and Wearing, are characterized by larger values, suggesting that their primary role is to prevent dissatisfaction rather than to generate additional satisfaction gains.
4.5. Attribute Improvement Priority Assessment Results Based on MCD
The performance values, calculated using the equations, along with their normalized results, are presented in
Table 5. Practicality has the highest performance, indicating that this attribute is currently highly recognized; by contrast, Wearing has the lowest performance, suggesting that it still has considerable room for improvement. In addition, the remaining attributes exhibit varying degrees of differences.
By integrating the MCD and the improvement potential of attributes,
Figure 6 presents the AIPS-based priority results. The results show that the improvement priority is not determined solely by the Kano category. Attributes within the same Kano category may have different priority levels because they differ in both current performance and marginal contribution to CS or dissatisfaction. For example, although Wearing and Service are both classified as M-type attributes, Wearing receives a higher AIPS value because it has a lower current performance and a stronger negative impact on CS. Similarly, among A-type attributes, Battery ranks higher than App support because it has greater improvement potential and a higher positive marginal contribution. Among O-type attributes, Operate receives the highest priority because its current performance remains relatively low while its total contribution is high. These results indicate that Kano-based prioritization should account for intra-category heterogeneity rather than assuming that attributes within the same category have equivalent improvement value. By integrating MCD with improvement potential, AIPS extends Kano analysis from requirement classification to fine-grained priority assessment, thereby providing a more nuanced basis for data-driven product improvement decisions.
In summary, the definition of AIPS provides a clear quantitative basis for enterprises to optimize resource allocation and develop improvement strategies. When optimizing products or services, enterprises should adopt differentiated improvement strategies. For M-type attributes, deficiencies should be addressed as a priority to avoid CS caused by inadequate basic functions. For O-type attributes, firms should prioritize improvements in core performance to achieve steady CS growth. For A-type attributes, they can serve as important leverage points for enhancing product competitiveness and building differentiated advantages. By prioritizing limited resources toward attributes with high AIPS values, enterprises can not only enhance CS more effectively but also maximize the benefits of improvement under cost constraints.
4.6. Sensitivity Analysis and Comparative Evaluation
4.6.1. Sensitivity Analysis of Threshold Parameters
To assess the appropriateness of the threshold settings in the MCD-Kano model, this study conducted a sensitivity analysis as a robustness check. Specifically, ς was varied across {0.02, 0.05, 0.10}, and δ across {1.5, 2.0, 3.0} to evaluate the stability of the MCD-Kano classification results under alternative threshold settings.
Table 6 shows that the MCD-Kano classification is insensitive to changes in
ς, as all 11 attributes retain the same Kano category across the tested values. For
δ, the classification results remain identical to the baseline when
δ is decreased from 2.0 to 1.5. Only when
δ is increased to 3.0 do two attributes, Design and Wearing, change category, while the remaining 9 out of 11 attributes retain their original Kano categories. This variation can be explained by the role of δ in the classification rules. In the proposed model,
δ serves as the cutoff ratio for distinguishing asymmetric from one-dimensional effects. Increasing
δ from 2.0 to 3.0 imposes a stricter requirement for identifying A-type or M-type attributes. As a result, attributes whose asymmetric marginal contributions are not strong enough to satisfy the higher cutoff may be reclassified as O-type. This explains why Design shifts from A-type to O-type and Wearing shifts from M-type to O-type under
δ = 3.0.
These changes are consistent with the MCD-Kano classification rules and suggest that Design and Wearing are located near the decision boundary between asymmetric and one-dimensional effects, rather than being arbitrarily classified. Moreover, the changes follow the expected direction from asymmetric categories to the one-dimensional category, without irregular or contradictory shifts. Therefore, the sensitivity analysis does not undermine the rationality of the baseline classification. Instead, it shows that most attributes remain stable under reasonable threshold settings, while the few observed changes are rule-consistent and interpretable.
4.6.2. Comparison of Product Attribute Extraction Model
In the product attribute extraction stage, two topic modeling methods, LDA and BTM, were employed. To ensure a fair comparison, the analysis applied both models to the same product review dataset, fixed the number of topics at 19, and then calculated topic coherence scores to assess model quality. As shown in
Figure 7, BTM achieved a significantly higher score than LDA, demonstrating stronger topic coherence and structural identification capability.
Further manual interpretation and semantic analysis showed that the high-frequency words generated by BTM had higher semantic concentration and fewer redundant or meaningless words, indicating that BTM can more effectively capture the core attribute information in the text and improve the accuracy and interpretability of attribute extraction. Combined with the quantitative results and manual analysis, the findings of this study are consistent with those of Zhang et al. [
31], namely that when processing short or highly sparse texts, BTM can significantly improve topic coherence and semantic aggregation by modeling word-pair co-occurrence patterns while reducing noise interference.
4.6.3. Comparison of Sentiment Analysis Models
To evaluate the contribution of the main components of the proposed model, this study conducted an ablation analysis by comparing the full BERT-A-Conv model with four baseline models: BERT, BERT-CNN, CNN, and BERT-attention. These comparisons were designed to examine the effects of contextual semantic representation, convolutional feature extraction, and attention-based attribute modeling on attribute-level sentiment classification.
As shown in
Table 7, the proposed model achieved the highest macro-average Precision, Recall, and F1-Score, with low standard deviations of 0.004, 0.005, and 0.004, respectively, indicating stable performance across five independent runs. To further assess the consistency of these performance differences, pairwise one-sided Wilcoxon signed-rank tests were conducted using macro-F1 scores from the five runs, showing that BERT-A-Conv significantly outperformed the compared models at the 0.05 significance level.
4.6.4. Comparison of Kano-Based Attribute Classification and Prioritization
To examine whether the attribute classification results derived from online reviews are consistent with consumers’ needs, a standard Kano questionnaire was adopted for external validation. A total of 350 questionnaires were distributed. After excluding invalid responses that failed the attention check, selected the same option for all items, or had abnormally short completion times, 311 valid responses were obtained. The questionnaire design followed the standard Kano method, with one pair of functional and dysfunctional questions designed for each product attribute. The functional question asked how consumers would feel if the attribute existed or performed well, whereas the dysfunctional question asked how consumers would feel if the attribute were absent or performed poorly. Each question was measured using a five-point scale ranging from “Like” to “Dislike”. The attributes were then classified as A, O, M, I, R, or questionable (Q-type) according to the Kano evaluation table, using the responses to the functional and dysfunctional questions. The descriptive statistics of the respondents are reported in
Table A3 in
Appendix A.
On this basis, the classification results obtained by the proposed method were compared with those of the standard Kano questionnaire and the method of Joung and Kim [
80]. To enable comparison with binary-sentiment-based Kano classification, the three sentiment categories were transformed into binary sentiment by merging neutral sentiment into the corresponding positive or negative category under the Joung and Kim setting. The reliability of the attribute-level sentiment classification and CS prediction models was ensured before the subsequent marginal-contribution calculation, and Kano classification was performed.
Table 8 summarizes the comparison of all 11 attributes across the three classification methods. Taking the standard Kano questionnaire results as the external validation benchmark, the proposed method was consistent with the questionnaire results for all attributes. In contrast, the Joung and Kim method produced two inconsistent classifications, with Service and Wearing identified as O-type attributes. Further analysis shows that the Joung and Kim method identifies Kano types mainly based on the sign direction of the marginal contributions of positive and negative sentiments; therefore, Service and Wearing were classified as O-type attributes. However, relying solely on positive and negative sentiments captures only marginal effects at the two endpoints and cannot reveal differences in attribute effects during transitions between sentiment states. After incorporating neutral sentiment, the proposed method shows that the
values for Service and Wearing are 0.898 and 0.952, respectively, which are much higher than their
values (−0.144 and −0.455, respectively). This pattern indicates that the satisfaction response curves for these two attributes increase markedly from negative to neutral, but show almost no further increase from neutral to positive, overall presenting an inverted V-shaped trend with neutral sentiment as the peak, which is consistent with the characteristics of M-type attributes. In addition, although the classification of A-type attributes is consistent, their method cannot reflect the characteristics of A-type curves. For App support, Design, Battery, Price, and Quality, the
values are mostly close to 0 or negative, whereas all
values are clearly positive, indicating that A-type attributes show little change, or even a slight decline, when shifting from negative to neutral sentiment; however, they increase significantly when shifting from neutral to positive sentiment. Thus, they exhibit an overall V-shaped pattern, with neutral sentiment as the low point, and the increase on the right side substantially greater than the change on the left. For Connectivity, Practicality, Fitness tracking, and Operate, both
and
are positive, indicating that the indicators increase continuously from negative to neutral and then to positive, conforming to a monotonically increasing pattern and consistent with the characteristics of O-type attributes.
In summary, treating only positive and negative sentiments as endpoints is equivalent to assuming a two-endpoint shape for the satisfaction response curve. By incorporating neutral sentiment, the model introduces an intermediate reference point and captures the marginal changes across the three states of “negative–neutral–positive,” thereby distinguishing O, A, M, R, and I-type attributes more effectively. This result also supports Tang et al.’s [
81] view that neutral sentiment is not noise but should be separately identified and modeled. Therefore, incorporating neutral sentiment helps reduce classification bias and improves the rationality and reliability of Kano classification results.
To further validate the practical relevance of the AIPS-based within-category rankings, we adopted an expert-judgment procedure. Experts with experience in product development and cost evaluation rated AIPS-derived within-category pairwise priority relationships using a seven-point Likert scale, combining pairwise comparison for relative-priority judgment with expert-based validation practices [
82,
83].
After evaluating 17 pairwise priority relationships, the experts showed strong agreement with the AIPS-derived rankings. Specifically, 16 comparisons received mean scores between 5.60 and 6.80, indicating broad support for the within-category priority relationships. The only exception was Design over Quality, which received a lower score and greater dispersion (Mean = 4.00, SD = 1.22). This divergence may be due to the small AIPS difference between the two attributes and to potential biases in online review data, in which customers may more readily express visible design impressions than quality-related concerns unless failures occur. It may also reflect experts’ greater emphasis on reliability, durability, failure risk, and long-term brand trust. Thus, closely ranked attributes should be interpreted with caution when online review signals and practical development considerations differ.
5. Discussion and Conclusions
The MCD-Kano model proposed in this study aims to decode CS for e-commerce products and accurately classify and prioritize the attributes that influence CS, thereby providing decision support for product improvement in review-rich e-commerce contexts. This method comprises key stages, including attribute extraction, attribute-level sentiment analysis, construction of the MCD-Kano model, and within-category attribute priority determination, forming an analytical process from semantic modeling to optimization decision-making. This systematic framework helps address several limitations of previous studies, including the neglect of neutral sentiment, insufficient accuracy in Kano classification, and ambiguous priority determination, thereby providing a basis for product iteration and improvement.
In the case study, 11 key attributes of best-selling smartwatches were systematically analyzed and classified into 4 O-type, 5 A-type, and 2 M-type attributes using the Kano model. The limited number of M-type attributes suggests that only a small proportion of the identified attributes are perceived as basic requirements. Nevertheless, these M-type attributes remain critical, as poor performance in them is more likely to cause customer dissatisfaction and adversely affect overall product evaluations. The relatively high proportion of A-type attributes suggests that more attributes are closely associated with enhancing CS and creating differentiated customer experiences. At the same time, this pattern may also stem from the relatively short time since some products were launched, reflecting the strong market enthusiasm commonly observed in early-adopter feedback.
The results show that Wearing was identified as a key M-type attribute. This classification is highly consistent with the physical characteristics of smartwatches. Smartwatches usually need to remain in close contact with the customer’s body for extended periods and are often worn throughout the day to support continuous health monitoring, activity tracking, and sleep monitoring. Therefore, even if the product performs well in other functional dimensions, discomfort during wear, excessive device weight, or unsuitable strap materials may directly undermine the customer experience and significantly reduce customers’ overall satisfaction with the product. Beyond basic requirements, firms should also make full use of differentiated attributes to enhance the perceived value of their products. Notably, the model results indicate that Battery was identified as the highest-priority A-type attribute. This finding reflects the tension between the current technological bottlenecks in the wearable device industry and consumers’ expectations. Compared with traditional watches, smartwatches integrate multiple functions, including health monitoring, message notifications, location services, and app-based interactions. As a result, they have higher levels of power consumption, and battery life has long been a core concern for customers. When a product achieves significant breakthroughs in battery stability, battery life, or charging efficiency, this attribute can become a powerful “delighter,” substantially exceeding customer expectations and further serving as an important driver of brand differentiation. Finally, regarding O-type attributes, the results suggest that operational experience, including response speed, ease of operation, and system fluency, is a key area for achieving linear improvements in CS. Given the inherent limitations of smartwatch interactions on small-screen interfaces, a smooth, efficient operational experience can directly translate into higher-quality perceived use. Therefore, firms should adopt a phased iterative strategy in product optimization: first, they should strictly ensure the performance of M-type attributes to prevent customer loss caused by unmet basic requirements; second, they should continuously improve O-type attributes to steadily enhance product usability and operational efficiency; and finally, they should prioritize investment in A-type attribute innovation to strengthen product differentiation and further consolidate market competitiveness.
This paper provides theoretical support and managerial insights for evaluating product satisfaction and improving enterprise products in the e-commerce context.
5.1. Theoretical Contribution
This study makes three main theoretical contributions. First, this study extends research in information systems on user-generated content mining and the intelligent analysis of customer feedback. This paper regards attribute-level sentiment information in online reviews as the key analytical unit linking the identification of customer feedback with the construction of demand knowledge, and develops an interpretable Kano analysis framework. This framework integrates attribute extraction, attribute-level sentiment analysis, Kano requirement classification, and interpretive analysis into a unified logic, thereby providing a clearer theoretical pathway for understanding how user-generated content can be systematically transformed into product requirement knowledge.
Second, this study advances the application of the Kano model and the theory of asymmetric CS in the context of online reviews. Traditional online-review-based customer requirement analysis often adopts a binary positive–negative sentiment classification, thereby neglecting the large number of neutral or weak sentiment expressions in customer evaluations. This paper introduces a neutral sentiment state and, drawing on game-theoretic logic, employs the SHAP method to measure differences in the marginal contributions of product attributes to CS and dissatisfaction under different sentiment states. On this basis, it constructs a Kano classification method and extends customer feedback from the traditional positive–negative sentiment classification to multi-dimensional satisfaction signals that include positive, negative, and neutral evaluations. This method helps reveal, in greater detail, the differentiated effects of sentiment states on CS and dissatisfaction, and provides a more interpretable theoretical basis for explaining the asymmetric role of product attributes in the structure of customer requirements.
Third, this study deepens research on decision support for product improvement. Traditional Kano analysis mainly focuses on the classification of attribute categories but gives insufficient attention to differences in attribute values within the same category and to the order of improvement. Drawing on two-factor theory, this paper combines differences in marginal contribution with improvement potential and distinguishes the evaluative dimensions most relevant to different Kano categories, thereby constructing an evaluation mechanism for prioritizing attribute improvement. This enables Kano analysis to extend from requirement category identification to attribute-level improvement ranking. Accordingly, this study provides a more refined theoretical–analytical perspective on product iteration and data-driven decision support in complex product contexts.
5.2. Practical Contribution
First, this study offers more precise and actionable guidance for enterprise product development and resource allocation. It enables product managers to move beyond the tendency to add features indiscriminately and instead direct limited budgets toward core attributes with greater satisfaction-conversion efficiency, thereby improving the effectiveness of product investment and maximizing returns.
Second, compared with traditional questionnaire-based approaches that are often constrained by delayed feedback, the automated analytical framework developed in this study helps e-commerce firms establish an agile market-monitoring mechanism. In practice, the framework can be embedded into regular reporting systems or decision dashboards to generate attribute priorities, satisfaction drivers, and emerging pain-point signals. For firms with limited technical resources, a simplified implementation focusing on attribute extraction and sentiment tracking can still provide useful support for early-stage product optimization. Moreover, the framework may be further extended to other consumer electronics products with sufficient online review data, thereby providing scalable support for customer need monitoring and managerial decision-making.
Third, this methodological framework is not limited to supporting product R&D; it also provides direct managerial value to marketing and customer service functions. By generating fine-grained attribute priorities, it enables marketing departments to identify differentiated selling points and helps customer service teams accurately detect pain-point attributes associated with customer dissatisfaction. As a result, firms can formulate proactive service recovery strategies to improve customer retention and business conversion in the e-commerce context.
5.3. Limitations and Future Research Directions
Although the proposed method and case study have been rigorously validated, future research should further address the following limitations. First, the current data are mainly derived from online reviews collected from two Chinese e-commerce platforms and focus on a single product category; therefore, the generalizability of the findings should be interpreted with caution. In addition, the measurement of satisfaction in this study is constrained by the indicators available on e-commerce platforms. Following previous studies, we used review ratings as a proxy for CS; however, ratings may not fully capture the complexity of customer perceptions. Therefore, future research could incorporate other satisfaction-related indicators, such as repurchase behavior, return records, or follow-up survey data, to further validate the proposed framework. Second, although expert- and consumer-based validation provides preliminary evidence of the method’s effectiveness, the systematic approach proposed in this study has not yet been applied or tested in actual product improvement processes. Therefore, future research should extend this approach to field implementation in relevant departments and establish a closed-loop verification mechanism through before-and-after comparisons of product improvement decisions, thereby bridging the gap between theoretical deduction and practical application. In addition, the relatively short evaluation periods of some products may introduce early-adopter bias, potentially skewing sentiment evaluations toward the positive. Future longitudinal studies could further examine how these Kano classifications evolve as products mature.