2.1. Text Sentiment Analysis
Developments in the field of affective computing has provided technical means to analyze consumer emotions [
14]. Among them, text sentiment analysis is one of the most widely used sentiment computing methods [
15]. Based on natural language processing, emotional tendencies (positive/negative, etc.) and intensities can be automatically identified from consumer comments and social media posts. Commonly used methods for Chinese sentiment analysis include two categories: sentiment dictionary-based computation and machine learning-based classification [
16]. Lexicon methods utilize pre-constructed sentiment dictionaries to calculate sentiment scores based on the frequency of positive and negative sentiment words in the text, while machine learning methods train classification models (e.g., SVM, deep learning, etc.) to directly predict text sentiment polarity. In recent years, pre-trained language models (e.g., BERT and its Chinese variants) have achieved significant improvements in sentiment analysis tasks [
17]. For example, the Chinese RoBERTa-wwm-ext, pre-trained with full word masks to better understand the context, can capture subtle sentiment features of commented texts, thus improving classification accuracy. Sentiment analysis has been applied to consumer behavior research [
18], such as analyzing e-commerce reviews to assess user satisfaction or predicting product sales in conjunction with social opinion. In this study, we combine RoBERTa-wwm-ext with a sentiment lexicon to analyze the sentiment polarity and calculate the sentiment intensity of social media evaluative texts, and incorporate them into the consumption prediction model as one of the influencing factors.
First, we fine-tune the text for sentiment classification using the pre-trained language model RoBERTa-wwm-ext, which is a RoBERTa model trained on whole-word masking (WWM) for the Chinese corpus, and has performed well on several Chinese NLP tasks [
19]. We utilize it to obtain contextual semantic representations of text and add a fully connected layer to output sentiment polarity (e.g., positive, negative, neutral). Compared to the traditional TF-IDF+machine-learning-based approach, the pre-trained model is able to better understand the implicit emotional meanings, e.g., emotional judgments of slang and Internet buzzwords, which is especially important for informal expressions commonly used by Generation Z. A total of 856 customer reviews were scraped from Jellycat’s official website, and the texts were lower-cased, tokenized with Jieba, and padded or truncated to 128 tokens. The model’s final hidden state was passed through a fully connected layer that outputs the class-probability vector, which can be expressed as follows:
where
Pneg,
Pneu and
Ppos denote the model’s belief that the text is negative, neutral, or positive, respectively, and by construction
Pneg +
Pneu +
Ppos = 1. Fine-tuning uses a learning rate 2 × 10
−5, the batch size is 32, and it has five epochs to avoid over-fitting.
Due to the specificity of Chinese texts, we combine a deep learning approach based on pre-trained models and a traditional dictionary-based approach to improve the analysis accuracy and robustness [
20]. If a text
T contains c
i+ occurrences of the
ith positive token and
ci− occurrences of the
jth negative token, its raw lexical polarity is as follows:
where the superscripts “+” and “−” indicate positive and negative polarity, respectively. Degree adverbs, such as very or slightly, multiply
wt by an intensity factor, while an in-sentence negator (e.g., not) flips its sign. The scalar
Slex is mapped to [0, 1] through the logistic squashing function.
where 1 −
σ(
Slex) and
σ(
Slex) serve as pseudo-probabilities of negative and positive sentiment, and the neutral slot is left empty because a dictionary alone cannot establish neutrality.
The specific approach is to utilize the KnowledgeNet HOWNET sentiment dictionary and the customized sentiment thesaurus of popular words on the internet to calculate the sentiment scores after word separation of the text, as follows: counting the frequency of positive and negative sentiment words, combining the degree adverbs (e.g., “very”, “a little”) with the weight correction and the role of negatives to obtain the sentiment scores of each text. The frequency of positive emotion words and negative emotion words is counted, combined with the weight correction of degree adverbs (e.g., “very”, “a little”) and the role of negative words to obtain the emotion score of each text. This score reflects the strength and direction of the sentiment of the text. For example, when a user comment contains words such as “like”, “cure”, “happy”, etc., and is modified by the adverb “special”, we determine the strength of the sentiment of the text, and judge the sentiment to be strongly positive; if negative words such as “disappointment” and “boredom” appear, the text is judged to be negative. The lexical approach is relatively intuitive and explains the contribution of each word to the sentiment value, but may miss sentences that imply sentiment.
To integrate the advantages of the two approaches, we compare and combine the prediction results of the RoBERTa model with the lexicon sentiment scores. On the one hand, if the two results are consistent (e.g., both are judged as positive), the confidence level is increased; if not, we manually inspect the text to find out the reasons for the disagreement, such as misjudgment by some anti-statement lexicographic methods, etc., and then adjust and optimize the model or the lexicon. Ultimately, we assign each text record a sentiment propensity label (positive/medium/negative) as well as a sentiment strength score (normalized by taking the positive probability or lexicon sentiment value of the RoBERTa output). These metrics from sentiment analysis will be used as one of the feature inputs for subsequent modeling.
2.2. Feature Construction and Selection
After obtaining the sentiment indicators, we integrate other information from the questionnaire to construct a complete feature set. Feature construction consists of two parts: one is the direct use or processing of questionnaire items to form features based on theoretical assumptions, and the other is the extraction of data-derived features.
First, we based the questionnaire design on the characteristics of each respondent obtained by combining or converting multiple questions into scores. The questionnaire covers demographic characteristics such as gender, age, and income level, consumer behavioral characteristics such as purchase and knowledge of therapeutic toys, and psychological attitude scales such as stress measurement scale, mood measurement scale, and product advantage measurement scale.
Each multi-item attitude scale is condensed into a single latent score by simple averaging.
where
xik is respondent
k’s response to the
i-th item and
mk is the number of items in that scale. Internal consistency is checked with Cronbach’s
α.
Ensuring that each aggregated score behaves unidimensionally. Demographic fields (gender, age, income) and observed behavioral markers (prior purchases, product knowledge) are included unchanged.
For the attitude-scale-type items, we went through reliability and validity tests (see the next section for details) and calculated the mean score of each scale as a measure of that latent variable [
21].
Second, we used the questionnaire information to construct some derived variables. For example, based on the respondents’ ratings of different toy attributes (cute appearance, brand IP, social topics, etc.), we calculated a “healing attribute recognition index”, which represents the overall recognition of the concept of therapeutic toys by users.
Drawing on responses that rate individual toy attributes r
ia ∈ [1, 5] (e.g., cute appearance, brand IP), we define a “healing attribute recognition index”.
With A being the number of rated attributes; larger values indicate broader endorsement of therapeutic toy concepts. In addition, each respondent carries a sentiment-intensity score si ∈ [0, 1], yielding a text-derived affective signal that complements self-reported mood scales. Altogether, we create p0 candidate features—substantially exceeding the sample size n—so dimensionality reduction is essential.
In addition, the emotional intensity obtained from text sentiment analysis is added to the feature matrix, and each user has a corresponding emotional tendency value [
22]. After sorting, we obtain dozens of candidate features, the number of which is more than the sample size and the possibility of multicollinearity, which need to be screened.
To avoid redundancy and overfitting, we use the Boruta–SHAP algorithm for feature selection. The Boruta–SHAP algorithm first identifies potentially relevant features using Boruta [
23], and then evaluates the importance of the features and refines them using SHAP values [
24]. This framework goes beyond the traditional single feature importance measure by allowing the use of an arbitrary tree model as a base model, providing more efficient, accurate and explanatory feature selection results [
25]. In market segmentation and consumer behavior research, Boruta–SHAP is able to effectively identify the key factors that drive consumer behavior and help optimize marketing strategies.
In the specific implementation, the target variable in consumer behavior prediction is used as the supervisory signal, the random forest is chosen as the base model, and all the candidate features and a number of “shadow features” are fed into the Boruta algorithm for iteration [
26]. After several rounds of running, the algorithm outputs a set of results in which each feature is rated as “important”, “unimportant”, or “marginal”. Next, we calculate SHAP values for the potentially important features screened by Boruta to measure the magnitude and direction of their influence on the model output [
27], thus further eliminating variables with minimal or unstable contributions. The final set of retained features contains both traditional questionnaire variables and sentiment analysis variables with derived variables. This process ensures that we do not omit factors that have a significant impact on the objectives, while controlling the feature dimensions within reasonable limits [
28]. After screening, the number of features used in modeling in this study is 10, which is about 30% of the initial candidate set. Among them are both the basic attributes of users, as well as mental attitude indicators and emotional tendency indicators, which initially cover the main influencing factors of our hypothesis.
2.3. Modeling and Feature Importance Analysis
Random forest (RF) is an integrated learning algorithm based on decision trees that excels in user behavior prediction due to its robustness and nonlinear fitting ability. Compared to linear regression or logistic regression, random forest is able to handle high-dimensional features and complex interactions, and is more tolerant to nonlinear relationships and noise common in consumption data [
29]. In emotion-driven consumption contexts, RF models are suitable as prediction tools due to the variety of influencing factors and nonlinear relationships. In addition to predictive performance, RF provides built-in feature importance assessment, which helps to initially screen out important variables. However, it is important to note that RF’s importance tends to favor features with a high base or with multiple branches, which may be biased. Therefore, we use RF prediction in conjunction with k-means clustering analysis for a more comprehensive assessment of feature impact.
After completing the feature engineering, we built a random forest model to predict the continuous dependent variable. In this paper, the dependent variable is consumers’ willingness to buy therapeutic toys or the intensity of their demand, which is derived from the score of the “Buying Therapeutic Toys makes me feel happy” questionnaire or similar questions. This variable is continuously distributed and can be regarded as a scale of demand. The random forest regression model was chosen because, on the one hand, it can handle more input features and explore nonlinear relationships in this study, and the built-in out-of-bag estimation can alleviate overfitting [
30,
31]; on the other hand, compared with neural networks, random forests have a certain degree of interpretability and robustness, which makes it easy to be combined with the subsequent correlation analysis [
32].
We divided the sample data into training and testing sets by 8:2, used k-fold cross-validation on the training set, and tuned the hyperparameters. The main hyperparameters to be tuned include the number of decision trees, the maximum depth, and the minimum number of sample foliated nodes. Considering the limited sample size, we selected a relatively conservative number of trees (
n = 100) by grid search, and limited the maximum depth to prevent single tree overfitting [
33]. The average R
2 and mean-squared error (MSE) from cross-validation were used to assess the performance of the model during the training phase and to guide parameter selection. After finalizing the parameters, the model is retrained on the entire training set and the prediction performance is evaluated on an independent test set, including R
2, MSE metrics [
34].
In addition to the prediction performance, we also extracted the importance of the features of the random forest. Based on the average reduction in mean square error or Gini impurity reduction metrics, we can obtain a ranking of the contribution of each input feature to the model’s prediction. This provides a list of important factors and facilitates a more in-depth explanation of the influences on the model.
2.4. K-Means Cluster Analysis
Finally, we apply k-means clustering to segment the interviewed Generation Z consumers in order to portray different types of user profiles, which helps to understand the differences in the characteristics of different groups and implement precision marketing. K-means clustering is a classic unsupervised learning method widely used for consumer segmentation due to its simplicity and high efficiency [
35]. K-means optimizes by iterative optimization to make sure that the similarity of objects within the clusters degree is maximized and inter-cluster similarity is minimized by iterative optimization [
36]. Given the standardized feature matrix {x
1,…,x
n} (where each x
i stacks the ten retained predictors and the relevant demographic dummies), K-means partitions the sample into K non-overlapping clusters {C
1,…,C
K} by minimizing the within-cluster sum of squared Euclidean distances.
where
μk denotes the centroid of cluster
CK. Iterative updates alternate between assigning each respondent to the nearest centroid (maximizing intra-cluster similarity) and recalculating centroids (minimizing inter-cluster similarity).
Based on questionnaire data, clustering can be used to group individuals with similar consumer perceptions and behavioral characteristics based on consumers’ responses on multiple attributes. For example, users who emphasize emotional value and those who focus on cost-effectiveness can be classified into different clusters, thus depicting very different consumption portraits.
In this study, we select the key variables identified by the model, as well as the demographic characteristics of the users, etc., and perform k-means clustering analysis, expecting to discover the segmentation of Generation Z therapeutic toy consumers. Through clustering, we can distill three types of user portraits and compare the differences between clusters in terms of emotional needs, purchase intention, etc.
The data used for clustering consisted of the key features screened as described earlier, as well as key demographic variables. Prior to clustering, we standardized continuous-type features (mean 0 variance 1) to avoid differences in magnitude affecting distance calculations; categorical variables were converted to dummy variables. To determine the number of clusters K, we considered the elbow rule and contour coefficients and other indicators. Several trials in the range of K = 2–6 were conducted to determine the optimal K value to balance the intra-cluster tightness and inter-cluster separation.