4.1. Study I: Measurement Foundation and Benchmark Validation
Study I establishes the measurement foundation of WEPA through anchor-word development, anchor-word validation, benchmark-based evaluation against human annotations, and cross-temporal stability assessment. Because WEPA generates a single construct score for each user–week observation and the focal constructs are expected to vary over time, classical reliability measures such as internal consistency and test–retest reliability are not directly applicable. We therefore examine whether the anchor-word dictionaries exhibit the expected semantic structure, whether WEPA scores align with human judgments and outperform a dictionary-based baseline, and whether semantic axes retain stable orientation across the observation period.
4.1.1. Anchor-Word Development and Validation
We identified focal psychological constructs from Goal-Setting Theory and Social Cognitive Theory. The goal-setting constructs include goal commitment, goal specificity, and goal difficulty. We also examine self-efficacy and its four source dimensions, including mastery experience, vicarious experience, social persuasion, and physiological states. These constructs cover both unidimensional and multidimensional structures, are theoretically central to exercise behavior, and can be meaningfully represented through bipolar semantic dimensions.
In the Chinese fitness domain, anchor words were selected to capture theoretically meaningful semantic contrasts. Goal specificity contrasts quantified goal indicators with vague outcome expectations. Goal difficulty contrasts high-effort or challenge-related expressions with low-intensity or easy activities. Goal commitment contrasts determination and follow-through with hesitation, delay, or disengagement. For self-efficacy, mastery experience contrasts success with failure-related expressions, vicarious experience contrasts learning from others with unguided trial-and-error, social persuasion contrasts supportive with discouraging feedback, and physiological states contrast energetic or positive conditions with fatigued or depleted ones.
Some dimensions require context-specific interpretation. The goal difficulty axis captures difficulty-related expressions in online fitness discourse, including perceived obstacles, fatigue, and challenge-related language, and should not be treated as a direct measure of objective goal difficulty in the original Goal-Setting Theory. Similarly, the social persuasion axis captures platform expressions related to encouragement or discouragement, whose behavioral meaning may depend on whether they reflect stable support or compensatory help-seeking.
Table 6 presents representative English glosses of selected Chinese anchor words to help international readers to understand their semantic content and polarity. The empirical analysis uses the original Chinese anchor words.
The anchor-word dictionaries were developed in three stages. First, theory pooling drew on Goal-Setting Theory, Social Cognitive Theory, and fitness-community language to generate 260 candidate words. After screening against the 200-dimensional GloVe vocabulary trained on the 816 MB corpus, 251 valid candidates remained. Second, three eHealth scholars independently evaluated the fit between candidate words and construct definitions. Words were retained when at least two experts agreed on their assigned category. Inter-rater agreement was high, with Krippendorff’s , and 232 theoretically relevant anchor words were retained.
Finally, we conducted semantic-space diagnostics. For each construct, we calculated average cosine similarity within the positive pole, within the negative pole, and between poles. This diagnostic assessed whether words within the same pole were, on average, closer to one another than to words from the opposite pole.
Table 7 shows that all constructs follow this expected pattern. Although the absolute cosine values are moderate, this is plausible because anchor words are not intended to be strict synonyms. They function as theoretically related expressions aligned with the same pole, supporting subsequent semantic-axis construction.
4.1.2. Human Benchmark and Algorithm Comparison
To evaluate WEPA against human benchmarks, we selected two theoretically distinct constructs. Goal specificity represents a relatively concrete cognitive–behavioral dimension of Goal-Setting Theory and tests whether WEPA can recover quantified goal statements. Physiological states represent an affective–somatic source of self-efficacy and provide a more difficult test because expressions such as energized or exhausted are often implicit and fragmented in short social-media text.
We constructed two annotated datasets from the Keep corpus. The goal-specificity benchmark includes 518 posts with quantifiable goals, and the physiological-states benchmark includes 536 posts with identifiable bodily or emotional descriptions. All texts contain at least 10 characters. The same three eHealth scholars annotated both datasets. They identified quantifiable behavioral indicators for goal specificity and physiological or affective indicators for physiological states without inferring broader motivational meanings. Inter-annotator agreement was high for both constructs, with Cronbach’s for goal specificity and 0.922 for physiological states.
We use a dictionary-based method as a transparent closed-vocabulary baseline. Both methods rely on the same anchor-word sets, but their scoring logic differs. The dictionary method relies on exact keyword matching, whereas WEPA uses distributed semantic aggregation. This baseline is not intended to represent all possible dictionary systems. It provides a conservative reference for evaluating whether semantic projection improves coverage and agreement with human judgments under short-text conditions.
The dictionary-based method computes scores using the normalized difference in anchor-word frequencies (
Eichstaedt et al., 2020):
where
denotes the set of words in the focal text, and
and
denote the positive and negative anchor-word sets. Stronger agreement with human annotations indicates a better capture of the intended construct as expressed in the text.
Table 8 reports the benchmark comparison. For goal specificity, WEPA achieves a Spearman rank correlation of
(
), higher than the dictionary baseline (
,
). For physiological states, the contrast is larger. The dictionary method covers only 52 of 536 texts (9.7%) and shows weak agreement with human judgments (
,
), whereas WEPA scores all texts and achieves strong agreement (
,
).
These results suggest that semantic projection improves both coverage and benchmark agreement under realistic short-text conditions. The advantage is especially clear for constructs expressed through implicit, affective, or fragmented language, where exact lexical matching often fails. At the same time, the benchmark should not be interpreted as direct human-annotation validation for all seven constructs. It provides evidence across two theoretically distinct cases, one relatively concrete cognitive–behavioral construct and one affective–somatic construct.
4.1.3. Cross-Temporal Stability of Semantic Axes
The benchmark comparison establishes that WEPA captures the intended constructs in human-annotated texts. For longitudinal measurement, it is also necessary to examine whether semantic axes retain stable orientation over time. If axis orientation shifts because of linguistic evolution or corpus-specific estimation variance, observed score changes may partly reflect measurement artifacts.
To assess cross-temporal robustness, we distinguish between axis orientation and semantic structure. We partition the full corpus by calendar year from 2015 to 2021 and train separate yearly embedding models using the same hyperparameters as the main specification. Each yearly embedding space is aligned to the full-period reference space using a Procrustes transformation, so that all yearly models share a common coordinate system (
Yao et al., 2018).
The full-period embedding is used as the reference because it provides a more stable estimate of the semantic space than yearly embeddings, especially in early sparse periods, and avoids anchoring the analysis to any single-year realization (
Cassani et al., 2021). This choice does not introduce forward-looking bias because the analysis does not involve prediction. The full-period embedding serves only as a technical reference frame for comparing semantic representations across time.
For each construct, we compute two indicators. Cosine similarity between the yearly semantic axis and the full-period axis captures axis stability, or whether the overall direction of the construct axis remains stable. Spearman rank correlation captures rank-order consistency, or whether the relative ordering of anchor words along the construct dimension is preserved. We place less emphasis on 2015 because the Keep platform was launched in February 2015 and the first-year corpus is sparse, approximately 23 MB, compared with 134 MB in 2016.
Figure 3 presents the results. Panel (a) reports semantic axis stability based on cosine similarity, and Panel (b) reports rank-order consistency based on Spearman correlations.
These two indicators capture complementary aspects of temporal stability. Axis stability reflects whether the overall geometric direction is preserved, while rank-order consistency reflects whether the internal semantic structure of anchor words remains stable. The results show that axis orientation varies across constructs, but rank-order consistency remains consistently high after 2016. This pattern suggests that WEPA provides a stable semantic measurement structure over time, especially at the level of internal semantic ordering. The analysis supports the longitudinal interpretability of WEPA scores, although it should not be treated as a full proof of measurement invariance across platforms or contexts.
4.2. Study II: Criterion Validity of Goal-Setting Constructs
Study II evaluates whether WEPA-derived scores for goal commitment, goal specificity, and goal difficulty are associated with subsequent exercise behavior in theoretically consistent directions.
4.2.1. Hypotheses
Goal-Setting Theory argues that clear, specific, and appropriately challenging goals, when combined with feedback, can improve individual performance (
Locke & Latham, 2019). Building on this logic, we develop expectations for three goal-setting constructs. Because WEPA measures these constructs through users’ own expressions in platform text, the hypotheses concern text-based indicators of goal-related psychological states rather than experimentally assigned goal conditions.
Goal commitment reflects persistence, motivational investment, and determination. Users with higher commitment are more likely to sustain exercise behavior (
Brandstätter & Bernecker, 2022). Goal specificity refers to quantified and actionable goal statements, such as
run 5 km per week. Specific goals reduce behavioral ambiguity and provide clearer implementation paths (
Wirth et al., 2009). Accordingly, WEPA goal commitment and goal specificity scores are expected to be positively associated with future exercise duration.
Goal difficulty requires a context-specific interpretation. In classical Goal-Setting Theory, difficult goals can promote performance when they are accepted and supported by sufficient ability and resources (
Senko & Harackiewicz, 2005). In online fitness discourse, however, expressions such as
too hard,
I cannot keep going, or
I am exhausted may signal perceived obstacles, fatigue, or frustration. The WEPA goal difficulty score is therefore interpreted as expressed difficulty in platform text, and we expect it to be negatively associated with subsequent exercise duration.
Based on these arguments, we propose the following hypotheses:
Hypothesis 1. The WEPA goal commitment score is positively associated with future exercise duration.
Hypothesis 2. The WEPA goal specificity score is positively associated with future exercise duration.
Hypothesis 3. The WEPA goal difficulty score is negatively associated with future exercise duration.
4.2.2. Model Specification
The three goal-setting constructs are derived from users’ weekly text using WEPA and measured as continuous construct scores. Because goal commitment and goal difficulty are conceptually related and empirically correlated (
; see
Table 5), estimating all three constructs in a joint model would complicate interpretation. Joint estimation would yield partial effects conditional on other constructs, which is less aligned with the goal of assessing criterion validity for each construct individually. We therefore estimate separate individual fixed-effects panel models for each construct.
Multicollinearity diagnostics indicate that joint estimation is statistically feasible, with maximum condition indices below 2.07 and variance inflation factors below 1.57. The use of separate models is therefore motivated by interpretability rather than by multicollinearity concerns.
Using individual fixed-effects models, we estimate the following regression for each construct:
where
is log-transformed exercise duration at time
, and
denotes goal commitment, goal specificity, or goal difficulty at time
t.
represents individual fixed effects, and standard errors are clustered at the user level. The time-varying controls include age, weekly text length, exercise-duration change, social engagement, and social feedback.
Criterion validity is assessed through temporally ordered predictive associations between construct scores at time t and exercise behavior at time . Prediction refers to cross-temporal statistical association, not causal prediction or intervention effects. Weekly aggregation limits temporal precision because users may exercise or post at any point within a week. Directional consistency between construct scores and the external behavioral criterion provides evidence for criterion validity within this observational framework.
4.2.3. Results
Table 9 reports the criterion-validity results. Goal commitment is positively associated with exercise duration at time
(
,
), supporting H1. Goal specificity is also positively associated with exercise duration (
,
), supporting H2. Goal difficulty is negatively associated with exercise duration (
,
), supporting H3.
Because the dependent variable is log-transformed, the coefficients can be interpreted approximately as semi-elasticities. A one-standard-deviation increase in goal commitment, goal specificity, and goal difficulty corresponds to about +7.1%, +1.9%, and −5.2%, respectively, in exercise duration at . Using the raw mean of weekly exercise duration before log transformation as a reference (15.61 min/week), these effects translate into about +1.12, +0.30, and −0.81 min/week. These modest raw-minute equivalents should be interpreted as statistically reliable behavioral associations, not as large individual-level intervention effects.
These findings provide criterion-validity evidence for WEPA in measuring unidimensional psychological constructs.
4.3. Study III: Criterion Validity of Self-Efficacy Dimensions
Study III extends the criterion-validity assessment to multidimensional measurement. It examines whether WEPA-derived scores for overall self-efficacy and its four source dimensions are associated with subsequent exercise behavior in theoretically expected directions.
4.3.1. Hypotheses
According to Social Cognitive Theory, self-efficacy is shaped by four sources of information, including mastery experience, vicarious experience, social persuasion, and physiological states (
Bandura, 1977). This four-dimensional structure makes self-efficacy a useful case for testing whether WEPA can distinguish behavioral associations among dimensions within the same construct. Because these dimensions are measured through linguistic expressions in platform text, the hypotheses concern WEPA-derived textual indicators rather than a direct experimental manipulation of self-efficacy sources.
Mastery experience reflects successful task completion and is widely regarded as the strongest source of self-efficacy. Vicarious experience arises from observing similar others succeed and provides behavioral models for action. Expressions of prior accomplishment or learning from others should therefore be positively associated with future exercise duration. Physiological states reflect subjective evaluations of physical condition and emotional arousal. Positive bodily and emotional states may support behavioral maintenance, while fatigue or negative affect may weaken efficacy beliefs. Thus, the WEPA physiological states score is also expected to be positively associated with future exercise duration.
Social persuasion requires a context-specific interpretation in digital fitness communities. In Bandura’s original theory, persuasive encouragement from others can strengthen efficacy beliefs. In platform discourse, however, references to external affirmation, such as, “I need likes”, “please encourage me”, or “I need support to continue”, may appear as compensatory expressions during motivational blockage. We therefore expect the WEPA social persuasion score to be negatively associated with subsequent exercise duration in this platform context. This hypothesis reflects the user-expressed salience of persuasion- or support-related language and should not be interpreted as a general claim that social persuasion weakens self-efficacy.
Based on these arguments, we propose the following hypotheses.
Hypothesis 4. The WEPA self-efficacy score is positively associated with future exercise duration.
Hypothesis 5. The WEPA mastery experience score is positively associated with future exercise duration.
Hypothesis 6. The WEPA vicarious experience score is positively associated with future exercise duration.
Hypothesis 7. The WEPA physiological states score is positively associated with future exercise duration.
Hypothesis 8. The WEPA social persuasion score is negatively associated with future exercise duration.
4.3.2. Model Specification
Study III adopts the joint model as the primary specification because the four source dimensions represent theoretically distinct components of the same multidimensional construct. The goal is to assess each dimension’s association with subsequent behavior and to examine its relative contribution after accounting for the other sources. This logic is consistent with the theoretical role of self-efficacy sources in Social Cognitive Theory.
Confirmatory factor analysis results support this multidimensional structure (CFI = 0.999, TLI = 0.998, RMSEA = 0.038). Although the four dimensions are highly correlated (
–
), the joint model remains informative because it captures conditional associations among theoretically related components. Multicollinearity diagnostics indicate that the model is estimable. The VIFs for MastExp and SocPer are 13.09 and 10.31, slightly above the conventional threshold of 10, while the maximum condition index is 8.56, well below the critical threshold of 30 (
Kalnins & Hill, 2023;
O’Brien, 2007). These diagnostics call for a cautious interpretation of coefficient magnitudes without invalidating the joint specification.
Using an individual fixed-effects model, the joint specification is written as follows:
where
denotes mastery experience, vicarious experience, social persuasion, or physiological states at time
t. The joint model estimates each source dimension conditional on the other three dimensions.
As a sensitivity check, we also estimate separate single-dimension models:
The separate models estimate gross associations between each source expression and subsequent exercise behavior, while the joint model estimates conditional associations. This comparison helps to assess whether the joint estimates are sensitive to intercorrelations among the four dimensions. The controls, fixed effects, and clustered standard errors follow Study II.
4.3.3. Results
Table 10 reports the criterion-validity results for overall self-efficacy and its four source dimensions. All models use the same analysis sample. The overall self-efficacy score is positively associated with exercise duration at time
(
,
), supporting H4.
The separate models show positive gross associations for mastery experience (, ), vicarious experience (, ), social persuasion (, ), and physiological states (, ). The joint model reveals a more differentiated pattern. Mastery experience (, ), vicarious experience (, ), and physiological states (, ) remain positively associated with exercise duration, supporting H5, H6, and H7. Social persuasion becomes negatively associated with exercise duration in the joint model (, ), supporting H8.
This contrast is substantively informative. Social persuasion is weakly positive when entered alone but negative after controlling for the other three sources, suggesting that its residual component may capture support-seeking, motivational vulnerability, or difficulty-related appeals for encouragement after mastery, vicarious, and physiological-affective signals are accounted for.
Because the joint model is the primary specification, we interpret its coefficients as approximate semi-elasticities. A one-standard-deviation increase in mastery experience, vicarious experience, physiological states, and social persuasion corresponds to about +12.4%, +13.0%, +2.5%, and −20.6% in exercise duration at , or approximately +1.94, +2.03, +0.40, and −3.22 min/week using the same raw-mean reference. These modest raw-minute equivalents support criterion validity for text-based measurement but should not be interpreted as clinically meaningful changes in exercise behavior.
Overall, these findings provide criterion-validity evidence for WEPA in multidimensional psychological measurement. They also suggest that self-efficacy expressions in online fitness communities should be interpreted as contextually embedded linguistic indicators, not as direct one-to-one replications of classical self-efficacy sources.
4.4. Study IV: Exploratory Relative-Week Dynamics
Study IV provides an exploratory descriptive analysis of relative-week dynamics in a naturalistic digital setting. It is not intended as a formal test of ecological validity, a hypothesis test, or the evidence of individual developmental trajectories. Instead, it examines whether WEPA-derived construct scores reveal broad and interpretable aggregate patterns when observed users are aligned by time since platform registration.
Because users differ in activity duration and participation continuity, directly tracing individual trajectories would be misleading. Some users disengage quickly, some return after interruptions, and later relative weeks contain a more selected subset of retained users. We therefore align users by registration week (), aggregate construct scores at each relative week, and interpret the results as patterns among observable users at each stage rather than as trajectories of the original registration cohort.
For each construct, this procedure produces an aggregate relative-week trajectory based on the user–week records observed during that relative week. To smooth short-term fluctuations, we apply a three-week centered moving average. For comparability across constructs, each raw WEPA score is standardized using its full-sample mean and standard deviation before aggregation.
Figure 4 therefore reports mean standardized WEPA scores rather than raw projection values.
When all seven trajectories are considered together, three descriptive patterns emerge. First, mastery experience, vicarious experience, and social persuasion show a pronounced early decline that becomes much flatter after approximately Week 20, with all three trajectories later fluctuating close to the zero line. Mastery experience generally remains the highest among the three, although their differences become smaller after the early stage.
Second, goal difficulty and goal specificity both move upward, but with different temporal profiles. Goal difficulty converges rapidly toward the zero line and remains close to it after approximately Week 20. Goal specificity starts from the lowest level and increases more gradually across the observation window. Even in later weeks, it remains below zero, making it the clearest long-term directional pattern in the figure.
Third, physiological states and goal commitment display weaker recovery-like patterns. Both decline in the early stage and later move partially toward the zero line. Compared with goal specificity and goal difficulty, these two trajectories show smaller changes and should be interpreted as mild aggregate adjustments.
Taken together, the trajectories suggest an approximate descriptive transition around Week 20. This point is a visual reference, not a statistically estimated breakpoint. Before Week 20, several trajectories change relatively quickly; after Week 20, most constructs stabilize near the neutral range or fluctuate within a narrower band. The main long-term pattern is therefore not a uniform increase in motivational intensity, but gradual improvement in goal-related specificity alongside a broader stabilization of other construct-related expressions among retained observable users.