1. Introduction
Preventive healthcare increasingly relies on the ability to monitor physiological status continuously and non-invasively outside clinical environments [
1,
2]. Among health indicators associated with long-term cardiovascular and metabolic well-being, cardiorespiratory fitness (CRF) is one of the most informative, as reduced CRF has been consistently linked to elevated cardiovascular risk, diminished functional capacity, and increased all-cause mortality [
3,
4]. Consequently, reliable assessment of CRF is valuable not only for athletic performance evaluation, but also for early detection of health deterioration and long-term wellness monitoring in the general population [
5].
The most widely accepted indicator of CRF is maximal oxygen uptake, VO
2max, defined as the maximum rate at which oxygen can be utilized during intense exercise [
3,
6]. Clinically, VO
2max is commonly measured through CPET, a supervised laboratory procedure requiring specialized equipment, trained personnel, controlled testing conditions, and substantial physical effort from participants. Although highly accurate, these requirements limit large-scale deployment and make repeated assessment impractical for many populations, particularly older adults or individuals with reduced exercise tolerance [
7].
Recent advances in wearable sensing technologies have created new opportunities for accessible cardiorespiratory monitoring. Modern wearable devices can continuously acquire physiological and biomechanical signals such as heart rate, blood oxygen saturation (SpO
2), acceleration, angular velocity, and body orientation during normal daily life [
8]. Recent developments in wearable sensor design have also focused on creating more flexible, lightweight, and unobtrusive sensing platforms that improve user comfort and facilitate continuous physiological monitoring, further expanding the capabilities of wearable health technologies [
9,
10]. By enabling the integration of physiological and movement-derived information outside controlled laboratory settings, wearable sensing opens the door to more practical and scalable approaches for estimating cardiorespiratory fitness, as conceptually illustrated in
Figure 1, which contrasts the resource-intensive nature of conventional CPET with a wearable-based framework capable of estimating VO
2max from short-duration daily life measurements.
Despite this promise, wearable-based CRF estimation remains limited by three major challenges: many methods require prolonged monitoring over days or weeks [
11,
12]; others depend on structured exercise protocols or predefined movement sequences [
13,
14,
15], and some remain strongly activity-dependent, requiring either specific exercises or explicit activity recognition labels [
16,
17]. In addition, validation is often performed on relatively narrow cohorts, limiting confidence in generalization across heterogeneous populations [
18,
19,
20].
This work asks a central question: can cardiorespiratory fitness be estimated from less than one hour of wearable signals collected during semi-structured daily activities, without exercise testing, structured protocols, or explicit activity recognition?
To address this question, this work proposes an activity-independent framework for estimating VO
2max from short-duration wearable-derived physiological and biomechanical signals collected during everyday behavior. The ground-truth VO
2max values were obtained using the Queens College Step Test, an established and validated submaximal exercise protocol commonly used for indirect cardiorespiratory fitness evaluation and population-level assessment of cardiorespiratory fitness [
21,
22,
23].
Rather than modeling discrete activity categories, the proposed approach learns generalized relationships between movement intensity, physiological response, and subject-level cardiorespiratory capacity.
The framework adopts a two-stage strategy. In the first stage, a regression model estimates MET, a standardized physiological measure that quantifies energy expenditure relative to resting metabolic rate and has been extensively adopted in exercise physiology, epidemiological studies, and wearable sensing research as a practical and physiologically meaningful representation of physical activity intensity [
24,
25,
26,
27].
Although MET values derived from standardized compendia represent population-level estimates rather than individualized measurements, they provide a validated framework for characterizing the energetic demands of daily activities and have been widely used to standardize activity intensity across diverse research settings [
28,
29,
30]. By expressing physical effort on a continuous physiological scale rather than through discrete activity categories, MET provides an interpretable intermediate representation that enables the proposed framework to model generalized relationships between movement intensity, physiological response, and cardiorespiratory fitness independently of the specific activity being performed.
The estimated MET values are subsequently used as a continuous representation of activity intensity. In the second stage, this intensity representation is combined with physiological biomarkers, movement descriptors, and demographic information to predict subject-level VO2max. By decoupling physiological demand from activity semantics, the proposed framework aims to improve robustness to unseen activities while preserving physiological interpretability.
The main contributions of this work are summarized as follows:
An activity-independent framework for wearable-based VO2max estimation that reduces reliance on explicit activity recognition and structured exercise protocols.
A two-stage modeling strategy that first estimates movement intensity through MET regression and subsequently integrates intensity, physiological signals, and demographic descriptors for subject-level cardiorespiratory fitness prediction.
Validation on a heterogeneous participant cohort spanning variations in sex, age, and fitness level, supporting the generalization capacityy of the proposed approach under realistic daily life conditions.
A scalable alternative to conventional laboratory-based fitness assessment that may enable more accessible and frequent cardiorespiratory monitoring in non-clinical settings.
The remainder of this paper is organized as follows.
Section 2 describes the study protocol, wearable sensing setup, preprocessing procedures, feature engineering pipeline, and machine learning models used for MET and VO
2max estimation.
Section 3 presents the experimental results.
Section 4 discusses the implications, limitations, and future research directions of the proposed framework. Finally,
Section 5 concludes the paper.
2. Materials and Methods
This section describes the experimental protocol, data collection procedures, wearable setup, preprocessing steps, and artificial intelligence architectures used in the study. It outlines how participants were recruited and monitored using wearable devices during a structured protocol involving real-world physical activities. Two separate pipelines were established: one for MET regression and another for estimating the VO2max health indicator.
An overview of the proposed framework is presented in
Figure 2. The methodology follows a two-stage modeling pipeline for wearable-based cardiorespiratory fitness estimation. In the first stage, synchronized movement signals from IMU sensors are segmented into short-duration windows to estimate continuous MET values, providing a representation of physical intensity over time. The resulting MET estimates are then evaluated using a stability criterion based on the accumulated stability duration, denoted by
, which represents the cumulative duration (in seconds) over which consecutive MET estimates satisfy the stability condition. Only periods satisfying
s are selected for further analysis. For each valid stable segment, the corresponding movement, physiological, and demographic signals are organized into 1-min windows from which movement-derived, physiological, intensity-related, and demographic features are extracted. The resulting feature representations are subsequently used in the second stage to estimate subject-level VO
2max through a regression model.
2.1. Study Design and Data Collection Protocol
This study was approved by the Institutional Review Board (IRB) of the University of Puerto Rico and conducted in accordance with the ethical standards of the Collaborative Institutional Training Initiative (CITI) Program. All participants provided written informed consent prior to participation and completed a Physical Activity Readiness Questionnaire (PAR-Q) [
31] to ensure safe involvement in the protocol.
A total of 67 individuals were recruited, of which 60 participants were included in the final analysis after excluding incomplete or low-quality recordings. Demographic information, including age, sex, height, weight, body fat percentage, and body mass index (BMI), was collected for each participant and later incorporated as input features in the proposed models. Further details on the dataset collection methodology and sensing framework are available in our previously published work [
32].
Figure 3 illustrates the distribution of demographic and fitness characteristics across the participant pool, highlighting the heterogeneous nature of the recruited cohort. Recruitment was intentionally designed to encourage broad participation, with eligibility criteria kept intentionally wide and limited primarily to adulthood and safe participation requirements. As expected in a general population sample, most participants were concentrated within the average fitness range, while fewer individuals were observed at the lower and higher ends of the spectrum. This distribution reflects realistic population variability, although it introduces additional challenges for modeling extreme physiological responses.
Data collection was conducted in a controlled indoor environment to reduce external variability. As a reference measure of cardiorespiratory fitness, each participant completed the Queens College Step Test, a standardized submaximal exercise protocol from which VO
2max was estimated using sex-specific predictive equations [
22]. The test consisted of stepping at a fixed cadence (24 steps/min for men, 22 steps/min for women) for three minutes, followed by manual heart rate measurement.
Following the step test, participants completed a structured activity protocol designed to capture a broad range of physical intensities representative of daily living. The protocol was developed in accordance with American National Standards Institute/Consumer Technology Association (ANSI/CTA) guidelines for real-world wearable evaluation [
33,
34], ensuring standardized activity execution and controlled intensity levels while preserving ecological relevance to natural daily conditions. This balance between protocol consistency and real-world representativeness is essential for developing models that generalize reliably in wearable-based health monitoring applications. The protocol lasted approximately 37 min and included alternating periods of rest, light, moderate, and vigorous activities defined according to MET-based categorizations [
35]. The activity transitions were guided by an automated timing system to ensure consistent execution among the participants. The sequence of activities, their duration, and the corresponding MET values, illustrated in
Figure 4, correspond to the standardized protocol performed by all participants during data acquisition and are part of the multimodal dataset previously presented in [
32].
This protocol enabled the collection of synchronized motion and physiological signals across a wide range of controlled yet representative activity intensities.
2.2. Wearable Sensor Configuration
To capture both biomechanical and physiological signals participants were instrumented with a combination of inertial and biomarker sensing devices, as illustrated in
Figure 5.
Motion data were collected using five MetaMotionRL Inertial Measurement Units (IMUs) (MbientLab Inc., San Jose, CA, USA), placed on the chest, left hand, right hand, left knee, and right knee. The sensors operated at a sampling frequency of 50 Hz and recorded a combination of tri-axial acceleration, gyroscope, and quaternion signals depending on their placement. The chest sensor provided the most comprehensive set of measurements, serving as a central reference for whole-body motion [
36], while limb-mounted sensors captured complementary upper and lower extremity dynamics.
This configuration was designed to provide balanced coverage of both linear and rotational movement while minimizing sensor burden and preserving user comfort, supporting potential real-world deployment scenarios.
Physiological signals were acquired using two wrist-worn devices: a Garmin Venu 3 smartwatch (Garmin Ltd., Olathe, KS, USA) and a CheckMe™ oximeter (Viatom Technology Co., Ltd., Shenzhen, China), worn on opposite wrists. Both devices recorded heart rate (HR) and peripheral SpO2 at a sampling frequency of 0.5 Hz, providing redundant measurements to improve robustness under motion conditions.
2.3. Signal Preprocessing
The raw signals collected from IMUs and biomarker devices were processed through a two-stage pipeline consisting of temporal synchronization and signal-specific filtering, corresponding to the preprocessing stage illustrated in
Figure 2.
To ensure consistency across heterogeneous data sources, all signals were first aligned in time. Timestamps from both IMU and biomarker streams were parsed and converted into a unified high-resolution datetime representation using an ISO-like format (YYYY-MM-DD HH:MM:SS.ssssss), enabling consistent temporal alignment across modalities.
For each subject, a common temporal interval was defined as:
where
and
denote the initial and final timestamps of the IMU recordings, respectively;
and
denote the initial and final timestamps of the biomarker recordings, respectively; and
and
represent the beginning and end of the common temporal interval shared by both sensing modalities.
All signals were then trimmed to this shared interval defined by Equation (
1), ensuring strict temporal alignment between motion and physiological measurements.
Following synchronization, signal-specific preprocessing was applied to account for the distinct characteristics of motion and biomarker data.
Biomarker signals, including Heart Rate (HR) and SpO
2, were sampled at a low frequency of 0.5 Hz. Due to the low sampling frequency, conventional filtering provides limited benefit and may distort the physiological signal [
37]. Therefore biomarker signals were left unfiltered and unnormalized, preserving their original scale and temporal dynamics.
In contrast, IMU signals were filtered to remove noise and irrelevant frequency components. Each channel was processed using a sequential Butterworth filtering approach [
38]:
Both filters were applied using zero-phase filtering to prevent phase distortion. After filtering, each signal was normalized using Z-score normalization [
39]:
where
denotes the normalized sample value,
x is the original IMU sample, and
and
correspond to the mean and standard deviation, respectively, computed for the corresponding IMU channel of each subject.
Normalization defined by Equation (
2) was performed independently for each IMU channel on a per-subject basis, preventing information leakage across subjects while preserving the temporal structure of each recording.
This preprocessing strategy reflects the complementary roles of the signals: IMU data require noise reduction and normalization, whereas biomarker signals are preserved in their original form to maintain physiological interpretability.
2.4. Stage 1: Intensity Representation for Activity-Independent Modeling
To enable activity-independent modeling, an intermediate representation of movement intensity was introduced based on MET. Unlike discrete activity labels, which may not generalize to unseen activities, MET provides a continuous measure of physiological effort. In this framework, MET serves as an intermediate variable that captures the intensity of movement and provides contextual information to the downstream model. This stage corresponds to the MET estimation block shown in
Figure 2.
This representation allows the model to interpret signals in terms of energetic demand rather than activity type. By decoupling intensity from activity semantics, the proposed approach supports the learning of generalized relationships between movement, physiological response, and cardiorespiratory outcomes, enabling the model to operate even when the user performs activities that were not included in the training dataset.
As illustrated in greater detail in
Figure 6, Stage 1 transforms wearable motion signals into a continuous MET representation through a sequence of preprocessing, segmentation, feature extraction, and regression steps. Movement signals acquired from wearable IMUs are first synchronized and filtered, after which non-overlapping 5-s windows are generated and used to extract representative motion descriptors characterizing movement intensity and temporal dynamics. These features are then processed by an MLP regression model to estimate continuous MET values over time. The predicted MET signal is subsequently smoothed using median and moving average filters to improve temporal consistency and is then evaluated using a stability criterion based on the accumulated stable duration, denoted by
, which represents the amount of consecutive time during which the MET signal remains stable. If the stability condition
s is not satisfied, the framework continues processing subsequent 5-s windows and reevaluates the stability criterion. Once the condition is met, the corresponding stable segment is transferred as the output of Stage 1 and forwarded to Stage 2 of the framework. This stability filtering step ensures that the downstream VO
2max model primarily receives windows corresponding to sustained activity conditions, reducing the influence of transitional periods that may introduce variability in physiological response patterns.
2.4.1. Feature Engineering for MET Regression
Prior to feature extraction, the synchronized signals were segmented into fixed-length windows to enable localized analysis of movement patterns. To capture the short-term dynamics of human movement, IMU signals were segmented into non-overlapping windows of 5 s. At a sampling rate of 50 Hz, each window contains sufficient temporal resolution to characterize variations in motion intensity while remaining responsive to rapid changes in activity.
Within each window, movement intensity was represented through features derived from both accelerometer and gyroscope signals. Rather than relying on raw axes, tri-axial measurements were transformed into magnitude signals, enabling an orientation-invariant description of motion and focusing the representation on the overall level of physical effort.
Feature extraction was designed to capture complementary aspects of movement intensity [
40]. Measures such as root mean square, mean absolute value, and signal power quantify the overall energy of the motion, while peak-related descriptors (e.g., maximum value and peak-to-peak range) capture bursts of activity. Statistical features, including variability and dispersion, reflect the consistency or irregularity of movement patterns.
To further capture how intensity evolves within each window, the signal was partitioned into four temporal segments, and the energy of each segment was computed. This provides a compact representation of intra-window dynamics, allowing the model to distinguish between steady motion and transient changes in effort.
Together, these features provide a structured representation of movement intensity that links raw motion signals to their underlying energetic demand, forming the basis for MET estimation.
2.4.2. MET-Based Intensity Estimation
To obtain a continuous representation of movement intensity, a regression model was trained to estimate the MET from the structured intensity features extracted from 5-s IMU windows.
The proposed model is a fully connected feedforward neural network designed to learn the nonlinear relationship between wearable motion descriptors and their associated energetic cost. The architecture consists of two hidden layers with GELU activation functions and dropout regularization, followed by a linear output layer that produces a scalar MET estimate for each input window. Ground-truth MET labels were assigned according to the experimental protocol described in
Figure 4, allowing the model to learn a direct mapping between movement-derived features and physiological intensity.
This formulation enables movement intensity to be represented as a continuous variable rather than as a discrete activity label. Consequently, transitions between effort levels can be naturally captured over time, providing a smooth and physiologically meaningful description of motion intensity that can be integrated into the downstream VO2max prediction pipeline.
2.5. Integration into the Prediction Pipeline
The estimated MET values were incorporated into the prediction pipeline as a continuous representation of movement intensity. Unlike discrete activity labels, this signal provides a direct measure of physiological effort over time.
To improve robustness, the predicted MET signal was refined using a median filter followed by a moving average filter, reducing outliers and capturing the underlying intensity trend [
41].
From the smoothed MET signal, features describing both intensity magnitude and temporal behavior were extracted. In addition, a stability metric was defined to quantify the consistency of the intensity signal using the coefficient of variation (CV) [
42]:
where
and
denote the standard deviation and mean of the MET signal within a sliding window, respectively.
To account for temporal persistence, stability was accumulated over time. Let
s denote the resolution of the MET predictions, and let
i indicate the index of the current MET prediction window. A counter
was used to track the number of consecutive prediction windows satisfying the MET stability condition:
where
denotes the stability counter at window
i,
is the counter value from the previous window,
is the coefficient of variation of the predicted MET signal evaluated at window
i, and
is the predefined stability threshold. When
is below
, the current window is considered stable and the counter is increased by one. Otherwise, the counter is reset to zero.
The corresponding stability duration,
, was then computed as:
where
represents the cumulative duration, in seconds, over which consecutive MET prediction windows have satisfied the stability condition, and
is the duration of each MET prediction window.
This metric enables the identification of sustained activity periods, which are later used as a gating mechanism for VO2max modeling.
2.6. Stage 2: Modeling Framework for VO2max Estimation
Building upon the stable MET representation obtained in Stage 1, the second stage maps multimodal wearable features to a subject-level VO2max estimate. Since short-term fluctuations in the predicted MET signal are frequently associated with activity transitions and delayed physiological adaptation, only temporally stable segments are considered for downstream modeling.
A segment is considered valid when the accumulated stability duration satisfies
where
is the accumulated stability duration defined in Equation (
5). Thus, only segments for which the predicted MET signal remains stable for at least 60 consecutive seconds are forwarded to Stage 2.
The 60-s stability requirement was introduced to prioritize sustained physiological responses over transient activity transitions. Heart rate kinetics studies have shown that cardiovascular adaptation to changes in exercise intensity occurs progressively, with delayed responses before a new steady state is reached [
43,
44]. Consequently, stable periods were selected to improve feature reliability and reduce the influence of transitional segments. While this criterion was appropriate for the activity durations considered in the present protocol, future studies involving more heterogeneous free-living behavior may benefit from evaluating alternative stability thresholds.
Once the stability criterion is satisfied, the corresponding multimodal signals, including IMU data, biomarker signals (HR and SpO2), and MET estimates, are segmented into fixed-length windows of 60 s, which constitute the fundamental units for feature extraction in Stage 2.
As illustrated in
Figure 7, each stable window is transformed into a multimodal feature vector combining movement-derived, physiological, intensity-related, and demographic information. These feature vectors are subsequently processed by the XGBoost regression model to generate window-level VO
2max estimates. Since cardiorespiratory fitness is defined at the subject level rather than the window level, predictions obtained across all stable windows corresponding to a participant are aggregated to produce a final subject-level estimate. This aggregation step reduces the influence of local variability between individual windows and enables the model to capture more consistent subject-specific physiological patterns across different activity conditions.
2.6.1. Feature Engineering for VO2max Regression
Feature engineering was designed to capture the relationship between movement intensity and physiological response in an activity-independent manner. Features were extracted from synchronized accelerometry, HR, SpO2, and MET signals using 60-s windows.
Movement features were derived from acceleration magnitude signals obtained from sensors placed on the chest, knee, and hand. Time-domain and frequency-domain descriptors were used to characterize movement intensity, variability, and temporal dynamics, while inter-segment correlations were included to capture coordination patterns across body regions [
45].
Physiological features were extracted from HR and SpO
2 signals, summarizing their level, variability, and temporal evolution within each window. To explicitly link movement and physiological response, cross-modal features were also computed between HR and motion signals [
46].
The MET signal, estimated from the Stage 1 regression model, was incorporated as a continuous representation of activity intensity. Descriptors capturing intensity level and temporal variation were extracted. In addition, combined HR and MET features were included to reflect cardiovascular efficiency under different workload conditions.
Finally, demographic and baseline physiological variables, including age, sex, height, weight, BMI, resting HR, and resting SpO
2, were incorporated to provide subject-specific context [
47]. This allowed the model to account not only for the instantaneous physiological response to activity, but also for individual characteristics that influence cardiorespiratory fitness.
Figure 8 provides an illustrative example of representative engineered features for Subject 5 across the experimental protocol. The selected signals include physiological features (Heart Rate and SpO
2), movement-derived features (Chest Motion RMS), and intensity-related features (MET and HR-MET coupling), each represented by distinct color groupings for visual clarity. The figure is intended to provide a qualitative view of how these feature categories evolve throughout the protocol and respond to changes in activity intensity.
Periods of rest are characterized by relatively stable and low-amplitude responses across most modalities, whereas transitions to more demanding activities, particularly walking and box carrying, are accompanied by increases in MET estimates, heart rate, and chest motion RMS. The figure also illustrates that the different modalities do not evolve identically. For example, activities associated with similar MET levels may exhibit distinct movement and physiological responses, as evidenced by differences in chest motion RMS and HR-MET coupling. These observations highlight the complementary information captured by multimodal wearable signals and motivate the integration of movement-derived, physiological, and intensity-related features for activity-independent VO2max estimation.
2.6.2. Subject-Level Prediction Strategy
The regression model produces predictions at the window level, while the target variable, VO2max, is defined at the subject level. Each subject contributes multiple windows derived from 60-s segments of synchronized data.
To obtain a subject-level estimate, window-level predictions are grouped into non-overlapping chunks of fixed size. For each chunk, the median prediction is computed to reduce the influence of noisy or unrepresentative windows. The final VO2max estimate for the subject is then obtained as the median of all chunk-level predictions.
Formally, given a set of window-level predictions
, these are partitioned into
K chunks, and the subject-level prediction is defined as:
where
denotes the set of predictions within the
k-th chunk.
2.6.3. Model Architecture
VO
2max estimation was formulated as a supervised regression task using extreme gradient boosting (XGBoost) [
48]. This model was selected due to its strong performance on structured tabular data and its ability to capture nonlinear relationships among heterogeneous feature types, including physiological, movement-derived, intensity-related, and demographic variables.
Training was performed at the window level, where each feature vector represented a 60-s segment and was assigned the VO2max label of the corresponding subject. This strategy allowed the model to learn consistent subject-level physiological patterns from multiple observations collected across different activities and conditions.
To improve sensitivity across the full fitness spectrum, sample weighting was incorporated during training, assigning greater importance to subjects at the lower and upper ends of the VO2max distribution. In addition, the target variable was standardized using statistics computed from the training set and transformed back to the original scale for evaluation.
Model hyperparameters were selected based on validation performance to balance predictive accuracy and generalization.
2.7. Evaluation Protocol
Model performance was evaluated using the Leave-One-Subject-Out (LOSO) [
49] cross-validation strategy. In this setup, data from one subject was held out for testing, while the model was trained using data from all remaining subjects. This process was repeated for each subject, ensuring that all evaluations were performed on unseen individuals.
Performance was assessed at the subject level. Although the model produces predictions at the window level, these were aggregated into a single estimate per subject, as described in the previous subsection. This ensures consistency between the prediction and the ground truth, which is defined at the subject level.
The primary evaluation metric was the root mean squared error (RMSE), computed between the predicted and true VO2max values across all subjects. In addition, the coefficient of determination () and the Pearson correlation coefficient (r) were reported to assess the strength and linearity of the relationship between predicted and true values.
All reported metrics correspond to the aggregation of predictions from the held-out subjects across all folds.
3. Results
This section presents the results of the proposed framework, including both the MET regression and the VO2max estimation models. The evaluation focuses on generalization across subjects under a LOSO scheme.
Performance is assessed using RMSE, , and r, providing a concise view of prediction accuracy and agreement with reference values.
3.1. Stage 1 Performance: MET Regression
The first stage of the proposed framework estimates a continuous MET signal from wearable motion features, providing an intermediate representation of movement intensity for downstream physiological modeling. Prior to evaluation, the model hyperparameters were optimized using Optuna (v3.6.2) [
50] under a LOSO validation scheme. The selected configuration consisted of 87 training epochs, a learning rate of
, weight decay of
, a batch size of 64, a hidden dimension of 128 neurons, and a dropout rate of 0.44.
In addition to LOSO, model performance was evaluated using a Leave-One-Activity-Out (LOAO) strategy, which provides a stricter test of generalization by assessing the model’s ability to estimate intensity levels for activities that were entirely excluded during training. To enforce this separation, LOAO partitioning was defined using the ground-truth MET labels assigned by the experimental protocol, ensuring that all samples corresponding to the held-out intensity level were omitted from training. This design prevents exposure to that activity intensity during model fitting and provides a rigorous evaluation of generalized intensity estimation across unseen movement conditions.
As shown in
Figure 9, the model achieved low prediction error under LOSO (RMSE = 0.885 MET) with near-zero bias, indicating reliable intensity estimation across unseen subjects. Under the more challenging LOAO setting, performance remained strong for intermediate-intensity activities, where both RMSE and prediction bias remained relatively small. Larger deviations were observed at the extremes of the MET range, particularly for held-out activities corresponding to rest (MET = 1.0), which was consistently overestimated, and cycling (MET = 6.8), which was systematically underestimated. This pattern suggests that the regression model primarily learns an interpolative mapping within the range of intensities observed during training, while extrapolation beyond that range remains more challenging. Nevertheless, the model preserved a coherent progression of estimated intensity across unseen activity conditions, supporting its use as a meaningful intermediate representation of physical effort within the proposed framework.
3.2. Stage 2 Performance: VO2max Estimation
The second stage of the proposed framework evaluates the ability of the aggregated wearable-derived features to estimate subject-level VO2max under a LOSO validation scheme. The proposed model achieved a mean fold RMSE of 5.48 mL·kg−1·min−1, a global RMSE of 6.82 mL·kg−1·min−1, an value of 0.40, and a Pearson correlation coefficient of , indicating moderate predictive agreement with the reference VO2max values and meaningful capture of inter-subject physiological variability.
Figure 10 presents the relationship between predicted and reference VO
2max values across all LOSO folds. A clear positive association can be observed, confirming that the model successfully preserves the relative ranking of subjects according to cardiorespiratory fitness. The fitted calibration line exhibited a slope of 0.466 and an intercept of 24.204 mL·kg
−1·min
−1, indicating compression of the predicted VO
2max range relative to the reference values. As a result, lower VO
2max values tended to be overestimated, whereas higher VO
2max values were generally underestimated. This behavior is likely influenced by the concentration of participants within the mid-range of the fitness distribution, which encourages regression toward the population mean and reduces predictive sensitivity at the extremes of the physiological spectrum.
This prediction pattern is further illustrated in
Figure 11, which shows the residual error as a function of the true VO
2max value. Positive residuals are more frequent at lower fitness levels, while increasingly negative residuals appear at higher VO
2max values, confirming a compression effect in the predicted range. This behavior suggests that subjects at the physiological extremes remain the most challenging cases for the model, whereas prediction errors remain comparatively balanced within the mid-range values, where most participants are concentrated. Overall, the observed trend indicates that the proposed framework captures meaningful cardiorespiratory patterns from wearable-derived signals while highlighting the need for greater representation of extreme fitness profiles in future datasets.
To further evaluate the robustness of the proposed activity-independent representation, an additional LOAO validation was performed, in which all samples corresponding to one activity category were excluded during training and used exclusively for testing. This protocol provides a stricter assessment of generalization, as the model must infer cardiorespiratory fitness from movement intensities that were not observed during optimization. As shown in
Figure 12, the average RMSE remained close to the LOSO baseline across all held-out activities, with all LOAO conditions remaining within 0.13 mL·kg
−1·min
−1 of the LOSO average RMSE. This indicates only modest performance variations between LOAO conditions. Slightly larger errors were observed when box carrying was excluded, suggesting that this activity may provide a richer combination of biomechanical and physiological responses that contributes useful information for characterizing cardiorespiratory fitness. Nevertheless, the overall differences remained limited, supporting the robustness of the proposed activity-independent representation. These results indicate that the proposed framework does not rely on activity-specific patterns, but instead learns generalized relationships between physiological response, biomechanical behavior, and underlying cardiorespiratory fitness.
Collectively, these results demonstrate that short-duration wearable-derived physiological and biomechanical signals encode sufficient information to recover meaningful inter-subject variability in cardiorespiratory fitness. Moreover, the limited variation observed across LOAO conditions indicates that the proposed framework does not depend heavily on any single activity and generalizes well to unseen activities within the experimental protocol, supporting the central premise of activity-independent cardiorespiratory health estimation from wearable data. Validation under fully unconstrained free-living behavior remains an important direction for future work.
To assess the propagation of Stage 1 errors into the final prediction task, we computed the correlation between subject-level MET prediction error (MAE) and the absolute VO2max prediction residual. The resulting correlation was weak (Pearson , ), indicating that larger MET estimation errors do not systematically correspond to larger VO2max prediction errors.
4. Discussion and Future Work
This study demonstrates the feasibility of estimating cardiorespiratory fitness from wearable-derived physiological and motion signals collected during less than one hour of daily activities. Unlike traditional assessment methods that require controlled laboratory protocols, specialized equipment, and clinical supervision, the proposed framework suggests that meaningful fitness-related information can be inferred rapidly under semi-structured real-world activity conditions. This represents an important step toward accessible cardiorespiratory health monitoring using wearable devices. However, further validation under fully unconstrained free-living conditions is needed before conclusions regarding long-term everyday monitoring can be drawn.
A central contribution of this work is its activity-independent modeling strategy. By learning generalized relationships between movement intensity, physiological response, and cardiorespiratory capacity, the proposed framework eliminates the need for specific exercise protocols or prolonged data collection, thereby improving practicality for deployment in semi-structured and potentially real-world monitoring scenarios. Although the proposed framework is not intended to replace clinical CPET evaluation, it may serve as an accessible screening or longitudinal monitoring tool capable of identifying changes in cardiorespiratory fitness outside laboratory environments.
The diversity of the collected cohort, including participants of different ages, sexes, and fitness levels, strengthens the validity of the proposed framework and supports generalization across users. However, because most individuals naturally fall within average fitness ranges, the resulting target distribution is centered around the population mean. This imbalance is reflected in the calibration behavior observed in
Figure 10, where the fitted regression line exhibits a compressed prediction range relative to the reference values. This effect is likely driven primarily by the limited representation of subjects at the extremes of the fitness spectrum, which encourages regression toward the population mean.
Another limitation arises from the first-stage MET estimation model and the reference intensity labels used for its development. The protocol-assigned MET values were derived from standardized resources, including the Compendium of Physical Activities and ANSI/CTA guidelines, and therefore provide validated population-level estimates of activity intensity. However, these labels do not fully capture inter-individual variability in energy expenditure, and the model was trained within a relatively limited intensity range (approximately 1 to 6.8 METs), restricting its ability to extrapolate beyond the activities represented in the dataset. Since estimated intensity serves as an important contextual feature for downstream VO2max prediction, both subject-specific variations in metabolic cost and limited representation of higher or lower effort levels may influence model robustness. Nevertheless, the weak association observed between subject-level MET errors and VO2max residuals suggests that the second-stage model is able to leverage complementary physiological, biomechanical, and demographic information beyond the intensity representation alone. Future work should incorporate individualized MET measurements obtained through indirect calorimetry or portable metabolic systems and extend the range of activity intensities to further improve generalization.
Future work should focus on expanding the dataset, particularly by increasing representation at the extremes of the fitness spectrum, where prediction remains most challenging. Incorporating additional biomarkers, such as respiration rate, electrocardiographic signals, skin temperature, and heart rate variability, may further improve physiological characterization. Finally, validation against direct CPET, the gold-standard measure of cardiorespiratory fitness, and the use of directly measured VO2max values for model development would provide stronger physiological and clinical validation of the proposed framework and allow quantification of the impact of reference-label uncertainty introduced by indirect submaximal fitness assessments. Future work will therefore focus on collecting an independent cohort with simultaneous CPET and multimodal wearable measurements to evaluate the proposed activity-independent pipeline against directly measured VO2max values and further assess its generalization capacity under gold-standard testing conditions.
5. Conclusions
This work presented an activity-independent machine learning framework for estimating cardiorespiratory fitness from short-duration multimodal wearable signals acquired during daily life activities. Unlike conventional approaches that rely on structured exercise protocols, prolonged monitoring periods, or explicit activity recognition, the proposed framework was designed to learn generalized relationships between movement intensity, physiological response, and subject-level fitness capacity. To achieve this, a two-stage modeling strategy was introduced in which MET was first estimated as a continuous representation of activity intensity and subsequently integrated with physiological biomarkers, biomechanical descriptors, and demographic information to predict VO2max.
Experimental evaluation under the LOSO validation protocol demonstrated that short-duration wearable-derived signals contain meaningful information related to inter-subject variability in cardiorespiratory fitness. The proposed framework achieved moderate predictive agreement with reference VO2max values, with a mean fold RMSE of 5.48 mL·kg−1·min−1, a global RMSE of 6.82 mL·kg−1·min−1, an of 0.40, and a Pearson correlation coefficient of . These results indicate that clinically relevant fitness-related physiological patterns can be captured from less than one hour of wearable monitoring, even without explicit knowledge of the performed activity.
From a broader perspective, these findings support the feasibility of scalable, unobtrusive cardiorespiratory fitness assessment using wearable devices in semi-structured activity settings outside traditional laboratory environments. By reducing dependence on specialized testing equipment, supervised exercise protocols, and activity-specific models, the proposed framework represents a practical step toward accessible preventive health monitoring in everyday life. Future work will focus on validation against direct cardiopulmonary exercise testing measurements, evaluation under fully unconstrained free-living conditions, expanding population diversity, and improving predictive performance at the extremes of the fitness spectrum.