From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs

Brunyé, Tad T.; Petrimoulx, Mitchell V.; Cantelon, Julie A.

doi:10.3390/s26072034

Open AccessPerspective

From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs

by

Tad T. Brunyé

^1,2,*

,

Mitchell V. Petrimoulx

^1,2 and

Julie A. Cantelon

^1,2

¹

Center for Applied Brain and Cognitive Sciences, Tufts University, Medford, MA 02155, USA

²

U.S. Army DEVCOM Soldier Center, Natick, MA 01760, USA

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(7), 2034; https://doi.org/10.3390/s26072034

Submission received: 21 February 2026 / Revised: 16 March 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Sensors in 2026)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A key bottleneck in real-world wearable sensing is the transformation of noisy physiological signals into actionable decisions; a cognitive co-pilot architecture is proposed linking sensing, probabilistic state estimation, LLM-based contextual reasoning, and attention-aware intervention.
Local, uncertainty-aware, edge-deployed reasoning is a necessary architectural condition for trustworthy decision support in high-stakes environments, and several research gaps exist to achieve this goal.

What are the implications of the main findings?

Effective wearable-AI systems will require integrated sociotechnical design combining sensor validation, uncertainty-calibrated inference, grounded LLM reasoning, and cognitive-engineering-driven interface policies.
Beyond model accuracy, progress should be evaluated in downstream human outcomes (trust calibration, workload, decision quality, long-term reliance), reframing success as improved human–system integration.

Abstract

Wearable biosensors increasingly stream multi-channel physiological and behavioral data outside the laboratory, yet most deployments still end in dashboards or threshold alarms that leave interpretation open to the user. In high-stakes domains, such as military, emergency response, aviation, industry, and elite sport, the constraint is rarely data availability but the cognitive effort required to convert noisy signals into timely, actionable decisions. We argue for on-person cognitive co-pilots: systems that integrate multimodal sensing, compute probabilistic state estimates on devices, synthesize those states with task and environmental context using locally hosted large language models (LLMs), and deliver recommendations through attention-appropriate cues that preserve autonomy. Enabling conditions include mature wearable sensing, edge artificial intelligence (AI) accelerators, tiny machine learning (TinyML) pipelines, privacy-preserving learning, and open-weight LLMs capable of local deployment with retrieval and guardrails. However, critical research gaps remain across layers: sensor validity under real-world conditions, uncertainty calibration and fusion under distribution shift, verification of LLM-mediated reasoning, interaction design that avoids alarm fatigue and automation bias, and governance models that protect privacy and consent in constrained settings. We propose a layered technical framework and research agenda grounded in cognitive engineering and human–automation interaction. Our core claim is that local, uncertainty-aware reasoning is an architectural prerequisite for trustworthy, low-latency augmentation in isolated, confined, and extreme environments.

Keywords:

wearable biosensors; artificial intelligence; edge computing; large language models; multimodal sensing; decision support; human–machine integration

1. Introduction

Over the last fifteen years, personal biosensing has progressed from sporadic measurement to continuous high-volume digital exhaust [1,2,3]. Early work in mobile and wearable sensing established the procedures for continuous collection, opportunistic sampling, on-device preprocessing, and context inference, which now underpin commercial wearables and occupational monitoring programs [4]. Yet a persistent mismatch remains between what our systems can measure and what users and decision-makers may need in the moment [5,6]. Heart rate variability (HRV), photoplethysmography (PPG), electrodermal activity (EDA), accelerometry, skin temperature, respiration, sleep staging, and (in some contexts) EEG and eye tracking are all increasingly feasible at the edge. But high-stakes failures rarely occur because decision-makers lacked access to raw time series data; failures more likely occur because decision-makers were forced to interpret ambiguous signals under time pressure, distraction, uncertainty, and competing operational goals.

In a typical biosensing deployment, sensors stream to a hub, features are computed, classifiers label discrete states, and outputs are visualized as readiness scores, traffic lights, body batteries, gauges, flags, or other alerts. The final step, translating from state to action, is often left under-specified [7,8]. Even when a model correctly classifies and alerts to high stress or fatigue, the user still faces a gap that limits actionability [9]: How does this state interact with the present phase of a task or goal? Is it transient or trending? Is it risky now, or merely informative? What intervention is feasible given operational or contextual constraints? What is the cost of intervening incorrectly?

Cognitive engineering has long emphasized that the human is not a peripheral user, but a coupled component in a sociotechnical system [10]. Classic constructs such as limited working memory, attentional bottlenecks, situational awareness, and mental workload help determine whether information is noticed, trusted, integrated, and acted upon [5,9,11,12]. These theories converge on a practical implication: high-stakes interventions must be considered a decision-support partner that manages attention, uncertainty, and timing. We argue that this is where locally hosted LLMs become potentially useful. Namely, as computational mechanisms for contextual synthesis, explanation, and dialog, provided they are constrained, grounded, and audited [13,14,15]. Our paper is motivated by recent advances in edge artificial intelligence (AI) and local language-model deployment, including runtime-efficient inference for edge LLMs, improved post-training quantization and compression methods, and a growing literature on human-AI collaboration and wearable sensing pipelines [7,15,16,17,18].

In this opinion paper, we do not present a new algorithm or empirical system evaluation; rather, we synthesize existing technical and cognitive science studies into a conceptual framework and research agenda. We organize this agenda around a four-layer framework: (1) multimodal sensing, (2) probabilistic state estimation, (3) contextual reasoning, and (4) attentional action, clarifying where technical and human-factor risks emerge (Figure 1).

The novelty of this Perspective article is not the introduction of a new sensor, classifier, or standalone LLM application. Rather, it is the explicit integration of four elements that are often treated separately in the literature: (i) multimodal, quality-aware biosensing, (ii) uncertainty-calibrated probabilistic state estimation, (iii) grounded local reasoning that translates state estimates into context-sensitive options and explanations, and (iv) attention-aware interventions designed to preserve human autonomy and calibrated trust. Existing work has often focused on one of these components in isolation, such as wearable sensing, edge inference, dashboarding, threshold alarms, or LLM-enabled explanation. Our contribution is to position these as parts of a single sociotechnical architecture for on-person cognitive co-pilots and identify the technical and human-factor dependencies that must be solved jointly for trusted deployment in high-stakes environments.

2. Edge Computing as an Enabler

Edge computing is often motivated by latency and bandwidth optimization. In high-stakes biosensing, it may be more accurately framed as both necessary (i.e., given network availability and/or security constraints) and trust enabling. Satyanarayanan argued that edge computing emphasizes proximity to the data source, reduced latency, and improved robustness when connectivity is limited, properties that map directly onto isolated, confined, and extreme (ICE) environments [19]. Shi and colleagues likewise highlighted the technical imperatives: real-time processing near sensors, resilience to network disruption, and the need to manage heterogeneity and mobility [20].

These imperatives matter because many occupational contexts either cannot tolerate a cloud dependency (submarines, mines, disaster zones, contested environments) or cannot accept the privacy and governance implications of streaming raw biometrics off-person [21,22,23]. Local inference changes the model: sensitive data can be reduced to features or probabilistic state estimates, stored briefly, and protected via device security primitives. It also changes the interaction model: if inference must wait on a network round trip, the system will either be ignored (too slow) or will push users back to conservative alarms (too brittle).

In parallel, hardware and software ecosystems have matured to make on-device inference practical. Tiny machine learning (TinyML) and embedded inference pipelines have demonstrated that compressed neural networks can run within tight power envelopes when the full pipeline (i.e., sampling, preprocessing, model, postprocessing) is integrated [24]. Open-weight LLMs now provide a controllable substrate for on-device reasoning, especially when combined with retrieval, tool use, and strict output schemas [25,26]. Below, we describe key layers of a system that transforms sensing into sense-making.

2.1. Layer 1: Multimodal Sensing

The sensing layer is critical, but motion, sweat, temperature extremes, poor contact, and electromagnetic noise can all degrade signal quality and induce systematic bias [27,28]. Photoplethysmography (PPG) is a canonical example: while it is attractive for heart rate (HR) and peripheral capillary oxygen saturation (SpO₂) estimation, motion artifacts and contact changes can dominate the waveform during physical activity, making accuracy highly context dependent [29,30]. Similar challenges exist for electrodermal activity (EDA; sweat and temperature sensitivity) [31], skin temperature (environmental coupling) [32], and electroencephalography (EEG; electrode stability and artifact contamination) [33].

This is why the most credible architectures embrace redundancy and fusion. Human activity recognition and context inference research has repeatedly shown that multi-sensor combinations, such as IMU + location + physiological channels, outperform any single stream and provide robustness under partial failure [34,35,36,37]. The same logic applies to readiness and risk monitoring: a plausible “fatigue” estimate might integrate sleep history, HRV dynamics, movement micro-patterns, and behavioral markers (e.g., response variability); it would not treat any one feature as decisive [38].

Wearable form factors are expanding the design space. E-textiles and smart garments promise distributed sensing that may be more comfortable and less obtrusive than point devices, while also enabling multi-site measurement (e.g., respiration via chest expansion, EMG at muscle groups). Reviews have detailed both the opportunity and the practical limitations: washability, durability, signal stability, and integration with power and data pathways [39,40,41]. Likewise, noninvasive biochemical sensing, particularly with sweat-based analytes, has advanced rapidly, with work surveying electrochemical and microfluidic approaches for electrolytes, metabolites, and stress-relevant markers [42]. The near-term point is not that a single biomarker (e.g., cortisol) will be perfectly measured in the field tomorrow; rather, that the sensing layer is increasingly capable of providing multi-modal proxies for state inference.

A key technical implication is that Layer 1 should not merely collect data from sensors. It should produce metadata outputs related to signal quality, such as contact confidence, motion intensity, ambient interference, and missingness and artifact likelihood [43,44]. Without explicit quality modeling, downstream classifiers and LLM reasoning will conflate physiology with artifact and degrade trust when the system fails in what might have been predictable, repeatable ways.

2.2. Layer 2: Probabilistic State Estimates

Layer 2 is where raw streams become interpretable state variables, and where many current systems stop. In occupational settings, it is not sufficient to label stress as high. The system must quantify uncertainty and detect drift. Activity recognition research has long wrestled with these issues, including non-independent and identically distributed (non-IID) data across individuals, context-dependent movement signatures, and sensor placement variability [35,45]. In real sensor deployments in occupational contexts, these are not edge cases—they are often the default.

Historically, signal-to-state pipelines relied on hand-engineered features (e.g., time-domain HRV metrics, frequency-domain power, EDA peaks, IMU statistics) paired with classical models [46,47,48]. Those methods remain valuable on-device because they are interpretable and computationally efficient; however, deep learning models (e.g., temporal convolutional neural networks/CNNs, long short-term memory/LSTMs, gated recurrent units/GRUs, and increasingly lightweight transformers) have demonstrated superior performance when trained on large, diverse datasets and when they can learn invariant features across contexts. The challenge is that field physiology is heterogeneous, and accuracy on a benchmark may not reflect real-world performance. Three technical priorities deserve emphasis in an expert-facing architecture:

Fusion as Estimation. Feature concatenation can fail silently when one channel is corrupted. Fusion should be treated as a state estimation problem: combine multiple noisy observations into a posterior belief over latent states. Classical filters (e.g., Kalman variants, particle filters) and modern neural Bayesian approximations can serve this role; the critical design point is to produce a distribution (or at least calibrated confidence) in addition to a discrete label [49,50].
Calibration and Abstention. A classifier that always outputs a label is dangerous in high-stakes contexts. Layer 2 should support abstention (i.e., insufficient confidence) and graceful degradation (i.e., automated fallback to simpler models when signals are poor) [51,52]. Providing users with uncertainty information can support calibrated trust in a system and organization.
Continual Improvement. Federated learning provides a principled mechanism to update models across multiple devices while keeping training data local, aggregating parameter updates [53,54]. In practice, federated pipelines must still handle adversarial updates, heterogeneity, and auditability; but as an architectural concept, they align naturally with occupational privacy constraints.

Personalization is necessary because physiological baselines, behavioral signatures, and task responses vary substantially across users; of course, personalization must be bounded to avoid overfitting and silent model drift. In practice, safer personalization should rely on conservative mechanisms such as baseline normalization, threshold adjustment within predefined limits, calibration to recent within-user history, and periodic recalibration windows rather than unrestricted online updating of all model parameters. Drift detection should likewise be treated as a first-class function of Layer 2, using indicators such as persistent degradation in signal quality, shifts in feature distributions, instability in posterior estimates, or repeated disagreement between model expectations and observed data patterns [45,55]. When drift is detected, the system should not continue to produce recommendations in the same manner it would with confident inputs; instead, it should respond gracefully by lowering confidence, widening uncertainty bounds, increasing abstention frequency, falling back to a simpler or better-calibrated model when available, and, when necessary, requesting sensor adjustment, recalibration, or renewed baseline collection [51,52,56]. In high-stakes deployments, this type of bounded personalization and explicit drift response is preferable to maximizing short-term predictive fit at the cost of reduced transparency and trust.

By the end of Layer 2, the system should output a compact, semantically meaningful state vector with uncertainty. Concretely, the Layer 2 output should be something the system can treat as a best estimate of the user’s current state, along with how confident it is and whether conditions look outside what the model was trained for. For example, fatigue could be represented as an individualized fatigue estimate with a confidence range relative to baseline; heat strain as a risk probability with a confidence score and a drift flag; and abstention as an explicit “no recommendation” outcome when confidence is low and/or signal quality is poor. Uncertainty may reflect both measurement noise and model uncertainty under distribution shift, both of which are critical in field deployments. These state estimates may include hydration, fatigue, stress, heat strain, cognitive load, or motor stability, ideally with uncertainty estimates [57,58]. Here, state refers to a probabilistic latent variable inferred from multiple signals, not a direct measurement. Without uncertainty-aware state representations, downstream reasoning systems are forced to treat estimates as facts, amplifying error and overconfidence at higher layers. This state vector is the handoff point to Layer 3. Practically speaking, this handoff should be treated as a structured interface specification so that uncertainty, drift status, and abstention are preserved (and not collapsed into a single point estimate).

2.3. Layer 3: From States to Insights

The promise of Layer 3 is that an LLM can serve as a contextual synthesis engine, a mechanism for turning a state vector plus task/environment constraints into a structured recommendation and explanation, in natural language when needed, with a traceable data provenance [58,59,60]. Importantly, we do not propose that the LLM directly infers physiology or replaces the underlying state estimation. Layer 2 produces the core truth outputs (state estimates, uncertainty, and quality/drift indicators). Layer 3 is limited to structured synthesis, linking recommendations to trusted guidance, and dialog for clarification, while preserving uncertainty and supporting abstention when evidence is weak.

The role of the LLM is three-fold. First, for context integration, it binds the Layer 2 state vector to task phase, role constraints, environmental hazards, and historical patterns [57,61,62]. The result is not fatigue is high, but rather fatigue is high relative to baseline, with rising trend and low recovery likelihood in the next 45 min. For example, a fatigue estimate with high uncertainty during a high-stakes occupational exercise should yield a recommendation to monitor and re-assess (i.e., rather than to intervene immediately); an unconstrained language model might instead overconfidently narrate an urgent risk and prompt unnecessary disruption.

Second, for decision structuring, an LLM can translate latent estimates into action options (i.e., candidate courses of action), explicitly noting tradeoffs: intervene now (cost: time, mission impact), delay (cost: rising risk), or gather more information (cost: attention) [63,64]. A key design choice is to present options, not commands, unless immediate safety thresholds are crossed.

Third, explanations can support calibrated trust. Interpretability research is clear that an explanation must match the stakeholder’s needs, context, and timing [65,66,67,68]. In a high-stakes moment, a one-sentence rationale plus a confidence statement may be best; later, a detailed audit trail may be essential for learning (after-action review) and accountability.

These roles are only feasible for a local LLM if the system enforces evidence grounding, uses structured outputs, and treats the LLM as one component in a verifiable pipeline.

While promising, there are also inherent risks associated with LLM application. LLMs are prone to hallucination and overconfident narrative [69,70,71]. The technical response is to ground generation in retrieval and tools. Retrieval-Augmented Generation (RAG) is the canonical pattern: retrieve relevant, vetted documents (protocols, doctrine, medical guidelines, SOPs, user baselines) and constrain the model to synthesize from those sources [64,72,73,74]. In a cognitive co-pilot, RAG should be paired with strict output schemas (e.g., recommendation, rationale, confidence, data sources used, what would change my mind) and with refusal behaviors when evidence is insufficient.

Local deployment of LLMs is only feasible with compression. Open-weight model families (e.g., Llama, Gemma, Qwen) provide strong baselines for local reasoning and summarization, but real edge use typically requires quantization and efficient runtimes [25,75,76,77,78,79]. Tooling ecosystems such as llama.cpp and local runners make this practical on commodity devices, and quantization methods such as GPTQ and AWQ show how weights can be reduced with limited quality loss, though the impact on reliability under uncertainty remains an open research question for high-stakes applications [80,81]. For local reasoning modules, evaluation should include at minimum schema adherence, consistency across repeated identical queries, hallucination rate under ambiguous or conflicting retrieved inputs, abstention/refusal behavior when evidence is insufficient, and latency/energy cost per reasoning call.

2.4. Layer 4: From Insight to Action

Layer 3 is where insights are formed, and Layer 4 is where systems frequently succeed or fail in practice. Alarm fatigue in healthcare illustrates the hazard: even when alarms are technically correct, high frequency and low specificity can cause users to ignore them, silence them, or work around them [82,83]. In high-stakes work, frequent false positives erode trust and add workload; in contrast, false negatives create unjustified reassurance, and poorly timed prompts can disrupt attention during critical phases [84].

This is where cognitive engineering should dominate design decisions. Cognitive Load Theory predicts that interventions that add extraneous processing at the wrong time will degrade performance even when information is accurate [85]. Multiple Resource Theory predicts that modality choices matter: visual overlays can be costly during visually demanding tasks, whereas brief haptic signals or concise audio might be less interfering [86]. Situation Awareness theory suggests that displays should support projection and prioritization, not merely reporting [9,87].

Human–automation interaction research adds another set of hazards: automation bias (i.e., over-reliance on system outputs) and complacency (i.e., reduced monitoring when automation is present). Parasuraman and colleagues framed levels of automation as a spectrum that changes the human’s role from active controller to passive monitor; the risk is not simply too much automation, but poorly calibrated automation that shifts responsibility without maintaining engagement [5,88]. Empirical work has specifically linked decision aids to complacency and bias through attentional mechanisms [89,90]. That is, people look less, verify less, and accept more when automation appears authoritative [91]. Trust research further emphasizes that the goal is not maximal trust, but appropriate reliance, a dynamic alignment between system competence and user expectations [92,93].

Taken together, Layer 4 should be designed as a control policy for attention. It should regulate when the system speaks, how it speaks, and what it asks the human to do. Three design principles should be used. First, event gating reserves intrusive alerts for high-confidence, high-consequence conditions, and treats low-confidence states as monitoring or queryable. Second, modality matching maps message types to channels (e.g., haptic for urgent binary signals, audio for short directives, visual for reflective analysis). Finally, trust calibration couples recommendations with uncertainty, rationale, and (when feasible) counterfactuals (e.g., I am concerned because X; I would relax this concern if Y improved.).

To make this layer operational, the intervention policy can be framed as a mapping from system state and task context to an intervention decision. Relevant inputs include the posterior state estimate and associated uncertainty from Layer 2, signal-quality indicators, drift or out-of-distribution flags, task phase or operational tempo, workload proxies (e.g., movement intensity, task-event markers, or historical workload patterns), recent alert history, and modality availability. A simple baseline policy family can be defined in terms of gated intervention rules. For example, high-confidence and high-consequence conditions may trigger immediate alerts when interruption cost is low; moderate-confidence states may generate deferred or monitoring actions that surface information only at natural task boundaries; and low-confidence or drift-flagged conditions should increase abstention or request additional data rather than producing actionable recommendations. We return to this idea in Section 4.3. More sophisticated implementations could treat intervention timing as a utility optimization problem that balances expected benefit against interruption cost, or as a learnable policy (e.g., contextual bandit or reinforcement learning framework) that adapts cue timing over repeated deployments. Importantly, regardless of implementation method, the policy should preserve transparency by exposing the variables that influenced the intervention decision and by logging intervention outcomes for later audit and refinement [94].

2.5. A Formal Reference Architecture

To make the proposed framework more explicit, we define a compact reference architecture that links Layers 1–4 (Table 1). Let

x_{t}

denote the multimodal sensor observations acquired over time

t

, including physiological, behavioral, and contextual streams (e.g., HR/HRV, EDA, skin temperature, IMU-derived movement features, task events, and environmental measurements). Let

q_{t}

denote signal-quality metadata associated with these observations, such as contact confidence, motion artifact likelihood, ambient interference, missingness, and sensor uptime. The purpose of Layer 1 is therefore not only to collect raw data, but to produce paired outputs

(x_{t}, q_{t})

, where data quality is treated as an explicit model input.

Layer 2 maps these observations into a latent state estimate

z_{t}

, where

z_{t}

represents the user’s probabilistic physiological or cognitive state (e.g., fatigue, heat strain, stress, cognitive load, or motor instability). Formally, the goal is not a deterministic label, but an estimate of the posterior

p (z_{t} ∣ x_{1 : t}, q_{1 : t}, c_{t})

, where

c_{t}

denotes contextual variables such as task phase, environmental conditions, role requirements, and individualized baseline information. In this framing, uncertainty arises from at least two sources: measurement uncertainty (e.g., noisy or incomplete sensing) and model uncertainty (e.g., distribution shift or limited support from the training data). The Layer 2 output should therefore include: (i) a state estimate, (ii) an uncertainty value or interval, (iii) signal-quality and drift indicators, and (iv) an abstention option when confidence is insufficient for safe recommendation.

Layer 3 receives a structured state representation rather than raw sensor data. Let

s_{t} = {z_{t}, u_{t}, d_{t}, a_{t}, c_{t}}

, where

u_{t}

denotes uncertainty,

d_{t}

denotes a drift or out-of-distribution flag,

a_{t}

denotes model availability/abstention status, and

c_{t}

denotes relevant contextual constraints. The role of Layer 3 is to transform

s_{t}

into a bounded set of candidate interpretations or actions

r_{t}

using grounded reasoning over trusted local knowledge sources, operating procedures, and task constraints. Importantly, this layer should be schema constrained: outputs should specify the recommended option(s), rationale, confidence/uncertainty statement, and source provenance, rather than unrestricted free-form text.

Layer 4 governs whether, when, and how the output of Layer 3 should be presented to the user. Let

π_{t}

denote an attention-control policy that maps the current system state, user state, task demands, and recommendation urgency onto an intervention decision

y_{t}

. This decision may include whether to alert, defer, query for clarification, log silently, or abstain from action. The central design objective is not simply to maximize detection sensitivity, but to balance expected utility against cognitive cost, interruption burden, and the risk of over-reliance.

Several assumptions underlie this reference architecture. First, sensor observations in field settings are expected to be noisy, incomplete, and context dependent. Second, user baselines and task environments are heterogeneous, necessitating personalization and shift detection. Third, hardware constraints matter: feasible deployment depends on bounded latency, memory, and power budgets, which may require compression, quantization, duty cycling, or fallback to simpler models. Fourth, because deployment may occur in high-stakes settings, the system must preserve uncertainty, support abstention, and remain auditable across all layers. We emphasize that this formulation is not intended as a finalized implementation specification, but as a common scaffold for comparing candidate designs and identifying where technical and human-factor risks enter the pipeline.

3. On-Person Cognitive Co-Pilots: A Framework

By cognitive co-pilot, we refer to an on-person, context-aware decision-support system that continuously integrates multimodal sensing, probabilistic state estimation, and grounded reasoning to help a user interpret their current condition, receive contextualized recommendations, and select appropriate actions. Traditional status monitoring dashboards often display signals, summary scores, and threshold alarms. A cognitive co-pilot is designed to reduce interpretive burden by translating uncertain physiological and contextual signals into structured, actionable guidance while preserving human autonomy. As such, a cognitive co-pilot minimally includes: (a) quality-aware sensing, (b) probabilistic state estimation with uncertainty, (c) drift and out-of-distribution detection, (d) grounded reasoning over trusted rules or references, and (e) an explicit intervention policy governing when and how to advise the user and provide recommendations. Optional capabilities include natural language dialog (i.e., a method to prompt and chat), adaptive personalization, user-tunable thresholds, retrieval over historical personal data, and an audit mode that enables after-action explanation.

Such a system does not replace human judgment; rather, it functions as a bounded reasoning partner that manages uncertainty, timing, and attention in support of safe and effective performance. This is realized via a layered loop that couples sensing, inference, reasoning, and presentation with explicit human roles. This framework is depicted in Figure 1.

In addition to Layers 1–4, the framework emphasizes a human-in-the-loop feedback cycle. The user injects context (e.g., task phase, constraints, goals), queries rationale, and provides feedback that can tune thresholds (but within a bounded range), modalities, and (in some cases) local personalization. This is the mechanism by which trust is calibrated and autonomy is preserved.

4. Research Agenda and Future Directions

At the sensing layer, the frontier is the deployable validity of sensors. Sweat and tear sensing are advancing quickly, yet field reliability depends on sampling, contamination control, and the interpretability of analytes in context [32,95]. Some sensing modalities, such as heart rate, inertial motion sensing, and skin temperature, are already routinely deployed in ICE environments, including military training, industrial safety, and endurance operations; for these sensors, the research emphasis has shifted from basic capability to ruggedization, long-duration reliability, and integration with quality-aware inference pipelines. Energy autonomy is similarly a gating issue: batteries constrain wear time and user acceptance, motivating hybrid harvesting approaches (triboelectric, piezoelectric, thermoelectric) and energy-aware duty cycling [96,97,98]. A realistic path forward is to treat biochemical sensing as opportunistic and complementary in the near term; they are valuable when available, but not required for baseline operation, while investing in signal-quality detection and artifact modeling across all modalities.

At the signal-to-state layer, the central problem is generalization under shift [45,99,100]. Physiological baselines vary widely, and operational contexts induce systematic deviations (heat, dehydration, altitude, stimulants, sleep debt). Models must therefore be personalized without overfitting, and they must detect when deployment conditions fall outside the training envelope. Federated learning offers one route to improvement without centralizing raw data, but production systems will also need calibration monitoring, adversarial robustness, and audit trails that can withstand scrutiny after adverse events [54].

At the reasoning layer, the research problem is to make the LLM safe and grounded. Retrieval-augmented patterns can constrain generation to vetted sources, but only if retrieval itself is governed: which documents are authoritative, how they are updated, and how conflicts are resolved [72,73,101]. Quantization and compression enable local deployment but may alter failure modes in subtle ways, particularly for uncertainty communication and multi-step reasoning [81]. Methods like generative pre-trained transformer quantization (GPTQ) and activation-aware weight quantization (AWQ) provide practical compression tools, but safety-critical evaluation must move beyond benchmark accuracy to measure hallucination under ambiguous inputs, brittleness under adversarial prompts, and consistency across repeated queries [102,103].

At the presentation layer, the research challenge is when and how to intervene. We already know from alarm fatigue research that frequency and specificity shape compliance, and from automation bias research that authoritative phrasing can degrade verification [82,84]. What remains underexplored is a principled mapping from recommendation type × task phase × workload × modality constraints to an intervention policy that preserves performance. This is where simulation-based testing, human-subject experiments, and field trials must converge. The system should be evaluated not only on classification accuracy, but on downstream outcomes: error rates, response latency, workload, trust calibration, and skill retention over time.

4.1. Transitioning Research from Lab to Field

To support translation beyond this layered framework, cognitive co-pilot systems should be evaluated with a suite of criteria that extends beyond traditional model accuracy or classification performance, as depicted in Figure 2.

First, validation must include sensor performance under realistic operational conditions, explicitly testing robustness to motion, heat, sweat, poor contact, and environmental interference. Laboratory accuracy is insufficient if signal integrity collapses in the field; therefore, performance reporting should include quality-aware metrics, failure characterization, and the downstream consequences of degraded sensing on state estimation and recommendations.

Second, evaluation must examine robustness under changing conditions, including distribution shift, user variability, and evolving operational contexts. Systems should demonstrate not only stable performance across environments, but also the capacity to detect drift, flag out-of-distribution inputs, and appropriately adjust confidence or abstain. This moves evaluation from does it work? to does it know when it might not work? This is a distinction that is central to trust.

Third, uncertainty calibration should be treated as a first-class outcome. A useful cognitive co-pilot is one that communicates confidence transparently and appropriately limits its own authority. Evaluation should therefore test whether confidence estimates align with true reliability, whether the system appropriately produces “no recommendation” states when evidence is weak, and whether uncertainty information actually improves downstream human decision quality [74,94].

Fourth, translational evaluation must include human outcomes, not just model outputs. This includes workload, response latency, alert fatigue, automation bias, trust calibration, and adherence to recommendations. A system that improves classification accuracy but increases cognitive burden or promotes over-reliance may degrade real-world performance [91]. Accordingly, experimental designs should pair technical validation with human-subject testing that measures how the system changes behavior, attention allocation, and decision processes during representative tasks.

Finally, longer-term effects must be examined. Continuous decision support can reshape user expectations, skill retention, and monitoring behavior over time. Evaluation should therefore extend beyond short trials to assess whether systems promote complacency, erode expertise, or instead support learning and adaptive self-regulation. Longitudinal deployment studies, after-action review analyses, and training-transfer assessments will be essential to determine whether cognitive co-pilots function as sustainable augmentation tools (i.e., not as short-term performance crutches).

As an illustrative validation path, future implementations of this framework could be evaluated in a staged manner. A first phase should quantify sensing and inference performance under controlled manipulations of motion, heat, sweat, poor contact, missingness, and environmental interference, with repeated runs used to characterize variance, calibration, and failure modes. A second phase should evaluate robustness under distribution shift by varying user characteristics, task phases, environmental conditions, and operational tempo, with reporting that includes dispersion, confidence intervals, abstention rates, and drift-detection behavior. A third phase should test downstream human effects in representative scenarios, such as navigation, decision-making, resource allocation, monitoring, triage, or multitask simulations, using outcomes such as decision accuracy, response latency, workload, trust calibration, alert adherence, and evidence of automation bias or over-reliance. Especially for high-stakes use, validation should be designed to determine not only whether the system performs well on average, but whether it remains reliable, honest under uncertainty, and beneficial to human performance across repeated use and heterogeneous conditions. This broader evaluation approach reframes success as demonstrable improvement in human–system performance over time.

4.2. Illustrative Edge Deployment Scenario

To make the framework more concrete, consider an illustrative deployment in which a wearable device (e.g., a smartwatch or chest-worn node) performs continuous sensing and lightweight preprocessing, while a paired companion device (e.g., a smartphone, body-worn computer, or vehicle-mounted edge processor) performs state estimation and bounded reasoning. This distinction is important because the proposed framework is not equally implementable on all hardware classes today [77,78]. In near-term deployments, watch-class and sensor-node devices are well suited for continuous acquisition, short-window feature extraction, and signal-quality estimation, but are generally not the preferred location for persistent contextual reasoning with retrieval and language models. By contrast, phones and body-worn computers can more realistically support recurrent probabilistic inference, short-term history storage, retrieval over local reference materials, and event-triggered reasoning.

In such a configuration, Layer 1 would run continuously on the wearable and sample a modest multimodal set such as PPG-derived cardiovascular features, skin temperature, IMU-derived movement features, and task-event metadata, while also generating signal-quality indicators related to contact quality, motion artifact, and missingness. Rather than assuming continuous transmission of raw data, the wearable could compute short-window features and quality summaries locally and transmit those at a lower update rate when needed. This keeps battery and bandwidth demands within realistic limits for persistent use.

Layer 2 would run on the companion device using a compact personalized model that updates a state representation containing variables such as fatigue likelihood, thermal strain likelihood, workload estimate, uncertainty, and drift status. The practical requirement at this layer is not maximal model complexity, but stable recurrent inference with low enough latency to support decisions during ongoing activity. In many plausible deployments, this means sub-second to few-second update cycles, modest memory requirements, and explicit fallback behavior when computation, signal quality, or confidence is limited.

Layer 3 would also likely run on the companion device or another local edge processor, but unlike Layers 1 and 2, it would be invoked intermittently rather than continuously. In practice, this layer is most realistic today when it is bounded: triggered by a threshold crossing, user query, or change in task context; constrained by retrieval over trusted local documents or rules; and required to produce structured outputs rather than unconstrained free text. A compressed local LLM, a smaller reasoning model, or a hybrid rules-plus-retrieval pipeline may all be viable implementations depending on hardware resources and risk tolerance [64,77]. Full, always-on free-form language reasoning on watch-class hardware should not be treated as the default target architecture.

The central architectural implication is that computational burden should be partitioned across layers according to operational role. Continuous sensing and preprocessing should remain extremely lightweight and power efficient. Recurrent state estimation must remain computationally stable enough for persistent use on mobile hardware. Contextual synthesis and explanation should be event-triggered, bounded, and reserved for cases in which they add value beyond simpler decision logic. In this manner, latency, memory, and energy budgets need not be identical across layers; instead, each layer should be designed according to both its temporal demands and the consequences of delay or failure.

From an implementation standpoint, the framework is best interpreted as a tiered architecture with different levels of near-term feasibility. Today, Layer 1 is clearly implementable on watch- and wearable-class devices; Layer 2 is realistically implementable on phones and other companion processors; and Layer 3 is implementable locally only when compressed, retrieval-bounded, and event-triggered. The most resource-intensive functions should therefore migrate to the most capable local device available, while preserving low-latency operation and minimizing off-person data transfer.

We emphasize that this scenario is illustrative rather than definitive. The purpose of this example is not to claim that a single hardware stack is universally sufficient, but to show that end-to-end feasibility depends on careful function partitioning, bounded local operation, and explicit fallback behavior when computing, energy, or confidence are limiting. As summarized in Table 2, the framework partitions tasks based on what is realistically implementable today, their temporal requirements, computational pressure, and appropriate fallback behavior. In practice, the layered framework should be interpreted as a distributed local architecture in which the most time-sensitive and power-constrained functions remain closest to the sensor, while the most memory-intensive and explanation-oriented functions are shifted to the most capable local processor available.

4.3. Illustrative End-to-End Walkthroughs

To clarify how the framework operates across layers, we illustrate several end-to-end examples. These examples are not intended as validated deployment claims or optimized policies. Rather, they are intended to show how the proposed architecture transforms sensed signals into probabilistic state estimates, integrates those estimates with local context and trusted guidance, and then applies an intervention policy that may alert, defer, or abstain depending on confidence and operational conditions.

In our first example (alert), consider a user engaged in sustained physical work in a hot environment while wearing a wrist- or chest-based sensing platform and carrying a paired local companion device. Layer 1 acquires multimodal signals including PPG-derived cardiovascular features, skin temperature, IMU-derived movement patterns, and task-event metadata, while also generating signal-quality indicators. Layer 2 integrates these observations with individualized baseline information and current task context to estimate an elevated posterior probability of thermal strain and fatigue, with acceptable signal quality, no drift flag, and sufficiently narrow uncertainty bounds. Layer 3 then retrieves trusted local guidance relevant to heat strain mitigation, task constraints, and available intervention options, and produces a structured recommendation such as: elevated heat strain likelihood relative to baseline, confidence moderate-to-high, recommend short cooling/rest opportunity at next feasible task boundary, reassess within 10 min. Layer 4 evaluates the urgency of the condition against task phase and interruption cost, and delivers a brief cue through a low-burden modality such as haptic feedback with a short visual or auditory explanation. In this example, the system functions not as a dashboard that merely reports physiological values, but as a bounded decision-support partner that links uncertain state estimates to context-sensitive options.

In our second example (defer), consider a user performing a visually demanding or time-sensitive task in which interruption itself may degrade performance, such as air traffic control, piloting, or targeting. Layer 1 detects physiological and behavioral changes consistent with increasing workload or stress, and Layer 2 outputs an elevated stress posterior with moderate confidence, but the inferred task phase indicates that the user is currently in a critical interval with high interruption cost. Layer 3 retrieves relevant operational guidance and produces a structured recommendation indicating that monitoring is warranted but immediate interruption is not preferred unless the state worsens or persists. Layer 4 therefore selects a deferred policy rather than an immediate alert, logging the state change and delaying any overt cue until workload proxies decrease or the task phase changes. In this example, the system uses an explicit attention policy to determine that the correct action is to defer.

In our third example (abstain), we illustrate why abstention is a necessary system behavior rather than a failure mode. Suppose a user is moving vigorously through complex terrain with poor sensor contact, high motion artifact, and physiological patterns that differ substantially from the data on which the model was trained. Layer 1 detects degraded signal quality and missingness. Layer 2 produces a weak or unstable posterior, widened uncertainty, and an out-of-distribution or drift flag. Under these conditions, Layer 3 should not generate a strong recommendation from ambiguous evidence. Instead, the reasoning layer either remains inactive or produces only a structured status such as insufficient confidence for recommendation, signal quality degraded, request sensor check or recalibration, continue monitoring. Layer 4 then withholds an actionable cue and instead logs the event, requests improved sensor fit, or falls back to a simpler monitoring policy. This example is important because a trustworthy cognitive co-pilot must preserve uncertainty and support explicit abstention when evidence is weak, rather than converting low-quality inputs into overconfident advice.

These walkthroughs illustrate three distinct but necessary outcomes of the framework: actionable recommendation, context-sensitive deferral, and explicit abstention. The architectural point is that all three outcomes can be appropriate depending on signal quality, posterior uncertainty, task context, and the cost of interruption. A useful cognitive co-pilot is therefore not defined by how often it generates recommendations, but by whether it generates the right level of intervention (if any) for the available evidence and operational conditions (Table 3).

4.4. Practical Limitations and Deployment Constraints

Several practical limitations should be considered prior to deployment. First, scalability may be limited by heterogeneity across users, contexts, and missions. Physiological baselines differ across individuals, and field conditions introduce systematic shifts through heat, dehydration, fatigue, sleep loss, altitude, stimulants, motion artifact, and evolving task demands [104,105]. As a result, a model that appears accurate in one context may degrade substantially in another unless calibration monitoring, shift and drift detection, and abstention are treated as core system functions.

Second, robustness to noise and missingness remains a central constraint. To combat signal degradation in the field, modern systems are moving toward “quality-aware” processing. Technical frameworks now utilize modulation spectral representations to generate signal-quality metadata, allowing the system to adjust enhancement strategies “on-the-fly” based on the detected noise level [106]. Additionally, novel neural network training methods, such as opportunistic teacher-forcing, allow models to maintain reasoning quality even when up to 80% of the physiological data is missing or contaminated [107].

Third, deployment on the edge requires sophisticated resource orchestration. Recent architectures like Synergy enable on-body AI by distributing model workloads across tiny hardware accelerators, optimizing for strict energy and latency budgets [108]. For more complex tasks, “offload shaping” strategies can be used to determine which functions should be processed locally and which should be sent to companion devices to minimize bandwidth and power consumption [109].

Fourth, in safety-critical or industrial environments, purely generative models are often deemed too risky due to non-deterministic outputs. Current state-of-the-art architectures employ a hybrid approach: deterministic rule-based agents enforce hard safety constraints and “interlocks,” while LLMs are reserved for higher-level diagnostic interpretation and natural language interaction with the user [110]. This knowledge-infused design ensures that AI behavior remains predictable and justifiable [111].

Fifth, the attentional action layer can be formalized more rigorously than a simple alerting heuristic. The decision of when to alert a user is increasingly modeled as a formal safety-control game. These frameworks use partially observable stochastic models to balance the utility of an intervention against the “cost” of auditing or human interruption [112]. Integrating generative AI with Instance-Based Learning (IBL) cognitive models further allows systems to better mirror human decision-making processes, enabling more symbiotic human-AI teaming [113].

Finally, the framework requires validation at the level of human–system performance, not only technical component accuracy; indeed, validation is shifting from technical accuracy toward human-centric metrics. The SHAPE-AI tool was recently developed as a standardized tool to monitor how AI affects situational awareness, workload, and trust calibration in real-time clinical settings [114]. Comprehensive meta-analyses suggest that longitudinal validation and participatory design are essential to ensure that explainable AI (XAI) actually improves decision quality without introducing automation bias [115]. In this sense, the most important open question is not whether the system can generate recommendations, but under what conditions it improves human judgment safely and reliably [116].

5. Governance, Privacy, and Autonomy

Local processing reduces risk, but it does not eliminate it. Devices can be stolen, compromised, or coerced. Occupational deployments also raise governance questions that are qualitatively different from consumer wellness: who owns the data, who can access derived readiness estimates, and what constitutes meaningful consent when use is mandated [117,118,119]. For example, derived readiness estimates may be interpreted as objective medical facts, used to restrict assignments, or inadvertently incorporated into disciplinary or performance processes.

Two existing frameworks are immediately relevant. The National Institute of Standards and Technology (NIST) AI Risk Management Framework (AI RMF 1.0) emphasizes mapping and measuring risks across the AI lifecycle and aligns trustworthiness with characteristics such as validity, reliability, safety, transparency, accountability, and privacy [120]. The NIST Privacy Framework similarly provides a structure for identifying and managing privacy risk as an enterprise concern. These can inform technical architecture (data minimization, retention limits, access controls, audit logs) and interaction policy (what is shown to whom, when, and why) [121].

From a technical standpoint, privacy-preserving analytics may include secure enclaves and trusted execution environments (TEEs) for data-in-use protection and differential privacy for aggregated reporting [122]. But the ethical core remains sociotechnical: continuous monitoring can easily drift from safety enhancement into surveillance [123]. The only stable boundary is explicit governance: clear purpose limitation, role-based access, transparent user-facing explanations, and avenues for contesting or overriding system-driven inferences [124].

Finally, autonomy is preserved by designing the human-in-the-loop framework so that the human remains an agent in the interaction. That implies opt-out pathways when feasible, graded control over modalities and thresholds, and training that teaches users when the system is competent and when it is not; these are explicit responses to the well-documented risks of complacency and automation bias [125,126,127].

6. Conclusions

The convergence of pervasive biosensing, edge computing, and locally hosted LLMs makes a new class of systems plausible: on-person cognitive co-pilots that reduce cognitive burden by translating multimodal physiology into context-aware, actionable guidance. The technical novelty is not in any single sensor or model. Rather, it is in the coupling of probabilistic state estimation, grounded reasoning, and attentional interface design under real operational constraints. Edge deployment is central because it enables low-latency support in connectivity-poor settings and supports privacy-preserving governance by minimizing off-person data flow [19,20].

The research agenda is therefore necessarily multidisciplinary: sensor engineering for field validity, machine learning for calibrated inference under shift, LLM systems engineering for grounded reasoning under compression, and cognitive engineering for interventions that preserve situational awareness and autonomy. If that integration is performed well, cognitive co-pilots can become trusted partners rather than noisy dashboards; they will augment but not replace the human capacity to perform safely and effectively when conditions are at their most challenging.

Author Contributions

Conceptualization, T.T.B.; writing—original draft preparation, T.T.B.; writing—review and editing, M.V.P. and J.A.C.; visualization, T.T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest. The opinions or assertions contained herein are the private views of the authors and are not to be construed as official or reflecting the views, policies, or positions of the United States Army or the United States Department of War. Any citations of commercial organizations and trade names in this report do not constitute an official Department of the Army endorsement or approval of the products or services of these organizations.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AWQ	Activation-Aware Weight Quantization
CNNs	Convolutional Neural Networks
EDA	Electrodermal Activity
EEG	Electroencephalography
EMG	Electromyography
GRUs	Gated Recurrent Units
GPTQ	Generative Pre-trained Transformer Quantization
HR	Heart Rate
HRV	Heart Rate Variability
ICE	Isolated, Confined, and Extreme (environments)
IMU	Inertial Measurement Unit
LLM	Large Language Model
LSTMs	Long Short-Term Memory networks
NIST	National Institute of Standards and Technology
PPG	Photoplethysmography
RAG	Retrieval-Augmented Generation
RMF	Risk Management Framework
SOPs	Standard Operating Procedures
SpO₂	Peripheral Capillary Oxygen Saturation
TEEs	Trusted Execution Environments
TinyML	Tiny Machine Learning

References

Macias Alonso, A.K.; Hirt, J.; Woelfle, T.; Janiaud, P.; Hemkens, L.G. Definitions of Digital Biomarkers: A Systematic Mapping of the Biomedical Literature. BMJ Health Care Inform. 2024, 31, e100914. [Google Scholar] [CrossRef]
Neef, D. Digital Exhaust: What Everyone Should Know about Big Data, Digitization and Digitally Driven Innovation; Pearson Education: London, UK, 2015; ISBN 978-0-13-383796-4. [Google Scholar]
Schütz, N.; Knobel, S.E.J.; Botros, A.; Single, M.; Pais, B.; Santschi, V.; Gatica-Perez, D.; Buluschek, P.; Urwyler, P.; Gerber, S.M.; et al. A Systems Approach towards Remote Health-Monitoring in Older Adults: Introducing a Zero-Interaction Digital Exhaust. NPJ Digit. Med. 2022, 5, 116. [Google Scholar] [CrossRef]
Lane, N.D.; Miluzzo, E.; Lu, H.; Peebles, D.; Choudhury, T.; Campbell, A.T. A Survey of Mobile Phone Sensing. IEEE Commun. Mag. 2010, 48, 140–150. [Google Scholar] [CrossRef]
Parasuraman, R.; Sheridan, T.B.; Wickens, C.D. A Model for Types and Levels of Human Interaction with Automation. IEEE Trans. Syst. Man. Cybern. Part A Syst. Hum. 2000, 30, 286–297. [Google Scholar] [CrossRef]
Brunyé, T.T.; Yau, K.; Okano, K.; Elliott, G.; Olenich, S.; Giles, G.E.; Navarro, E.; Elkin-Frankston, S.; Young, A.L.; Miller, E.L. Toward Predicting Human Performance Outcomes from Wearable Technologies: A Computational Modeling Approach. Front. Physiol. 2021, 12, 738973. [Google Scholar] [CrossRef]
Trivedi, A.R.; Tayebati, S.; Kumawat, H.; Darabi, N.; Kumar, D.; Kosta, A.K.; Venkatesha, Y.; Jayasuriya, D.; Jayasinghe, N.; Panda, P.; et al. Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges. In Proceedings of the 2025 Design, Automation & Test in Europe Conference (DATE), Verona, Italy, 20–22 April 2026; pp. 1–10. [Google Scholar]
Raymond, D.A.; Kumar, P.; Goureshettiwar, P. Intergration of Wearable Biosensors and Data Analytics for Remote Health Monitoring. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 28–30 August 2024; pp. 707–713. [Google Scholar]
Endsley, M.R. Toward a Theory of Situation Awareness in Dynamic Systems. In Situational Awareness; Routledge: Abingdon, UK, 2011. [Google Scholar]
Norman, D. Cognitive Engineering. In User Centered System Design; CRC Press: Boca Raton, FL, USA, 1986. [Google Scholar]
Wickens, C.D. Multiple Resources and Mental Workload. Hum. Factors 2008, 50, 449–455. [Google Scholar] [CrossRef] [PubMed]
Lee, J.D.; Kirlik, A.; Dainoff, M.J. The Oxford Handbook of Cognitive Engineering; OUP: New York, NY, USA, 2013; ISBN 978-0-19-975718-3. [Google Scholar]
Miller, T. Explanation in Artificial Intelligence: Insights from the Social Sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for Human-AI Interaction. In CHI ’19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–13. [Google Scholar]
Ma, S.; Chen, Q.; Wang, X.; Zheng, C.; Peng, Z.; Yin, M.; Ma, X. Towards Human-AI Deliberation: Design and Evaluation of LLM-Empowered Deliberative AI for AI-Assisted Decision-Making. In CHI ’25: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1–23. [Google Scholar]
Cai, G.; Tian, R.; Yang, L.; Jia, Y.; Li, L.; Wang, J. Efficient Inference for Edge Large Language Models: A Survey. Tsinghua Sci. Technol. 2026, 31, 1365–1380. [Google Scholar] [CrossRef]
Husom, E.J.; Goknil, A.; Astekin, M.; Shar, L.K.; KÃ¥sen, A.; Sen, S.; Mithassel, B.A.; Soylu, A. Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency. ACM Trans. Internet Things 2025, 6, 28. [Google Scholar] [CrossRef]
Xianyong, S.; Mishra, D.K.; Özcan, Ö.Ö.; Karahan, M.; Panpan, T.; Zhongshan, X.; Vijayaraj Kumar, P. Wearable Smart Sensors Integration with AI and Machine Learning for Tracking Human Health. Biosens. Bioelectron. X 2025, 27, 100711. [Google Scholar] [CrossRef]
Satyanarayanan, M. The Emergence of Edge Computing. Computer 2017, 50, 30–39. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Chan, K.S.; Johnsen, F.T. Cybersecurity in Tactical Edge Networks. IEEE Secur. Priv. 2025, 23, 10–20. [Google Scholar] [CrossRef]
He, P.; Wang, Y.; Zheng, G.; Zhou, H. Design and Implementation of an Edge Computing-Based Underground IoT Monitoring System. Mining 2025, 5, 54. [Google Scholar] [CrossRef]
Haliassos, A.; Kasvis, D.; Karathanos, S. The Challenges of Data Privacy and Cybersecurity in Cloud Computing and Artificial Intelligence (AI) Applications for EQA Organizations. EJIFCC 2025, 36, 599–604. [Google Scholar]
Warden, P.; Situnayake, D. TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019; ISBN 978-1-4920-5201-2. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. GPT3.Int8(): 8-Bit Matrix Multiplication for Transformers at Scale. Adv. Neural Inf. Process. Syst. 2022, 35, 30318–30332. [Google Scholar]
Clifford, G.D.; Behar, J.; Li, Q.; Rezek, I. Signal Quality Indices and Data Fusion for Determining Clinical Acceptability of Electrocardiograms. Physiol. Meas. 2012, 33, 1419. [Google Scholar] [CrossRef]
Taskasaplidis, G.; Fotiadis, D.A.; Bamidis, P.D. Review of Stress Detection Methods Using Wearable Sensors. IEEE Access 2024, 12, 38219–38246. [Google Scholar] [CrossRef]
Allen, J. Photoplethysmography and Its Application in Clinical Physiological Measurement. Physiol. Meas. 2007, 28, R1. [Google Scholar] [CrossRef]
Zhang, Y.; Song, S.; Vullings, R.; Biswas, D.; Simões-Capela, N.; van Helleputte, N.; van Hoof, C.; Groenendaal, W. Motion Artifact Reduction for Wrist-Worn Photoplethysmograph Sensors Based on Different Wavelengths. Sensors 2019, 19, 673. [Google Scholar] [CrossRef]
Boucsein, W. Electrodermal Activity; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; ISBN 978-1-4614-1126-0. [Google Scholar]
Heikenfeld, J.; Jajack, A.; Rogers, J.; Gutruf, P.; Tian, L.; Pan, T.; Li, R.; Khine, M.; Kim, J.; Wang, J.; et al. Wearable Sensors: Modalities, Challenges, and Prospects. Lab. A Chip 2018, 18, 217–248. [Google Scholar] [CrossRef] [PubMed]
Urigüen, J.A.; Garcia-Zapirain, B. EEG Artifact Removal—State-of-the-Art and Guidelines. J. Neural Eng. 2015, 12, 031001. [Google Scholar] [CrossRef]
Lara, O.D.; Labrador, M.A. A Survey on Human Activity Recognition Using Wearable Sensors. IEEE Commun. Surv. Tutor. 2013, 15, 1192–1209. [Google Scholar] [CrossRef]
Bulling, A.; Blanke, U.; Schiele, B. A Tutorial on Human Activity Recognition Using Body-Worn Inertial Sensors. ACM Comput. Surv. 2014, 46, 33. [Google Scholar] [CrossRef]
Chung, S.; Lim, J.; Noh, K.J.; Kim, G.; Jeong, H. Sensor Data Acquisition and Multimodal Sensor Fusion for Human Activity Recognition Using Deep Learning. Sensors 2019, 19, 1716. [Google Scholar] [CrossRef]
Mahato, K.; Saha, T.; Ding, S.; Sandhu, S.S.; Chang, A.-Y.; Wang, J. Hybrid Multimodal Wearable Sensors for Comprehensive Health Monitoring. Nat. Electron. 2024, 7, 735–750. [Google Scholar] [CrossRef]
Stojchevska, M.; Steenwinckel, B.; Van Der Donckt, J.; De Brouwer, M.; Goris, A.; De Turck, F.; Van Hoecke, S.; Ongenae, F. Assessing the Added Value of Context during Stress Detection from Wearable Data. BMC Med. Inf. Decis. Mak. 2022, 22, 268. [Google Scholar] [CrossRef] [PubMed]
Stoppa, M.; Chiolerio, A. Wearable Electronics and Smart Textiles: A Critical Review. Sensors 2014, 14, 11957–11992. [Google Scholar] [CrossRef]
Yang, K.; McErlain-Naylor, S.A.; Isaia, B.; Callaway, A.; Beeby, S. E-Textiles for Sports and Fitness Sensing: Current State, Challenges, and Future Opportunities. Sensors 2024, 24, 1058. [Google Scholar] [CrossRef]
Azeem, M.; Shahid, M.; Masin, I.; Petru, M. Design and Development of Textile-Based Wearable Sensors for Real-Time Biomedical Monitoring; a Review. J. Text. Inst. 2025, 116, 80–95. [Google Scholar] [CrossRef]
Davis, N.; Heikenfeld, J.; Milla, C.; Javey, A. The Challenges and Promise of Sweat Sensing. Nat. Biotechnol. 2024, 42, 860–871. [Google Scholar] [CrossRef]
Demirci Uzun, S. Machine Learning-Based Prediction and Interpretation of Electrochemical Biosensor Responses: A Comprehensive Framework. Microchem. J. 2025, 218, 115656. [Google Scholar] [CrossRef]
Giorgi, G.; Tonello, S. Wearable Biosensor Standardization: How to Make Them Smarter. Standards 2022, 2, 366–384. [Google Scholar] [CrossRef]
Subbaswamy, A.; Saria, S. From Development to Deployment: Dataset Shift, Causality, and Shift-Stable Models in Health AI. Biostatistics 2020, 21, 345–352. [Google Scholar] [CrossRef]
Clifford, G.D.; Liu, C.; Moody, B.; Lehman, L.H.; Silva, I.; Li, Q.; Johnson, A.E.; Mark, R.G. AF Classification from a Short Single Lead ECG Recording: The PhysioNet/Computing in Cardiology Challenge 2017. In Proceedings of the 2017 Computing in Cardiology (CinC), Rennes, France, 24–27 September 2017; pp. 1–4. [Google Scholar]
Ordóñez, F.J.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
Hammerla, N.Y.; Halloran, S.; Ploetz, T. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. arXiv 2016, arXiv:1604.08880. [Google Scholar] [CrossRef]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic. Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Lampinen, J.; Vehtari, A. Bayesian Approach for Neural Networks—Review and Case Studies. Neural Netw. 2001, 14, 257–274. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Begoli, E.; Bhattacharya, T.; Kusnezov, D. The Need for Uncertainty Quantification in Machine-Assisted Medical Decision Making. Nat. Mach. Intell. 2019, 1, 20–23. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. y Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Kairouz, P.; McMahan, H.B. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Subbaswamy, A.; Adams, R.; Saria, S. Evaluating Model Robustness and Stability to Dataset Shift. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021. [Google Scholar]
Sun, Y.; Li, S.; Sun, W.; Jin, T.; Guo, M.; Fu, M.; Liu, D.; Huang, X. A Lightweight Uncertainty Modeling Approach for Wearable Sensor Signals Based on Sample Overlap Estimation. Measurement 2026, 262, 120051. [Google Scholar] [CrossRef]
Saria, S. Individualized Sepsis Treatment Using Reinforcement Learning. Nat. Med. 2018, 24, 1641–1642. [Google Scholar] [CrossRef]
Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
Patil, A.; Jadon, A. Advancing Reasoning in Large Language Models: Promising Methods and Approaches. arXiv 2025, arXiv:2502.03671. [Google Scholar] [CrossRef]
Zollner, D.; Vasiliev, R.; Castellano, B.; Molnar, G.; Janssen, P. A Technical Examination to Explore Conditional Multimodal Contextual Synthesis in Large Language Models. Res. Sq. 2024. [Google Scholar] [CrossRef]
Barberini, S.; Everleigh, D.; Aldershaw, C.; Highcastle, A.; Bartholomew, F. Architecting Contextual Gradient Synthesis for Knowledge Representation in Large Language Models. TechRxiv 2024. [Google Scholar] [CrossRef]
Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 9118–9147. [Google Scholar]
Yang, Q.; Farseev, A.; Ongpin, M.; Huang, A.; Chu-Farseeva, Y.-Y.; You, D.-M.; Lepikhin, K.; Nikolenko, S. Fusing Predictive and Large Language Models for Actionable Recommendations in Creative Marketing. ACM Trans. Inf. Syst. 2025, 43, 116. [Google Scholar] [CrossRef]
ElShawi, R.; Sherif, Y.; Al-Mallah, M.; Sakr, S. Interpretability in Healthcare: A Comparative Study of Local Machine Learning Interpretability Techniques. Comput. Intell. 2021, 37, 1633–1650. [Google Scholar] [CrossRef]
Zhang, L.; Chen, Z. Large Language Model-Based Interpretable Machine Learning Control in Building Energy Systems. Energy Build. 2024, 313, 114278. [Google Scholar] [CrossRef]
Lyu, D.; Wang, X.; Chen, Y.; Wang, F. Language Model and Its Interpretability in Biomedicine: A Scoping Review. iScience 2024, 27, 109334. [Google Scholar] [CrossRef]
Parisineni, S.R.A.; Pal, M. Enhancing Trust and Interpretability of Complex Machine Learning Models Using Local Interpretable Model Agnostic Shap Explanations. Int. J. Data Sci. Anal. 2024, 18, 457–466. [Google Scholar] [CrossRef]
Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of Multimodal Large Language Models: A Survey. arXiv 2025, arXiv:2404.18930. [Google Scholar]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
Wen, B.; Xu, C.; Han, B.; Wolfe, R.; Wang, L.L.; Howe, B. Mitigating Overconfidence in Large Language Models: A Behavioral Lens on Confidence Estimation and Calibration. In Proceedings of the NeurIPS 2024 Workshop on Behavioral Machine Learning, Vancouver, BC, Canada, 10 October 2024. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
Ke, Y.H.; Jin, L.; Elangovan, K.; Abdullah, H.R.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Kuo, C.-F.; et al. Retrieval Augmented Generation for 10 Large Language Models and Its Generalizability in Assessing Medical Fitness. npj Digit. Med. 2025, 8, 187. [Google Scholar] [CrossRef]
Banbury, C.R.; Reddi, V.J.; Lam, M.; Fu, W.; Fazel, A.; Holleman, J.; Huang, X.; Hurtado, R.; Kanter, D.; Lokhmotov, A.; et al. Benchmarking TinyML Systems: Challenges and Direction. arXiv 2021, arXiv:2003.04821. [Google Scholar] [CrossRef]
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Y.; Qian, B.; Shi, X.; Shu, Y.; Chen, J. A Review on Edge Large Language Models: Design, Execution, and Applications. ACM Comput. Surv. 2025, 57, 209. [Google Scholar] [CrossRef]
Wang, X.; Tang, Z.; Guo, J.; Meng, T.; Wang, C.; Wang, T.; Jia, W. Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models. ACM Comput. Surv. 2025, 57, 228. [Google Scholar] [CrossRef]
Wang, R.; Gao, Z.; Zhang, L.; Yue, S.; Gao, Z. Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques. Comput. Sci. Rev. 2025, 57, 100755. [Google Scholar] [CrossRef]
Sparrenberg, L.; Deußer, T.; Berger, A.; Sifa, R. Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in Llama. Cpp. In Proceedings of the 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), Birmingham, UK, 9–12 October 2025; pp. 1–10. [Google Scholar]
Egashira, K.; Vero, M.; Staab, R.; He, J.; Vechev, M. Exploiting LLM Quantization. Adv. Neural Inf. Process. Syst. 2024, 37, 41709–41732. [Google Scholar] [CrossRef]
Cvach, M. Monitor Alarm Fatigue: An Integrative Review. Biomed. Instrum. Technol. 2012, 46, 268–277. [Google Scholar] [CrossRef]
Michels, E.A.M.; Gilbert, S.; Koval, I.; Wekenborg, M.K. Alarm Fatigue in Healthcare: A Scoping Review of Definitions, Influencing Factors, and Mitigation Strategies. BMC Nurs. 2025, 24, 664. [Google Scholar] [CrossRef] [PubMed]
Bliss, J.P.; Dunn, M.C. Behavioural Implications of Alarm Mistrust as a Function of Task Workload. Ergonomics 2000, 43, 1283–1300. [Google Scholar] [CrossRef] [PubMed]
Sweller, J. Cognitive Load during Problem Solving: Effects on Learning. Cogn. Sci. 1988, 12, 257–285. [Google Scholar] [CrossRef]
Wickens, C.D. Multiple Resources and Performance Prediction. Theor. Issues Ergon. Sci. 2002, 3, 159–177. [Google Scholar] [CrossRef]
Endsley, M.R.; Jones, D.G. Designing for Situation Awareness: An Approach to User-Centered Design, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2025; ISBN 978-1-003-38823-4. [Google Scholar]
Parasuraman, R.; Riley, V. Humans and Automation: Use, Misuse, Disuse, Abuse. Hum. Factors 1997, 39, 230–253. [Google Scholar] [CrossRef]
Mosier, K.L.; Skitka, L.J. Automation Use and Automation Bias. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 1999, 43, 344–348. [Google Scholar] [CrossRef]
Parasuraman, R.; Manzey, D.H. Complacency and Bias in Human Use of Automation: An Attentional Integration. Hum. Factors 2010, 52, 381–410. [Google Scholar] [CrossRef]
Romeo, G.; Conti, D. Exploring Automation Bias in Human–AI Collaboration: A Review and Implications for Explainable AI. AI Soc. 2026, 41, 259–278. [Google Scholar] [CrossRef]
Dzindolet, M.T.; Peterson, S.A.; Pomranky, R.A.; Pierce, L.G.; Beck, H.P. The Role of Trust in Automation Reliance. Int. J. Hum. -Comput. Stud. 2003, 58, 697–718. [Google Scholar] [CrossRef]
Lee, J.D.; See, K.A. Trust in Automation: Designing for Appropriate Reliance. Hum. Factors 2004, 46, 50–80. [Google Scholar] [CrossRef]
Tatasciore, M.; Loft, S. Calibrating Reliance on Automated Advice: Transparency and Trust Calibration Feedback. Int. J. Hum. –Comput. Interact. 2025, 41, 14723–14733. [Google Scholar] [CrossRef]
Bandodkar, A.J.; Wang, J. Non-Invasive Wearable Electrochemical Sensors: A Review. Trends Biotechnol. 2014, 32, 363–371. [Google Scholar] [CrossRef] [PubMed]
Chong, Y.-W.; Ismail, W.; Ko, K.; Lee, C.-Y. Energy Harvesting for Wearable Devices: A Review. IEEE Sens. J. 2019, 19, 9047–9062. [Google Scholar] [CrossRef]
Ali, A.; Shaukat, H.; Bibi, S.; Altabey, W.A.; Noori, M.; Kouritem, S.A. Recent Progress in Energy Harvesting Systems for Wearable Technology. Energy Strategy Rev. 2023, 49, 101124. [Google Scholar] [CrossRef]
Perez, A.J.; Zeadally, S. Recent Advances in Wearable Sensing Technologies. Sensors 2021, 21, 6828. [Google Scholar] [CrossRef]
Mahato, K. Introduction to Biosensors for Personalized Health. In Biosensors for Personalized Healthcare; Mahato, K., Chandra, P., Eds.; Springer Nature: Singapore, 2024; pp. 1–25. ISBN 978-981-97-5473-1. [Google Scholar]
Qureshi, R.; Irfan, M.; Ali, H.; Khan, A.; Nittala, A.S.; Ali, S.; Shah, A.; Gondal, T.M.; Sadak, F.; Shah, Z.; et al. Artificial Intelligence and Biosensors in Healthcare and Its Clinical Relevance: A Review. IEEE Access 2023, 11, 61600–61620. [Google Scholar] [CrossRef]
Prabhune, S.; Berndt, D.J. Deploying Large Language Models with Retrieval Augmented Generation. arXiv 2024, arXiv:2411.11895. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration. Proc. Mach. Learn. Syst. 2024, 6, 87–100. [Google Scholar] [CrossRef]
Antikainen, E.; Njoum, H.; Kudelka, J.; Branco, D.; Rehman, R.Z.U.; Macrae, V.; Davies, K.; Hildesheim, H.; Emmert, K.; Reilmann, R.; et al. Assessing Fatigue and Sleep in Chronic Diseases Using Physiological Signals from Wearables: A Pilot Study. Front. Physiol. 2022, 13, 968185. [Google Scholar] [CrossRef]
Brunyé, T.T.; Goring, S.A.; Cantelon, J.A.; Eddy, M.D.; Elkin-Frankston, S.; Elmore, W.R.; Giles, G.E.; Hancock, C.L.; Masud, S.B.; McIntyre, J.; et al. Trait-Level Predictors of Human Performance Outcomes in Personnel Engaged in Stressful Laboratory and Field Tasks. Front. Psychol. 2024, 15, 1449200. [Google Scholar] [CrossRef] [PubMed]
Tiwari, A.; Cassani, R.; Kshirsagar, S.; Tobon, D.P.; Zhu, Y.; Falk, T.H. Modulation Spectral Signal Representation for Quality Measurement and Enhancement of Wearable Device Data: A Technical Note. Sensors 2022, 22, 4579. [Google Scholar] [CrossRef]
Tomonaga, S.; Mizutani, H.; Doya, K. Training Recurrent Neural Networks with Inherent Missing Data for Wearable Device Applications. Proc. AAAI Conf. Artif. Intell. 2025, 39, 29512–29513. [Google Scholar] [CrossRef]
Gong, T.; Jang, S.Y.; Acer, U.G.; Kawsar, F.; Min, C. Synergy: Towards On-Body AI via Tiny AI Accelerator Collaboration on Wearables. IEEE Trans. Mob. Comput. 2025, 24, 9319–9333. [Google Scholar] [CrossRef]
Iyengar, R.; Dong, Q.; Nguyen, C.; Pillai, P.; Satyanarayanan, M. Offload Shaping for Wearable Cognitive Assistance. Electronics 2024, 13, 4083. [Google Scholar] [CrossRef]
González-Potes, A.; Martínez-Castro, D.; Paredes, C.M.; Ochoa-Brust, A.; Mena, L.J.; Martínez-Peláez, R.; Félix, V.G.; Félix-Cuadras, R.A. Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study. AI 2026, 7, 51. [Google Scholar] [CrossRef]
Roy, K. Process-Grounded Knowledge-Infused Learning and Decision Making. Ph.D. Thesis, University of South Carolina, Columbia, SC, USA, 2025. [Google Scholar]
Griffin, C.; Thomson, L.; Shlegeris, B.; Abate, A. Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols. arXiv 2024, arXiv:2409.07985. [Google Scholar] [CrossRef]
Malloy, T.; Gonzalez, C. Applying Generative Artificial Intelligence to Cognitive Models of Decision Making. Front. Psychol. 2024, 15, 1387948. [Google Scholar] [CrossRef] [PubMed]
Mai, M.V.; Ozkaynak, M.; Marquard, J.L.; Simonson, R.J.; Holden, R.J.; Barton, H.J.; Catchpole, K.; Jamil, M.; Brauer, S.; Kandaswamy, S. SHAPE-AI: Development and Expert Validation of a Survey for Human—AI Performance Evaluation in Healthcare. medRxiv 2026. [Google Scholar] [CrossRef]
Abbas, Q.; Jeong, W.; Lee, S.W. Explainable AI in Clinical Decision Support Systems: A Meta-Analysis of Methods, Applications, and Usability Challenges. Healthcare 2025, 13, 2154. [Google Scholar] [CrossRef]
Brunyé, T.T.; Mitroff, S.R.; Elmore, J.G. Artificial Intelligence and Computer-Aided Diagnosis in Diagnostic Decisions: 5 Questions for Medical Informatics and Human-Computer Interface Research. J. Am. Med. Inf. Assoc. 2026, 33, 543–550. [Google Scholar] [CrossRef]
Barocas, S.; Nissenbaum, H. Big Data’s End Run around Procedural Privacy Protections. Commun. ACM 2014, 57, 31–33. [Google Scholar] [CrossRef]
Capulli, E.; Druda, Y.; Palmese, F.; Butt, A.H.; Domenicali, M.; Macchiarelli, A.G.; Silvani, A.; Bedogni, G.; Ingravallo, F. Ethical and Legal Implications of Health Monitoring Wearable Devices: A Scoping Review. Soc. Sci. Med. 2025, 370, 117685. [Google Scholar] [CrossRef]
Radanliev, P. Privacy, Ethics, Transparency, and Accountability in AI Systems for Wearable Devices. Front. Digit. Health 2025, 7, 1431246. [Google Scholar] [CrossRef]
Tabassi, E. Artificial Intelligence Risk Management Framework (AI RMF 1.0); NIST: Gaithersburg, MD, USA, 2023. [Google Scholar]
Raji, I.D.; Smart, A.; White, R.N.; Mitchell, M.; Gebru, T.; Hutchinson, B.; Smith-Loud, J.; Theron, D.; Barnes, P. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency; Association for Computing Machinery: New York, NY, USA, 2020; pp. 33–44. [Google Scholar]
Sabt, M.; Achemlal, M.; Bouabdallah, A. Trusted Execution Environment: What It Is, and What It Is Not. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Washington, DC, USA, 20–22 August 2015; Volume 1, pp. 57–64. [Google Scholar]
Zuboff, S. The Age of Surveillance Capitalism. In Social Theory Re-Wired; Routledge: Abingdon, UK, 2023. [Google Scholar]
Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency; Association for Computing Machinery: New York, NY, USA, 2019; pp. 59–68. [Google Scholar]
Mehrotra, S.; Degachi, C.; Vereschak, O.; Jonker, C.M.; Tielman, M.L. A Systematic Review on Fostering Appropriate Trust in Human-AI Interaction: Trends, Opportunities and Challenges. ACM J. Responsible Comput. 2024, 1, 26. [Google Scholar] [CrossRef]
Li, Y.; Wu, B.; Huang, Y.; Luan, S. Developing Trustworthy Artificial Intelligence: Insights from Research on Interpersonal, Human-Automation, and Human-AI Trust. Front. Psychol. 2024, 15, 1382693. [Google Scholar] [CrossRef]
Montealegre-López, N. Exploring the Role of Trust in AI-Driven Decision-Making: A Systematic Literature Review. Manag. Rev. Q. 2025. [Google Scholar] [CrossRef]

Figure 1. Conceptual framework for an on-person cognitive co-pilot. Multimodal signals from the external environment and task context are acquired through a biosensing layer (wearables, e-textiles, eye tracking, and other sensors), processed on-device into probabilistic state estimates (signal-to-state), synthesized with task and historical context to generate actionable insights (state-to-insight), and delivered through modality-appropriate cues (insight-to-action). A continuous feedback and control loop supports context injection, natural-language querying, user override, and system calibration to promote appropriate reliance.

Figure 2. Over time, evaluation should extend from sensor validity in real conditions to robustness and drift awareness, calibrated uncertainty and abstention, human performance effects, and longer-term impacts on skill and reliance.

Table 1. The four framework layers, their functional roles, and formal input variables and outputs.

Layer	Functional Role	Input Variables	Output Formalization
Acquisition	Quality-aware data collection	$(x_{t}, q_{t})$	Paired output where quality is an explicit input
Estimation	Probabilistic state mapping	$x_{1 : t}, q_{1 : t}, c_{t}$	$Posterior distribution, p (z_{t} ∣ x_{1 : t}, q_{1 : t}, c_{t}), and uncertainty u_{t}$
Reasoning	Grounded interpretation	$s_{t} = {z_{t}, u_{t}, d_{t}, a_{t}, c_{t}}$	$r_{t}$ : Schema-constrained actions or interpretations
Policy	Attention-control and gatekeeping	$s_{t}, r_{t}$ , user state, task demands	$y_{t}$ : Intervention decision (e.g., alert, defer, abstain)

Table 2. Illustrative computational responsibilities, feasibility, and resource considerations across framework layers.

Metric	Layer 1 (Sensing)	Layer 2 (State Estimation)	Layer 3 (Reasoning)	Layer 4 (Intervention Policy)
Primary role	Signal acquisition, preprocessing, signal-quality flags	Probabilistic state estimation, uncertainty, drift detection, personalization within bounds	Contextual synthesis, explanation, option generation, retrieval-grounded recommendation	Decide whether, when, and how to cue and advise user.
Most realistic device location today	Watch, chest-worn node, textile node, or other wearable sensor	Smartphone, body-worn computer, vehicle-mounted edge processor	Smartphone, body-worn computer, or other local edge processor	Likely same device as Layer 2 or 3.
Near-term implementation status	Clearly implementable today	Realistically implementable today on companion hardware	Implementable today only when bounded, compressed, and event-triggered	Implementable today as lightweight gating logic.
Temporal profile	Continuous or high-frequency sampling	Recurrent low-latency updates	Intermittent and event-triggered	Near-immediate once upstream outputs are available.
Typical latency target	Milliseconds to sub-second preprocessing windows	Sub-second to few-second state updates	Second-level acceptable in many use cases if not safety-critical; should be faster when urgency is high	Low-latency cue selection once recommendation state is available.
Main memory/compute pressure	Very limited memory; lightweight DSP/feature extraction only	Compact predictive model, short-term history, calibration/drift checks	Highest memory burden; retrieval store, compressed model, schema enforcement, tool use	Minimal additional compute beyond policy evaluation.
Main power pressure	Highest sensitivity to battery drain and duty cycle	Moderate; repeated on-device inference must remain stable over prolonged use	High per-call energy cost but reduced by intermittent use	Low, provided cueing logic remains simple.
What should usually not run here	Full contextual reasoning, persistent retrieval, large-model inference	Heavy free-form generation if tighter real-time inference is required	Continuous always-on sampling or raw-signal preprocessing	Complex interpretation without uncertainty and drift inputs.
Preferred implementation style	Lightweight signal processing, feature extraction, compression, local quality estimation	Compact probabilistic model, bounded personalization, abstention and drift logic	Compressed local LLM, small reasoning model, rules + retrieval hybrid, strict schemas	Rule-based or utility-based gating policy.
Fallback under constraint	Lower sampling rate, fewer channels, more buffering, reduced transmission	Lower update frequency, simpler model, wider uncertainty, increased abstention	Template-based logic, rule-based heuristics, deferred explanation, abstention	Silence/defer cue, request recalibration, or log only.
Conceptual vs implementable distinction	Largely implementable with current wearables	Largely implementable on current companion devices	Partly implementable now; strongest claims remain conditional on bounded local reasoning and compression	Implementable now if framed as a lightweight attention policy.

Table 3. Illustrative end-to-end outcomes across framework layers.

Example Outcome	Layer 1 Inputs	Layer 2 Outputs	Layer 3 Outputs	Layer 4 Action
Alert	PPG, skin temperature, IMU, task context, acceptable quality	Elevated fatigue and thermal strain posterior, moderate-to-high confidence, low drift	Retrieved local heat guidance, structured recommendation to rest and reassess	Brief haptic or audio cue at feasible task boundary
Defer	Physiological arousal and task-demand indicators	Elevated stress and workload posterior, moderate confidence	Monitor and defer recommendation due to task phase	No immediate alert, delay cue until lower-demand period
Abstain	Poor contact, motion artifact, missingness, unusual context	Wide uncertainty, unstable posterior, drift flag	No strong recommendation, request recalibration, sensor check	Abstain, log event, or fall back to simpler monitoring

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Brunyé, T.T.; Petrimoulx, M.V.; Cantelon, J.A. From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs. Sensors 2026, 26, 2034. https://doi.org/10.3390/s26072034

AMA Style

Brunyé TT, Petrimoulx MV, Cantelon JA. From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs. Sensors. 2026; 26(7):2034. https://doi.org/10.3390/s26072034

Chicago/Turabian Style

Brunyé, Tad T., Mitchell V. Petrimoulx, and Julie A. Cantelon. 2026. "From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs" Sensors 26, no. 7: 2034. https://doi.org/10.3390/s26072034

APA Style

Brunyé, T. T., Petrimoulx, M. V., & Cantelon, J. A. (2026). From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs. Sensors, 26(7), 2034. https://doi.org/10.3390/s26072034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs

Highlights

Abstract

1. Introduction

2. Edge Computing as an Enabler

2.1. Layer 1: Multimodal Sensing

2.2. Layer 2: Probabilistic State Estimates

2.3. Layer 3: From States to Insights

2.4. Layer 4: From Insight to Action

2.5. A Formal Reference Architecture

3. On-Person Cognitive Co-Pilots: A Framework

4. Research Agenda and Future Directions

4.1. Transitioning Research from Lab to Field

4.2. Illustrative Edge Deployment Scenario

4.3. Illustrative End-to-End Walkthroughs

4.4. Practical Limitations and Deployment Constraints

5. Governance, Privacy, and Autonomy

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI