Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Ronando, Elsen; Inoue, Sozo

doi:10.3390/s25113324

Open AccessArticle

Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

by

Elsen Ronando

^1,2,*,†

and

Sozo Inoue

^1,†

¹

Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu Ward, Kitakyushu 808-0135, Japan

²

Department of Informatics, Universitas 17 Agustus 1945 Surabaya, Semolowaru No. 45, Kota Surabaya 60118, Indonesia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2025, 25(11), 3324; https://doi.org/10.3390/s25113324

Submission received: 16 April 2025 / Revised: 13 May 2025 / Accepted: 23 May 2025 / Published: 25 May 2025

(This article belongs to the Special Issue Sensors Technologies for Measurements and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we propose a novel few-shot optimization with Hybrid Euclidean Distance with Large Language Models (HED-LM) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13 ± 10.71%, outperforming both random selection (59.30 ± 10.13%) and distance-only filtering (67.61 ± 11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.

Keywords:

few-shot prompting; large language models (LLMs); example selection; fatigue detection; sensor data; accelerometer; euclidean distance; contextual reasoning; HED-LM

1. Introduction

Few-shot prompting with large language models (LLMs) has attracted considerable interest as a groundbreaking method for tackling new tasks with limited training data. In contrast to conventional machine learning models, which typically depend on vast datasets for training, few-shot prompting enables LLMs to adapt to new tasks by leveraging only a small number of in-context examples. For example, GPT-3 has exhibited exceptional abilities in producing contextually appropriate and precise responses based on just a handful of instance prompts, emphasizing few-shot prompting’s potential to lessen the resource requirements of extensive training [1]. Key attributes of this approach include its flexibility across diverse tasks [2], its multilingual processing capabilities [3], and its superior reasoning performance in complex scenarios [4]. However, these advantages are counterbalanced by challenges such as limited generalization to unfamiliar domains [5], substantial computational resource requirements [2], and a pronounced dependence on effective prompt design [6,7].

The effectiveness of few-shot prompting largely hinges on selecting examples incorporated into the prompt. Studies have shown that using inappropriate examples, whether randomly selected or overly specific, can impede the model’s ability to generalize across tasks [8]. Moreover, thoughtful example selection is critical for optimizing resource utilization and reducing computational overhead while maintaining high performance [9]. Notably, prompts that integrate examples closely aligned with the target task have been observed to enhance model accuracy by as much as 30% compared to prompts with generic or mismatched examples [6]. These findings underscore the need for robust strategies in example selection to mitigate biases and improve generalization, especially in applications where data variability and complexity are high.

Real-world applications of few-shot prompting, particularly in domains involving sensor data, present additional complexities due to the inherent variability and noise in the data. Sensor data, such as accelerometer readings, often exhibit intricate patterns that are challenging to represent effectively in a few-shot prompting framework, as illustrated in Figure 1. Misrepresenting such data can significantly reduce task performance, emphasizing the importance of selecting representative and contextually relevant examples for the target task. For instance, Feng and Duarte [10] proposed a parameter transfer method, for example, selection in human activity detection using sensor data, achieving a 15% improvement in generalization. However, their approach was highly sensitive to data quality, with noisy inputs causing a performance drop of over 20%. Similarly, Ronando and Inoue [11] utilized a randomized sample selection method, transforming accelerometer data into graph representations for few-shot prompting. While this method improved task-specific performance by 22%, it lacked robustness in handling complex datasets compared to zero-shot learning setups.

These limitations highlight the necessity for structured and practical strategies, such as selection for few-shot prompting, mainly when dealing with high-variability and high-dimensional sensor data. Proximity-based measures have been explored as a potential solution, evaluating the similarity between examples and the target task. Two primary approaches are commonly considered: leveraging the internal reasoning capabilities of LLMs or employing distance-based metrics such as Euclidean distance. While LLMs excel in contextual evaluations, they struggle with numerical data, such as sensor readings, due to their text-centric architectures [11,12,13,14,15]. Conversely, distance-based methods provide straightforward numerical similarity assessments but fail to capture the contextual depth required for effective prompting.

Despite various approaches to few-shot prompting with sensor data, existing methods remain limited in adapting to complex and noisy real-world signals. Traditional machine learning models, such as Random Forests, rely on handcrafted features and require extensive labeled data for training. These models often struggle with intra-subject variability and overlapping patterns commonly found in fatigue detection scenarios [16]. While computationally simple, random selection frequently introduces irrelevant or misleading examples, especially when data complexity is high [17]. Distance-based approaches provide more structured numeric similarity but suffer from semantic ambiguity; examples may be numerically close yet contextually inappropriate, resulting in misclassification [18]. These shortcomings demonstrate the need for a hybrid selection mechanism combining quantitative precision and semantic relevance to support more robust and adaptable few-shot prompting.

As further illustrated in Figure 1, accelerometer signals from different users often exhibit visually overlapping patterns between fatigue and non-fatigue classes. Even when the true labels differ, these signals’ temporal structures and magnitudes may appear strikingly similar. This phenomenon exacerbates the abovementioned limitations, making it difficult for random sampling and distance-based methods to select representative examples reliably. These visual overlaps highlight the practical need for a more context-aware selection strategy to address semantic and numeric ambiguity in few-shot prompting tasks.

We adopt fatigue detection from accelerometer data as our case study to provide a rigorous benchmark for evaluating example selection strategies under real-world signal ambiguity. Fatigue detection is characterized by high intra-class variability, overlapping signal patterns, and subtle class boundaries, making it a suitable yet challenging scenario for evaluating the robustness of few-shot prompting. However, we emphasize that improving fatigue detection is not this research’s primary goal. Rather, fatigue detection serves as a representative task to assess the adaptability and effectiveness of our proposed selection framework in realistic sensor-based learning environments.

To overcome these challenges, we propose the Hybrid Euclidean Distance with Large Language Models (HED-LM) framework, which integrates numerical similarity filtering with LLM-based contextual evaluation to optimize example selection in sensor-based few-shot prompting. Our approach systematically integrates multiple stages to enhance the selection process. First, it employs numerical preprocessing to transform raw sensor signals into structured feature representations, ensuring that relevant features are retained while reducing noise. Next, Euclidean distance filtering is applied to identify the most numerically similar examples to the target task, serving as an initial selection mechanism. However, recognizing that numerical similarity alone is insufficient, HED-LM then incorporates LLM-based contextual relevance scoring, which refines the selected examples by evaluating their domain-specific semantic alignment. Finally, an optimized few-shot prompting strategy is applied to construct an effective in-context learning setup, ensuring that the chosen examples contribute to better model generalization.

This study aims to address the following research questions (RQs):

RQ1: How can the performance of few-shot prompting be improved for sensor-based classification tasks?
Contribution 1: We explore the impact of optimized example selection strategies in few-shot prompting using wearable sensor data, highlighting challenges in generalization, variability, and signal ambiguity with a case study on fatigue detection.
RQ2: How can the proposed HED-LM framework enhance example selection in few-shot prompting for sensor data?
Contribution 2: We introduce HED-LM, a novel hybrid framework that integrates Euclidean distance filtering and LLM-based contextual scoring to select semantically and numerically relevant examples.
RQ3: How does HED-LM compare to conventional example selection approaches such as random sampling and distance-based filtering in few-shot prompting?
Contribution 3: We show that HED-LM improves few-shot prompting performance over conventional methods. Compared to random and distance-based selection, HED-LM achieves relative macro F1-score improvements of 16.6% and 2.3%, respectively, demonstrating the advantage of combining numerical similarity and contextual relevance in selecting examples.

This study introduces a novel strategy for few-shot prompting that leverages numerical proximity and domain-aware contextual reasoning using large language models (LLMs). While fatigue detection is adopted as a representative case due to its inherent signal ambiguity, the central contribution lies in developing a generalizable selection mechanism, Hybrid Euclidean Distance with LLM (HED-LM), that integrates distance-based filtering and LLM-driven relevance scoring. This dual-selection strategy addresses core limitations in existing few-shot approaches for sensor data, offering a scalable and adaptable prompting framework applicable beyond the fatigue domain, including other physiological, safety-critical, or activity recognition scenarios.

The remainder of this paper is structured as follows: Section 2 reviews the existing literature, identifying gaps this study seeks to address. Section 3 details the methodology and the HED-LM framework. Section 4 presents experimental results and discusses their implications. Section 5 explores the broader impacts and limitations of this research. Finally, Section 6 concludes with insights and potential directions for future work.

2. Related Work

Optimizing example selection in few-shot prompting remains a critical challenge, particularly in sensor-based data applications. This section reviews the literature on few-shot optimization and its broader application to sensor data, highlighting key challenges and opportunities. We employ physical fatigue detection as a representative test case to systematically evaluate these challenges. However, it is important to note that fatigue detection is not the primary goal of this research; instead, it serves as a benchmark to validate the effectiveness of HED-LM in optimizing example selection for high-variability sensor data. The insights gained from this case study are expected to generalize to other sensor-based prompting applications.

2.1. Few-Shot Optimization

Few-shot prompting is a method that allows large language models to generalize from a small set of labeled examples by leveraging in-context learning without additional training. This approach has gained increasing attention due to its efficiency across a wide range of domains, including action recognition [19], vision–language tasks [20], and text classification [21].

Although prompting-based models are highly flexible, their performance is heavily influenced by the quality of examples selected for inclusion in the prompt, especially when dealing with complex or specialized data. Several strategies have been proposed to optimize this process. Aguirre et al. [22] emphasized the importance of contextual evaluation in improving model performance, while Cegin et al. [23] examined the trade-offs between random and structured selection. Although the random selection is computationally efficient and does not require additional heuristics, Ronando and Inoue [11] demonstrated that it often leads to significant performance degradation, mainly when applied to high-dimensional sensor data.

Alternative approaches have sought to impose more structure on example selection. Perez et al. [24] introduced a method combining cross-validation with a minimum description length criterion but found that it performed inconsistently across tasks. Similarly, Chang et al. [25] employed a K-means clustering approach to improve selection, yet this method relied on unlabeled data, limiting its applicability. A feature distribution analysis technique was proposed by Shin et al. [9], which identified informative samples based on data distribution properties, though it required extensive labeling and lacked adaptability across different prompting tasks.

Dynamic and similarity-based approaches have also been explored. Margatina et al. [26] proposed an active learning-based approach that effectively identified high-value examples but was limited by its single-iteration process, missing opportunities for iterative refinement. Similarly, Yao et al. [27] introduced in-context sampling (ICS), which aligned prompts based on data similarity but required substantial domain expertise. Methods like ACSESS [28] and IDS [29] provided more consistent improvements but faced scalability and computational constraints. Skill-KNN [30] further emphasized the difficulties in balancing domain generalization with robust example selection.

While these studies have made significant progress in example selection, there remains a persistent gap when applying these strategies to sensor-based few-shot prompting. This is especially critical in domains such as accelerometer analysis, where data variability, noise, and high dimensionality complicate similarity evaluation and example construction.

2.2. Few-Shot Prompting on Sensor Data

Applying few-shot prompting to sensor data presents additional complexities due to the fundamental differences between structured numerical signals and the text-centric architectures of large language models (LLMs). Unlike textual inputs, sensor readings are high-dimensional, noise-prone, and exhibit significant temporal dependencies, making them difficult to interpret using few-shot prompting frameworks for natural language tasks.

2.2.1. General Challenges in Sensor Data for Few-Shot Prompting

Sensor data inherently possess high dimensionality, significant variability, and noise. These complexities pose unique challenges, such as selection for few-shot prompting. The high dimensionality and temporal dependencies in sensor signals complicate the representation and interpretation process, leading to potential misclassification if examples are not carefully selected. Additionally, noise and variability, common in real-world sensor data, further exacerbate the challenge of ensuring the representativeness and relevance of selected examples for accurate prompting.

Numerous research efforts have sought to tackle these difficulties. Liu et al. [31] integrated LLMs with physiological and behavioral time-series data but reported information loss when converting numerical signals into text representations. Similarly, Li et al. [15] explored LLM applications in human activity recognition but faced difficulties capturing subtle variations in sensor readings, which were poorly represented in text-based models. Advanced preprocessing pipelines, such as those developed by Siraj et al. [32], have been introduced to handle noisy sensor inputs. However, their high computational requirements have made real-time implementation impractical.

2.2.2. Fatigue Detection as a Representative Case Study

Fatigue detection was chosen as a representative case study due to its practical importance and inherent complexity. Accurately detecting fatigue has substantial implications for health, safety, and productivity across numerous healthcare and occupational safety domains. Moreover, accelerometer-based fatigue detection is characterized by subtle and overlapping sensor patterns, which make example selection critically important [11,15]. The subtle differences between fatigue and non-fatigue states, coupled with temporal dynamics and sensor noise, provide an ideal scenario to rigorously test and validate the effectiveness of example selection optimization methods like HED-LM.

A follow-up study [11] evaluated fatigue detection using accelerometer data as a case study for few-shot prompting, comparing list-based and graph-based representations. Their results indicated that zero-shot learning often outperformed few-shot prompting when random example selection was used, highlighting a significant limitation of current selection strategies. This underscores the need for a more structured and informed approach to example selection, particularly in sensor-based few-shot prompting applications.

2.3. Challenges and Motivations

Although the approaches discussed in the previous literature have significantly contributed to developing few-shot prompting, critical challenges still have not been fully resolved, especially in complex and highly variable sensor data. Firstly, example selection remains a weak point that significantly affects model performance under few-shot conditions. Random selection methods often produce irrelevant or even misleading examples, especially when the variability in the data is high. In contrast, numerical distance-based approaches (such as Euclidean distance) can provide clear similarity metrics but often neglect crucial contextual aspects in the learning process.

Secondly, the main challenge in using a large language models (LLMs) for few-shot prompting on sensor data is to bridge the gap between numerical data representation and LLMs’ ability to perform context-based semantic reasoning. LLMs have been proven effective in deep context understanding of textual data, but they tend to have difficulties handling numerical representations of sensor signals, leading to suboptimal example selection.

Thirdly, the fatigue detection case study used in this research was chosen because it presents a unique challenge suitable as a benchmark for evaluating sample selection methods. Sensor data patterns for fatigue conditions are often highly similar to those of non-fatigue conditions, which makes the challenge of sample selection even more complicated and important to solve appropriately. Previous studies have shown that zero-shot approaches can surpass poorly optimized few-shot approaches, suggesting that structured and contextualized sample selection is critical to ensure optimal model performance.

Given these challenges, this study is motivated to develop a new approach that is more effective in example selection for few-shot prompting on sensor data. Therefore, we propose a Hybrid Euclidean Distance with Large Language Models (HED-LM) approach specifically designed to integrate numerical similarity evaluation with domain-based contextual reasoning performed by LLMs. With this approach, we aim to significantly improve the effectiveness of example selection, particularly in complex sensor-based application scenarios, such as fatigue detection. More broadly, this research also lays the foundation for an approach that can be applied to various scenarios in sensor data-driven learning.

3. Proposed Method

We propose a hybrid framework called Hybrid Euclidean Distance with Large Language Models (HED-LM) to address the challenge of selecting high-quality examples in few-shot prompting for sensor-based classification tasks. As shown in Figure 2, the structure is made up of two primary phases: the example selection stage, which identifies the most relevant labeled instances based on both numerical similarity and contextual relevance, and the inference stage, where a large language model uses a prompt constructed from the selected examples to predict the label of a new input. The overall design combines the strengths of numerical distance-based filtering, such as Euclidean distance, with LLM-driven semantic evaluation to ensure that the chosen examples are quantitatively similar and semantically aligned with the target input. This dual-filtering approach improves the accuracy, generalizability, and robustness of few-shot prompting, particularly in high-dimensional and noisy sensor data scenarios. The technical details of each component, including data preprocessing, feature extraction, distance-based filtering, LLM-based scoring, re-ranking, prompt construction, and label inference, are elaborated on in the following subsections.

3.1. Data Acquisition

In this study, we utilize a publicly available dataset comprising accelerometer magnitude signals collected from 19 participants during running activities, with a sampling frequency of 256 Hz [33]. Each subject’s session contains 180 consecutive samples representing raw acceleration magnitudes calculated from the x, y, and z axes, as defined in Equation (1).

| a | = \sqrt{a_{x}^{2} + a_{y}^{2} + a_{z}^{2}}

(1)

where

a_{x}, a_{y},

and

a_{z}

are the accelerometer values of the x, y, and z axes, respectively. Ground-truth labels were assigned as “fatigue” or “non-fatigue” based on physiological indicators and expert annotation during the activity session. The dataset consists of 6006 labeled subjects, with each instance represented as a single 1D vector of 180 values. This dataset was selected due to its temporal complexity, class imbalance, and high intra-subject variability, making it an appropriate benchmark for evaluating few-shot prompting strategies in sensor-based fatigue detection.

However, these raw data cannot be directly utilized in the framework due to the LLM’s limited ability to interpret lengthy signal sequences effectively, often leading to misinterpretations [11]. To address this, the data undergo a preprocessing pipeline to enhance interpretability and align them with the LLM’s requirements.

3.2. Preprocessing

Preprocessing transforms the raw accelerometer signals into structured data that capture key characteristics while preserving temporal and spectral information.

Windowing. The accelerometer signal of each subject, comprising 180 samples, is segmented into three equal windows (0–60, 60–120, and 120–180 samples), representing the start, activity, and finishing phases, respectively. With a sampling frequency of 256 Hz, segment 0–60 represents a time duration of 0–234 milliseconds (ms), segment 60–120 represents 234–468 milliseconds (ms), and segment 120–180 represents 468–702 milliseconds (ms). Each phase encapsulates distinct characteristics, such as stabilization, consistent motion, and fatigue-related changes. This segmentation retains critical temporal variations, enabling the system to extract meaningful phase-specific features and detect subtle patterns indicative of fatigue.

By preserving this granularity, windowing supports detailed segment-level analysis, facilitating the extraction of robust time- and frequency-domain features. For each segment, metrics such as mean, standard deviation, min and max value, peak-to-peak, root mean square (RMS), skewness, kurtosis, dominant frequency, and low-frequency energy are computed, resulting in a 30-dimensional feature vector per subject. This detailed representation ensures that downstream processes can capture signal behaviors crucial for differentiating fatigue from non-fatigue states.

Low-Pass Filtering. The proposed method’s low-pass filtering stage is crucial for removing high-frequency noise from accelerometer signals while preserving the low-frequency components that carry meaningful information about physical movement. This stage guarantees that the signal quality is adequate for identifying features essential for fatigue detection.

The filtering process is implemented using a Butterworth filter [34], chosen for its smooth frequency response and minimal distortion in the passband. The core idea of the filter is to attenuate frequency components above a defined cutoff frequency (

f_{cutoff}

= 30 Hz), as physical activities like running or walking typically occur at frequencies below this threshold. The relationship between the cutoff frequency and the sampling rate (

f_{s}

= 256 Hz) in the dataset is defined in terms of the Nyquist frequency in Equation (2):

f_{Nyquist} = \frac{f_{s}}{2}

(2)

The normalized cutoff frequency is then calculated in Equation (3) as:

f_{normalized} = \frac{f_{cutoff}}{f_{Nyquist}}

(3)

The Butterworth filter is designed with an order of 4, determining the frequency roll-off’s steepness. Using the normalized cutoff frequency, the filter coefficients B and A are computed to represent the transfer function in Equation (4):

H (s) = \frac{B (s)}{A (s)}

(4)

Here,

B (s)

and

A (s)

are polynomials of the filter coefficients. To preserve the temporal integrity of the signal, a zero-phase filtering technique is applied using forward–backward filtering, mathematically expressed in Equation (5) as:

y [n] = Reverse (Forward (x [n], H (s)))

(5)

where

x [n]

is the input signal and

y [n]

is the filtered output. This approach ensures no phase distortion, maintaining the alignment of signal features across time.

The low-pass filtering process refines the accelerometer signal by removing irrelevant high-frequency components, making it cleaner and more representative of the underlying physical activity. This filtered signal becomes the foundation for accurate feature extraction and subsequent classification, contributing to the robustness of the proposed fatigue detection framework.

Normalization. After filtering, the signal is normalized using min–max scaling. This maps each segment’s values into the range

[0, 1]

, preserving relative differences while ensuring uniform scaling across segments. Normalization mitigates disparities in amplitude caused by varying movement intensities, facilitating fair comparisons between features in subsequent analysis.

Feature Extraction. The filtered and normalized signals are transformed into numerical feature vectors. These features encapsulate both time- and frequency-domain characteristics [35]:

Time-Domain Features: include mean, standard deviation, max and min values, root mean square (RMS), skewness, kurtosis, and peak-to-peak values, highlighting statistical and energetic properties.
Frequency-Domain Features: computed using Fast Fourier Transform (FFT), these include dominant frequency and low-frequency energy, reflecting motion’s spectral characteristics.

Combining these features, the framework captures a comprehensive representation of the accelerometer signal, enabling robust analysis of fatigue-related patterns.

3.3. Distance-Based Filtering in Candidate Selection

The proposed method transforms each subject’s accelerometer data into a feature vector by extracting and concatenating features from three distinct signal segments. For example, a subject’s feature vector comprises 30 dimensions with 10 features per segment, drawn from statistical and frequency-based characteristics such as mean, standard deviation, max–min values, root mean square (RMS), skewness, kurtosis, peak-to-peak, dominant frequency, and low-frequency energy. These features encapsulate key patterns and dynamics of the accelerometer signal, providing a compact representation for comparing labeled and unlabeled subjects.

When a new subject (unlabeled) is introduced, its accelerometer data undergo the same preprocessing pipeline to generate a corresponding feature vector, denoted as

x_{new}

. The Euclidean distance is then calculated to measure the similarity between this new vector and the feature vectors of previously labeled subjects. The Euclidean distance

d (x_{new}, y)

between

x_{new}

and a labeled subject vector y is computed in Equation (6) as:

d (x_{new}, y) = \sqrt{\sum_{i = 1}^{m} {(x_{new, i} - y_{i})}^{2}}

(6)

where

m = 30

represents the dimensionality of the feature vectors, smaller distances indicate more remarkable similarities in numerical patterns, such as RMS values and average accelerations. This suggests the new subject may share the same fatigue label as the closest labeled subject. A subsequent LLM scoring stage evaluates label compatibility within a contextual framework to refine the prediction further.

After computing the distances, the labeled subjects are ranked in ascending order based on their proximity to the new subject. The system selects the

K

-nearest candidates (e.g., K = 5 or K = 10) for further evaluation. This step reduces the computational burden by narrowing the candidate pool, ensuring that the subsequent LLM scoring focuses only on the most relevant labeled subjects. Since LLM-based assessments can be resource-intensive, incorporating this distance-based filtering improves efficiency without compromising accuracy.

The preliminary filtering process identifies the most relevant labeled subjects and creates a structured foundation for the next evaluation stage. The system reduces complexity during the LLM scoring stage by leveraging this streamlined selection process. This approach enhances the model’s ability to classify fatigue versus non-fatigue states with improved precision and reliability, ultimately advancing the framework’s overall efficacy.

3.4. LLM Scoring for Contextual Candidate Evaluation

The proposed method incorporates an advanced LLM-based scoring stage following the initial distance-based screening. This stage evaluates the relevance of labeled candidates (fatigue or non-fatigue) to a new subject by integrating contextual reasoning beyond numerical similarity. By leveraging the LLM’s ability to interpret patterns and assess label appropriateness, the scoring process enriches candidate selection, aligning it more closely with the unique characteristics of the new subject’s signal data. The LLM scoring is executed automatically via an Application Programming Interface (API), ensuring that the relevance assessment process is fully automated and does not require manual intervention.

Structuring Prompts for Effective Contextual Reasoning. Structured prompts are created for each candidate–new subject pair to facilitate accurate evaluations. These prompts are designed with three essential components:
- Numeric Features: Each prompt provides statistical metrics and frequency-based characteristics extracted from the three signal segments (e.g., segment 1, segment 2, and segment 3). These features represent the signal’s critical characteristics.
- Labeled Subject (Old Subject Label): The candidate subject’s label (e.g., fatigue or non-fatigue) offers a basis for comparison with the new subject.
- LLM Guidance: Specific instructions to direct the reasoning process, such as “Please compare the numeric data segment by segment. If the differences in Mean, Std, RMS etc. are very small ⇒ high relevance. Also check whether the labeled subject labels (Fatigue/Non-Fatigue) are aligned with the numeric pattern of the new subject”.
This structured approach ensures that the LLM provides all necessary information for a detailed and comprehensive assessment, minimizing ambiguity in its evaluation process.
Assigning Relevance Scores with Justifications. The LLM assesses the relevance of each candidate–new subject pair by assigning a score ranging from 0 to 1, where higher scores reflect more substantial alignment between the candidate’s label and the new subject’s signal patterns. An explanation accompanies each score, adding transparency and interpretability to the process. The output format follows a consistent template:
- Score: A numerical value indicating relevance (e.g., SCORE: 0.90).
- Reason: A concise justification, e.g.,: “Segments 1 and 2 exhibit high similarity in RMS values; the label ‘fatigue’ is appropriate due to elevated RMS levels in these segments”.
This dual output format ensures that each decision is well supported, combining numeric analysis with contextual reasoning. By leveraging the LLM’s chain-of-thought reasoning, the scoring mechanism provides a nuanced understanding of the relationships between signal features and labels.
Contextual Label Synergy for Accurate Evaluations. A critical aspect of the scoring process is the assessment of label synergy, where the LLM evaluates how well the candidate’s label aligns with the new subject’s signal characteristics. For instance:
- If a labeled subject’s signal (e.g., “fatigue”) aligns with the numeric patterns of the new subject (e.g., high RMS values in segments 2 and 3), a high relevance score is assigned.
- Conversely, if the labeled subject’s label is “non-fatigue” but the new subject exhibits fatigue-like patterns, the relevance score is reduced, even in high numeric similarity.
This contextual evaluation is formalized in Equation (7) as follows:

${Relevance}_{LLM} (x_{old}, x_{new}, {label}_{old}) \approx f (Δ (x_{old}, x_{new}), synergy ({label}_{old}, x_{new}))$

(7)

where $x_{old}$ is the feature vector on the labeled subject; $x_{new}$ is the feature vector on the new subject (unlabeled subject); ${label}_{old}$ is the labeled subject; $Δ (\cdot)$ measures numerical differences in features such as mean, RMS, etc.; $synergy (\cdot)$ evaluates the alignment of the old subject’s label with the new subject’s signal pattern; and $f (\cdot)$ represents the LLM’s reasoning process. The LLM scoring mechanism enhances the precision of candidate selection for few-shot prompting by combining numerical comparisons with label alignment.

The LLM scoring stage bridges numerical signal analysis and label semantics, refining the few-shot prompting pipeline. Candidates with high numeric similarity and label alignment are prioritized, ensuring the most contextually relevant examples are selected. To further enhance this contextual alignment, domain knowledge explicitly guides the scoring process by providing clearly defined numerical thresholds derived from domain expert analysis and prior empirical studies. For instance, specific rules such as ‘RMS values above 0.5 in segments 2 and 3 typically indicate fatigue’ or ‘mean acceleration values below 0.31 strongly suggest fatigue conditions’ significantly increase the precision of contextual evaluation, reducing ambiguity during candidate assessment. This dual evaluation framework mitigates the risk of misclassification by deprioritizing candidates with conflicting labels, even if they exhibit numerical similarity.

This intelligent scoring process strengthens the few-shot prompting methodology by ensuring that the selected examples effectively guide the classification task. Consequently, the proposed approach demonstrates increased reliability and precision in differentiating between fatigue and non-fatigue states, rendering it a strong solution for detecting physical fatigue. More comprehensive details regarding the domain-specific thresholds and their integration into LLM prompts are explained in Appendix A.1. Moreover, more details on how to generate domain knowledge, the form of domain knowledge examples, and their placement in the prompting design are explained in Appendix A.2.

3.5. Re-Ranking

The re-ranking stage in the proposed method serves as a critical intermediary between the distance-based filtering and the LLM scoring processes. After identifying the closest

{distance}_{K}

subjects using Euclidean distance and computing their relevance scores

LLMScore \in [0, 1]

through the LLM evaluation, this stage reorders the candidates based on their LLM scores. Re-ranking transcends simple numerical similarity by prioritizing contextually relevant examples, ensuring that label alignment and contextual reasoning are incorporated into the candidate selection process.

Euclidean distance provides an initial measure of proximity between the new subject’s feature vector

x^{(new)}

and the labeled candidate’s

x^{(old)}

. However, distance alone cannot capture the nuanced relationships LLM scoring offers. While numerical similarity identifies potentially relevant subjects, the LLMScore considers numerical patterns and label synergy, providing a more robust criterion for evaluating relevance. Mathematically, the re-ranking process is represented in Equation (8) as:

Top - K Candidates (x^{(new)}) = \underset{K - candidates}{argsort_desc} ({LLMScore (x^{(old)}, x^{(new)})}_{j = 1}^{{distance}_{K}})

(8)

where:

$x^{(new)}$ : feature vector of the new subject.
$x^{(old)} x (o l d)$ : feature vector of a labeled candidate within the ${distance}_{K}$ closest subjects.
$LLMScore (\cdot)$ : relevance score assigned by the LLM, reflecting numeric similarity and label alignment.
$argsort_desc$ : a function that indexes candidates in descending order of their LLM scores and selects the top-K candidates.

This approach ensures that candidates with high LLM scores indicative of numeric and contextual relevance are prioritized for inclusion in the few-shot prompt or the final label prediction.

Distance-K and top-K are pivotal in shaping the efficiency and accuracy of the proposed HED-LM framework. Specifically, distance-K defines how many of the closest labeled candidates (based on Euclidean distance) are initially retrieved for evaluation. At the same time, top-K controls how many of those are selected after LLM-based scoring for the final decision-making. This hierarchical filtering strategy helps reduce computational overhead by narrowing down the candidate set before invoking the more expensive LLM reasoning. From a system design perspective, this ensures a balance between numeric similarity and semantic alignment, allowing the framework to prioritize quantitative and contextually relevant candidates. The specific values for distance-K and top-K used in this work (e.g., 5 and 3 or 10 and 5) were tuned based on empirical observations, as discussed further in the experimental setup.

3.6. Few-Shot Prompt

The few-shot prompt includes two examples (2-shot) from the re-ranked results, ensuring representation from both label categories: fatigue and non-fatigue. Each example is designed to contain the following key components:

Numeric Data Summary: a concise representation of extracted features from the three signal segments (Segment 1, Segment 2, and Segment 3).
Relevance Score and Reason: the relevance score assigned during the LLM scoring stage and a brief explanation of why the example aligns contextually with the new subject.
Label Conclusion: a definitive statement summarizing the example’s label, such as “Conclusion: The label is fatigue” or “Conclusion: The label is non-fatigue”.

After presenting these examples, the numeric data of the new subject (unlabeled) are appended to the prompt, along with an instruction guiding the LLM to make a final decision: “Please compare the new data with [Example 1 and Example 2] and determine the final label: ‘fatigue’ or ‘non-fatigue’”. This structured design ensures that the LLM provides concrete examples illustrating how specific numerical features correlate with their respective labels, enhancing its ability to generalize and make a well-informed prediction.

Few-shot prompting can be represented in Equation (9) as:

P = \underset{Shot - 1}{\underset{︸}{(Example 1 : numeric + label)}} + \underset{Shot - 2}{\underset{︸}{(Example 2 : numeric + label)}} + \underset{Unlabeled Input}{\underset{︸}{(New Data : numeric)}} \overset{LLM}{\to} (Final Label)

(9)

Here:

P is the prompt that combines two labeled examples and one unlabeled target.
Shot-1 and Shot-2 provide the LLM with explicit references to numerical patterns and their associated labels.
The new data serve as the input for the LLM’s final label prediction.

In our case, Shot-1 and Shot-2 taken should have a representative criterion of one label “fatigue” and “non-fatigue”. If in the top-K candidates, no such criterion is found, then Shot-1 and Shot-2 may be labeled “fatigue” or “non-fatigue”, respectively.

By combining these elements, the prompt ensures that the LLM has sufficient context to evaluate the new subject against the labeled examples and predict the appropriate label. More details of few-shot prompting are described in Appendix A.3.

The few-shot prompt’s use of two examples (2-shot) was carefully considered to balance performance, interpretability, and computational efficiency. Including one representative example from each label category, fatigue and non-fatigue, ensures that the prompt provides clear reference points for the LLM compared with the unlabeled data. This binary-class representation structure is particularly effective for classification tasks with subtle label boundaries. Additionally, using only two examples helps reduce the prompt length. It minimizes the computational cost associated with API-based LLM inference while preserving the context for accurate decision-making.

3.7. Predict Label

Once the few-shot prompt is constructed, the LLM is invoked to classify the new subject, assigning it a label of “fatigue” or “non-fatigue”. This stage integrates all preceding processes, for example, selection, feature extraction, and prompt formulation, guiding the LLM to make a precise and contextually informed decision.

Calling the LLM for Label Prediction. The finalized prompt, including the two most contextually relevant examples (2-shot) and the new subject’s unlabeled numeric data, is submitted to the LLM with explicit instructions. The LLM’s task is to generate a concise, single-word output, either “fatigue” or “non-fatigue”, to ensure focus and eliminate unnecessary verbosity. The process can be represented in Equation (10) as:

$Reply = LLM (P)$

(10)

where P is the constructed prompt containing the selected examples and the new subject’s numeric data. This approach ensures that the LLM remains focused on label prediction, using the provided examples to make an informed decision based on numerical and contextual relevance.
Parsing the LLM’s Response. The output of the LLM is parsed to extract the final label:
- If the response is “fatigue”, the new subject is labeled “fatigue”.
- If the response is “non-fatigue”, the new subject is labeled “non-fatigue”.
In cases where the LLM produces an ambiguous response (e.g., “I believe it is non-fatigue, but maybe fatigue?”), a fallback mechanism is employed. This involves analyzing the frequency of “fatigue” and “non-fatigue” in the LLM’s output. The label with the higher frequency is selected, ensuring systematic resolution of ambiguity without compromising the consistency of the prediction process.
Mapping Numeric Data to Labels. The overall label prediction process can be formalized mathematically in Equation (11) as follows:

$label (x^{(new)}) = Argmax (LLM (prompt shots, x^{(new)}))$

(11)

where:
- $x^{(new)}$ : Feature vector of the new subject.
- prompt shots: The labeled examples are provided in the few-shot prompt.
- $Argmax$ : Function selecting the label (“fatigue” or “non-fatigue”) with the highest confidence or frequency.
This representation captures the relationship between the new subject’s numeric data, the prompt examples, and the LLM’s output. It formalizes the label prediction process as a function of both prompt design and the LLM’s reasoning capabilities.

Algorithm 1 outlines the end-to-end workflow of the HED-LM approach. The framework systematically integrates preprocessing, feature extraction, candidate selection, LLM evaluation, and few-shot prompting to deliver accurate fatigue detection. Performance is assessed using the macro F1-score, ensuring balanced evaluation across fatigue and non-fatigue classifications.

Algorithm 1 HED-LM Algorithm with k-shot Examples

Require: File Path (

F_{path}

), Distance (

k_{d}

), Top-K (

k_{t})

Require: Domain Knowledge (DK), Few-Shot Examples (

n_{shots} = k

)

1:: Load Data:
2:: $D \leftarrow Load dataset from F_{path}$
3:: $X \leftarrow Extract features : X = {x_{1}, x_{2}, \dots, x_{n}}$
4:: $L \leftarrow Extract labels : L = {ℓ_{1}, ℓ_{2}, \dots, ℓ_{n}}$
5:: Process Features from preprocessing stages:
6:: $F \leftarrow Feature set : F = {f_{1}, f_{2}, \dots, f_{n}} derived from X and L$
7:: Initialize LLM:
8:: $LLM \leftarrow Language model initialized with specified parameters$
9:: Define Prompt and Create Chain:
10:: $P \leftarrow Prompt generated using DK and task - specific questions$
11:: $Chain \leftarrow Combine P and LLM$
12:: Hybrid Euclidean Distance with LLM Scoring (HED-LM):
13:: $R \leftarrow Score features : R = {r_{1}, r_{2}, \dots, r_{n}} using F, Chain, k_{d}, k_{t}$
14:: Final Prediction:
15:: for $i = 1$ to n do
16:: $T_{i} \leftarrow k - shot examples from R [i]$
17:: $P_{i} \leftarrow Few - shot prompt generated for feature f_{i} using T_{i}$
18:: $ℓ_{pred, i} \leftarrow Predicted label from Chain and P_{i}$
19:: $Results \leftarrow Results \cup {ℓ_{i}, ℓ_{pred, i}}$
20:: end for
21:: Evaluation Results:
22:: $MacroF 1 \leftarrow \frac{1}{C} \sum_{c = 1}^{C} \frac{2 \cdot {Precision}_{c} \cdot {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}}$
23:: Output: MacroF1

4. Experiments

In this section, we detail the experiments conducted to assess the performance of our approach (HED-LM), which combines a distance approach and LLM scoring for few-shot optimization in detecting physical activity fatigue based on accelerometer signal data. We divide our experimental description into several points: the experimental setup and the experimental results.

4.1. Experimental Setup

Experiment Objectives. There are two main points of objectives carried out in our experiments, including:

Testing the effectiveness of our approach. We intend to examine whether the end-to-end framework of our approach can improve the classification of physical fatigue detection compared to three baselines: (a) traditional machine learning, (b) randomized approach, and (c) distance approach.
Assessing the role of domain knowledge. We also investigate how the influence of domain knowledge inserted in the LLM scoring and prediction process can affect the relevance assessment of the original subject to the new subject so that the final label becomes more precise in performance.

Datasets and Sensor Contexts. We utilized the publicly available dataset from Kathirgamanathan et al. [33], comprising accelerometer-based time series data collected from 19 recreational runners using lumbar-mounted Shimmer IMUs at a sampling rate of 256 Hz. Each subject completed three tasks on an outdoor running track: (1) a 400 m run under non-fatigued conditions, (2) a multistage beep test used to induce fatigue, and (3) a follow-up 400 m run in a fatigued state. The primary signal was the acceleration magnitude derived from the tri-axial accelerometer data. From the two 400-meter runs per subject, individual strides were segmented and resampled using a Soft-DTW barycenter smoothing technique, yielding approximately 6006 instances for binary classification (fatigue vs. non-fatigue). This protocol ensures well-controlled fatigue induction and allows for subject-specific, stride-level analysis.

Preprocessing and Feature Extraction. Each 180-sample signal was windowed into three segments × 60 samples. We then applied a low-pass filter (30 Hz cutoff) to reduce noise, followed by min–max normalization per segment. The retrieved features include men, standard deviation, max, min, peak-to-peak, RMS, skewness, kurtosis, dominant frequency, and energy lowband for each segment. Thus, each subject has 30 features (3 segments × 10 features).

Distance and LLM parameters. In our approach, we set the parameters of the distance-K (

d i s t a n c e_{K}

) and top-K (

t o p_{K}

) effects of LLM scoring as follows:

#ParamA has a distance parameter setting and LLM scoring that uses Euclidean distance with $d i s t a n c e_{K} = 5$ , meaning we only take the five closest subjects as candidates and perform LLM scoring with $t o p_{K} = 3$ , which only takes the three best candidates from the five candidate distance approach based on integrated relevance assessment with domain knowledge.
#ParamB has distance and LLM scoring parameter settings, namely using Euclidean distance with $d i s t a n c e_{K} = 10$ , meaning that we only take the ten closest subjects as candidates and perform LLM scoring with $t o p_{K} = 5$ , which only takes the five best candidates from ten distance approach candidates based on integrated relevance assessment with domain knowledge.

The comparison of the two parameters—(#ParamA) with

d i s t a n c e_{K} = 5

,

t o p_{K} = 3

and (#ParamB) with

d i s t a n c e_{K} = 10

,

t o p_{K} = 5

—was performed to see the effect of the size of the candidates called and selected by the LLM on the performance of the HED-LM method. When

d i s t a n c e_{K}

and

t o p_{K}

are more minor (as in #ParamA), the process becomes more efficient due to fewer LLM calls and the number of final examples selected. However, the candidate coverage is also narrower. In contrast, in #ParamB, more candidates are assessed and retrieved, increasing the chance of finding the most similar subjects. By comparing the two, we can assess the coverage-related trade-off (potentially higher performance) in parameterizing the HED-LM method.

We further conducted sensitivity experiments to evaluate the impact of varying

d i s t a n c e_{K}

and

t o p_{K}

values on performance. Lower values of

d i s t a n c e_{K}

(e.g., 3 or 5) reduced LLM call costs. However, they often excluded contextually relevant candidates, while higher values (e.g., 15) increased processing time without consistent performance gains due to the inclusion of less relevant examples. Similarly, increasing

t o p_{K}

beyond 5 diluted the label relevance scoring, as weaker candidates were retained. Empirically, we found that

d i s t a n c e_{K} = 10

and

t o p_{K} = 5

yielded the best balance between coverage and LLM scoring accuracy, while

d i s t a n c e_{K} = 5

and

t o p_{K} = 3

offered faster computation with slightly reduced performance. These configurations reflect trade-offs between precision and efficiency, which are further analyzed in Section 4.2.1.

For both parameter settings, we use the same model, GPT-4o-mini (OpenAI, San Francisco, CA, USA; accessed in March 2025), with parameter setting temperature = 0.3, which can provide more controllable and consistent model control. We chose the GPT-4o-mini (OpenAI, San Francisco, CA, USA; accessed in March 2025) because it performs well in understanding accelerometer sensor magnitude data [11].

Performance evaluation. In this study, we evaluate four approaches for fatigue detection from wearable sensor data:

Traditional Machine Learning (ML): We used a Random Forest classifier trained on 30 features, namely eight time-domain statistical features and two frequency-domain features, which were calculated across three equally sized segments per input trace from raw accelerometer data. It does not involve any prompting or interaction with large language models. The inference is conducted locally, making it an LLM-independent, fully offline baseline.
Random Approach: A random selection of examples is incorporated into few-shot prompting with an API GPT-4o-mini model (OpenAI, San Francisco, CA, USA; accessed in March 2025), temperature = 0.3, and added domain knowledge information generated with the GPT-4o model (OpenAI, San Francisco, CA, USA; accessed in March 2025).
Distance Approach: Using Euclidean distance only (KNN-based), subjects with numerically close distances are input examples into few-shot prompting with GPT-4o-mini model (OpenAI, San Francisco, CA, USA; accessed in March 2025) and temperature = 0.3, adding domain knowledge information generated with the GPT-4o model (OpenAI, San Francisco, CA, USA; accessed in March 2025).
Our proposed method, HED-LM (Hybrid Euclidean Distance with Large Language Models), filters distance-based similarity and then ranks the selected examples based on contextual relevance scored by an LLM before constructing the prompt.

To ensure clarity in the computational and architectural differences between these methods, Table 1 summarizes the core characteristics of each approach.

Macro F 1 - Score = \frac{1}{k} \sum_{i = 1}^{k} \frac{2 \cdot {Precision}_{i} \cdot {Recall}_{i}}{{Precision}_{i} + {Recall}_{i}}

(12)

where k is the total number of classes;

{Precision}_{i}

and

{Recall}_{i}

are calculated for each class i.

All models are evaluated using macro F1-score, as shown in Equation (12), as the primary metric under a 2-shot configuration. A 2-shot configuration was used based on exploratory experiments comparing 2-shot prompting and full-shot. Our findings showed that performance gains beyond the two examples were marginal, while longer prompts increased LLM latency and cost. Furthermore, the 2-shot format provides a balanced and interpretable structure for binary classification by selecting one example per class, fatigue and non-fatigue. This setup offers a practical compromise between model effectiveness and computational efficiency, making it suitable for real-time or resource-constrained scenarios. The 2-shot usage evaluation details are explained in Section 4.2.2.

To ensure robust and fair assessment across all methods, especially given the inter-subject variability inherent in physiological data, each experimental run was conducted on an isolated per-user subset. This user-specific evaluation setup prevents cross-user information leakage, supports subject-wise generalization analysis, and ensures that all classification results reflect personalized model behavior rather than pooled training effects.

Evaluation Protocol and Prompt Construction. To ensure methodological consistency and fair evaluation across all approaches, we adopted a unified experimental protocol that applies equally to the traditional machine learning (ML) baseline and the three LLM-based prompting strategies. Although the full dataset comprises 6006 labeled sensor instances, all experiments were conducted independently on a per-user basis, using a fixed slice of the dataset corresponding to each subject (e.g., user10: samples 1736 to 2063, totaling 327 instances). For the ML baseline, we trained a Random Forest classifier implemented using the RandomForestClassifier from scikit-learn (version 1.6.1, scikit-learn developers, USA) with 100 estimators and random_state=42. A minimal few-shot configuration was used, where

n = 2

examples, one fatigue and one non-fatigue sample, were selected for training, and the remaining instances served as the test set. The same test set was reused without modification across all LLM-based methods, including random selection, distance-only, and our proposed Hybrid Euclidean Distance with Large Language Models (HED-LM).

For each test instance, a prompt was constructed using exactly two support examples drawn from the same user’s data, strictly excluding the test instance itself to ensure no data leakage or label contamination. In the random approach, examples were sampled uniformly at random; in the distance-only method, we selected the two most similar examples based on Euclidean distance in the 30-dimensional feature space; and in the HED-LM method, we first filtered candidates based on distance and then ranked them using an LLM relevance scoring mechanism that evaluated numerical similarity and semantic label alignment. All test samples were predicted independently using these prompt structures. Using a consistent

n = 2

across both ML and LLM paradigms ensures a balanced and realistic few-shot comparison, particularly under low-resource settings. The fixed-size prompting strategy was also chosen to remain within the token limits of the LLM model (GPT-4o-mini) (OpenAI, San Francisco, CA, USA; accessed in March 2025) and to reflect practical constraints in real-world deployment scenarios. To further clarify the consistency of the evaluation setup across all approaches, Table 2 summarizes the key components of the experimental design. The table outlines the data source, training configuration, test isolation strategy, selection methods, and prompt construction across the machine learning baseline and all LLM-based methods. All models were evaluated on the same per-user data slice using exactly two support examples per test instance, ensuring methodological alignment and eliminating the risk of data leakage.

Figure 3 illustrates the complete data processing and inference pipelines across all compared approaches to support the detailed evaluation setup described earlier visually. All methods begin with a per-user data slice, followed by preprocessing and feature extraction steps. For the machine learning baseline (Random Forest), two samples (1 fatigue, 1 non-fatigue) are used to train the classifier, and the remaining instances are held out for testing. In contrast, the LLM-based approaches use the remaining pool (excluding the test sample) to construct prompts based on different support example selection strategies. The random approach selects support examples arbitrarily, while the distance-only approach selects based on Euclidean similarity. At the same time, our proposed HED-LM method combines distance filtering with LLM-based scoring to select the most semantically aligned examples. All test predictions are made independently, and the same test set is used consistently across methods. This diagram emphasizes the strict isolation of test instances and ensures that all predictions are made without leakage from the evaluation set.

4.2. Experimental Results

4.2.1. Impact of Distance-K and Top-K Parameters

Before analyzing the final performance comparison with baseline methods, we first evaluated the effect of key parameters on the performance of HED-LM. In particular, we compared four configurations: #ParamA with

d i s t a n c e_{K} = 5

,

t o p_{K} = 3

, #ParamB with

d i s t a n c e_{K} = 10

,

t o p_{K} = 5

, #ParamC with

d i s t a n c e_{K} = 15

,

t o p_{K} = 5

, and #ParamD with

d i s t a n c e_{K} = 15

,

t o p_{K} = 7

. The results, illustrated in Figure 4, demonstrate a clear trade-off between classification performance and computational cost.

From a performance perspective, increasing distance-K and top-K generally improved macro F1-scores. Specifically, the configuration (15/7) achieved the highest performance at 71.70%, followed closely by (15/5) at 68.45% and (10/5) at 64.42%. However, this increase in performance comes with a steep rise in computational time. Computation time more than doubled from (5/3) to (15/5), reaching over 5000 s for the latter. Interestingly, although (15/7) slightly outperformed (15/5) in macro F1-score, it required comparable computation time, yielding only a marginal improvement of approximately 3%.

These findings suggest that while larger values of distance-K and top-K enhance the contextual richness of few-shot prompts by broadening the candidate pool and improving relevance selection, performance gains plateau while computational costs continue to rise. Therefore, configurations (10/5) and (5/3) offer more favorable trade-offs.

The configuration (10/5) balances accuracy and efficiency, achieving a solid macro F1-score of approximately 64–66% with a moderate computation time. Meanwhile, (5/3) is particularly attractive for resource-constrained environments, as it offers competitive accuracy (around 67–68%) with substantially lower computation time, less than half of that required by the (15/5) and (15/7) configurations.

In summary, while the (15/7) configuration yields the highest macro F1-score, its computational demands make it less practical for real-world deployment. We, therefore, recommend using (10/5) as the default configuration for general use cases requiring a balance between performance and latency and (5/3) for deployment scenarios with tighter computational or time constraints. For this reason, our research focuses on performance comparison using #ParamA and #ParamB.

4.2.2. Effect of the Number of Shots in Few-Shot Prompting

To assess the effect of the number of shots in few-shot prompting, we compared the performance and computation time of the HED-LM method using 2-shot and full-shot configurations, where all available support instances, excluding the test sample, are used in the prompt without filtering or semantic selection. The 2-shot setting includes two representative labeled examples (one per class) as input prompts, whereas the full-shot setting uses all available top-K examples selected via LLM scoring.

As shown in Figure 5, HED-LM in the 2-shot setting achieved the highest macro F1-score of 67.70% for User ID 4, outperforming not only other 2-shot baselines (Random: 63.51%, Distance: 60.91%, ML: 53.33%) but also the full-shot variant of HED-LM itself, which yielded only 52.68%. This finding highlights that using more examples does not always translate into better performance; the added complexity in full-shot prompting may introduce noise or redundancy that hinders LLM inference.

It is also important to clarify that this comparison isolates the inference stage only. The computation times reported in Figure 6 do not include the time spent on example selection for HED-LM. This design decision ensured a fair and consistent comparison across all methods. Since the baseline approaches, Random, Distance, and full-shot do not involve any selection or ranking process; their execution begins directly at the inference phase. Accordingly, the inference time measured for HED-LM excludes preprocessing overhead, making the comparison focused solely on prompt execution efficiency.

Regarding computation cost in Figure 6, 2-shot HED-LM was significantly more efficient, requiring only 147.31 s, far below the full-shot variant, which required 1129.84 s. This makes the 2-shot configuration almost 7.7 times faster than the full-shot, demonstrating a substantial computational advantage. The traditional machine learning (ML) model demonstrated the fastest computation time across all approaches. ML models operate entirely locally and do not involve external calls to large language model APIs. In contrast, all prompt-based methods (HED-LM, Random, Distance) depend on interaction with LLM APIs during inference, which introduces additional latency. Although this overhead is expected, it reflects a realistic deployment scenario for LLM-integrated systems.

These results show that 2-shot prompting performs better classification while dramatically reducing inference time. The 2-shot configuration balances efficiency and effectiveness and proves more scalable for real-time or resource-constrained deployment. For this reason, we adopt 2-shot prompting as the standard setup in all subsequent experiments.

4.2.3. Per-User Performance Evaluation

To better understand how our method performs across different individuals, we detailedly evaluated the macro F1-score per User ID. This per-user analysis is critical in fatigue detection using accelerometer data, where physiological and behavioral variability across users can significantly affect signal patterns. Each subject may exhibit unique movement dynamics and fatigue responses, and thus, evaluating the macro F1-score individually allows us to assess the adaptability and generalization capability of the model in real-world settings.

By presenting macro F1-scores per user, we aim to evaluate both the robustness and fairness of the HED-LM framework. This approach also helps identify whether the few-shot prompting mechanism effectively adapts to unseen users using only a small number of examples. Table 3 presents the macro F1-score comparison of four different approaches, such as Traditional Machine Learning (ML), Random, Distance, and our proposed HED-LM with two parameter configurations: #ParamA (

d i s t a n c e_{K} = 5

,

t o p_{K} = 3

) and #ParamB (

d i s t a n c e_{K} = 10

,

t o p_{K} = 5

). The results are reported as percentages for each User ID, highlighting that HED-LM consistently outperforms the baseline methods.

One of the most prominent findings is observed for User ID 4, where HED-LM with 2-shot prompting (#ParamA) achieved a macro F1-score of 67.70%, outperforming all baseline approaches (Random: 59.82%, Distance: 62.89%, ML: 53.92%). This suggests that HED-LM can effectively align semantic relevance and numerical similarity even with a minimal prompt to enhance classification for users with moderately distinguishable fatigue patterns.

In contrast, User ID 10 presents an unusual scenario where the traditional ML approach drastically underperforms (only 19.10%), whereas all prompt-based methods achieve much higher scores (HED-LM: 89.88%, Distance: 90.51%). This highlights the LLM’s ability to generalize from a few examples, even in cases where classical models struggle due to data imbalance or noise sensitivity.

A significant gain is also seen in User ID 8, where HED-LM (#ParamA: 83.19%) substantially outperforms the ML approach (52.57%) and other baselines. This illustrates the benefit of semantic filtering for users with high inter-class confusion, where fatigue and non-fatigue signals may have overlapping statistical features but clearer contextual signals.

Meanwhile, User ID 17 shows a relatively stable yet modest improvement, with all methods clustered around mid-range scores (HED-LM: ∼59%). This suggests that model performance may plateau regardless of selection strategy in certain users with ambiguous or low-quality sensor signals, indicating a potential limitation of LLM-based prompting without contextual enrichment.

Interestingly, for User ID 13, all methods perform similarly poorly (∼50%), indicating that prompt optimization has a limited effect when the signal characteristics are inherently weak or indistinct. This case supports the idea that data quality remains a bottleneck that even LLM-based reasoning cannot fully overcome.

Finally, on average, HED-LM (#ParamA) achieved the highest macro F1-score across all users (69.13 ± 10.71%), outperforming Random (59.30 ± 10.13%), Distance (67.61 ± 11.39%), and ML (50.4 ± 17.13%). These findings further reinforce that the hybrid strategy in HED-LM, combining numerical filtering and semantic scoring, is consistent across users and resilient to inter-subject variability. Additionally, HED-LM exhibits a lower standard deviation than ML across users, suggesting improved stability and fairness in subject-level prediction. More details about the confusion matrix of this performance comparison are shown in Appendix A.4.

4.2.4. Contribution of Domain Knowledge in Prompting

In addition to comparing the performance of the HED-LM approach against the baseline methods (Random, Distance, and ML approach), we also analyze the effect of domain knowledge on our approach. In this experiment, we measured performance with and without domain knowledge. Without domain knowledge, LLM only assesses numerical data without applying relevant threshold rules. For instance, Figure 7 compares these results for User ID 4, where the highest increase value is (+14.6%) in the random approach, and the lowest is (+2.1%) in HED-LM with #ParamB.

The results show that using domain knowledge significantly improves the accuracy of the HED-LM approach. With domain knowledge, the HED-LM approach achieves a performance of 67.70% for #ParamA and 64.42% for #ParamB, while without domain knowledge, the performance decreases to lower (–8.08%) and (–2.08%), respectively. This shows that domain knowledge helps LLM understand the numerical context more deeply, resulting in more accurate predictions. In contrast, without domain knowledge, for the performance of the baseline approaches, such as the Random approach (45.25%) and Distance approach (63.78%), the Random approach with domain knowledge is better, but the Distance approach with domain knowledge is less good. Although the Distance approach with domain knowledge has decreased insignificantly with a difference of (–0.89%) when compared to the performance of the Distance approach without domain knowledge, it can be seen that our approach (HED-LM) is still better than the Distance approach without domain knowledge. In addition, compared to other baselines, such as the ML approach, the performance of HED-LM with domain knowledge is far superior (67.70% vs. 53.92%). This indicates that relying solely on machine learning algorithms without the support of domain knowledge is not enough to achieve optimal results, especially on data with complex structures.

Using domain knowledge improves performance and provides higher reliability in various usage scenarios, as seen in User ID 4. This shows that an approach based on combining LLM and domain knowledge can be a convenient solution to improving performance on complex tasks. Appendix A.4 explains more details about the differences in the confusion matrix when comparing the influence of domain knowledge.

5. Discussion

5.1. Comparative Performance Overview

In this section, we discuss the experimental results in more detail. We show that HED-LM (Hybrid Euclidean Distance–Language Model) generally provides higher macro F1-score performance than the three primary baselines, namely the Random approach, the Distance approach, and traditional machine learning (ML), as shown in Figure 8. This finding confirms that the distance-based filtering process followed by LLM scoring and few-shot prompting can utilize numeric synergy (distance) and label synergy (domain knowledge-based LLM reasoning) to improve the accuracy of melt detection.

First, we compare HED-LM with the random approach that randomly selects few-shot examples without considering numeric data similarity. The results show that the random approach is prone to giving examples of irrelevant labels, especially when the variation of sensor data is large enough, resulting in a lower macro F1-score. Meanwhile, HED-LM uses Euclidean distance to select the closest subject and then refines it through LLM scoring, which prioritizes subjects with aligned labels (fatigue/non-fatigue) according to the numeric pattern of the new subject. Based on the observations from our experiments, HED-LM has a performance improvement of about (+1.17%)∼(+30.79%) compared to the random approach, with the lowest improvement for User ID 18 and the highest for User 19.

Second, compared to the distance approach, the HED-LM method stands out because it adds a layer of “label synergy” in the LLM scoring. The distance approach only relies on numeric similarity, so “similar” but mislabeled subjects can be included in the list of few-shot examples. The HED-LM downgrades such subjects through the LLM relevance score, resulting in a higher macro F1-score. However, we found one case (e.g., User ID 10) where the distance approach was better than HED-LM. Our analysis shows that in that subject, the numeric data of the new subject are ambiguous. The domain knowledge is not fully applicable (e.g., RMS and mean are in the ambiguous boundary range). Hence, LLM tends to give the old subject a “middle” relevancy score, which is numerically close but has an incorrect label.

On the other hand, the distance approach successfully places the closest distance subject with the correct label, so the classification on User ID 10 performs better with the distance approach. Cases like this emphasize the importance of calibrating domain knowledge to translate numeric data “in the gray range” more decisively. Based on the experimental results, we see that HED-LM can improve the performance by about (+0.24%)∼(+8.08%), where the lowest improvement is for User ID 17 and the highest is for User ID 13.

Third, HED-LM provides more adaptive outcomes than traditional machine learning (ML). ML models (e.g., Random Forest) effectively extract global regularities in data features. However, the few-shot prompting in HED-LM allows the LLM to more personally assess new subjects with the most relevant examples, where domain knowledge emphasizes specific thresholds (e.g., “mean lower than 0.31 then fatigue-like”). This LLM reasoning adds a layer of interpretation that traditional training-based ML lacks, significantly if movement intensity varies (beginning vs. end of the segment). As a result, HED-LM has a significant performance improvement over the ML approach of about (+0.07%)∼(+70.78%), which is lowest from User ID 19 and highest from User ID 10.

We also observed two parameter configurations, namely #ParamA (

d i s t a n c e_{k} = 5

,

t o p_{k} = 3

) versus #ParamB (

d i s t a n c e_{k} = 10

,

t o p_{k} = 5

). In #ParamA, the system only calls LLM for the five closest candidates then selects the best three after re-ranking. This approach saves overhead and is suitable if the dataset has relatively straightforward distances between subjects. However, if the dataset tends to be more varied and many subjects are “on the border”, #ParamB can increase the coverage by scoring ten candidates and finally selecting the top five, potentially resulting in a better macro F1-score as subjects that are “not too close in proximity but have strong label synergy” can be accommodated. Of course, #ParamB also increases the overhead of LLM calls by almost double. In our evaluation, there are cases where #ParamA efficiently yields high results when the numeric features of the dataset are more stable. At the same time, #ParamB is useful when the data display various borderline patterns, so the wider coverage helps LLM re-ranking filter out truly suitable label candidates. Based on our experimental analysis, when comparing parameters, the highest increase (+9.78%) occurs with #ParamB at User ID 13 and the lowest (+0.35%) with #ParamB at User ID 21.

To test the significance of performance differences among the five methods (Random, Distance, ML, HED-LM #ParamA, HED-LM #ParamB), we performed the Friedman test on the macro F1-score results for 19 subjects. The test results show that the difference is significant (F-statistic = 54.55, p-value < 0.0001), indicating that at least one method performs significantly differently from the other methods. In order to find out which method pairs were significantly different, we proceeded with the Nemenyi post hoc. The p-value table of post hoc results shown in Table 4 indicates some important findings:

Random vs. Distance (p = 0.0076) and Random vs. HED-LM (p < 0.0001) show that the Random method has a significant difference compared to Distance, HED-LM #ParamA, and HED-LM #ParamB.
Random vs. ML (p = 0.9727) showed no significant difference, implying that both methods are relatively less accurate than the other groups.
Distance vs. ML (p = 0.0007) was significantly different, highlighting that Distance was clearly better than ML.
Distance vs. HED-LM #ParamA /HED-LM #ParamB (p > 0.05) was not significantly different; neither was HED-LM #ParamA vs. HED-LM #ParamB (p ≈ 1.0). This indicates that the three methods (Distance, HED-LM #ParamA, and HED-LM #ParamB) are potentially in roughly equal performance clusters.
ML vs. HED-LM #ParamA/HED-LM #ParamB (p < 0.0001) is significantly different, indicating HED-LM #ParamA and HED-LM #ParamB are statistically superior to ML.

From these post hoc results, it can be concluded that:

Random and ML belong to lower-performing or unstable clusters; they are not significantly different from each other but clearly differentiate against other better methods.
Distance, HED-LM #ParamA, and HED-LM #ParamB all form high-performance clusters that are not significantly different from each other. A p-value > 0.05 for each pair indicates there is no statistical evidence that one is truly superior.
Overall, these results validate our initial findings—that both hybrid methods (HED-LM #ParamA and HED-LM #ParamB) and the Distance approach perform better than Random or ML.

Thus, the Friedman and post hoc Nemenyi tests confirm that the improved macro F1-score results in the Distance and HED-LM variants are not a statistical fluke but rather represent real performance differences.

After confirming that there were significant differences between methods (Friedman test) and knowing which pairs of methods differed (post hoc Nemenyi), we calculated Cliff’s Delta (

δ

) to assess how large (practical effect) the differences between two specific methods were. Cliff’s Delta is a non-parametric metric that does not require the assumption of normality, as per our data condition (macro F1-score in 19 subjects). Basically,

δ

is in the range of −1 to +1:

$δ$ > 0 means the method to the left of “vs.” (e.g., “X vs. Y”) tends to be superior.
$δ$ < 0 means the method to the right of “vs.” is superior.
$| δ | \geq$ 0.47–0.50 is often interpreted as a large effect, implying substantial differences in practice.

Our pairwise results confirm the existence of two practically distinct performance clusters:

Superior Cluster: Distance, HED-LM #ParamA, and HED-LM #ParamB
- Cliff’s Delta shows values close to zero when these three methods are compared against each other. For example, Distance vs. HED-LM #ParamA ( $δ$ = −0.097) and Distance vs. HED-LM #ParamB ( $δ$ = −0.053), which are each classified as negligible effects ( $| δ |$ < 0.1). Similarly, HED-LM #ParamA vs. HED-LM #ParamB ( $δ$ = 0.042), so there is no strong indication that one of the three stands out significantly from the others.
- In other words, within this superior cluster, the performance of Distance, HED-LM #ParamA, and HED-LM #ParamB were relatively balanced according to effect size, in line with the post hoc results, which also stated that they were not significantly different.
Lower Cluster: Random and ML
- The inter-method comparison in this cluster also shows relatively close results (Random vs. ML: $δ$ = 0.274, small–medium effect), indicating Random is slightly better but not too far behind ML.
- It is when these two methods are compared with the methods in the superior cluster that large effects are seen. For example, Random vs. HED-LM #ParamA ( $δ$ = −0.524) and Random vs. HED-LM #ParamB ( $δ$ = −0.485) confirm that HED-LM #ParamA and HED-LM #ParamB are far superior to Random. In addition, Distance vs. ML ( $δ$ = 0.562) illustrates a significant difference with a large effect to support the superiority of Distance. A value of $| δ |$ above 0.47 indicates a truly different performance in practical terms, not just statistically significant.

Based on the overall effect size, we can conclude that the Distance, HED-LM #ParamA, and HED-LM #ParamB methods form one high-performing and mutually equivalent group, while Random and ML are in the lower-performing group—with often “large” effect differences when compared against the superior group. This result is consistent with the previous statistical analysis (Friedman test and Nemenyi post hoc), which suggested that although globally there are significant differences, the “top cluster” methods (Distance, HED-LM #ParamA, HED-LM #ParamB) are difficult to distinguish from each other convincingly. Thus, Cliff’s Delta provides strong evidence that the performance gap between the top cluster and the lower clusters is substantial while affirming the stability of performance within the top cluster itself.

Meanwhile, based on the p-value table in Table 5 for the six pairwise comparisons between HED-LM #ParamA/B and the other three methods (Random, Distance, ML), it appears that HED-LM #ParamA is consistently significantly different from Random (p < 0.001 in both tests) and ML (p < 0.001) and also significant when compared to Distance (p < 0.05 for both the t-test and Wilcoxon). This shows that HED-LM #ParamA is statistically superior in all these comparisons. In contrast, HED-LM #ParamB was also clearly better than Random and ML (p < 0.001) but showed no significant difference when compared to Distance (t-test p = 0.09976; Wilcoxon p = 0.06021), both of which exceeded the

α

= 0.05 threshold. Thus, these test results indicate that HED-LM #ParamB is equivalent to Distance, there is no statistical evidence that they are different, whereas HED-LM #ParamA is significantly different (and likely superior) to Distance. Overall, both HED-LM approaches were shown to be superior to both Random and ML, but only HED-LM #ParamA displayed a significant difference to Distance according to these test data.

Embedding thresholds (mean, std, RMS) in the LLM messaging system increase the sensitivity of the relevancy score. We observed that when domain knowledge is disabled, LLM scoring tends to “blend” (score around 0.4–0.6), decreasing re-ranking power. However, if domain knowledge is available and aligned with the data, LLM does not hesitate to give extreme scores (e.g., 0.1 or 0.9). As a result, old subjects whose labels and features perfectly match the new subjects jump to the top of the rankings. However, the case of User ID 10 shows that inaccurate domain knowledge, especially if the subject’s numeric data are in the vulnerable zone, can negatively impact LLM scores.

The overall results of this study support our initial goal of optimizing few-shot prompting for fatigue detection in complex sensor data by combining numerical analysis (distance) and label synergy reasoning (LLM). Most subjects experienced significant macro F1-score improvement when domain knowledge was included, indicating that the HED-LM method has broad potential to detect fatigue vs. non-fatigue with precision. However, we also noticed that there are “anomalous” cases, such as User ID 10, that should be further investigated, both in terms of the validity of the knowledge domain and the adjustment of the distance parameter. Thus, although HED-LM is generally superior, there are still opportunities to update the domain knowledge and adapt the scoring mechanism to make the method more robust in various sensor data conditions.

5.2. Limitations and Future Work

Alternative distance metrics. Euclidean distance was chosen because it runs in only

O (d)

time, is easy to interpret geometrically, and is still the most common baseline in recent surveys of time-series similarity measures [36]. Its main limitation is that it treats the feature space as isotropic, ignoring directional information and any correlation between features. Cosine similarity addresses the first issue by focusing on angular alignment, whereas Mahalanobis distance addresses the second by weighting each dimension with the inverse covariance matrix. Computing the Mahalanobis form,

{(x - μ)}^{⊤} Σ^{- 1} (x - μ)

, requires a matrix–vector multiplication and thus scales quadratically with the feature dimension d, making it noticeably slower than Euclidean. Even so, ref. [36] reports several biomedical applications where Mahalanobis outperforms Euclidean and cosine because the covariance structure carries important class information. We therefore plan a systematic ablation study of Euclidean, cosine, and Mahalanobis distances in future work. A longer-term goal is to develop an adaptive, data-aware metric-selection module that can trade off the potential accuracy gains of richer metrics against their higher computational cost.

Window-Segmentation Strategy. HED-LM currently divides each 180-sample magnitude trace into three fixed windows. A rigid scheme guarantees identical feature dimensions across subjects and keeps the numeric summary compact enough for the LLM token budget. The downside is that physiological transitions seldom align with rigid boundaries; class-specific cues can be diluted if fatigue onset falls inside a window. Adaptive segmentation has therefore become an active research topic. Truong et al. [37] provide a comprehensive review of recent change-point detection (CPD) algorithms that identify statistical shifts in multivariate biosignals. CPD-driven windowing can dynamically resize segments around behavioral changes and has been shown, across multiple studies cited in that review, to improve activity- and health-state recognition without excessive computational overhead. Dynamic time-warping alignment is another adaptive alternative, but its quadratic cost makes real-time deployment more challenging. To balance fidelity and efficiency, the next version of HED-LM will adopt a two-stage design: a lightweight CPD routine will first mark coarse break-points; very short segments will then be merged so that no more than two examples are passed to the LLM. This pipeline preserves temporal nuance without inflating prompt length or latency, remaining compatible with edge-deployment constraints.

Hyper-Parameter Sensitivity. Section 4.2.1 shows that macro-F1 varies with the values of distance-K and top-K. Large swings caused by example selection are also reported for in-context learning [38]. A small K risk omits informative neighbors, whereas a large K increases latency and may introduce noise. We therefore plan to add an adaptive-K controller that expands the candidate pool until the local distance entropy stabilizes, similar in spirit to the adaptive distance-weighted k-NN proposed by W. Xue et al. [39]. This mechanism should reduce manual tuning and lower the risk of over-fitting to personal subjects.

Feature-Extraction Scope. Our present features, mean, RMS, skewness, and dominant frequency, are token-efficient and interpretable but may miss non-linear or multi-scale patterns linked to fatigue. Wang et al. [40] showed that wavelet-packet energy features boost wearable-sensor fatigue detection by 7% over raw statistics. Transformer-based embeddings capture long-range dependencies in multivariate time series [41]. Integrating such rich features, followed by dimensionality reduction to fit the LLM prompt budget, is a promising avenue for the next version of HED-LM.

Comparison with Other Few-Shot Paradigms. HED-LM is training-free and contrasts with meta-learning approaches such as MAML. Hospedales et al. [42] review these methods and note their computational footprint and need for labeled support sets, constraints that conflict with our edge-deployment scenario. Parameter-efficient prompt tuning [43] and retrieval-augmented generation [44] offer a middle ground, requiring only a small set of frozen LLM parameters or an external memory. Exploring these hybrids with HED-LM is an important direction for future work.

Broader Applicability and Domain Extension. While this study adopts fatigue detection from unimodal accelerometer data as a representative use case, the proposed HED-LM framework is designed to be modular and domain-agnostic. It can be extended to other temporal signal classification problems that rely on physiological sensors such as heart rate, skin temperature, and electromyography, enabling multimodal integration in healthcare and industrial settings. Beyond fatigue, this approach can support tasks such as stress detection, sleep stage recognition, or ergonomic risk classification, where numerical features and contextual reasoning are essential. Future work will involve applying HED-LM across diverse domains and signal modalities to evaluate its effectiveness under more complex, real-world conditions. These cross-domain explorations will help identify scenarios where semantic reasoning from LLMs provides significant gains over purely numerical similarity measures.

6. Conclusions

In this study, we have proposed HED-LM (Hybrid Euclidean Distance-Language Model) for fatigue detection in accelerometer data. Through comprehensive experiments, HED-LM consistently outperforms baselines such as random approach, distance-only, and traditional machine learning (ML), especially in the F1-score macro metric. The success of HED-LM is supported by two important mechanisms: (1) distance-based filtering that selects numerically closest subjects, and (2) LLM scoring that combines domain knowledge (threshold mean, RMS, etc.) to assess the suitability of fatigue/non-fatigue labels more semantically. Finally, candidate re-ranking and few-shot prompting ensure new subjects acquire the most relevant examples, resulting in more accurate mapping of sensor data to final labels.

On the other hand, the results also highlighted some aspects that require further attention. For example, we observed one case (User ID 10) where the distance approach outperformed HED-LM—this emphasizes the need for more precise calibration of domain knowledge, especially when the subject’s numeric data are in the gray zone (the range between “fatigue-like” vs. “non-fatigue-like”). In addition, the selection of the parameters

d i s t a n c e_{k}

and

t o p_{k}

also significantly impacts the efficiency and coverage of candidates assessed by the LLM, requiring a balance between the overhead of LLM calls and the potential performance improvement.

In addition to finding a consistent improvement in the macro F1-score, we also conducted statistical analysis to validate the performance differences between the methods. The Friedman test results (p-value < 0.01) confirmed that at least one method was significantly superior to the others. Post hoc Nemenyi showed that the HED-LM approach (both #ParamA and #ParamB) formed a superior group significantly different from Random and traditional ML methods. Effect size measurements (Cliff’s Delta or Kendall’s W) support these findings with large effect category values, indicating that the superiority of HED-LM is substantial in practical terms, not just a statistical fluke. Therefore, the performance improvement we obtained is superior in metrics (macro F1-score) and proves to have significant differences and noticeable effects in the context of applying few-shot prompting on accelerometer data. Going forward, the outcomes of this statistical analysis can provide a strong basis for additional development, such as integrating more adaptive domain knowledge and evaluation in other domains that require robust few-shot classification.

For future work, We see several development opportunities to make the HED-LM approach more effective and widely applicable. First, the knowledge domain can be enriched by adding more holistic rules, such as including scenarios of varying movement intensity or emphasizing the analysis of intermediate frequencies that may affect fatigue patterns. Secondly, evaluating more diverse datasets ranging from light intensity to high load can gauge the method’s adaptability under more extreme real-world conditions. Thirdly, we consider fine-tuning LLMs to handle numeric data more optimally, reducing reliance on manual prompts and deepening the analysis of quantitative feature differences (e.g., mean, RMS, kurtosis). Finally, future research could include an adaptive mechanism for the parameters

d i s t a n c e_{k}

and

t o p_{k}

so that the system automatically balances the candidate coverage and computational overhead of the LLM according to the data characteristics. These efforts will improve the capabilities of HED-LM and expand its applicability in various fatigue detection scenarios on complex sensor data.

Overall, this study highlights the potential of HED-LM as a hybrid approach in the domain of few-shot prompting on feature-rich sensor data. By combining numerical analysis (distance) and semantic reasoning (LLM scoring), this method can be further improved in terms of efficiency, interpretability, and generalizability to support various fatigue vs. non-fatigue detection applications in the context of healthcare and other fields.

Author Contributions

Conceptualization, E.R. and S.I.; methodology, E.R.; software, E.R. and S.I.; validation, E.R. and S.I.; formal analysis, E.R.; investigation, E.R.; resources, E.R.; data curation, E.R.; writing—original draft preparation, E.R.; writing—review and editing, E.R. and S.I.; visualization, E.R.; supervision, S.I.; project administration, E.R.; funding acquisition, S.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the JST-Mirai Program, Creation of Care Weather Forecasting Services in the Nursing and Medical Field, Grant Number JPMJMI21H3 in Japan.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available using public datasets that can be accessed at https://zenodo.org/records/7997851, accessed on 1 January 2025.

Acknowledgments

The authors gratefully acknowledge the support provided by the Ministry of Education, Culture, Sports, Science, and Technology (MEXT), Japan, during the course of this study. We also sincerely thank the members of the Sozo Laboratory for their invaluable assistance and collaboration throughout this research. Further appreciation is extended to Universitas 17 Agustus 1945 (Untag) Surabaya and the Ministry of Higher Education, Science, and Technology of the Republic of Indonesia for their continuous support and encouragement. Special thanks go to Enny Indasyah and Ryuga Alfatih Ernando for their unwavering support and motivation in completing this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
SD	Standard Deviation
GPT	Generative Pre-trained Transformer
HED-LM	Hybrid Edit Distance–Language Model
KNN	K-Nearest Neighbors
LLM	Large Language Model
LLMs	Large Language Models
ML	Machine Learning
RMS	Root Mean Square

Appendix A

Appendix A.1

This section describes the process of example selection using LLM scoring through carefully designed prompting. The first step involves creating a context incorporating feature information from each segment, paired with its corresponding label, based on examples provided by

d i s t a n c e_{k}

. Additionally, feature details from unlabeled subjects, predicted during the evaluation, are included in the context to enrich the scoring framework.

To ensure the scoring process is systematic and practical, we design prompts that guide the LLM in evaluating relevance at each stage. These prompts include detailed instructions and domain-specific knowledge to help the model follow a structured scoring methodology. The questions are presented progressively, enabling the LLM to generate thoughtful and well-aligned responses to the assessment criteria.

Throughout the LLM scoring process, we use the GPT-4o-mini model (OpenAI, San Francisco, CA, USA; accessed in March 2025) with a temperature setting 0.3 to maintain consistency and reliability. The interplay of domain knowledge, instructions, context, and staged questioning enhances the overall effectiveness of the scoring mechanism. A detailed illustration of the prompting design for this process is provided in Figure A1.

Figure A1. Prompting design for LLM scoring. The design incorporates several key components: (1) domain knowledge to establish threshold values for performance enhancement, (2) instructions to guide the LLM’s focus, (3) context to provide relevant background, and (4) questions to elicit targeted responses.

Appendix A.2

We use feature analysis to integrate domain knowledge produced by an LLM to improve HED-LM’s performance. In particular, we employ the GPT-4o model (OpenAI, San Francisco, CA, USA; accessed in March 2025) in the feature analysis procedure to extract domain knowledge that closely corresponds to the context of the feature values in the sensor data. This guarantees that the insights gleaned are highly pertinent and customized to the dataset’s features.

Figure A2 thoroughly outlines the architecture of the prompting system utilized to acquire this domain knowledge. This design is set up to let the LLM analyze the features methodically and derive insightful information. Furthermore, Figure A3 shows an example of the LLM’s answer that demonstrates how domain knowledge is created through this approach. These figures show how the approach is applied and how LLM-generated domain knowledge is incorporated into the HED-LM framework.

Figure A2. Prompting design for acquiring domain knowledge from sensor data feature analysis. The design leverages sensor data features to extract domain-specific insights, guiding the LLM through targeted instructions, contextual information, and structured queries for effective knowledge acquisition.

Figure A3. Example of LLM responses generating domain knowledge. The responses are utilized to establish domain-specific insights, which are subsequently applied in LLM scoring and few-shot prompting to enhance performance in the target task.

Appendix A.3

In this section, we outline the design of a few-shot prompting strategy for fatigue prediction, leveraging the GPT-4o-mini model (OpenAI, San Francisco, CA, USA; accessed in March 2025) with a temperature setting of 0.3. Our approach integrates several key components: domain knowledge, instructions, context, and questions. Each element is critical in crafting effective prompts tailored to the task. For the context component, we employ a

t o p_{K}

selection process to identify the most relevant examples based on evaluation results from LLM scoring. These examples are then filtered down to two, forming a 2-shot prompting setup. To ensure balance and enhance the model’s interpretability, we carefully select one example labeled as “fatigue” and another labeled as “non-fatigue”. This approach provides a clear distinction between the two states, aiding the model in better understanding the task.

In cases where the

t o p_{K}

examples lack representation from both labels, meaning all examples belong to either “fatigue” or “non-fatigue”, we construct the 2-shot prompt using examples with the same label. This adaptive strategy ensures the robustness of our method, even in scenarios with label imbalance. Figure A4 illustrates additional details on the design of our few-shot prompting approach.

Figure A4. Prompting design for few-shot prompting. The framework consists of multiple components: (a) domain knowledge to define threshold values for optimizing performance, (b) instructions to streamline the LLM’s focus, (c) context to supply essential background information, and (d) questions to facilitate precise and relevant responses.

Appendix A.4

This section presents the confusion matrix results for the baseline methods and our proposed approach, providing a detailed model performance analysis. The confusion matrix is a critical tool for evaluating model behavior, offering insights into the distribution of predictions through metrics such as True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These values are integral to calculating the macro F1-score, a key evaluation metric that captures the balance between precision and recall across all classes.

By comparing the confusion matrices, we aim to identify whether our proposed approach effectively addresses the limitations of the baselines. Precisely, we assess improvements in handling minority classes, reducing misclassification errors, and achieving a more balanced distribution of predictions across different classes. Figure A5, Figure A6 and Figure A7 illustrate the confusion matrices for the random, distance-based, and traditional machine learning approaches for individual participants. In contrast, Figure A8 and Figure A9 highlight the confusion matrices for our two parameterized approaches, evaluated under the same conditions. These comparisons demonstrate the effectiveness of our method in overcoming baseline weaknesses and achieving a more robust and balanced predictive performance. Meanwhile, Figure A10 shows the confusion matrix of User ID 4 performance without domain knowledge.

Figure A5. Confusion matrix for random approach.

Figure A6. Confusion matrix for distance approach.

Figure A7. Confusion matrix for traditional ML approach.

Figure A8. Confusion matrix for HED-LM with #ParamA.

Figure A9. Confusion matrix for HED-LM with #ParamB.

Figure A10. Confusion matrix of User ID 4 without domain knowledge.

References

Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv 2022, arXiv:2208.03299. [Google Scholar]
Lin, X.V.; Mihaylov, T.; Artetxe, M.; Wang, T.; Chen, S.; Simig, D.; Ott, M.; Goyal, N.; Bhosale, S.; Du, J.; et al. Few-shot Learning with Multilingual Generative Language Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9019–9052. [Google Scholar]
Cao, K.; Brbic, M.; Leskovec, J. Concept Learners for Few-Shot Learning. arXiv 2021, arXiv:2007.07375. [Google Scholar]
Qin, X.; Song, X.; Jiang, S. Bi-Level Meta-Learning for Few-Shot Domain Generalization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15900–15910. [Google Scholar] [CrossRef]
Gao, T.; Fisch, A.; Chen, D. Making Pre-trained Language Models Better Few-shot Learners. arXiv 2021, arXiv:2012.15723. [Google Scholar]
Sclar, M.; Choi, Y.; Tsvetkov, Y.; Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Adiga, R.; Subramanian, L.; Chandrasekaran, V. Designing Informative Metrics for Few-Shot Example Selection. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10127–10135. [Google Scholar] [CrossRef]
Shin, J.; Kang, Y.; Jung, S.; Choi, J. Active Instance Selection for Few-Shot Classification. IEEE Access 2022, 10, 133186–133195. [Google Scholar] [CrossRef]
Feng, S.; Duarte, M.F. Few-Shot Learning-Based Human Activity Recognition. arXiv 2019, arXiv:1903.10416. [Google Scholar] [CrossRef]
Ronando, E.; Inoue, S. Leveraging Large Language Models to Enhance Understanding of Accelerometer Data on Physical Fatigue Detection Question Answering. In Proceedings of the 112th Mobile Computing and New Social Systems, 83rd Ubiquitous Computing Systems, 41st Consumer Devices & Systems, 30th Aging Society Design Joint Research, Matsuyama, Japan, 26–27 September 2024; pp. 1–8. [Google Scholar]
Fan, Y.; Yang, D.; He, X. CTYUN-AI at SemEval-2024 Task 7: Boosting Numerical Understanding with Limited Data Through Effective Data Alignment. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Mexico City, Mexico, 20–21 June 2024; Ojha, A.K., Doğruöz, A.S., Tayyar Madabushi, H., Da San Martino, G., Rosenthal, S., Rosá, A., Eds.; pp. 47–52. [Google Scholar] [CrossRef]
Spathis, D.; Kawsar, F. The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models. J. Am. Med. Inform. Assoc. JAMIA 2023, 31, 2151–2158. [Google Scholar] [CrossRef]
Hota, A.; Chatterjee, S.; Chakraborty, S. Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data. ACM Trans. Intell. Syst. Technol. 2024, 1–25. [Google Scholar] [CrossRef]
Li, Z.; Deldari, S.; Chen, L.; Xue, H.; Salim, F.D. SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition. arXiv 2024, arXiv:2410.10624. [Google Scholar]
Jeworutzki, A.; Schwarzer, J.; Von Luck, K.; Stelldinger, P.; Draheim, S.; Wang, Q. Small Data, Big Challenges: Pitfalls and Strategies for Machine Learning in Fatigue Detection. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 5–7 July 2023; PETRA ’23. pp. 364–373. [Google Scholar] [CrossRef]
Zhao, T.Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-Shot Performance of Language Models. arXiv 2021, arXiv:2102.09690. [Google Scholar]
Cai, W.; Louise, C. Sensorimotor distance: A grounded measure of semantic similarity for 800 million concept pairs. Behav. Res. Methods 2022, 55, 3416–3432. [Google Scholar] [CrossRef]
Shi, Y.; Wu, X.; Lin, H. Knowledge Prompting for Few-shot Action Recognition. arXiv 2022, arXiv:2211.12030. [Google Scholar]
Jin, W.; Cheng, Y.; Shen, Y.; Chen, W.; Ren, X. A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models. arXiv 2022, arXiv:2110.08484. [Google Scholar]
Min, S.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Noisy Channel Language Model Prompting for Few-Shot Text Classification. arXiv 2022, arXiv:2108.04106. [Google Scholar]
Aguirre, C.; Sasse, K.; Cachola, I.; Dredze, M. Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models. arXiv 2023, arXiv:2311.08472. [Google Scholar]
Cegin, J.; Pecher, B.; Simko, J.; Srba, I.; Bielikova, M.; Brusilovsky, P. Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification. arXiv 2024, arXiv:2410.10756. [Google Scholar]
Perez, E.; Kiela, D.; Cho, K. True Few-Shot Learning with Language Models. arXiv 2021, arXiv:2105.11447. [Google Scholar]
Chang, E.; Shen, X.; Yeh, H.S.; Demberg, V. On Training Instance Selection for Few-Shot Neural Text Generation. arXiv 2021, arXiv:2107.03176. [Google Scholar]
Margatina, K.; Schick, T.; Aletras, N.; Dwivedi-Yu, J. Active Learning Principles for In-Context Learning with Large Language Models. arXiv 2023, arXiv:2305.14264. [Google Scholar]
Yao, B.; Chen, G.; Zou, R.; Lu, Y.; Li, J.; Zhang, S.; Sang, Y.; Liu, S.; Hendler, J.; Wang, D. More Samples or More Prompts? Exploring Effective In-Context Sampling for LLM Few-Shot Prompt Engineering. arXiv 2024, arXiv:2311.09782. [Google Scholar]
Pecher, B.; Srba, I.; Bielikova, M.; Vanschoren, J. Automatic Combination of Sample Selection Strategies for Few-Shot Learning. arXiv 2024, arXiv:2402.03038. [Google Scholar]
Qin, C.; Zhang, A.; Chen, C.; Dagar, A.; Ye, W. In-Context Learning with Iterative Demonstration Selection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; pp. 7441–7455. [Google Scholar]
An, S.; Zhou, B.; Lin, Z.; Fu, Q.; Chen, B.; Zheng, N.; Chen, W.; Lou, J.G. Skill-Based Few-Shot Selection for In-Context Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; pp. 13472–13492. [Google Scholar] [CrossRef]
Liu, X.; McDuff, D.; Kovacs, G.; Galatzer-Levy, I.; Sunshine, J.; Zhan, J.; Poh, M.Z.; Liao, S.; Achille, P.D.; Patel, S. Large Language Models are Few-Shot Health Learners. arXiv 2023, arXiv:2305.15525. [Google Scholar]
Siraj, F.M.; Ayon, S.T.K.; Uddin, J. A Few-Shot Learning Based Fault Diagnosis Model Using Sensors Data from Industrial Machineries. Vibration 2023, 6, 1004–1029. [Google Scholar] [CrossRef]
Kathirgamanathan, B.; Cunningham, P. Generating Explanations to understand Fatigue in Runners Using Time Series Data from Wearable Sensors. In Proceedings of the ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH), Honolulu, HI, USA, 28 July 2023; pp. 1–11. [Google Scholar]
Zhang, X.; Jiang, S. Application of Fourier Transform and Butterworth Filter in Signal Denoising. In Proceedings of the 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; pp. 1277–1281. [Google Scholar] [CrossRef]
de Souza, P.; Silva, D.; de Andrade, I.; Dias, J.; Lima, J.P.; Teichrieb, V.; Quintino, J.P.; da Silva, F.Q.B.; Santos, A.L.M. A Study on the Influence of Sensors in Frequency and Time Domains on Context Recognition. Sensors 2023, 23, 5756. [Google Scholar] [CrossRef]
Chen, Z.; Li, H. A survey on distance and similarity measures for time-series data analysis. Inf. Sci. 2020, 546, 441–465. [Google Scholar] [CrossRef]
Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
Min, S.; Lewis, M.; Zettlemoyer, L.; Hajishirzi, H. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11044–11064. [Google Scholar]
Xue, W.; Duan, L.; Hong, X.; Zheng, X. Adaptive weighting and nearest neighbor-based area control for imbalanced data classification. Appl. Soft Comput. 2025, 177, 113171. [Google Scholar] [CrossRef]
Wang, J.; Sun, S.; Sun, Y. A Muscle Fatigue Classification Model Based on LSTM and Improved Wavelet Packet Threshold. Sensors 2021, 21, 6369. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI 2021, Online, 2–9 February 2021; pp. 11106–11115. [Google Scholar]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-Learning in Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5149–5169. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]

Figure 1. Comparison of fatigue and non-fatigue accelerometer signals from three different users. Each subplot shows one representative sample per class, extracted from running sessions. The yellow-highlighted region (time index 40–100) indicates a temporal window where the waveform patterns between fatigue and non-fatigue signals exhibit visual similarity. To quantitatively support this observation, we report Cosine Similarity and Dynamic Time Warping (DTW) distance between the two signals. Cosine Similarity captures directional similarity (1.0 = identical), while DTW accounts for temporal misalignment (lower values = more similar). These metrics further emphasize the challenge of signal overlap, motivating the need for context-aware example selection in few-shot prompting.

Figure 2. Proposed HED-LM framework for physical fatigue detection. (a) Example selection stage: the optimal examples are identified based on numerical proximity and contextual relevance using Euclidean distance with LLM. (b) Inference stage: few-shot prompting is conducted with LLM utilizing the selected examples to classify physical fatigue.

Figure 3. Comparison of evaluation pipelines between the ML baseline and LLM-based approaches.

Figure 4. Comparison of HED-LM performance and computation time across different parameter configurations of distance-K and top-K on User ID 4. The blue line (left axis) shows the macro F1-score (%) for each configuration, while the red dashed line (right axis) represents the total computation time in seconds. While higher parameter values such as (15/7) yield better accuracy, they come with substantially higher computational costs. Configurations (10/5) and (5/3) demonstrate a more favorable trade-off between performance and efficiency.

Figure 5. Macro F1-score comparison for different methods using 2-shot and full-shot setups on User ID 4. HED-LM (2-shot) delivers the best performance, outperforming even the full-shot variant, which suffered from reduced accuracy.

Figure 6. Computation time comparison between 2-shot and full-shot configurations for different methods on User ID 4. HED-LM (2-shot) achieves the lowest computation cost among LLM-based methods, with significantly lower runtime than the full-shot variant.

Figure 7. Macro F1-score performance comparison: influence of domain knowledge on User ID 4. The results highlight the impact of incorporating domain knowledge, showcasing improved performance for User ID 4 compared to scenarios without domain knowledge integration.

Figure 8. Overall user comparison: HED-LM and its parameters versus baselines. The comparison evaluates the performance of the proposed HED-LM framework and its key parameters across all users, demonstrating its superiority over baseline methods in terms of effectiveness and consistency.

Table 1. Summary of characteristics across evaluated approaches. Only the ML model is LLM-independent and operates without prompt-based inference.

Approach	Uses LLM	Example Selection	API Call	Prompt-Based
ML (Random Forest)	No	No (offline learning)	No	No
Random Approach	Yes	Random selection	Yes	Yes
Distance Approach	Yes	Euclidean distance	Yes	Yes
HED-LM (Ours)	Yes	Distance + LLM semantic scoring	Yes	Yes

Table 2. Evaluation setup and prompting strategies across ML and LLM-based methods.

Component	ML (Random Forest)	Random Approach	Distance Approach	HED-LM (Ours)
Data Source	Per-user slice	Same	Same	Same
Data Range	Per-user slice	Same	Same	Same
Training Samples (n)	2	2	2	2
Test Set	Held-out	Same	Same	Same
Selection Strategy	Random from class	Random	Euclidean distance	Distance + LLM scoring
Prompt Structure	N/A	2-shot	2-shot	2-shot, label-balanced
Leakage Prevention	Yes	Yes	Yes	Yes

Table 3. Macro F1-score evaluation: comparison between the proposed HED-LM approach and baseline methods. The results illustrate the superior performance of HED-LM, demonstrating its effectiveness in achieving higher macro F1-scores compared to the baseline approaches. Bold values indicate the best performance for each comparison.

User ID	ML Approach	Random Approach	Distance Approach	HED-LM (Ours)
	(%)	(%)	(%)	(%)
				#ParamA	#ParamB
4	53.92	59.82	62.89	67.70	64.42
5	70.51	85.15	88.30	87.23	88.83
6	69.22	67.61	71.04	72.45	71.57
7	59.97	49.93	62.26	65.14	61.93
8	52.57	68.90	79.40	83.19	81.87
9	43.52	61.64	65.81	65.25	66.19
10	19.10	62.69	90.51	89.88	86.83
11	52.85	48.30	51.55	57.24	59.49
12	71.45	73.17	81.67	83.81	84.25
13	49.11	49.88	50.81	52.76	58.89
14	34.79	50.95	56.66	59.31	58.64
15	36.39	52.47	57.40	60.85	59.37
17	45.46	47.95	59.73	59.97	59.18
18	39.99	62.91	63.79	63.00	64.08
19	74.83	44.11	73.20	74.90	70.93
20	70.46	61.36	75.43	76.68	77.57
21	60.01	63.02	64.93	64.93	65.28
22	37.07	58.30	61.46	63.11	59.63
23	17.32	58.49	67.70	66.11	68.09
(Mean ± SD)	(50.4 ± 17.13)	(59.30 ± 10.13)	(67.61 ± 11.39)	(69.13 ± 10.71)	(68.79 ± 10.24)

Table 4. Nemenyi post hoc test results between Random, Distance, ML, HED-LM #ParamA, and HED-LM #ParamB methods. The values in the table are the exact comparison p-values without using scientific notation.

	Random	Distance	ML	HED-LM #ParamA	HED-LM #ParamB
Random	1.000000	0.007625	0.972672	0.00001085355	0.00002331567
Distance	0.007625	1.000000	0.0007448186	0.5369701	0.6372963
ML	0.972672	0.000745	1.000000	0.0000004028977	0.0000009506715
HED-LM #ParamA	0.000011	0.536970	0.0000004028977	1.000000	0.9998743
HED-LM #ParamB	0.000023	0.637296	0.0000009506715	0.9998743	1.000000

Table 5. Statistical test results (p-value) for performance comparison between HED-LM #ParamA and HED-LM #ParamB methods with Random, Distance, and ML methods.

Comparison	t-Test p-Value	Wilcoxon p-Value
HED-LM #ParamA vs. Random	0.00004	0.00000
HED-LM #ParamA vs. Distance	0.02716	0.00240
HED-LM #ParamA vs. ML	0.00006	0.00002
HHED-LM #ParamB vs. Random	0.00005	0.00000
HED-LM #ParamB vs. Distance	0.09976	0.06021
HED-LM #ParamB vs. ML	0.00009	0.00004

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ronando, E.; Inoue, S. Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection. Sensors 2025, 25, 3324. https://doi.org/10.3390/s25113324

AMA Style

Ronando E, Inoue S. Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection. Sensors. 2025; 25(11):3324. https://doi.org/10.3390/s25113324

Chicago/Turabian Style

Ronando, Elsen, and Sozo Inoue. 2025. "Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection" Sensors 25, no. 11: 3324. https://doi.org/10.3390/s25113324

APA Style

Ronando, E., & Inoue, S. (2025). Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection. Sensors, 25(11), 3324. https://doi.org/10.3390/s25113324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Optimization

2.2. Few-Shot Prompting on Sensor Data

2.2.1. General Challenges in Sensor Data for Few-Shot Prompting

2.2.2. Fatigue Detection as a Representative Case Study

2.3. Challenges and Motivations

3. Proposed Method

3.1. Data Acquisition

3.2. Preprocessing

3.3. Distance-Based Filtering in Candidate Selection

3.4. LLM Scoring for Contextual Candidate Evaluation

3.5. Re-Ranking

3.6. Few-Shot Prompt

3.7. Predict Label

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.2.1. Impact of Distance-K and Top-K Parameters

4.2.2. Effect of the Number of Shots in Few-Shot Prompting

4.2.3. Per-User Performance Evaluation

4.2.4. Contribution of Domain Knowledge in Prompting

5. Discussion

5.1. Comparative Performance Overview

5.2. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix A.4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI