1. Introduction
Buildings represent a major source of global
emissions, making energy efficiency in the built environment a critical priority. While advanced automation and optimization techniques have driven significant efficiency gains in residential and large commercial buildings, these approaches are not readily applicable to Small and Medium Offices (SMOs). Unlike large commercial buildings, which increasingly deploy Building Energy Management Systems (BEMS) supported by dedicated facility teams, SMOs typically lack centralized energy-management infrastructure, professional energy engineers, and the financial capacity to adopt advanced sensing or control systems. Consequently, energy-related decisions in SMOs are often made manually or governed by fixed schedules, resulting in persistent inefficiencies such as unnecessary cooling, suboptimal temperature setpoints, and prolonged lighting operation during low-occupancy periods. Occupant behavior has been widely recognized as a key determinant of building energy consumption [
1,
2]. Addressing energy inefficiency in SMOs therefore requires solutions that are cost-effective, easy to deploy, and operable without dedicated on-site management.
In such environments, reliable occupancy information is fundamental to effective energy management, supporting both adaptive system operation and the delivery of personalized energy-saving feedback to occupants. However, acquiring reliable occupancy data in SMOs remains challenging: vision-based systems introduce significant privacy risks and are difficult to deploy in shared office environments [
3], while device-assisted approaches such as badge tracking or smartphone sensing depend on active user compliance and are therefore unsuitable for continuous passive monitoring [
4]. Conventional sensor modalities, such as Passive Infrared (PIR) detectors [
5] and
-based inference [
6], are constrained by limited spatial resolution, delayed response times, and poor robustness under varying environmental conditions.
Wi-Fi Channel State Information (CSI) has emerged as a promising modality for device-free occupancy sensing, owing to its fine-grained characterization of multipath propagation and its compatibility with commodity wireless infrastructure [
7]. By combining fine-grained channel measurements with machine learning techniques, CSI-based methods have demonstrated strong performance in crowd counting and human activity recognition [
8,
9,
10,
11]. Unlike camera-based systems, CSI sensing operates without capturing visual information, making it inherently suitable for privacy-sensitive deployment environments [
12]. For instance, Zou et al. [
13] proposed WiFree, achieving 99.1% binary occupancy detection accuracy and 92.8% crowd counting accuracy in controlled settings. Yang et al. [
11] demonstrated CSI-based human detection and activity recognition in smart-home environments, reporting accuracies of up to 98% using commodity Wi-Fi routers. However, these studies are predominantly evaluated in controlled laboratory settings and do not address the system-level constraints of practical SMO deployment, including hardware limitations, long-term operational stability, and integration with energy management workflows.
Beyond algorithmic limitations, practical deployment of CSI-based sensing is further constrained by hardware dependencies and the substantial computational overhead of high-dimensional CSI processing, which typically demands PC- or cloud-level computing resources [
7,
8,
9,
10]. Furthermore, only a small number of studies have established an explicit link between CSI-derived occupancy estimates and practical energy management actions. Natarajan et al. [
14] demonstrated an ESP32-based CSI Human Activity Recognition (HAR) system capable of driving smart LED control, reporting estimated energy savings under representative occupancy scenarios. However, such demonstrations remain appliance-level and rule-driven and fall short of addressing SMO-oriented system requirements, including long-term operational robustness, multi-sensor coordination, and feedback mechanisms capable of sustaining behavioral change.
In the absence of centralized BEMS and dedicated facility personnel, energy conservation in SMOs relies predominantly on occupants’ voluntary behavioral responses. Research in behavioral science has demonstrated that delivering actionable, occupancy-aware energy-saving recommendations can substantially raise occupants’ energy consciousness and promote sustained conservation behaviors [
15]. Accordingly, a deployable and reliable occupancy-sensing capability is a prerequisite for generating the context-aware, occupancy-driven feedback that underpins effective behavioral energy management in SMOs.
Despite these advances, translating CSI-based occupancy sensing into a practically deployable system for SMOs remains an open challenge [
16]. Beyond computational constraints, existing CSI systems rarely achieve the end-to-end integration required for long-term unattended operation under the practical constraints of SMO environments. As a result, despite the demonstrated potential of CSI sensing, no integrated system-level solution currently exists that jointly addresses deployable occupancy estimation, robust edge preprocessing, and feedback-oriented behavioral energy management under SMO constraints.
Recent deep learning-based methods have substantially advanced CSI-based sensing accuracy and robustness, yet their computational and memory demands remain prohibitive for low-cost edge deployment in SMOs. In our prior work [
17], four image-based deep learning architectures (Vision Transformer, ResNet50, DenseNet121, and EfficientNetB0) were systematically benchmarked for CSI-based occupancy classification, with DenseNet121 identified as the strongest-performing baseline. Nevertheless, even this best-performing candidate depends on high-dimensional time-frequency spectrograms and cloud-level computation, rendering it unsuitable for direct edge deployment in resource-constrained SMO environments. To bridge this gap, Tiny Machine Learning (TinyML) techniques have been increasingly adopted to compress and optimize deep learning models for deployment on microcontroller-class hardware [
18]. Armenta-Garcia et al. [
19] demonstrated the feasibility of CSI-based inference on ESP32-S3 microcontrollers by leveraging model quantization, Pseudo Static Random Access Memory (PSRAM) utilization, and optimized embedded toolchains, achieving 92.43% accuracy for activity recognition. However, these efforts focus primarily on model executability rather than the system-level requirements of end-to-end deployable sensing, including robust signal preprocessing under non-stationary conditions and stable long-term operation on resource-constrained devices.
Beyond runtime feasibility, CSI-based occupancy sensing is a temporal inference problem by nature, as occupancy states and human activities produce time-varying amplitude patterns in CSI streams. Transformer architectures have proven effective at capturing long-range temporal dependencies in building energy applications, including energy consumption forecasting [
20] and HVAC fault detection [
21]. These results indicate that attention-based sequence modeling is well suited to the temporal structure of CSI data. Nevertheless, standard Transformer architectures are computationally intensive, precluding their direct deployment on microcontroller-class edge devices for on-device occupancy inference. This gap motivates the design of lightweight Transformer variants that preserve sequence modeling capability while satisfying the memory, latency, and robustness requirements of SMO edge deployments.
The BI-Tech (Behavioral Insight × Technology) project was initiated by Japan’s Ministry of the Environment in 2019 to bridge the gap between occupancy-sensing technology and occupant-driven energy conservation in office environments. The BI-Tech approach integrates IoT-based environmental sensing with data-driven decision support to promote self-motivated energy-saving behaviors, shifting the focus from passive automated control to active occupant engagement. Prior work within this framework [
22] has shown that systems delivering context-aware feedback and personalized energy-saving suggestions can raise occupants’ energy awareness and yield measurable reductions in energy consumption. However, the effectiveness of such behavioral interventions depends on reliable, low-cost occupancy sensing integrated with environmental monitoring, a requirement that existing CSI research has not yet fulfilled for SMO environments.
Taken together, these observations identify a critical gap: despite the technical maturity of CSI sensing and the demonstrated effectiveness of behavioral energy management, few existing systems jointly address low-cost occupancy estimation, multi-modal environmental monitoring, and feedback-oriented energy management under SMO constraints. To address this gap, this study presents a deployable edge-AI occupancy-sensing framework for SMOs within the BI-Tech platform, with the following three contributions:
A dual-core ESP32-S3 edge platform with PSRAM support, enabling concurrent multi-stage CSI preprocessing and multimodal environmental sensing on a single low-cost node;
A lightweight Mini Transformer model designed for the compact 8 CSI feature space, achieving 98.86% accuracy in multi-level occupancy classification with 138K parameters and a per-window data footprint of 24 kB, corresponding to a data volume reduction exceeding 129× compared with the DenseNet121 baseline;
Integration of the occupancy-sensing module with multimodal environmental sensors and the BI-Tech behavioral intervention platform, enabling occupancy-aware energy-saving reminders for lighting and HVAC management in SMO environments.
3. CSI Feature Extraction
To enable real-time occupancy inference under the hardware and communication constraints of SMOs, we propose a lightweight CSI feature extraction pipeline that performs local compression and noise suppression on edge devices before cloud transmission. Unlike conventional approaches that transmit raw CSI streams or apply computationally intensive deep feature learning on cloud servers [
17,
25], our method extracts compact statistical descriptors directly on the ESP32-S3 device, achieving a 129× compression ratio while preserving occupancy-discriminative patterns.
As illustrated in
Figure 3, the BI-Tech system adopts a unified edge-side processing architecture in which raw CSI amplitude streams undergo multi-stage denoising, subcarrier selection, and statistical feature construction on the device. The resulting feature snapshots are encoded into structured JSON payloads and transmitted to the cloud via MQTT over TLS. This design ensures distributional consistency between offline training and online deployment, as both stages execute identical feature extraction operations.
3.1. CSI Acquisition and Windowing
CSI provides fine-grained characterization of wireless channels by capturing multipath-induced amplitude and phase variations. In this study, CSI measurements are collected using a single-input single-output (SISO) configuration, with one ESP32-S3 device as transmitter (BI-Tech-Tx) and another as receiver (BI-Tech-Rx), each equipped with a single omnidirectional antenna. CSI amplitude streams are continuously sampled at 100 Hz over a 2.4 GHz Wi-Fi channel with a 20 MHz bandwidth (IEEE 802.11n). Each CSI sample contains 384 bytes of raw I/Q measurements provided by the ESP-IDF CSI API under our packet configuration, comprising the Legacy Long Training Field (LLTF), High-Throughput Long Training Field (HT-LTF), and Space-Time Block Coding High-Throughput Long Training Field (STBC-HT-LTF) [
26]. After removing null entries and converting I/Q pairs to amplitudes, 166 valid amplitude channels are obtained. To reduce on-device computation and memory usage, we retain the 56 highest-energy channels for subsequent feature extraction.
The continuous CSI stream is segmented into fixed-length temporal windows of 6 s, corresponding to samples per window. Adjacent windows overlap by 50% (3 s stride), yielding approximately nine segments per 30-s recording. This windowing scheme is adopted throughout the study to generate consistent input segments for subsequent preprocessing and feature construction.
Let the complex CSI of the
i-th subcarrier at time
t be expressed as
where
denotes the CSI amplitude,
denotes the phase,
i is the subcarrier index, and
t denotes the discrete time index within the window. Due to phase synchronization challenges in low-cost ESP32-S3 devices, this study focuses exclusively on amplitude information. We define
as the amplitude sequence within each window (
), where
is the total number of samples per 6-second window.
3.2. Multi-Stage Denoising Pipeline
Raw CSI amplitude sequences captured by the ESP32-S3 are subject to impulsive outliers, missing or corrupted samples, and short-term measurement noise. Following the signal processing workflow in
Figure 3, a four-stage preprocessing pipeline is applied to each subcarrier sequence, including Hampel outlier removal, linear interpolation, Kalman smoothing, and wavelet-based denoising, in order to suppress noise components while retaining meaningful temporal variations in the CSI stream.
First, a Hampel filter with window size of
and threshold factor of
is applied to suppress impulsive outliers. The filtered output is defined as
where
is the Hampel-filtered amplitude of the
i-th subcarrier at time
t;
denotes the median operator computed over the local window
;
is the median absolute deviation of the same local window, computed with normalization constant
;
w is the Hampel window half-width; and
is the outlier threshold factor. Samples exceeding
from the local median are replaced by the median value. The window size
(60 ms at 100 Hz sampling) was chosen to balance outlier detection sensitivity and computational overhead on the ESP32-S3 platform.
Second, missing or corrupted samples flagged during CSI packet reception are reconstructed via piecewise linear interpolation using adjacent valid samples, preserving temporal continuity of the amplitude stream.
Third, a discrete Kalman filter is applied to attenuate Gaussian measurement noise while preserving signal dynamics. The observation noise variance
R and process noise variance
Q are adaptively set proportional to the variance of the input sequence:
where
R is the observation noise variance representing measurement uncertainty;
Q is the process noise variance representing the expected rate of signal variation; and
denotes the variance of the Hampel-filtered amplitude sequence
within the current window. The ratio
was empirically tuned to prioritize smoothness over rapid tracking, as occupancy changes occur at timescales longer than individual samples.
Finally, a four-level Daubechies-4 (db4) discrete wavelet decomposition is applied to further suppress high-frequency components (>10 Hz) beyond the Nyquist frequency of human motion. The db4 wavelet was selected for its compact support and good time-frequency localization properties. High-frequency detail coefficients are thresholded using soft thresholding with universal threshold
where
is the soft-thresholding threshold applied to the wavelet detail coefficients;
is the noise standard deviation estimated via the median absolute deviation of the finest-scale wavelet coefficients; and
T is the number of samples in the current window.
3.3. Subcarrier Selection and Feature Construction
After preprocessing, null subcarrier entries are removed according to the ESP32-S3 specification, yielding 166 valid amplitude channels across the three LTF fields. These channels serve as candidates for subsequent energy-based selection. For each valid subcarrier
i, temporal energy is computed as
where
denotes the amplitude of the
i-th subcarrier at time
t after the four-stage denoising pipeline;
T is the number of samples in a 6-second window (
); and
is the total temporal energy of subcarrier
i within the window, used to rank subcarriers for selection. Subcarriers are ranked by
in descending order, and the top 56 are retained. The energy criterion is adopted because it preferentially retains channels with higher signal-to-noise ratio, where occupancy-induced perturbations in variance, rate-of-change, and zero-crossing rate are most detectable, and its O(T) computational complexity satisfies the real-time processing constraint of the ESP32-S3 without introducing the overhead of transform-based or supervised dimensionality reduction methods. The value k = 56 was chosen to retain sufficient spectral diversity for cross-subcarrier feature discrimination while keeping the feature matrix compact for edge inference; a systematic k-value ablation is identified as a direction for future work.
For each selected subcarrier i, we compute eight statistical descriptors that characterize amplitude distribution, temporal variation, and oscillatory behavior. These features are designed to capture occupancy-discriminative patterns while maintaining computational efficiency for real-time edge processing.
Mean amplitude quantifies the average channel response:
where
is the mean amplitude of the
i-th selected subcarrier within the current window and
T is the number of samples per window.
Standard deviation characterizes temporal fluctuation intensity:
where
is the standard deviation of the amplitude sequence of the
i-th selected subcarrier within the window, and
is the within-window mean defined above. Variance is computed as
to provide an alternative dispersion metric.
Maximum and minimum amplitudes capture the range of channel responses within a window:
where
and
are the maximum and minimum amplitudes of the
i-th selected subcarrier within the window, respectively.
Energy is defined as the sum of squared amplitudes:
where
denotes the total signal energy of the
i-th selected subcarrier within the current window.
Rate of change quantifies short-term amplitude variations between adjacent samples:
where
is the mean absolute difference between adjacent samples of the amplitude sequence of the
i-th selected subcarrier within the window. Larger
values indicate more rapid temporal fluctuations, which are typically associated with human movement.
Zero-crossing rate measures the frequency of sign changes in the zero-mean amplitude sequence, capturing oscillatory patterns:
where
denotes the zero-mean amplitude sequence of the
i-th subcarrier, obtained by subtracting the within-window mean
;
is the sign function returning
for positive values and
for negative values; and
is the normalized count of sign changes per sample pair within the window.
Each 6-s CSI window is thus represented as a compact feature matrix:
where each row corresponds to one selected subcarrier and each column corresponds to one of the eight statistical descriptors defined above. This matrix serves as the direct input to the downstream Mini Transformer encoder.
3.4. Edge-Oriented Compression and Transmission
The proposed statistical feature representation significantly reduces the amount of data transmitted compared to raw CSI. Each 6-second window contains CSI packets, with 384 bytes per packet under the ESP-IDF configuration, resulting in approximately 225 KB of raw data. After feature extraction, the compact matrix requires only 1.75 KB (448 float32 values), corresponding to a compression ratio of approximately 129×.
All feature extraction operations including the four-stage denoising pipeline, subcarrier selection, and statistical computation are executed locally on the ESP32-S3 edge device using an optimized ESP-IDF implementation. The resulting feature matrices are encoded into structured JSON payloads and transmitted to the cloud server via MQTT over TLS. This edge-first design reduces network bandwidth requirements, minimizes transmission latency, and ensures consistent feature distributions between offline training and online deployment.
Table 1 summarizes the data volume reduction achieved at each processing stage.
6. Discussion
The proposed framework achieves substantial reductions in both communication overhead and model complexity without compromising occupancy classification accuracy. Replacing high-dimensional time-frequency spectrograms with compact statistical feature matrices eliminates the MQTT bandwidth bottleneck that renders image-based pipelines unsuitable for on-device deployment in SMOs. At the system level, the dual-core implementation resolves a hardware conflict that algorithm-focused studies tend to overlook: CSI acquisition requires a dedicated Wi-Fi channel that cannot coexist with simultaneous network communication on a single-core device. Assigning CSI preprocessing to Core 0 and network operations to Core 1 under FreeRTOS ensures deterministic timing for both tasks and supports unattended long-term operation, a fundamental requirement in SMOs without dedicated maintenance personnel. At the application level, the integration of reliable occupancy estimates with the BI-Tech behavioral intervention platform closes the sensing-to-feedback loop that prior BI-Tech deployments lacked, enabling occupancy-aware energy-saving reminders grounded in accurate multi-level counting rather than coarse presence detection.
The ablation results clarify what drives model performance in this feature space. Lightweight attention-based architectures benefit more from data-level balancing via SMOTE than from increasing model capacity or strengthening regularization. Both over-parameterization and aggressive weight decay degraded generalization on the held-out test set, indicating that model size should be matched to the dimensionality of the input feature space rather than scaled up by default.
6.1. Comparison with Alternative Sensing Modalities
To situate the proposed approach within the broader occupancy sensing landscape,
Table 4 compares the major sensing modalities against the practical requirements of SMO deployment. Vision-based systems offer high counting accuracy but introduce privacy concerns that are difficult to reconcile with occupant acceptance in shared offices, and they require structured cabling, ceiling-level mounting, and ongoing maintenance [
3]. PIR sensors avoid these concerns but are limited to binary presence detection and cannot provide the per-person granularity required for proportional HVAC control [
5]. CO
2-based inference is attractive for its dual role in air quality monitoring, but diffusion dynamics in ventilated spaces introduce response delays of several minutes that make real-time count estimation unreliable [
6]. Device-assisted approaches such as badge tracking and Wi-Fi probe monitoring require active user cooperation or probabilistic MAC address inference, both of which introduce systematic biases [
4].
As shown in
Table 4, CSI-based sensing covers a gap that no single existing modality addresses: it provides multi-level count estimation, requires no body-worn devices, raises no visual privacy concerns, and operates on commodity Wi-Fi hardware that many SMOs already possess. The proposed framework inherits these properties while resolving the computational feasibility barrier that has prevented CSI methods from running on microcontroller-class edge devices. We acknowledge that the comparison in
Table 4 is qualitative; a rigorous quantitative benchmark across modalities under identical experimental conditions would require a dedicated multi-sensor deployment study that accounts for room geometry, occupant behavior patterns, and ventilation conditions simultaneously, and is identified as an important direction for future work.
6.2. Comparison with TinyML-Based Edge Occupancy Sensing Approaches
Table 5 compares three representative edge-deployed human sensing systems against the proposed approach. Armenta-Garcia et al. [
19] achieved 92.43% accuracy for Wi-Fi CSI-based HAR on the ESP32-S3 using a quantised DenseNet with a 127 kB footprint, demonstrating that CSI inference is feasible on microcontroller-class hardware. However, HAR classifies the activity type of a single subject, whereas multi-level occupancy counting requires aggregating subtle channel responses from multiple static occupants across subcarriers, which represents a distinct inferential setting. Mach et al. [
30] reported 99.38% accuracy for UWB radar-based people counting on STM32 microcontrollers, but the system relies on dedicated radar hardware rather than commodity Wi-Fi infrastructure, and the evaluation involved freely walking participants rather than the seated-office conditions studied here. Pandkar et al. [
31] combined CO
2, temperature, humidity, illuminance, and PIR sensors with a Random Forest model on ESP32 devices, achieving R
2 = 0.923, but the 1.426 MB model footprint is substantially larger than that of the proposed Mini Transformer, and CO
2-based inference introduces the diffusion delays discussed above.
To the best of our knowledge, few existing works simultaneously combine device-free operation on commodity Wi-Fi CSI, multi-level occupancy counting under realistic seated-office conditions, and microcontroller-compatible model complexity. The proposed framework is designed to satisfy all three constraints jointly.
6.3. Robustness Considerations and Limitations
The 98.86% test accuracy was obtained under controlled conditions and does not fully represent the range of environments encountered in operational SMOs. Several deployment scenarios warrant discussion. Environments with frequent entry and exit events introduce transient CSI disturbances that may affect one or two consecutive 6-second windows. Because the model is trained on steady-state occupancy patterns, these transition windows may be temporarily misclassified. However, because the BI-Tech energy management strategy operates on occupancy session timescales rather than instantaneous presence events, isolated transition-level misclassifications are unlikely to affect the reliability of session-level decisions.
Two representative use cases illustrate this design. For lighting waste detection, sustained illuminance above threshold combined with persistently low occupancy estimates across consecutive 5-min cycles during evening hours triggers a reminder delivered the following morning. For HVAC waste detection, extended system operation during unoccupied periods, such as overnight or across lunch breaks, is identified through aggregated occupancy records and translated into behavioral feedback. These scenarios operate on timescales of tens of minutes to hours, and the 5-min sensing cycle therefore provides sufficient temporal resolution for effective intervention. Prior BI-Tech field deployments [
22] have shown that session-level feedback leads to measurable reductions in unnecessary energy use through sustained behavioral adaptation. Accordingly, the present framework focuses on feedback-oriented energy management rather than instantaneous closed-loop actuation. Real-time device-level switching, such as PIR-triggered light-off control, represents a different control paradigm requiring separate architectural and comfort considerations, and is identified as a complementary direction for future integration.
Static human-shaped objects such as mannequins or coats left on chairs modify the multipath background but do not produce time-varying channel fluctuations. The discriminative features used here, including per-subcarrier variance (), rate-of-change (), and zero-crossing rate (Z), respond to temporal dynamics rather than static channel structure. Real occupants generate persistent micro-movements such as postural shifts and respiration that activate these features continuously, whereas static objects do not. The four-stage denoising pipeline further suppresses fixed background contributions to the feature distribution. The same reasoning applies to furniture relocation: repositioned objects alter the static multipath background but not its temporal fluctuations, so the system’s reliance on temporal statistics provides inherent robustness to layout changes. Substantial rearrangements involving large reflective surfaces may nevertheless require a brief recalibration to realign the feature distribution.
Moving non-human objects such as equipment trolleys or pushed chairs introduce non-stationary channel perturbations not covered by the current training data. Human body movement tends to produce distributed fluctuations spanning many subcarriers simultaneously due to the large and irregular reflecting surface of the body, while moving objects typically affect a smaller and more spatially coherent subset of subcarriers. The cross-subcarrier attention mechanism in the Mini Transformer is designed to exploit these distributional differences, but whether this cross-subcarrier mechanism provides sufficient discrimination under realistic object movement conditions remains experimentally unverified. Moving metallic objects pose a particular challenge: their high conductivity produces stronger and more spatially concentrated reflections whose instantaneous amplitude on affected subcarriers can exceed that produced by human body movement. Whether the attention mechanism can reliably separate these two effects under heavy metallic interference remains an open question and is identified as a high-priority validation target in future work.
Regarding antenna configuration, the current implementation uses omnidirectional antennas on both nodes to support spatial diversity and flexible device placement. Directional antennas could improve signal-to-noise ratio along specific propagation paths but would require precise orientation during installation and may reduce the multipath diversity that the statistical feature extraction relies on. A systematic comparison under controlled occupancy conditions is identified for future work. The current evaluation was conducted in a single office room with a fixed floor area and rectangular geometry. Room geometry directly affects multipath propagation structure: rooms with irregular shapes, open-plan layouts, or significantly different dimensions may produce CSI feature distributions that differ systematically from those observed in the training environment. While the four deployment cases in this study cover distinct Tx/Rx propagation paths and partially simulate geometric diversity at the link level, cross-room generalization has not been validated. Deploying the framework in a new room geometry would likely require recollecting a modest amount of labeled data to recalibrate the feature distribution, consistent with the periodic recalibration strategy already discussed for long-term environmental drift.
More broadly, CSI-based features are sensitive to long-term environmental drift, device variability, and room layout changes, any of which can shift the feature distribution relative to the training data without producing an obvious system failure. Periodic recalibration partially addresses this but places a maintenance burden on SMO operators. Online adaptation mechanisms, such as incremental learning on recently collected samples or environment-aware feature normalization, represent a more durable solution and are not yet supported in the current system. The multimodal data already collected by the BI-Tech node, including CO2 concentration and illuminance measurements, provides a complementary basis for cross-validating occupancy estimates and detecting anomalous divergence from physically plausible values.
This study does not include long-term field validation of the proposed occupancy-aware control strategies. Prior BI-Tech work [
22] demonstrated that behavioral reminders tied to real-time occupancy data can yield measurable energy reductions over multi-month deployments; however, those results were obtained without CSI-based occupancy counting. Integrating the human counting framework into a longitudinal BI-Tech deployment would allow the energy-saving contribution of accurate occupancy estimation to be evaluated directly, and this remains the principal next step for the research program.
7. Conclusions
This study presents a deployable edge-AI occupancy-sensing framework for SMOs, addressing the system-level constraints that have limited the practical adoption of CSI-based methods. The primary contributions include a dual-core ESP32-S3 implementation that separates CSI preprocessing from network communication, a four-stage signal denoising pipeline that produces stable statistical features under multipath conditions, and a lightweight Mini Transformer that classifies six occupancy levels from a compact feature matrix with 98.86% test accuracy. At the data level, the edge-side feature extraction pipeline reduces per-window transmission volume from 225 kB of raw CSI to 1.75 kB of statistical features, a 129× reduction relative to raw CSI transmission that eliminates the MQTT bandwidth bottleneck. Compared with the DenseNet121 baseline, replacing high-dimensional spectrogram inputs with the compact representation achieves a further reduction exceeding 5000× in per-window input data volume, alongside an approximately 58× reduction in model parameter count, while maintaining competitive classification performance.
The ablation analysis showed that data-level balancing via SMOTE and the adoption of ELU activation contribute more to generalization than increasing model capacity or strengthening regularization, reinforcing the principle that model architecture should be matched to the dimensionality of the feature space. The framework integrates with the BI-Tech behavioral intervention platform, providing the reliable occupancy estimates required to generate occupancy-aware energy-saving reminders in SMO environments.
Several directions remain for future work. Controlled validation involving non-human moving objects, metallic reflectors, and directional antenna configurations is needed to characterize deployment boundaries more precisely. Online adaptation mechanisms to handle long-term environmental drift should be incorporated into the system. Integrating the human counting module into a longitudinal BI-Tech field deployment would allow the energy-saving contribution of accurate occupancy estimation to be evaluated directly under real-world operating conditions.