Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings

Chaudhari, Pratiksha; Xiao, Yang; Li, Tieshan

doi:10.3390/electronics14163323

Open AccessFeature PaperArticle

Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings

by

Pratiksha Chaudhari

¹

,

Yang Xiao

^1,*

and

Tieshan Li

²

¹

Department of Computer Science, University of Alabama, Tuscaloosa, AL 35487-0290, USA

²

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3323; https://doi.org/10.3390/electronics14163323

Submission received: 19 July 2025 / Revised: 10 August 2025 / Accepted: 18 August 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Machine/Deep Learning Applications and Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

Occupancy detection is vital for improving energy efficiency, automation, and security in smart buildings. Reliable detection systems enable dynamic control of lighting, heating, ventilation, air conditioning, and security operations, leading to substantial cost savings and enhanced occupant comfort. However, accurately detecting occupancy using environmental sensor data remains challenging. Existing machine learning and deep learning models, such as Random Forests, convolutional neural networks, and recurrent neural networks, often struggle to capture both fine-grained local patterns and long-range temporal dependencies, limiting their generalization to complex, real-world occupancy patterns. To address these challenges, we propose Translution, a novel hybrid Transformer-based architecture specifically designed for occupancy detection from multivariate sensor time-series data. Translution combines multi-scale convolutional encoding to extract local temporal features, self-attention mechanisms to model long-range dependencies, and an adaptive gating mechanism that dynamically selects relevant features to improve robustness and generalization. We trained Translution on 8143 samples and evaluated it on two distinct subsets of the University of California, Irvine (UCI) Occupancy Detection Dataset: one with shorter, more consistent time spans (2804 samples) and another covering longer, more varied occupancy cycles with abrupt changes and different lighting/ventilation conditions (9752 samples). Evaluating these diverse subsets, which represent both typical and challenging real-world scenarios, explicitly strengthens Translution’s generalizability claim, demonstrating its ability to detect occupancy across varied temporal patterns and environmental conditions accurately. Our results demonstrate that Translution achieves 98.5% accuracy, 97.3% F1-score, and 98.55% area under the receiver operating characteristic curve, significantly outperforming traditional machine learning and deep learning baselines. These findings highlight Translution’s potential as a highly accurate and stable solution for real-time occupancy detection in diverse smart building environments.

Keywords:

occupancy detection; smart buildings; IoT sensors; devices; deep learning; transformers

Graphical Abstract

1. Introduction

Accurate and reliable occupancy detection has become a cornerstone in the advancement of smart building technologies, providing critical capabilities for enhancing energy efficiency, enabling sophisticated automation, and strengthening security measures [1]. Buildings account for consuming 32% of global energy consumption and contributing to 34 per cent of global CO₂ emissions, with significant wastage attributed to inefficient operation of heating, ventilation, air conditioning (HVAC) systems and lighting [2]. Traditional static schedules often lead to energy use in unoccupied spaces, resulting in unnecessary costs and environmental impact [3]. Occupancy-driven management enables the dynamic adjustment of building operations based on real-time presence information, ensuring that energy is expended only when necessary. This not only reduces operational expenses but also supports global efforts toward sustainability, positioning smart buildings as vital contributors to energy conservation goals [4].

Beyond energy efficiency, occupancy detection plays a pivotal role in delivering intelligent, adaptive building automation [5]. Modern smart buildings utilize occupancy data to optimize various subsystems, including dynamic lighting control, personalized temperature settings, and air quality management [6]. Real-time occupancy insights enable predictive maintenance strategies, space utilization optimization, and emergency preparedness systems that adapt evacuation routes based on actual occupant distribution [7]. Moreover, occupancy data integrated into access control and security systems enhances building safety by detecting anomalies and unauthorized presence [8]. The rapid growth of the smart building market, driven by urbanization, rising energy costs, advancements in IoT, and regulatory emphasis on green building practices, is expected to spur significant innovation and investment in occupancy detection technologies in the coming years.

However, achieving accurate occupancy detection in complex real-world environments presents formidable challenges. Environmental signals such as temperature, humidity, CO₂ concentration, and light intensity exhibit delayed, nonlinear, and noisy relationships with human presence [9]. Variations in external weather, building design, ventilation patterns, and occupant behavior further complicate the interpretation of sensor readings [10]. Traditional approaches, such as passive infrared (PIR) sensors [11] and rule-based methods, offer simplicity and cost-effectiveness but often fail to detect stationary occupants or generate high rates of false alarms [12]. More advanced camera- and audio-based systems improve detection fidelity but introduce significant privacy risks and require substantial computational resources, hindering their scalability for widespread deployment.

Furthermore, the integration of smart building technologies introduces additional complexities in occupancy detection. Modern environments comprise heterogeneous sensors and platforms, generating vast volumes of asynchronous, multimodal data [13,14]. Merging these disparate data streams into a unified, actionable understanding of occupancy states requires sophisticated data-processing pipelines capable of handling inconsistencies, missing data, and sensor drift over time [13]. Occupancy detection models must perform reliably under variable operational conditions, such as changing lighting levels, fluctuating indoor temperatures, and evolving occupant behaviors. Simultaneously, strict privacy regulations, such as the General Data Protection Regulation (GDPR), mandate non-intrusive, privacy-respecting detection mechanisms, emphasizing the need for solutions that avoid capturing visual, audio, or personally identifiable data [12].

These realities highlight the limitations of conventional machine learning (ML) and deep learning (DL) approaches in occupancy detection. Classical ML models, including Random Forests, Support Vector Machines, and Multi-layer Perceptrons, rely heavily on engineered features and cannot often model complex temporal relationships inherent in dynamic building environments. While deep learning architectures, such as convolutional neural networks (CNNs) [15] and recurrent neural networks (RNNs), offer improved feature extraction and sequential modeling, they typically struggle to capture long-range temporal dependencies and adapt to non-stationary patterns over time [16]. Moreover, many existing models are not designed to cope with noisy, partially missing, or degraded sensor data, reducing their robustness in practical deployments [17].

Despite advancements, a pressing need exists for occupancy detection models capable of simultaneously capturing both fine-grained local variations and long-term temporal dependencies across multiple sensor modalities. Existing hybrid CNN-RNN models, which combine different architectures, often excel at local feature extraction or sequential modeling but fail to effectively integrate both temporal scales due to inherent limitations such as vanishing gradients or restricted receptive fields [18,19]. Pure Transformer models, on the other hand, are adept at modeling long-range dependencies but sometimes struggle with encoding fine-grained local occupancy events, especially under high-frequency sensor sampling rates where spatial patterns play an essential role, and can be sensitive to noisy sensor inputs. Addressing these diverse and complex requirements simultaneously is crucial for developing reliable, scalable, and socially acceptable smart building systems that can operate effectively in real-world conditions.

To address these fundamental limitations and fill this critical research gap, we propose Translution, a novel hybrid Transformer-based architecture specifically designed for robust, privacy-preserving occupancy detection from multivariate sensor time-series data. Unlike prior hybrid or Transformer models, which exhibit one or both of the aforementioned shortcomings, Translution integrates multi-scale local feature extraction and long-range dependency modeling with an adaptive gating mechanism for enhanced robustness and generalization. This comprehensive solution is tailored to the complex temporal dynamics of occupancy patterns, achieving high accuracy while respecting privacy constraints by relying solely on non-intrusive environmental sensor data.

In response to these needs, we propose Translution, a novel hybrid Transformer-based architecture specifically designed for robust, privacy-preserving occupancy detection. Translution fuses multi-scale convolutional encoding to extract localized temporal features with self-attention mechanisms that capture long-range dependencies across environmental sensor signals. An adaptive feature gating mechanism is incorporated to dynamically prioritize informative inputs and suppress irrelevant noise, enhancing the model’s resilience to environmental variability and sensor imperfections. Translution thus offers a comprehensive solution tailored to the complex temporal dynamics of occupancy patterns, achieving high accuracy while respecting privacy constraints by relying solely on non-intrusive environmental sensor data. Our contributions are as follows:

We proposed Translution, a novel hybrid deep learning architecture that integrates self-attention mechanisms for long-range temporal modeling with multi-scale convolutional encoding to capture localized and diverse environmental patterns. This design enables the model to effectively learn both transient and sustained occupancy signals from sensor data.
A dedicated preprocessing pipeline transforms raw 2D sensor readings into sequential inputs, facilitating temporal reasoning. Within the model, an adaptive gating mechanism dynamically fuses convolutional and attention-derived features, allowing Translution to prioritize salient information while suppressing noise without relying on handcrafted features.
Translution exclusively uses non-intrusive environmental sensor modalities (temperature, humidity, CO₂, and light), avoiding cameras and microphones, and ensuring compliance with modern privacy standards. Its architecture is designed for generalization across varying occupancy conditions, making it well-suited for deployment in diverse smart building contexts.

The subsequent sections of the paper are organized as follows: Section 2 provides a review of related work in the area of occupancy detection for smart buildings. Section 3 presents the proposed Translution model, detailing its architecture and methodology. Section 4 describes the experimental setup, evaluation metrics, and performance comparison between traditional machine learning, deep learning approaches, and Translution. Finally, Section 5 concludes the paper by summarizing key challenges and outlining directions for future research.

2. Related Work

Accurate occupancy detection in smart buildings has been a key research focus over the past decade, driven by the critical need to improve energy efficiency, enable intelligent automation, and ensure occupant comfort. A wide range of ML and DL approaches have been explored, each with its advantages and limitations [20]. In this section, we categorize the prior work into traditional ML methods, deep learning architectures, Transformer-based models, and privacy-aware occupancy detection frameworks.

Traditional Machine Learning Approaches: Earlier efforts in occupancy detection heavily relied on classical ML algorithms trained on environmental sensor data. Decision trees, Support Vector Machines (SVMs), and Random Forests were commonly employed due to their simplicity and interpretability. For instance, the authors [21] presented a comprehensive review highlighting that Random Forest classifiers using temperature, humidity, CO₂, and light-intensity readings could achieve moderate success in occupancy prediction. However, these methods often required extensive manual feature engineering and struggled with dynamic and non-linear occupancy patterns, limiting their robustness in real-world smart building environments.

Deep Learning-Based Models: The limitations of traditional approaches led researchers to explore deep learning models capable of learning complex, non-linear feature representations from raw sensor streams [22]. They developed a hybrid CNN-LSTM model where convolutional layers captured spatial features, and LSTM layers learned temporal occupancy patterns, significantly improving detection accuracy. Similarly, the authors [23,24] demonstrated that deep learning models, particularly recurrent neural networks (RNNs) and gated recurrent units (GRUs), are more capable of handling sequential dependencies compared to shallow models. Nevertheless, deep learning models still face challenges, such as overfitting on small datasets and sensitivity to noisy sensor inputs.

Transformer-Based Architectures: With the introduction of attention mechanisms, Transformer-based models have recently gained popularity in occupancy modeling. The authors in [25] proposed a distributed Transformer network that could dynamically predict energy consumption by modeling occupancy patterns across smart building clusters. They demonstrated that self-attention layers could effectively capture long-term dependencies that traditional RNNs failed to model. Similarly, the authors in [26] introduced the DMFF (Deep Multimodal Feature Fusion) framework, combining convolutional feature extractors with Transformer encoders. Their approach successfully integrated local and global features, resulting in faster convergence and improved generalization across heterogeneous occupancy patterns. However, pure Transformer models sometimes struggle with encoding fine-grained local occupancy events, especially when operating under high-frequency sensor sampling rates, where spatial patterns play a crucial role.

Multimodal and Privacy-Preserving Occupancy Detection: Recent research has increasingly emphasized the importance of privacy-respecting occupancy detection methods. The authors in [19,27] advocated for occupancy inference using only non-intrusive environmental sensors, such as temperature, humidity, light, and CO₂ measurements, to avoid the privacy concerns associated with camera- or audio-based approaches. Their studies demonstrated that when combined with appropriate machine learning techniques, environmental data alone can achieve competitive occupancy detection performance while adhering to modern data protection standards, such as the General Data Protection Regulation (GDPR). Additionally, the authors [28] explored lightweight CNN-LSTM frameworks for predicting occupancy and energy consumption on embedded platforms, such as the Jetson Nano, highlighting the growing trend toward efficient, real-time, and privacy-compliant smart building solutions.

To provide an overview and structured comparison of the different occupancy detection approaches discussed, Table 1 summarizes their key characteristics across dimensions such as algorithms, sensor modalities, privacy considerations, temporal modeling scope, typical performance, and limitations. This comparative analysis helps to highlight the specific challenges addressed by Translution and contextualize its novel contributions within the existing literature. We summarize the gaps identified in the literature as follows:

Existing models tend to specialize in capturing either short-term occupancy changes or long-term occupancy patterns, but often fail or struggle to capture both types of patterns together.
Deep learning models often overfit to specific datasets, limiting robustness across different buildings and occupancy patterns.
Many occupancy detection frameworks still rely on intrusive sensing modalities (e.g., cameras, microphones), which raises privacy concerns and limits their practical applicability in real-world smart building environments.

3. Proposed Methodology—Translution

Accurate occupancy detection in smart buildings presents unique challenges due to the complex and dynamic nature of human behavior and environmental fluctuations. Traditional machine learning models (e.g., Random Forests, SVMs) operate on static feature snapshots and struggle to capture the evolving temporal patterns inherent in occupancy data [1]. Similarly, deep learning models, particularly CNNs and RNNs, are optimized for either short-term local feature extraction or sequence modeling, but often fail to effectively integrate both temporal scales simultaneously [29]. CNNs excel at capturing localized short-term variations but have limited receptive fields [30]. At the same time, RNNs and LSTMs can model sequential dependencies but suffer from vanishing gradients and inefficiencies when modeling long-range dependencies over longer time horizons [31]. To overcome these limitations, we propose Translution, a novel hybrid Transformer-based architecture explicitly designed to model the multi-scale temporal nature of environmental sensor data. Translution combines multi-scale 1D convolutions for capturing short-term local variations with an adaptive multi-head self-attention mechanism capable of modeling long-range dependencies in occupancy patterns. Furthermore, a dynamic gating mechanism is introduced to adaptively prioritize critical features while suppressing noise, ensuring robustness under varying environmental conditions. The proposed architecture is lightweight enough for practical deployment in smart buildings while offering superior generalization across diverse occupancy contexts. As shown in Figure 1, the framework consists of three major stages: Data Collection, Data Preprocessing, and Hybrid Model Architecture with Translution Blocks.

3.1. Data Collection

The experimental evaluation in this study is based on a real-world occupancy detection dataset collected from smart building rooms under varying temporal and environmental conditions [32]. Each data sample records five key environmental attributes measured at regular intervals: temperature (°C), relative humidity (%), light intensity (Lux), carbon dioxide (CO₂) concentration (ppm), and humidity ratio (dimensionless) [32]. These features are chosen because they are non-intrusive, respect occupant privacy, and have demonstrated a strong correlation with human presence in indoor settings.

The dataset comprises over 8000 samples for training and more than 12,000 samples for testing and evaluation across different occupancy scenarios. Each record is associated with a binary occupancy label, where 0 represents an unoccupied state, and 1 indicates an occupied state [32].

Occupancy Label \in {0 (Unoccupied), 1 (Occupied)},

This setup formulates a binary classification task that aims to predict occupancy status solely based on environmental sensor readings, ensuring a privacy-preserving and scalable solution for smart building management.

3.2. Data Preprocessing

To enable effective learning, a series of preprocessing steps is applied to the dataset before model training. First, raw environmental sensor readings are standardized using z-score normalization to ensure all input features are on a comparable scale. This facilitates faster convergence and improves training stability [33]. The normalization parameters—mean and standard deviation—are computed solely from the training data and then applied consistently across all dataset partitions to prevent data leakage [34]. Following normalization, the time-series data is segmented into overlapping sequences using a sliding-window approach. Specifically, with a window size of 10 time steps, each sequence comprises 10 consecutive readings, resulting in an input shape of

(T, F)

, where

T = 10

and

F = 5

denotes the number of features. This transformation produces a three-dimensional tensor, where each sample encodes short-term temporal dynamics that are essential for modeling occupancy transitions. By structuring the data in this way, the model can capture temporal dependencies rather than relying solely on static feature values [35]. Additionally, any instances with missing or corrupted records are removed to preserve the integrity of the training process. While this approach ensures the integrity of our benchmark evaluation, we acknowledge that real-world sensor data often contains missing values. For practical deployments, an imputation strategy would be a necessary preprocessing step to handle such incomplete data sequences. Overall, this preprocessing pipeline ensures that the input data is temporally coherent, numerically stable, and semantically informative—providing a solid foundation for learning occupancy patterns [36] as shown in Figure 2.

3.3. Translution Architecture

The Translution architecture is a deep learning model specifically designed for modeling temporal sensor data sequences in occupancy detection. Unlike static classifiers, Translution processes inputs as structured time series, where each input sequence consists of 10 consecutive sensor readings, each with five environmental features. These sequences are first projected into a latent space and enriched with positional encodings to retain temporal order [37]. The projected representations are then passed through a hybrid encoder composed of two main components: a multi-scale convolutional encoder for extracting short-range temporal patterns and an adaptive Transformer-based attention mechanism for capturing long-range dependencies across the sequence [38,39]. This layered design enables the model to learn both local and global occupancy cues, which are essential for robust prediction in variable indoor environments. Figure 3 illustrates the overall architecture, from sequence input to classification output.

3.3.1. Multi-Scale Convolutional Encoder

The first major component of the translution block is the multi-scale convolutional encoder, designed to extract local patterns in the temporal dimension of sensor sequences [40]. This module applies parallel one-dimensional convolutions with kernel sizes of 3, 5, and 7, enabling the model to capture short-term fluctuations, mid-range trends, and broader local patterns simultaneously. Alternatively, for each kernel size k, a 1D convolution operation is applied to the input sequence as follows:

\begin{matrix} h_{t}^{(k)} = GELU (W^{(k)} * X_{t - ⌊ k / 2 ⌋ : t + ⌊ k / 2 ⌋} + b^{(k)}), \end{matrix}

(1)

where X is the input sequence,

W^{(k)}

and

b^{(k)}

are the learnable weights and biases for the convolutional branch with kernel size k, ∗ denotes the convolution operation, GELU is the Gaussian Error Linear Unit activation function. The outputs of the three convolutional branches are concatenated:

\begin{matrix} C_{multi} = [h^{(3)} ∥ h^{(5)} ∥ h^{(7)}], \end{matrix}

(2)

This merged representation is passed through a dense projection layer to unify dimensionality before being fused with the attention output. By operating at multiple temporal scales, this module ensures that the model can attend to patterns such as sudden changes in light levels, gradual CO₂ build-up, or persistent humidity shifts—all of which may indicate changes in occupancy.

3.3.2. Adaptive Multi-Head Attention

To capture long-range temporal relationships that may span the entire input sequence, Translution employs an adaptive multi-head self-attention mechanism [41]. This module enables the model to dynamically attend to any part of the input sequence, regardless of the time distance, which is particularly important for detecting subtle and delayed occupancy signals (e.g., gradually rising CO₂ or delayed light changes).

Let the input sequence be

X \in R^{T \times d_{model}}

, where T is the number of time steps and

d_{model}

is the model dimensionality. For each attention head, the input is linearly projected into queries Q, keys K, and values V using learnable weight matrices:

Q = {(X W)}_{Q}, K = {(X W)}_{K}, V = {(X W)}_{V}

(3)

Each head computes attention weights using the scaled dot-product attention formula, extended here with a learnable scaling vector gamma for each head:

\begin{matrix} Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} \cdot γ) V, \end{matrix}

(4)

where

d_{k}

is the dimensionality of the keys and

γ \in R^{num_heads}

is a trainable parameter that adaptively adjusts the sensitivity of each head to the input scale.

The outputs from all attention heads are concatenated and passed through a final dense layer:

\begin{matrix} Z = Concat ({head}_{1}, \dots, {head}_{h}) W_{O}, \end{matrix}

(5)

where

W_{O}

is the learnable output projection matrix and h is the number of attention heads. This mechanism allows Translution to simultaneously capture multiple types of temporal interactions across the sequence—such as periodic occupancy cycles or lingering effects in air quality—with each head specializing in different attention patterns.

3.3.3. Dynamic Gating Mechanism

The dynamic gating mechanism in Translution serves as an adaptive fusion layer, balancing the contributions of convolution-based local features and attention-derived global representations [42]. Instead of statically concatenating or averaging these outputs, the gating mechanism learns to dynamically emphasize or suppress each stream, depending on the input context. Let

C \in R^{T \times d_{model}}

be the output of the convolutional path, and

A \in R^{T \times d_{model}}

be the output of the attention path. A gating vector

G \in {[0, 1]}^{T \times d_{model}}

is computed using a sigmoid-activated linear transformation:

\begin{matrix} G = σ (W_{g} ⊙ X + b_{g}), \end{matrix}

(6)

where

σ

is the sigmoid function, and X is the input to the gating layer (typically the original or normalized input sequence). This gating vector then modulates the contributions from both streams:

\begin{matrix} F = G ⊙ C + (1 - G) ⊙ A, \end{matrix}

(7)

Here, ⊙ denotes element-wise multiplication. The resulting

F_{fused}

feature contains selectively weighted information from both local and global perspectives, dynamically adjusted for each timestep and feature dimension. This mechanism is particularly beneficial in noisy real-world sensor environments, where certain signals (e.g., CO₂) may dominate at times, while others (e.g., light) might offer sharper cues in different contexts.

3.3.4. Feedforward Network and Normalization

Each Translution block concludes with a position-wise feedforward network (FFN) designed to transform and refine the fused temporal representation at each timestep [37]. This module applies two fully connected layers with a non-linear activation function in between:

\begin{matrix} FFN (x) = W_{2} \cdot GELU (W_{1} x + b_{1}) + b_{2}, \end{matrix}

(8)

Here,

W_{1}

and

W_{2}

are learnable weight matrices, and the GELU (Gaussian Error Linear Unit) activation function is used due to its smooth, non-linear behavior, which has demonstrated strong performance in Transformer-based models.

To improve gradient flow and training stability, the Translution block employs residual connections and layer normalization around both the attention and FFN sub-layers. Specifically, the block structure is as follows: (1) apply layer normalization to the input; (2) pass it through the attention or FFN module; (3) add the original input back via a residual connection; (4) normalize again.

This yields two residual equations per block:

\begin{matrix} Z_{1} = LayerNorm (X + Attention (X)), \end{matrix}

(9)

\begin{matrix} Z_{2} = LayerNorm (Z_{1} + FFN (Z_{1})), \end{matrix}

(10)

These residual pathways ensure that important low-level features are preserved across layers while layer normalization reduces internal covariate shifts and accelerates convergence.

By alternating between attention, convolution, and feedforward processing, each Translution block can flexibly capture diverse types of occupancy-related patterns—ranging from rapid environmental fluctuations to prolonged spatial trends.

3.3.5. Temporal Pooling and Output Classification

Once the sensor sequence has passed through all Translution blocks, it is transformed into a rich temporal representation

H \in R^{T \times d_{model}}

, where T is the sequence length and

d_{model}

is the latent feature size [43]. To convert this variable-length sequence into a fixed-size vector for classification, Translution applies Global Average Pooling (GAP) over the temporal dimension:

\begin{matrix} h_{agg} = \frac{1}{T} \sum_{t = 1}^{T} H_{t} \end{matrix}

(11)

This operation computes the mean representation across all timesteps, capturing the overall temporal footprint of the sequence. GAP offers a lightweight, permutation-invariant aggregation mechanism that reduces overfitting compared to recurrent flattening or attention-based pooling.

The pooled vector

h_{agg} \in R^{d_{model}}

is then passed through a final fully connected output layer with a sigmoid activation:

\begin{matrix} \hat{y} = σ (W_{out} \cdot h_{agg} + b_{out}), \end{matrix}

(12)

where

\hat{y} \in [0, 1]

represents the model’s predicted probability of the input sequence corresponding to an “occupied” state. This design enables Translution to perform binary classification while preserving both global and local context extracted from the environmental data stream.

4. Experiment

We evaluated our model’s performance through a five-step process: data collection, data analysis, model training, model evaluation, and comparison with traditional machine learning and deep learning models.

4.1. Data Collection

The data used in this study is derived from the University of California, Irvine (UCI) Occupancy Detection Dataset [32], a well-known benchmark for evaluating models in indoor presence recognition using ambient sensor signals. This dataset provides real-world, timestamped readings from a typical room environment in a building, monitored through non-intrusive environmental sensors. The sensors include temperature (°C), relative humidity (%), light intensity (Lux), CO₂ concentration (ppm), and humidity ratio. Each reading is accompanied by a ground-truth occupancy label (0 for unoccupied and 1 for occupied), verified using a PIR (Passive Infrared) motion sensor [32].

To ensure robust training and generalization, the dataset is segmented into three distinct parts, covering different periods and varying occupancy patterns. These parts include a training subset composed of 8143 samples under normal working conditions. An initial testing subset of 2804 samples was collected during a different period for generalization testing. A second testing subset of 9752 samples, which contains more abrupt changes and varying lighting or ventilation conditions, simulates challenging deployment scenarios [32].

Each row in the dataset represents a single reading captured at one-minute intervals, resulting in a rich temporal structure suitable for sequence modeling. By leveraging this structure, we could create input windows of consecutive time steps to form sequences that preserve short- and medium-term temporal dependencies in occupancy behavior. These sequences are further normalized using z-score scaling to reduce variance between features and enable more stable convergence during training. The sensors involved in the data-collection process are entirely non-intrusive, requiring no visual, audio, or biometric data, thereby ensuring full compliance with privacy-preserving standards, such as GDPR. This choice supports our objective of developing occupancy detection models that are both effective and ethically sound for real-world smart building environments.

4.2. Data Assessment

To design a robust and efficient deep learning model for occupancy detection, a thorough analysis of the dataset was carried out to gain empirical insights into the temporal behavior and variability of environmental sensor signals. This analysis guided key decisions on preprocessing strategies, sequence structuring, and model input configurations. Two central questions drove this exploration: (1) What are the distributional characteristics of each environmental feature? (2) What is the optimal sequence length for capturing occupancy patterns without introducing noise or computational overhead?

The dataset, derived from real-world smart building environments, includes five non-intrusive features—temperature sampled at regular intervals. Statistical profiling revealed that light and CO₂ had the most distinguishable ranges across occupied and unoccupied classes. For example, average light levels during occupancy peaked at over 600 Lux, while unoccupied periods consistently remained below 50 Lux. Likewise, CO₂ values showed a gradual accumulation during occupied sessions, due to human respiration, with averages exceeding 900 ppm, while unoccupied sessions remained around 400–500 ppm. Humidity and temperature demonstrated moderate variability but also contributed meaningful temporal trends when analyzed over time sequences.

We further evaluated the sequential nature of the data by computing moving averages and temporal autocorrelations to understand how quickly signals changed. This helped in selecting a fixed sequence length of 10 time steps, which provided a balanced window to capture both immediate fluctuations and longer occupancy cycles without overwhelming the model. For example, transient changes in light were observable within 2–3 time steps, while CO₂ trends required 8–10 intervals for meaningful shifts.

The analysis also informed feature normalization. Since environmental sensors operate on different scales (e.g., CO₂ in ppm vs. humidity as a ratio), standardization was applied using z-score normalization to ensure all features contributed equally during training. Additionally, we confirmed that the dataset had no missing values and minimal class imbalance, which was addressed using class weighting during model training to avoid bias toward the dominant class.

4.3. Evaluation Metrics

To assess model performance, several standard evaluation metrics were used, ensuring a robust evaluation of the occupancy detection models. These metrics were chosen to measure the model’s ability to correctly classify occupancy status while addressing issues like class imbalance and prediction reliability [44].

Accuracy:

Measures the overall correctness of the model’s predictions. It is useful when class distribution is balanced, but can be misleading in imbalanced datasets [44].

\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(13)

Accuracy was used as a general performance indicator but was supplemented with other metrics to provide a more nuanced analysis.

Precision:

Evaluates how many of the predicted positive cases are correct. High precision is crucial in applications where false positives (incorrectly predicting occupancy) need to be minimized [44].

\begin{matrix} Precision = \frac{T P}{T P + F P} \end{matrix}

(14)

This metric was used to ensure that the model does not falsely predict occupancy too frequently, which is important for energy-efficient decision-making.

Recall:

Measures how well the model detects actual positive cases. High recall is essential when missing an occupied state could have significant consequences, such as suboptimal energy use in smart buildings [44].

\begin{matrix} Recall = \frac{T P}{T P + F N} \end{matrix}

(15)

This metric was particularly important in ensuring that occupied spaces were correctly identified and not mistakenly classified as unoccupied.

F1-Score:

The harmonic mean of precision and recall balances both metrics. It is particularly useful when the dataset is imbalanced, thereby preventing reliance on accuracy alone [44].

\begin{matrix} F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(16)

The F1-score was chosen as a primary metric to assess the trade-off between precision and recall, ensuring a well-balanced model.

Area Under the Receiver Operating Characteristic Curve (AUROC):

Evaluates the model’s ability to distinguish between classes across different probability thresholds. A higher AUROC indicates better discrimination between occupied and unoccupied states. This metric was useful for analyzing model robustness across varying classification thresholds, allowing for better decision-making in practical applications [44].

These metrics provided a comprehensive assessment of model effectiveness across different test datasets, enabling a detailed comparison of deep-learning models for occupancy detection.

4.4. Optimal Sequence Length Determination

We further evaluated the sequential nature of the data by computing moving averages and temporal autocorrelations to understand how quickly signals changed. This helped in selecting a fixed sequence length of 10 time steps, which provided a balanced window to capture both immediate fluctuations and longer occupancy cycles without overwhelming the model. For instance, transient changes in light were observable within 2–3 time steps, while CO₂ trends required 8–10 intervals for meaningful shifts.

To empirically validate this design decision and identify the optimal input sequence length for Translution, we conducted a series of experiments evaluating the model’s performance across various sequence lengths: 5, 10, 15, and 20 time steps. The results, summarized in Table 2 and Table 3, demonstrate the impact of sequence length on Translution’s accuracy, F1-score, and AUROC.

As evidenced by these tables, Translution consistently achieved its highest performance across all metrics (Accuracy, F1-score, and AUROC) with a sequence length of 10 for both Test Dataset 1 and Test Dataset 2. Shorter sequences (e.g., 5) likely failed to capture sufficient temporal context, resulting in reduced recall and an overall F1-score. Conversely, while longer sequences (15 and 20) may contain more historical data, they often introduce irrelevant noise or dilute crucial short-term patterns, leading to a decline in performance. This empirical validation confirms that a 10-timestep window optimally balances the need to capture essential temporal dynamics with computational efficiency, supporting its selection as the standard input sequence length for Translution.

4.5. Hyperparameter Settings

To ensure effective learning and consistent benchmarking, the hyperparameters used in the Translution model were carefully selected and tuned, balancing model complexity, generalization ability, and computational efficiency. These settings, summarized in Table 4, were primarily determined through a systematic process involving preliminary experiments and iterative optimization based on the performance of the validation set [45].

An initial set of hyperparameter values was chosen, drawing upon common practices for Transformer-based architectures and deep learning models in time-series analysis. Subsequently, these parameters were refined by evaluating their impact on key performance metrics (F1-score and AUROC) on a dedicated validation set, which was held out from the training data. For instance, the internal representation dimension

d_{model}

of 64 and the use of 2 stacked Translution blocks were identified as providing an optimal balance between model capacity and computational demand, through iterative testing to observe their effect on model performance and training time.

The dropout rate of 0.2 was selected to mitigate overfitting, a common challenge in deep learning, by monitoring the divergence between training and validation loss curves. This helped ensure the model’s ability to generalize effectively to unseen data. The Adam optimizer, with a learning rate of 0.001 and a batch size of 32, was chosen for its proven efficiency and stability in deep learning training, and further fine-tuned for optimal convergence behavior during preliminary runs. Global Average Pooling was employed before the final classification layer due to its lightweight and permutation-invariant aggregation properties, which also contribute to reducing overfitting.

4.6. Model Training

The training phase of the Translution architecture was designed to strike a balance between model complexity, learning efficiency, and generalization performance. To begin, environmental sensor data—comprising temperature, humidity, light intensity, CO₂ concentration, and humidity ratio—was preprocessed through normalization and temporal sequence construction. Specifically, the data was reshaped into overlapping sequences of 10 time steps, allowing the model to observe short-term fluctuations and emerging occupancy trends across small time windows. This resulted in a structured 3D input format of the shape (samples, 10, 5).

Rather than training on static features, the model learned from evolving temporal patterns in a way that mimics real-world occupancy fluctuations. The training leveraged the Scikit-learn ‘Pipeline’ wrapper to tightly couple preprocessing and model fitting steps, ensuring consistency during validation and future deployment. The core model was trained using the Adam optimizer with binary cross-entropy loss, suitable for binary occupancy classification. A batch size of 32 and 15 epochs was chosen to balance convergence speed and training stability.

Although early stopping was not activated in this setup, the model incorporated built-in regularization mechanisms: dropout layers within attention and convolution modules, adaptive gating for dynamic noise suppression, and batch normalization for gradient stability. These collectively mitigated overfitting and helped the model generalize well across two separate test datasets.

4.7. Comparative Analysis of Different Machine and Deep Learning Models for Occupancy Detection

To rigorously assess the efficacy of the proposed Translution architecture, we designed a comparative experimental setup using the UCI Occupancy Detection Dataset [32]. The dataset was split into three parts: a training set and two distinct test sets—one representing stable occupancy conditions and the other comprising more variable and challenging scenarios. Each sample includes five environmental sensor readings (temperature, humidity, CO₂, light, and humidity ratio) and a binary occupancy label. We trained all baseline models using the same preprocessed input sequences, ensuring consistency in evaluation. Our objective was not just to report raw accuracy but to examine how well different models handle temporal variation, noise, and generalization across changing occupancy patterns.

We included five baseline models in the experiment: Random Forest, MLP, OCSVM, CNN, and RNN. [1] These models span a range of paradigms, from traditional supervised classifiers to temporal sequence learners. Random Forest and MLP were selected for their widespread use in sensor-based classification tasks. OCSVM provided a non-supervised baseline to examine model behavior under minimal label supervision. CNNs were chosen for their ability to learn spatial features from fixed-size windows, while RNNs represented conventional sequence models capable of capturing short-term temporal dependencies.

During the evaluation, all models were assessed using five standard metrics: accuracy, precision, recall, F1-score, and AUROC. The results reveal distinct patterns in performance, shown in Table 5 and Table 6. In Test 1, Translution achieved an accuracy of 98.1%, the highest among all models. It also led to recall (99.8%) and F1-score (97.3%), outperforming RNN (F1 = 95.7%), CNN (F1 = 84.6%), and traditional models like Random Forest (F1 = 95.1%). In Test 2, which involved more irregular occupancy behavior, Translution maintained robust performance with an accuracy of 98.9% and an equally high F1-score of 97.3%. At the same time, competing models saw notable drops—for instance, MLP dropped to an F1-score of 90.2%, and CNN fell to just 76.3%.

OCSVM consistently underperformed in both test sets, particularly in terms of precision and F1-score, confirming its limited utility in structured occupancy tasks. CNNs, though precise, missed many true positives due to their inability to model temporal dependencies. RNNs fared better but still failed to generalize across test sets. Translution’s advantage lies in its multi-scale convolutional layers, which capture short- and mid-range feature patterns, and self-attention modules, which handle long-range dependencies without recency bias. Additionally, the adaptive gating mechanism enables dynamic feature weighting at each time step, making the model robust to noise and environmental variability.

Overall, the experimental findings demonstrate that Translution is not only highly accurate but also generalizes better across diverse conditions, maintains high recall (minimizing false negatives), and handles real-world complexities more effectively than conventional ML/DL approaches. Its consistent performance across both test sets establishes it as a reliable, scalable, and privacy-compliant model for real-time occupancy detection in smart building environments.

Table 7 presents the computation time analysis for training and testing various machine learning and deep learning models, including our proposed Translution architecture. As expected, simpler models such as Random Forest and OCSVM exhibited minimal computation time—training in under 0.1 s and completing inference on both test sets in fractions of a second. These models are highly efficient, but they come with significant trade-offs in terms of detection accuracy and generalization, as demonstrated in prior evaluations. In contrast, deep learning models such as CNN and RNN required longer training times (83.61 s and 85.96 s, respectively) and showed increased inference latency, especially on the more complex Test Set 2. Translution, being the most complex architecture, required the longest training time at 222.19 s and exhibited the highest inference time: 2.22 s on Test Set 1 and 11.81 s on Test Set 2. This increase is attributed to its multi-scale convolutional layers, attention mechanisms, and dynamic gating modules, which collectively model both short-term and long-term dependencies in the input data. This trade-off between performance and computational cost is a critical consideration for real-world deployment, especially in diverse scenarios such as edge computing or large-scale smart building infrastructures.

For edge deployment, where a model needs to run directly on a resource-constrained device (e.g., an IoT gateway or a microcontroller), Translution’s current latency may be too high for real-time applications, and its memory footprint could be prohibitive. In such cases, the priority shifts from maximum accuracy to a balanced solution that offers high enough performance with minimal latency and power consumption. For large-scale deployments across many smart buildings, the cumulative computational demand and energy consumption for training and inference become significant. While Translution’s performance justifies its use in a central server environment, its scalability is a key challenge. The quadratic complexity of the attention mechanism and the parallel multi-scale convolutions contribute to this overhead.

4.8. Ablation Study

To validate the necessity and individual contributions of Translution’s core components, we conducted a comprehensive ablation study. We systematically evaluated four model configurations: the full Translution architecture and three ablated versions, each with one of the key components removed (dynamic gating, multi-scale convolution, and adaptive attention). All models were trained and evaluated under identical conditions on both Test Dataset 1 and Test Dataset 2. The results are summarized in Table 8.

The results in Table 8 demonstrate the critical role of each component in achieving Translution’s superior performance. When any of the three main components is removed, the model’s performance on both test sets, particularly in terms of the F1-score and AUROC, experiences a significant degradation.

Specifically, the removal of the dynamic gating mechanism resulted in a notable drop in performance, with the F1-score falling to 88.0% and 91.0% on Test 1 and Test 2, respectively. This degradation, particularly in recall, validates the importance of the gating mechanism in dynamically prioritizing relevant sensor signals and suppressing irrelevant noise, which is essential for robust detection in variable environments.

The ablation of the multi-scale convolution module led to the most substantial performance drop. The F1-score plummeted to 78.3% on Test 1 and 85.2% on Test 2. This finding empirically confirms the necessity of the convolutional component for capturing fine-grained local temporal patterns and feature interactions, which are crucial for detecting transient occupancy events. The model’s inability to learn these localized features effectively without the multi-scale convolution severely impacts its overall accuracy.

Finally, removing the adaptive attention mechanism also resulted in a significant performance reduction, with the F1-score decreasing to 90.1% and 90.3% on Test 1 and Test 2, respectively. While still performing better than the no-convolution model, this outcome underscores the vital role of the self-attention mechanism in modeling long-range temporal dependencies. Its absence limits the model’s ability to contextualize occupancy events across the entire input sequence, which is essential for detecting subtle and delayed changes in environmental data.

The ablation study provides strong empirical evidence that all three components—the multi-scale convolution, adaptive attention, and dynamic gating—are indispensable. Their synergistic integration is what enables the full Translution model to achieve state-of-the-art performance and robust generalization across diverse real-world occupancy scenarios.

While Translution demonstrates superior predictive accuracy and robustness, its advanced architecture results in significantly higher training and inference times compared to simpler baselines. This computational overhead is a critical consideration for practical deployment, particularly in resource-constrained edge devices or large-scale smart building infrastructures that require real-time responses. For such scenarios, future optimizations, such as model pruning, quantization, or knowledge distillation, will be essential to reduce model complexity and inference latency while preserving performance.

5. Conclusions

In this study, we introduced Translution, a novel hybrid deep learning model for accurate and privacy-preserving occupancy detection in smart buildings. The proposed architecture integrates multi-scale convolutional layers to capture short-term temporal patterns, complemented by Transformer-based attention blocks that effectively model long-range dependencies in environmental sensor data. Additionally, dynamic feature fusion via adaptive gating allows Translution to prioritize relevant features and suppress noisy signals, improving robustness across diverse conditions.

We evaluated Translution using the publicly available UCI Occupancy Detection Dataset, which comprises real-world sensor readings from office environments. Extensive experiments have demonstrated that Translution significantly outperforms conventional models, such as Random Forest, MLP, CNN, RNN, and One-Class SVM, across multiple performance metrics, including accuracy, F1-score, AUROC, precision, and recall. Notably, Translution achieved superior generalization on test datasets with unseen temporal conditions, validating its ability to adapt to dynamic occupancy patterns.

Beyond predictive performance, our model upholds strong privacy standards by exclusively relying on non-intrusive environmental features—such as temperature, humidity, light, and CO₂—thereby avoiding the use of audio or video data that may compromise user privacy. This design aligns with regulatory frameworks, such as GDPR, making Translution a suitable solution for real-world deployments in privacy-sensitive smart environments.

Despite its strong performance, Translution’s architecture introduces significant computational complexity, particularly due to the layered attention and convolutional components. This can limit its scalability for real-time deployment on edge or resource-constrained devices. Additionally, the fixed sequence length design may reduce flexibility across varied sensor sampling rates.

In future work, we aim to optimize Translution for real-world use by exploring techniques such as model pruning to remove redundant connections and parameters, quantization to reduce the model’s memory footprint and accelerate inference, and knowledge distillation, where a smaller model is trained to mimic the behavior of the larger Translution architecture. These methods could allow for a significant reduction in model complexity and latency while preserving a substantial portion of Translution’s high performance, making it a more viable solution for diverse deployment scenarios. We also plan to evaluate its performance on larger, more diverse datasets and extend the framework to support variable-length sequences and online adaptation to dynamic building environments.

Author Contributions

Conceptualization, Y.X.; Methodology, P.C., Y.X. and T.L.; Validation, T.L.; Formal analysis, P.C.; Investigation, P.C.; Writing—original draft, P.C.; Writing—review and editing, Y.X. and T.L.; Supervision, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Acronym	Definition
AI	Artificial Intelligence
AUROC	Area Under the Receiver Operating Characteristic
CNN	Convolutional Neural Network
${CO}_{2}$	Carbon Dioxide
DL	Deep Learning
DMFF	Deep Multimodal Feature Fusion
FFN	Feedforward Network
FP	False Positive
FN	False Negative
GAP	Global Average Pooling
GDPR	General Data Protection Regulation
GELU	Gaussian Error Linear Unit
GRU	Gated Recurrent Unit
HVAC	Heating, Ventilation, Air Conditioning
IoT	Internet of Things
LSTM	Long Short-Term Memory
ML	Machine Learning
MLP	Multi-layer Perceptron
OCSVM	One-Class Support Vector Machine
PIR	Passive Infrared
RNN	Recurrent Neural Network
SVM	Support Vector Machine
TP	True Positive
TN	True Negative
UCI	University of California, Irvine

References

Chaudhari, P.; Xiao, Y.; Cheng, M.M.-C.; Li, T. Fundamentals, Algorithms, and Technologies of Occupancy Detection for Smart Buildings Using IoT Sensors. Sensors 2024, 24, 2123. [Google Scholar] [CrossRef]
Global Status Report for Buildings and Construction 2024/2025. Available online: https://www.unep.org/resources/report/global-status-report-buildings-and-construction-20242025 (accessed on 5 August 2025).
Candanedo, L.M.; Feldheim, V. Accurate occupancy detection of an office room from light, temperature, humidity, and CO₂ measurements using statistical learning models. Energy Build. 2016, 112, 28–39. [Google Scholar] [CrossRef]
Erickson, V.L.; Cerpa, A.E. Occupancy based demand response HVAC control strategy. In Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys’ 10), Zurich, Switzerland, 2 November 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 7–12. [Google Scholar] [CrossRef]
Labeodan, T.; Zeiler, W.; Boxem, G.; Zhao, Y. Occupancy measurement in commercial office buildings for demand-driven control applications—A survey and detection system evaluation. Energy Build. 2015, 93, 303–314. [Google Scholar] [CrossRef]
Soleimanijavid, A.; Konstantzos, I.; Liu, X. Challenges and opportunities of occupant-centric building controls in real-world implementation: A critical review. Energy Build. 2024, 308, 113958. [Google Scholar] [CrossRef]
Wagner, A.; O’Brien, W. Virtual special issue editorial—State-of-the-art in occupant-centric building design and operation: A collection of reviews. Build. Environ. 2020, 180, 107026. [Google Scholar] [CrossRef]
Sonta, A.; Dougherty, T.R.; Jain, R.K. Data-driven optimization of building layouts for energy efficiency. Energy Build. 2021, 238, 110815. [Google Scholar] [CrossRef]
Cicero, S.; Guarascio, M.; Guerrieri, A.; Mungari, S. A Deep Anomaly Detection System for IoT-Based Smart Buildings. Sensors 2023, 23, 9331. [Google Scholar] [CrossRef]
Szczurek, A.; Maciejewska, M.; Pietrucha, T. Occupancy determination based on time series of CO₂ concentration, temperature and relative humidity. Energy Build. 2017, 147, 142–154. [Google Scholar] [CrossRef]
Peng, M.; Xiao, Y.; Li, N.; Liang, X. Monitoring Space Segmentation in Deploying Sensor Arrays. IEEE Sens. J. 2014, 14, 197–209. [Google Scholar] [CrossRef]
Abuhussain, M.A.; Alotaibi, B.S.; Dodo, Y.A.; Maghrabi, A.; Aliero, M.S. Multimodal Framework for Smart Building Occupancy Detection. Sustainability 2024, 16, 4171. [Google Scholar] [CrossRef]
Beltran, A.; Erickson, V.L.; Cerpa, A.E. ThermoSense: Occupancy Thermal Based Sensing for HVAC Control. In Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings (BuildSys’ 13), Roma, Italy, 11–15 November 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 1–8. [Google Scholar] [CrossRef]
Verma, U.; Tyagi, P.; Kaur, M. Artificial intelligence in human activity recognition: A review. Int. J. Sens. Netw. 2023, 41, 1–22. [Google Scholar] [CrossRef]
Ahmad, N.; Ghosh, S.; Rout, J. CNN-based hybrid deep learning framework for human activity classification. Int. J. Sens. Netw. 2024, 44, 74–83. [Google Scholar] [CrossRef]
Eckhoff, D.; Wagner, I. Privacy in the Smart City—Applications, Technologies, Challenges, and Solutions. IEEE Commun. Surv. Tutor. 2018, 20, 489–516. [Google Scholar] [CrossRef]
Wang, W.; Chen, J.; Hong, T. Occupancy prediction through machine learning and data fusion of environmental sensing and Wi-Fi sensing in buildings. Autom. Constr. 2018, 94, 233–243. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
Khan, I.; Zedadra, O.; Guerrieri, A.; Spezzano, G. Occupancy Prediction in IoT-Enabled Smart Buildings: Technologies, Methods, and Future Directions. Sensors 2024, 24, 3276. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Zhang, H.; Yang, Y.; He, L. A text classification network model combining machine learning and deep learning. Int. J. Sens. Netw. 2024, 44, 182–192. [Google Scholar] [CrossRef]
Sayed, A.N.; Himeur, Y.; Bensaali, F. Deep and transfer learning for building occupancy detection: A review and comparative analysis. Eng. Appl. Artif. Intell. 2022, 115, 105254. [Google Scholar] [CrossRef]
Lu, C.; Gu, J.; Lu, W. An improved attention-based deep learning approach for robust cooling load prediction under diverse occupancy schedules. Sustain. Cities Soc. 2023, 96, 104679. [Google Scholar] [CrossRef]
Hitimana, E.; Bajpai, G.; Musabe, R.; Sibomana, L.; Kayalvizhi, J. Implementation of IoT Framework with Data Analysis Using Deep Learning Methods for Occupancy Prediction in a Building. Future Internet 2021, 13, 67. [Google Scholar] [CrossRef]
Cretu, G.; Stamatescu, I.; Stamatescu, G. Modeling and prediction of occupancy in buildings based on sensor data using deep learning methods. IEEE Access 2024, 12, 102994–103003. [Google Scholar] [CrossRef]
Dai, R.; Bai, G. Distributed context-aware Transformer enables dynamic energy consumption prediction for smart building networks. Digit. Commun. Netw. 2025; in press. [Google Scholar] [CrossRef]
Sun, K. DMFF: Deep multimodal feature fusion for building occupancy detection. Build. Environ. 2024, 253, 111355. [Google Scholar] [CrossRef]
Zhang, Z. Building Occupancy Analytics Based On Deep Learning Through the Use of Environmental Sensor Data. Master’s Thesis, Virginia Tech, Arlington, VA, USA, 2023. [Google Scholar]
Gozuoglu, A.; Ozgonenel, O.; Gezegin, C. CNN-LSTM based deep learning application on Jetson Nano: Estimating electrical energy consumption for future smart homes. Internet Things J. 2024, 26, 101148. [Google Scholar] [CrossRef]
Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
Candanedo, L. Occupancy Detection [Dataset]. UCI Machine Learning Repository. 2016. Available online: https://archive.ics.uci.edu/dataset/357/occupancy+detection (accessed on 2 February 2025).
Yun, J. ZNorm: Z-Score Gradient Normalization Accelerating Skip-Connected Network Training without Architectural Modification. arXiv 2024, arXiv:2408.01215. [Google Scholar]
What is Data Leakage in Machine Learning? Available online: https://www.ibm.com/think/topics/data-leakage-machine-learning (accessed on 15 March 2025).
Balderas, L.; Lastra, M.; Benítez, J.M. An Efficient Green AI Approach to Time Series Forecasting Based on Deep Learning. Big Data Cogn. Comput. 2024, 8, 120. [Google Scholar] [CrossRef]
Feng, C.; Mehmani, A.; Zhang, J. Deep Learning-Based Real-Time Building Occupancy Detection Using AMI Data. IEEE Trans. Smart Grid 2020, 11, 4490–4501. [Google Scholar] [CrossRef]
Vasani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Ismail Fawaz, H.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.A.; Petitjean, F. InceptionTime: Finding AlexNet for time series classification. Data Min. Knowl. Disc. 2020, 34, 1936–1962. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 2021, 379, 20200209. [Google Scholar] [CrossRef]
Cui, Z.; Chen, W. Multi-Scale Convolutional Neural Networks for Time Series Classification. arXiv 2016, arXiv:1603.06995. [Google Scholar] [CrossRef]
Gao, C.; Zhang, N.; Li, Y.; Bian, F.; Wan, H. Multi-scale adaptive attention-based time-variant neural networks for multi-step time series forecasting. Appl. Intell. 2023, 53, 28974–28993. [Google Scholar] [CrossRef]
Li, F.; Li, G.; He, X.; Cheng, J. Dynamic Dual Gating Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 5310–5319. [Google Scholar] [CrossRef]
Vaňuš, J.; Martinek, R.; Danys, L.; Nedoma, J.; Bilik, P. Occupancy Detection in Smart Home Space Using Interoperable Building Automation Technologies. Hum.-Centric Comput. Inf. Sci. 2022, 12, 616–632. [Google Scholar] [CrossRef]
Powers, D.; Ailab. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011, 2, 2229–3981. [Google Scholar] [CrossRef]
Franceschi, L.; Donini, M.; Perrone, V.; Klein, A.; Archambeau, C.; Seeger, M.; Pontil, M.; Frasconi, P. Hyperparameter Optimization in Machine Learning. arXiv 2024, arXiv:2410.22854. [Google Scholar] [CrossRef]

Figure 1. Proposed methodology framework.

Figure 2. Data-preprocessing flow diagram.

Figure 3. Translution architecture.

Table 1. Comparative analysis of occupancy detection models.

Model Type	Algorithms	Sensor Modality	Privacy	Key Performance	Limitations
Traditional ML	Random Forest, SVM, MLP	Environmental (Temp, Humid, CO₂, Light)	Low (non-intrusive sensors	Moderate Accuracy (e.g., 90–95%)	Simple and cost-effective; but less robust for real-world dynamic patterns.
Deep Learning (RNNs)	LSTM, GRU	Environmental, Wi-Fi, Camera	Low to Moderate (privacy-invasive sensors)	Good Accuracy (e.g., 90–96%)	Face challenges like overfitting on small datasets; sensitivity to noisy sensor inputs.
Deep Learning (CNNs)	CNN	Environmental, Wi-Fi, Camera	Low to Moderate (privacy-invasive sensors)	Good Accuracy (e.g., 80–90%)	Struggle to capture long-range temporal dependencies; limited generalization to complex patterns.
Hybrid (CNN-RNN)	CNN-LSTM	Environmental, Wi-Fi, Camera	Low to Moderate (privacy-invasive sensors)	High Accuracy (e.g., 90–97%)	Improved feature extraction and sequential modeling over single architectures. Can still face generalization issues.
Pure Transformer-Based	Distributed Transformer, DMFF	Environmental, Multi-modal	Low to Moderate (privacy-invasive sensors)	High Accuracy (e.g., 95–98%)	Can achieve faster convergence and better generalization. Struggle with fine-grained local events.

Table 2. Translution performance on Test Set 1 across different sequence lengths.

Sequence Length	Accuracy	Precision	Recall	F1-Score	AUROC
5	91.3	96.4	79.1	86.9	88.7
10	98.1	95.2	99.8	97.3	98.4
15	91.5	94.1	81.6	87.4	89.3
20	88.6	93.7	73.9	82.7	85.6

Table 3. Translution performance on Test Set 2 across different sequence lengths.

Sequence Length	Accuracy	Precision	Recall	F1-Score	AUROC
5	96.9	91.9	93.3	92.6	95.5
10	98.9	96.2	98.5	97.3	98.7
15	93.5	96.5	71.8	82.3	85.5
20	94.6	98	75.6	85.4	87.6

Table 4. Hyperparameter tuning.

Category	Parameter	Value/Setting
Model Input	Sequence Length (T)	10
Model Input	Number of Features (F)	5
Architecture	Model Dimension $d_{model}$	64
	Translution Blocks	2
	Attention Heads	4
	Kernel Sizes	[3,5,7]
	Convolution Filters	64
	Feedforward Units(dff)	128
	Dropout Rate	0.2
	Pooling Strategy	Global Average Pooling
Training	Epochs	100
	Batch Size	32
	Learning Rate	0.001
	Optimizer	Adam
	Loss Function	Binary Cross Entropy

Table 5. Comparison of the machine- and deep learning models on Test Dataset 1.

ML/DL Models	Accuracy	Precision	Recall	F1-Score	ROC
Random Forest	97.9	94.7	98.7	95.1	97
MLP	96.4	94.5	98.9	96.1	96
OCSVM	93.9	86.3	99	92.2	95
CNN	90	96.7	75.2	84.6	87
RNN	96.9	96.3	95.2	95.7	97
Translution (our model)	98.1	95.2	99.8	97.3	98.4

Table 6. Comparison of the machine- and deep learning models on Test Dataset 2.

ML/DL Models	Accuracy	Precision	Recall	F1-Score	ROC
Random Forest	97.3	95.4	97.4	96.4	97
MLP	95.4	82.4	99.6	90.2	97
OCSVM	84.4	57.4	99.8	72.8	90
CNN	89.3	71.6	81.6	76.3	87
RNN	91.2	74	89.8	81.2	91
Translution (our model)	98.9	96.2	98.5	97.3	98.7

Table 7. Computation time of machine learning models and Translution model for training and testing.

Model	Train Time	Test 1 Time	Test 2 Time
Ransom Forest	0.06 s	0.01 s	0.03 s
MLP	12.06 s	0.25 s	0.68 s
OCSVM	0.04 s	0.02 s	0.05 s
CNN	83.61 s	0.41 s	1.33 s
RNN	85.96 s	1.16 s	1.73 s
Translution	222.19 s	2.22 s	11.81 s

Table 8. Ablation study.

Model Configuration	Metrics	Test 1	Test 2
Translution	Accuracy	98.1	98.9
	Precision	95.2	96.2
	Recall	99.8	98.5
	F1-Score	97.3	97.3
	AUROC	98.4	98.7
No Dynamic Gating	Accuracy	92.0	96.6
	Precision	96.7	98.2
	Recall	80.7	84.8
	F1-Score	88.0	91.0
	AUROC	89.6	92.2
No Multi-scale Convolution	Accuracy	86.7	94.2
	Precision	95.8	91.9
	Recall	66.2	79.3
	F1-Score	78.3	85.2
	AUROC	82.3	88.7
No Adaptive Attention	Accuracy	93.2	96.1
	Precision	94.73	94.82
	Recall	85.8	86.1
	F1-Score	90.1	90.3
	AUROC	91.6	92.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chaudhari, P.; Xiao, Y.; Li, T. Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings. Electronics 2025, 14, 3323. https://doi.org/10.3390/electronics14163323

AMA Style

Chaudhari P, Xiao Y, Li T. Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings. Electronics. 2025; 14(16):3323. https://doi.org/10.3390/electronics14163323

Chicago/Turabian Style

Chaudhari, Pratiksha, Yang Xiao, and Tieshan Li. 2025. "Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings" Electronics 14, no. 16: 3323. https://doi.org/10.3390/electronics14163323

APA Style

Chaudhari, P., Xiao, Y., & Li, T. (2025). Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings. Electronics, 14(16), 3323. https://doi.org/10.3390/electronics14163323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Translution: A Hybrid Transformer–Convolutional Architecture with Adaptive Gating for Occupancy Detection in Smart Buildings

Abstract

1. Introduction

2. Related Work

3. Proposed Methodology—Translution

3.1. Data Collection

3.2. Data Preprocessing

3.3. Translution Architecture

3.3.1. Multi-Scale Convolutional Encoder

3.3.2. Adaptive Multi-Head Attention

3.3.3. Dynamic Gating Mechanism

3.3.4. Feedforward Network and Normalization

3.3.5. Temporal Pooling and Output Classification

4. Experiment

4.1. Data Collection

4.2. Data Assessment

4.3. Evaluation Metrics

4.4. Optimal Sequence Length Determination

4.5. Hyperparameter Settings

4.6. Model Training

4.7. Comparative Analysis of Different Machine and Deep Learning Models for Occupancy Detection

4.8. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI