You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

16 November 2025

Multi-Property Infrared Sensor Array for Intelligent Human Tracking in Privacy-Preserving Ambient Assisted Living

,
,
and
Department of System Design, Tokyo Metropolitan University, Tokyo 192-0072, Japan
*
Author to whom correspondence should be addressed.
This article belongs to the Section Applied Physics General

Abstract

This paper deals with a privacy-preserving human tracking system that uses multi-property infrared sensor arrays. In the growing field of intelligent elderly care, there is a critical need for monitoring systems that ensure safety without compromising personal privacy. While traditional camera-based systems offer detailed activity recognition, privacy-related concerns often limit their practical application and user acceptance. Consequently, approaches that protect privacy at the sensor level have gained increasing attention. The privacy-preserving human tracking system proposed in this paper protects privacy at the sensor level by fusing data from an ultra-low-resolution 8 × 8 (64-pixel) passive thermal infrared (IR) sensor array and a similarly low-resolution 8 × 8 active Time-of-Flight (ToF) sensor. The thermal sensor identifies human presence based on heat signature, while the ToF sensor provides a depth map of the environment. By integrating these complementary modalities through a convolutional neural network (CNN) enhanced with a cross-attention mechanism, our system achieves real-time three-dimensional human tracking. Compared to previous methods using ultra-low-resolution IR sensors, which mostly only obtained two-dimensional coordinates, the acquisition of the Z coordinate enables the system to analyze changes in a person’s vertical position. This allows for the detection and differentiation of critical events such as falls, sitting, and lying down, which are ambiguous to 2D systems. With a demonstrated mean absolute error (MAE) of 0.172 m in indoor tracking, our system provides the data required for privacy-preserving Ambient Assisted Living (AAL) applications.

1. Introduction

The advent of intelligent assistive systems has precipitated a paradigm shift in the realm of daily living support [], particularly among ageing demographics. Contemporary systems are progressively reliant on human tracking and activity recognition to facilitate critical applications such as fall detection, routine monitoring and emergency response [,]. However, smart technologies are rejected by some users due to privacy issues [,,]; a recent survey of eldercare technology reveals that 28% of seniors express reservations about using smart eldercare solutions due to privacy concerns []. This is especially true for cameras commonly used in AAL technology. According to a survey [], of the 50 older adults who participated, all 34 who had not installed surveillance systems expressed concerns about privacy. Among them, 29 were concerned about visual privacy, and 21 were concerned about behavioral privacy.
Current vision-centric approaches, while effective in activity recognition [,], inherently compromise user privacy through high-resolution visual data acquisition. A fundamental dichotomy between functionality and privacy engenders an adoption paradox: the very demographics most in need of assistive technologies—particularly privacy-conscious seniors—often exhibit the greatest reluctance to adopt them. Researchers hope to find alternative solutions to reduce these privacy risks, such as thermal sensors or depth sensors. These sensors cannot obtain feature data such as RGB, but high-resolution thermal sensors or depth sensors can still capture facial details, which may reveal recognizable body contours. Therefore, using thermal sensors or depth sensors with ultra-low spatial resolution (such as 8 × 8 pixels) becomes an option, at which point the raw data can no longer distinguish the physiological characteristics of a person.
Previous methods, mostly employing ultra-low-resolution sensors, while offering privacy protection, only achieve human tracking in a 2D plane (ground coordinates) or detect even less information []. For example, in study [], only the XY coordinates could be obtained. Study [] could not obtain coordinates, but could only obtain specific human body states. This lack of information about height changes makes it difficult for the system to recognize actions involving height shifts, such as falling, sitting, or lying down. To address this issue, this paper proposes a multi-property infrared sensor approach, which is not a single device but a system that synergistically fuses data from two distinct sensors operating in the infrared (IR) spectrum. The aim of this paper is to develop and validate a privacy-preserving human tracking system capable of acquiring three-dimensional (3D) coordinates in real-time. By fusing data from ultra-low-resolution thermal and Time-of-Flight (ToF) sensors, the system is designed to overcome the limitations of 2D tracking, enabling the detection of activities with vertical changes, such as falling, sitting, and lying down.
A Time-of-Flight (ToF) sensor operates as an active system, calculating distance by emitting a modulated light signal and measuring the time delay or phase shift in the light as it reflects off an object. In contrast, a passive infrared (IR) thermal sensor functions by detecting the long-wave infrared radiation naturally emitted by objects, generating a thermal map where signal intensity directly corresponds to the object’s surface temperature. The schematic diagram is shown in Figure 1.
Figure 1. Schematic diagram of Time-of-Flight (ToF) sensor and passive infrared (IR) thermal sensor.
IR sensors can detect human bodies within their perception range, but cannot directly measure distance. In contrast, ToF sensors can detect the distance of objects, but cannot determine whether the detected object is a human body. Therefore, this paper proposes an intelligent system that integrates the above two sensors to achieve human body tracking while protecting privacy. Our approach leverages two ultra-low-resolution sensing modalities:
It involves infrared (IR) sensors ( 8 × 8 grid) capturing coarse thermal distribution, which eliminates identifiable visual features while retaining motion-related heat signatures. Along with that, Time-of-Flight (ToF) sensors ( 8 × 8 grid) provide low-dimensional depth matrices that encode depth.
The two sensors have different fields of view, and we propose a cross-attention mechanism to map the human body identified by the IR sensor to the data of the ToF sensor to enhance the performance of the convolutional neural network.
A key aspect of our approach is the fusion of complementary data streams through a dedicated neural architecture. The utilization of IR data facilitates the direction of attention towards human-occupied regions, whilst ToF measurements provide depth for human tracking.
The main contributions of this paper are as follows:
  • Achieving privacy protection at the sensor level.
  • A multi-property sensors fusion architecture that integrates 8 × 8 IR and 8 × 8 ToF sensors for privacy-preserving 3D human tracking.
  • A CNN architecture featuring a cross-sensor attention mechanism, designed for real-time inference on low-cost, non-GPU devices.
  • Obtaining the XYZ three-dimensional coordinates of the human body in an indoor environment.
The remainder of this paper is organized as follows: Section 2 reviews related work in privacy-preserving monitoring and sensor-based tracking. Section 3 details our proposed method, including the system architecture, sensor fusion strategy, and network design. Section 4 presents the experimental setup and evaluation results. Section 5 discusses the findings, limitations, and future work. Finally, Section 6 concludes the paper.

3. Method

3.1. Proposed System Architecture

The proposed system operates in two distinct phases to achieve a balance between functionality and privacy. In the training phase, multi-sensor data are collected. Synchronized 8 × 8 IR thermal grids and ToF depth matrices are collected. As a supervised learning process, a dual-stream CNN is trained to map IR/ToF inputs to 3D coordinates. The supervisory signal, which serves as the ground truth, is obtained using an RGB-D camera (ZED2, manufactured by Stereolabs, San Francisco, USA) and its accompanying human tracking algorithm to provide coordinate data. The RGB-D camera detects the position of the person’s nose coordinates as the supervisory signal.
In the deployment phase, the RGB-D camera is physically removed to facilitate privacy-preserving inference, with the trained model processing only IR/ToF data to output 3D coordinates, thereby ensuring that no visual data is acquired. Data minimization involves the local processing of raw sensor matrices ( 8 × 8 ) and their immediate discarding following coordinate extraction, in accordance with the GDPR’s “privacy by default” principle.
Additionally, to reduce the system’s cost, the computations must be deployable on edge devices with low computational power. Therefore, the model should be simplified as much as possible to create a lightweight model. A lightweight model is one with a low parameter count and minimal computational complexity, ensuring fast inference speeds and a small memory footprint on such resource-constrained devices.
The architecture of this system is characterized by the following features: (1) Bi-level isolation: Complete decoupling of privacy-sensitive RGB-D hardware after training. (2) Lightweight model: For real-time inference on low-cost devices; (3) Multi-sensor combination: IR sensors are combined with ToF sensors to improve the accuracy of results. This system fundamentally eliminates the privacy inherent in traditional vision systems, and low privacy risk can be expected to improve the acceptance of intelligent systems by the older adults. The system overview is shown as Figure 2.
Figure 2. System overview.

3.2. Sensor Configuration

As we have discussed above, we used both passive IR sensors and ToF sensors. The resolution for both was selected as 8 × 8 . The AMG8833 is a widely used 8 × 8 IR sensor, and the main performance indicators are shown in Table 3. The ToF sensor used in this paper is vl53l5x, the main performance indicators are shown in Table 4. Although both sensors have an 8 × 8 resolution, their fields of view are different, making direct fitting impossible. Therefore, we used machine learning to map the pixels belonging to the human figure in the IR sensor to the results of the ToF sensor to obtain depth (distance) data. Through data fusion, we used the IR sensor’s thermal features (semantic information) to locate the human body’s 2D position within the 8 × 8 grid, and then used the depth data (spatial information) provided by the ToF sensor in the corresponding region to acquire the Z-axis coordinate. This “semantic–spatial” complementarity is necessary because, without this fusion, the system could either only find a 2D hotspot with no depth (IR only), or measure a 3D point cloud without semantic context to differentiate human subjects from inanimate objects (ToF only).
Table 3. AMG8833 IR sensor performance specifications.
Table 4. VL53L5X ToF sensor performance specifications.
We designed a modular device that connects a ToF (Time-of-Flight) sensor and an IR (infrared) sensor to a single controller, ensuring the synchronous acquisition of data from both. The acquired data was then transmitted via Wi-Fi to an edge computing node for computation. The actual hardware composition is shown as Figure 3.
Figure 3. Hardware design of multi-property infrared sensor array.

3.3. Data Preprocessing

For the AMG8833 sensor, the highest output is about 30 °C at 0.02 m for a person with a standard body temperature (about 36.5 °C), while the output temperature decreases with the distance between the person and the sensor, and the highest output is about 23 °C at 2 m. Within 2 m, the thermal contrast provided by the IR sensor is sufficient to segment the human body from the ambient background. Any further distance will result in the output of the IR sensor not distinguishing the human body from the environment. The maximum measurement range of this system was set at 2 m × 2 m. The distance and inter-output tests are shown in Figure 4.
Figure 4. (Above): AMG8333 output result at 2 m. (Below): AMG8333 output result at 2.5 m. As the distance increases, it becomes more and more difficult to distinguish people from the environment in the output results.
For the Z coordinate of a person (the distance between the person and the sensor), the ToF sensor can get good output results. However, ToF sensors, especially those with resolution as low as 8 × 8 used in this case, have difficulty in finding the part of the output result that belongs to the person. IR sensors can differentiate between the person and the environment by using temperature. Since the temperature output resolution of AMG8833 is 0.25 °C, the result of judging the distance between human and the sensor by temperature change is not good. Therefore, in this paper, both IR sensor and ToF sensor are used: IR sensor is used to find the human body and ToF sensor is used to detect the distance.
The necessity of sensor fusion becomes evident when considering the inherent ambiguities of single-sensor systems in a cluttered indoor environment. Consider a common scenario, illustrated in Figure 5: a person with a stroller stands near a cabinet, upon which rests a cup of hot coffee.
Figure 5. Schematic diagram of the sensor results when a person with a stroller is standing near a cabinet that holds a cup of hot coffee. The color differences illustrate that the sensors perceive different data types: the IR sensor is primarily sensitive to temperature, while the TOF sensor is primarily sensitive to distance.
In this case, a Time-of-Flight (ToF) sensor, which is semantically blind, fails to differentiate the person from the adjacent stroller and cabinet. It will simply return an undifferentiated depth map of a single, complex “blob,” making it impossible to identify the human target within the cluster.
Conversely, an IR thermal sensor, while adept at segmenting heat signatures, faces two critical limitations: First, it is incapable of providing the essential distance (Z-axis) data that the ToF sensor captures. Second, and more fundamentally, it suffers from an unavoidable susceptibility to thermal interference. The sensor cannot semantically distinguish the target human from other thermal confounders; it will register both the person and the hot coffee as valid “hot spots,” creating critical ambiguity.
This vulnerability to thermal noise from common household items (e.g., hot beverages, ovens or radiant heaters) is an inherent challenge for all low-resolution IR systems, introducing a high risk of false positives. Therefore, to manage this well-known limitation and first establish a robust technical baseline, this study defines its operational scope. We focused on a primary use case for assistive technology: the indoor support of an elderly individual living alone.
Initial filtering of the IR sensor: In the context of sensor data interpretation, values below 23 °C are likely to represent ambient temperature, while values above 30 °C may correspond to other heat-emitting objects in daily life, such as a hot cup of coffee. To address this, a filter is applied, retaining a temperature range between 22 °C (as the lower limit) and 31 °C (as the upper limit) for initial filtering.
For each frame, the temperature data T from the sensor is processed as follows:
Since body temperature remains relatively stable, the filtered data is further processed by taking the highest temperature ( T m a x ) after the initial filtering, and using T m a x minus 2 °C as the lower threshold for final filtering. The data is processed as Equation  (1).
T = T , T [ T m a x 2 , T m a x ] , T m a x [ 22 , 30 ] 0 , others
where T is the processed temperature value for a pixel, and T m a x is the maximum temperature detected in the current sensor frame, constrained to the range of [22, 30] degrees Celsius. As mentioned above, the human body target falls within this temperature range.
The output of two types sensors and their RGB images are shown as Figure 6. It can be observed that it is almost difficult to discern any information about privacy with only two low-resolution sensors. Privacy is protected at the sensor level.
Figure 6. (A) Output of IR sensor. The temperature data corresponding to the human body is shown in yellow. (B) Output of ToF sensor. The distance data corresponding to the human body is shown in dark blue. (C) RGB image.
The final detection space is shown in Figure 7. The RGB-D camera detects the position of the body (coordinates of the body’s nose) as teacher data. The outputs of the IR sensor and ToF sensor are recorded at a frame rate of 10 fps. A test subject walks in the indoor environment while the sensor system records the aforementioned data, which is compiled into a dataset for training the CNN network. In the application phase, the RGB-D camera will be removed and only the IR sensor and ToF sensor will be used. Compared to previous methods that could only obtain 2D coordinates and were restricted to ceiling-mounted sensors [,], one of the advantages of the 3D system proposed in this paper is the ability to place the sensor arbitrarily. Therefore, in this work, we positioned the sensor on a wall, meaning the depth direction (Z-axis) is parallel to the floor.
Figure 7. Detect space. The camera is only use for training.

3.4. Proposed Network Architecture

In [], a low-resolution 16-pixel thermopile sensor array was used to obtain the 2D coordinates of a person in a 3 × 3 space. The authors tested several NN architectures, including multilayer perceptron, autoregressive, 1-D convolutional NN (1D-CNN), and long short-term memory. The 1D-CNN provided the best compromise between localization accuracy, movement tracking smoothness, and required resources. Based on this finding, we also selected a CNN for data processing to obtain the X, Y, Z coordinates of a person. And our 8 × 8 sensor data is inherently spatial. CNNs are specifically designed to process grid-like data and learn local patterns, which is crucial for interpreting our sensor inputs.
The basic process of the proposed CNN is to input ToF data with IR data, use attention mechanism to combine IR data with ToF data to find the part of ToF data that belongs to a person, and output the Z coordinate. The Z coordinates are also taken as part of the input and combined with IR data and processed by CNN to output XY coordinates.
All the experiments were conducted using the PyTorch (version 2.2.0+cu121) framework. Each input sample comprised two 8 × 8 matrices, one from a ToF (Time-of-Flight) sensor and one from an IR (infrared) sensor, with a 3-dimensional (x, y, z) coordinate vector as the corresponding label. We employed a custom dual-branch, attention-based CNN (AttentionCNN). An AttentionModule, consisting of two 2D convolutional layers (channels 1-16-1) with Sigmoid activation, processed the 8 × 8 IR input to produce a spatial attention mask. This mask was applied element-wise to the 8 × 8 ToF input. The resulting attended ToF data was fed into a Z-axis branch, consisting of three convolutional layers (channels 1-16-32-64) and a two-layer MLP (4096-128-1), to regress the Z coordinate. A parallel XY-axis branch, with an identical three-layer convolutional stack, processed the original IR input. Its flattened features (4096-dim) were concatenated with the predicted Z coordinate (1-dim) and fed into a final two-layer MLP (4097-128-2) to regress the X and Y coordinates. Training was performed with a batch size of 32 using the Adam optimizer with a learning rate of 1 × 10 3 . The loss function was nn.MSELoss(). The model was validated every 10 epochs using the Average Euclidean Distance as the primary metric, and training was configured to stop automatically when this validation error fell below 0.09. The architecture of the network is shown as Figure 8, and the training hyperparameters are shown as Table 5.
Figure 8. Network architecture. The visualization of input is for display purposes only. The system only processes the raw data as input.
Table 5. Training hyperparameters for the network.
  • Cross-Sensor Attention Module
    Dynamically enhances human-related depth features in ToF data using IR thermal patterns while suppressing background noise.
    • IR Feature Extraction:
      Input: IR data (after initial filtering).
      Convolutional Processing:
      Conv2D (1→16): 3 × 3 kernel, padding = 1, ReLU activation.
      Conv2D (16→1): 3 × 3 kernel, padding = 1, Sigmoid activation.
      Output: An 8 × 8 attention mask M ( 8 × 8 ) [ 0 , 1 ] , where values near 1 indicate regions with human activity likelihood.
    • ToF Feature Enhancement:
      Apply the attention mask M to raw ToF depth data D via element-wise multiplication, as shown in Equation (2).
    D a t t = D M
    where D R ( 8 × 8 ) is the raw ToF matrix and ⊙ denotes element-wise multiplication. This amplifies depth signals in human-occupied areas and suppresses static background interference (e.g., furniture).
  • Dual-Stream Processing
    • ToF Stream (Z-axis Estimation):
      Objective: Estimate vertical positioning (height) from attention-enhanced ToF data.
      Convolutional Feature Extraction:
      Three sequential convolutional blocks used and the output: 64-channel 8 × 8 feature map F z was obtained.
      Flatten F z into predicted vertical position z. Two fully connected (FC) layers:
      FC1: 64 × 8 × 8 128 dimensions (ReLU).
      FC2: 128 1 dimension (linear activation).
      Predicted vertical position z as the output was obtained.
    • IR Stream (XY-axis Estimation):
      Estimate horizontal coordinates by fusing IR thermal data with predicted Z coordinate Z.
      Identical structure to ToF stream, three 3 × 3 convolutional blocks ( 1 16 32 64 channels). A 64-channel 8 × 8 feature map F x y was obtained.
      Concatenate with the predicted Z coordinate z,
      F f u s e d = [ Flatten ( F x y ) ; z ] , dimension = 4096 + 1
      where F f u s e d is the final concatenated feature vector, Flatten ( F x y ) is the flattened 4096-dimensional feature vector from the IR stream, and z is the predicted Z coordinate from the ToF stream.
      Two FC layers are applied:
      FC1: 64 × 8 × 8 + 1 128 dimensions (ReLU).
      FC2: 128 2 dimensions (linear activation).
      Predicted horizontal coordinates ( x , y ) as the output is obtained.
    • Loss Function:
      The network is trained using the Mean Squared Error (MSE) loss, a standard choice for regression tasks, as defined in the code by nn.MSELoss(); the equation is shown in Equation (4).
      L = L M S E = 1 N i = 1 N ( x p r e d , i x r e a l , i ) 2 + ( y p r e d , i y r e a l , i ) 2 + ( z p r e d , i z r e a l , i ) 2
      where y p r e d = [ x p r e d , y p r e d , z p r e d ] is the network’s predicted 3D coordinates, and y r e a l = [ x r e a l , y r e a l , z r e a l ] is the ground truth coordinates. This loss function computes the average squared difference, applying equal importance to the errors in all three axes (x, y, and z), rather than prioritizing specific axes with different weights.

3.5. Key Mechanism Parameter Explanation

The key mechanism in this model is the AttentionModule, which is implemented as a spatial attention generator rather than a traditional Transformer-based cross-attention. Its purpose is to generate a spatial weight mask using IR data (heat source) to guide the feature extraction of ToF data (depth).
The key parameter settings in this module and their rationale are as follows:
  • kernel_size=3, padding=1: This configuration is applied to both convolutional layers to ensure the spatial dimensions of the feature maps are preserved (i.e., “same” padding). Maintaining the 8 × 8 resolution is a critical design requirement, as the output attention mask must retain a direct spatial correspondence with the 8 × 8 ToF sensor image. This alignment is necessary for the subsequent element-wise multiplication (gating) step.
  • conv1(in=1, out=16): The initial layer functions as a feature extractor, expanding the single-channel IR input into a 16-channel feature space. The rationale for this expansion is to compensate for the information-sparse nature of the raw 8 × 8 IR data. By increasing the feature dimensionality, the network is afforded the capacity to learn more complex and abstract representations, such as thermal gradients and edges, rather than being limited to raw temperature values.
  • conv2(in=16, out=1): Conversely, the final convolutional layer performs dimensionality reduction, squeezing the 16-channel feature maps back into a single-channel output. This is conceptually necessary as the final output must serve as a unified spatial attention map, where each pixel location is assigned a single scalar importance value. This layer effectively integrates the multi-dimensional abstract features learned by conv1 into this final, cohesive weight map.
  • Sigmoid() Activation: Sigmoid activation is applied to the final single-channel output. Its function is to normalize the logits from conv2 into a probabilistic range of [0, 1]. This normalization is essential for the map to function as a “soft gate” or mask. The resulting values act as coefficients for the subsequent element-wise multiplication: values approaching 1.0 signify high relevance (allowing the corresponding ToF data to pass through), while values approaching 0.0 signify low relevance (suppressing the ToF data).

4. Experiments and Results

4.1. Experimental Environment and Dataset

The main computer configurations used in this paper are a low-cost tablet with CPU: Intel(R) N97 2.00 GHz and no GPU in the pc.
The data collection process yielded two distinct datasets. Each data point in both sets consisted of two 8 × 8 sensor matrices (one from a Time-of-Flight sensor, one from an IR thermal sensor) and a 3-dimensional ground truth position vector (X, Y, Z).
The training and validation dataset included over 13,000 data samples. Each data sample included 128 elements of data from two 8 × 8 low-resolution sensors, and the 3D coordinate data from an RGB-D camera, for a total of 131 elements. It was collected in an indoor environment with a small amount of furniture, where a tester performed a wide variety of actions. The tester were healthy young people These actions included common movements like walking, sitting, and standing, and motions with significant height changes, such as lying down, sitting down, and falling.
For model development, this entire 13,000+ set pool was randomly shuffled to prevent sequential bias. It was then split, with 15% (approx. 1950+ sets) allocated for validation and the remaining 85% (approx. 11,050+ sets) used for training.
The test dataset (Final Evaluation) was a separate test set of 1500 data points was recorded as a single, continuous time-series sequence. Because human motion is inherently continuous, this test set was explicitly not shuffled during evaluation. This method ensures a realistic assessment of the model’s performance on unseen, temporally coherent motion patterns.

4.2. Results and Analysis

On the platform used in this test, the average running time was less than 1 ms, and the MSE reached 0.025 m2. The maximum error in the Z-axis direction, which is the most prone to error, was 0.543 m, which met the demand. The final result, output via IR sensor and ToF sensor compare with truth coordinates provided by the RGB-D camera is shown in Figure 9. As previously discussed, a primary advantage of our 3D coordinate acquisition system is its flexibility in sensor placement. To demonstrate this, the sensor module was mounted on a vertical wall for this experiment. This configuration establishes a coordinate system where the Z-axis (depth) runs parallel to the floor, representing the distance between the subject and the sensor array. Consequently, the X- and Z-axes map the subject’s 2D position on the ground plane, while the Y-axis directly corresponds to the subject’s height.
Figure 9. Comparison of the predicted X, Y, Z coordinates with ground truth coordinates provided by the RGB-D camera. (a) X-axis coordinate comparison; (b) Y-axis coordinate comparison; (c) Z-axis coordinate comparison.
An analysis of the results reveals distinct error characteristics for different axes. The X and Y coordinates are heavily informed by the AMG8833 IR sensor. This component is known to be susceptible to thermal drift and inherent temperature inaccuracies, which introduce a degree of signal instability. This instability is observable as high-frequency noise in the X-axis and Y-axis predictions. However, despite this noise, the model’s output demonstrates a strong correlation with the ground truth, accurately tracking the overall trajectory of the tester’s movements.
The most significant estimation errors were observed in the Z-axis (depth). We attribute this to the complexities of the sensor fusion process itself. While combining the IR and ToF data is essential for the 3D position, the fusion model must also learn to manage the combined and sometimes amplified noise from both inputs. Detailed error metrics are shown in Table 6.
Table 6. Error metrics of the proposed method.
In conclusion, while the system exhibits predictable noise artifacts, the predicted 3D path successfully mirrors the subject’s continuous motion, achieving a mean Euclidean distance error of 0.124 m. The predicted 3D path successfully mirrors the subject’s continuous motion, including complex actions and height changes. The magnitude of these errors is considered acceptable for the target application. Therefore, this low-resolution, privacy-preserving fusion system demonstrates significant potential for practical deployment in Ambient Assisted Living (AAL) environments for tasks such as activity monitoring or fall detection.
A quantitative performance comparison was conducted between the Time-of-Flight (ToF) sensor alone, the thermal sensor alone, and our proposed fused ToF-IR approach. As detailed in Table 7, the results reveal critical distinctions in their tracking capabilities.
Table 7. Detailed error metrics for single-sensor and fused methods.
The ToF-only method yielded the poorest performance. This approach was highly unreliable, frequently misidentifying inanimate, room-temperature objects as human subjects. This fundamental weakness is starkly illustrated by its catastrophic Max X Error of 2.841 m and the highest overall Euclidean distance error ( 0.265 m).
Conversely, the thermal-only method provided the same 2D localization as fusion method. Its Max X ( 0.264 m) and Max Y ( 0.391 m) errors were comparable to those of the fusion model. However, its performance was markedly inferior in depth estimation. As the thermal sensor lacks a direct mechanism for distance measurement, it produced a Max Z Error of 0.807 m, significantly higher than the fusion method.
Our fusion model demonstrably overcomes the limitations of both individual sensors. It achieved better results than every single metric, including the lowest overall MSE ( 0.025 m2), MAE ( 0.172 m), and a mean Euclidean distance error of only 0.124 m. This 3D tracking performance is achieved by synergistically combining the thermal sensor’s robust subject identification (which filters out the ToF’s false positives) with the ToF sensor’s depth data. The qualitative tracking results are visualized in Figure 10.
Figure 10. Comparison of tracking results for different methods. The plots show the predicted coordinates versus the ground truth: (a) Comparison on the X-axis; (b) Comparison on the Y-axis; (c) Comparison on the Z-axis.

4.3. Testing of Various Walking Patterns

To further validate the performance of the proposed method, we selected three motion types that may be encountered in daily life: circle shape, spiral shape, and a sit shape. In circle shape, the subject’s walking trajectory on the ground forms a circle, whereas in spiral shape, the trajectory is a spiral. These two motions provide a visual demonstration of the system’s ability to track a person’s motion on the ground (the depth direction of the system is parallel to the ground, meaning that the motion on the ground plane is the system’s X–Z plane). The sit shape involves the subject moving to a nearby chair, completing the actions of sitting down and standing up, and then departing. This motion primarily demonstrates the system’s ability to track changes in the person’s height. Each pattern is designed to test the system’s tracking capabilities under different motion dynamics and spatial configurations. The results for each motion pattern and their corresponding error metrics are presented below.
The results were processed by the CNN model and then using an exponentially weighted moving average (EWMA) filter. This algorithm works as follows:
x ( t ) = ( 1 α ) x ( t 1 ) + α y ( t )
where x ( t ) is the filtered value at the current moment, x ( t 1 ) is the filtered value at the previous moment, y ( t ) is the input value (original signal) at the current moment, and α is the filter coefficient ( 0 < α < 1 ) , also known as the smoothing factor.
  • Circle shape: Circle shape refers to the test subject’s movement in a circular trajectory in the test space, mainly testing the system’s tracking capability along the X–Z-axis (i.e., the person’s coordinates on the ground). The result is shown as Figure 11. And the error metrics are shown as Table 8.
    Figure 11. Comparison of X–Z result (circle shape).
    Table 8. Error metrics for IR-ToF Fusion Network in circle shape walking pattern.
  • Spiral shape: The spiral shape refers to the test subject moving in a spiral trajectory in the test space, mainly testing the system’s tracking capability along the X–Z-axis (i.e., the person’s coordinates on the ground). The result is shown as Figure 12. And the error metrics are shown as Table 9.
    Figure 12. Comparison of X–Z result (spiral shape).
    Table 9. Error metrics for IR-ToF Fusion Network in spiral shape walking pattern.
  • Sit shape: The sit shape test is when the tester walks halfway through the test space, completes the sitting and standing movements, and then continues walking. It mainly tests the system’s tracking ability on the Y-axis (i.e., the change in height of the person). The result is shown as Figure 13. And the error metrics are shown as Table 10.
    Figure 13. Comparison of X–Y result (sit shape). The tester walks in a straight line, sits down in the middle (arrow), then stands up, and then continues walking.
    Table 10. Error metrics for IR-ToF Fusion Network in sit shape walking pattern.

5. Discussion

5.1. Summary of Findings

The experimental results demonstrate that the fusion of 8 × 8 Time-of-Flight (ToF) and 8 × 8 infrared (IR) sensors enables 3D human tracking with a mean Euclidean distance error of 0.124 m (Table 6) and a Mean Square Error (MSE) of 0.025 m2. The system operates with a sub-millisecond average inference time on a low-cost CPU platform, confirming its suitability for real-time applications.
A primary capability of this approach is the acquisition of 3D coordinates. This capability facilitates flexible sensor placement, such as the wall-mounting demonstrated in this study, distinguishing it from 2D systems that typically mandate ceiling installation. As previously discussed, some indoor localization methods using low-resolution IR sensors are often limited to 2D coordinates, necessitating a ceiling-mounted deployment. The ability of our system to acquire 3D coordinates grants deployment flexibility. To highlight this 3D capability and demonstrate that the sensor can be placed arbitrarily, we intentionally chose a wall-mounted configuration for our experiments—a setup that is impractical for previous 2D-only methods [,]. While this configuration successfully validates the system’s 3D potential, we acknowledge it introduces susceptibility to inter-person occlusion in multi-person scenarios. This, however, is a consequence of the chosen demonstration rather than an inherent model limitation. A clear path for multi-person tracking using our 3D model would simply involve a ceiling-mounted configuration, which would largely mitigate occlusion by projecting individuals as distinct thermal and depth signatures onto the 2D plane.
For Ambient Assisted Living (AAL), acquiring 3D coordinates provides a significant advantage over 2D-only systems. The detection of vertical (Z-axis) changes is particularly crucial, as it enables the recognition of key activities. The system’s ability to track such diverse motion patterns—including planar movement like walking and vertical movement like sitting—was validated, as shown in Figure 11, Figure 12 and Figure 13.

5.2. Privacy and Ethical Considerations

The system’s “privacy-by-design” foundation is the use of two ultra-low-resolution 64-pixel sensor grids. It uses both a ToF sensor and a passive IR sensor, not an RGB sensor. As demonstrated in Figure 6, this data is insufficient to capture identifiable biometric features, thus mitigating the risks associated with traditional visual surveillance.

5.3. Analysis of System Performance and Limitations

A significant challenge is environmental thermal interference. While the preprocessing filter (Equation (1)) is designed to isolate human body temperatures (22 °C to 31 °C), this heuristic can be compromised by other thermal sources. Direct sunlight, radiant heaters, or hot household appliances (e.g., ovens, hot beverages) can create false positives or mask the thermal signature of a person. Similarly, large pets may also be misidentified as human targets.
The tracking accuracy is lowest along the Z-axis (depth), which exhibited the highest Mean Absolute Error (0.095 m) and Max Error (0.543 m) (Table 6). This is an inherent consequence of the ultra-low 8 × 8 resolution, which makes precise distance estimation challenging, especially as the subject moves further from the sensor.

5.4. Future Work

Based on the system’s performance and limitations, several avenues for future research are identified:
  • Multi-Person Tracking: A primary direction is the development of multi-person tracking. This requires investigating optimal sensor placement (e.g., ceiling-mounting) to mitigate occlusion.
  • Temporal Modeling for Occlusion Resilience: To address temporary occlusions (e.g., by furniture or self-occlusion), the current frame-by-frame model could be enhanced. Incorporating temporal models, such as Recurrent Neural Networks (RNNs), LSTMs, or Transformers, would allow the system to predict a subject’s position based on motion history.
  • Environmental Robustness: Future models should be trained to be more robust to thermal interference. This could involve dynamically adjusting the temperature filters based on ambient room temperature or developing a more sophisticated fusion logic that can differentiate human thermal patterns from static (e.g., heater) or sudden (e.g., sunlight) thermal noise.
  • Generalizability via Federated Learning: The current model was trained in a specific environment. To improve generalizability across diverse home layouts and thermal conditions, Federated Learning is a promising approach. This would allow models to be collaboratively trained on datasets from multiple locations without centralizing raw sensor data, thus preserving user privacy.
  • Hardware and Resolution Trade-offs: Further exploration of hardware, such as 4 × 4 sensor grids, could further reduce identifiability risks. This must be balanced against the resulting loss in tracking accuracy, requiring a systematic study of the performance-privacy trade-off.
And we acknowledge that the deployment of this technology in practical applications (especially in human care) must be predicated on a thorough review of relevant ethical and legal issues, such as data privacy and user consent. This research currently focuses on exploring the technical feasibility of low-resolution sensors, while a comprehensive ethical and legal analysis will serve as a crucial next step before future practical applications.

6. Conclusions

In conclusion, this study presented and validated a real-time 3D human tracking system that fuses data from 8 × 8 ToF and 8 × 8 IR sensors. The system achieves an average Euclidean tracking error of 0.124 m (Table 6) with a sub-millisecond inference time, demonstrating its viability for real-time assistive applications on CPU-based platforms. This result can be contextualized with prior art (Table 2). Much of the existing research—such as the PIR-based systems of Liu et al. [] and Mukhopadhyay et al. []—is confined to 2D localization and reports higher errors (approx. 0.64 m). Other 2D deep learning models (e.g., Tariq et al. []) have reported high planar accuracy (0.096 m RMSE). The proposed system’s 3D error of 0.124 m is achieved while solving a 3D problem. This 3D capability (X, Y, Z) addresses the limitations of 2D, ceiling-mounted IR-only systems and enables flexible wall-mounted deployment. As validated on diverse motion patterns (Figure 11, Figure 12 and Figure 13), the system demonstrated its capability to track common daily activities, with detailed error metrics presented for each pattern (Table 8, Table 9 and Table 10). A key aspect of this work is the demonstration of a system that provides tracking functionality while using hardware that offers inherent privacy protection. By achieving 3D tracking using non-identifiable 64-pixel raw data (Figure 6), this research shows that assistive monitoring can be performed without high-resolution visual data. This “privacy-by-design” approach, which achieves protection at the sensor level, is intended to address acceptance gaps among privacy-conscious user groups.

Author Contributions

Conceptualization, Q.S., M.K., and N.K.; data curation, Q.S. and T.O.; formal analysis, Q.S.; investigation, Q.S.; methodology, Q.S. and M.K.; project administration, N.K.; resources, Q.S. and M.K.; software, Q.S. and T.O.; supervision, M.K. and T.O.; validation, Q.S., M.K., and T.O.; visualization, Q.S.; writing—original draft, Q.S. and N.K.; writing—review and editing, Q.S. and T.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Tokyo Metropolitan University Hino Campus (R7-007) on [30 April 2025].

Data Availability Statement

The data that support the findings of this study are openly available on GitHub at https://github.com/sqwgithub/Multi-infrared-sensor-fusion (accessed on 26 October 2025). Key Software Dependencies: Python: 3.8.19; PyTorch: 2.2.0+cu121; CUDA Toolkit: 12.1; cuDNN: 8.8.1 (version 8801); NumPy: 1.24.4; Pandas: 2.0.3; scikit-learn: 1.3.0.

Acknowledgments

This work was partially supported by Japan Science and Technology Agency (JST), Moonshot R&D, with grant number JPMJMS2034 and TMU local 5G research support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AALAmbient Assisted Living
ADLActivities of Daily Living
AoAAngle of Arrival
AoDAngle of Departure
BLEBluetooth Low Energy
CNNConvolutional Neural Network
EWMAExponentially Weighted Moving Average
FCFully Connected
GDPRGeneral Data Protection Regulation
HARHuman Activity/State Recognition
IMUInertial Measurement Unit
IRInfrared
LSTMLong Short-Term Memory
MAEMean Absolute Error
MLPMultilayer Perceptron
MSEMean Square Error
NNNeural Network
PIRPassive Infrared
RGB-DRed Green Blue-Depth
RMSERoot Mean Square Error
RSSIReceived Signal Strength Indicator
SVRSupport Vector Regression
ToFTime-of-Flight

References

  1. Cicirelli, G.; Marani, R.; Petitti, A.; Milella, A.; D’Orazio, T. Ambient Assisted Living: A Review of Technologies, Methodologies and Future Perspectives for Healthy Aging of Population. Sensors 2021, 21, 3549. [Google Scholar] [CrossRef]
  2. Almutairi, M.; Gabralla, L.A.; Abubakar, S.; Chiroma, H. Detecting Elderly Behaviors Based on Deep Learning for Healthcare: Recent Advances, Methods, Real-World Applications and Challenges. IEEE Access 2022, 10, 69802–69821. [Google Scholar] [CrossRef]
  3. Nasr, M.; Islam, M.M.; Shehata, S.; Karray, F.; Quintana, Y. Smart Healthcare in the Age of AI: Recent Advances, Challenges, and Future Prospects. IEEE Access 2021, 9, 145248–145270. [Google Scholar] [CrossRef]
  4. Fournier, H.; Molyneaux, H.; Kondratova, I. Designing for Privacy and Technology Adoption by Older Adults. In Proceedings of the HCI International 2022 Posters, Virtual, 26 June–1 July 2022; Stephanidis, C., Antona, M., Ntoa, S., Eds.; Springer: Cham, Switzerland, 2022; pp. 506–515. [Google Scholar] [CrossRef]
  5. Garg, V.; Camp, L.J.; Lorenzen-Huber, L.; Shankar, K.; Connelly, K. Privacy concerns in assisted living technologies. Ann. Telecommun. 2014, 69, 75–88. [Google Scholar] [CrossRef]
  6. Tham, N.A.Q.; Brady, A.M.; Ziefle, M.; Dinsmore, J. Barriers and Facilitators to Older Adults’ Acceptance of Camera-Based Active and Assisted Living Technologies: A Scoping Review. Innov. Aging 2024, 9, igae100. [Google Scholar] [CrossRef]
  7. Rasche, P.; Wille, M.; Bröhl, C.; Theis, S.; Schäfer, K.; Knobe, M.; Mertens, A. Prevalence of Health App Use Among Older Adults in Germany: National Survey. JMIR mHealth uHealth 2018, 6, e26. [Google Scholar] [CrossRef]
  8. Wang, C.Y.; Lin, F.S. Exploring Older Adults’ Willingness to Install Home Surveil-Lance Systems in Taiwan: Factors and Privacy Concerns. Healthcare 2023, 11, 1616. [Google Scholar] [CrossRef]
  9. Zhang, H.B.; Zhang, Y.X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.X.; Chen, D.S. A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef] [PubMed]
  10. Zhang, S.; Wei, Z.; Nie, J.; Huang, L.; Wang, S.; Li, Z. A Review on Human Activity Recognition Using Vision-Based Method. J. Healthc. Eng. 2017, 2017, 3090343. [Google Scholar] [CrossRef]
  11. Sirmacek, B.; Riveiro, M. Occupancy Prediction Using Low-Cost and Low-Resolution Heat Sensors for Smart Offices. Sensors 2020, 20, 5497. [Google Scholar] [CrossRef]
  12. Tariq, O.B.; Lazarescu, M.T.; Lavagno, L. Neural Networks for Indoor Person Tracking with Infrared Sensors. IEEE Sens. Lett. 2021, 5, 1–4. [Google Scholar] [CrossRef]
  13. Newaz, N.T.; Hanada, E. A Low-Resolution Infrared Array for Unobtrusive Human Activity Recognition That Preserves Privacy. Sensors 2024, 24, 926. [Google Scholar] [CrossRef]
  14. Mortenson, W.B.; Sixsmith, A.; Beringer, R. No Place Like Home? Surveillance and What Home Means in Old Age. Can. J. Aging/Rev. Can. Vieil. 2016, 35, 103–114. [Google Scholar] [CrossRef]
  15. Xu, P.; Sulaiman, N.A.A.; Ding, Y.; Zhao, J.; Li, S. A study of falling behavior recognition of the elderly based on deep learning. Signal Image Video Process. 2024, 18, 7383–7394. [Google Scholar] [CrossRef]
  16. Dou, W.; Azhar, A.S.; Chin, W.; Kubota, N. Human Activity Recognition System Based on Continuous Learning with Human Skeleton Information. Sens. Mater. 2024, 36, 4713–4730. [Google Scholar] [CrossRef]
  17. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
  18. Ravi, S.; Climent-Pérez, P.; Florez-Revuelta, F. A Review on Visual Privacy Preservation Techniques for Active and Assisted Living. Multimed. Tools Appl. 2024, 83, 14715–14755. [Google Scholar] [CrossRef]
  19. Shajari, S.; Kuruvinashetti, K.; Komeili, A.; Sundararaj, U. The Emergence of AI-Based Wearable Sensors for Digital Health Technology: A Review. Sensors 2023, 23, 9498. [Google Scholar] [CrossRef] [PubMed]
  20. Moore, K.; O’Shea, E.; Kenny, L.; Barton, J.; Tedesco, S.; Sica, M.; Crowe, C.; Alamäki, A.; Condell, J.; Nordström, A.; et al. Older adults’ experiences with using wearable devices: Qualitative systematic review and meta-synthesis. JMIR mHealth uHealth 2021, 9, e23832. [Google Scholar] [CrossRef] [PubMed]
  21. Kristoffersson, A.; Lindén, M. A Systematic Review of Wearable Sensors for Monitoring Physical Activity. Sensors 2022, 22, 573. [Google Scholar] [CrossRef]
  22. Lian, C.; Li, W.J.; Kang, Y.; Li, W.; Zhou, D.; Zhan, Z.; Chen, M.; Suo, J.; Zhao, Y. Enhanced Human Lower-Limb Motion Recognition Using Flexible Sensor Array and Relative Position Image. Pattern Recognit. 2026, 171, 112142. [Google Scholar] [CrossRef]
  23. Wang, J.; Dhanapal, R.K.; Ramakrishnan, P.; Balasingam, B.; Souza, T.; Maev, R. Active RFID Based Indoor Localization. In Proceedings of the 2019 22th International Conference on Information Fusion (FUSION), Ottawa, ON, Canada, 2–5 July 2019; pp. 1–7. [Google Scholar] [CrossRef]
  24. Salman, A.; El-Tawab, S.; Yorio, Z.; Hilal, A. Indoor Localization Using 802.11 WiFi and IoT Edge Nodes. In Proceedings of the 2018 IEEE Global Conference on Internet of Things (GCIoT), Alexandria, Egypt, 5–7 December 2018; pp. 1–5. [Google Scholar] [CrossRef]
  25. Bouazizi, M.; Ye, C.; Ohtsuki, T. Low-Resolution Infrared Array Sensor for Counting and Localizing People Indoors: When Low End Technology Meets Cutting Edge Deep Learning Techniques. Information 2022, 13, 132. [Google Scholar] [CrossRef]
  26. Shao, S.; Kubota, N.; Hotta, K.; Sawayama, T. Behavior Estimation Based on Multiple Vibration Sensors for Elderly Monitoring Systems. J. Adv. Comput. Intell. Intell. Inform. 2021, 25, 489–497. [Google Scholar] [CrossRef]
  27. Yamamoto, K.; Shao, S.; Kubota, N. Heart Rate Measurement Using Air Pressure Sensor for Elderly Caring System. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019; pp. 1440–1444. [Google Scholar] [CrossRef]
  28. Liu, X.; Yang, T.; Tang, S.; Guo, P.; Niu, J. From Relative Azimuth to Absolute Location: Pushing the Limit of PIR Sensor Based Localization. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, MobiCom ’20, London, UK, 21–25 September 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
  29. Mukhopadhyay, B.; Sarangi, S.; Srirangarajan, S.; Kar, S. Indoor Localization Using Analog Output of Pyroelectric Infrared Sensors. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, Spain, 15–18 April 2018; pp. 1–6. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.