You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

12 January 2025

Multi-Person Localization Based on a Thermopile Array Sensor with Machine Learning and a Generative Data Model

,
,
and
Laboratory of Adaptive Lighting Systems and Visual Processing, Technical University of Darmstadt, Hochschulstr. 4a, 64289 Darmstadt, Germany
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence-Based Target Recognition and Remote Sensing Data Processing

Abstract

Thermopile sensor arrays provide a sufficient counterbalance between person detection and localization while preserving privacy through low resolution. The latter is especially important in the context of smart building automation applications. Current research has shown that there are two machine learning-based algorithms that are particularly prominent for general object detection: You Only Look Once (YOLOv5) and Detection Transformer (DETR). Over the course of this paper, both algorithms are adapted to localize people in 32 × 32-pixel thermal array images. The drawbacks in precision due to the sparse amount of labeled data were counteracted with a novel generative image generator (IIG). This generator creates synthetic thermal frames from the sparse amount of available labeled data. Multiple robustness tests were performed during the evaluation process to determine the overall usability of the aforementioned algorithms as well as the advantage of the image generator. Both algorithms provide a high mean average precision (mAP) exceeding 98%. They also prove to be robust against disturbances of warm air streams, sun radiation, the replacement of the sensor with an equal type sensor, new persons, cold objects, movements along the image frame border and people standing still. However, the precision decreases for persons wearing thick layers of clothes, such as winter clothing, or in scenarios where the number of present persons exceeds the number of persons the algorithm was trained on. In summary, both algorithms are suitable for detection and localization purposes, although YOLOv5m has the advantage in real-time image processing capabilities, accompanied by a smaller model size and slightly higher precision.

1. Introduction

The presence detection of a person with and without movement is essential within many aspects of building automation, such as smart lighting systems [], elderly fall detection [,,] or movement tracking [,]. A thermopile sensor array uses a grid of thermopiles to record a low-resolution infrared (IR) image representing a 2D thermal landscape of the area within the array’s field of view (FOV) []. The thermal landscape texture is formed by the interaction of IR sources, such as occupants or heaters, with the IR absorption and/or reflection characteristics of surrounding surfaces. Based on the difference in two time-consecutive images, properties like the number of occupants or their direction of movement can be detected and localized [,,]. In addition, such thermopile arrays preserve the occupant’s privacy due to their low resolution compared to other image based systems like optical cameras [,,].
By leveraging the privacy advantage of a thermopile array in previous discussed applications such as smart lighting systems, elderly fall detection or movement tracking, only anonymized data are recorded. At the same time, valuable information can be obtained about the number of occupants and their speed and direction of movement. For these applications, no explicit information about the exact person, gender or age of the people in the FOV are required. In industrial and office environments, such as meeting rooms with high security and privacy standards, optical RGB or monochrome cameras can therefore not be used. Based on this challenge and new potential fields of application with low resolution, privacy-preserving thermopile sensors, the following research aims to utilize state-of-the-art algorithms that achieve two goals: (1) to robustly distinguish between a person and a dynamic heat radiation object and (2) to conceive an algorithm that is also robust to changing environments like different installation locations.

3. Experiment

This section highlights the experimental setup for collecting IR image data, a generative labeling and data augmentation model and the training procedure.

3.1. Data Collection

A commercially available sensor with a 32 × 32-pixel resolution is utilized. This sensor offers a good balance between a low enough resolution to preserve privacy and simultaneously a sufficiently high resolution to distinguish multiple objects and persons in the field of view (FOV). The sensor model HTPA32 × 32dR2L1.9/0.8HiC[CH] manufactured by Heimann Sensor GmbH was arbitrarily selected for data capturing. This sensor series is advertised by Heimann Sensor GmbH (Eltville am Rhein, Germany) as being designed for person detection. The specifications are illustrated in Table 1.
Table 1. Specifications of the thermopile array sensor HTPA32 × 32dR2L1.9/0.8HiC[Ch] from Heimann Sensor GmbH.
After a literature search for labeled thermal images, no dataset with the following properties could be found: annotated multiple persons, disturbing objects with a resolution equal or greater than 32 × 32 pixel and an overhead installation position. Therefore, a suitable training dataset had to be created for this paper. For data collection, the environment is displayed in Figure 1 with three possible entries/exits as well as the sensor position (red dot) and field of view (red dashed rectangle). This scenario mimics an indoor office context, with one workplace and an overhead view of the infrared array.
Figure 1. Schematic structure of the indoor office-like test environment with marked access points: 1, 2 and 3. The red dot marks the position of the thermopile sensor and the red dashed lines frame the field of view of the sensor, which equal 2.5 m × 2.5 m (7.8 cm × 7.8 cm pixel size on the floor). The sensor was installed at a height of 2.3 m on the ceiling, which measured heights up to 2.4 m.
According to the literature recommendations, the following parameters were chosen for generating a diversified dataset: occupants [], clothing [,] and poses (standing, sitting, small movements, walking) []. In total, three scenarios were carried out by four male participants. The characteristics of each male person regarding height and hair type is collected in Table 2.
Table 2. Height and hair type of four male participants used in three scenarios of the experiment.
The four participants wore either spring, summer, autumn or winter clothing for data collection. The following actions were performed to gather the data:
1.
Each person individually walked into the room, took a place at the working desk, typed something on the laptop, stood up and left through another exit. Each person accomplished this scenario six times with six different routes, resulting in about 1800 captured images.
2.
Two people entered from different entrances, passing each other and leaving the field of view (FoV) at different exits. In total, 920 images were taken.
3.
The area within the FoV was completely cleared of all objects and individual images of the participants were taken. Each person was captured standing, walking and sitting centrally under the sensor and in the four corners, as well as on the four borders of the test environment. The test subjects were encouraged to perform a variety of poses, such as sitting with legs stretched out, walking with long steps, or standing with their arms crossed. This scenario yielded roughly 350 images.
As a result, 3071 real-world images were collected and manually labeled. Of these images, 2150 (70%) were utilized as training data. Out of the remaining images, 614 (20%) were utilized as validation data and 307 (10%) as test data. Random selection determined the precise assignments.

3.2. Generative Data Creation Model

Data-driven algorithms, such as YOLOv5 and DETR, usually require a larger number of labeled data to perform in a robust and accurate manner. Since data collection and labeling is a highly time-consuming process, a novel generative model for an automatic creation of labeled infrared image data has been implemented. The infrared image generator’s (IIG) aim is to generate realistic images with automatic labeled persons and heat-radiating objects based on a small set of pre-captured and pre-labeled real-world data. By passing parameters like the number of occupants and interfering IR sources to the IIG, it generates an entirely new IR image with a correct scene and IR source(s) annotation. The image background and object temperature can be varied individually. Images created this way are then further augmented by methods like image rotation or flipping. Combining real-world data collection, IR image construction with the proposed IIG and additional augmentation yields a sufficient dataset for data-driven algorithms like YOLOv5 or DETR.
Creating an image with the IIG follows three steps: generate a background, place a specified number of occupants and finally add noise or interfering thermal sources to the image. Each of these three phases are explained in greater detail subsequently:
  • Background generation: The background is modeled as a stretched normal distribution. By determining the function parameters, arbitrary backgrounds can be created.
  • Inserting occupants: Crop occupants from labeled images and place a predefined number of persons in the image.
  • Noise and interfering thermal sources: Thermal sources for noise are added to the image. Two kinds of noise are added: (1) As stretched normal distributions or (2) as cropped image fragments from the labeled data. Finally, the border of the inserted occupant image parts as well as the noise fragments from labeled data are blurred with a Gaussian filter to fit into the new background.
Figure 2 visualizes these parts in a real image.
Figure 2. Real example 32 × 32 px image of the infrared array with a highlighted person (green box), disturbances (blue circles) and background (red box).
The three parts—background, person and disturbances—were further analyzed with temperature histograms. The background is similar to a stretched normal distribution. The obtained temperature distributions of the humans were of high complexity and very diverse. Heat-radiating disturbances could be separated into two groups: (1) disturbances that are similar to a hot spot that have a stretched normal distribution and (2) a randomly distributed appearing second group. With this in mind, backgrounds and some disturbances are mathematically modeled, while humans and complex disturbances are extracted from real images.
As a base dataset for the IIG algorithm, a total of 3400 images, divided into 2000 images with people and disturbances and 1400 with only background and minor disturbances of the environment, were used. The 2000 images are equivalent to the training dataset from Section 3.1. However, in this dataset, only images were included in which a person is completely visible in the FoV of the sensor. Therefore, images in which a person exits and enters the FoV were excluded to work only with full representations of people. All 1400 background images were newly captured and are only used for a kernel density estimation (KDE) with an optimization of the bandwidth to fit the given backgrounds generally. This led to a non-parametric generative model for the backgrounds, where 32 × 32-pixel images can be drawn from to create new backgrounds. Figure 3 shows two real measured images and two artificial images. For the next step, the backgrounds were set to a random temperature between 18 °C and 25 °C, which corresponds to a common office temperature, by adding a temperature delta between the average temperature of the generated background and the desired temperature, to cover as many different environmental scenarios as possible.
Figure 3. Comparison of two randomly selected real measured images on the left side and two artificial backgrounds on the right side. The real and generated images have a similar representation in the histogram.
Occupants and heavy disturbances are inserted as extractions from the remaining 2000 captured and manually labeled images. Thereby, disturbances and humans are tagged with bounding boxes. These boxes are extracted from the images with an additional four surrounding pixels. This value was chosen to obtain two pixels, which is the mean amount of the influenced distance around the object by the radiated heat, and another two pixels to smoothly insert the extracted pixel snippet into the background. After the insertion at a specific selected location, a Gaussian filter is used to blend the snippet seamlessly into the background and to overcome hard transitions at the edges. Due to this approach, real-life poses, like standing, sitting, variations in arm and leg positioning, as well as people’s clothing, are cropped and can be overlaid to the generated background while knowing the exact location. This insertion procedure can be used to insert an arbitrary number of persons with random poses as well as noise, and is summarized in Figure 4.
Figure 4. Schematic process of image creation with the generative image creation approach. The sequence is from left to right: generating a background, inserting one or more persons, adjusting the transition and inserting other disturbances.

3.3. Algorithm Implementation

The implementation of YOLOv5m [] and DETR [] follows the official documentation guidelines. Both include the PyTorch framework. YOLOv5m is trained with the given train routine and adjusted parameters. For the DETR algorithm [], the officially provided evaluation metrics were used and the train and test procedures were rebuilt in the PyTorch Lighting Framework [] for ease-of-use and faster training. The DETR Model was implemented from the Hugging Faces implementation (https://huggingface.co/docs/transformers/v4.20.1/en/model_doc/detr#transformers.DetrForObjectDetection (accessed on 2 November 2024)).

4. Results

The following sections examine the experimental results. First, the applied performance metrics and the test dataset are explained. Subsequently, multiple tests are performed to evaluate the advantage of the generative model and to test the general robustness as well as the robustness against disturbances, movement and positions, as well as winter clothing and multiple persons.

4.1. Model Training

The training of both algorithms took place on a 12 CPU@2.6 GHz (Skylake), 29.3 GB RAM, Windows 10 Pro machine with a 8 GB VRAM Nvidia P4000 GPU. The parameters that differ from the default parameters for YOLOv5 are as follows: batch size: 800, epochs: 200 and number of workers: 2. Within the implementation of the training for DETR, all the aforementioned parameters like batch size and epochs are automatically selected from the PyTorch Lighting Framework. The training utilized, in total, 17,508 generated data from the IIG algorithm, which are based on 3400 captured and manually labeled data from Section 3.2. In total, DETR and YOLOv5m have been trained for 200 epochs. In Table 3, the training results and speed are shown in greater detail. The main difference is the model size with 41 MB for YOLOv5m, which is much lower than the DETR model with 485 MB and has the ability to run on edge computing devices with low memory. Furthermore, the YOLOv5 model took only 2.6 h to train, whereas DETR trained 240 h. With an inference time per image of 1.3 ms for YOLOv5, the algorithm is able to detect objects with the full speed of the sensor, and DETR can process images in 300 ms, which leads to a reduced frame rate.
Table 3. Trainings speed for YOLOv5 and DETR.

4.2. Performance Metrics

As a suitable evaluation metric, the mean average precision (mAP) is applied to include the correct classification as well as the correct location []. To evaluate the location of the bounding box, the average precision (AP) employs the intersection over union (IoU) metric. If the IoU between the predicted bounding box and the ground truth is below a threshold specified by the user, the localization is determined to be incorrect. The mAP is usually calculated in two ways [,,,,,]: Either the mean over all class APs with a single IoU threshold is taken, as is conducted on the PASCAL VOC dataset, or the mAP is calculated for a range of IoU thresholds and then averaged, as is common practice on the MS COCO dataset. For Pascal VOC, a single IoU threshold of >0.5 is defined as being able to calculate the mAP and for MS COCO the mAP is determined by the average of all AP for an IoU between 0.5 and 0.95 with a step size of 0.05, which gives more weight to the exact localization of the object in the image []. These two values are used as mAP0.5 for the PASCAL VOC value and mAP>0.5 for the COCO value in this paper. Since, in the context of this work, the detection of persons primarily plays a major role rather than their exact localization, mAP0.5 is applied as the main evaluation metric.

4.3. Test Dataset

For evaluation purposes, multiple datasets corresponding to the tests described above were collected and extended with the generative model. The exact involved test datasets are specified in accordance to the respective tests explained in the next sections. As a test scenario, up to five persons are included: four male and one female. The four male participants consist of persons 1 and 2, known from the creation of the training data, as seen in Table 2, along with two new male participants and one female participant. To test the robustness and generalization, a different arrangement of the room objects is established to simulate a new room. The new room configuration is ongoing, and is called Room2.

4.4. Generative Model

From the captured 3017 data in Room1, 10% of the data (307) are randomly selected and only used for testing the infrared image generator (IIG). In these test data, the same people and room as in the generated training dataset are present. For Room2 the same four male and one additional female persons were asked to generate in total another 425 labeled test data with a similar procedure as the previous generated training data (see Section 3.1). For training the IIG the 3400 data explained in Section 3.2 are utilized. For augmentation the 2150 training data out of Section 3.1 are included as well as rotated and flipped samples to receive 17,722 training data overall. With the IIG algorithm, 17,508 images are created for training. All training data were captured in Room1. Room2 is only represented in the 425 test data.
In Table 4, a comparison of the mAP values between different trained YOLOv5m algorithms and DETR is represented. The training lasted for 200 epochs for YOLO and DETR. Afterwards, the test datasets for Room1 and Room2 are applied for evaluation. The meaning of the abbreviations in Table 4 are as follows:
Table 4. Evaluation results for the YOLOv5m models and different training datasets. Bold marked values represent the best values per column. The mAP values are calculated based on the 307 data from Room1 and 425 data from Room2.
  • orig: the real measured and manually labeled data;
  • gen: data generated with the generative model (IIG);
  • aug: data that are augmented by rotating and flipping the measured orig dataset.
Table 4 shows that the test with the highest mAP value is either the generated data (gen) alone or combined with the original measured data (orig). Furthermore, the mAP of Room1 compared to Room2 has only a small decline compared to the augmented or original dataset. Thus, the data generation yields an improvement in precision, a higher robustness to different environments as well as persons and a better generalization regarding the generated data. For the last entry, the gen+orig dateset is used in a way so that the trained model with the generated data (gen) is additionally trained with the measured data (orig). This subsequent training could lead to a decline in generalization, and, therefore, this entry is slightly worse than only the generated data for the newly presented Room2. In summary, both algorithms can localize and detect humans well, with a mAP0.5 of 0.975 for DETR and 0.985 for YOLOv5m. A minor reduction in the mAP score is revealed for the new scenario (Room2) with a new person.
It can be seen in Table 4 that YOLOv5mgen and DETRgen have a high mAP after training solely with the generated dataset. These trained models are applied for the evaluation in all the next sections. Due to the long training duration of DETR, it was only trained with the generated data as reference and revealed a slightly worse mAP score as YOLOv5mgen and is therefore not considered for further training on the augmented and original data. The exact training statistics for YOLOv5m and DETR on the generated dataset in Room1 are presented in Section 4.1 and show a substantially faster training time for YOLO and a real-time ability with >8.3 fps as the sensors’ maximum output speed (see Table 1).

4.5. General Robustness

As a next step, the general robustness of both algorithms is investigated without disturbing heat-radiating objects but including shelves, chair, desk and walls in the field of view (FoV). For this purpose, five new test datasets were created:
  • Reference: test data from Room1 with persons and room as known from the training;
  • New Room: Room2 was used with the same four male persons as in the reference;
  • New person: a fifth female person was captured in Room1;
  • New Room + Person: gathering the new person in Room2;
  • Second sensor: nother sensor of the same type—HTPA32 × 32dR2L1.9/0.8HiC[CH]—was applied;
  • Different Sensor: similar sensor but with a lower thermal sensitivity that results in a lower signal-to-noise ratio is included: HTPA32x32dR2L2.1/0.8F5.0HiA[Si].
As reference, the 307 data of Room1 were applied, where these data include the same environment (Room1) and four male occupants as examined for the model training that utilizes the 17,508 generated data of the IIG algorithm. The change in environment causes fewer problems for both algorithms than a new person, as the mAP values indicate in Table 5. Furthermore, the YOLOv5 model performs better than DETR for a new person and room, although DETR yielded a higher accuracy on the validation data during training. Both models prove to be robust against change in occupants or new environments with >mAP0.5 0.91 and >mAP>0.5 0.5, respectively.
Table 5. General robustness of the two algorithms. Comparison of test data similar to the training, with a new person and in a new environment as well as the change to an equal sensor type and a different sensor type with poorer signal-to-noise ratio and the same resolution. Bold numbers indicate the better values.
DETR reveals fewer issues due to the change to an equal type sensor (2. sensor) as far as the detection of a person is concerned, but the YOLOv5 model exhibits a superior ability in localizing persons. This is in contrast to the change to a different sensor model with a poorer signal-to-noise ratio, where YOLOv5m still has a high mAP0.5 of 0.884 where DETR has only 0.28. In general, both models show significantly worse detection performance in images with the different sensor. DETR especially lacks robustness to the sensor change and cannot generalize to the poorer image quality. This could be due to the fact that it has adapted too well to the images learned in training (overfitting).

4.6. Robustness Against Disturbances

In order to make the algorithm robust against disturbances during the training process, 10,500 images (60%) of the 17,508 generated images contained disturbances. First, two datasets simulating extreme disturbances are created: one with hot and one with cold objects, to determine the respective influence of the temperatures. As disturbance sources, objects that can occur in an office and work environment were selected. For these tests, a dataset is captured with the following hot objects: a kettle, a coffee maker, a cup with a hot beverage, two radiators—one wall heater and one electric radiator—floor heating and two chargers for a laptop and a cell phone, in addition to the previous located standard laptop and monitor heat sources. As for the cold objects, Coolpacks, ice cream, cold drinks and an open fridge and freezer were selected as sources of interference. As reference mAP value, the 425 IR images of Room2 with a new fifth person are applied. Table 6 reveal that the YOLO model is very robust against the mentioned interference sources and exceeds DETR. The influence of heat also seems to be stronger than the influence of cold objects according to this test. This could be due to the fact that a human body also presents a heat source and therefore distinguishing it from interfering heat sources is more challenging. The next two groups of data focus on the specific influence of thermal radiation, such as warm air flow and sun rays through a window that heat the surface. In the first of these three warm air datasets, the hair dryer that simulates the warm air flow is simply directed into the test environment without directly targeting an object. In the second, it directly irradiates the test subject, and in the third it is aimed directly at the sensor. YOLO and DETR both yield a comparable mAP with respect to the reference. Thus, it can be concluded that these disturbances have a minor influence on the detection and localization accuracy.
Table 6. Influence of heat and cold on the two algorithms. Bold numbers indicate the better values. Bold numbers indicate the better values.

4.7. Robustness of Movement and Position

This section examines the influence of movement, posture and position within the image. For this purpose, in the first test scenario, datasets are created that contain both standing and sitting people at rest and in motion. The datasets were recorded centrally under the sensor except for the datasets named as on the edge, which are images recorded with persons standing and walking on the edge of the field of view of the sensor.
Motion has little influence on the detection of both algorithms, as Table 7 presents. Their mAP0.5 values are consistently very high, which shows that they reliably detect the people in the images. Only the exact localization, indicated by the lower mAP>0.5 values, seems to be more difficult for YOLOv5 when moving while sitting. DETR shows the worst results for localization when sitting at rest. Detection at the edge of the image is also no issue for either approaches.
Table 7. Influence of movement and rest on the two algorithms. Bold numbers indicate the better values.

4.8. Influence of Winter Clothing

Room2 serves as the reference equal to Section 4.6. The test subjects for this evaluation scenario were dressed in spring clothing with both shorts and long pants combined with a t-shirt. For the winter clothing dataset, the test subjects wear a thick winter jacket in addition to long pants in each case. The winter hat dataset adds a hat to the winter outfit. Table 8 shows that a reliable detection of individuals is possible despite them wearing winter clothing. Adding the winter hat significantly decreases the detection. This shows the importance of the large heat radiated from the head for the detection of a person. For this task, the YOLOv5 model is superior to the DETR model.
Table 8. Influence of winter clothing and a winter hat. Bold numbers indicate the better values.

4.9. Multiple Persons

As a final evaluation, the influence of multiple persons is examined in a room without objects to obtain sufficient space for up to five people within the 250 × 250 cm field of view of the sensor. In the training dataset, only one or two persons are captured per image. This test determines how well the models have learned the representation or features of people and can generalize to many people in a picture and thus to another situation that is completely unknown to the models. The results of this last evaluation test can be seen in Table 9. YOLOv5 and DETR both showed no problems in detecting one or two persons. For three and four people, YOLO still provides good precision in detection. With five persons in the image, this algorithm encounters noticeable difficulties. In the case of DETR, the trained model already exceeds its limits for the detection of three occupants. Figure 5 presents an example of the detection of two randomly selected images out of the dataset with five persons by YOLO. As the images reveal, these persons that are detected are similar to the ground truth; however, the model is not capable of detecting all the people present in the image.
Table 9. Influence of multiple people on trained models with up to two persons. Bold numbers indicate the better values.
Figure 5. Visualization example of YOLOv5 detection of five people in one image. On the left is the ground truth and on the right is the labels provided by YOLOv5.

5. Discussion

The goal of this paper was to find and evaluate an algorithm that robustly detects and localizes people with a 32 × 32-pixel thermopile array sensor. Based on a conducted literature research, two single-stage detectors, YOLOv5m and DETR, were selected due to their real-time capability in processing images of the infrared sensor with full speed. YOLO reached 792 fps and is capable of processing the images at full speed. DETR accomplishes 3 fps and is substantially slower, as the given sensors’ output default is 8.3 fps. This observation is similar to the results of the comparison of an object detection algorithm by Zaidi et al. []. Given the model size of 41 MB of YOLO, running this algorithm on an edge computing device is more likely than the 485 MB model of DETR. Furthermore, YOLOv5 was trained for 200 epochs in 2.6 h, whereas the best epoch for DETR was 44 and took 53 h. As for training, a set of 3400 images was manually labeled and extended by a novel thermal image generator (IIG) to a total of 17,508 images. This image generator is able to generate an arbitrary number of images from a few labeled data and archives a higher mAP through solely augmenting images with flipping and rotation. This is achieved by creating various backgrounds from the density function of the input data and blending in real cropped and filtered persons and disturbances. The evaluation of the robustness of the two algorithms reveal that after training, YOLOv5m reaches 99% mAP0.5 as DETR reaches 98% mAP0.5. Both YOLOv5 and DETR provide very accurate models that are well-suited for person detection in thermal images and both have exceptionally high precision and applicability to untrained persons and environments. The change in the sensor to an equivalent type of sensor model is also possible; however, a change to a different sensor model with a poorer signal-to-noise ratio affects the precision and retraining is recommended. In this evaluation, both models achieved over 98% mAP0.5 on real data with known individuals and over 95% mAP0.5 on real data with untrained people in an unknown environment. However, YOLOv5 was about 1% more accurate than DETR in both categories. YOLO is less affected by untrained interfering heat or cold objects. The mAP0.5 of DETR drops for unknown hot objects to 50%, such as in the case of kettles, hot beverages and heaters. A warm air stream and sun presents only a small effect, similar to occupants moving or resting within the field of view and on its edges. A far larger decrease is caused by subjects wearing a winter hat and by adding more persons to the scenery as trained. In summary, it can be stated that YOLOv5m is superior in comparison to DETR.
No clear statement can be obtained about the generalization of these results, since the training of the algorithms and the creation and selection of the trainings as well as test data may favor one of the two models. In addition, the choice of methods was made from a very narrow set, so there is a possibility of the existence of a much more suitable model that is not considered in this work, like Swin Transformer []. Also, the focus on mAP might add a bias to the results of this study or might not represent them comprehensively enough. Therefore, these algorithims cannot be assumed to have found a generally optimal option. In further work, the classification can be extended to multiple classes like sitting, standing or to lying, and disturbances could be labeled. The results of the localization might also be used to create trajectories and obtain a better understanding of movement inside the field of view. Despite the limitations, it can be assumed that the created YOLOv5m model is a detector that delivers very good precision and fulfills the requirements for this work.

6. Conclusions

This paper evaluates two single-stage detectors, YOLOv5m and DETR, as person detection and localization approaches in 32 × 32-pixel thermal images. By analyzing multiple scenarios and disturbances, both algorithms present a good precision of 99% for YOLOv5m and 98% for DETR, and can therefore both be applied for tasks. Likewise, both algorithms have a high tolerance to disturbing sources based on the training dataset created by the generative model. YOLOv5 has a much faster inference time and a smaller model, so it is more suitable for running on edge computing devices. Furthermore, YOLOv5m is more robust against multiple disturbances.

Author Contributions

Conceptualization, S.K. and J.L.; methodology, S.K. and J.L.; software, J.L.; validation, S.K., J.L. and S.B.; formal analysis, S.K. and S.B.; investigation, J.L.; resources, J.L.; data curation, J.L.; writing—original draft preparation, S.K.; writing—review and editing, J.L. and S.B.; visualization, J.L.; supervision, T.Q.K.; project administration, T.Q.K.; funding acquisition, S.K. and T.Q.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant no. 445336968 and by the Open Access Publishing Fund of the Technical University of Darmstadt.

Institutional Review Board Statement

Ethical review and approval were waived for this study since all the human subjects were completely anonymized and they signed that they agreed with the anonymous recording by means of low-resolution IR-array and that this did not cause undue stress to them in any way.

Data Availability Statement

The novel generative image generator (IIG) algorithm, as well as the original data presented in this paper, are openly available on Github: https://github.com/KlirS/Multi-Person-Localization-based-on-a-Thermopile-Array-Sensor-with-a-Generative-Data-Model (accessed on 2 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Klir, S.; Fathia, R.; Benkner, S.; Babilin, S.; Khanh, T.Q. Preference Lighting Model: Generalization of lighting preferences for individual users. In Proceedings of the 2021 Joint Conference—11th International Conference on Energy Efficiency in Domestic Appliances and Lighting & 17th International Symposium on the Science and Technology of Lighting (EEDAL/LS:17), Toulouse, France, 6 January 2022; pp. 1–6. [Google Scholar]
  2. Fan, X.; Zhang, H.; Leung, C.; Shen, Z. Robust unobtrusive fall detection using infrared array sensors. In Proceedings of the 2017 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Daegu, Republic of Korea, 16–18 November 2017; pp. 194–199. [Google Scholar]
  3. Hayashida, A.; Moshnyaga, V.; Hashimoto, K. The use of thermal IR array sensor for indoor fall detection. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 594–599. [Google Scholar]
  4. Morawski, I.; Lie, W.N.; Chiang, J.C. Action Prediction Using Extremely Low-Resolution Thermopile Sensor Array for Elderly Monitoring. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 984–988. [Google Scholar]
  5. Shetty, A.D.; Disha; Shubha, B.; Suryanarayana, K. Detection and tracking of a human using the infrared thermopile array sensor—‘Grid-EYE’. In Proceedings of the 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kerala, India, 6–7 July 2017; pp. 1490–1495. [Google Scholar]
  6. Tariq, O.B.; Lazarescu, M.T.; Lavagno, L. Neural Networks for Indoor Person Tracking with Infrared Sensors. IEEE Sens. Lett. 2021, 5, 9–12. [Google Scholar] [CrossRef]
  7. Corsi, C. History highlights and future trends of infrared sensors. J. Mod. Opt. 2010, 57, 1663–1686. [Google Scholar] [CrossRef]
  8. Gu, N.; Yang, B.; Li, T. High-resolution Thermopile Array Sensor-based System for Human Detection and Tracking in Indoor Environment. In Proceedings of the 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 9–13 November 2020; pp. 1926–1931. [Google Scholar]
  9. Chen, Z.; Wang, Y.; Liu, H. Unobtrusive sensor-based occupancy facing direction detection and tracking using advanced machine learning algorithms. IEEE Sens. J. 2018, 18, 6360–6368. [Google Scholar] [CrossRef]
  10. Gochoo, M.; Tan, T.-H.; Batjargal, T.; Seredin, O.; Huang, S.-C. Device-Free Non-Privacy Invasive Indoor Human Posture Recognition Using Low-Resolution Infrared Sensor-Based Wireless Sensor Networks and DCNN. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 2311–2316. [Google Scholar]
  11. Tateno, S.; Meng, F.; Qian, R.; Hachiya, Y. Privacy-Preserved Fall Detection Method with Three-Dimensional Convolutional Neural Network Using Low-Resolution Infrared Array Sensor. Sensors 2020, 20, 5957. [Google Scholar] [CrossRef] [PubMed]
  12. Petrova, G.; Spasov, G.; Iliev, I. A Review on Applications of Low-resolution IR Array Sensors in Ambient-Assisted Living. In Proceedings of the 2021 XXX International Scientific Conference Electronics (ET), Sozopol, Bulgaria, 15–17 September 2021; pp. 1–5. [Google Scholar]
  13. Parnin, S.; Rahman, M.M. Human location estimation using thermopile array sensor. IOP Conf. Ser. Mater. Sci. Eng. 2017, 260, 012007. [Google Scholar] [CrossRef]
  14. Shubha, B.; Shastrimath, V.V.D. Real-Time Occupancy Detection System Using Low-Resolution Thermopile Array Sensor for Indoor Environment. IEEE Access 2022, 10, 130981–130995. [Google Scholar] [CrossRef]
  15. Honorato, J.L.; Spiniak, I.; Torres-Torriti, M. Human detection using thermopiles. In Proceedings of the 2008 IEEE Latin American Robotic Symposium, Salvador, Brazil, 29–30 October 2008; pp. 151–157. [Google Scholar]
  16. Trofimova, A.A.; Masciadri, A.; Veronese, F.; Salice, F. Indoor Human Detection Based on Thermal Array Sensor Data and Adaptive Background Estimation. J. Comput. Commun. 2017, 5, 16–28. [Google Scholar] [CrossRef]
  17. Perra, C.; Kumar, A.; Losito, M.; Pirino, P.; Moradpour, M.; Gatto, G. Monitoring indoor people presence in buildings using low-cost infrared sensor array in doorways. Sensors 2021, 21, 4062. [Google Scholar] [CrossRef] [PubMed]
  18. Vandersteegen, M.; Reusen, W.; Beeck, K.V.; Goedemé, T. Person Detection Using an Ultra Low-resolution Thermal Imager on a Low-cost MCU. In Image and Vision Computing. IVCNZ 2022. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; pp. 1–10. [Google Scholar]
  19. Uijlings, J.R.R.; Sande, K.E.A.V.D.; Gevers, T.; Smeulders, A.W.M. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
  20. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  21. Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
  22. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
  23. Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
  24. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. A Rev. J. 2022, 126, 1–18. [Google Scholar] [CrossRef]
  25. Girshick, R. Fast R-CNN. Proc. IEEE Int. Conf. Comput. Vis. 2015, 2015, 1440–1448. [Google Scholar]
  26. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
  27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  28. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 30th IEEE Conference Computer Vision Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  29. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
  30. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016, 2016, 779–788. [Google Scholar]
  31. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1–15. [Google Scholar]
  32. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar]
  33. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
  34. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
  35. Song, H.; Sun, D.; Chun, S.; Jampani, V.; Han, D.; Heo, B.; Kim, W.; Yang, M.H. An Extendable, Efficient and Effective Transformer-based Object Detector. arXiv 2022, arXiv:2204.07962. [Google Scholar]
  36. Anyanwu, G.O.; Nwakanma, C.I.; Putri, A.R.; Lee, J.M.; Kim, D.S.; Kim, J.; Hwang, G. Thermal Array Sensor Resolution-Aware Activity Recognition using Convolutional Neural Network. In Proceedings of the 2022 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 21–24 February 2022; pp. 372–376. [Google Scholar]
  37. Chen, W.H.; Ma, H.P. A fall detection system based on infrared array sensors with tracking capability for the elderly at home. In Proceedings of the 2015 17th International Conference on E-Health Networking, Application & Services (HealthCom), Boston, MA, USA, 14–17 October 2015; pp. 428–434. [Google Scholar]
  38. Mashiyama, S.; Hong, J.; Ohtsuki, T. Activity recognition using low resolution infrared array sensor. In Proceedings of the 2015 IEEE International Conference on Communications (ICC), London, UK, 8–12 June 2015; pp. 495–500. [Google Scholar]
  39. Muthukumar, K.A.; Bouazizi, M.; Ohtsuki, T. A Novel Hybrid Deep Learning Model for Activity Detection Using Wide-Angle Low-Resolution Infrared Array Sensor. IEEE Access 2021, 9, 82563–82576. [Google Scholar] [CrossRef]
  40. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
  41. Ahmad, M.; Ahmed, I.; Ullah, K.; Khan, I.; Khattak, A.; Adnan, A. Person detection from overhead view: A survey. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 567–577. [Google Scholar] [CrossRef]
  42. Taramasco, C.; Rodenas, T.; Martinez, F.; Fuentes, P.; Munoz, R.; Olivares, R.; De Albuquerque, V.H.; Demongeot, J. A novel monitoring system for fall detection in older people. IEEE Access 2018, 6, 43563–43574. [Google Scholar] [CrossRef]
  43. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. ultralytics/yolov5: V7.0-YOLOv5 SOTA Realtime Instance Segmentation (v7.0); Zenodo: Genève, Switzerland, 2022. [Google Scholar]
  44. Falcon, W.; The PyTorch Lightning Team. PyTorch Lightning (2.0.8); Zenodo: Genève, Switzerland, 2023. [Google Scholar]
  45. Padilla, R.; Netto, S.L.; Silva, E.A.B.D. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.