A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments

: In modern security situations, tracking multiple human objects in real-time within challenging urban environments is a critical capability for enhancing situational awareness, minimizing response time, and increasing overall operational effectiveness. Tracking multiple entities enables informed decision-making, risk mitigation, and the safeguarding of civil-military operations to ensure safety and mission success. This paper presents a multi-modal electro-optical/infrared (EO/IR) and radio frequency (RF) fused sensing (MEIRFS) platform for real-time human object detection, recognition, classiﬁcation, and tracking in challenging environments. By utilizing different sensors in a complementary manner, the robustness of the sensing system is enhanced, enabling reliable detection and recognition results across various situations. Speciﬁcally designed radar tags and thermal tags can be used to discriminate between friendly and non-friendly objects. The system incorporates deep learning-based image fusion and human object recognition and tracking (HORT) algorithms to ensure accurate situation assessment. After integrating into an all-terrain robot, multiple ground tests were conducted to verify the consistency of the HORT in various environments. The MEIRFS sensor system has been designed to meet the Size, Weight, Power, and Cost (SWaP-C) requirements for installation on autonomous ground and aerial vehicles.


Introduction
Autonomous vehicles, including unmanned aerial vehicles (UAVs) [1][2][3] and unmanned ground vehicles (UGVs) [4], have found extensive applications in agriculture [5], data acquisition [6], and search and rescue due to their mobility and operational simplicity. One significant capability desired in these search and surveillance scenarios is the ability of autonomous vehicles to recognize human subjects' actions and respond accordingly. Electro-optical (EO) cameras have become essential tools on UAV and UGV platforms to enhance situational awareness, perform object detection, and enable efficient tracking capabilities. Cameras provide valuable visual information that aids in various applications, including search and rescue operations, surveillance missions, and security monitoring.
However, recognizing human objects from videos captured by a mobile platform presents several challenges. The articulated structure and range of possible poses of the human body make human object recognition and tracking (HORT) a complex task. Humans exhibit diverse movements and postures, making it difficult for an autonomous system to accurately recognize and track them in video footage. Additionally, the quality of the captured videos further complicates the recognition and classification process. Videos Figure 1 illustrates the complete structure of the MEIRFS sensor system designed for human object detection, recognition, and tracking. The edge platform (UAV or UGV) is equipped with all the required sensors for detecting and continuously tracking human objects. These sensors include the ranging radar, EO/IR camera, laser range finder, differential barometer, and a pan/tilt platform. Additionally, the known friendly object (designated as blue in Figure 1) is equipped with an IR emitter and an RF transponder, enabling easy recognition by the MEIRFS system amidst all the detected human objects. Ideally, it would be desirable to have a solution that can accurately detect and identify friendly human objects without the need for additional tags or markers. However, in practical scenarios, it is challenging, if not impossible, to find a single method that can work effectively in all situations. For instance, visible light imaging can provide valuable color and feature patterns that can be used to differentiate between unknown friend and foe objects in wellilluminated environments. However, this approach may not be effective in low-light or dark environments. In such cases, additional measures are necessary to promptly and correctly identify friendly human objects.
The major contributions of this paper are: (1) identifying the appropriate sensors tha can provide required information in different situations; (2) building the hardware proto type of the proposed sensor system with both hardware integration and software imple mentation; and (3) verifying the effectiveness of the sensor system with both indoor and outdoor experiments. By employing state-of-the-art sensors and well-tested DL-enhanced algorithms, a robust and reliable sensor system for real-time human target detection, iden tification, and tracking was successfully demonstrated. Figure 1 illustrates the complete structure of the MEIRFS sensor system designed for human object detection, recognition, and tracking. The edge platform (UAV or UGV) is equipped with all the required sensors for detecting and continuously tracking human objects. These sensors include the ranging radar, EO/IR camera, laser range finder, differ ential barometer, and a pan/tilt platform. Additionally, the known friendly object (desig nated as blue in Figure 1) is equipped with an IR emitter and an RF transponder, enabling easy recognition by the MEIRFS system amidst all the detected human objects. Ideally, i would be desirable to have a solution that can accurately detect and identify friendly hu man objects without the need for additional tags or markers. However, in practical sce narios, it is challenging, if not impossible, to find a single method that can work effectively in all situations. For instance, visible light imaging can provide valuable color and feature patterns that can be used to differentiate between unknown friend and foe objects in well illuminated environments. However, this approach may not be effective in low-light or dark environments. In such cases, additional measures are necessary to promptly and cor rectly identify friendly human objects.

System Architecture
To address this challenge, the use of tags or markers becomes essential. By incorpo rating such tags, the detection and recognition of friendly forces can be enhanced, facili tating effective communication and decision-making on challenging scenarios. The im portance of employing tags or markers for friendly human object detection can enhance security operations enabling efficient coordination among people and ultimately enhanc ing the overall effectiveness and safety of challenging missions.

Radio Frequency (RF) Subsystem
The RF subsystem is comprised of a LFMCW ranging radar, with the transceiver lo cated on the platform and the transponder situated on the friendly object. Additionally, a smart antenna is positioned on the platform side. The LFMCW transceiver, illustrated in Figure 2a, consists of a LFMCW transmitter, a LFMCW receiver with frequency/range To address this challenge, the use of tags or markers becomes essential. By incorporating such tags, the detection and recognition of friendly forces can be enhanced, facilitating effective communication and decision-making on challenging scenarios. The importance of employing tags or markers for friendly human object detection can enhance security operations enabling efficient coordination among people and ultimately enhancing the overall effectiveness and safety of challenging missions.

Radio Frequency (RF) Subsystem
The RF subsystem is comprised of a LFMCW ranging radar, with the transceiver located on the platform and the transponder situated on the friendly object. Additionally, a smart antenna is positioned on the platform side. The LFMCW transceiver, illustrated in Figure 2a, consists of a LFMCW transmitter, a LFMCW receiver with frequency/range scanning capability, and a signal processor. The RF system incorporates a smart antenna capable of estimating the angle between the platform and the friendly object. The smart antenna achieves a measurement accuracy of 0.8 • and effectively suppresses multipath signals reflected from the ground, walls, and ceilings. Figure 2b displays the radar transponder situated on the friendly object side. The entire radar subsystem underwent testing in an indoor environment, as depicted in Figure 2c, which showcases the measured distance between the platform and the friendly object. The results demonstrate the consistent detection and accurate distance measurement capabilities of the MEIRFS self-developed radar subsystem.
FOR PEER REVIEW 4 of scanning capability, and a signal processor. The RF system incorporates a smart anten capable of estimating the angle between the platform and the friendly object. The sma antenna achieves a measurement accuracy of 0.8° and effectively suppresses multipa signals reflected from the ground, walls, and ceilings. Figure 2b displays the radar tra sponder situated on the friendly object side. The entire radar subsystem underwent te ing in an indoor environment, as depicted in Figure 2c, which showcases the measur distance between the platform and the friendly object. The results demonstrate the co sistent detection and accurate distance measurement capabilities of the MEIRFS self-d veloped radar subsystem. To enhance the signal-to-noise ratio (SNR) and range detection, several techniqu were implemented: 1. The RF signals were sampled multiple times, typically eight samples, and Fast Fo rier Transform (FFT) calculations were performed on each sample. The results we then averaged, improving the SNR, and extending the detection range; 2. Due to varying hardware gain responses across the baseband spectrum, it was ne essary to determine the local signal noise floor as a reference. By comparing the re signal with the local noise floor instead of the entire baseband noise floor, accura detection can be achieved; 3. Local averaging windows were utilized to establish the appropriate reference lev contributing to improved detection accuracy.
The current radar range cutoff stands at just over 27 m. If required, parameters c be adjusted to enable a longer detection range. The distance measurement update rate set at 7 times per second. At this refresh rate, the average current draw is 700mA at 6 To enhance the signal-to-noise ratio (SNR) and range detection, several techniques were implemented: The RF signals were sampled multiple times, typically eight samples, and Fast Fourier Transform (FFT) calculations were performed on each sample. The results were then averaged, improving the SNR, and extending the detection range; 2.
Due to varying hardware gain responses across the baseband spectrum, it was necessary to determine the local signal noise floor as a reference. By comparing the real signal with the local noise floor instead of the entire baseband noise floor, accurate detection can be achieved; 3.
Local averaging windows were utilized to establish the appropriate reference level, contributing to improved detection accuracy.
The current radar range cutoff stands at just over 27 m. If required, parameters can be adjusted to enable a longer detection range. The distance measurement update rate is set at 7 times per second. At this refresh rate, the average current draw is 700 mA at 6 V. The refresh rate can be increased if certain radar functions are not turned off between each update to conserve power. The capabilities of the MEIRFS RF subsystem were tested and verified in both outdoor open environments and wooded areas. Furthermore, it was confirmed that the RF subsystem consistently detects the distance of human objects equipped with radar transponders, even through multiple drywalls.

EO/IR Subsystem
The EO/IR subsystem comprises an EO camera, an IR camera, a laser rangefinder situated on the platform side, a controllable IR emitter on the friendly object side, and a pan/tilt platform. Within the subsystem, the EO camera utilizes a 3D stereo camera for visible image acquisition and depth sensing, while the long-wavelength IR camera is employed for thermal detection. Two options for IR cameras are available, allowing for interchangeability to accommodate different detection ranges. Both options have undergone comprehensive testing and successful implementation.
Aligned with the viewing direction of the IR camera, the laser rangefinder is capable of measuring distances up to 100 m. The IR subsystem consistently distinguishes between LOS friendly and non-friendly objects by analyzing the IR signal emitted from the IR emitter equipped by the friendly object.
The hardware arrangement of the IR subsystem is depicted in Figure 3a. Both the IR camera and the laser rangefinder are aligned to point in the same direction and are mounted on the pan/tilt platform, allowing for rotation in various directions. The laser rangefinder is utilized to measure the distance of the object located at the center of the IR image's field of view. As shown in Figure 3b, the process begins with the capture of the first image t 1 from the IR camera, which detects the human object. The object's position within the IR image's field of view is then calculated. Subsequently, the lateral angle position α and the vertical angle position φ of the object relative to the IR camera's pointing direction can be determined. These calculated angle positions are then sent to the pan/tilt platform, which adjusts the IR subsystem's orientation to center the object within the IR camera's field of view. Thus, at time instant t 2 , the distance of the object can be measured using the laser rangefinder. Figure 3c presents the flowchart illustrating the working principle of the EO/IR subsystem, highlighting its functionality in detecting, tracking, and measuring the distance of the object of interest.

EO/IR Subsystem
The EO/IR subsystem comprises an EO camera, an IR camera, a laser rangefinder situated on the platform side, a controllable IR emitter on the friendly object side, and a pan/tilt platform. Within the subsystem, the EO camera utilizes a 3D stereo camera for visible image acquisition and depth sensing, while the long-wavelength IR camera is employed for thermal detection. Two options for IR cameras are available, allowing for interchangeability to accommodate different detection ranges. Both options have undergone comprehensive testing and successful implementation.
Aligned with the viewing direction of the IR camera, the laser rangefinder is capable of measuring distances up to 100 m. The IR subsystem consistently distinguishes between LOS friendly and non-friendly objects by analyzing the IR signal emitted from the IR emitter equipped by the friendly object.
The hardware arrangement of the IR subsystem is depicted in Figure 3a. Both the IR camera and the laser rangefinder are aligned to point in the same direction and are mounted on the pan/tilt platform, allowing for rotation in various directions. The laser rangefinder is utilized to measure the distance of the object located at the center of the IR image's field of view. As shown in Figure 3b, the process begins with the capture of the first image from the IR camera, which detects the human object. The object's position within the IR image's field of view is then calculated. Subsequently, the lateral angle position and the vertical angle position of the object relative to the IR camera's pointing direction can be determined. These calculated angle positions are then sent to the pan/tilt platform, which adjusts the IR subsystem's orientation to center the object within the IR camera's field of view. Thus, at time instant , the distance of the object can be measured using the laser rangefinder. Figure 3c presents the flowchart illustrating the working principle of the EO/IR subsystem, highlighting its functionality in detecting, tracking, and measuring the distance of the object of interest. The 3D stereo camera from Stereolabs is used as the EO camera for both visible image acquisition and depth sensing. The camera offers advanced depth sensing capabilities and is widely used for applications such as robotics, virtual reality, autonomous navigation,

Electro-Optical (EO) Camera
The 3D stereo camera from Stereolabs is used as the EO camera for both visible image acquisition and depth sensing. The camera offers advanced depth sensing capabilities and is widely used for applications such as robotics, virtual reality, autonomous navigation, and 3D mapping. Some key features of the Stereolabs 3D camera include a high-resolution (1920 × 1080 pixels) visible image, depth sensing, real-time 3D mapping, and a comprehensive software development kit (SDK).
In our specific application, we utilize the image captured by the left camera of the 3D stereo camera as the EO image. The left image serves as the basis for human object detection and tracking using visible light. By leveraging the visible light spectrum, one can benefit from the detailed texture information and visual cues present in the EO image, enabling accurate detection and tracking of human subjects.

Infrared (IR) Camera
The IR subsystem incorporates two different IR cameras for varying human object detection ranges: the 9640P IR camera from ICI and the Boson 320 IR camera from Teledyne. The selection and testing of these cameras were performed to adapt to different detection requirements.
The short-range Boson 320 IR camera boasts a compact size of 21 × 21 × 11 mm and weighs only 7.5 g. It is equipped with a 6.3 mm lens and offers a horizontal field of view (FOV) of 34 • . This camera is capable of detecting human objects up to a range of 25 m. It features exceptional thermal sensitivity, equal to or less than (≤) 20 mK, and an upgraded automatic gain control (AGC) filter that enhances scene contrast and sharpness in all environments. With a fast frame rate of up to 60 Hz, it enables real-time human object detection. The image resolution of this camera is 320 × 256 pixels, and the image stream is transferred in real-time from the camera to the host PC via a universal serial bus (USB) cable.
On the other hand, the long-range ICI 9640p is a high-quality thermal-grade IR camera with an image resolution of 640 × 512 pixels. It utilizes a 50 mm athermalized lens, providing a FOV of 12.4 • × 9.3 • , and has a total weight of 230 g. This ICI IR camera achieves a detection range exceeding 100 m. The maximum frame rate supported by this camera is 30 Hz.
By incorporating both the Boson 320 and the ICI 9640p cameras into the IR subsystem, the MEIRFS system can adjust to different detection ranges, ensuring flexibility and adaptability in various scenarios.

Laser Rangefinder
To overcome the limitation of the IR camera in measuring the distance of detected objects, we integrated a laser rangefinder, the SF30/C, from Lightware into our system. The laser rangefinder is specifically designed to provide accurate distance measurements. It is aligned with the viewing direction of the IR camera, and both devices are mounted on a rotary stage. The collocated configuration ensures that the laser rangefinder is always directed towards the center of the IR camera's field of view (FOV).
When a human object of interest is detected in the FOV, the rotary stage automatically adjusts the orientation of the IR subsystem to the center of the object, affording the precise position of the object relative to the platform of the sensor system. By combining the information from the IR camera, which provides the location of the object, and the laser rangefinder, which provides the distance measurement, MEIRFS can accurately determine the spatial coordinates of the human object in real-time.

Sensor System Integration
The proposed MEIRFS system is designed to be versatile and applicable to both UAVs and UGVs for various tasks. In this paper, we demonstrate the successful integration and mounting of the MEIRFS system onto an all-terrain robot platform to conduct ground tests.
By deploying the MEIRFS system on a UGV, the performance and capabilities are evaluated in real-world scenarios encountered by ground-based robotic platforms. The all-terrain robot platform provides a suitable environment for testing the MEIRFS system's functionalities, such as human object detection, recognition, and tracking. These tests help validate the effectiveness and robustness of the MIERFS system in different sensor, environment, and object operational conditions.
The MEIRFS integration onto the all-terrain robot platform enables us to assess the MEIRFS system's performance in practical ground-based applications, paving the way for potential deployment on both UAVs and UGVs for diverse tasks such as surveillance, search and rescue, and security operations.  To ensure an organized and compact design, all the cables of t carefully managed and extended to the interior of the robot. Insi batteries are utilized to generate a 24 V DC power supply, which is both the rotary stage and the robot's wheels.
In terms of connectivity, a single USB cable is all that is neces munication between the MEIRFS system and the host computer. T to a USB hub integrated into the robot, facilitating seamless comm host computer and all the sensors as well as the rotary stage. By co and employing a simplified connection scheme, the MEIRFS system To ensure an organized and compact design, all the cables of the MEIRFS system are carefully managed and extended to the interior of the robot. Inside the robot, two 12 V batteries are utilized to generate a 24 V DC power supply, which is required for operating both the rotary stage and the robot's wheels.
In terms of connectivity, a single USB cable is all that is necessary to establish communication between the MEIRFS system and the host computer. The USB cable connects to a USB hub integrated into the robot, facilitating seamless communication between the host computer and all the sensors as well as the rotary stage. By consolidating the cables and employing a simplified connection scheme, the MEIRFS system ensures efficient and streamlined communication, minimizing clutter, and simplifying the setup process. The organized arrangement enhances the overall functionality and practicality of the system during operation.

Software Package
To facilitate user control and provide a comprehensive display of the detection results, a graphical user interface (GUI) software package was developed. The MEIRFS GUI software serves as a centralized platform for communication and control between the host computer and all the hardware devices in the sensor system.
The GUI software, illustrated in Figure 5, enables seamless communication and data exchange with the various components of the sensor system. The GUI acts as a userfriendly interface for controlling and configuring the system, as well as displaying key data and detection results in a clear and organized manner. Through the GUI software, users can conveniently interact with the sensor system, adjusting settings, initiating detection processes, and monitoring real-time data. The software provides an intuitive and efficient means of accessing and managing the functionalities of the MIERFS system. Specifically, the GUI software has been developed with the following capabilities: (1) Display the image acquired from EO/IR cameras; (2) Configure the machine learning model for human object detection;  The measurement results from the various sensors in the MIERFS system mitted to the host computer at different data update rates. To ensure accurate the object, these measurements are synchronized within the GUI software to c object's position. In the MIERFS system, the IR camera plays a crucial role in hu detection, recognition, and tracking. Therefore, the measurements from other synchronized with the update rate of the IR camera. During our testing, the r man object detection process achieved a continuous frame rate of approximate per second (fps) when the laptop computer (equipped with an Intel Core i9-1 and Nvidia RTX-3060 laptop GPU) was connected to a power source. When The measurement results from the various sensors in the MIERFS system are transmitted to the host computer at different data update rates. To ensure accurate tracking of the object, these measurements are synchronized within the GUI software to calculate the object's position. In the MIERFS system, the IR camera plays a crucial role in human object detection, recognition, and tracking. Therefore, the measurements from other sensors are synchronized with the update rate of the IR camera. During our testing, the real-time human object detection process achieved a continuous frame rate of approximately 35 frames per second (fps) when the laptop computer (equipped with an Intel Core i9-11900H CPU and Nvidia RTX-3060 laptop GPU) was connected to a power source. When the laptop computer operated solely on battery, the frame rate reduced to about 28 fps. Each time a new frame of the IR image is received in the image acquisition thread, the software updates the measured data from all the sensors. The synchronization ensures that the measurement results from different sensors are aligned with the latest IR image frame, providing accurate and up-to-date information for human object detection and tracking.

Deep Learning-Based Algorithm for Human Object Detection
After evaluating various DL-based object detection algorithms suitable for real-time applications [25,26], we selected the open-source YOLOv4 (You Only Look Once) detector [7] as the tool for EO/IR image analysis in human object detection. The YOLOv4 detector is recognized as one of the most advanced DL algorithms for real-time object detection. It employs a single neural network to process the entire image, dividing it into regions and predicting bounding boxes and probabilities for each region. These bounding boxes are weighted based on the predicted probabilities.
The YOLOv4 model offers several advantages over classifier-based systems. It considers the entire image during testing, leveraging global context to enhance its predictions. Unlike systems such as the region-based convolutional neural network (R-CNN), which require thousands of network evaluations for a single image, YOLOv4 makes predictions in a single evaluation, making it remarkably fast. In fact, it is over 1000 times faster than R-CNN and 100 times faster than Fast R-CNN [7].
To ensure the YOLOv4 detector's effectiveness in different scenarios, we gathered more than 1000 IR images, encompassing various cases, as depicted in Figure 6. Additionally, we considered scenarios where only a portion of the human body was within the IR camera's field of view, such as the lower body, upper body, right body, and left body. Once the raw IR image data was annotated, both the annotated IR images and their corresponding annotation files were used as input for training the YOLOv4 model. The pre-trained YOLOv4 model, initially trained with the Microsoft Common Objects in Context (COCO) dataset, served as the starting point for training with the annotated IR images.  Once the training of the YOLOv4 model was finalized, we proceeded to evaluate its performance using IR images that were not included in the training process. Figure 7 showcases the effectiveness of the trained YOLOv4 model in accurately detecting human objects across various scenarios, including: (1) Human object detection in indoor environments; Once the training of the YOLOv4 model was finalized, we proceeded to evaluate its performance using IR images that were not included in the training process. Figure 7 showcases the effectiveness of the trained YOLOv4 model in accurately detecting human objects across various scenarios, including: (1) Human object detection in indoor environments; (2) Human object detection in outdoor environments; (3) Detection of multiple human objects within the same IR image; (4) Human object detection at different distances; and (5) Human object detection regardless of different human body gestures.
Once the training of the YOLOv4 model was finalized, we proceeded to evaluate its performance using IR images that were not included in the training process. Figure 7 showcases the effectiveness of the trained YOLOv4 model in accurately detecting human objects across various scenarios, including: (1) Human object detection in indoor environments; (2) Human object detection in outdoor environments; (3) Detection of multiple human objects within the same IR image; (4) Human object detection at different distances; and (5) Human object detection regardless of different human body gestures. The trained YOLOv4 model exhibited satisfactory performance in all these scenarios, demonstrating its ability to robustly detect human objects in diverse environments and under various conditions. The trained YOLOv4 model exhibited satisfactory performance in all these scenarios, demonstrating its ability to robustly detect human objects in diverse environments and under various conditions.

Sensor Fusion and Multi-Target Tracking
Although the IR image alone is effective for human object detection, it may not provide optimal performance in multiple human object tracking tasks due to its limited color and texture information compared to visible light images. To address this limitation and achieve accurate human object tracking in complex scenarios, images from both the IR camera and the EO camera were utilized. To enhance the features in these images, a DL-based image fusion algorithm was developed. Image fusion combines the information from the IR and EO images to create fused images that offer improved detection and tracking capabilities and enhance the tracking results in challenging situations.
This Section presents the algorithms that are compatible with the MEIRFS hardware design for sensor fusion and multi-target tracking. In particular, the U2 Fusion [27], a unified unsupervised image fusion network, is adapted to fuse visible and infrared images and provide high-quality inputs even in adversarial environments for the downstream multi-target tracking (MTT) task.

Sensor Fusion
Infrared cameras capture thermal radiation emitted by objects, while visible cameras capture the reflected or emitted light in the visible spectrum. Therefore, infrared cameras are useful for applications involving temperature detection, night vision, and identifying heat signatures [28,29]. Visible cameras, on the other hand, are commonly used for photography, computer vision, and surveillance in well-lit conditions. Both types of cameras serve distinct purposes and have their own specific applications based on the type of light they capture. Fusing these two modalities allows us to see the thermal characteristics of objects alongside their visual appearance, providing enhanced scene perception and improved object detection.
Image fusion has been an active field [30,31], and many algorithms have been developed. DL-based image fusion techniques are of particular interest to MEIRFS due to their superior performance and reduced efforts for feature engineering and fusion rules. Zhang et al. [32] provide a comprehensive review of the DL methods in different image fusion scenarios. In particular, DL for infrared and visible image fusion can be categorized into autoencoder (AE), convolutional neural network (CNN), and generative adversarial network (GAN)-based methods according to the deep neural network architecture. Since AE is mostly used for feature extraction and image reconstruction while GAN is often unstable and difficult to train, we consider CNN-based methods to facilitate the multiobject tracking task. To overcome the problem of lacking a universal groundtruth and no-reference metric, CNN-based fusion constrains the similarity between the fusion image and source images by designing loss functions. Specifically, we adapt U2Fusion [27] for the MEIRFS system, which provides a unified framework for multi-modal, multi-exposure, and multi-focal fusion. However, U2Fusion [27] did not consider image registration, which is the first step towards image fusion. Due to the differences in camera parameters such as the focal length and field of view, the images may not share the same coordinate system, and thus image registration is necessary to align and fuse the images. We calibrate the IR and visible cameras and compute the transformation matrix offline to reduce the online effort for image registration. The image registration in our work is performed by cropping the RGB image to align the FOV with the IR image based on the camera calibration for our hardware design and achieves efficacy performance. It is noted that integrating image registration into the U2Fusion model and training the integrated model in an end-to-end manner can simplify the image registration process and improve image fusion performance [33], which will be investigated in future work. After image registration, the training pipeline of U2Fusion with aligned images is shown in Figure 8. To preserve the critical information of a pair of source images denoted as and , U2Fusion [27] minimizes the loss function defined as follows: where denotes the parameters in DenseNet for generating the result fusion image , and is the training dataset; ℒ ( , ) is the similarity loss between the result and source images; ℒ ( , ) is the elastic weight consolidation [34] that prevents cata- To preserve the critical information of a pair of source images denoted as I 1 and I 2 , U2Fusion [27] minimizes the loss function defined as follows: where θ denotes the parameters in DenseNet for generating the result fusion image I f , and D is the training dataset; L sim (θ, D) is the similarity loss between the result and source images; L ewc (θ, D) is the elastic weight consolidation [34] that prevents catastrophic forgetting in continual learning; and λ is the trade-off parameter that controls the relative importance of the two parts. Additionally, where α controls the trade-off; S I f ,I i (i = 1, 2) denotes the structural similarity index measure (SSIM) for constraining the structural similarity between the source images I i and I f ; MSE I f ,I i (i = 1, 2) denotes the mean square error (MSE) for constraining the difference of the intensity distribution; and ω 1 and ω 2 are adaptive weights estimated based on the information measurement of the feature maps of the source images. In particular, the information measurement g I is defined as: where φ C k j (I) is the feature map extracted by the convolutional layer of VGG16 before the j-th max-pooling layer, and H j , W j , and D j denote the height, width, and channel of the feature map, respectively. Moreover, the elastic weight consolidation L ewc is defined as which penalizes the weighted squared distance between the parameter values of the current task θ and those of the previous task θ * to prevent forgetting what has been learned from old tasks.
To train a customized model for our system, we can fine-tune the learned model in U2Fusion using transfer learning approaches [35] with data collected by the cameras to enhance learning efficiency. Furthermore, since solely IR or visible images can be sufficient for the object tracking task under certain environmental conditions, we designed a selector switch to skip the image fusion if it is unnecessary to detect the object. The mode selector is controlled manually, i.e., the operator selects the proper mode based on the assessment of the image quality of the infrared and visible images and the necessity of image fusion. In future work, we will incorporate mode selection into the U2Fusion model to automatically select the mode. Figure 9 shows the complete pipeline of image fusion processing for object tracking.

DL-Based Algorithm for Human Object Tracking
In certain scenarios, the human object may become lost due to inherent limitations in object detection algorithms as well as various challenging circumstances such as occlusions and fluctuations in lighting conditions. To effectively address these situations, the utilization of a human object tracking algorithm becomes necessary [36].
To optimize the tracking results, our system employs the "ByteTrack" object tracking model as the primary algorithm [37]. For effective performance, ByteTrack utilizes

DL-Based Algorithm for Human Object Tracking
In certain scenarios, the human object may become lost due to inherent limitations in object detection algorithms as well as various challenging circumstances such as occlusions and fluctuations in lighting conditions. To effectively address these situations, the utilization of a human object tracking algorithm becomes necessary [36].
To optimize the tracking results, our system employs the "ByteTrack" object tracking model as the primary algorithm [37]. For effective performance, ByteTrack utilizes YOLOX as the underlying backbone for object detection [38]. Unlike traditional methods that discard detection results below a predetermined threshold, ByteTrack takes a different approach. It associates nearly all the detected boxes by initially separating them into two categories: high-score boxes, containing detections above the threshold, and low-score boxes, encompassing detections below the threshold. The high-score boxes are first linked to existing tracklets. Subsequently, ByteTrack computes the similarity between the lowscore boxes and the established tracklets, facilitating the recovery of objects that may be occluded or blurred. Consequently, the remaining tracklets, which mostly correspond to background noise, are removed. The ByteTrack methodology effectively restores precise object representations while eliminating spurious background detections.
In the MEIRFS system, the fusion of IR and visible image pairs is followed by the application of the YOLOX algorithm to the fused image. This algorithm performs human object detection and generates confidence scores for the detected objects. In the presence of occlusion, priority is given to high-confidence detections, which are initially matched with the tracklets generated by the Kalman filter. Subsequently, an intersection over union (IoU) similarity calculation is utilized to evaluate the remaining tracklets and low-confidence detections. This process facilitates the matching of low-confidence detections with tracklets, enabling the system to effectively handle occlusion scenarios.

Experiments and Results
With the integrated sensors in the MEIRFS system, multiple ground tests have been performed in different environments to validate the performance of each individual component in the sensor system as well as the whole system's performance for human object detection, geolocation, and LOS-friendly human object recognition.

Indoor Experiments
In Figure 10a, we tested the MEIRFS sensor system's capability of detecting and continuously tracking a single human object. When the human object appears in the IR camera's field of view, it will be immediately identified (marked with the red bounding box) and tracked by the sensor's system.
Compared with the traditional EO camera, one advantage of the IR camera is that it can detect human objects when there is no illumination. The long-wavelength infrared (LWIR) camera detects the direct thermal energy emitted from the human body. Figure 10b shows that the MEIRFS system can function correctly even in a dark environment. Figure 10c demonstrates the measurement accuracy of the radar subsystem. When the friendly human object is detected by the MEIRFS system, the distance to the platform is measured by both the radar subsystem and the laser rangefinder. The measurement results verified that the radar subsystem can provide accurate distance information for the friendly object, with an error of less than 0.3 m when compared with the laser rangefinder.
In the last test, as shown in Figure 10d, there are two human objects. The one holding the IR emitter (a heat source) is the friendly object. The other is the non-friendly object. The system was configured to track only non-friendly objects. When both objects came into the IR camera's FOV, the sensor system immediately identified them and marked the friendly object with a green bounding box and the non-friendly object with a red box. Moreover, the sensor system immediately started to continuously track and follow the non-friendly object.
the IR emitter (a heat source) is the friendly object. The other is the non-friendly o The system was configured to track only non-friendly objects. When both objects into the IR camera's FOV, the sensor system immediately identified them and marke friendly object with a green bounding box and the non-friendly object with a red Moreover, the sensor system immediately started to continuously track and follow non-friendly object.

Outdoor Experiments
Extensive experiments were conducted to thoroughly validate the effectivene the MEIRFS system for multiple human object tracking in outdoor environments. T experiments were designed to assess the system's performance and capabilities a various scenarios and conditions encountered in real-world outdoor settings.
The tracking model employed has undergone pre-training on two datasets, na CrowdHuman [39] and MOT20 [40]. The CrowdHuman dataset is characterized b extensive size, rich annotations, and substantial diversity. The CrowdHuman datase compasses a total of 470,000 human instances from both the training and validation sets. Notably, each image within the dataset contains an average of 22.6 people, the exhibiting a wide range of occlusions. On the other hand, the MOT20 dataset comp eight sequences extracted from three densely populated scenes, where the number o dividuals per frame can reach up to 246 individuals. The pre-trained model's exposu such varied and challenging conditions enables it to effectively handle a wide arr real-world scenarios, leading to enhanced object tracking capabilities and more rel results. The original model used in our research was trained on a separate system con ing of eight NVIDIA Tesla V100 GPUs with a batch size of 48. There is a 80epoch tra schedule for the MOT17 dataset, combining the MOT17, CrowdHuman, Cityperson

Outdoor Experiments
Extensive experiments were conducted to thoroughly validate the effectiveness of the MEIRFS system for multiple human object tracking in outdoor environments. These experiments were designed to assess the system's performance and capabilities across various scenarios and conditions encountered in real-world outdoor settings.
The tracking model employed has undergone pre-training on two datasets, namely CrowdHuman [39] and MOT20 [40]. The CrowdHuman dataset is characterized by its extensive size, rich annotations, and substantial diversity. The CrowdHuman dataset encompasses a total of 470,000 human instances from both the training and validation subsets. Notably, each image within the dataset contains an average of 22.6 people, thereby exhibiting a wide range of occlusions. On the other hand, the MOT20 dataset comprises eight sequences extracted from three densely populated scenes, where the number of individuals per frame can reach up to 246 individuals. The pre-trained model's exposure to such varied and challenging conditions enables it to effectively handle a wide array of realworld scenarios, leading to enhanced object tracking capabilities and more reliable results. The original model used in our research was trained on a separate system consisting of eight NVIDIA Tesla V100 GPUs with a batch size of 48. There is a 80epoch training schedule for the MOT17 dataset, combining the MOT17, CrowdHuman, Cityperson, and ETHZ datasets. The image size is set to 1440 × 800, with the shortest side ranging from 576 to 1024 during multi-scale training. Data augmentation includes Mosaics and Mixups. The optimizer is SGD with a weight decay of 5 × 10 −4 and momentum of 0.9. The initial learning rate is 10 −3 with a 1epoch warm-up and cosine annealing schedule. For the inference stage, we performed the evaluations on an NVIDIA 2080 Ti GPU. With this configuration, we achieved 27.98 frames per second (FPS), which demonstrates the real-time capabilities of our hardware system. Figure 11 presents the evaluation of MEIRFS' tracking ability, revealing noteworthy insights from the top and bottom rows of the displayed results. In these scenarios, which involve the movement of multiple individuals amidst occlusion, the MEIRFS multimodal U2Fusion tracking algorithm exhibits exceptional performance. Each individual is identified by a unique ID number and tracked using a distinct color, showcasing the algorithm's ability to accurately track different people without experiencing any instances of object loss. As shown in Figure 11, the continuous tracking results are represented by six key image frames, which are labeled with key frame number in time sequence at the lower left corner of each image frame. The outcome underscores the robustness and reliability of the REIRFS tracking algorithm, particularly in challenging conditions where occlusion and the simultaneous presence of multiple objects present significant tracking difficulties.
our hardware system. Figure 11 presents the evaluation of MEIRFS' tracking ability, revealing noteworthy insights from the top and bottom rows of the displayed results. In these scenarios, which involve the movement of multiple individuals amidst occlusion, the MEIRFS multimodal U2Fusion tracking algorithm exhibits exceptional performance. Each individual is identified by a unique ID number and tracked using a distinct color, showcasing the algorithm's ability to accurately track different people without experiencing any instances of object loss. As shown in Figure 11, the continuous tracking results are represented by six key image frames, which are labeled with key frame number in time sequence at the lower left corner of each image frame. The outcome underscores the robustness and reliability of the REIRFS tracking algorithm, particularly in challenging conditions where occlusion and the simultaneous presence of multiple objects present significant tracking difficulties. Figure 11. Experiments to demonstrate the capability of the MEIRFS sensor systemfor multiple human objects tracking. Each identified human object is labeled with a unique ID number. Figure 12 illustrates the performance of the MEIRFS tracking algorithm on images captured by an IR camera, images captured by a visible camera, and the fused images obtained by sensor fusion. Analysis of the top and middle rows reveals that both scenarios encounter challenges in tracking person #1, and person #2 is incorrectly assigned as person #1, while person #1 is mistakenly considered a new individual, person #3. However, in the bottom row, following the fusion of IR and visible images, our tracking algorithm successfully tracks both person #1 and person #2, even in the presence of occlusions. The performance highlights the effectiveness of the induced sensor fusion, which combines information from both IR and visible images. As a result, the fusion process enriches the image features available for utilization by the tracking algorithm, leading to improved tracking performance in challenging scenarios. Figure 11. Experiments to demonstrate the capability of the MEIRFS sensor systemfor multiple human objects tracking. Each identified human object is labeled with a unique ID number. Figure 12 illustrates the performance of the MEIRFS tracking algorithm on images captured by an IR camera, images captured by a visible camera, and the fused images obtained by sensor fusion. Analysis of the top and middle rows reveals that both scenarios encounter challenges in tracking person #1, and person #2 is incorrectly assigned as person #1, while person #1 is mistakenly considered a new individual, person #3. However, in the bottom row, following the fusion of IR and visible images, our tracking algorithm successfully tracks both person #1 and person #2, even in the presence of occlusions. The performance highlights the effectiveness of the induced sensor fusion, which combines information from both IR and visible images. As a result, the fusion process enriches the image features available for utilization by the tracking algorithm, leading to improved tracking performance in challenging scenarios.

Discussion
To demonstrate the effectiveness of our system in tracking human subjects, we conducted an evaluation using the videos that we collected from outdoor experiments.
The results of this experiment, as presented in Table 1, showcased a mean average precision (mAP) score of 0.98, calculated at an intersection over union (IOU) threshold of

Discussion
To demonstrate the effectiveness of our system in tracking human subjects, we conducted an evaluation using the videos that we collected from outdoor experiments.
The results of this experiment, as presented in Table 1, showcased a mean average precision (mAP) score of 0.98, calculated at an intersection over union (IOU) threshold of 0.50. With a high mAP of 0.98, the detection algorithm demonstrates its proficiency and precision in identifying objects accurately and reliably. This achievement provides strong evidence that the algorithm is well-suited and perfectly capable of handling the unique characteristics and complexities presented by our data. Consequently, this success in accuracy lays a solid foundation for the subsequent tracking evaluation, affirming the algorithm's competence in reliably detecting and localizing human subjects for the tracking phase. To assess the tracking algorithm's performance, we employed multiple object tracking accuracy (MOTA) as our evaluation metric. The MOTA metric considers three crucial aspects: the number of misses (m), the number of false positives ( f p t ), and the number of mismatches (mme t ), with the total number of objects (g t ) included in the denominator. This comprehensive evaluation provides valuable insights into the system's ability to accurately track human subjects over time.
The evaluation results of the tracking algorithm are presented in Table 2. Notably, the achieved MOTA score is an impressive 0.984, indicating a remarkably high level of accuracy and performance. This outstanding MOTA score serves as compelling evidence that the tracking algorithm is exceptionally effective. With such encouraging results, we can confidently assert that the tracking algorithm is well-suited for this specific application and has the potential to significantly enhance the overall capabilities of our system. Its outstanding performance in human tracking brings us closer to achieving our system's objectives with a high degree of precision and reliability.

Conclusions
This paper proposes and develops a multimodal EO/IR and RF-based sensor (MEIRFS) system for real-time human object detection, recognition, and tracking on autonomous vehicles. The integration of hardware and software components of the MEIRFS system was successfully accomplished and demonstrated in indoor and outdoor scenes with collected and common datasets. Prior to integration, thorough device functionality testing established communication between each device and the host computer. To enhance human object recognition and tracking (HORT), multimodal deep learning techniques were designed. Specifically, the "U2Fusion" sensor fusion algorithm and the "ByteTrack" object tracking model were utilized. These approaches significantly improved the performance of human object tracking, particularly in complex scenarios. Multiple ground tests were conducted to verify the consistent detection and recognition of human objects in various environments. The compact size and light weight of the MEIRFS system make it suitable for deployment on UGVs and UAVs, enabling real-time HORT tasks.
Future work includes deploying and testing the MEIRFS system on UAV platforms. Additionally, we aim to leverage the experience gained from ground tests to retrain the deep learning models using new images acquired from the EO/IR camera and a radar on the UAV. We anticipate that the MEIRFS system will be capable of performing the same tasks of human object detection, recognition, and tracking that have been validated during the ground tests.