3.1. Data Acquisition
In
Step 1 (showm in
Figure 2), considering the effectiveness and efficiency, we first conducted day-time image acquisition with a five-eye tilt camera on a flight platform, with SHARE PSDK 102S (
https://www.shareuavtec.com/ProductDetail/6519312.html, accessed on 1 October 2022) (shown in
Figure 3a) and DJI M300 RTK (
https://www.dji.com/cn/matrice-300?site=brandsite&from=nav, accessed on 1 October 2022) (shown in
Figure 3b). The special SHARE 102S camera was composed of one vertical downward-facing lens and four 45-degree tilt lenses. Five overlapping images were recorded simultaneously for each shot as presented in
Figure 4. For each single lens, more than 24 million pixels can be perceived. Furthermore, in order to fully and evenly cover the survey area, all flight paths were pre-planned in a grid fashion and automated by the flight control system supported by DJI M300. The images captured during this process were mainly used to reconstruct a high-precision 3D model. Sample images taken by PSDK 102S are shown in
Figure 4.
Secondly, regarding the DJI H20T (
https://www.dji.com/cn/zenmuse-h20-series?site=brandsite&from=nav, accessed on 1 October 2022) (as shown in
Figure 3c) a visible light module was used to acquire another set of visible images, which have the same field of view as the corresponding thermal image. It acted as a "bridge" between thermal images and the reference 3D model, connecting the thermal image with other cross-view visible images and 3D point clouds. Similar to the previous operation, we mounted the H20T on the DJI M300 RTK, and captured the same scene again under well-lit conditions during the day. This time, in order to increase the diversity of data, we used two methods—planning route and free route—to collect images, and recorded the route information and the shooting angle of each image collection point.
Finally, we used the DJI H20T thermal module to obtain thermal images. The process was carried out with poor lighting conditions at night. It mainly used the track re-fly technique of the DJI M300 RTK to make the drone fly along the route where the bridge image is captured. During each trajectory, H20T captures the thermal images at the same angle and point as the bridge images thanks to the extremely small RTK error. As each thermal image has a corresponding bridge image, subsequent manual annotation earns facilitation. The image samples are shown in
Figure 5.
In conclusion, our dataset contained three kinds of images in two modes, all of which hold their metadata, such as the focal length, aperture, and exposure. The specifications of the cameras used are shown in
Table 2. At the same time, we recorded geo-referenced information (real-time kinematic positioning information) and flight control information (heading, speed, pitch, etc.) for the data acquisition process via the UAV platform.
3.2. Three-Dimensional Model Reconstruction
After obtaining the original data, we used the open-source system COLMAP [
7] to reconstruct a 3D model from all visible images, as shown in
Step 2 of
Figure 2. The reconstructed point cloud is shown in
Figure 6. Specifically, for each scene, the input resources included visible light images collected by the five-eye camera and the bridge images provided by H20T Visible during the day-time. The main pipeline of COLMAP [
7] consists of feature extraction, image retrieval, structure-from-motion (SfM) and multi-view stereo (MVS).
Three files were output after the whole process:
Cameras,
Images, and
Points3D.
Camera records the intrinsic of each set of cameras. The format is as follows:
where Model represents the camera model, which is all PINHOLE in our dataset, and
denotes the coordinate of the principal point. It should be noted that only the intrinsic parameters of visible light images are included, whereas the intrinsic parameters of thermal images are provided by DJI through camera calibration.
The
Images contains the pose and feature points of each image. The specific form is as follows:
where
represents the orientation in the form of quaternions, and
represents the translation. The pose composed of them is the absolute pose in the 3D reference model. The one that needs to be detailed is Points2D. It contains the index and 2D coordinates of all of the 2D feature points in the image, as well as the 3D point id for each point. If the
is −1, it means that the point is not visible in the point cloud and is a point that failed to triangulate.
The
Points3D is the 3D point cloud of the scene, representing the spatial structure of the entire scene. It stores the id, coordinates, color, and track of each 3D point as follows:
Each track contains multiple . This represents the image with that can observe this 3D point, and the index of the corresponding 2D feature point on this image.
Although COLMAP [
7] does not have the ability to register thermal images to the 3D reconstruction because of its disparate modality, bridge images can lend a help. This is owed to bridge images having the same visual field as the corresponding thermal images, and the 3D points observed by the bridge image can also be observed by the thermal one. With this premise, we can easily establish the 2D-3D structure correspondences between the thermal image and the pre-built point cloud.
3.3. Semi-Automatic Supervision Generation
In the first two steps, we obtained all raw images and reference 3D models, after which, we performed semi-automatic supervision generation. Essentially, the thermal image was annotated to obtain its absolute pose. To obtain the absolute pose of the thermal image, we first needed to establish a one-to-one correspondence between the 2D points of the thermal image and the 3D points in the reference model. Since thermal images are not involved in the reconstruction process of 3D models, we accomplished this goal with the help of bridge images. Specifically, as shown in
Step 3 of
Figure 2, we first retrieved the bridge image from
Images (generated in
Section 3.2) based on the image name, and then we read all
data of this bridge image and selected all of the points whose
was not −1; that is, those points with visibility. Finally, we marked the bridge image with a pink cross at these coordinates (X, Y) of those visible points, as shown in
Figure 7a.
Next, we selected the markers that are easy for humans to recognize, such as the edge point of objects and corner points of buildings. We marked the center of the pink cross in Hugin (
https://github.com/ndevenish/Hugin, accessed on 10 October 2022) as accurately as possible, and recorded the coordinate of this point, denoted by
. Meanwhile, we marked one point with the closest appearance in the thermal image, and recorded the coordinate
. We simply used Hugin to mark tie points between thermal–visible image pairs and recorded the coordinates of manually labeled points.
Another key step is to find the 3D point corresponding to this pair of manual tie points. In the bridge image, we found the closest pixel distance to
from the cross-marked points. According to the previous description, the cross mark denotes a 2D feature point, which corresponds to a
in the 3D point cloud. Then, we obtained the coordinates
of this 3D point by id in
Points3D (generated in
Section 3.2). Through this labeling process, we matched the 3D point
with the 2D point
labeled on the thermal image. To ensure that the final pose could be computed successfully, we matched 15 to 30 points manually for each set of bridge–thermal images. The total number of annotated image pairs reached 898.
Finally, given the intrinsic parameters of the thermal camera and the 2D-3D matching, we could use the PnP solver [
38,
39] to compute the absolute pose of the thermal image as
.
In the above step, we only manually matched 898 bridge–thermal image pairs to obtain the pose of the thermal images. In order to generate a large amount of supervision for multi-view visible–thermal image pairs to effectively train the model based on learning, we needed to propagate the annotation information to all visible images through the bridge image, so as to realize efficient semi-automatic supervision generation.
The premise of all of this is that the associated multi-view visible image of each bridge image was acquired in 3D model reconstruction (
Section 3.2 and
Step 2 shown in
Figure 2) Because SfM and MVS ensure that these multi-view visible images share part of the track and have a sufficient overlap with the bridge images, for each bridge image, there are tens or even hundreds of associated images, as shown in
Figure 8a.
The key to our efficient semi-automatic supervision generation method, and the reason why we collect bridge images, is to make the thermal image have the same field of view as the bridge image. In this way, the images associated with the bridge image are guaranteed to have a sufficient overlap with the thermal image. In other words, those multi-view images related to the thermal image can be retrieved in the whole image gallery by the bridge image, as shown in
Figure 8b.
After obtaining the above retrieval relationship between thermal and visible images, we can semi-automatically generate supervision from their pose, intrinsic parameters, and depth. Here, pose represents the absolute pose of each image in the scene, which is denoted by
T, as shown in Equation (
1).
The
R,
t stand for the rotation matrix and translation, respectively. The intrinsic parameter can generally be represented by the matrix in Equation (
2), where
f is the focal length and
and
constitute the coordinate of the principal point. It should be noted that the camera model used in the whole process is PINHOLE.
Assuming that we have obtained poses of two thermal and visible images in the previous step as
and
, as shown in Equation (
3), given the depth map of the visible image, which stores the depth value for each pixel coordinate
, we aim to build dense correspondences.
Figure 9 illuminates the basic matching process, where the subscripts
V and
T represent the visible and thermal image, respectively. Based on the PINHOLE projection function Equation (
4), we built the relationship between the 3D point
(world coords) and 2D pixel
(pixel coords).
For a certain point
in the visible image, where
is the depth value of the point, we obtained its corresponding 3D point by modified projection Equation (
5), where one transformation from nonhomogeneous to homogeneous coordinates is implied, since the result of the last three terms operation is nonhomogeneous.
We then transformed it to the camera coordinate system of the thermal image by
Equation (
6). The
P point was then projected to the imaging plane by the intrinsic parameter Equation (
7). Finally, by eliminating
, we could obtain the pixel coordinates of the corresponding points in the thermal image homogeneously.
In summary, we obtained the point-to-point correspondence between visible
and thermal
by Equation (
8).
Therefore, tens or even hundreds of multi-view thermal–visible image pairs with pixel-wise correspondence can be obtained simply by manually labeling a pair of bridge–thermal images. As a result, our method achieved the goal of efficiently generating a large number of multi-view and cross-spectral image matching supervision in a semi-automatic manner.