# Cross-Spectral Local Descriptors via Quadruplet Network

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

- We propose and evaluate three ways of using triplets for learning cross-spectral descriptors. Triplet networks were originally designed to work on visible imagery, so the performance on cross-spectral images is unknown.
- We propose a new training CNN-based architecture that outperforms the state-of-the-art in a public Visible and Near-Infrared (VIS-NIR) cross-spectral image pair dataset. Additionally, our experiments show that our network is also useful for learning local feature descriptors in the visible domain.
- Fully trained networks and source code are publicly available at [5].

## 2. Background and Related Work

#### 2.1. Near-Infrared Band

#### 2.2. Dataset

#### 2.3. Cross-Spectral Descriptors

#### 2.4. CNN-Based Feature Descriptor Approaches

## 3. PN-Net (Triplet Network)

- As previously stated, our network is similar to the triplet network but specifically designed to learn cross-spectral local feature descriptors. A brief description of this network will help to set the basis of our proposal in Section 4.
- We explain the motivation behind our proposal through several experiments. After training PN-Net to learn cross-spectral feature descriptors, we discovered that the network performance improved when we randomly alternated between non-matching patches from both spectra.

#### 3.1. PN-Net Architecture

#### 3.2. PN-Net Loss

#### 3.3. Cross-Spectral PN-Net

#### 3.4. Q-Net Architecture

## 4. Q-Net

#### 4.1. Q-Net Loss

## 5. Experimental Evaluation

#### 5.1. VIS-NIR Scene Dataset

**Training:**Q-Net and PN-Net networks were trained using Stochastic Gradient Descent (SGD) with a learning rate of 1.1, weight decay of 0.0001, batch size of 128, momentum of 0.9 and learning rate decay of 0.000001. Trained data was shuffled at the beginning of each epoch and each input patch was normalized to its mean intensity. The trained data was split into two, where 95% of the data was used as training data and 5% as validation. Training was performed with and without data augmentation (DA), where the augmented data was obtained by flipping the images vertically and horizontally, and rotating the images by 90, 180 and 270 degrees. Each network was trained ten times to account for randomization effects in the initialization. Lastly, we used a grid search strategy to find the best parameters.

**Model details:**Model details are described in Table 3. The layers and parameters are the same from [4], which, after several experimental results, proved to be suitable for describing cross-spectral patches. Notice that, for feature description, shallow models are suitable, since lower layers are more general than the upper ones. The descriptor size was obtained through experimentation. We tested the performance when different descriptor sizes were used. Figure 6 shows the results of our experiment. From the figure we can see that there is a gain in increasing the descriptor size until 256. Descriptor sizes bigger than 256 did not perform better.

**Software and hardware:**All the code was implemented using the Torch framework ([25]). The GPU consisted of an NVIDIA Titan X and the network was trained in between five and ten hours when we used data augmentation.

#### 5.2. Multi-View Stereo Correspondence Dataset

**Training**. Quadruplet networks were trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.1, weight decay of 0.0001, batch size of 128, momentum of 0.9 and learning rate decay of 0.000001. Trained data was shuffled at the beginning of each epoch and each input patch was normalized using zero-mean and unit variance. We split up each training sequence into two sets, where 80% of the data was used as training data and the 20% left as validation data. We used the same software and hardware from the previous experiment. As in the previous experiment, Q-Net and PN-Net networks were trained ten times to account for randomization effects in the initialization.

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Zhang, Z. Microsoft Kinect Sensor and Its Effect. IEEE MultiMedia
**2012**, 19, 4–10. [Google Scholar] [CrossRef] - You, C.W.; Lane, N.D.; Chen, F.; Wang, R.; Chen, Z.; Bao, T.J.; Montes-de Oca, M.; Cheng, Y.; Lin, M.; Torresani, L.; et al. CarSafe app: Alerting drowsy and distracted drivers using dual cameras on smartphones. Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (ACM 2013), Taipei, Taiwan, 25–28 June 2013; pp. 13–26. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis.
**2004**, 60, 91–110. [Google Scholar] [CrossRef] - Balntas, V.; Johns, E.; Tang, L.; Mikolajczyk, K. PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors. arXiv, 2016; arXiv:1601.05030. [Google Scholar]
- Github. Qnet. Available online: http://github.com/ngunsu/qnet (accessed on 15 April 2017).
- Yi, D.; Lei, Z.; Li, S.Z. Shared representation learning for heterogenous face recognition. Proceeding of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; Volume 1, pp. 1–7. [Google Scholar]
- Ring, E.; Ammer, K. The technique of infrared imaging in medicine. In Infrared Imaging; IOP Publishing: Bristol, UK, 2015. [Google Scholar]
- Klare, B.F.; Jain, A.K. Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1410–1422. [Google Scholar] [CrossRef] [PubMed] - Aguilera, C.A.; Aguilera, F.J.; Sappa, A.D.; Aguilera, C.; Toledo, R. Learning cross-spectral similarity measures with deep convolutional neural networks. Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Brown, M.; Susstrunk, S. Multi-spectral SIFT for scene category recognition. Proceeding of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 177–184. [Google Scholar]
- Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In European Conference on Computer Vision; Springer: New York, NY, USA, 2006; pp. 404–417. [Google Scholar]
- Firmenichy, D.; Brown, M.; Süsstrunk, S. Multispectral interest points for RGB-NIR image registration. Proceeding of the 2011 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; pp. 181–184. [Google Scholar]
- Pinggera, P.; Breckon, T.; Bischof, H. On Cross-Spectral Stereo Matching using Dense Gradient Features. Proceeding of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 526.1–526.12. [Google Scholar]
- Morris, N.J.W.; Avidan, S.; Matusik, W.; Pfister, H. Statistics of Infrared Images. Proceeding of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–7. [Google Scholar]
- Aguilera, C.; Barrera, F.; Lumbreras, F.; Sappa, A.; Toledo, R. Multispectral image feature points. Sensors
**2012**, 12, 12661–12672. [Google Scholar] [CrossRef] - Mouats, T.; Aouf, N.; Sappa, A.D.; Aguilera, C.; Toledo, R. Multispectral Stereo Odometry. IEEE Trans. Intell. Transp. Syst.
**2015**, 16, 1210–1224. [Google Scholar] [CrossRef] - Aguilera, C.A.; Sappa, A.D.; Toledo, R. LGHD: A feature descriptor for matching across non-linear intensity variations. Proceeding of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec, Canada, 27–30 September 2015; pp. 178–181. [Google Scholar]
- Ma, J.; Zhao, J.; Ma, Y.; Tian, J. Non-rigid visible and infrared face registration via regularized Gaussian fields criterion. Pattern Recognit.
**2015**, 48, 772–784. [Google Scholar] [CrossRef] - Shen, X.; Xu, L.; Zhang, Q.; Jia, J. Multi-modal and Multi-spectral Registration for Natural Images. In European Conference on Computer Vision; Springer: New York, NY, USA, 2014; pp. 309–324. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Learning to compare image patches via convolutional neural networks. Proceeding of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4353–4361. [Google Scholar]
- Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015. [Google Scholar]
- Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res.
**2016**, 17, 1–32. [Google Scholar] - Winder, S.; Hua, G.; Brown, M. Picking the best DAISY. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; pp. 178–185. [Google Scholar]
- Collobert, R.; Kavukcuoglu, K.; Farabet, C. Torch7: A matlab-like environment for machine learning. In Proceedings of the BigLearn, NIPS Workshop, Granada, Spain, 12–17 December 2011. Number EPFL-CONF-192376. [Google Scholar]

**Figure 1.**The proposed network architecture. It consists of four copies of the same CNN that accepts as input two different cross-spectral correctly matched image pairs (MP1 and MP2). The network computes the loss based on multiples ${L}_{2}$ distance comparisons between the output of each CNN, looking for the matching pair with the biggest ${L}_{2}$ distance and the non-matching pair with the smallest ${L}_{2}$ distance. Both cases are then used for backpropagation of the network. This can be seen as positive and negative mining.

**Figure 2.**VIS-NIR cross-spectral image pairs; top images are from the visible spectrum and bottom images from the near-infrared spectrum.

**Figure 3.**Image patches from the VIS-NIR training set. The first row corresponds to grayscale images from the visible spectrum; and the second row to NIR images. (

**a,b**): non-matching pairs; (

**c,d**): correctly matched pairs.

**Figure 6.**FPR95 performance on the VIS-NIR scene dataset for Q-Net 2P-4N using different descriptor sizes ((

**a**) 64; (

**b**) 128; (

**c**) 256 and (

**d**) 512). Shorter bars indicate better performances. On top of the bars, standard deviation values are represented with segments.

**Figure 7.**ROC curves for the different descriptors evaluated on the VIS-NIR dataset. For Q-Net and PN-Net, we selected the network with the best performance. Each subfigure shows the result in one of eight tested categories of the dataset.

**Table 1.**Shows the number of cross-spectral image pairs per category on the VIS-NIR patch dataset used to train and evaluate our work.

Category | # Cross-Spectral Pairs |
---|---|

country | 277,504 |

field | 240,896 |

forest | 376,832 |

indoor | 60,672 |

mountain | 151,296 |

old building | 101,376 |

street | 164,608 |

urban | 147,712 |

water | 143,104 |

Train seq. | PN-Net Gray | PN-Net NIR | PN-Net Random |
---|---|---|---|

country | 11.79 | 11.63 | 10.65 |

field | 17.84 | 16.56 | 16.10 |

forest | 36.00 | 32.47 | 32.19 |

indoor | 48.21 | 47.26 | 46.48 |

mountain | 29.35 | 26.29 | 25.67 |

old building | 29.22 | 27.25 | 27.69 |

street | 18.23 | 16.71 | 16.73 |

urban | 32.78 | 36.61 | 33.35 |

water | 18.16 | 17.76 | 15.84 |

average | 26.84 | 25.84 | 25.08 |

Layer | Description | Kernel | Output Dim |
---|---|---|---|

1 | Convolution | 7 × 7 | 32 × 26 × 26 |

2 | Tanh | - | 32 × 26 × 26 |

3 | MaxPooling | 2 × 2 | 32 × 13 × 13 |

4 | Convolution | 6 × 6 | 64 × 8 × 8 |

5 | Tanh | - | 64 × 8 × 8 |

6 | Linear | - | 256 |

**Table 4.**FPR95 performance on the VIS-NIR scene dataset. Each network, i.e., siamese-L2, PN-Net and Q-Net, were trained in the country sequence and tested in the other eight sequences as in [9]. Smaller results indicate better performance. In brackets, the standard deviation is provided.

Descriptor/Network | Field | Forest | Indoor | Mountain | Old Building | Street | Urban | Water | Mean |
---|---|---|---|---|---|---|---|---|---|

EHD | $48.62$ | $23.17$ | $30.25$ | $33.94$ | $19.62$ | $27.29$ | $3.72$ | $23.46$ | $26.26$ |

LGHD | $18.80$ | $3.73$ | $8.16$ | $11.34$ | $8.17$ | $6.66$ | $7.39$ | $13.90$ | $9.77$ |

siamese-L2 | $38.47$ | $12.46$ | $7.94$ | $22.36$ | $15.70$ | $16.85$ | $11.06$ | $29.18$ | $15.50$ |

PN-Net RGB | $\underset{\left(1.08\right)}{25.33}$ | $\underset{(0.28}{4.41}$ | $\underset{\left(0.32\right)}{7.00}$ | $\underset{\left(1.07\right)}{19.37}$ | $\underset{\left(0.32\right)}{7.31}$ | $\underset{\left(0.46\right)}{10.21}$ | $\underset{\left(0.27\right)}{5.00}$ | $\underset{\left(0.67\right)}{17.79}$ | $\underset{\left(0.40\right)}{12.05}$ |

PN-Net NIR | $\underset{\left(0.98\right)}{24.74}$ | $\underset{\left(0.14\right)}{4.45}$ | $\underset{\left(0.25\right)}{6.54}$ | $\underset{\left(0.44\right)}{15.75}$ | $\underset{\left(0.19\right)}{7.78}$ | $\underset{\left(0.25\right)}{10.82}$ | $\underset{\left(0.14\right)}{4.66}$ | $\underset{\left(0.34\right)}{16.49}$ | $\underset{\left(0.15\right)}{11.40}$ |

PN-Net Random | $\underset{\left(1.00\right)}{24.56}$ | $\underset{\left(0.20\right)}{3.91}$ | $\underset{\left(0.43\right)}{6.56}$ | $\underset{\left(0.60\right)}{15.99}$ | $\underset{\left(0.31\right)}{6.84}$ | $\underset{\left(0.36\right)}{9.51}$ | $\underset{\left(0.34\right)}{4.407}$ | $\underset{\left(0.61\right)}{15.62}$ | $\underset{\left(0.34\right)}{10.92}$ |

Q-Net 2P-4N (ours) | $\underset{\left(0.81\right)}{20.80}$ | $\underset{\left(0.20\right)}{3.12}$ | $\underset{\left(0.27\right)}{\mathbf{6.11}}$ | $\underset{\left(0.49\right)}{12.32}$ | $\underset{\left(0.13\right)}{5.42}$ | $\underset{\left(0.40\right)}{6.57}$ | $\underset{\left(0.11\right)}{3.30}$ | $\underset{\left(0.50\right)}{11.24}$ | $\underset{\left(0.14\right)}{8.61}$ |

PN-Net Random DA | $\underset{\left(0.65\right)}{20.09}$ | $\underset{\left(0.27\right)}{3.27}$ | $\underset{\left(0.14\right)}{6.36}$ | $\underset{\left(0.57\right)}{11.53}$ | $\underset{\left(0.20\right)}{5.19}$ | $\underset{\left(0.20\right)}{5.62}$ | $\underset{\left(0.28\right)}{3.31}$ | $\underset{\left(0.36\right)}{10.72}$ | $\underset{\left(0.24\right)}{8.26}$ |

Q-Net 2P-4N DA (ours) | $\underset{\left(0.33\right)}{\mathbf{17.01}}$ | $\underset{\left(0.17\right)}{\mathbf{2.70}}$ | $\underset{\left(0.18\right)}{6.16}$ | $\underset{\left(0.38\right)}{\mathbf{9.61}}$ | $\underset{\left(0.18\right)}{\mathbf{4.61}}$ | $\underset{\left(0.09\right)}{\mathbf{3.99}}$ | $\underset{\left(0.13\right)}{\mathbf{2.83}}$ | $\underset{\left(0.14\right)}{\mathbf{8.44}}$ | $\underset{\left(0.09\right)}{\mathbf{6.86}}$ |

**Table 5.**Matching results in the multi-view stereo correspondence dataset. Evaluations were made on the 100 K image pairs’ ground truth recommended from the authors. Results correspond to FPR95. The smallest results indicate better performance. The standard deviation is provided in brackets.

Training | Notredame | Liberty | Notredame | Yosemite | Yosemite | Liberty | |
---|---|---|---|---|---|---|---|

Testing | Yosemite | Liberty | Notredame | ||||

Descriptor | mean | ||||||

siamese-L2 | $15.15$ | $20.09$ | $12.46$ | $8.38$ | $18.83$ | $6.04$ | $13.49$ |

$\underset{\mathrm{size}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}128,\phantom{\rule{4.pt}{0ex}}\mathrm{patches}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}2,560,000}{\text{PN-Net}}$ | $\underset{\left(0.20\right)}{8.47}$ | $\underset{\left(0.48\right)}{9.50}$ | $\underset{\left(0.17\right)}{9.17}$ | $\underset{\left(0.49\right)}{10.82}$ | $\underset{\left(0.18\right)}{4.47}$ | $\underset{\left(0.10\right)}{4.16}$ | $\underset{\left(0.17\right)}{7.77}$ |

$\underset{\mathrm{size}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}128,\phantom{\rule{4.pt}{0ex}}\mathrm{patches}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}3,840,000}{\text{PN-Net}}$ | $\underset{\left(0.46\right)}{8.46}$ | $\underset{\left(0.23\right)}{\mathbf{8.77}}$ | $\underset{\left(0.11\right)}{8.86}$ | $\underset{\left(0.57\right)}{10.78}$ | $\underset{\left(0.14\right)}{4.37}$ | $\underset{\left(0.10\right)}{3.98}$ | $\underset{\left(0.16\right)}{7.53}$ |

$\underset{\mathrm{size}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}128,\phantom{\rule{4.pt}{0ex}}\mathrm{patches}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}2,560,000}{\text{Q-Net}\phantom{\rule{4.pt}{0ex}}\text{2P-4N}}$ | $\underset{\left(0.52\right)}{\mathbf{7.69}}$ | $\underset{\left(0.71\right)}{9.34}$ | $\underset{\left(0.31\right)}{\mathbf{7.64}}$ | $\underset{\left(0.60\right)}{\mathbf{10.22}}$ | $\underset{\left(0.18\right)}{\mathbf{4.07}}$ | $\underset{\left(0.13\right)}{\mathbf{3.76}}$ | $\underset{\left(0.22\right)}{\mathbf{7.12}}$ |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Aguilera, C.A.; Sappa, A.D.; Aguilera, C.; Toledo, R.
Cross-Spectral Local Descriptors via Quadruplet Network. *Sensors* **2017**, *17*, 873.
https://doi.org/10.3390/s17040873

**AMA Style**

Aguilera CA, Sappa AD, Aguilera C, Toledo R.
Cross-Spectral Local Descriptors via Quadruplet Network. *Sensors*. 2017; 17(4):873.
https://doi.org/10.3390/s17040873

**Chicago/Turabian Style**

Aguilera, Cristhian A., Angel D. Sappa, Cristhian Aguilera, and Ricardo Toledo.
2017. "Cross-Spectral Local Descriptors via Quadruplet Network" *Sensors* 17, no. 4: 873.
https://doi.org/10.3390/s17040873