# On the Use of Normalized Compression Distances for Image Similarity Detection

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. From Simple NCD to NCD Vectors

#### 2.1. Direct Information Distances between Images

#### 2.2. Vectors of Compression Distances

## 3. A Simple Experiment

#### Direct NCD Based Classification

## 4. Robustness: A Case Study

#### 4.1. Robustness against Histogram Equalization

- JPEG2000: 21/24 for horizontal concatenation with ${T}^{v}$, 20/24 for vertical concatenation with ${T}^{id}$, 18/24 for both horizontal and vertical concatenation with rotation, 17/24 for vertical concatenation with both horizontal and diagonal translation;
- GZIP: 8/24 for vertical concatenation with rotation;
- WinZIP: 19/24 for vertical concatenation with ${T}^{v}$, 15/24 for horizontal concatenation with ${T}^{v}$ and 8/24 for vertical concatenation with ${T}^{h}$;
- WinRAR: 14/24 for horizontal concatenation with rotation;
- 7Z: 8/24 for vertical concatenation with ${T}^{d}$, 6/24 for vertical concatenation with ${T}^{h}$;
- PAQ8: 19/24 for vertical concatenation with ${T}^{h}$, 14/24 for vertical concatenation with ${T}^{d}$ and ${T}^{id}$, only 7/24 for horizontal concatenation with ${T}^{h}$
- PAQ9: 20/24 for horizontal concatenation with ${T}^{v}$, 17/24 for vertical concatenation with ${T}^{v}$, 15/24 for horizontal concatenation with ${T}^{h}$, 8/24 for horizontal concatenation with ${T}^{h}$;
- LPAQ1: 17/24 for horizontal concatenation with ${T}^{h}$ and 15/24 for horizontal concatenation with ${T}^{v}$;
- FPAQ0f: 24/24 for horizontal concatenation with rotation; 23/24 for horizontal concatenation with ${T}^{v}$.

#### 4.2. More Results on Robustness

#### 4.3. Robustness and Mathematical Complexity

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Li, M.; Chen, X.; Li, X.; Ma, B.; Vitanyi, P.M. The similarity metric. IEEE Trans. Inf. Theory
**2004**, 50, 3250–3264. [Google Scholar] [CrossRef] - Cilibrasi, R.; Vitanyi, P.M. Clustering by compression. IEEE Trans. Inf. Theory
**2005**, 51, 1523–1545. [Google Scholar] [CrossRef] - Watanabe, T.; Sugawara, K.; Sugihara, H. A new pattern representation scheme using data compression. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 579–590. [Google Scholar] [CrossRef] - Li, M.; Badger, J.H.; Chen, X.; Kwong, S.; Kearney, P.; Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics
**2001**, 17, 149–154. [Google Scholar] [CrossRef] [PubMed] - Cilibrasi, R.; Vitányi, P.M.; de Wolf, R. Algorithmic clustering of music based on string compression. Comput. Music J.
**2004**, 28, 49–67. [Google Scholar] [CrossRef] - Foster, P.; Dixon, S.; Klapuri, A. Identifying Cover Songs Using Information-Theoretic Measures of Similarity. IEEE/ACM Trans. Audio Speech Lang. Process.
**2015**, 23, 993–1005. [Google Scholar] [CrossRef][Green Version] - Cilibrasi, R.; Vitanyi, P.M. The Google similarity distance. IEEE Trans. Knowl. Data Eng.
**2007**, 19, 370–383. [Google Scholar] [CrossRef] - Stamatatos, E. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol.
**2009**, 60, 538–556. [Google Scholar] [CrossRef] - Bardera, A.; Feixas, M.; Boada, I.; Sbert, M. Compression-based image registration. In Proceedings of the 2006 IEEE International Symposium on Information Theory, Seattle, WA, USA, 9–14 July 2006; pp. 436–440. [Google Scholar]
- Cohen, A.R.; Vitanyi, P.M. Normalized Compression Distance of Multisets with Applications. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 37, 1602–1614. [Google Scholar] [CrossRef] [PubMed] - Chen, T.-C.; Dick, S.; Miller, J. Detecting visually similar web pages: Application to phishing detection. ACM Trans. Internet Technol.
**2010**, 10, 1–38. [Google Scholar] [CrossRef] - Lillo-Castellano, J.M.; Mora-Jiménez, I.; Santiago-Mozos, R.; Chavarría-Asso, F.; Cano-González, A.; García-Alberola, A.; Rojo-Álvarez, L. Symmetrical Compression Distance for Arrhythmia Discrimination in Cloud-based Big Data Services. IEEE J. Biomed. Health Inform.
**2010**, 19, 1253–1263. [Google Scholar] [CrossRef] [PubMed] - Cebrián, M.; Alfonseca, M.; Ortega, A. The normalized compression distance is resistant to noise. IEEE Trans. Inf. Theory
**2007**, 53, 1895–1900. [Google Scholar] [CrossRef] - Tran, N. A perceptual similarity measure based on smoothing filters and the normalized compression distance. Proc. SPIE
**2010**, 7257, 75270. [Google Scholar] - Vazquez, P.-P.; Marco, J. Using normalized compression distance for image similarity measurement: An experimental study. Vis. Comput.
**2012**, 28, 1063–1084. [Google Scholar] [CrossRef] - Mortensen, J.; Wu, J.J.; Furst, J.; Rogers, J.; Raicu, D. Effect of image linearization on normalized compression distance. In Signal Processing, Image Processing and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2009; pp. 106–116. [Google Scholar]
- Pinho, A.J.; Ferreira, P.J.S.G. Image similarity using the normalized compression distance based on finite context models. In Proceedings of the IEEE 2011 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; pp. 1993–1996. [Google Scholar]
- Campana, B.J.L.; Keogh, E.J. A compression based distance measure for texture. Stat. Anal. Data Min.
**2010**, 3, 381–398. [Google Scholar] [CrossRef] - Hu, B.; Rakthanmanon, T.; Campana, B.; Mueen, A.; Keogh, E. Image mining of historical manuscripts to establish provenance. In Proceedings of the 2012 SIAM International Conference on Data Mining, Anaheim, CA, USA, 26–28 April 2012; pp. 804–815. [Google Scholar]
- Cerra, D.; Mallet, A.; Gueguen, L.; Datcu, M. Algorithmic information theory-based analysis of earth observation images: An assessment. IEEE Geosci. Remote Sens. Lett.
**2010**, 7, 8–12. [Google Scholar] [CrossRef][Green Version] - Cerra, D.; Datcu, M. A fast compression-based similarity measure with applications to content-based image retrieval. J. Vis. Commun. Image Represent.
**2012**, 23, 293–302. [Google Scholar] [CrossRef][Green Version] - Guha, T.; Ward, R.K. Image similarity using sparse representation and compression distance. IEEE Trans. Multimed.
**2014**, 16, 980–987. [Google Scholar] [CrossRef] - Gonzales, R.C.; Woods, R.E. Digital Image Processing; Prentice-Hall: Upper Saddle River, NJ, USA, 2008. [Google Scholar]
- Coltuc, D.; Bolon, P.; Chassery, J.-M. Exact Histogram Specification. IEEE Trans. Image Process.
**2006**, 15, 1143–1152. [Google Scholar] [CrossRef] [PubMed] - Coltuc, D.; Bolon, P. An inverse problem: Histogram equalization. In Proceedings of the Signal Process, Rhodes, Greece, 8–11 September 1998. [Google Scholar]
- Data Compression Programs. Available online: http://mattmahoney.net/dc/ (accessed on 30 January 2018).
- Cox, I.; Miller, M.; Bloom, J.; Fridrich, J.; Kalker, T. Digital Watermarking and Steganography; Morgan Kaufmann: Burlington, MA, USA, 2007. [Google Scholar]
- Petitcolas, F.A.P. Watermarking schemes evaluation. IEEE Signal Process. Mag.
**2000**, 17, 58–64. [Google Scholar] [CrossRef] - Mahoney, M. Adaptive Weighting of Context Models for Lossless Data Compression; Technical Report CS-2005-16; Florida Institute of Technology: Melbourne, FL, USA, 2005. [Google Scholar]
- Kodak Lossless True Color Image Suite. Available online: http://www.r0k.us/graphics/kodak/ (accessed on 30 January 2018).
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- UC Merced Land Use Dataset. Available online: http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 30 January 2018).
- Face Recognition Technology (FERET). Available online: https://www.nist.gov/programs-projects/face-recognition-technology-feret (accessed on 30 January 2018).

**Figure 1.**GZIP based vectors for Lena and Barbara with horizontal translation (

**left**) and vertical translation (

**right**).

**Figure 3.**Details of test image Kodim5 corrupted with $20\%$ “salt and paper” noise (

**left**) and sequence of FPAQ0f signatures for the original, and two noisy versions with $10\%$ and $20\%$ (

**right**).

**Figure 5.**Four data sets: Overpass (rows 1–2); Denseresidential (rows 3–4); Harbor (rows 5–6); and Faces (rows 7–8).

Horizontal Concatenation | Vertical Concatenation | NCD | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

${\mathit{T}}^{\mathit{h}}$ | ${\mathit{T}}^{\mathit{v}}$ | ${\mathit{T}}^{\mathit{d}}$ | ${\mathit{T}}^{\mathit{id}}$ | ${\mathit{T}}^{\mathit{\varphi}}$ | ${\mathit{T}}^{\mathit{h}}$ | ${\mathit{T}}^{\mathit{v}}$ | ${\mathit{T}}^{\mathit{d}}$ | ${\mathit{T}}^{\mathit{id}}$ | ${\mathit{T}}^{\mathit{\varphi}}$ | H | V | |

JPEG2000 | L | P | L | L | P | P | L | P | P | P | B | B |

JPEG | L | L | L | L | L | L | L | L | L | L | B | B |

PNG | B | B | B | B | B | B | L | B | B | L | B | B |

TIFF | B | B | B | B | B | B | B | B | B | L | B | B |

GZIP | B | B | B | B | B | B | L | B | B | P | B | B |

WinZIP | L | P | L | L | B | P | P | L | L | B | B | B |

WinRAR | L | L | L | L | P | L | L | L | L | P | B | B |

7Z | L | L | L | L | L | L | P | L | P | B | B | B |

PAQ8 | P | L | L | L | B | P | L | P | P | L | B | B |

PAQ9 | P | P | L | L | B | P | P | L | L | B | B | B |

LPAQ1 | P | L | L | L | B | P | L | L | L | B | B | B |

FPAQ0f | L | P | L | L | P | L | L | L | L | L | B | B |

**Table 2.**Configurations with at least 20/24 correct classification results for histogram equalization.

Compressor | Transform | Concatenation | Results |
---|---|---|---|

FPAQ0f | Rotation | Horizontal | 24/24 |

FPAQ0f | Rotation | Vertical | 23/24 |

JPEG2000 | Vertical translation | Horizontal | 21/24 |

JPEG2000 | Inverse diagonal | Vertical | 20/24 |

FPAQ0f | JPEG2000 | ||||||
---|---|---|---|---|---|---|---|

Mooving average | Window size | $3\times 3$ | $5\times 5$ | $7\times 7$ | $3\times 3$ | $5\times 5$ | $7\times 7$ |

Results | 22/24 | 18/24 | 17/24 | 21/24 | 21/24 | 19/24 | |

Gaussian filtering | Window size | $3\times 3$ | $5\times 5$ | $7\times 7$ | $3\times 3$ | $5\times 5$ | $7\times 7$ |

Results | 24/24 | 20/24 | 18/24 | 23/24 | 22/24 | 20/24 | |

Median filtering | Window size | $3\times 3$ | $5\times 5$ | $7\times 7$ | $3\times 3$ | $5\times 5$ | $7\times 7$ |

Results | 23/24 | 21/24 | 18/24 | 22/24 | 18/24 | 16/24 | |

Gaussian noise | ${\sigma}^{2}$ | 0.0001 | 0.0005 | 0.001 | 0.0001 | 0.0005 | 0.001 |

Results | 24/24 | 21/24 | 21/24 | 24/24 | 19/24 | 19/24 | |

“Salt and pepper” noise | Density | $0.01\%$ | $0.05\%$ | $0.1\%$ | $0.01\%$ | $0.05\%$ | $0.1\%$ |

Results | 24/24 | 24/24 | 24/24 | 24/24 | 17/24 | 11/24 | |

Lossy compression | Quality (QF) | 80 | 40 | 20 | 80 | 40 | 20 |

Results | 24/22 | 22/24 | 18/24 | 24/24 | 22/24 | 20/24 | |

Downscaling | Scale factor | 3/4 | 1/2 | 1/4 | 3/4 | 1/2 | 1/4 |

Results | 24/24 | 23/24 | 22/24 | 2/24 | 23/24 | 18/24 | |

Upscaling | Scale factor | 3/2 | 7/4 | 2 | 3/2 | 7/4 | 2 |

Results | 24/24 | 24/24 | 24/24 | 13/24 | 7/24 | 22/24 | |

Cropping | Rows & columns | 16 | 32 | 48 | 2 | 4 | - |

Results | 24/24 | 23/24 | 18/24 | 11/24 | 4/24 | - | |

Rotation | Degrees | ${2.5}^{\circ}$ | ${5}^{\circ}$ | ${7.5}^{\circ}$ | ${2.5}^{\circ}$ | - | - |

Results | 24/24 | 23/24 | 20/24 | 9/24 | - | - |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Coltuc, D.; Datcu, M.; Coltuc, D.
On the Use of Normalized Compression Distances for Image Similarity Detection. *Entropy* **2018**, *20*, 99.
https://doi.org/10.3390/e20020099

**AMA Style**

Coltuc D, Datcu M, Coltuc D.
On the Use of Normalized Compression Distances for Image Similarity Detection. *Entropy*. 2018; 20(2):99.
https://doi.org/10.3390/e20020099

**Chicago/Turabian Style**

Coltuc, Dinu, Mihai Datcu, and Daniela Coltuc.
2018. "On the Use of Normalized Compression Distances for Image Similarity Detection" *Entropy* 20, no. 2: 99.
https://doi.org/10.3390/e20020099