Multiset Lempel–Ziv Jaccard Distance

Aoki, Satoshi; Koga, Hisashi

doi:10.3390/info17050489

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Multiset Lempel–Ziv Jaccard Distance

by

Satoshi Aoki

and

Hisashi Koga

^*

Department of Computer and Network Engineering, University of Electro-Communications, Tokyo 182-8585, Japan

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 489; https://doi.org/10.3390/info17050489

Submission received: 13 March 2026 / Revised: 27 April 2026 / Accepted: 13 May 2026 / Published: 16 May 2026

(This article belongs to the Section Information Theory and Methodology)

Download Versions Notes

Abstract

The performance of pattern classification is affected significantly by feature selection. However, for security applications, selecting proper features is difficult, as malicious software continuously changes its characteristics. Thus, compression-based pattern recognition has attracted much attention because it does not require explicit feature selection to design proper distance measures. LZJD (Lempel–Ziv Jaccard Distance), in particular, has been useful for malware classification, as it computes compression distances without actually compressing objects and is suitable for handling large files like malware. LZJD extracts a compression dictionary for every object in advance and estimates a similarity between two objects by comparing their compression dictionaries. However, LZJD ignores the similarity between words in a compression dictionary. As a result, even if the dictionary has many similar words, they are simply processed as different words. To exploit the similarity between words, we propose to remove the last characters of words in the dictionary and to unify similar words that share the same prefix. This unification of words turns the compression dictionary into a multiset of words. Hence, our compression distance is named MLZJD (Multiset LZJD). In addition, the unification of words in MLZJD decreases the number of word kinds in compression dictionaries and contributes to speeding up the distance computation. We experimentally show that MLZJD halves the execution time as compared with LZJD, while hardly damaging the classification accuracy. Even on condition that the compression distances are approximated with Min-Hash, MLZJD achieves a much shorter running time than LZJD, while retaining almost the same classification accuracy as LZJD.

Keywords: compression-based pattern recognition; malware classification; multiset; min-hash; dictionary-based compression

Share and Cite

MDPI and ACS Style

Aoki, S.; Koga, H. Multiset Lempel–Ziv Jaccard Distance. Information 2026, 17, 489. https://doi.org/10.3390/info17050489

AMA Style

Aoki S, Koga H. Multiset Lempel–Ziv Jaccard Distance. Information. 2026; 17(5):489. https://doi.org/10.3390/info17050489

Chicago/Turabian Style

Aoki, Satoshi, and Hisashi Koga. 2026. "Multiset Lempel–Ziv Jaccard Distance" Information 17, no. 5: 489. https://doi.org/10.3390/info17050489

APA Style

Aoki, S., & Koga, H. (2026). Multiset Lempel–Ziv Jaccard Distance. Information, 17(5), 489. https://doi.org/10.3390/info17050489

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiset Lempel–Ziv Jaccard Distance

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI