Similarity to a Single Set

Naish, Lee

doi:10.3390/bdcc10050164

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Similarity to a Single Set

by

Lee Naish

Computing and Information Systems, The University of Melbourne, Melbourne 3010, Australia

Big Data Cogn. Comput. 2026, 10(5), 164; https://doi.org/10.3390/bdcc10050164

Submission received: 4 April 2026 / Revised: 8 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026

(This article belongs to the Section Data Mining and Machine Learning)

Download Versions Notes

Abstract

Identifying similarities in data is fundamental to discovery in science. Measuring or ranking similarity is a key way of reducing the dimensionality of data, is at the heart of many data intensive algorithms and can also be used directly for some applications. This paper extends our understanding of a relatively simple similarity problem. Our primary application is spectral-based fault localisation (SBFL), in which a computer program is run with a large number of test cases and data is collected on which statements are executed in each test case. For each statement, the set of test cases in which it is executed is compared to the set of test cases that failed, and this is used to rank the statements to help locate bugs, an instance of what we call the similarity to a single set (STASS) problem. This paper is primarily theoretical but some contributions are validated with SBFL experiments. Set similarity is equivalent to similarity of binary vectors or two-by-two contingency tables. The problem is also equivalent to converting two-dimensional data with a “partial order”, such as points on a rectangular grid, to a one-dimensional total order. Even when the raw data is not binary, we are often interested in comparing binary classifiers for the data, such as diagnostic tests, and comparing binary classifiers is an instance of the STASS problem. More than a hundred set similarity measures have been proposed in the literature and hundreds of thousands have been evaluated for SBFL, but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties and forms of symmetry that similarity measures can have. It refines previously identified properties so they are no longer incompatible, identifies new forms of symmetry, defines ordering relations over similarity measures, and proposes a new statistic that can be used to help choose a good similarity measure for a given domain.

Keywords: binary similarity measure; set similarity; data mining; clustering; classification; diagnostic test; spectral-based fault localization; STASS

Share and Cite

MDPI and ACS Style

Naish, L. Similarity to a Single Set. Big Data Cogn. Comput. 2026, 10, 164. https://doi.org/10.3390/bdcc10050164

AMA Style

Naish L. Similarity to a Single Set. Big Data and Cognitive Computing. 2026; 10(5):164. https://doi.org/10.3390/bdcc10050164

Chicago/Turabian Style

Naish, Lee. 2026. "Similarity to a Single Set" Big Data and Cognitive Computing 10, no. 5: 164. https://doi.org/10.3390/bdcc10050164

APA Style

Naish, L. (2026). Similarity to a Single Set. Big Data and Cognitive Computing, 10(5), 164. https://doi.org/10.3390/bdcc10050164

Article Menu

Similarity to a Single Set

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI