Next Article in Journal
Context and Layers in Harmony: A Unified Strategy for Mitigating LLM Hallucinations
Previous Article in Journal
Learning Parameter Dependence for Fourier-Based Option Pricing with Tensor Trains
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Gene Selection Algorithms in a Single-Cell Gene Decision Space Based on Self-Information

1
Fujian Provincial Key Laboratory of Data-Intensive Computing, Fujian University Laboratory of Intelligent Computing and Information Processing, School of Mathematics and Computer Science, Quanzhou Normal University, Quanzhou 362000, China
2
Fujian Key Laboratory of Financial Information Processing, Key Laboratory of Applied Mathematics in Fujian Province University, Putian University, Putian 351100, China
3
Fujian Province University Key Laboratory of Computational Science, School of Mathematical Sciences, Huaqiao University, Quanzhou 362000, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(11), 1829; https://doi.org/10.3390/math13111829
Submission received: 4 April 2025 / Revised: 21 May 2025 / Accepted: 24 May 2025 / Published: 30 May 2025

Abstract

A critical step for gene selection algorithms using rough set theory is the establishment of a gene evaluation function to assess the classification ability of candidate gene subsets. The concept of dependency in a classic neighborhood rough set model plays the role of this evaluation function. This criterion only notes the information provided by the lower approximation and omits the upper approximation, which may result in the loss of some important information. This paper proposes gene selection algorithms within a single-cell gene decision space by employing self-information, taking into account both lower and upper approximations. Initially, the distance between gene expression values within each subspace is defined to establish the tolerance relation on the cell set. Subsequently, self-information is introduced through the lens of tolerance classes. The relationship between these measures and their respective properties is then examined in detail. For gene expression data, the proposed self-information metric demonstrates superiority over other measures by accounting for both lower and upper approximations, thereby facilitating the selection of optimal gene subsets. Finally, gene selection algorithms within a single-cell gene decision space are developed based on the proposed self-information metric, and experiments conducted on 10 publicly available single-cell datasets indicate that the classification performance of the proposed algorithms can be enhanced through the selection of genes pertinent to classification. The results demonstrate that F i S I achieves an average classification accuracy of 93.7% (KNN) while selecting 48.3% fewer genes than Fisher’s score.
MSC:
68T09; 68T99; 68W40

1. Introduction

1.1. Research Background

Single-cell gene expression data provide detailed molecular insights that help us better understand the functions and characteristics of cells. Through these data, we can delve into the differences between various cell types, uncover the roles of cells in disease progression and physiological processes, and provide crucial insights for personalized medicine and precision therapies. Gene selection is an effective method for understanding the factors involved in disease using single-cell gene expression data.
Rough set theory (RS-theory) is an important mathematical tool for dealing with uncertainty, inconsistency, and incomplete knowledge, and it has been used in gene selection. RS-theory constitutes a pivotal methodology for addressing uncertainty, the fundamental premise of which is the approximate quantification of inaccurate or indefinite knowledge with the known knowledge. For a classification problem, RS-theory uses attributes to induce binary relations where objects are grouped into different information granules. The decision variable is then approximately characterized by these information granules, and the lower and the upper approximations of the decision are formulated. Furthermore, a feature evaluation function, called dependency function, is defined. Different types of binary relations can lead to different rough sets such as classical rough sets, neighborhood rough sets, dominance rough sets, fuzzy rough sets [1,2], etc. An information system (IS) in the light of RS-theory, proposed by Pawlak [3], is a database that shows relationships between objects and attributes. Most applications of RS-theory are related to an IS.
Information entropy, introduced by Shannon [4], is an important tool for estimating uncertainty. Some scholars have measured the uncertainty of an IS by using information entropy. For example, Wang et al. [5] proposed a novel entropy measurement for general fuzzy relations. Zhang et al. [6] discussed uncertainty measurement in a categorical IS. Li et al. [7] investigated entropy measure of a fuzzy relation IS. Navarrete et al. [8] described an approach to smoothing RGB-D data based on information entropy. Hempelmann et al. [9] devised an evaluation method for a medical IS by using information entropy. Delgado et al. [10] raised entropy weight method-based analysis on environmental conflicts.
Feature selection is called attribute reduction in RS-theory. It is a basic data preprocessing technology in machine learning and pattern classification tasks. The aim is to eliminate redundant attributes to simplify the complexity of classification model and to improve the generalization ability of classification. To name a few, Zeng et al. [11] proposed incremental feature selection based on fuzzy rough set with Gaussian kernel; Kim et al. [12] gave a neighborhood rough set-based feature selection method; Wang et al. [13] considered attribute reduction via a fuzzy rough iterative computation model; Dai et al. [14] presented information entropy-based feature selection algorithms for an interval-valued IS; Singh et al. [15] gave feature selection approaches for set-valued data on the basis of RS-theory; Sang et al. [16] advanced incremental feature selection method via conditional entropy; Huang et al. [17] studied discernibility measure for a fuzzy β covering IS and gave feature selection algorithms; Jia et al. [18] raised similarity-based feature selection from the viewpoint of clustering; Wang et al. [5] considered neighborhood self-information based feature selection; Li et al. [19] put forward heterogeneous feature selection via information entropy; Wang et al. [20] proposed a feature selection algorithm using local conditional entropy; Sang et al. [16] researched incremental feature selection by means of conditional entropy; Yuan et al. [21] investigated unsupervised heterogeneous feature selection based on fuzzy mutual information; Chen et al. [22] raised random sampling accelerator for attribute reduction; Jiang et al. [23] introduced accelerator for supervised attribute reduction; Chen et al. [24] presented attribute group for attribute reduction.

1.2. Related Work

Gene expression data show the abundance of gene-transcribed mRNA measured directly or indirectly in cells. The data can be used to analyze which genes have changed in expression, how genes are related to each other, and how gene activity is affected under different conditions. With the rapid developments in microarray and sequencing technology, a large amount of single-cell RNA-seq data (scRNA-seq data) has emerged [25,26]. ScRNA-seq data are important gene expression data that are required to select the correct gene from a large number of genes. Owing to technical and sampling reasons, it is difficult to identify such a gene from gene expression data with high noise, high sparsity, high dimensionality, and uncertainty. Moreover, the selection of the best genes using traditional statistical analysis and machine learning methods is often ineffective due to the uncertainty of gene expression data. Consequently, gene selection theory based on RS-theory must be established to reduce the complexity of gene expression data. For instance, Bommertet et al. [27] proposed a filter method for feature selection of high-dimensional gene expression data; Li et al. [28] researched an uncertainty measurement for scRNA-seq data using a Gaussian kernel and applied it to unsupervised gene selection; Sharma et al. [29] presented a gene selection method for cancer classification based on multi-objective meta-heuristic machine learning algorithms; and Zhang et al. [30] studied an uncertainty measurement for scRNA-seq data based on class-consistent technology and considered its applicability for semi-supervised gene selection. Entropy-based methods (Li [28] and Zhang [30]) quantify uncertainty but fail to leverage rough set granularity. Sun et al. [31] presented a joint neighborhood entropy-based gene selection method using Fisher’s score for tumor classification. While effective for class-aware feature selection, it does not distinguish between certain and possible decisions. Sheng et al. [32] introduced a feature selection method for high-dimensional single-cell gene expression data based on unsupervised learning; Zhang al. [33] considered feature selection in a neighborhood decision IS and applied it to scRNA-seq data classification; Li et al. [34] investigated gene selection in a single-cell gene decision space by means of a Gaussian kernel. The Gaussian kernel techniques model data continuity but struggle with discrete decision boundaries. Both Zhang [33] and Li [34] use dependency functions that prioritize consistent decisions, neglecting inconsistent ones. This results in information loss and suboptimal gene subsets. Zhang et al. [35] proposed a gene selection algorithm based on class-consistent technology and a fuzzy rough iterative computation model, demonstrating its effectiveness over existing methods in handling high-dimensional gene expression data. Zhang et al. [36] studied an efficient parameter optimization method based on neighborhood entropy that replaces traditional grid search by integrating neighborhood decision classes and minimizing entropy values. Ma et al. [37] proposed robust fuzzy evidence theory-based feature selection algorithms with a noise-resistant distance metric and framework for hybrid data. Yu et al. [38] introduced a fast and robust feature selection algorithm based on cross-similarity derived from a robust fuzzy relation enabling effective application to large-scale noisy gene datasets.

1.3. Motivation and Contributions

A real-valued information system (RVDIS) is one in which the information values are real numbers. If the sample, attribute, and information values in the RVDIS are the cell, gene, and gene expression values, respectively, and the gene expression data of a gene space change into scRNA-seq data, the RVDIS is then referred to as a single-cell gene decision space ( s c g d -space).
However, the above methods cannot effectively manage the uncertainty of gene expression data. Although a number of feature selection or gene selection algorithms have been proposed with rough sets, most feature evaluation functions were constructed only considering consistent decisions of objects in the lower approximation of the decision, which may lead to some information loss. In fact, classification information is not only related to the consistent decisions but is also related to the inconsistent decisions of objects in the upper approximation. The classification information of the inconsistent decisions should not be ignored in constructing a feature evaluation function.
In light of the preceding research motivations, the primary contributions of this paper encompass the following points.
(1)
Given a subspace in the s c g d -space, the tolerance relation on the cell set is defined by introducing a variable parameter to control the distance between two gene expression values, which leads to the tolerance class. Rough approximations of this subspace are constructed. This overcomes the shortcomings of the traditional rough set model.
(2)
Five types of decision self-information measures as feature evaluation functions are proposed. The three feature evaluation functions with superior performance, self-information, relative self-information, and integrated self-information are chosen to design gene selection algorithms. The reason why they have superior performance is because they consider the classification information provided by both the upper and the lower approximations of the decision.
(3)
Three gene selection algorithms in an s c g d -space are put forward using the chosen self-information. These algorithms are demonstrated in several publicly available datasets for scRNA-seq. The experimental results show that these algorithms can effectively select gene subsets and outperform the existing algorithms.

1.4. Organization and Structure

In this paper, we first analyze the shortcoming of dependency function. Then, we present three kinds of indices of uncertainty by using the lower and the upper approximations of decision: (1) decision index; (2) certain decision index; (3) possible decision index. Combined with self-information, we introduce five types of decision self-information as feature evaluation functions and obtain their related properties. Based on the self-information with the superior performance, we design three gene selection algorithms in an s c g d -space. Finally, a series of experiments are conducted to verify that the designed algorithms can select effective gene subsets.
Figure 1 shows the flowchart for this paper.
The remainder of this article is organized as follows. Section 2 reviews rough set model on an s c g d -space. Section 3 defines self-information in an s c g d -space. Section 4 proposes self-information-based gene selection algorithms. Section 5 conducts numerical experiments to demonstrate the performance of the proposed algorithms. Section 6 summarizes this paper.

2. A Single-Cell Gene Decision Space and Rough Set Model

This part reviews the single-cell gene decision space and considers the rough set model in an s c g d -space.
In this paper, S, 2 S , and | W | represent a finite set, the family of all subsets of S, and the cardinality of W 2 S , respectively. Put
S = { s 1 , s 2 , , s n } .
( S , C , d ) is known as a real-valued decision information system (RVDIS), if a C , s S , a ( s ) is a real number, where C is the conditional attribute set and d is a decision attribute.
Definition 1
([34]). For a RVDIS ( S , C , d ) , suppose that S and C are the sets of cells and genes to describe the cells, respectively. If a C , s S , a ( s ) expresses the gene expression value of the cell s relative to the gene a and the gene expression data in ( S , C , d ) are single cell RNA-seq data (scRNA-seq data), then ( S , C , d ) is known as a single-cell gene decision space ( s c g d -space).
Definition 2
([34]). For an s c g d -space ( S , C , d ) , a C , s , s S , the distance between a ( s ) and a ( s ) is known as
d i s ( a ( s ) , a ( s ) ) = | a ( s ) a ( s ) | m a x { a ( x ) : x S } m i n { a ( x ) : x S } .
Definition 3.
For an s c g d -space ( S , C , d ) , let G C and θ [ 0 , 1 ] , put
G θ = { ( s , s ) S × S : a G , d i s ( a ( s ) , a ( s ) ) θ } ,
R d = { ( s , s ) S × S : d ( s ) = d ( s ) } .
Clearly, G θ and R d are tolerance (reflexive and symmetric) and equivalence relations on S, respectively.
Denote
a θ = { a } θ .
Put
G θ ( s ) = { s S : ( s , s ) G θ } ,
R d ( s ) = { s S : ( s , s ) R d } .
Then, G θ ( s ) and R d ( s ) are known as the tolerance and decision classes of o, respectively.
Denote
S / R d = { R d ( s ) : s S } = { D 1 , D 2 , , D r } .
Proposition 1.
For an s c g d -space ( S , C , d ) .
(1) 
If G 1 G 2 , then θ [ 0 , 1 ] , s S , G 2 θ ( s ) G 1 θ ( s ) ;
(2) 
If 0 θ 1 θ 2 1 , then G C , s S , G θ 1 ( s ) G θ 2 ( s ) .
Proof. 
Obviously.    □
Definition 4.
For an s c g d -space ( S , C , d ) , let G C and θ [ 0 , 1 ] . Define G θ : S 2 V d as follows:
G ( s ) = d ( G θ ( s ) ) .
Then, G θ is known as generalized decision in ( S , G , d ) .
If s S , | C θ ( s ) | = 1 , then ( S , C , d ) is known as θ-consistent; otherwise, it is known as θ-inconsistent.
Proposition 2.
Let ( S , C , d ) be an s c g d -space. Given G C and θ [ 0 , 1 ] . Then,
G θ R d s S , | G θ ( s ) | = 1 .
Proof. 
. Let G θ R d . Then, s S , G θ ( s ) R d ( s ) . Suppose t G θ ( s ) . Then, t = d ( s ) for some s G θ ( s ) . s G θ ( s ) implies that s R d ( s ) . So, t = d ( s ) = d ( s ) . Thus, s S , | G θ ( s ) | = 1 .
. Assume that s S , | G θ ( s ) | = 1 . Suppose s G θ ( s ) . Then, d ( s ) G θ ( s ) . Since d ( s ) G θ ( s ) and | G θ ( s ) | = 1 , we have d ( s ) = d ( s ) . Then, s R d ( s ) . Thus, G θ ( s ) R d ( s ) .    □
Theorem 1.
Let ( S , C , d ) be an s c g d -space. Given θ [ 0 , 1 ] . Then, ( S , C , d )  is θ-consistent ⇔ C θ R d .
Proof. 
It can be proven using Proposition 2.    □
Define
G θ ̲ ( W ) = { s S : G θ ( s ) W } ,
G θ ¯ ( W ) = { s S : G θ ( s ) W } .
Proposition 3.
For an s c g d -space ( S , C , d ) .
(1) 
If W 1 W 2 S , then G C , θ [ 0 , 1 ] ,
G θ ̲ ( W 1 ) G θ ̲ ( W 2 ) , G θ ¯ ( W 1 ) G θ ¯ ( W 2 ) ;
(2) 
If G 1 G 2 C , then θ [ 0 , 1 ] , W 2 S
G 1 θ ̲ ( W ) G 2 θ ̲ ( W ) , G 2 θ ¯ ( W ) G 1 θ ¯ ( W ) ;
(3) 
If 0 θ 1 θ 2 1 , then G C , W 2 S ,
R ̲ G θ 2 ( W ) R ̲ G θ 1 ( W ) , R ¯ G θ 1 ( W ) R ¯ G θ 2 ( W ) .
Proof. 
It can be proven using Proposition 1.    □

3. Self-Information of a Subspace in an scgd -Space

Definition 5
([4]). Let x be a random variable. p ( x ) denotes the probability of x. Suppose that I ( x ) is a metric of the uncertainty of x. Then, I ( x ) is known as the self-information of x if meets the following conditions:
(1) 
Non-negative: I ( x ) 0 ;
(2) 
If p ( x ) 0 , then I ( x ) ;
(3) 
If p ( x ) = 1 , then I ( x ) = 0 .
(4) 
Strict monotonic: If p ( x ) > p ( y ) , then I ( x ) < I ( y ) .
Let ( S , C , d ) be an s c g d -space. Given G C and θ [ 0 , 1 ] .
(1)
θ -positive region of G with respect to d is known as
P O S G θ ( d ) = i = 1 r G ̲ ( D i ) ;
(2)
θ -dependence of G with respect to d is known as
Γ G θ ( d ) = 1 n | P O S G θ ( d ) | .
Γ G θ ( d ) can be used as the criterion for gene selection. However, it only notes consistent cells in the lower approximation and ignores inconsistent cells that have important classification information. Inconsistent cells may be merged in the upper approximation. Thus, the criterion for gene selection should contain both the lower and the upper approximations of the decision.
Let ( S , C , d ) be an s c g d -space. Given G C and θ [ 0 , 1 ] . i , define
d e c ( D i ) = | D i | , c e r t G θ ( D i ) = | G θ ̲ ( D i ) | , p o s s G θ ( D i ) = | G θ ¯ ( D i ) | .
Then, d e c ( D i ) , c e r t G θ ( D i ) and p o s s G θ ( D i ) are called the decision index, certain decision index, and possible decision index of D i , respectively,
Due to the properties that the metric has, combined with self-information, the following definition is introduced.
Definition 6.
For an s c g d -space ( S , C , d ) , let G C and θ [ 0 , 1 ] , define
c S I G θ ( d ) = i = 1 r ( 1 c e r t G θ ( D i ) d e c ( D i ) ) l o g 2 c e r t G θ ( D i ) d e c ( D i ) ;
p S I G θ ( d ) = i = 1 r ( 1 d e c ( D i ) p o s s G θ ( D i ) ) l o g 2 d e c ( D i ) p o s s G θ ( D i ) ;
S I G θ ( d ) = 1 2 ( c S I G θ ( d ) + p S I G θ ( d ) ) ;
r S I G θ ( d ) = i = 1 r ( 1 c e r t G θ ( D i ) p o s s G θ ( D i ) ) l o g 2 c e r t G θ ( D i ) p o s s G θ ( D i ) .
Then, c d - S I G θ ( d ) , p d - S I G θ ( d ) , S I G θ ( d ) and r- S I G θ ( d ) are called the certain decision θ-self information, possible decision θ-self-information, θ-self-information and relative θ-self-information of the subspace ( S , G , d ) , respectively.
When G = , let
c S I G θ ( d ) = p S I G θ ( d ) = S I G θ ( d ) = r S I G θ ( d ) = i = 1 r ( 1 d e c θ ( D i ) n ) l o g 2 d e c θ ( D i ) n .
Four kinds of decision self-information measures in Definition 6 characterize the classification ability of feature subset G from different angles, respectively. Thus, they can be viewed as feature evaluation functions.
Proposition 4.
For an s c g d -space ( S , C , d ) .
(1) 
If G 1 G 2 C , then
c S I G 2 θ ( d ) c S I G 1 θ ( d ) , p S I G 2 θ ( d ) p S I G 1 θ ( d ) ,
S I G 2 θ ( d ) S I G 1 θ ( d ) , r S I G 2 θ ( d ) r S I G 1 θ ( d ) .
(2) 
If 0 θ 1 θ 2 1 , then
c S I G θ 1 ( d ) c S I G θ 2 ( d ) , p S I G θ 1 ( d ) p S I G θ 2 ( d ) ,
S I G θ 1 ( d ) S I G θ 2 ( d ) , r S I G θ 1 ( d ) r S I G θ 2 ( d ) .
(3) 
2 S I G θ ( d ) r S I G θ ( d ) .
Proof. 
(1) Suppose G 1 G 2 C . Then, from Proposition 3(2), we have
i , G 1 θ ̲ ( D i ) G 2 θ ̲ ( D i ) .
So, i , c e r t G 1 θ ( D i ) c e r t G 2 θ ( D i ) . It follows that i ,
c e r t G 1 θ ( D i ) d e c ( D i ) c e r t G 2 θ ( D i ) d e c ( D i ) , l o g 2 c e r t G 1 θ ( D i ) d e c ( D i ) l o g 2 c e r t G 2 θ ( D i ) d e c ( D i ) .
Thus,
0 1 c e r t G 2 θ ( D i ) d e c ( D i ) 1 c e r t G 1 θ ( D i ) d e c ( D i ) , 0 l o g 2 c e r t G 2 θ ( D i ) d e c ( D i ) l o g 2 c e r t G 1 θ ( D i ) d e c ( D i ) .
Hence,
c S I G 2 θ ( d ) c S I G 1 θ ( d ) .
Similarly, we can prove that
G 1 G 2 p S I G 2 θ ( d ) p S I G 1 θ ( d ) , S I G 2 θ ( d ) S I G 1 θ ( d ) , r S I G 2 θ ( d ) r S I G 1 θ ( d ) .
(2) Suppose 0 θ 1 θ 2 1 . Then, from Proposition 3(3), we have
i , R ̲ G θ 2 ( D i ) R ̲ G θ 1 ( D i ) .
So, i , c e r t G θ 2 ( D i ) c e r t G θ 1 ( D i ) . It follows that i ,
c e r t G θ 2 ( D i ) d e c ( D i ) c e r t G θ 1 ( D i ) d e c ( D i ) , l o g 2 c e r t G θ 2 ( D i ) d e c ( D i ) l o g 2 c e r t G θ 1 ( D i ) d e c ( D i ) .
Thus,
0 1 c e r t G θ 1 ( D i ) d e c ( D i ) 1 c e r t G θ 2 ( D i ) d e c ( D i ) , 0 l o g 2 c e r t G θ 1 ( D i ) d e c ( D i ) l o g 2 c e r t G θ 2 ( D i ) d e c ( D i ) .
Hence,
c S I G θ 1 ( d ) c S I G θ 2 ( d ) .
Similarly, we can prove that
θ 1 θ 2 p S I G θ 1 ( d ) p S I G θ 2 ( d ) , S I G θ 1 ( d ) S I G θ 2 ( d ) , r S I G θ 1 ( d ) r S I G θ 2 ( d ) .
(3) Put
m i ( 1 ) = c e r t G θ 1 ( D i ) d e c ( D i ) , m i ( 2 ) = d e c ( D i ) p o s s G θ 1 ( D i ) , m i ( 3 ) = c e r t G θ 1 ( D i ) p o s s G θ 1 ( D i ) ;
n i ( 1 ) = 1 m i ( 1 ) , n i ( 2 ) = 1 m i ( 2 ) , n i ( 3 ) = 1 m i ( 3 ) .
Then, m i ( 3 ) = m i ( 1 ) m i ( 2 ) and
n i ( 3 ) = 1 m i ( 1 ) m i ( 2 ) = 1 ( 1 n i ( 1 ) ) ( 1 n i ( 2 ) ) = n i ( 1 ) + n i ( 2 ) n i ( 1 ) n i ( 2 ) .
r S I G θ ( d ) = i = 1 r n i ( 3 ) l o g 2 m i ( 3 ) = i = 1 r n i ( 3 ) l o g 2 ( m i ( 1 ) m i ( 2 ) ) = i = 1 r n i ( 3 ) l o g 2 m i ( 1 ) i = 1 r n i ( 3 ) l o g 2 m i ( 2 ) = i = 1 r n i ( 1 ) l o g 2 m i ( 1 ) i = 1 r n i ( 2 ) l o g 2 m i ( 2 ) + ε = c S I G θ ( d ) + p S I G θ ( d ) + ε = 2 S I G θ ( d ) + ε .
ε = i = 1 r n i ( 3 ) l o g 2 m i ( 1 ) i = 1 r n i ( 3 ) l o g 2 m i ( 2 ) + i = 1 r n i ( 1 ) l o g 2 m i ( 1 ) + i = 1 r n i ( 2 ) l o g 2 m i ( 2 ) = i = 1 r ( n i ( 3 ) n i ( 1 ) ) l o g 2 m i ( 1 ) i = 1 r ( n i ( 3 ) n i ( 2 ) ) l o g 2 m i ( 2 ) = i = 1 r ( n i ( 2 ) n i ( 1 ) n i ( 2 ) ) l o g 2 m i ( 1 ) i = 1 r ( n i ( 1 ) n i ( 1 ) n i ( 2 ) ) l o g 2 m i ( 2 ) = i = 1 r n i ( 2 ) ( 1 n i ( 1 ) ) l o g 2 m i ( 1 ) i = 1 r n i ( 1 ) ( 1 n i ( 2 ) ) l o g 2 m i ( 2 ) 0 .
Thus, 2 S I G θ ( d ) r S I G θ ( d ) .    □
Definition 7.
For a s c g d -space ( S , C , d ) , let G C and θ , α [ 0 , 1 ] , define
i S I G θ , α ( d ) = α S I G θ ( d ) + ( 1 α ) p S I G θ ( d ) ,
where α and 1 α are the weighting factors of S I G θ ( d ) and p S I G θ ( d ) , respectively. Then, i S I G θ , α ( d ) is called integrated θ-self-information of the subspace ( S , G , d ) with respect to α.
Example 1.
For a s c g d -space ( S , C , d ) listed in Table 1, where S = { x 1 , x 2 , x 3 } is a set of cells, C = { a 1 , a 2 , a 3 } is a set of genes, and d is a decision attribute that divides the domain S into two decision equivalence classes D 1 = { x 1 , x 2 } and D 2 = { x 3 } .
If the neighborhood granule size is set at θ = 0.3 , the relation matrices are listed as follows:
G a 1 θ = 1 1 0 1 1 0 0 0 1 , G a 2 θ = 1 0 1 0 1 0 1 0 1 , G a 3 θ = 1 0 1 0 1 0 1 0 1 ,
and the certain decision index and the possible decision index are listed in Table 2 and Table 3, respectively.
Then, the decision self-information is listed in Table 4 and Table 5.

4. Gene Selection Algorithms in an scgd -Space Based on Self-Information

4.1. Preliminaries

Definition 8.
For an s c g d -space ( S , C , d ) , the normalization equation of an s c g d -space is known as
x i j * = x i j x j m i n ( x j m a x x j m i n ) + τ .
where x i j denotes the gene expression level of the x i versus the gene a j , x j m a x = m a x x i j : i = 1 , 2 , , n , x j m i n = m i n x i j : i = 1 , 2 , , n , to avoid the overflow error when denominator equals 0, the parameter τ denotes a very small number.
A preliminary subset of genes from an s c g d -space is selected using Fisher’s score.
Definition 9.
Let (S,C,d) be an s c g d -space and gene a C . Fisher’s score f ( a ) of a is formulated as
f ( a ) = j = 1 n m j μ j a μ a 2 j = 1 n m j θ j a 2 + τ .
in which n stand for the number of ground truth classes, m j stand for the number of cells of class j, μ j a and θ j a denote the average value and standard deviation of cells of class j in terms of gene a, μ a represents the average value of all cells in terms of gene a. Since some scRNA-seq data may contain values that all θ j a are equal to zero, τ is a very small number used to prevent zeros from appearing in the denominator.
Based on Fisher’s score defined in Formula (21), we propose algorithm for selection preliminary subset of genes.
Let k be the number of genes that expected. In Algorithm 1, Fisher’s score is used as measure to select genes according to Definition 9.
Algorithm 1 has a “for” loop that determines its time complexity. Given that S = n and C = m , Algorithm 1 has a time complexity of O ( m ) . This means that the time taken by Algorithm 1 to execute increases linearly with the size of C and not S.
Algorithm 1: An algorithm for selecting genes in an s c g d -space based on Fisher’s score
Mathematics 13 01829 i001

4.2. Gene Selection Algorithms

Based on θ -self-information, relative θ -self-information, and integrated θ -self-information, we propose three gene selection algorithms (i.e., Algorithms 2–4) in an s c g d -space. To reduce the time complexities of the three algorithms, Algorithm 1 is preliminarily employed to select a gene subset. Afterwards, the proposed algorithms are applied to the gene subsets above to select a core gene subset. With the preprocessing step, the proposed algorithms input a much smaller gene subset compared to the original gene set and have a greatly reduced time complexity.
Considering that θ -self-information and relative θ -self-information cannot only combine the classification information of upper and lower approximations and reduce the possibility of misclassification but also have a better convergence effect, we choose them as a feature evaluation functions to measure the classification ability of a feature subset. Based on this reason, we propose the following algorithms (i.e., Algorithms 2 and 3).
Integrated θ -self-information not only has the same efficacy as θ -self-information and relative θ -self-information, but also uses their weights when utilizing the efficacy of θ -self-information and relative θ -self-information. Based on this reason, we propose Algorithm 4.
Given that S = n , C = m , G * = m G * . The “while” loop on the outer side and the “for” loop on the inner side of Algorithm 2 determine its time complexity. Due to the variable start determining the outer “while” loop, the intersection of all elements in G * must be calculated in the worst case. So, Algorithm 2 has a best time complexity of O ( n 2 m G * ) and a worst time complexity of O ( n 2 m G * 2 ) . Similarly, Algorithms 3 and 4 also have an approximate time complexity of O ( n 2 m G * 2 ) . In particular, since m G * m , the result of Algorithm 1 has a large impact on the time complexity of the proposed algorithms. Figure 2 shows the framework charts of Algorithm 4.
Algorithm 2: A gene selection algorithm based on θ -self-information (F- S I ) in an s c g d -space
Mathematics 13 01829 i002
Algorithm 3: A gene selection algorithm based on relative θ -self-information (Fr- S I ) in an s c g d -space
Mathematics 13 01829 i003
Algorithm 4: A gene selection algorithm based on the integrated- θ -self-information (Fi- S I ) in an s c g d -space
Mathematics 13 01829 i004

5. Experimental Analysis

This section presents experiments to evaluate performance of the proposed algorithms. The experimental flow is illustrated in Figure 3. We first employ Fisher’s score to select the preliminary genes and then implement the proposed algorithms to select subset genes on several publicly available datasets of single-cell sequencing. To evaluate the performance of proposed algorithms, the classification accuracy (ACC) of three classifiers KNN (k-nearest neighbors), SVM (support vector machine) and DT (decision tree) are taken as performance metrics. Then, two common dimension reduction methods, PCA and tSNE, are performed, and the ACC of the three classifiers are compared with those of three proposed algorithms. Moreover, we compare the proposed algorithms with three rough set-based algorithms. Finally, to eliminate randomness, two statistical tests are utilized to further verify the results.

5.1. Dataset and Preprocess

Several single-cell RNA sequencing (scRNA-seq) datasets are used to evaluate the performance of gene selection algorithms mentioned above. Detailed information on these datasets is shown in Table 6. These datasets can be downloaded from the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/, accessed on 20 September 2022).
Firstly, single-cell sequencing data consist of one file per sample, and it is necessary to combine these data into a matrix format. During the combination process, only genes that are present in all samples are selected to form the matrix data. Then, due to the limitations of single-cell RNA sequencing technology, there are many genes that are not expressed in single-cell gene sequencing data. The sensitivity and specificity of single-cell RNA sequencing technology are relatively low, and there is a lot of noise in RNA-seq data. These factors can lead to many genes not being expressed in single-cell gene sequencing data, and their expressed value is set to zero. Therefore, genes that have zero values in the all the samples are removed.

5.2. Preliminary Number of Genes

Due to the large number of genes of single-cell datasets, as a preliminary step in this study, a subset selection method based on Fisher’s score is employed to acquire a gene subset. This preliminary method calculates the Fisher’s score of each gene and selects the top k genes with highest scores among them, and the value of k is varied from 50 to 3000 in 50 increments. Afterwards, the selected genes are used as feature in a KNN (k-nearest neighbors) classifier with the model parameter k = 5 . To avoid sample randomness, ten-fold cross-validation is employed, and the accuracy is determined by the average value of its classification accuracies.
Figure 4 depicts the ten-fold classification accuracy results for the above datasets. The accuracy varies with number of genes selected by Algorithm 1. We found that the classification accuracy stabilizes as k increases. Considering that gene selection algorithms aim to maximize classification accuracy while minimizing the number of selected genes, the minimal k to reach the maximal ten-fold classification accuracy is taken as the appropriate number of preliminary genes, and a summary of the results is shown in the Table 7.

5.3. Benchmarking Compared with Raw Data and Fisher Data

In this section, the performance of proposed algorithms is compared with that of raw data and Fisher data. First, the Figure 5 and Table 7 describe the number of selected genes. The results of accuracy metrics taken by three classifiers (KNN, SVM, and DT) are shown in Figure 6 and Table 8, Table 9 and Table 10.
Figure 5 illustrates the number of selected genes across various datasets using different methods: raw data, Fisher, F-SI, Fr-SI, and Fi-SI. In each dataset, the number of genes selected by each method is compared, with fewer genes indicating a better method. It can be observed that number of selected genes by the proposed algorithms is significantly lower than the number chosen by raw data and Fisher data. Table 7 presented the number of genes chosen by F-SI, Fr-SI, and Fi-SI. The proposed algorithms generally outperform the raw and Fisher methods by selecting fewer genes, with F-SI and Fr-SI methods frequently showing the best performance across multiple datasets.
Meanwhile, reducing a large number of features did not result in a decrease in classification accuracy. We compared classification performance of the proposed algorithms with that of raw data and Fisher data across all datasets. Detailed results can be seen in Table 8, Table 9 and Table 10 and Figure 6. The underlined ACC values in Table 8, Table 9 and Table 10 indicate the largest ACC values with respect to raw data and Fisher data. The radar charts provided illustrate the classification accuracy of three classifiers applied to various datasets. Each spoke of the radar represents a dataset, and the distance from the center indicates the accuracy percentage, with further from the center indicating higher accuracy. We found that the performance of F-SI, Fr-SI, and Fi-SI are slightly higher than those of Fisher data and far better than those of raw data.
The significance of above results is crucial as it directly impacts the quality and interpretability of the results obtained. Proper gene selection helps in identifying key genes that are relevant to the biological question being studied, reducing noise and improving the accuracy of downstream analyses. By selecting the right genes, researchers can focus on specific biological processes, cell types, or pathways of interest, leading to more meaningful insights and discoveries from single-cell sequencing data.

5.4. Performance Comparisons with PCA and tSNE

To evaluate the performance of proposed algorithms, their results are compared with those of PCA and tSNE, two commonly used dimensionality reduction methods.
The overall classification accuracy results under three classifiers (KNN, SVM, and DT) are respectively presented in Figure 7. In comparison, the proposed Algorithms 2–4 can maintain stability of accuracy performance in different datasets. By contrast, the accuracy results for tSNE under the KNN and DT classifiers in dataset GSE51372 are only 55.60% and 55.66%, respectively, suggesting a nonstable performance among datasets.
Detailed results are presented in Table 8, Table 9 and Table 10. The underlined ACC value is the largest ACC of the algorithms. It can be found that the classification accuracy of the proposed algorithms is higher than those of PCA and tSNE through the comparison. Under three classifiers (KNN, SVM, DT), the proposed algorithms attain the largest ACC in 10, 10, and 9 of 10 datasets. It can be concluded that, under the three classifiers (KNN, SVM, and DT), the performance of the proposed algorithms is superior to that of tSNE and PCA in terms of ACC values.

5.5. Comparisons with Other Algorithms

To assess the performance of algorithms, we use three rough set-based feature selection algorithms—a neighborhood mutual information-based algorithm [24] (NMI), a neighborhood discrimination index-based algorithm [8] (NDI), and a neighborhood rough set-based algorithm [32] (NRS)—to contrast the ACC of three classifiers. Table 8, Table 9 and Table 10 present detailed results and the ACC value with underline style also means the largest ACC of algorithms. Under three classifiers (KNN, SVM, DT), the proposed algorithms attain the largest ACC in 10, 10, and 8 of 10 datasets. Specifically, Fi-SI algorithm attains the largest ACC in 10, 10, and 8 of 10 datasets. F-SI attain the largest ACC in 5, 5, and 1 datasets and Fr-SI attain the largest ACC in 5, 5, and 2 datasets. It can be found that the three proposed algorithms outperform the three contrasting algorithms in most of datasets. The ranks of ACC value displayed in Table 11. The penultimate and final rows depict the Friedman mean rank (FMR) and comprehensive ranks of an algorithm, where FMR is the average ranking of an algorithm. These statistics results lead to same conclusion.

5.6. Statistical Analysis

In order to check if there were significant differences between proposed algorithms and other algorithms, the Friedman test and the Bonferroni–Dunn test are utilized to analyze the ACC results with NDI, NMI, and NRS.
The Friedman test is used to determine whether all algorithms have the same performance. The Friedman statistic is formulated as follows.
χ F 2 = 12 N k ( K + 1 ) i = 1 k r i 2 k ( k + 1 ) 2 4 ,
where k denotes the number of algorithms, N denotes the number of datasets, r i denotes the mean rank of the i-th algorithm. As the Friedman test is too conservative, it is replaced by the F F statistic.
F F = ( N 1 ) χ F 2 N ( k 1 ) χ F 2 .
The F F follows a Fisher distribution with k 1 and ( k 1 ) ( N 1 ) degrees of freedom.
The calculation of the Friedman statistic value is facilitated by the aforementioned ranks in Table 11. The FMR in Table 11 allows for the computation of the corresponding statistics χ F and F F under KNN, SVM, and DT classifiers using Formulas (22) and (23). Table 12 displays two sets of statistics that correspond to Table 11.
The null hypothesis of Friedman statistic is that the classification performance of the five algorithms are identical. If statistic F F > F α ( k 1 , ( k 1 ) ( N 1 ) ) , then the algorithms have different performance. Table 11 present the quantities of algorithms and datasets, which are 6 and 10, respectively. These values correspond to k = 6 and N = 10 in Formulas (22) and (23). The critical value of F 0.1 ( 5 , 45 ) = 1.980 at the significance level α = 0.1 . From Table 12, it can be found that three statistics are all larger than that. Accordingly, the performances of six algorithms vary statistically significantly among KNN, DT, and SVM classifiers.
To judge whether the performance of the proposed algorithms exceed those of NDI, NMI and NRS, the Bonferroni–Dunn test is employed as post hoc analysis. In order to determine which algorithm differs significantly from the control algorithm, pairwise comparisons are performed between the algorithms. Following is the formula for the critical distance statistic (CD):
C D α = q α k k ( k + 1 ) 6 N ,
where q α satisfies Tukey distributions with k degrees of freedom. Statistically, if the absolute FMR difference between algorithms is greater than C D α , there is then a better algorithm than the other.
In terms of the Formula (24), the critical value C D 0.1 = 1.94 if α = 0.1 , k = 6 and N = 10 . Figure 8, Figure 9 and Figure 10 show the critical diagram of ranking all four algorithms. The axis represents the average ranking of each algorithm across all tested datasets. Lower ranks are generally better, indicating higher performance. The red horizontal line indicates the threshold of statistical significance. If the ranges of two algorithms do not overlap by more than this CD, their performance difference is considered statistically significant. It can be concluded from these figures that for KNN and SVM classifiers, Fi- S I outperforms NMI NRS and NDI but does not perform better than NDI. For DT classifier, Fi- S I does not perform better than NDI, but it outperforms NMI and NRS.

5.7. Sensitivity Analysis

Finally, we conducted further experiments to present the performance change with the neighborhood parameter θ and integration parameter α of Algorithm 4. The parameters θ and α were varied from 0.1 to 0.9 with step of 0.1. Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 show the ACC curves for different datasets with SVM classifier. The results of KNN and DT are similar to SVM.
The results indicate that higher accuracy is achievable over an expanded area in the majority of cases. Concurrently, most datasets exhibit stability within their respective regions. Consequently, these figures facilitate the selection of appropriate values for each dataset.
It is observed that numerous datasets achieve optimal accuracy as α varies. By taking into account both the upper and lower approximations of the decision, the characterization of the knowledge can be enhanced, thereby leading to improved classification accuracy.
Specifically, in the results from GSE57249, the minimal variation in accuracy with respect to the parameter α indicates that the S I G θ ( d ) and p S I G θ ( d ) , which measure classification certainty (through the lower approximation) and likelihood (through the upper approximation), respectively, are highly correlated. This correlation arises when samples from different classes are fully separable within the attribute space, resulting in clear-cut partitions of neighborhoods or equivalence classes.

6. Conclusions

Gene selection is a valuable method for discovering gene subsets that are meaningful to understand disease. This paper proposed three self-information-based gene selection algorithms to sufficiently considering information contained in the upper and lower approximations that facilitate the selection of optimal gene subsets. The main highlights of this paper are as follows: (1) By defining the tolerance relation on the cell set, the tolerance class and rough approximations are constructed, overcoming the shortcomings of the traditional rough set model. (2) Considering both lower and upper approximations, three uncertainty measures (self-information, relative self-information and integrated self-information) are defined that further take advantage of upper approximation and may result in accurate gene selection. (3) Several open single-cell datasets downloaded from GEO were preprocessed and experiments on them show that classification performance of the proposed algorithms can be improved by selecting appropriate genes related to classification.
The findings of experiment results reveal that the F i - S I method attains an impressive average classification accuracy of 93.7% when utilizing the KNN algorithm while simultaneously selecting a significantly reduced number of genes compared to the gene selection achieved by Fisher’s score. There is still room for further optimization and improvement of the designed algorithms: (1) When processing datasets containing large numbers of cells and genes, increasing efficiency and reducing time of execution is needed; (2) Preliminary gene selection based on Fisher’s score may discard meaningful genes that achieve a low Fisher score. Investigation will need to be performed on whether preliminary selection can be removed to achieve a more comprehensive result if efficiency and time are allowed. In the future, we will investigate the application of the proposed algorithms to other types of data, such as biomedical data and image data. We will integrate ensemble methods like random forest to compare feature importance rankings with self-information metrics, further bridging uncertainty-aware selection and ensemble learning. We will use deep learning and mutual information bottleneck methods to gene selection.

Author Contributions

Y.F.: methodology, investigation, writing—original draft; Y.L.: methodology, editing; C.H.: editing, investigation; Z.L.: validation, editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Fujian Province (2022J01309).

Data Availability Statement

The data used or analyzed during the current study are available from the corresponding author after the paper is accepted for publication.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. Ayub, S.; Shabir, M.; Riaz, M.; Karaaslan, F.; Marinkovic, D.; Vranjes, D. Linear Diophantine Fuzzy Rough Sets on Paired Universes with Multi Stage Decision Analysis. Axioms 2022, 11, 686. [Google Scholar] [CrossRef]
  2. Ayub, S.; Shabir, M.; Riaz, M.; Mahmood, W.; Bozanic, D.; Marinkovic, D. Linear Diophantine Fuzzy Rough Sets: A New Rough Set Approach with Decision Making. Symmetry 2022, 14, 525. [Google Scholar] [CrossRef]
  3. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  4. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  5. Wang, C.; Huang, Y.; Shao, M.; Hu, Q.; Chen, D. Feature selection based on neighborhood self-information. IEEE Trans. Cybern. 2019, 50, 4031–4042. [Google Scholar] [CrossRef] [PubMed]
  6. Zhang, Q.; Chen, Y.; Zhang, G.; Li, Z.; Chen, L.; Wen, C.F. New uncertainty measurement for categorical data based on fuzzy information structures: An application in attribute reduction. Inf. Sci. 2021, 580, 541–577. [Google Scholar] [CrossRef]
  7. Li, Z.; Zhang, P.; Ge, X.; Xie, N.; Zhang, G.; Wen, C.F. Uncertainty measurement for a fuzzy relation information system. IEEE Trans. Fuzzy Syst. 2019, 27, 2338–2352. [Google Scholar] [CrossRef]
  8. Navarrete, J.; Viejo, D.; Cazorla, M. Color smoothing for RGB-D data using entropy information. Appl. Soft Comput. 2016, 46, 361–380. [Google Scholar] [CrossRef]
  9. Hempelmann, C.F.; Sakoglu, U.; Gurupur, V.P.; Jampana, S. An entropy-based evaluation method for knowledge bases of medical information systems. Expert Syst. Appl. 2016, 46, 262–273. [Google Scholar] [CrossRef]
  10. Delgado, A.; Romero, I. Environmental conflict analysis using an integrated grey clustering and entropy-weight method: A case study of a mining project in Peru. Environ. Model. Softw. 2016, 77, 108–121. [Google Scholar] [CrossRef]
  11. Zeng, A.; Li, T.; Liu, D.; Zhang, J.; Chen, H. A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst. 2015, 258, 39–60. [Google Scholar] [CrossRef]
  12. Kim, K.J.; Jun, C.H. Rough set model based feature selection for mixed-type data with feature space decomposition. Expert Syst. Appl. 2018, 103, 196–205. [Google Scholar] [CrossRef]
  13. Wang, C.; Wang, Y.; Shao, M.; Qian, Y.; Chen, D. Fuzzy rough attribute reduction for categorical data. IEEE Trans. Fuzzy Syst. 2019, 28, 818–830. [Google Scholar] [CrossRef]
  14. Dai, J.H.; Hu, H.; Zheng, G.J.; Hu, Q.H.; Han, H.F.; Shi, H. Attribute reduction in interval-valued information systems based on information entropies. Front. Inf. Technol. Electron. Eng. 2016, 17, 919–928. [Google Scholar] [CrossRef]
  15. Singh, S.; Shreevastava, S.; Som, T.; Somani, G. A fuzzy similarity-based rough set approach for attribute selection in set-valued information systems. Soft Comput. 2020, 24, 4675–4691. [Google Scholar] [CrossRef]
  16. Sang, B.; Chen, H.; Yang, L.; Li, T.; Xu, W. Incremental feature selection using a conditional entropy based on fuzzy dominance neighborhood rough sets. IEEE Trans. Fuzzy Syst. 2021, 30, 1683–1697. [Google Scholar] [CrossRef]
  17. Huang, Z.; Li, J. Discernibility Measures for Fuzzy β Covering and Their Application. IEEE Trans. Cybern. 2022, 52, 9722–9735. [Google Scholar] [CrossRef]
  18. Jia, X.; Rao, Y.; Shang, L.; Li, T. Similarity-based attribute reduction in rough set theory: A clustering perspective. Int. J. Mach. Learn. Cybern. 2020, 11, 1047–1060. [Google Scholar] [CrossRef]
  19. Li, Z.; Qu, L.; Zhang, G.; Xie, N. Attribute selection for heterogeneous data based on information entropy. Int. J. Gen. Syst. 2021, 50, 548–566. [Google Scholar] [CrossRef]
  20. Wang, Y.; Chen, X.; Dong, K. Attribute reduction via local conditional entropy. Int. J. Mach. Learn. Cybern. 2019, 10, 3619–3634. [Google Scholar] [CrossRef]
  21. Yuan, Z.; Chen, H.; Zhang, P.; Wan, J.; Li, T. A novel unsupervised approach to heterogeneous feature selection based on fuzzy mutual information. IEEE Trans. Fuzzy Syst. 2021, 30, 3395–3409. [Google Scholar] [CrossRef]
  22. Chen, Z.; Liu, K.; Yang, X.; Fujita, H. Random sampling accelerator for attribute reduction. Int. J. Approx. Reason. 2022, 140, 75–91. [Google Scholar] [CrossRef]
  23. Jiang, Z.; Liu, K.; Yang, X.; Yu, H.; Fujita, H.; Qian, Y. Accelerator for supervised neighborhood based attribute reduction. Int. J. Approx. Reason. 2020, 119, 122–150. [Google Scholar] [CrossRef]
  24. Chen, Y.; Liu, K.; Song, J.; Fujita, H.; Yang, X.; Qian, Y. Attribute group for attribute reduction. Inf. Sci. 2020, 535, 64–80. [Google Scholar] [CrossRef]
  25. Buettner, F.; Natarajan, K.N.; Casale, F.P.; Proserpio, V.; Scialdone, A.; Theis, F.J.; Teichmann, S.A.; Marioni, J.C.; Stegle, O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 2015, 33, 155–160. [Google Scholar] [CrossRef]
  26. Chung, W.; Eum, H.H.; Lee, H.O.; Lee, K.M.; Lee, H.B.; Kim, K.T.; Ryu, H.S.; Kim, S.; Lee, J.E.; Park, Y.H.; et al. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat. Commun. 2017, 8, 15081. [Google Scholar] [CrossRef]
  27. Bommert, A.; Welchowski, T.; Schmid, M.; Rahnenführer, J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings Bioinform. 2021, 23, bbab354. [Google Scholar] [CrossRef]
  28. Li, Z.; Zhang, J.; Liu, F.; Wen, C.F. Uncertainty measurement for single cell RNA-seq data via Gaussian kernel: Application to unsupervised gene selection. Eng. Appl. Artif. Intell. 2024, 130, 107707. [Google Scholar] [CrossRef]
  29. Sharma, A.; Rani, R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. Comput. Methods Programs Biomed. 2019, 178, 219–235. [Google Scholar] [CrossRef]
  30. Zhang, Q.; Zhao, Z.; Liu, F.; Li, Z. Uncertainty measurement for single cell RNA-seq data based on class-consistent technology with application to semi-supervised gene selection. Appl. Soft Comput. 2023, 146, 110645. [Google Scholar] [CrossRef]
  31. Sun, L.; Zhang, X.Y.; Qian, Y.H.; Xu, J.C.; Zhang, S.G.; Tian, Y. Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl. Intell. 2019, 49, 1245–1259. [Google Scholar] [CrossRef]
  32. Sheng, J.; Li, W.V. Selecting gene features for unsupervised analysis of single-cell gene expression data. Briefings Bioinform. 2021, 22, bbab295. [Google Scholar] [CrossRef]
  33. Zhang, J.; Zhang, G.; Li, Z.; Qu, L.; Wen, C.F. Feature selection in a neighborhood decision information system with application to single cell RNA data classification. Appl. Soft Comput. 2021, 113, 107876. [Google Scholar] [CrossRef]
  34. Li, Z.; Feng, J.; Zhang, J.; Liu, F.; Wang, P.; Wen, C.F. Gaussian kernel based gene selection in a single cell gene decision space. Inf. Sci. 2022, 610, 1029–1057. [Google Scholar] [CrossRef]
  35. Zhang, J.; Yu, G.; Huang, D.; Wang, Y. Gene selection in a single cell gene decision space based on class-consistent technology and fuzzy rough iterative computation model. Appl. Intell. 2023, 53, 30113–30132. [Google Scholar] [CrossRef]
  36. Zhang, Q.; Wang, P.; Pedrycz, W.; Li, Z. Neighborhood entropy guided by a decision attribute and its applications in multi-source information fusion and attribute selection. Appl. Soft Comput. 2024, 167, 112380. [Google Scholar] [CrossRef]
  37. Ma, X.; Liu, J.; Wang, P.; Yu, W.; Hu, H. Feature selection for hybrid information systems based on fuzzy ß covering and fuzzy evidence theory. J. Intell. Fuzzy Syst. 2024, 46, 4219–4242. [Google Scholar]
  38. Yu, W.; Ma, X.; Zhang, Z.; Zhang, Q. A Method for Fast Feature Selection Utilizing Cross-Similarity within the Context of Fuzzy Relations. cmc-Comput. Mater. Contin. 2025, 83, 1195–1218. [Google Scholar] [CrossRef]
  39. Ting, D.T.; Wittner, B.S.; Ligorio, M.; Jordan, N.V.; Shah, A.M.; Miyamoto, D.T.; Aceto, N.; Bersani, F.; Brannigan, B.W.; Xega, K. et al. Single-cell rna sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 2014, 8, 1905–1918. [Google Scholar] [CrossRef]
  40. Biase, F.H.; Cao, X.; Zhong, S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome Res. 2014, 24, 1787–1796. [Google Scholar] [CrossRef]
  41. Grover, A.; Sanjuan-Pla, A.; Thongjuea, S.; Carrelha, J.; Giustacchini, A.; Gambardella, A.; Macaulay, I.; Mancini, E.; Luis, T.C.; Mead, A. et al. Single-cell rna sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells. Nat. Commun. 2016, 7, 11075. [Google Scholar] [CrossRef] [PubMed]
  42. Hawkins, F.; Kramer, P.; Jacob, A.; Driver, I.; Thomas, D.C.; McCauley, K.B.; Skvir, N.; Crane, A.M.; Kurmann, A.A.; Hollenberg, A.N.; et al. Prospective isolation of nkx2-1–expressing human lung progenitors derived from pluripotent stem cells. J. Clin. Investig. 2017, 127, 2277–2294. [Google Scholar] [CrossRef] [PubMed]
  43. Chiang, D.; Chen, X.; Jones, S.M.; Wood, R.A.; Sicherer, S.H.; Burks, A.W.; Leung, D.Y.; Agashe, C.; Grishin, A.; Dawson, P.; et al. Single-cell profiling of peanut-responsive t cells in patients with peanut allergy reveals heterogeneous effector th2 subsets. J. Allergy And Clinical Immunol. 2018, 141, 2107–2120. [Google Scholar] [CrossRef] [PubMed]
  44. Chen, L.; Lee, J.W.; Chou, C.-L.; Nair, A.V.; Battistone, M.A.; Păunescu, T.G.; Merkulova, M.; Breton, S.; Verlander, J.W.; Wall, S.M.; et al. Transcriptomes of major renal collecting duct cell types in mouse identified by single-cell rna-seq. Proc. Natl. Acad. Sci. USA 2017, 114, E9989–E9998. [Google Scholar] [CrossRef]
  45. Hook, P.W.; McClymont, S.A.; Cannon, G.H.; Law, W.D.; Morton, A.J.; Goff, L.A.; McCallion, A.S. Single-cell rna-seq of dopaminergic neurons informs candidate gene selection for sporadic parkinson’s disease. bioRxiv 2017, 148049. [Google Scholar]
  46. Donega, V.; Marcy, G.; Giudice, Q.L.; Zweifel, S.; Angonin, D.; Fiorelli, R.; Abrous, D.N.; Rival-Gervier, S.; Koehl, M.; Jabaudon, D.; et al. Transcriptional dysregulation in postnatal glutamatergic progenitors contributes to closure of the cortical neurogenic period. Cell Rep. 2018, 22, 2567–2574. [Google Scholar] [CrossRef]
  47. Lu, J.; Baccei, A.; da Rocha, E.L.; Guillermier, C.; McManus, S.; Finney, L.A.; Zhang, C.; Steinhauser, M.L.; Li, H.; Lerou, P.H. Single-cell rna sequencing reveals metallothionein heterogeneity during hesc differentiation to definitive endoderm. Stem Cell Res. 2018, 28, 48–55. [Google Scholar] [CrossRef]
Figure 1. Flowchart of this paper.
Figure 1. Flowchart of this paper.
Mathematics 13 01829 g001
Figure 2. The framework of Algorithm 4.
Figure 2. The framework of Algorithm 4.
Mathematics 13 01829 g002
Figure 3. Experimental flow.
Figure 3. Experimental flow.
Mathematics 13 01829 g003
Figure 4. Classification accuracy of KNN based on k genes with top Fisher’s score.
Figure 4. Classification accuracy of KNN based on k genes with top Fisher’s score.
Mathematics 13 01829 g004
Figure 5. Comparisons of number of selected genes.
Figure 5. Comparisons of number of selected genes.
Mathematics 13 01829 g005
Figure 6. Comparisons of classification accuracy (%) of three classifier (KNN (a), SVM (b), DT (c)) with raw data and Fisher data.
Figure 6. Comparisons of classification accuracy (%) of three classifier (KNN (a), SVM (b), DT (c)) with raw data and Fisher data.
Mathematics 13 01829 g006
Figure 7. Comparison of classification accuracy (%) for three classifiers (KNN (a), SVM (b), DT (c)) with PCA and tSNE.
Figure 7. Comparison of classification accuracy (%) for three classifiers (KNN (a), SVM (b), DT (c)) with PCA and tSNE.
Mathematics 13 01829 g007
Figure 8. KNN classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Figure 8. KNN classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Mathematics 13 01829 g008
Figure 9. SVM classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Figure 9. SVM classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Mathematics 13 01829 g009
Figure 10. DT classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Figure 10. DT classifier critical diagram for Bonferroni–Dunn test ranking all algorithms.
Mathematics 13 01829 g010
Figure 11. Accuracy varying with θ and α (GSE51372 and GSE57249).
Figure 11. Accuracy varying with θ and α (GSE51372 and GSE57249).
Mathematics 13 01829 g011
Figure 12. Accuracy varying with θ and α (GSE70657 and GSE72612).
Figure 12. Accuracy varying with θ and α (GSE70657 and GSE72612).
Mathematics 13 01829 g012
Figure 13. Accuracy varying with θ and α (GSE96106 and GSE98852).
Figure 13. Accuracy varying with θ and α (GSE96106 and GSE98852).
Mathematics 13 01829 g013
Figure 14. Accuracy varying with θ and α (GSE99701 and GSE108020).
Figure 14. Accuracy varying with θ and α (GSE99701 and GSE108020).
Mathematics 13 01829 g014
Figure 15. Accuracy varying with θ and α (GSE109556 and GSE109979).
Figure 15. Accuracy varying with θ and α (GSE109556 and GSE109979).
Mathematics 13 01829 g015
Table 1. Decision table.
Table 1. Decision table.
a 1 a 2 a 3 d
x 1 000.14291
x 1 0.28571.00001.00001
x 1 1.00000.250000
Table 2. Certain decision index.
Table 2. Certain decision index.
S / d D 1 D 2
c e r t a 1 θ ( D k ) 12
c e r t a 2 θ ( D k ) 01
c e r t a 3 θ ( D k ) 01
Table 3. Possible decision index.
Table 3. Possible decision index.
S / d D 1 D 2
p o s s a 1 θ ( D k ) 12
p o s s a 2 θ ( D k ) 23
p o s s a 3 θ ( D k ) 23
Table 4. Decision self-information of D i .
Table 4. Decision self-information of D i .
B c - SI B θ ( D 1 ) p - SI B θ ( D 1 ) SI B θ ( D 1 ) r - SI B θ ( D 1 ) i - SI B θ ( D 1 ) c - SI B θ ( D 2 ) p - SI B θ ( D 2 ) SI B θ ( D 2 ) r - SI B θ ( D 2 ) i - SI B θ ( D 2 )
a 1 0000000000
a 2 00.34660.17330.34660.24260.34660.13520.24090.73240.1986
a 3 00.34660.17330.34660.24260.34660.13520.24090.73240.1986
Table 5. Decision self-information.
Table 5. Decision self-information.
B c - SI B θ ( d ) p - SI B θ ( d ) SI B θ ( d ) r - SI B θ ( d ) i - SI B θ ( d )
a 1 00000
a 2 0.34660.48180.41421.08080.4412
a 3 0.34660.48180.41421.08080.4412
Table 6. Datasets from scRNA sequencing.
Table 6. Datasets from scRNA sequencing.
GSE IDContributor# Cells# Raw Genes# Class# Preliminary GenesReference
GSE51372Chen18719,6817300Chen et al. (2014) [39]
GSE57249Biase5625,6804100Biase et al. (2014) [40]
GSE72612Watanabe9517,15131500Watanabe et al. (2018)
GSE70657Grover13515,173250Grover et al. (2016) [41]
GSE96106Hawkins14533,950250Hawkins et al. (2017) [42]
GSE98852Chiang25913,938250Chiang et al. (2017) [43]
GSE99701Chen23524,290550Chen et al. (2017) [44]
GSE108020Hook47320,930250Hook et al. (2018) [45]
GSE109556Donega23022,527250Donega et al. (2018) [46]
GSE109979Lu32919,6854850Lu et al. (2018) [47]
Table 7. Comparisons of number of selected genes.
Table 7. Comparisons of number of selected genes.
DatasetsRaw DataFisher DataF-SIFr-SIFi-SI
GSE5137219,681300141316
GSE5724925,680100222
GSE7261217,151150018199
GSE7065715,17350222
GSE9610633,95050886
GSE9885213,93850131314
GSE9970124,290509922
GSE10802020,93050353
GSE10955622,52750131629
GSE10997919,685850999
Table 8. Comparisons of classification accuracy (%) on KNN classifier.
Table 8. Comparisons of classification accuracy (%) on KNN classifier.
DatasetsRawFisherPCAtSNEmRMRReliefFNRSNDINMIF-SIFr-SIFi-SI
GSE5137214.9885.5375.9755.686.9186.3576.578.0973.3486.6186.07 87.70 ̲
GSE5724985.76 100.00 ̲ 98.1887.5897.6196.8180.394.5580.3 100.00 ̲ 100.00 ̲ 100.00 ̲
GSE7261267.3786.3276.8471.5891.6589.7166.3283.1677.8990.5389.47 94.74 ̲
GSE7065743.785.9370.3756.390.7488.6371.8585.1982.22 94.07 ̲ 94.07 ̲ 94.07 ̲
GSE9610655.17 100.00 ̲ 91.7267.5998.3794.7495.17 100.00 ̲ 97.24 100.00 ̲ 100.00 ̲ 100.00 ̲
GSE9885267.5781.4778.876.0683.6882.5273.7675.6880.3483.483.4 84.19 ̲
GSE9970142.1391.4977.0263.8390.8590.3173.1990.6480 91.91 ̲ 91.91 ̲ 91.91 ̲
GSE10802087.9596.293.8793.4597.1897.0697.2595.9997.6897.4697.26 98.10 ̲
GSE10955658.26 99.57 ̲ 98.2680.4396.6396.4693.4896.9696.5297.8397.39 99.57 ̲
GSE10997961.08 99.09 ̲ 95.4583.9195.8994.3574.4694.2386.6598.4798.4798.47
Table 9. Comparisons of classification accuracy (%) on SVM classifier.
Table 9. Comparisons of classification accuracy (%) on SVM classifier.
DatasetsRawFisherPCAtSNEmRMRReliefFNRSNDINMIF-SIFr-SIFi-SI
GSE5137257.3184.476963.6883.2581.3373.2784.9873.8388.2286.63 88.75 ̲
GSE5724996.36 100.00 ̲ 98.1891.0698.9197.187596.3675 100.00 ̲ 100.00 ̲ 100.00 ̲
GSE7261270.538071.5865.2696.1595.1162.1184.218096.8494.74 97.89 ̲
GSE7065768.8989.6378.5257.7891.5891.3974.8191.8585.93 92.59 ̲ 92.59 ̲ 92.59 ̲
GSE9610681.38 100.00 ̲ 92.4169.6695.3595.0195.86 100.00 ̲ 97.93 100.00 ̲ 100.00 ̲ 100.00 ̲
GSE9885269.5 91.13 ̲ 78.0375.6987.1683.2578.881.4783.4286.8986.8988.83
GSE9970163.8392.7778.361.2891.4890.1977.0291.9182.55 93.62 ̲ 93.62 ̲ 93.62 ̲
GSE10802093.4597.0590.9294.0897.2996.3797.4796.4197.46 97.68 ̲ 97.68 ̲ 97.68 ̲
GSE10955697.39 99.57 ̲ 98.775.6597.5296.2693.4896.9694.7898.798.7 99.57 ̲
GSE10997996.06 99.39 ̲ 95.4580.2597.9398.3178.4195.7487.5698.7997.8699.09
Table 10. Comparisons of classification accuracy (%) on DT classifier.
Table 10. Comparisons of classification accuracy (%) on DT classifier.
DatasetsRawFisherPCAtSNEmRMRReliefFNRSNDINMIF-SIFr-SIFi-SI
GSE5137286.6784.5272.7755.6687.6785.4679.778.6686.6682.983.37 90.91 ̲
GSE5724985.4596.3698.1885.6189.9288.3578.4891.0678.48 100.00 ̲ 100.00 ̲ 100.00 ̲
GSE7261289.4776.8476.8481.0591.0889.6369.4782.1177.8989.4791.58 92.63 ̲
GSE7065794.0795.5673.3356.397.8293.2790.3795.5697.0498.5298.52 99.26 ̲
GSE9610697.9396.5593.161.3897.9596.8394.48 98.62 ̲ 97.2497.93 98.62 ̲ 98.62 ̲
GSE9885278.7781.8870.2972.1982.3581.3372.9675.379.5481.8881.88 83.42 ̲
GSE9970181.2888.5171.0657.8791.8390.0773.62 93.19 ̲ 75.7491.4991.06 93.19 ̲
GSE10802095.3596.4193.2494.396.3597.1397.68 98.10 ̲ 97.8997.2597.2597.68
GSE10955690.4391.3 99.13 ̲ 76.0993.3893.6191.7495.2291.392.6192.6194.35
GSE10997990.5591.1792.7291.4993.4992.5771.4288.7582.3994.5493.92 95.14 ̲
Table 11. Comparisons of rank of classification accuracy with NRS, NMI, and NDI.
Table 11. Comparisons of rank of classification accuracy with NRS, NMI, and NDI.
DatasetsKNNSVMDT
F-SIFr-SIFi-SINRSNDINMIF-SIFr-SIFi-SINRSNDINMIF-SIFr-SIFi-SINRSNDINMI
GSE513722.03.01.05.04.06.02.03.01.06.04.05.04.03.01.05.06.02.0
GSE572492.02.02.05.54.05.52.02.02.05.54.05.52.02.02.05.54.05.5
GSE726122.03.01.06.04.05.02.03.01.06.04.05.03.02.01.06.04.05.0
GSE706572.02.02.06.04.05.02.02.02.06.04.05.02.52.51.06.05.04.0
GSE961062.52.52.56.02.55.02.52.52.56.02.55.04.02.02.06.02.05.0
GSE988522.52.51.06.05.04.02.52.51.06.05.04.02.52.51.06.05.04.0
GSE997012.02.02.06.04.05.02.02.02.06.04.05.03.04.01.56.01.55.0
GSE1080203.04.01.05.06.02.02.02.02.04.06.05.05.55.53.53.51.02.0
GSE1095562.03.01.06.04.05.02.52.51.06.04.05.04.03.02.05.01.06.0
GSE1099792.02.02.06.04.05.02.03.01.06.04.05.02.03.01.06.04.05.0
FMR2.22.61.555.754.154.752.152.451.555.754.154.953.252.951.65.53.354.35
Rank2.03.01.06.04.05.02.03.01.06.04.05.03.02.01.06.04.05.0
Table 12. Comparison of value of χ F 2 and F F for classification accuracy under 3 classifiers with NDI, NMI, and NRS.
Table 12. Comparison of value of χ F 2 and F F for classification accuracy under 3 classifiers with NDI, NMI, and NRS.
ValuesKNNSVMDT
χ F 2 41.4544.5925.95
F F 43.6974.259.71
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fang, Y.; Lin, Y.; Huang, C.; Li, Z. Gene Selection Algorithms in a Single-Cell Gene Decision Space Based on Self-Information. Mathematics 2025, 13, 1829. https://doi.org/10.3390/math13111829

AMA Style

Fang Y, Lin Y, Huang C, Li Z. Gene Selection Algorithms in a Single-Cell Gene Decision Space Based on Self-Information. Mathematics. 2025; 13(11):1829. https://doi.org/10.3390/math13111829

Chicago/Turabian Style

Fang, Yan, Yonghua Lin, Chuanbo Huang, and Zhaowen Li. 2025. "Gene Selection Algorithms in a Single-Cell Gene Decision Space Based on Self-Information" Mathematics 13, no. 11: 1829. https://doi.org/10.3390/math13111829

APA Style

Fang, Y., Lin, Y., Huang, C., & Li, Z. (2025). Gene Selection Algorithms in a Single-Cell Gene Decision Space Based on Self-Information. Mathematics, 13(11), 1829. https://doi.org/10.3390/math13111829

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop