Next Article in Journal
Underwater Ionic Current Signal Sensing and Information Transmission by Contact-Induced Charge Transfer
Previous Article in Journal
Impact of Augmented Reality and Game-Based Learning for Science Teaching: Lessons from Pre-Service Teachers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Establishing Two-Dimensional Dependencies for Multi-Label Image Classification

1
Science and Technology on Micro-System Laboratory, Shanghai lnstitute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, China
2
School of Electronic, Electrical and Communication, University of Chinese Academy of Sciences, Beijing 100049, China
3
Cornell University Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(5), 2845; https://doi.org/10.3390/app15052845
Submission received: 5 February 2025 / Revised: 3 March 2025 / Accepted: 3 March 2025 / Published: 6 March 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
As a fundamental upstream task, multi-label image classification (MLIC) work has made a great deal of progress in recent years. Establishing dependencies between targets is crucial for MLIC as targets in the real world always co-occur simultaneously. However, due to the complex spatial relationships and semantic relationships among targets, existing methods fail to effectively establish the dependencies between targets. In this paper, we propose a Two-Dimensional Dependency Model (TDDM) for MLIC. The network consists of an Spatial Feature Dependency Module (SFDM) and a Label Semantic Dependency Module (LSDM), which establish effective dependencies in the dimensions of image spatial features and label semantics, respectively. Our method was tested on three publicly available multi-label image datasets, PASCAL VOC 2007, PASCAL VOC 2012, and MS-COCO, and it produced superior results compared to existing state-of-the-art methods, as demonstrated in our experiments.

1. Introduction

Multi-label image classification (MLIC) is a fundamental yet crucial task in the domain of computer vision, with profound implications in diverse fields such as target detection [1,2,3], medical image recognition [4,5,6], and human attribute recognition [7,8,9]. Since targets typically coexist in the real world, a key aspect of MLIC lies in establishing dependencies between targets.
In the literature, previous methods have primarily focused on modeling spatial dependencies or label semantic dependencies. (1) For modeling spatial dependencies, convolution-based architectures [10,11] are the most commonly utilized approach. However, due to the constraint of the kernel size, the receptive field of CNN networks is limited, which would lead to a problem, i.e., the establishment of spatial dependency can be significantly affected by the pixel distance between targets. Therefore, it is challenging to establish effective long-range spatial dependencies when the targets in an image are distantly located. (2) For modeling label semantic dependencies, architectures [12,13] based on Graph Convolutional Networks (GCNs) are the most commonly utilized approach. However, the utilization of first-order neighborhood aggregation in the standard GCN formulation leads to a limited expressive power to the GCN and a lack of global information integration. For example, based on human experience, the probability of ‘dog’ and ‘fork’ appearing together is typically low, but if there is a ‘person’ present, the probability of ‘dog’ and ‘fork’ appearing together significantly increases. However, existing methods fail to consider the impact of the presence of ‘person’ on the coexistence probability of ‘dog’ and ‘fork’.
In multi-label image classification, existing methods face limitations in modeling long-range spatial dependencies and global semantic dependencies. Traditional CNNs are constrained by convolutional kernel sizes, making it difficult to capture long-range spatial relationships, while traditional GCNs rely only on first-order neighborhood aggregation, thus lacking the ability to integrate global information. To address these challenges, we propose the TDDM, which incorporates the SFDM and the LSDM to simultaneously model spatial and semantic dependencies. SFDM enhances cross-layer feature interactions through the FFM and FEM, improving feature representation and effectively capturing long-range spatial dependencies. LSDM introduces the GREM, which employs a multi-head max-attention mechanism to overcome the local aggregation limitation of traditional GCNs, strengthening global semantic modeling. Compared to conventional CNNs [14,15,16], GCNs [17,18], and existing multi-label classification methods [12,19], TDDM jointly models spatial and semantic dependencies, achieving excellent results on multiple datasets. These results validate TDDM’s effectiveness in modeling spatial and semantic dependencies, providing a more comprehensive solution for multi-label image classification.
In short, there are two major challenges in the field of MLIC currently. Firstly, it is difficult to establish effective long-range spatial feature dependencies. Secondly, it is challenging to establish global label semantic dependencies. In this paper, inspired by the work of [12,19], we propose a Two-Dimensional Dependency Model (TDDM) to simultaneously address the aforementioned issues. The TDDM consists of an Spatial Feature Dependency Module (SFDM), which establishes the dependency in the image spatial feature dimension, and a Label Semantic Dependency Module (LSDM), which establishes the dependency in the label semantic dimension. In the following, we will elaborate on our proposed method.
A. 
Establishing effective long-range spatial feature dependencies.
The Spatial Feature Dependency Module (SFDM) is used to establish effective long-range spatial feature dependencies. SFDM utilizes the ResNet network and extract feature maps from different layers, which preserves the information of targets at different spatial positions. These feature maps are then fed into our designed Feature Fusion Module (FFM) to fuse, establishing the spatial dependency of target features. For the highly abstract feature maps output by the last layer of ResNet, we designed a Feature Enhancement Module (FEM) to enhance and enrich the high-dimensional information. Through cross-layer fusion and feature enhancement, the spatial dependencies between targets are modeled, long-range dependencies are established, and the limitations of fixed convolutional kernel receptive fields are alleviated. By fusing features across different layers, we alleviated the limitation of local feature extraction and preserved detailed information as much as possible. In doing so, we effectively addressed the issues associated with using multiple layers of CNN for establishing spatial dependencies. In addition, our method adds only a few convolutional and fully connected layers in addition to the backbone network, which does not over-increase the network complexity and does not lead to gradient vanishing or explosion problems. This is also a strength compared to stacking multiple layers of CNNs.
B. 
Establishing global label semantic dependencies.
The Label Semantic Dependency Module (LSDM) uses a GCN to establish global label semantic dependencies. We also designed the Global Relationship Enhancement Module (GREM) based on a multi-head max-attention mechanism, which incorporates a max pooling operation on the Q u e r y tensor in addition to the self-attention mechanism, resulting in improved global enhancement effects. The information extracted from GCN is enhanced by GREM. By implementing the aforementioned operations, we can address the issue of insufficient extraction of label relationships and establish global semantic dependencies among labels.
In summary, the main contributions of this article are as follows.
  • We propose a Two-Dimensional Dependency Model (TDDM) for MLIC, which can simultaneously establish effective long-range spatial feature dependencies and global label semantic dependencies. This approach addresses the challenges of capturing the feature dependencies between distant targets in image feature extraction and incomplete understanding of the semantic dependencies between labels. To the best of our knowledge, this is the first multi-label image classification network that considers both problems simultaneously.
  • We propose a Feature Fusion Module (FFM) and a Feature Enhancement Module (FEM), which effectively integrate image feature information from different spatial positions while enhancing and enriching high-dimensional abstract information. In terms of semantic extraction, we design a Global Relationship Extraction Module (GREM) to enhance the fusion of global relationships.
  • We conducted experiments comparing our method with state-of-the-art methods on commonly used benchmark datasets. The results show that our method has superior computational performance. Specifically, our model achieves mAPs of 96.5% on PASCAL VOC 2007, 96.0% on PASCAL VOC 2012, and 85.2% on MS-COCO. The datasets and source code can be accessed from https://github.com/12pid/TDDM, accessed on 2 March 2025.

2. Related Work

The problem of MLIC has attracted increasing attention from researchers and was comprehensively reviewed in [20,21]. The most straightforward approach is to transform the multi-label classification problem into multiple single-label classification problems [22]. However, these methods overlook inter-label correlations, leading to low accuracy, and they also fail to address large-scale target recognition tasks. With the advancement of deep learning, neural networks have been able to learn rich information from datasets such as MS-COCO [23], PASCAL VOC [24], and ImageNet [25]. As a result, scholars have proposed various MLIC methods based on image features [26,27]. In recent years, there has been a growing body of research on exploring the dependencies among labels [16], with scholars simultaneously considering both image features and label semantics. In this paper, we will primarily review the relevant works from the following three perspectives.

2.1. Traditional Multi-Label Classification Methods

There are generally two directions for handling multi-label classification problems. The first direction involves problem transformation, while the second direction focuses on leveraging the characteristics of label relationships to propose adaptive algorithms. Early multi-label classification algorithms were based on the transformation approach, treating the problem as multiple independent single-label problems and ignoring the correlation between labels. Among them, the typical algorithms are One-vs-All [28] and Binary Relevance [29]. The Binary Relevance algorithm constructs multiple datasets for each label, where examples containing the label are labeled as positive and specific label classifiers are trained for prediction. Many methods apply association rules to address the correlation among labels in multi-label classification problems. However, these methods may encounter the issue of rule explosion, where a large number of rules are generated, leading to increased computational complexity, especially when dealing with a large number of labels. The representative algorithms in this category include Apriori [30] and FP-Growth [31]. Through using Apriori, one can obtain frequent item sets from the dataset, and they one can subsequently extract association rules from these frequent item sets based on their definition. Other classical machine learning algorithms, such as decision trees [32], boosting [33], and neural networks [34], have also been used.

2.2. Deep Learning Methods for MLIC

With the advancement of CNNs [26,27,35,36,37], numerous models [14,15,16,38] suitable for MLIC have been proposed. To leverage the relationships among labels, the study of [39] utilized Recurrent Neural Networks (RNNs) to transform labels into embedded label vectors. In [40], Zhu et al. improved the performance of a multi-label image classification task by introducing spatial regularization and image-level supervisory information. The focus was on constructing and classifying different labeled regions. However, because of the use of the spatial regularization method, it involved a large number of parameter tuning problems, such as, in the process of practical application, the image perhaps being affected by lighting, angle, scale, and other factors, which will affect the SRN spatial regularization; in addition, the method’s robustness is not good. The study of [41] incorporated label reordering and handling into the overall network architecture using context gating strategies. Guo et al. [42] considered visual attention consistency in multi-label image classification. Furthermore, label balancing techniques [43] have also been applied to enhance multi-label image classification performance.

2.3. Graph Structure Methods for MLIC

A significant body of work has demonstrated that complex label relationships are well suited for constructing graph structures. Li et al. [17] created a tree-like structure graph in the label space using the maximum spanning tree algorithm, and they also identified sets of informative label combinations to improve overall multi-label prediction performance. Li et al. [44] proposed a graphical lasso framework that models label correlations by jointly considering image features and labels. In state-of-the-art approaches [14,45], GCNs have been introduced for MLIC. The study of [12] utilized GCN to establish a dependency model among labels. The model constructs a directed graph on target labels, where each node is represented by the word embeddings of the label. GCN learns to map this label graph into a set of interdependent target classifiers. Additionally, the model employs a novel reweighting scheme to create an effective label correlation matrix, optimizing the propagation of information between GCN nodes. However, the method does not consider the global nature of image features and label semantics, and it only employs ResNet and standard GCN modules as the backbone network. In this paper, when we established the image spatial feature dependency, we also added FFM for feature fusion and FEM for high-level feature enhancement on top of ResNet. This makes up for the defect of not being able to establish long-range spatial dependencies due to the size limitation of the convolutional kernel. Meanwhile, when modeling inter-label semantic dependencies, we also added our designed GREM after the GCN module to enhance the global inter-label semantics. Zhao et al. [19] proposed a transformer-based dual relationship learning framework that constructs complementary relationships by exploring the correlation between a structural relationship graph and a semantic relationship graph. The structural relationship graph aims to capture long-term correlations by leveraging a scale-crossing transformer architecture within the target context. The semantic graph dynamically models the semantic significance of image objects using explicit semantic-aware constraints. However, on the one hand, this method introduces multiple Transformer modules and complex structures, leading to a model that is too large to be used in some scenarios with limited resources and, at the same time, also leading to more difficult model training. On the other hand, the approach uses only standard GCN modules and does not take into account the global nature of inter-label semantics. In this paper, we used FFM and FEM to complete the feature fusion and enhancement, which does not add too much computational burden and can achieve superior results. At the same time, GREM was designed to consider and solve the label semantic global problem.
Motivated by the aforementioned works, we propose Two-Dimensional Dependency Model (TDDM). Our model leverages a Spatial Feature Dependency Module (SFDM) to extract image information and establish dependencies in the image feature space while utilizing a Label Semantic Dependency Module (LSDM) to establish global label semantic dependencies. The key aspect of TDDM lies in simultaneously considering both the image spatial feature dependencies and global label semantic dependencies. We integrated the outputs of SFDM and LSDM to perform MLIC.

3. Method

In this section, we provide a detailed exposition of the Two-Dimensional Dependency Model (TDDM) framework for MLIC. We begin by presenting the preliminaries behind our approach, followed by a comprehensive description of the components of TDDM, i.e., SFDM and LSDM. Lastly, we provide an overview of the loss function used.

3.1. Preliminaries

We propose the TDDM framework, as shown in Figure 1. TDDM consists of two branches, SFDM and LSDM. SFDM is used to model the dependency of image features, where we employ ResNet as the backbone. During the downsampling process of ResNet, feature maps of different sizes are extracted from different layers and are then fused together. These feature maps contain information about targets at different spatial locations, thereby capturing the spatial dependency of targets in the fused feature map. Additionally, we enhanced the features from the last layer of ResNet to emphasize high-dimensional abstract information. LSDM was utilized to model the global label semantic dependency. We employed GCN to model the semantic dependency of the labels, and we utilized the Global Relationship Enhancement Module (GREM) to enhance the global semantic coherence among labels. Specifically, we devised a maximum attention mechanism to realize the functionality of GREM.

3.2. SFDM

In the SFDM, ResNet is used as the backbone. Given the input image I R 448 × 448 , ResNet generates feature maps X 3 R 1024 × 28 × 28 and X 4 R 2048 × 14 × 14 from ‘Layer3’ and ‘Layer4’, respectively. The feature maps are processed as follows:
u 1 = Conv ( X 3 ) , u 2 = Conv 1 ( X 4 ) , u 3 = Conv 2 ( X 4 ) ,
where Conv 1 ( · ) and Conv 2 ( · ) are two convolutional operations applied to X 4 . The resulting tensors u 1 , u 2 , and u 3 are passed into the Feature Fusion Module (FFM), while the output X 4 from the last layer of ResNet is input into the Feature Enhancement Module (FEM). The FFM integrates shallow and deep features along with spatial positional information, producing the following:
u F F M = f F F M ( u 1 , u 2 , u 3 ) .
The FEM focuses on enhancing the abstract feature X 4 to improve its representation as follows:
u F E M = f F E M ( X 4 ) .
Finally, the SFDM output is obtained by fusing the outputs of both FFM and FEM:
X S F D M = f F F M ( u 1 , u 2 , u 3 ) + f F E M ( X 4 ) .
This results in X S F D M , which captures both individual target features and their spatial dependencies, effectively enhancing the model’s expressive power.

3.2.1. Feature Fusion Module

The Feature Fusion Module (FFM) performs cross-layer and cross-scale fusion, as shown in Figure 2. To enhance the model’s representation power, feature tensors X 3 R 1024 × 28 × 28 and X 4 R 2048 × 14 × 14 from ‘Layer3’ and ‘Layer4’ of ResNet are extracted and processed. Tensor X 3 passes through a convolutional layer to produce u 1 R 512 × 28 × 28 , and X 4 undergoes two convolutional layers to generate u 2 R 512 × 14 × 14 and u 3 R 512 × 7 × 7 . These tensors are unified using bilinear operations, resulting in u 1 , u 2 , and u 3 . The feature fusion factor B f is then computed as the sum of these unified tensors:
B f = u 1 + u 2 + u 3 .
Subsequently, the fusion factor B f is added to u 1 , u 2 , and u 3 , respectively, to achieve information fusion of the information among them:
u 1 = u 1 + B f , u 2 = u 2 + B f , u 3 = u 3 + B f .
For the abstract tensor u 3 , an information compensation step is applied to improve feature fusion. This involves element-wise multiplication of u 1 , u 2 , and u 3 , followed by multiplication with a compensation coefficient C c , resulting in the compensation factor C f :
C f = C c · ( u 1 u 2 u 3 ) ,
where ⊙ denotes element-wise multiplication. The tensor C f is then multiplied with u 3 to accomplish information compensation:
u 3 = C f u 3 .
At this stage, cross-layer information fusion was completed. By interpolation, the tensors that had been fused with each other’s information were restored to their original dimensions, and they are denoted as U 1 , U 2 , and U 3 , respectively:
U 1 = Interp ( u 1 ) , U 2 = Interp ( u 2 ) , U 3 = Interp ( u 3 ) ,
where Interp(·) represents the interpolation operation. Subsequently, a global maximum pooling operation is performed on each tensor to transform their dimensions to [ B , 512 ] , where B represents the batch size:
U 1 p o o l = MaxPool ( U 1 ) , U 2 p o o l = MaxPool ( U 2 ) , U 3 p o o l = MaxPool ( U 3 ) .
The three feature tensors are then concatenated, resulting in a feature tensor of dimension [ B , 3 × 512 ] :
U f = Concat ( U 1 p o o l , U 2 p o o l , U 3 p o o l ) .
Finally, the feature tensor undergoes a fully connected layer to obtain a feature tensor of dimension [ B , C ] , where C represents the desired number of output classes:
f F F M = f c ( U f ) ,
where f c ( · ) represents the fully connected layer. The computation of the FFM module can be represented by the following equation:
f F F M = ξ G C L ( Ψ ( X 3 , X 4 ) ) ,
where Ψ ( · , · ) represents the utilization of the X 3 and X 4 convolutions, interpolation, and feature fusion; and the process ξ G C L ( · ) involves applying global max pooling, concatenating the fused tensors, and adjusting the length through a final fully connected layer.

3.2.2. Feature Enhancement Module

The output information from ‘ L a y e r 4 ’ in ResNet is highly abstract. To enhance these feature representations, we designed the Feature Enhancement Module (FEM). The output of ‘ L a y e r 4 ’, denoted as X 4 R 2048 × 14 × 14 , can be represented as x 1 , x 2 , , x 196 ( x i R 2048 ) . Firstly, we defined the scoring formula, which represents the attention weight coefficient, for category i and position j as follows:
v j i = s o f t m a x ( T C i ( x j ) ) k = 1 196 s o f t m a x ( T C i ( x k ) ) ,
where C i is the classifier of class i, and T is the control proportion coefficient.
For general features, the global average attention is defined as follows:
G = 1 196 n = 1 196 x n .
To enhance abstract features, we defined the computation of feature attention as follows (which can be regarded as a class-specific attention mechanism that focuses attention on the classification scores at different positions for different categories):
A i = n = 1 196 v n i x n .
To better integrate features, we combined feature attention with global average attention, referred to as fusion attention, and defined it as follows:
F i = G · A i .
At this point, we obtain the global enhanced attention for the i-th class, which is defined as follows:
f i = G + α A i + β F i ,
where α and β represent weight proportion control coefficients.
The final output to the FEM is as follows:
f F E M = ( f 1 , f 2 , , f C ) ,
where C represents the number of required output classes.

3.3. LSDM

3.3.1. Overview of GCN

GCN is a rapidly developing research direction in recent years. Unlike CNN, which can only handle problems on Euclidean data, GCN is applicable to a wider range of general data structures. By considering nodes and the directed relationships between them, GCN can effectively capture complex data structures. Messages can be propagated between nodes, and the node representations are updated after message passing. The objective of GCN is to learn a function f ( · , · ) on graph structures, which can update the representation of each node. The function f ( · , · ) takes, as input, the feature descriptions F l R n × d and the adjacency matrix A R n × n , where n represents the number of nodes and d denotes the dimension of node features. Each layer of nodes can be represented as follows:
F l + 1 = f ( F l , A ) .
After applying the convolution operation, it can be represented as follows:
F l + 1 = h ( A ^ F l W l ) ,
where A ^ R n × n is the normalized version of the adjacency matrix A , F l represents the node information state at layer l, W l R d × d represents the learnable transformation parameters, and h ( · ) denotes the activation function. By stacking multiple GCN layers using the above equation, we can learn the complex relationships between nodes.

3.3.2. Construction of a Correlation Matrix

The adjacency matrix describes the correlations between nodes and guides the propagation of information between nodes in GCN. Typically, the adjacency matrix is predefined. However, in this study, we adopted a data-driven approach, following the method proposed by [12], to construct the adjacency matrix. Specifically, we constructed the adjacency matrix in the form of conditional probabilities. Firstly, we calculated the occurrences of label pairs in the training set to obtain the matrix D R C × C , where C represents the number of label categories and D i j denotes the number of occurrences of label i and label j appearing together. Then, based on the matrix D, we can derive the conditional probability matrix:
P i j = D i j / N j ,
where N j represents the total occurrences of label j in the training set, and P i j = P ( L i | L j ) represents the probability of label i occurring when label j is present.
The above method for obtaining the adjacency matrix may encounter two issues. Firstly, the data may exhibit a long-tail distribution. Secondly, the adjacency matrix computed from the training set may suffer from overfitting, affecting its generalization ability. To address these issues, we propose to binarize the correlation P and filter out noise using a threshold value τ . Therefore, the adjacency matrix can be represented as follows:
A i j = 0 , if P i j < τ 1 , if P i j τ .
However, binarization can introduce the issue of over-smoothing. To mitigate this, we can employ a reweighting scheme, which is represented as follows:
A i j = p / j = 1 i j C A i j , if i j 1 p , if i = j ,
where A represents the reweighted adjacency matrix, and p denotes the allocation weight to be set. By incorporating the weighting operation, the reweighted adjacency matrix retains both the information of the individual nodes and the information from the neighboring nodes, effectively addressing the issue of over-smoothing.

3.3.3. Global Relationship Enhancement Module

The information representing the graph structure can be propagated among all of the nodes in GCN. The utilization of the standard GCN formula, which aggregates information from first-order neighborhoods, leads to limited expressive capabilities of GCN and a lack of global information integration. Furthermore, increasing the depth of GCN layers does not necessarily guarantee improved performance. This may result in the inability of GCN alone to establish global label semantic dependencies. To address this issue, we propose a Global Relationship Enhancement Module (GREM), as illustrated in Figure 3. In the Global Relationship Enhancement Module (GREM), the input tensor X f is first linearly transformed to generate the Query (Q), Key (K), and Value (V) tensors. These transformations can be expressed as follows:
Q = W Q X f + b Q , K = W K X f + b K , V = W V X f + b V ,
where W Q , W K , and W V are the weight matrices for the Query, Key, and Value tensors, respectively; and b Q , b K , and b V are the corresponding bias terms. Next, the Query tensor Q undergoes a MaxPooling operation, which is followed by a repeat operation to match the original dimension of Q:
Q = MaxPool ( Q ) , Q = Repeat ( Q ) ,
where R e p e a t ( · ) represents the repeat operation and M a x P o o l ( · ) is the maxpooling operation. The **attention mechanism** is applied by computing the dot product of the Query and Key tensors, which is followed by a Softmax operation to compute the attention weights:
Max-Atten = softmax Q K T + Q d k V ,
where d k is the dimensionality of the Query tensor. The output is then multiplied by the Value tensor V. For each attention head, the output is denoted as H 1 , H 2 , , H h , where h is the number of attention heads:
H i = Max-Atten i for i { 1 , 2 , , h } .
After computing the attention outputs for each head, they are concatenated to form the final tensor X f :
X f = Concat ( H 1 , H 2 , , H h ) .
Finally, the output of GREM, f G R E M , is obtained by applying a fully connected layer:
f G R E M = X f W o + b o ,
where W o is the weight matrix and b o is the bias term of the fully connected layer. In summary, GREM utilizes multiple attention heads to enhance the global relationships among nodes in the graph, with each head focusing on different parts of the graph and their interactions. The operations are designed to strengthen the global features and improve the model’s representation of distant label dependencies.

4. Experiments

In this section, we begin by describing the evaluation metrics and experimental details, followed by introducing the two datasets, PASCAL VOC 2007 and MS-COCO, and then we report the experimental results on both datasets. Subsequently, we conducted ablation studies from four perspectives, and, finally, we now present the results of the visualization analysis.

4.1. Evaluation Metrics

Based on previous work [12,46], we adopted several evaluation metrics for assessing the performance of our model, including average per-class precision (CP), recall (CR), F1 (CF1), average overall precision (OP), recall (OR), and F1 (OF1). CF1 comprehensively assesses the model’s accuracy and recall, making it particularly suitable for evaluating the model’s performance on the positive class and aiding in finding a balance between precision and recall. OF1 comprehensively considers the system’s accuracy and recall, and it is typically used to evaluate the overall performance of the system in detection or recognition tasks. For each image, if the confidence score for a label is greater than 0.5, the label is predicted as positive. Additionally, we present the results for the top three predicted labels, and we also calculated the mean average precision (mAP).

4.2. Implementation Details

In SFDM, we employed ResNet-101 as the backbone network, which was pre-trained on the ImageNet [25] dataset to leverage its feature extraction capabilities. In FEM, the scaling control coefficients α and β were set to 0.1 and 0.04, respectively, to control the relative weight of the feature enhancement mechanism. For LSDM, the Graph Convolutional Network (GCN) was constructed with two layers, where the output dimensions of the first and second layers were 1024 and 2048, respectively, to capture semantic dependencies. For the calculation of the correlation matrix, the parameters τ and p were set to 0.4 and 0.3, respectively. During the training process, the input images were randomly cropped and resized to a uniform resolution of 448 × 448 pixels, which was followed by random horizontal flipping to augment the data. For network optimization, we used Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 10 4 to prevent overfitting. The initial learning rate was set to 0.01, and the network was trained for a total of 100 epochs to ensure convergence. Additionally, we incorporated a warm-up scheduler to gradually increase the learning rate during the initial two epochs of training to stabilize the optimization process. Our network implementation was based on the PyTorch 3.10 framework, ensuring efficient and scalable training.

4.3. Baseline Model Parameters

To help readers better evaluate the effectiveness and fairness of the comparative analysis, we explicitly present the parameters of some baseline models as follows:
MulCon [47]: This model uses a projector Proj() with two linear layers and R e L U activation. Pretraining involves employing an Adam optimizer with a 1-cycle policy (max learning rate = 2 × 10 4 , while fine tuning involves using SGD with momentum = 0.9, weight decay = 10 4 , and an initial learning rate of 0.01 (which is reduced by 10 every 20 epochs). Data augmentation includes resizing to 448 × 448 , random flips, and random augmentation.
VSGCN [48]: The model is trained using the Adam optimizer with an initial learning rate of 10 4 for PASCAL VOC 2007 (decaying by a factor of 10 every 20 epochs) and 5 × 10 5 for MS-COCO (decaying by a factor of 5 every 15 epochs). A L e a k y R e L U activation function (negative slope = 0.2) and dropout layer (drop rate = 0.1) are applied in the Multi-Head GCN mechanism. The backbone network is ResNet-101, which is pre-trained on ImageNet, and 300-dimensional GloVe embeddings are used as semantic prototypes. Input images are resized to 448 × 448 , and training is conducted on two TITAN Xp GPUs with a batch size of 32.
CFMIC [18]: The model is implemented using PyTorch 3.10. Input images are resized to 448 × 448 , and the feature aggregation parameter is set to 0.5. For label co-occurrence embedding, each object is represented as a 300-dimensional GloVe vector. In the cross-modal fusion module, m = 358 is used for feature fusion and g = 2 is applied for group sum-pooling. The model is trained with a batch size of 32.
LGR [49]: The model is implemented using PyTorch with ResNet-101 (pre-trained on ImageNet) as the backbone network. GloVe embeddings are used for label representation, with multi-word labels averaged. Key hyperparameters include s = 0.3, 0.2, and 0.3; and p = 0.2, 0.2, and 0.3 for MS-COCO, Pascal VOC 2007, and Pascal VOC 2012, respectively. The GGNN steps are set to T l = 1 and T f = 3. During training, input images are resized to 448 × 448 with random cropping and flipping. The SGD optimizer is used with a batch size of 32, a momentum of 0.9, a weight decay of 10 5 , and an initial learning rate of 0.001 (reduced by a factor of 10 on plateaus).
GCN-MS-SGA [13]: This model uses ResNet-101 (pre-trained on ImageNet) as the backbone network. Label embeddings are generated using 300-dimensional GloVe vectors, with multi-word labels represented as the average of their word embeddings. Input images are resized to 448 × 448 with random cropping and horizontal flipping for data augmentation. The SGD optimizer is used with a momentum of 0.9, a weight decay of 10 4 , and an initial learning rate of 0.05, which decays by a factor of 10 every 50 epochs for a total of 100 epochs.
ML-GCN [12]: This model uses ResNet-101 (pre-trained on ImageNet) as the backbone. Label embeddings are 300-dimensional GloVe vectors, with multi-word labels averaged. L e a k y R e L U is used for activation. Input images are resized to 448 × 448 with random cropping and flipping. The SGD optimizer is used with a momentum of 0.9, a weight decay of 10 4 , and an initial learning rate of 0.01, decaying by 10 every 40 epochs for 100 epochs.

4.4. Experimental Results

To validate the effectiveness of our model, we conducted experiments on the PASCAL VOC (2007 and 2012) and MS-COCO datasets, and we also compared the results with other state-of-the-art methods. For a fair comparison, we directly adopted the results reported in the corresponding literature for all baseline models, ensuring that each model was evaluated under its optimal settings, as provided by the original authors. For detailed parameter configurations, readers can refer to the original papers cited in this work. Additionally, we performed ablation studies to further analyze the components and evaluated their contributions. Finally, we conducted network visualization analysis to intuitively display the regions of interest identified by the network.

4.4.1. Comparisons with State-of-the-Art Methods

Comparisons on PASCAL VOC 2007. The VOC 2007 dataset serves as a standard benchmark for evaluating the capability of image classification and recognition. The dataset consists of a training set (5011 images) and a test set (4952 images), comprising a total of 9963 images across 20 categories. To ensure a fair comparison with other state-of-the-art methods, we report the results in terms of average precision (AP) and mean average precision (mAP). The experimental results on the VOC 2007 dataset are presented in Table 1, where we compared our method with 13 other state-of-the-art approaches. Thanks to the establishment of our two-dimensional dependencies, our approach yielded significantly better results compared to methods that solely rely on image feature extraction or those that build one-dimensional dependencies. Our proposed model achieved an mAP of 96.5%, indicating superior performance compared to the other methods.
Comparisons on PASCAL VOC 2012. The PASCAL VOC2012 dataset is a challenging and widely recognized benchmark in the field of computer vision, and it is commonly used for tasks such as object detection, image classification, and semantic segmentation. It consists of a diverse collection of images, each annotated with detailed labels that specify object boundaries, categories, and other relevant information. Our experimental results, as presented in Table 2, demonstrate that our network achieved an impressive 96.0% on the comprehensive evaluation metric mAP. This performance surpasses the current state-of-the-art by 1.2%. This improvement can be attributed to the effective two-dimensional dependencies we established, which simultaneously consider both image feature dependencies and label semantic dependencies. Additionally, global relationships were also taken into account when constructing these dependencies.
Comparisons on MS-COCO. The COCO dataset is a large and diverse dataset for object detection, segmentation, and captioning. It aims at scene understanding and is primarily composed of images captured from complex everyday scenes, with precise object localization achieved through accurate segmentation. The dataset consists of 91 object categories, 328,000 images, and 2,500,000 labels. It currently stands as the largest dataset for semantic segmentation—with 80 provided categories and over 330,000 images, including 200,000 annotated images—containing over 1.5 million instances in total. The experimental results, as shown in Table 3, demonstrate a comparison between our proposed method and 11 state-of-the-art approaches, including CNN-RNN [39], SRN [40], ResNet-101 [26], ML-GCN [12], GCN-MS-SGA [13], MCAR [10], CFMIC [18], LGR [49], FLNet [53], LGLM [56], and STMG [57]. Compared to the state-of-the-art methods, our approach exhibited superior performance, indicating the effectiveness of our proposed modeling of two-dimensional dependencies.

4.4.2. Statistical Significance Experiment

To demonstrate that our data were significantly superior to the other groups, we first used the Wilcoxon signed-rank test to check if there was a significant statistical difference between our data and the other groups. The Wilcoxon signed-rank test is a non-parametric method thaat is used to compare two related samples. It involves calculating the differences between data pairs, ranking the absolute differences, assigning signs based on the direction, and then calculating the positive and negative rank sums. The p-value is derived from comparing the rank sums, and if it is below the significance level (typically 0.05), a significant difference is indicated. The PASCAL VOC 2007 dataset was utilized for validation, wherein the results of 13 comparison methods were assessed for significance against those of the method. The Wilcoxon signed-rank test was employed to analyze the significance, with the results depicted in Table 4. Notably, all p-values were found to be less than the significance threshold (a = 0.05). This indicates that the results of our model showed a significant difference compared to the results of the other 13 models. Secondly, we further compared the mean values (i.e., mAP) of each group to determine which group performed better in terms of numerical value. It was clear that our model had a higher mAP compared to the other 13 models (see Table 1), indicating that the performance of our model is superior to the other groups. Taken together, these two points demonstrate that our method is statistically superior to other methods, and the advantages over different comparison methods are not coincidental, but rather stable and consistent. In addition, we calculated the 95% confidence intervals for the mean differences. The calculation of the confidence intervals was based on the results of multiple independent experiments, which can more comprehensively reflect the stability of the model’s performance. The experimental results indicate that the differences were statistically significant, and the estimation precision was high. By introducing confidence intervals, we were able to more intuitively assess the variability of the model’s performance, thereby enhancing the credibility and scientific rigor of the conclusions.

4.4.3. Ablation Studies

As shown in this section, we conducted ablation studies from four different aspects, including the impact of different modules in TDDM on classification results; the influence of calculation methods for fusion factor B f and compensation factor C f in FFM feature fusion; the effects of weight proportion control coefficients α and β in FEM feature enhancement; and the influence of threshold values τ and p in GCN for computing correlation matrices. We conducted the aforementioned ablation experiments on the VOC dataset.
Effects of different modules. To evaluate the effectiveness of our model, we conducted ablation studies by reconstructing the model with different factors. To validate the necessity of modeling dependencies in both the image and semantic dimensions, we conducted ablation experiments by individually removing the SFDM and LSDM branches, as shown in the second and third rows of Table 5. The results indicate that using either the SFDM or LSDM branch alone yielded inferior performance compared to modeling dependencies in both dimensions. This strongly demonstrates the complementarity of the two branches and the effectiveness of modeling image feature spatial dependencies and semantic label dependencies. Furthermore, to further verify the effectiveness of each module in the dual branches, we conducted individual ablation experiments on each module. As the GCN module is concatenated before the GREM and serves as the foundation for LSDM, it cannot be individually ablated. Therefore, we selected FFM, FEM, and GREM as separate ablation modules. The results of the ablation experiments are presented in rows four to six of Table 5. The experimental results thoroughly validated the effectiveness of each module. We conducted statistical significance tests for the individual modules in the ablation study using Wilcoxon signed-rank tests. The results, as shown in Table 6, indicate that the p-values for all comparisons were below 0.05, confirming that the observed differences were statistically significant. Additionally, we further analyzed the impact of each module by examining the mAP values, and the results demonstrate that each module contributed significantly to the model’s performance.
Effects of different B f and C f . Determining the calculation of the fusion factor B f and the compensation factor C f in FFM is a key aspect that warrants our attention. Additionally, we aimed to ensure that the computation method does not introduce any additional computational burden. Therefore, we explored relatively straightforward approaches, such as the element-wise addition (add) or multiplication (mul) of feature vectors, to perform these calculations. The combined methods and identification results are presented in Table 7. We explored four fusion methods in various combinations, and it can be observed that adding the feature vectors as the fusion factor B f and multiplying the feature vectors as the compensation factor C f yielded the best combination approach.
Effects of different α and β . To investigate the impact of the weight proportion control coefficients α and β on the results of the global enhanced attention in the FEM, we conducted experiments using different weighting coefficients, as shown in Figure 4. We changed the values of α in a set of {0.02, 0.05, 0.1, 0.15, 0.2, 0.25} and β in a set of {0.02, 0.03, 0.04, 0.05, 0.06, 0.07}. We can observe that the model achieved the best performance when α = 0.1 and β = 0.04. We hypothesized that when α and β become excessively large, they accentuate feature attention and fusion attention while neglecting the crucial global average attention, resulting in a decrease in mAP.
Effects of different thresholds τ and p . In order to investigate the impact of threshold values τ and p on the accuracy of computing correlation matrices, we conducted experiments using different τ and p settings. Relevant literature [12,13,49] was reviewed to determine the approximate parameter ranges used in similar models. These references have furnished us with preliminary guidelines for the scope. We selected the following values for τ , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, and for p, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and we also observed the impact of different combinations on the model’s performance. As shown in Figure 5, the best performance was achieved when τ = 0.4 and p = 0.3. If the threshold τ was set too large, it may result in too many edges being retained, failing to filter out noise, which affects the model’s performance. On the other hand, if τ is set too small, too much information may be discarded, preventing the network from capturing sufficient structural information, leading to a decline in model performance. Regarding the value of p, if p is too small, the nodes in the graph may not gather enough information from their related nodes. On the other hand, if p is too large, it can lead to excessive smoothing, causing the node features to become overly similar and losing important distinctions. Both scenarios can negatively impact the model’s performance.

4.4.4. Computational Cost Analysis

Inference Time Analysis In this study, inference time serves as the metric for evaluating model computational complexity. It directly reflects the model’s computational speed in real-world applications, thereby demonstrating its performance under practical conditions. Image inputs were set to 224 × 224, and the inference tests were conducted on an RTX 3060 graphics card. The test results are shown in Table 8. The experimental findings indicate that the proposed method exhibits slower inference times compared to backbone networks such as ResNet-101 and the classical ML-GCN method. This can be attributed to the simpler structures and lower computational complexities of these base networks, resulting in shorter inference times. Conversely, methods like GCN-MS-SGA and MCAR, despite demonstrating higher performance, exhibit longer inference times due to their complex network structures and extensive computational operations. Overall, the proposed method strikes a balance between performance and computational complexity, delivering superior performance while maintaining an advantage in inference time. As such, it proves suitable for addressing efficient inference requirements in practical applications.
Spatial complexity analysis In the experiments, the different models’ spatial complexity was compared, as shown in Table 9, which lists the parameter counts for each model. Specifically, the parameter counts for ResNet-101, ML-GCN, LGLM, and the proposed method were 44.50 M, 42.50 M, 44.04 M, and 45.21 M, respectively. The proposed method has a relatively higher parameter count, primarily due to the inclusion of FFM, FEM, and GREM. These modules were incorporated to better integrate and enhance feature information to establish more effective dependencies, but they inevitably introduced additional computational overhead. Although the proposed model has a higher parameter count, trading a certain amount of parameter count for accuracy improvements is considered valuable. Efforts have been made to balance performance and parameter count to ensure that the model improves accuracy while keeping the parameter count under control. Additionally, the section on Conclusions and Future Work provides a detailed discussion on how to further reduce model complexity in future research. Plans include optimizing the model structure and incorporating more efficient techniques to reduce computational overhead and the number of parameters, with the aim of achieving a better balance between performance and complexity.

4.5. Visualization

To further validate the effectiveness of our model, we visualized heatmaps, as shown in Figure 6. We utilize CAM (class activation map) to highlight the regions of interest in the images, as perceived by the network. CAM is a technique employed to visualize the internal workings of a neural network. It aids in understanding the network’s focus on different objects or features in classification tasks and is used for visualizing the decision-making process of neural networks in image classification tasks. We compared ResNet-101, MobileNet, and our proposed model. The experimental results demonstrate that our model exhibits a stronger interest in the regions containing the labels, whereas ResNet-101 and MobileNet show a less focused and broader interest, resulting in inferior performance.

5. Conclusions and Future Work

In this paper, we propose the Two-Dimensional Dependency Model (TDDM) for MLIC. The dual branches of TDDM address two main issues: firstly, modeling object spatial feature dependencies to establish long-range dependencies and alleviate the limitations of fixed convolutional kernel receptive fields; secondly, modeling and enhancing global label semantic dependencies to address issues arising from the local semantics introduced by GCN itself. Our model outperforms previous methods on both the PASCAL VOC 2007 and MS-COCO datasets. We believe that its excellent performance and clear architecture provide a strong foundation for future research in multi-label image classification. In future work, we will focus on addressing some of the limitations of the proposed method. Firstly, due to the use of cross-layer feature fusion in the proposed method, a certain degree of feature confusion occurs, making it more challenging to distinguish similar image features. To address this issue, transfer learning will be introduced. Transfer learning allows the model to leverage learned features and representations from related tasks or domains, thereby better distinguishing similar features and reducing confusion. This approach aims to enhance the model’s robustness and generalization capabilities, making it easier for the model to differentiate similar image features and address the limitations of the current method. Secondly, the method uses a fixed correlation matrix, which may limit the model’s ability to capture dynamic relationships in the data and reduce its performance across different tasks and datasets. Therefore, we plan to develop correlation matrices with adaptive algorithms in the future to address this limitation. These adaptive algorithms can dynamically adjust the correlation matrices based on the input data, further improving the model’s performance and adaptability across a broader range of computer vision applications. This improvement will not only enhance the model’s accuracy, but also expand its applicability, ensuring superior performance in various practical scenarios. Thirdly, the focus will be on model lightweighting to enhance computational efficiency and practical applicability. This includes optimizing the model structure, employing depthwise separable convolutions to reduce parameter count and computational complexity, and utilizing knowledge distillation techniques to transfer knowledge from a large model to a smaller one, thereby improving its performance and efficiency. Additionally, model pruning and quantization techniques will be applied to further reduce computational overhead and memory usage. These measures are expected to enhance the applicability of the model in resource-constrained environments while maintaining high performance levels.

Author Contributions

The authors confirm contribution to this paper as follows: study conception and design: J.W. and Y.Z.; data collection: J.W. and T.W.; analysis and interpretation of results: J.W. and B.L.; draft manuscript preparation: J.W. and H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

This work is supported by the project program of Science and Technology on Micro-system Laboratory, NO.6142804230104.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, C.; Meng, Z. TBFF-DAC: Two-branch feature fusion based on deformable attention and convolution for object detection. Comput. Electr. Eng. 2024, 116, 109132. [Google Scholar] [CrossRef]
  2. Sirisha, M.; Sudha, S. TOD-Net: An end-to-end transformer-based object detection network. Comput. Electr. Eng. 2023, 108, 108695. [Google Scholar] [CrossRef]
  3. Zhou, J.; Hu, Y.; Lai, Z.; Wang, T. SEGANet: 3D object detection with shape-enhancement and geometry-aware network. Comput. Electr. Eng. 2023, 110, 108888. [Google Scholar] [CrossRef]
  4. Vankdothu, R.; Hameed, M.A. Adaptive features selection and EDNN based brain image recognition on the internet of medical things. Comput. Electr. Eng. 2022, 103, 108338. [Google Scholar] [CrossRef]
  5. Mouzai, M.; Mustapha, A.; Bousmina, Z.; Keskas, I.; Farhi, F. Xray-Net: Self-supervised pixel stretching approach to improve low-contrast medical imaging. Comput. Electr. Eng. 2023, 110, 108859. [Google Scholar] [CrossRef]
  6. Ke, J.; Wang, W.; Chen, X.; Gou, J.; Gao, Y.; Jin, S. Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF. Comput. Electr. Eng. 2023, 108, 108709. [Google Scholar] [CrossRef]
  7. Ding, I.J.; Liu, J.T. Three-layered hierarchical scheme with a Kinect sensor microphone array for audio-based human behavior recognition. Comput. Electr. Eng. 2016, 49, 173–183. [Google Scholar] [CrossRef]
  8. Saw, C.Y.; Wong, Y.C. Neuromorphic computing with hybrid CNN–Stochastic Reservoir for time series WiFi based human activity recognition. Comput. Electr. Eng. 2023, 111, 108917. [Google Scholar] [CrossRef]
  9. Bharathi, A.; Sridevi, M. Human action recognition in complex live videos using graph convolutional network. Comput. Electr. Eng. 2023, 110, 108844. [Google Scholar] [CrossRef]
  10. Gao, B.B.; Zhou, H.Y. Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition. IEEE Trans. Image Process. 2021, 30, 5920–5932. [Google Scholar] [CrossRef]
  11. Liu, L.; Guo, S.; Huang, W.; Scott, M.R. Decoupling category-wise independence and relevance with self-attention for multi-label image classification. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1682–1686. [Google Scholar]
  12. Chen, Z.M.; Wei, X.S.; Wang, P.; Guo, Y. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5177–5186. [Google Scholar]
  13. Liang, J.; Xu, F.; Yu, S. A multi-scale semantic attention representation for multi-label image recognition with graph networks. Neurocomputing 2022, 491, 14–23. [Google Scholar] [CrossRef]
  14. Chen, T.; Xu, M.; Hui, X.; Wu, H.; Lin, L. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 522–531. [Google Scholar]
  15. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  16. Wang, Z.; Chen, T.; Li, G.; Xu, R.; Lin, L. Multi-label Image Recognition by Recurrently Discovering Attentional Regions. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  17. Li, X.; Zhao, F.; Guo, Y. Multi-label Image Classification with A Probabilistic Label Enhancement Model. In Proceedings of the UAI, Quebec City, QC, Canada, 23–27 July 2014; Volume 1, pp. 1–10. [Google Scholar]
  18. Wang, Y.; Xie, Y.; Zeng, J.; Wang, H.; Fan, L.; Song, Y. Cross-modal fusion for multi-label image classification with attention mechanism. Comput. Electr. Eng. 2022, 101, 108002. [Google Scholar] [CrossRef]
  19. Zhao, J.; Yan, K.; Zhao, Y.; Guo, X.; Huang, F.; Li, J. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 163–172. [Google Scholar]
  20. Zhang, M.; Zhou, Z. A Review on Multi-Label Learning Algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
  21. Bogatinovski, J.; Todorovski, L.; Džeroski, S.; Kocev, D. Comprehensive comparative study of multi-label classification methods. Expert Syst. Appl. 2022, 203, 117215. [Google Scholar] [CrossRef]
  22. Tsoumakas, G.; Katakis, I. Multi-Label Classification: An Overview. Int. J. Data Warehous. Min. 2009, 3, 64–74. [Google Scholar]
  23. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  24. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  25. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  27. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 99, 7132–7141. [Google Scholar]
  28. Rifkin, R.; Klautau, A. In defense of one-vs-all classification. J. Mach. Learn. Res. 2004, 5, 101–141. [Google Scholar]
  29. Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
  30. Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, 12–15 September 1994; Volume 1215, pp. 487–499. [Google Scholar]
  31. Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. ACM Sigmod Rec. 2000, 29, 1–12. [Google Scholar] [CrossRef]
  32. Chen, Y.L.; Hsu, C.L.; Chou, S.C. Constructing a multi-valued and multi-labeled decision tree. Expert Syst. Appl. 2003, 25, 199–209. [Google Scholar] [CrossRef]
  33. Schapire, R.E.; Singer, Y. BoosTexter: A Boosting-based System for Text Categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
  34. Zhang, M.L.; Zhou, Z.H. Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar] [CrossRef]
  35. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  36. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 July 2016. [Google Scholar]
  37. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
  38. Cheng, M.M.; Zhang, Z.; Lin, W.Y.; Torr, P. BING: Binarized normed gradients for objectness estimation at 300 fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3286–3293. [Google Scholar]
  39. Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2285–2294. [Google Scholar]
  40. Feng, Z.; Li, H.; Ouyang, W.; Yu, N.; Wang, X. Learning Spatial Regularization with Image-Level Supervisions for Multi-label Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
  41. Lin, R.; Xiao, J.; Fan, J. Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 206–218. [Google Scholar]
  42. Guo, H.; Zheng, K.; Fan, X.; Yu, H.; Wang, S. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 729–739. [Google Scholar]
  43. Hand, E.; Castillo, C.; Chellappa, R. Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  44. Li, Q.; Qiao, M.; Bian, W.; Tao, D. Conditional graphical lasso for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2977–2986. [Google Scholar]
  45. Ye, J.; He, J.; Peng, X.; Wu, W.; Qiao, Y. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Computer Vision–ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar]
  46. Cao, P.; Chen, P.; Niu, Q. Multi-label image recognition with two-stream dynamic graph convolution networks. Image Vis. Comput. 2021, 113, 104238. [Google Scholar] [CrossRef]
  47. Dao, S.D.; Zhao, H.; Phung, D.; Cai, J. Contrastively enforcing distinctiveness for multi-label image classification. Neurocomputing 2023, 555, 126605. [Google Scholar] [CrossRef]
  48. Deng, X.; Feng, S.; Lyu, G.; Wang, T.; Lang, C. Beyond Word Embeddings: Heterogeneous Prior Knowledge Driven Multi-Label Image Classification. IEEE Trans. Multimed. 2023, 25, 4013–4025. [Google Scholar] [CrossRef]
  49. Chen, Y.; Zou, C.; Chen, J. Label-aware graph representation learning for multi-label image classification. Neurocomputing 2022, 492, 50–61. [Google Scholar] [CrossRef]
  50. Yang, H.; Zhou, T.; Zhang, Y.; Gao, B.B.; Wu, J.; Cai, J. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 280–288. [Google Scholar]
  51. Chen, T.; Wang, Z.; Li, G.; Lin, L. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  52. Nie, L.; Chen, T.; Wang, Z.; Kang, W.; Lin, L. Multi-label image recognition with attentive transformer-localizer module. Multimed. Tools Appl. 2022, 81, 7917–7940. [Google Scholar] [CrossRef]
  53. Sun, D.; Ma, L.; Ding, Z.; Luo, B. An attention-driven multi-label image classification with semantic embedding and graph convolutional networks. Cogn. Comput. 2022, 15, 1308–1319. [Google Scholar] [CrossRef]
  54. Wei, Y.; Xia, W.; Lin, M.; Huang, J.; Ni, B.; Dong, J.; Zhao, Y.; Yan, S. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1901–1907. [Google Scholar] [CrossRef]
  55. Wang, M.; Luo, C.; Hong, R.; Tang, J.; Feng, J. Beyond Object Proposals: Random Crop Pooling for Multi-Label Image Recognition. IEEE Trans. Image Process. 2016, 25, 5678–5688. [Google Scholar] [CrossRef] [PubMed]
  56. Xie, Y.; Wang, Y.; Liu, Y.; Zhou, K. Label graph learning for multi-label image recognition with cross-modal fusion. Multimed. Tools Appl. 2022, 81, 25363–25381. [Google Scholar] [CrossRef]
  57. Wang, Y.; Xie, Y.; Fan, L.; Hu, G. STMG: Swin transformer for multi-label image recognition with graph convolution network. Neural Comput. Appl. 2022, 34, 10051–10063. [Google Scholar] [CrossRef]
Figure 1. The overall framework of our proposed Two-Dimensional Dependency Model (TDDM), which consists of two branches: the Spatial Feature Dependency Module (SFDM) for modeling image spatial dependencies, and the Label Semantic Dependency Module (LSDM) for modeling the global semantic relationships among labels. In this framework, we also included the Feature Fusion Module (FFM) for feature integration, the Feature Enhancement Module (FEM) for enhancing features, and the Global Relationship Enhancement Module (GREM) for expanding the receptive field of label relationships.
Figure 1. The overall framework of our proposed Two-Dimensional Dependency Model (TDDM), which consists of two branches: the Spatial Feature Dependency Module (SFDM) for modeling image spatial dependencies, and the Label Semantic Dependency Module (LSDM) for modeling the global semantic relationships among labels. In this framework, we also included the Feature Fusion Module (FFM) for feature integration, the Feature Enhancement Module (FEM) for enhancing features, and the Global Relationship Enhancement Module (GREM) for expanding the receptive field of label relationships.
Applsci 15 02845 g001
Figure 2. Schematic diagram of FFM. The tensor u represents the output of the X 3 and X 4 convolutions; u denotes the tensor obtained by interpolating and unifying the dimensions of u ; B f represents the feature fusion factor; C f represents the compensation factor; and U represents the tensor that undergoes feature fusion and is restored to its original dimensions.
Figure 2. Schematic diagram of FFM. The tensor u represents the output of the X 3 and X 4 convolutions; u denotes the tensor obtained by interpolating and unifying the dimensions of u ; B f represents the feature fusion factor; C f represents the compensation factor; and U represents the tensor that undergoes feature fusion and is restored to its original dimensions.
Applsci 15 02845 g002
Figure 3. Schematic diagram of the GREM. Each head provides Q u e r y , K e y , and V a l u e vectors to achieve maximum attention enhancement, thereby accomplishing global relationship enhancement among labels.
Figure 3. Schematic diagram of the GREM. Each head provides Q u e r y , K e y , and V a l u e vectors to achieve maximum attention enhancement, thereby accomplishing global relationship enhancement among labels.
Applsci 15 02845 g003
Figure 4. Impact of the scaling control coefficients α and β on model performance: accuracy comparisons on the VOC2007 dataset.
Figure 4. Impact of the scaling control coefficients α and β on model performance: accuracy comparisons on the VOC2007 dataset.
Applsci 15 02845 g004
Figure 5. Impact of threshold τ and weight distribution coefficient p on model performance: accuracy comparisons on the VOC2007 dataset.
Figure 5. Impact of threshold τ and weight distribution coefficient p on model performance: accuracy comparisons on the VOC2007 dataset.
Applsci 15 02845 g005
Figure 6. CAM visualization results. We compared the visualization results between ResNet-101, MobileNet, and our proposed model. The visualizations demonstrate that our model is capable of providing more precise regions of interest.
Figure 6. CAM visualization results. We compared the visualization results between ResNet-101, MobileNet, and our proposed model. The visualizations demonstrate that our model is capable of providing more precise regions of interest.
Applsci 15 02845 g006
Table 1. Comparisons of the AP and mAP with the state-of-the-art methods on the VOC 2007 dataset. (Bold indicates the best performance in its category, and the same applies hereafter).
Table 1. Comparisons of the AP and mAP with the state-of-the-art methods on the VOC 2007 dataset. (Bold indicates the best performance in its category, and the same applies hereafter).
MethodsAeroBikeBirdBoatBotBusCarCatChairCowTableDogHorseMotPersPlantSheepSofaTrainTvmAP
CNN-RNN [39]96.783.194.292.861.282.189.194.264.283.670.092.491.784.293.759.893.275.399.778.684.0
ResNet-101 [26]99.597.797.896.465.791.896.197.674.280.985.098.496.595.998.470.188.380.298.989.289.9
FeV+LV [50]97.997.096.694.673.693.996.595.573.790.382.895.497.795.998.677.688.778.098.389.090.6
Atten-Reinforce [51]98.697.197.195.575.692.896.897.378.392.287.696.996.593.698.581.693.183.298.589.392.0
ATL [52]99.097.296.696.275.492.096.897.279.093.689.397.097.594.098.881.694.385.898.790.692.5
ML-GCN [12]99.598.598.698.180.894.697.298.282.395.786.498.298.496.799.084.796.784.398.993.794.0
GCN-MS-SGA [13]99.698.398.097.581.093.197.598.586.388.389.295.598.096.198.389.096.791.697.992.394.2
LGR [49]99.695.697.396.484.095.894.198.986.996.886.898.798.696.998.884.897.283.798.893.694.2
FLNet [53]99.698.798.997.984.695.396.296.585.696.187.297.798.697.098.186.597.486.598.890.894.4
CFMIC [18]99.798.598.898.383.996.597.598.883.196.187.498.698.997.299.085.497.184.999.294.294.7
SSGRL [14]99.798.498.097.685.796.298.298.882.098.189.798.898.797.099.086.998.185.899.093.795.0
VSGCN [48]99.898.698.798.685.296.998.298.683.696.587.899.199.097.599.385.896.387.698.695.295.1
MulCon [47]99.898.399.398.683.398.498.098.385.898.390.599.398.996.698.886.399.887.399.896.195.6
Ours99.699.198.898.985.897.998.599.187.998.192.999.198.698.699.390.998.891.399.897.096.5
Table 2. Comparisons of the AP and mAP with state-of-the-art methods on the VOC 2012 dataset.
Table 2. Comparisons of the AP and mAP with state-of-the-art methods on the VOC 2012 dataset.
MethodsAeroBikeBirdBoatBotBusCarCatChairCowTableDogHorseMotPersPlantSheepSofaTrainTvmAP
Fev+Lv [50]98.492.893.490.774.993.290.296.178.289.880.695.796.195.397.573.191.275.497.088.289.4
HCP [54]99.192.897.494.479.993.689.898.278.294.979.897.897.093.896.474.394.771.996.788.690.5
RCP  [55]99.392.297.594.982.394.192.498.583.893.583.198.197.396.098.877.795.179.497.792.492.2
MCAR [10]99.697.198.396.687.095.594.498.887.096.985.098.798.397.399.083.896.883.798.393.594.3
SSGRL [14]99.796.197.796.586.995.895.098.988.397.687.499.199.297.399.084.898.385.899.294.194.8
Ours99.797.599.097.687.197.796.299.790.298.489.399.199.397.499.088.099.188.599.797.396.0
Table 3. Comparisons with state-of-the-art methods on the MS-COCO dataset.
Table 3. Comparisons with state-of-the-art methods on the MS-COCO dataset.
MethodsALLTOP-3
mAP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1
CNN-RNN [39]61.2------66.055.666.469.266.467.8
SRN [40]77.181.665.471.282.769.975.885.258.867.487.462.572.9
ResNet-101 [26]77.380.266.772.883.970.876.884.159.469.789.162.873.6
ML-GCN [12]83.085.172.078.085.875.480.389.264.174.690.566.576.7
GCN-MS-SGA [13]83.485.171.677.884.075.079.388.863.475.788.866.075.7
MCAR [10]83.885.072.178.088.073.980.388.165.575.191.066.376.7
CFMIC [18]83.885.872.778.786.376.381.089.764.575.090.767.377.3
LGR [49]83.985.073.378.786.276.481.089.064.875.090.767.077.1
FLNet [53]84.184.973.979.085.577.481.189.065.275.290.467.577.3
LGLM [56]84.285.772.878.786.676.781.389.464.775.090.767.477.3
STMG [57]84.385.872.778.786.776.881.589.364.875.190.867.477.4
Ours85.284.175.979.885.278.781.888.766.576.090.568.177.7
Table 4. Wilcoxon signed-rank test results table.
Table 4. Wilcoxon signed-rank test results table.
MethodsCNN-RNN [39]ResNet-101 [26]FeV+LV [50]Atten-Reinforce [51]ATL [52]ML-GCN [12]GCN-MS-SGA [13]LGR [49]FLNet [53]CFMIC [18]SSGRL [14]VSGCN [48]MulCon [47]
p-value 8.83 × 10 5 8.84 × 10 5 8.83 × 10 5 8.84 × 10 5 8.82 × 10 5 8.78 × 10 5 1.54 × 10 4 1.96 × 10 4 2.32 × 10 4 2.91 × 10 4 2.44 × 10 4 6.26 × 10 4 2.28 × 10 2
Confidence interval[8.52, 16.50][3.38, 9.76][3.72, 8.12][3.03, 5.96][2.57, 5.37][1.38, 3.57][1.26, 3.41][1.30, 3.37][1.24, 2.96][0.88, 2.81][0.66, 2.40][0.63, 2.28][0.21, 1.64]
Table 5. Ablation study on the impact of different modules (FFM, FEM, GCN, and GREM) on the model performance when using the VOC2007 dataset.
Table 5. Ablation study on the impact of different modules (FFM, FEM, GCN, and GREM) on the model performance when using the VOC2007 dataset.
SFDMLSDMmAP
FFM FEM GCN GREM
96.5
95.0
94.3
94.9
94.6
95.7
Table 6. Ablation study of the Wilcoxon signed-rank test results.
Table 6. Ablation study of the Wilcoxon signed-rank test results.
ModulesLSDMSFDMFFMFEMGREM
p-value 2.44 × 10 4 6.34 × 10 4 5.36 × 10 4 3.50 × 10 4 3.44 × 10 2
Table 7. Impact of different calculation methods for fusion factor B f and compensation factor C f on the model performance when using the VOC2007 dataset.
Table 7. Impact of different calculation methods for fusion factor B f and compensation factor C f on the model performance when using the VOC2007 dataset.
Methods All
mAP OF1 CF1
add( B f )add( C f )94.892.290.5
mul( B f )add( C f )96.393.292.0
mul( B f )mul( C f )95.593.091.6
add( B f )mul( C f )96.593.292.2
Table 8. Inference time comparison results.
Table 8. Inference time comparison results.
MethodsResNet-101 [26]ML-GCN [12]MCAR [10]GCN-MS-SGA [13]Ours
Inference time/ms32.22139.28251.70351.18172.34
Table 9. Comparison table of parameter quantity.
Table 9. Comparison table of parameter quantity.
MethodsResNet-101 [26]ML-GCN [12]LGLM [56]Ours
Number of parameters/M44.5042.5044.0445.21
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Zhang, Y.; Wang, T.; Tang, H.; Li, B. Establishing Two-Dimensional Dependencies for Multi-Label Image Classification. Appl. Sci. 2025, 15, 2845. https://doi.org/10.3390/app15052845

AMA Style

Wang J, Zhang Y, Wang T, Tang H, Li B. Establishing Two-Dimensional Dependencies for Multi-Label Image Classification. Applied Sciences. 2025; 15(5):2845. https://doi.org/10.3390/app15052845

Chicago/Turabian Style

Wang, Jiuhang, Yuewen Zhang, Tengjing Wang, Hongying Tang, and Baoqing Li. 2025. "Establishing Two-Dimensional Dependencies for Multi-Label Image Classification" Applied Sciences 15, no. 5: 2845. https://doi.org/10.3390/app15052845

APA Style

Wang, J., Zhang, Y., Wang, T., Tang, H., & Li, B. (2025). Establishing Two-Dimensional Dependencies for Multi-Label Image Classification. Applied Sciences, 15(5), 2845. https://doi.org/10.3390/app15052845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop