Recent Research Progress of Graph Neural Networks in Computer Vision

Jia, Zhiyong; Wang, Chuang; Wang, Yang; Gao, Xinrui; Li, Bingtao; Yin, Lifeng; Chen, Huayue

doi:10.3390/electronics14091742

Open AccessReview

Recent Research Progress of Graph Neural Networks in Computer Vision

by

Zhiyong Jia

¹,

Chuang Wang

²,

Yang Wang

²,

Xinrui Gao

³,

Bingtao Li

^4,*,

Lifeng Yin

² and

Huayue Chen

⁵

¹

Library, Kaifeng Vocational College, Kaifeng 475100, China

²

School of Intelligent Rail Transit, Dalian Jiaotong University, Dalian 116028, China

³

School of Electronic Information and Automation, Civil Aviation University of China, Tianjin 300300, China

⁴

Aviation Maintenance NCO School, Air Force Engineering University, Xinyang 464000, China

⁵

School of Computer Science, China West Normal University, Nanchong 637002, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1742; https://doi.org/10.3390/electronics14091742

Submission received: 25 March 2025 / Revised: 17 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

Download

Browse Figures

Versions Notes

Abstract

Graph neural networks (GNNs) have demonstrated significant potential in the field of computer vision in recent years, particularly in handling non-Euclidean data and capturing complex spatial and semantic relationships. This paper provides a comprehensive review of the latest research on GNNs in computer vision, with a focus on their applications in image processing, video analysis, and multimodal data fusion. First, we briefly introduce common GNN models, such as graph convolutional networks (GCN) and graph attention networks (GAT), and analyze their advantages in image and video data processing. Subsequently, this paper delves into the applications of GNNs in tasks such as object detection, image segmentation, and video action recognition, particularly in capturing inter-region dependencies and spatiotemporal dynamics. Finally, the paper discusses the applications of GNNs in multimodal data fusion tasks such as image–text matching and cross-modal retrieval, and highlights the main challenges faced by GNNs in computer vision, including computational complexity, dynamic graph modeling, heterogeneous graph processing, and interpretability issues. This paper provides a comprehensive understanding of the applications of GNNs in computer vision for both academia and industry and envisions future research directions.

Keywords:

graph neural networks; computer vision; graph convolutional networks; graph attention networks; multimodal data fusion

1. Introduction

In recent years, deep learning has made significant progress in the field of computer vision (CV), particularly with convolutional neural networks (CNNs), which have become the core technology of modern visual systems, excelling in tasks such as image classification, object detection, and semantic segmentation [1]. CNNs, by simulating the function of biological visual neurons, effectively capture the spatial structural information of images. Through convolutional layers, they extract local features, reduce computational complexity with pooling layers, and make image-level decisions through fully connected layers, giving CNNs a significant advantage when processing two-dimensional image data.

However, despite CNNs’ excellence in handling structured grid data (e.g., images), their limitations become evident when dealing with complex topological visual data, particularly non-Euclidean data (such as relationships between different objects in an image or between video frames) [2]. CNNs assume that the input data have a regular grid structure, making it difficult for traditional CNN methods to effectively model the relationships and dependencies in complex graph-structured data.

To address this issue, graph neural networks (GNNs) have emerged as a deep learning approach that effectively handles irregular graph-structured data. GNNs model the relationships between nodes and edges, allowing them to capture deep information in image and video data processing, especially in multimodal inputs and complex scenarios, showing enormous potential [3]. Unlike CNNs, GNNs can process non-Euclidean data, adapting to object relationships in images and temporal dependencies in videos [4]. As a result, GNNs are playing an increasingly important role in tasks such as object recognition, video analysis, and temporal action localization.

What makes GNNs unique is their message passing mechanism, which allows information to propagate between nodes in the graph, capturing the dependencies between nodes. This enables GNNs to capture not only local spatial information but also the global dependencies between different nodes (e.g., objects, people, video frames, etc.) when processing visual data. Particularly in multimodal tasks, GNNs can better integrate data from different sources, providing more precise predictions and analyses. In video analysis, the graph convolution module can effectively capture temporal dependencies, significantly improving the performance of temporal action localization tasks [5]. In scene understanding tasks, GNNs, through efficient reasoning mechanisms, can optimize the fusion of data from different modalities, enhancing analytical efficiency [6].

Despite the progress GNNs have made in various fields, several challenges remain. Improving computational efficiency and addressing bottlenecks in large-scale data processing continue to be key issues. Additionally, the interpretability of GNN models and their adaptability across domains are also core challenges [7]. Future research will focus on optimizing the computational capabilities of GNNs, enhancing their adaptability to complex tasks, and further addressing the interpretability issues in their practical applications.

The structure of this paper is organized as follows: Section 2 reviews the basic concepts and development of GNNs; Section 3 discusses the applications of GNNs in computer vision, particularly in image processing; Section 4 analyzes the challenges faced by GNNs and explores future research directions; Section 5 concludes the paper, summarizing the findings and proposing directions for future work.

2. Development and Basic Concepts of Graph Neural Networks

2.1. Concept and Development of GNNs

Graph neural networks (GNNs) [8] were first proposed by Marco Gori et al. in 2005. The core idea is to iteratively aggregate the information from neighboring nodes and perform message passing to improve classification accuracy [9]. GNNs have shown excellent performance in image data processing, especially in capturing relationships between different regions. However, GNNs still face challenges, such as over-smoothing, over-compression, difficulty in handling heterogeneity, and difficulty in capturing long-range dependencies [10,11,12,13].

In 2009, F. Scarselli et al. introduced a graph-based data processing method that improved the accuracy of graph data processing [14]. Early GNNs, based on recurrent neural networks (RNNs), aggregated node features to generate vector representations, but their performance in complex image data processing was limited. To address this limitation, Bruna et al. proposed applying convolutional neural networks (CNNs) to graph data, introducing the graph convolutional network (GCN), which effectively reduced computational complexity [15]. However, GCN still had limitations when processing large-scale image data and was typically suitable only for shallow applications.

To overcome this limitation, William L et al. introduced the GraphSAGE model [16], which aggregates node features layer by layer using sampling and adjacency information, addressing computational and memory bottlenecks. However, GraphSAGE does not account for the differences in node importance, which limits its performance on data with large weight disparities. To address this, Petar Veličković et al. introduced the graph attention network (GAT) [17], which assigns different weights to neighboring nodes through an attention mechanism, achieving better results. Zhang et al. further proposed the gate attention network (GAAN) [18] by incorporating multi-head attention mechanisms, yielding significant results.

2.2. Evolution and Progress of GNN Models

To efficiently process complex graph-structured data, several classic GNN models have emerged, demonstrating strong capabilities in computer vision, including graph convolutional networks (GCN) [19], graph sample and aggregation (GraphSAGE) [20], graph attention networks (GAT) [21], and relational graph convolutional networks (R-GCN) [22]. These models significantly improve the efficiency and flexibility of feature learning through different aggregation and update strategies, especially in tasks such as image classification, object detection, and scene understanding. This section briefly discusses the working principles of these classic models, focusing on how they optimize the message passing and node feature update processes to capture the relationships between nodes in images. The meanings of the symbols used are provided in Table 1.

2.2.1. Graph Convolutional Networks (GCN)

GCN updates node features by aggregating information from neighboring nodes, capturing structural relationships in images, and improving task performance. Its core mechanism involves message passing and feature updating. The message function and update function of GCN are given by Equations (1) and (2), respectively:

m_{v}^{(l)} = \sum_{u \in N (v)} H_{u}^{(l)}

(1)

H_{v}^{(l + 1)} = σ (\tilde{A} H^{(l)} W^{(l)})

(2)

where

\tilde{A} = D^{- 1 / 2} (A + I) D^{- 1 / 2}

represents the normalized adjacency matrix, I is the identity matrix, and D is the degree matrix.

GCN aggregates neighboring information progressively through multiple layers but it suffers from high computational complexity and may encounter the problem of over-smoothing, which leads to the loss of node features.

2.2.2. Graph Sampling and Aggregation (GraphSAGE)

GraphSAGE learns node representations by neighbor sampling and feature aggregation, making it suitable for large-scale datasets. The key steps include the following:

(a): Sampling—random or stratified sampling is used to reduce computational complexity.
(b): Feature Aggregation—the features of neighboring nodes are aggregated, with common methods including sum, average, and max pooling.
(c): Message Function—the message passing process in GraphSAGE is given by Equation (3):

m_{v}^{(l)} = A g g r e g a t e ({H_{u}^{(l)} : u \in N (v)})

(3)

(d): Update Function—the function for updating node features is given by Equation (4):

H_{v}^{(l + 1)} = σ (W^{(l)} \cdot C o n c a t (H_{v}^{(l)}, m_{v}^{(l)}))

(4)

Here, Concat represents the concatenation of the current node features and the aggregated message. This update method helps to progressively refine the node representation in each layer.

GraphSAGE effectively handles large-scale graph data through these steps, but the treatment of neighbor node importance is relatively simple, which may fail to capture complex dependency relationships.

2.2.3. Graph Attention Networks (GAT)

GAT combines graph neural networks and self-attention mechanisms, enabling it to adaptively assign weights to neighboring nodes, enhancing the flexibility of information aggregation. Its core components are as follows:

(a) Attention Mechanism—by calculating the attention weights between each node and its neighboring nodes, GAT can dynamically adjust the flow of information. The attention coefficient is computed as shown in Equation (5):

a_{v u} = \frac{\exp (LeakyReLU (a^{T} [W H_{v} | | W H_{u}]))}{\sum_{k \in N (v)} \exp (LeakyReLU (a^{T} [W H_{v} | | W H_{k}]))}

(5)

Here, W is a learnable weight matrix used for linearly transforming node features. a is a learnable vector used to compute the attention coefficient. The operation [.||.] represents concatenation.

(b) Message Function—using the calculated attention coefficients avu, GAT aggregates information from the neighboring nodes of node v. The message function is given by Equation (6):

m_{v}^{(l)} = \sum_{u \in N (v)} a_{v u} W H_{u}

(6)

(c) Update Function—the update function of GAT is given by Equation (7):

H_{v}^{(l + 1)} = σ (W^{(l)} H_{v}^{(l)} + m_{v}^{(l)})

(7)

GAT performs well in image tasks but has higher computational complexity, which can impact training efficiency, especially on large-scale data.

2.2.4. Relational Graph Convolutional Networks (R-GCN)

R-GCN is used to process graph data with multiple types of relationships. Its operation is as follows:

(a) Message Function—the message function of GCN is represented as shown in Equation (8):

m_{v}^{(l)} = \sum_{r \in R} \sum_{u \in N_{r} (v)} H_{u}^{(l)} W_{r}^{(l)}

(8)

Here, R is the set of relationships, Nr(v) is the set of neighboring nodes of r connected by relationship v, and

W_{r}^{(l)}

is the learnable weight matrix associated with relationship r.

(b) Update Function—the update function of R-GCN is given by Equation (9):

H_{v}^{(l + 1)} = σ (\sum_{r \in R} \sum_{u \in N_{r} (v)} W_{r} h_{u} + b)

(9)

where

σ

is the activation function (e.g., ReLU) and

b

is the bias term. This mechanism enables R-GCN to handle graph data with multiple types of relationships, excelling in tasks like image segmentation and object recognition.

R-GCN is effective at processing graph data with multiple relationship types and performs well in tasks such as image segmentation and object recognition. However, it has high computational complexity and requires more computational resources.

2.2.5. Model Comparison

To evaluate the performance of different GNN architectures in computer vision tasks, as shown in Table 2, this paper provides a detailed analysis of four mainstream models from aspects such as computational efficiency and relationship modeling capabilities, offering key reference points for selecting the appropriate model.

2.3. Spectral and Spatial Graph Neural Networks

As the importance of graph-structured data in computer vision continues to increase, spectral and spatial graph neural networks (GNNs) have gradually become research hotspots. Spectral GNNs define convolution operations through graph spectral theory, effectively handling the non-Euclidean characteristics of image data and excelling at capturing image structures and spatial relationships. Spatial GNNs, on the other hand, handle the relationships between nodes through a message passing mechanism, making them suitable for pixel-level relationships and inter-object associations. These two approaches complement each other and drive the application of GNNs in computer vision. This section will briefly introduce the key advancements of these two types of GNNs in computer vision.

2.3.1. Spectral Graph Neural Networks

With the widespread application of convolutional neural networks (CNNs) in image processing and text analysis, researchers began to extend these methods to graph-structured data. Bruna et al. [15] introduced spectral convolutional networks, which perform spectral decomposition using the graph’s Laplacian matrix and define convolution operations in the spectral domain to better capture structural features in images [23]. However, this method has high computational complexity, especially when dealing with large-scale data, making efficiency a significant challenge. To address this issue, Defferrard et al. [24] proposed ChebyNet, which improves computational efficiency by using polynomial-form convolution kernels. Kipf and Welling [25] simplified this approach by introducing graph convolutional networks (GCN), which optimize computation through a first-order approximation of the convolution kernel, demonstrating significant advantages in tasks such as image classification and object detection.

Table 2. Model comparison.

Model	Features	Computational Complexity	Inference Time and Memory Consumption	Advantages	Limitations
GCN [24]	Approximate method based on spectral graph convolution, capturing local graph structure.	$O (\|E\|)$	The inference time is fast, influenced by the number of edges E and the structure of the graph, with relatively low memory consumption.	Computationally efficient, suitable for large-scale graphs, easy to implement.	Only applicable to fixed-structure graphs, cannot handle dynamic graphs, and ignores global relationships during inference.
GraphSAGE [16]	Generates node embeddings by sampling local neighborhood information.	$O (\|E\| K)$	The inference time is proportional to the number of edges E and the depth of neighborhood sampling K, with moderate memory consumption.	Supports inductive learning, capable of handling large-scale and dynamically changing graph data.	The aggregation function has high complexity and computational overhead.
GAT [17]	Introduces self-attention mechanism, capable of handling irregular graphs, suitable for dynamic structures.	$O (\|V\| F + \|E\|)$	The inference time is influenced by the number of nodes V, edges E, and feature dimensions F, with higher computational complexity and lower memory consumption.	Introduces self-attention mechanism, adapts to irregular graphs, supports handling dynamic structures.	Attention computation is limited, unable to scale to large graphs, and inference speed is slow.
R-GCN [22]	Designed specifically for handling multi-relational graph data, applied to knowledge graphs.	$O (\|E\| F)$	The inference time is higher, dependent on the number of edges E and feature dimensions F, with moderate memory consumption.	Capable of handling complex multi-relational graph data, suitable for knowledge graph tasks.	Requires a large number of parameters during training, with slower inference and higher model complexity.

2.3.2. Spatial Graph Neural Networks

Meanwhile, spatial graph neural networks have gained widespread attention in computer vision. Li et al. [26] introduced the gated graph neural network (Gated GNN), replacing the traditional RNN node update mechanism with gated recurrent units (GRU), which eliminates the limitations of compressed mappings and improves the efficiency of GNN in processing complex image data, especially in image and video analysis. Subsequently, several variants emerged, such as PATCHY-SAN [26], which mimics CNN convolution by node ordering and adjacent point selection; MoNet, which defines pseudo-coordinates for adjacent points and unifies multiple GNN models, showing advantages in scene understanding and object detection; graph attention network (GAT), which uses attention mechanisms to adaptively assign weights to neighboring nodes; and GraphSAGE, which extends to inductive learning via neighbor sampling to accelerate large-scale image data learning. The message passing network [27] unifies spatial GNNs into a message-passing framework, enhancing the efficiency of information exchange between nodes; Xu et al. [28] demonstrated that the expressive power of GNNs is equivalent to the Weisfeiler–Lehman graph isomorphism test, showing its strong capabilities in image classification and object detection.

After the advancements in spectral and spatial GNNs, researchers introduced various variants to address complex graph-structured data and dynamic application scenarios. For example, graph diffusion equations (Graph DEs) [29] improve image structure understanding via diffusion equations; graph transformation networks (GTN) [30] enhance the ability to process heterogeneous graphs; temporal graph networks (TGN) [31] are suitable for dynamic graph data; graph multi-layer perceptron (Graph-MLP) [32] enhances expressive power; and graph transformer (Graphormer) [33] leverages self-attention mechanisms to enhance long-range dependency modeling. Additionally, GPSE (graph positional encoding) [34] improves GNN performance in handling complex structural data by incorporating positional information into the graph. Through these variants, GNN has not only enriched its model architecture but also greatly expanded its application potential across various computer vision fields. Overall, the development chart of graph neural networks (see Figure 1) illustrates the evolution from basic models to various variants, reflecting the ongoing innovation and exploration by researchers in handling image data.

3. Applications of Graph Neural Networks in Computer Vision

In recent years, convolutional neural networks (CNNs) have made significant breakthroughs in the field of computer vision (CV). However, they still have limitations in capturing complex visual relationships, such as inter-region associations and temporal dependencies between video frames. For example, images can be viewed as spatial graphs, where each node represents an area of interest, and the edges represent the relationships between these regions, capturing inter-region dependencies. Graph neural networks (GNNs) are thus naturally suited to extract patterns from such graphs, aiding in the completion of computer vision tasks.

This section will explore the applications of GNNs in image processing, video analysis, and cross-media tasks [35], and discuss how they are adapted in representative algorithms to significantly improve task performance.

3.1. Image Processing

3.1.1. Object Detection

Object detection is one of the core tasks in computer vision and has garnered widespread attention in both academia and industry in recent years [36]. This task typically involves object detection and instance segmentation, with the goal of identifying the category and location of objects in an image. Deep learning methods, such as Faster-RCNN [37] and YOLO [38], have made significant progress in this field. Despite the success of these methods in terms of performance, most approaches overlook the interrelationships between categories, which can lead to performance degradation, especially under class imbalance or long-tail data distributions [39]. To address this issue, graph neural networks (GNNs) have been introduced to object detection tasks to model the dependencies between categories and enhance model performance. SGRN [40] (as shown in Figure 2) is a typical example, where the introduction of GNN improves detection accuracy and strengthens the modeling of inter-category relationships.

The challenge in object detection tasks lies not only in recognizing individual objects but also in reasoning about the semantic dependencies, co-occurrence relationships, and relative positions between objects. GNN-based methods address these issues through local and global relationship reasoning. The equation is shown in Equation (10):

G = \prod (I, O), \hat{X} = G N N (G, A)

(10)

GNN effectively captures dependencies between objects by modeling the edges in the graph. For example, Reasoning-RCNN proposes an adaptive global reasoning network that enhances reasoning ability by combining a knowledge graph for global propagation of visual information. SGRN [40] introduces a mechanism for adaptive discovery of semantic and spatial relationships, without relying on manually constructed knowledge bases. RelationNet [41] incorporates an adaptive attention module to explicitly learn the relationships between objects. RelationNet++ [42] introduces a self-attention-based decoder module, enabling the fusion of different object/part representations within a single detection framework. Li et al. [43] introduce heterogeneous graphs to jointly model the relationships between objects and scenes. Recently, GraphFPN [44] proposed a graphical feature pyramid network that facilitates within-scale and cross-scale feature interactions through superpixel hierarchies and spatial and channel attention mechanisms.

As shown in Table 3, a performance comparison of several representative object detection methods on the COCO [45] dataset is presented, covering key evaluation metrics such as mAP accuracy, inference speed (FPS), and memory consumption.

Based on the comparison data in Table 3, we can observe the differences in accuracy and inference speed among different object detection methods. YOLO excels in inference speed, making it suitable for real-time detection applications, but has relatively lower accuracy. In contrast, Faster R-CNN and Mask R-CNN have an advantage in accuracy, particularly for tasks that require high detection precision, although their inference speeds are slower. RelationNet++ introduces an attention-based decoder module and performs well in terms of mAP, capable of integrating representations of different objects/parts, making it suitable for tasks that require precise relationship modeling, though with slower inference speed. SGRN stands out in enhancing inter-class relationship modeling, making it suitable for tasks that require complex relationship reasoning, though its inference speed is relatively low. GraphFPN improves performance through cross-scale feature interaction, performing well in complex scenarios, but with higher memory consumption and moderate inference speed.

Additionally, domain-adaptive object detection (DAOD) has attracted considerable attention for cross-domain data processing. A natural method for modeling cross-domain relationships is to construct a bipartite graph

G = \{V_{s}, V_{t}, E\}

, where the vertices are divided into two disjoint sets, and the edges connect vertices from different sets. Chen et al. [46] constructed a relational graph based on cross-domain consistency, modeling the dependencies between objects in the pixel and semantic spaces through bipartite graph convolution networks and graph attention mechanisms. SIGMA [47] formulates DAOD as a graph matching problem, modeling category distributions through cross-image graphs. SRR-FSD [48] introduces a semantic relationship reasoning module that integrates the relationships between base and novel classes, improving the detection of novel objects.

Due to the advantages of GNNs in modeling long-range dependencies, GNN-based detection methods can effectively leverage relationships between objects, across scales, and across domains [49]. However, as additional learning modules are introduced, the optimization process becomes more complex, and compatibility issues between Euclidean and non-Euclidean data still need to be addressed. Future research can explore better feature mapping methods, combine Transformer or pure GNN encoders to enhance node feature representation, or perform reasoning directly in the original feature space, avoiding reliance on feature mappings, thus better preserving the intrinsic structure of the image.

3.1.2. Image Segmentation

Inspired by the groundbreaking advancements in deep learning technologies, significant progress has been made in the field of image classification, with models like ResNet [50] and DenseNet [51] demonstrating outstanding performance in image classification tasks. However, models based on convolutional neural networks (CNNs) have limitations in modeling complex relationships between samples. To address this issue, graph neural networks (GNNs) have been introduced to image classification tasks, enhancing the classification performance of fine-grained image instances through semi-supervised learning [52].

Image segmentation tasks involve partitioning an image into semantic regions and performing pixel-level labeling. Despite advances in CNN architectures, they are unable to reason about distant regions with arbitrary shapes, which affects overall scene understanding. GNNs provide a unified framework capable of modeling both object appearance and image context, improving semantic segmentation performance.

G = \prod (I), A = E (R_{a}, R_{c}), \hat{I} = P (GNN (G, A))

(11)

Here,

E (\cdot)

is a function that simultaneously encodes

R_{a}

and

R_{c}

,

P (\cdot)

is a pixel-level predictor, and

\hat{I}

represents the predicted result.

Zhang et al. [53] proposed a dual-GCN framework to separately model the spatial relationships between pixels and the dependencies between channel dimensions. Chen et al. [54] designed a global reasoning unit to perform relational reasoning, enhancing global feature aggregation. To avoid constructing a fully connected graph, DGMN [55] employs dynamic neighborhood sampling to predict node dependencies and information propagation affinity. Yu et al. [56] used dynamic sampling of representative nodes for relational modeling. Li et al. [57] improved the Laplacian formula to enable graph reasoning in the original feature space. Hu et al. [58] introduced a category dynamic graph convolution module for dynamic aggregation of pixels with the same category.

For comparison and analysis, Table 4 presents the performance comparison of several GNN-based semantic segmentation methods on the PASCAL VOC 2012 dataset [59]. From the table, it can be seen that Dual GCN achieves a good balance between mIoU accuracy and parameter count through spatial-channel dual graph modeling; GRN introduces a global relationship reasoning unit, enhancing global feature aggregation capabilities, but with a larger parameter count; and CDGC improves mIoU accuracy through a category dynamic graph convolution module, particularly suitable for dynamic feature aggregation, though it comes with relatively higher computational overhead.

For single-shot image segmentation, Zhang et al. [60] introduced the pyramid graph attention module, which combines unlabeled pixels with semantic and contextual information. Xie et al. [61] proposed scale-aware GNNs, using cross-scale relationship reasoning and a node cooperation mechanism to perceive different resolutions of the same object. Zhang et al. [62] proposed affinity attention GNNs, which convert the image into a weighted graph and propagate semantic information to unlabeled pixels. Wu et al. [63] designed a bidirectional graph reasoning network for diffuse segmentation. Table 5 presents the mIoU comparison of two single-shot segmentation models on the PASCAL-5i 1-shot dataset [64]. PGAM enhances the representation capability of images through the pyramid graph attention mechanism, although its mIoU accuracy is relatively lower; scale-aware GNN, on the other hand, achieves higher mIoU accuracy through cross-scale node collaboration, making it suitable for handling cross-scale relationships.

Many studies explore local and global context information using pyramid pooling, dilated convolutions, or self-attention mechanisms. For example, the non-local network [65] achieves this through a self-attention mechanism, although it comes with high computational cost. In contrast, GNN-based methods offer clear advantages in relationship modeling and training efficiency.

3.2. Video Analysis

3.2.1. Video Action Recognition

Human action recognition in videos is one of the core tasks in video processing and understanding, aimed at recognizing human actions in RGB/depth videos or skeletal data.

Regardless of the data format, modeling the spatiotemporal context of humans, objects, and joints is crucial for behavior recognition. Typically, a graph G is composed of objects, people, video frames, or skeletons, and GNNs predict actions by modeling the relationships between these elements.

G = \prod (\hat{V}, \hat{E}), F = G N N (G)

(12)

Here,

\hat{V}

represents nodes that can represent objects, people, video frames, or skeletons;

\hat{E}

represents edges that capture spatial relationships between objects within a single frame or temporal relationships across multiple frames; and F represents the aggregated information for final prediction.

Early action recognition methods, such as hand-crafted improved dense trajectory (iDT) [66], two-stream ConvNets [67], C3D [68], and I3D [69] focused on using spatiotemporal appearance features. To better model long-term temporal information, researchers attempted to use recurrent neural networks (RNNs) to model videos as ordered frame sequences [70]. However, these methods overlooked the spatial and temporal relationships between object instances, which are especially important in actions involving interactions between objects and people, such as “opening a book”.

In recent years, researchers have proposed various deep learning models [54] that utilize graph neural networks (GNNs) to build spatiotemporal representations of videos [71]. These models treat objects in the video as nodes in a graph and learn spatiotemporal relationships to optimize video analysis tasks. Wang et al. [72] proposed a graph-based reasoning model that captures long temporal context, reasoning about human–object and object–object relationships. Figure 3 illustrates a video action recognition model based on CNN, where a 3D convolutional neural network (I3D) processes video frames to obtain feature maps

I \in R^{T \times H \times W \times d}

, where T represents the temporal dimension,

H \times W

represents the spatial dimensions, and d represents the channel number. A region proposal network (RPN) [37] then extracts object bounding boxes, and RoIAlign [39] extracts features for each candidate object. The N output candidate objects correspond to N nodes in the graph, with connections based on appearance similarity and spatiotemporal relationships, and human behavior is predicted through graph convolutional networks (GCN). Ou et al. [73] proposed an object-level graph centered on the participants and applied GCNs to capture object context, improving spatiotemporal graphs. Zhang et al. [74] proposed a multi-scale temporal graph reasoning model that utilizes multiple GAT heads and temporal adjacency matrices to capture both short-term and long-term dependencies.

For comparison and summary, Table 6 presents the performance comparison of several GNN-based video action recognition models on the Kinetics-400 dataset [69]. Region graphs has a lower parameter count but relatively lower Top-1 accuracy; object-relation improves performance by introducing object-level graphs but at the cost of increased parameter count; temporal reasoning performs well in handling temporal information, achieving higher accuracy, making it suitable for long temporal tasks, though it comes with relatively higher computational overhead.

In summary, the application of graph neural networks (GNNs) in video action recognition provides a novel approach to addressing the limitations of traditional methods in capturing long-term dependencies and spatial relationships. GNNs can effectively model the relationships between objects and the spatiotemporal context information.

3.2.2. Temporal Action Localization

The task of temporal action localization aims to predict the boundaries and categories of action instances in a video. Most existing methods [75] adopt a two-stage pipeline; first, a set of candidate regions is generated, followed by classification and regression of the temporal boundaries for each candidate. However, these methods fail to effectively utilize the semantic relationships between candidates. To address this issue, graph neural networks (GNNs) have been used to model the interactions between candidates in a video, thereby enhancing the recognition ability of each candidate. P-GCN [76] proposes a method that uses graph convolutional networks (GCN) to model the relationships between candidates. The method first constructs a candidate action graph, treating each candidate action as a node, with the relationships between candidates represented by edges. P-GCN performs reasoning via GCN, transmitting information between candidates and updating node representations, ultimately optimizing the temporal boundaries and classification scores based on the dependencies between candidates, thus improving the accuracy of temporal action localization. In Table 7, the performance comparison of the above temporal action localization models on the ActivityNet-1.3 dataset [77] is presented. It can be seen that CDC has a more conservative performance in temporal action localization, with an mAP@0.5 of 45.3%, but its computation is relatively light, making it suitable for applications with high inference speed requirements; P-GCN improves performance through candidate graph reasoning, achieving an mAP@0.5 of 48.3%, but compared to CDC, its computational complexity and inference time are higher. Both models have their advantages; CDC is suitable for real-time applications, while P-GCN is more suitable for tasks that require higher accuracy.

3.3. Other Related Work: Cross-Media

Graph-structured data are widely present in various modalities such as images, videos, and text, playing an important role in tasks such as visual description, visual question answering, and cross-media retrieval. Effectively utilizing graph neural networks (GNNs) can significantly enhance the performance of cross-media tasks.

3.3.1. Visual Description

The task of visual description aims to automatically generate natural language descriptions of images or videos, helping visually impaired individuals understand visual content, and has gained significant attention due to its challenge in visual understanding.

Many existing methods [78], inspired by machine translation, attempt to “translate” image information into natural language. These methods typically use convolutional neural networks (CNNs) or region-based CNNs (R-CNNs) to encode the image, followed by a recurrent neural network (RNN) decoder to generate the description. However, the interrelationships and interactions between objects in an image are a natural foundation for descriptions, and how to effectively utilize these visual relationships remains an underexplored issue.

To address this, Yao et al. [79] proposed a graph convolutional network–long short-term memory (GCN-LSTM) framework to enhance image description by exploring visual relationships. This framework starts by modeling the interactions between objects and regions, enriching region-level feature representations, and passing them to the sentence decoder. Figure 4 shows how this method learns semantic and spatial relationships through graph convolution layers, ultimately generating natural language descriptions.

Subsequently, Yang et al. [80] proposed the scene graph auto-encoder (SGAE) framework for image description tasks. This framework consists of two main steps:

(a): Extract the scene graph from the image and encode it using graph convolutional networks (GCN) to re-represent the image content, ultimately generating a natural language description;
(b): Incorporate the scene graph into the description generation model, combining image and language information to generate the description.

As shown in Table 8, the performance comparison of each model on standard evaluation metrics such as BLEU-4, METEOR, and CIDEr is presented based on their performance on the COCO dataset. It can be seen that SGAE performs the best across BLEU-4, METEOR, and CIDEr, particularly with a CIDEr score of 114.8, making it suitable for tasks that require high-precision description generation; GCN-LSTM follows closely behind, with slightly lower metrics but still offering high performance and lower computational complexity; global–local shows relatively weaker performance, especially on the CIDEr metric, making it more suitable for applications with higher speed requirements.

3.3.2. Visual Question Answering (VQA)

The task of visual question answering (VQA) aims to enable a system to automatically answer natural language questions about visual information. Due to the need to understand and reason about the relationship between visual and textual modalities, this task presents significant challenges. In recent years, deep learning has driven advancements in VQA technology. Early methods [81] mainly focused on jointly representing the visual and language modalities, combined with encoder–decoder architectures and attention mechanisms, achieving significant success.

However, these methods did not fully consider the graph-structured information in VQA tasks. To address this, Zhang et al. [82] proposed a new method based on knowledge graphs, utilizing scene graphs to encode information and provide structured reasoning. As shown in Figure 5, experiments demonstrate that when scene graphs are combined with graph neural networks (GNNs), VQA performance significantly improves, particularly in tasks involving object counting, existence, attributes, and multi-object relationships [83].

Another study [84] proposed the relation-aware graph attention network (ReGAT) for visual question answering, aiming to model the relationships between multiple types of objects through a question-adaptive attention mechanism. This method first uses Faster R-CNN to generate candidate object regions and embeds the question into the model using a question encoder. The convolutional features of the object regions and the bounding box features are then fed into a relation encoder, learning relation-aware region-level features relevant to the question. Finally, these relation-aware visual features, along with the question embedding, are input into a multimodal fusion module to generate a joint representation, which is then processed by an answer prediction module to generate the final answer. As shown in Table 9, the accuracy comparison of each model is presented based on their performance on the VQA v2.0 dataset [85]. From Table 9, it can be seen that ReGAT performs the best in terms of accuracy, achieving 70.2%, by effectively improving cross-modal alignment performance through the introduction of a question-adaptive graph attention mechanism; cycle-consistency has an accuracy of 68.5%, enhancing cross-modal alignment through cycle consistency constraints, but its performance is slightly lower than that of ReGAT. The innovations of both models focus on different aspects; ReGAT is more suitable for tasks that require higher accuracy, while cycle-consistency is ideal for scenarios that need efficient alignment.

3.3.3. Cross-Media Retrieval

Image–text retrieval is an important direction in cross-modal research, aiming to retrieve relevant content from multimodal databases of images and texts. The main challenge of this task lies in understanding and measuring the semantic similarity between images and texts. Traditional methods [86] typically describe images and texts through global or local features. However, these methods neglect the object relationships within images and texts, resulting in poor retrieval performance.

To address this issue, as shown in Figure 6, Yu et al. [87] proposed a dual-channel neural network model based on graph convolutional networks. This model extracts structured features from both the visual and text modalities and couples these features through a graph neural network for matching in a shared semantic space. This method effectively captures the deep relationships between images and texts, improving the accuracy of cross-modal retrieval.

Additionally, Wang et al. [88] proposed a method to retrieve objects and their relationships from both images and text, forming visual and textual scene graphs. They designed a scene graph matching (SGM) model, which uses two customized graph encoders to encode the visual and textual scene graphs into feature maps. In each graph, it learns the features of objects and relationships at both the object and relational levels. This approach allows SGM to more effectively match corresponding features between different modalities on two levels.

For comparison and summary, Table 10 presents the performance comparison of three image–text retrieval models on the Flickr30K dataset [89]. From Table 10, it can be seen that SGM performs the best on both metrics, making it suitable for high-precision cross-modal retrieval tasks. Its innovative scene graph hierarchical matching method effectively improves retrieval accuracy; dual-channel GNN introduces graph convolutional networks to couple multimodal features, achieving good results, though slightly behind SGM in performance; VSE++ shows relatively weaker performance, particularly in the Text→Image R@1 metric, making it more suitable for applications where computational efficiency is prioritized.

3.3.4. Applications in Different Fields

Neural networks are a computational model that mimics the structure and function of biological neural networks, connecting a large number of artificial neurons (nodes) to process complex data and recognize patterns. In the field of image classification, Alhatemi et al. [91] designed a multiple pre-trained deep learning by combining ensemble learning with multiple classifiers for classification of stroke. Zheng et al. [92] proposed a broad sparse fine-grained image classification model based on dictionary selection strategy for image classification. Chen et al. [93] proposed a novel dual-scale complementary spatial-spectral joint classification model to mitigate the issues of detail loss and insufficient utilization of spatial information. Chattopadhyay [94] used the machine learning-based approach mimicking to diagnose a case with the help of senior doctors. Zhao et al. [95] proposed a fuzzy broad neuroevolutionary network via multiobjective evolutionary algorithms. Li et al. [96] proposed an approach using pulse signals containing emotional cues and deep learning to automatically detect the severity of stress in college students. Aher [97] proposed a political deer hunting optimization algorithm-based deep Q-network for detecting heart disease. Long et al. [98] proposed a novel principal space approximation ensemble discriminant edge least-squares regression for hyperspectral image classification. Long et al. [99] proposed a dual-model collaborative label adaptive correction BLS based on the collaboration of KRPBLS and SKBLS to significantly improve the robustness of BLS under label noise environment. In the field of fault diagnosis, Yao et al. [100] designed a parallel multiscale convolutional transfer neural network for cross-domain fault diagnosis by integrating 1-D feature maps into 2-D feature maps. Li et al. [101] proposed a hybrid framework for early depression detection that integrates multiple deep learning techniques and ensemble learning. In the field of traffic, Guo et al. [102] proposed an automation paradigm integrating controlling intent into the information processing loop through the spoken instruction-aware flight trajectory prediction framework. Deng et al. [103] proposed an autonomous path planning for unmanned aerial vehicles in dynamic environment. Lin et al. [104] proposed an automatic speech recognition method for improving air traffic safety. Zhu et al. [105] proposed an effective and robust genetic algorithm with hybrid multi-strategy and mechanism for airport gate allocation. Huang et al. [106] proposed a novel cylindrical coordinate particle swarm optimization planner with gene targeting for obtaining feasible and accurate flight routes for multiple UAVs. In another field, Li et al. [107] proposed an anti-poisoning attack decentralized privacy enhanced federated learning model for data sharing. Deng et al. [108] proposed a quantum differential evolutionary algorithm for high-dimensional problems. Huang et al. [109] proposed a multiple level competitive swarm optimizer for large-scale optimization problem. In addition, some other methods are proposed for different applications [110,111,112,113].

3.4. Frontier Issues of Graph Neural Networks in Computer Vision

This section introduces frontier issues of graph neural networks (GNNs) in computer vision, including advanced GNNs for computer vision and the broader applications of GNNs in the field.

3.4.1. Advanced Graph Neural Networks for Computer Vision

In computer vision, graph neural networks (GNNs) typically represent visual information as graph structures. Traditional methods often treat pixels, object bounding boxes, or image frames as nodes, constructing homogeneous graphs to model the relationships between them. In recent years, some novel modeling approaches have been applied to GNNs:

(a): Person Feature Blocks: Yan et al. [114], Yang et al. [115], and Yan et al. [116] constructed spatial and temporal graphs for person re-identification (Re-ID) tasks. By horizontally dividing person feature maps into several small blocks, these blocks are treated as nodes, and graph convolutional networks (GCN) are used to model the relationships between body parts across frames.
(b): Irregular Clustering Regions: Liu et al. [117] proposed a bipartite graph GNN for breast X-ray quality detection. Using k-nearest neighbors (kNN) forward mapping, the image feature map is divided into irregular regions, and the features of these regions are integrated as nodes. Nodes from images across different views represent geometric constraints and appearance similarities through the model.
(c): Neural Architecture Search (NAS) Units: Lin et al. [118] introduced a graph-based neural architecture search algorithm (NAS), where operation units are represented as nodes, and GCNs are used to model the relationships between units, thus improving the efficiency and performance of network structure search.

3.4.2. Broader Applications of Graph Neural Networks in Computer Vision

(a) Point Cloud Analysis: The goal of point cloud analysis is to identify a set of points in a coordinate system, where points are represented by their coordinates and other features. Early studies, such as PointNet++ [119] and VoxelNet [120], attempted to convert point clouds into regular grids (such as images and voxels) to utilize convolutional neural networks (CNNs). Recent research [121] has adopted graph representations to preserve the irregularity of point clouds, with GCN aggregating local information in a manner similar to CNNs in image processing. Chen et al. [122] proposed a hierarchical graph network structure for 3D object detection, Lin et al. [123] introduced learnable GCN kernels and used max pooling to process the receptive field of k-nearest neighbor nodes, Xu et al. [124] proposed coverage-aware mesh queries and mesh context aggregation to accelerate 3D scene segmentation, and Shi and Rajkumar [125] designed Point-GNN for detecting multiple objects in a single sample.

In Table 11, the performance comparison of several point cloud 3D object detection models on the ScanNet v2 dataset [126] is presented. It can be seen that Point-GNN performs the best on the mAP@0.5 metric, achieving 64.7%, and effectively improves 3D object detection accuracy through an end-to-end graph neural network detection method; Grid-GCN improves performance through efficient grid context aggregation, and although its inference speed is faster, its mAP is slightly lower than that of Point-GNN; PointNet++ employs hierarchical point set feature learning, performing well in terms of accuracy, making it suitable for more complex scenarios, but with slower inference speed; and VoxelNet processes point cloud data through voxelization and 3D CNN methods, achieving lower accuracy, but with less computational overhead, making it suitable for applications with higher speed requirements.

(b) Low-Resource Learning: Low-resource learning aims to learn from minimal data or prior knowledge. Wang et al. [127] and Kampffmeyer et al. [128] used knowledge graphs to guide zero-shot learning. Garcia and Bruna [129], Liu et al. [130], and Kim et al. [131] designed similarity metrics, modeling probabilistic learning problems as label propagation or edge-labeling problems, applied to facial recognition tasks. Wang et al. [132] inferred the link probability of nodes in face subgraphs using GCNs. Yang et al. [133] proposed a similarity-graph-based candidate-detection-segmentation framework for face clustering. Zhang et al. [134] proposed a global–local GCN framework for label cleaning in face recognition.

(c) Other Scenarios: Wei et al. [135] proposed the View-GCN model for recognizing 3D shapes through projected 2D images. Wald et al. [136] extended scene graphs to 3D indoor scenes. Ulutan et al. [137] utilized GCNs to reason about interactions between people and objects. Cucurull et al. [90] predicted fashion compatibility between two items by modeling edge prediction problems. Sun et al. [138] constructed social behavior graphs and used GNNs to propagate interaction information for trajectory prediction. Zhang et al. [139] established visual–language relationship graphs to mitigate hallucination problems in video description tasks.

Graph neural networks (GNNs) have demonstrated enormous potential in the field of computer vision but still face challenges related to interpretability. Due to the high connectivity and complexity of nodes and edges, the interpretability issue of GNNs is particularly prominent in decision-making tasks, such as medical diagnosis. Therefore, improving the interpretability and robustness of GNNs in visual tasks has become a key issue that needs to be addressed.

4. Challenges and Future Directions

Although graph neural networks (GNNs) have made significant progress in the field of computer vision, they still face numerous challenges that need to be addressed. Firstly, issues related to computational complexity and scalability limit the application of GNNs to large-scale image and video data. The traditional message passing mechanism leads to exponential growth in computational costs as the number of nodes V and edges E in the graph increases, particularly when handling large-scale video data, where the computational cost is often prohibitive. Although low-rank approximation methods, such as first-order approximated convolution kernels, significantly reduce the computational complexity of GCNs [140], existing algorithms still face significant memory bottlenecks when processing large-scale 3D point clouds or long temporal video data. Recent research by Zhou et al. [141] shows that through neural architecture search, redundant computations can be reduced by more than 80%, providing a new approach for developing real-time GNN systems.

When processing cross-modal spatiotemporal data, temporal graph networks (TGN) [31] achieve short-term temporal dependency modeling through a memory mechanism. However, in long temporal tasks (such as U-Action video analysis), the limitations of the memory mechanism lead to a decrease in action recognition accuracy [142]. Notably, Graphormer [33] introduced spatiotemporal positional encoding, achieving a 5.2% performance improvement on the OGB molecular dataset, demonstrating the potential of the Transformer architecture in dynamic graph modeling. However, balancing computational overhead with model accuracy remains an unresolved challenge, requiring joint optimization through graph sparsification and adaptive sampling [143].

Another key issue is the fusion of heterogeneous graphs and multimodal data. Currently, most GNN models assume that nodes and edges in a graph belong to the same modality, but in cross-modal tasks (such as visual question answering and image–text retrieval), the heterogeneity of graph structures and the integration of multimodal information have not been effectively addressed. Research on cross-modal graph structures [144] suggests that the design of heterogeneous graph GNNs and cross-modal GNNs will be a crucial direction for future research, as these methods can better handle data from different modalities, enhancing GNNs’ capability in cross-modal understanding and multimodal fusion.

The lack of interpretability significantly hinders the deployment of GNNs in high-risk domains such as medical diagnosis. Although GNNExplainer [145] provides local explanations for model behavior, its global interpretability is limited by the topological heterogeneity of complex graph structures. To address this challenge, Zhang et al. [146] innovatively combined medical knowledge graphs with GNNs, using symbolic clinical rule constraints to guide the message-passing process. This approach achieved a traceability accuracy of 92.3% in breast cancer pathology image classification tasks, providing an important practical basis for the collaborative paradigm of connectionism and symbolism. In terms of model credibility, uncertainty management in open environments has become a key bottleneck. Recent research [147] suggests that building trustworthy GNNs requires multi-dimensional collaborative optimization to improve robustness, defend against topological perturbation attacks, perform probabilistic calibration to quantify predictive uncertainty, and ensure privacy protection through differential privacy to secure sensitive medical data. The system framework developed by Wang et al. [148] partially validated this, with the GOODAT system controlling the misjudgment rate in autonomous driving scenarios to below 6.3%. However, the computational latency (averaging 83 ms/frame) still fails to meet real-time requirements.

Solving these challenges requires interdisciplinary collaborative innovation. On one hand, hardware-aware algorithm designs (such as GPU-accelerated graph partitioning [149]) can break computational bottlenecks. On the other hand, cognitive science-inspired explanation frameworks (such as visual concept extraction [150]) can enhance model transparency. Of particular interest is the fusion architecture of large language models with GNNs (such as GraphLLM [151]), which is opening new paradigms for multimodal reasoning. This could reshape the approach to solving computer vision tasks. With advancements in algorithms, hardware, and cross-domain integration, GNNs are expected to show broader application potential in the field of computer vision, driving further technological innovation and industry development.

5. Conclusions

This paper reviews the latest research progress of graph neural networks (GNNs) in the field of computer vision and provides an in-depth analysis of their broad application potential in image processing, video analysis, and multimodal data fusion. GNNs, with their advantages in handling non-Euclidean data structures, can effectively capture complex spatial relationships and semantic dependencies in images, particularly excelling in tasks such as object recognition, object detection, image segmentation, and video action recognition. By introducing the message passing mechanism, GNNs can efficiently transfer information between nodes in the graph, thus enabling a more comprehensive understanding and analysis of visual data.

However, despite the significant progress GNNs have made in the field of computer vision, they still face several challenges. Firstly, computational complexity and scalability issues remain constraints on their application in large-scale data processing, especially when dealing with high-dimensional image data and sequential video data, where these issues are particularly prominent. Secondly, the interpretability of GNN models has not yet been fully addressed, particularly in high-risk fields such as medical diagnostics, where the lack of transparent model behavior explanations limits their practical application. Furthermore, the effective fusion of heterogeneous graphs and multimodal data is also a major challenge in current research, and the effective integration of data from different modalities still requires further theoretical innovation and technical breakthroughs.

Future research should focus on improving the computational efficiency of GNN models, especially when processing large-scale datasets, and explore more efficient graph neural network architectures. To address the computational complexity issue, researchers could focus on optimizing the message passing mechanism, exploring low-complexity graph convolution methods, and incorporating sparsification techniques to reduce redundant computations. At the same time, enhancing the interpretability and robustness of GNN models remains a core challenge. In high-risk fields such as medical diagnostics and autonomous driving, interpretability issues limit the practical deployment of GNNs. Therefore, future research needs to enhance model transparency, develop interpretability frameworks, and ensure model stability and reliability in various environments. With the advent of new algorithms, advances in hardware acceleration, and cross-domain integration, GNNs are expected to play a greater role in practical applications in computer vision and other fields, driving continuous technological innovation and industry development.

Author Contributions

Conceptualization, Z.J. and Y.W.; methodology, Z.J.; software, Y.W.; validation, X.G.; formal analysis, L.Y.; investigation, C.W.; resources, X.G.; data curation, H.C.; writing—original draft preparation, Z.J.; writing—review and editing, B.L. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Henan Provincial Science and Technology Key Research Project (242102211009) and Henan Provincial Key Scientific Research Project for Higher Education Institutions (24A520049).

Conflicts of Interest

The authors declare that they have no known competing financial interest.

References

DiFilippo, N.M.; Jouaneh, M.K.; Jedson, A.D. Optimizing Automated Detection of Cross-Recessed Screws in Laptops Using a Neural Network. Appl. Sci. 2024, 14, 6301. [Google Scholar] [CrossRef]
Zhang, Z.; Cui, P.; Zhu, W. Deep learning on graphs: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 249–270. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef]
Li, A.; Xu, Z.; Li, W.; Chen, Y.; Pan, Y. Urban Signalized Intersection Traffic State Prediction: A Spatial-Temporal Graph Model Integrating the Cell Transmission Model and Transformer. SSRN. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5189471 (accessed on 10 April 2025).
Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang, J.; Gan, C. Graph convolutional module for temporal action localization in videos. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2028–2040. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhang, X.; Liu, Y. Towards efficient scene understanding via squeeze reasoning. IEEE Trans. Image Process. 2021, 30, 7050–7063. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. Artif. Intell. 2020, 293, 103468. [Google Scholar] [CrossRef]
Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Piscataway, NJ, USA; Volume 2, pp. 729–734. [Google Scholar] [CrossRef]
Dai, H.; Kozareva, Z.; Dai, B.; Smola, A.; Song, L. Learning steady-states of iterative algorithms over graphs. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1106–1114. Available online: https://proceedings.mlr.press/v80/dai18a.html (accessed on 10 April 2025).
Xie, Y.; Yao, C.; Gong, M.; Chen, C.; Qin, A.K. Graph convolutional networks with multi-level coarsening for graph classification. Knowl.-Based Syst. 2020, 194, 105578. [Google Scholar] [CrossRef]
Ma, Y.; Wang, S.; Aggarwal, C.C.; Tang, J. Graph convolutional networks with eigen pooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 723–731. [Google Scholar] [CrossRef]
Ran, X.J.; Suyaroj, N.; Tepsan, W.; Ma, J.H.; Zhou, X.B.; Deng, W. A hybrid genetic-fuzzy ant colony optimization algorithm for automatic K-means clustering in urban global positioning system. Eng. Appl. Artif. Intell. 2024, 137, 109237. [Google Scholar] [CrossRef]
Li, Q.; Han, Z.; Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014; Available online: https://openreview.net/forum?id=DQNsQf-UsoDBa (accessed on 10 April 2025).
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1024–1034. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; Available online: https://openreview.net/forum?id=rJXMpikCZ (accessed on 10 April 2025).
Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; Yeung, D.Y. GAAN: Gated attention networks for learning on large and spatiotemporal graphs. arXiv 2018, arXiv:1803.07294. [Google Scholar]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 974–983. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Hu, H.; He, F.; Cheng, S.; Zhang, Y. Graph sample and aggregate-attention network for hyperspectral image classification. In Graph Neural Network for Feature Extraction and Classification of Hyperspectral Remote Sensing Images; Springer: Singapore, 2021; pp. 29–41. Available online: https://link.springer.com/chapter/10.1007/978-981-97-8009-9_2 (accessed on 10 April 2025).
Liu, Z.; Zhou, J. Graph attention networks. In Introduction to Graph Neural Networks; Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool; Springer: Berlin/Heidelberg, Germany, 2022; pp. 39–41. [Google Scholar]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference, Heraklion, Crete, Greece, 3–7 June 2018; Springer: Cham, Switzerland; pp. 1–15. [Google Scholar] [CrossRef]
Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
Defferrard, M.; Bresson, X.; Vanderhgheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2017, arXiv:1609.02907. [Google Scholar]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2016, arXiv:1511.05493. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1263–1272. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Poli, M.; Massaroli, S.; Park, J.; Yamashita, A.; Asama, H.; Park, J. Graph neural ordinary differential equations. arXiv 2020, arXiv:2006.10637. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph Transformer Networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Available online: https://proceedings.neurips.cc/paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf (accessed on 10 April 2025).
Rossi, E.; Chamberlain, B.; Frasca, F.; Eynard, D.; Monti, F.; Bronstein, M. Temporal graph networks for deep learning on dynamic graphs. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Hu, Y.; You, H.; Wang, Z.; Wang, Z.; Zhou, E.; Gao, Y. Graph-MLP: Node classification without message passing in graph. arXiv 2021, arXiv:2106.04051. [Google Scholar]
Li, J.; Deng, W.; Dang, X.J.; Zhao, H.M. Fault diagnosis with maximum classifier discrepancy and deep feature alignment for cross-domain adaptation under variable working conditions. IEEE T. Reliab. 2025, in press. [Google Scholar]
Cantürk, S.; Liu, R.; Lapointe-Gagné, O.; Létourneau, V.; Wolf, G.; Beaini, D.; Rampášek, L. Graph positional and structural encoder. arXiv 2023, arXiv:2307.07107. [Google Scholar] [CrossRef]
Zhuang, Y.; Jain, R.; Gao, W.; Ren, L.; Aizawa, K. Panel: Cross-media intelligence. In Proceedings of the Web Conference 2021, ACM, WWW ’21, 25th ACM International Conference on Multimedia, Montreal, QC, Canada, 11–15 April 2016; p. 1173. [Google Scholar]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Xu, H.; Jiang, C.; Liang, X.; Li, Z. Spatial-aware graph relation network for large-scale object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9290–9299. [Google Scholar] [CrossRef]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; Volume 2018, pp. 3588–3597. [Google Scholar] [CrossRef]
Chi, C.; Wei, F.; Hu, H. RelationNet++: Bridging visual representations for object detection via transformer decoder. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 13564–13574. [Google Scholar]
Li, Z.; Du, X.; Cao, Y. GAR: Graph assisted reasoning for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1295–1304. [Google Scholar] [CrossRef]
Zhao, G.; Ge, W.; Yu, Y. GraphFPN: Graph feature pyramid network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2743–2752. [Google Scholar] [CrossRef]
Lin, T.Y.; Ma, L.; Belongie, S. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 1–20. [Google Scholar]
Chen, C.; Li, J.; Zheng, Z.; Huang, Y.; Ding, X.; Yu, Y. Dual bipartite graph learning: A general approach for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2683–2692. [Google Scholar] [CrossRef]
Li, W.; Liu, X.; Yuan, Y. SIGMA: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5281–5290. [Google Scholar] [CrossRef]
Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic relation reasoning for shot-stable few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8778–8787. [Google Scholar] [CrossRef]
Chen, C.; Li, J.; Zhou, H.-Y.; Han, X.; Huang, Y.; Ding, X.; Yu, Y. Relation matters: Foreground-aware graph-based relational reasoning for domain adaptive object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3677–3694. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. In International Conference on Learning Representations. 2018. Available online: https://openreview.net/forum?id=BJj6qGbRW (accessed on 10 April 2025).
Zhang, L.; Li, X.; Arnab, A.; Yang, K.; Tong, Y.; Torr, P.H.S. Dual graph convolutional network for semantic segmentation. arXiv 2020, arXiv:1909.06121. [Google Scholar]
Chen, Y.; Rohrbach, M.; Yan, Z.; Shuicheng, Y.; Feng, J.; Kalantidis, Y. Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 433–442. [Google Scholar] [CrossRef]
Zhang, L.; Xu, D.; Arnab, A.; Torr, P.H. Dynamic graph message passing networks. arXiv 2019, arXiv:1908.06955. [Google Scholar]
Yu, C.; Liu, Y.; Gao, C.; Shen, C.; Sang, N. Representative graph neural network. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 379–396. [Google Scholar] [CrossRef]
Li, X.; Yang, Y.; Zhao, Q.; Shen, T.; Lin, Z.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8950–8959. [Google Scholar] [CrossRef]
Hu, H.; Ji, D.; Gan, W.; Bai, S.; Wu, W.; Yan, J. Class-wise dynamic graph convolution for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–17. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9586–9594. [Google Scholar] [CrossRef]
Xie, G.-S.; Liu, J.; Xiong, H.; Shao, L. Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 5471–5480. [Google Scholar] [CrossRef]
Zhang, B.; Xiao, J.; Jiao, J.; Wei, Y.; Zhao, Y. Affinity Attention Graph Neural Network for Weakly Supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3000–3015. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Zhang, G.; Gao, Y.; Deng, X.; Gong, K.; Liang, X.; Lin, L. Bidirectional graph reasoning network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7794–7803. [Google Scholar] [CrossRef]
Hsu, P.M. One-Shot Object Detection Using Multi-Functional Attention Mechanism. Master’s Thesis, National Taiwan University, Taipei Taiwan, China, 2020. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 4489–4497. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar] [CrossRef]
Herzig, R.; Levi, E.; Xu, H.; Gao, H.; Brosh, E.; Wang, X.; Globerson, A.; Darrell, T. Spatio-temporal networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2347–2356. [Google Scholar] [CrossRef]
Wang, X.; Gupta, A. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 399–417. [Google Scholar] [CrossRef]
Ou, Y.; Mi, L.; Chen, Z. Object-relation reasoning graph for action recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 20101–20110. [Google Scholar] [CrossRef]
Zhang, J.; Shen, F.; Xu, X.; Shen, H.T. Temporal reasoning graph for activity recognition. IEEE Trans. Image Process 2020, 29, 5491–5506. [Google Scholar] [CrossRef]
Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–26 June 2017; pp. 5734–5743. [Google Scholar]
Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang, J.; Gan, C. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7094–7103. Available online: https://openaccess.thecvf.com/content_ICCV_2019/html/Zeng_Graph_Convolutional_Networks_for_Temporal_Action_Localization_ICCV_2019_paper.html (accessed on 10 April 2025).
He, X.; Sun, Z. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 961–970. [Google Scholar] [CrossRef]
Li, L.; Tang, S.; Deng, L.; Zhang, Y.; Tian, Q. Image caption with global-local attention. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/11236 (accessed on 10 April 2025).
Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Computer Vision–ECCV 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11218, pp. 711–727. [Google Scholar] [CrossRef]
Yang, X.; Tang, K.; Zhang, H.; Cai, J. Autoencoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10685–10694. Available online: https://openaccess.thecvf.com/content_CVPR_2019/html/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.html (accessed on 10 April 2025).
Shah, M.; Chen, X.; Rohrbach, M.; Parikh, D. Cycle-consistency for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6642–6651. [Google Scholar] [CrossRef]
Zhang, C.; Chao, W.L.; Xuan, D. An empirical study on leveraging scene graphs for visual question answering. arXiv 2019, arXiv:1907.12133. [Google Scholar]
Teney, D.; Liu, L.; van den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3239–3247. [Google Scholar] [CrossRef]
Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10312–10321. [Google Scholar] [CrossRef]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. arXiv 2016, arXiv:1612.00837. [Google Scholar]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
Yu, J.; Lu, Y.; Qin, Z.; Zhang, W.; Liu, Y.; Tan, J.; Guo, L. Modeling text with graph convolutional network for cross-modal information retrieval. In Advances in Multimedia Information Processing–PCM 2018; Hong, R., Cheng, W.H., Yamasaki, T., Wang, M., Ngo, C.W., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11164, pp. 223–234. [Google Scholar] [CrossRef]
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1497–1506. [Google Scholar] [CrossRef]
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
Cucurull, G.; Taslakian, P.; Vazquez, D. Context-aware visual compatibility prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12609–12618. [Google Scholar] [CrossRef]
Alhatemi, R.A.J.; Savaş, S. A weighted ensemble approach with multiple pre-trained deep learning models for classification of stroke. Medinformatics 2023, 2023, 10–19. [Google Scholar] [CrossRef]
Zheng, J.; Liang, P.; Zhao, H.; Deng, W. A broad sparse fine-grained image classification model based on dictionary selection strategy. IEEE Trans. Reliab. 2024, 73, 576–588. [Google Scholar] [CrossRef]
Chen, H.; Sun, Y.; Li, X.; Zheng, B.; Chen, T. Dual-Scale Complementary Spatial-Spectral Joint Model for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6772–6789. [Google Scholar] [CrossRef]
Chattopadhyay, S. Decoding Medical Diagnosis with Machine Learning Classifiers. Medinformatics 2024. Online First. [Google Scholar] [CrossRef]
Zhao, H.; Wu, Y.; Deng, W. Fuzzy Broad Neuroevolution Networks via Multiobjective Evolutionary Algorithms: Balancing Structural Simplification and Performance. IEEE Trans. Instrum. Meas. 2025, 74, 2505910. [Google Scholar] [CrossRef]
Li, M.; Li, J.; Chen, Y.; Hu, B. Stress Severity Detection in College Students Using Emotional Pulse Signals and Deep Learning. IEEE Trans. Affect. Comput. 2025. early access. [Google Scholar] [CrossRef]
Aher, C.N. Enhancing Heart Disease Detection Using Political Deer Hunting Optimization-Based Deep Q-Network with High Accuracy and Sensitivity. Medinformatics 2023. Online First. [Google Scholar] [CrossRef]
Long, H.; Chen, T.; Chen, H.; Zhou, X.; Deng, W. Principal space approximation ensemble discriminative marginalized least-squares regression for hyperspectral image classification. Eng. Appl. Artif. Intell. 2024, 133, 108031. [Google Scholar] [CrossRef]
Deng, W.; Shen, J.; Ding, J.; Zhao, H. Robust Dual-Model Collaborative Broad Learning System for Classification Under Label Noise Environments. IEEE Internet Things J. 2025. early access. [Google Scholar] [CrossRef]
Yao, R.; Zhao, H.; Zhao, Z.; Guo, C.; Deng, W. Parallel convolutional transfer network for bearing fault diagnosis under varying operation states. IEEE Trans. Instrum. Meas. 2024, 73, 3540713. [Google Scholar] [CrossRef]
Li, M.; Chen, Y.; Lu, Z.; Ding, F.; Hu, B. ADED: Method and Device for Automatically Detecting Early Depression Using Multimodal Physiological Signals Evoked and Perceived via Various Emotional Scenes in Virtual Reality. IEEE Trans. Instrum. Meas. 2025, 74, 2524016. [Google Scholar] [CrossRef]
Guo, D.; Zhang, Z.; Yang, B.; Zhang, J.; Yang, H.; Lin, Y. Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control. Nat. Commun. 2024, 15, 9662. [Google Scholar] [CrossRef] [PubMed]
Deng, W.; Feng, J.; Zhao, H. Autonomous Path Planning via Sand Cat Swarm Optimization with Multi-Strategy Mechanism for Un-manned Aerial Vehicles in Dynamic Environment. IEEE Internet Things J. 2025. early access. [Google Scholar] [CrossRef]
Lin, Y.; Ruan, M.; Cai, K.; Li, D.; Zeng, Z.; Li, F.; Yang, B. Identifying and managing risks of AI-driven operations: A case study of auto-matic speech recognition for improving air traffic safety. Chin. J. Aeronaut. 2023, 36, 366–386. [Google Scholar] [CrossRef]
Zhu, Z.; Li, X.; Chen, H.; Zhou, X.; Deng, W. An effective and robust genetic algorithm with hybrid multi-strategy and mechanism for airport gate allocation. Inf. Sci. 2024, 654, 119892. [Google Scholar] [CrossRef]
Huang, C.; Ma, H.; Zhou, X.; Deng, W. Cooperative Path Planning of Multiple Unmanned Aerial Vehicles Using Cylinder Vector Particle Swarm Optimization with Gene Targeting. IEEE Sens. J. 2025, 25, 8470–8480. [Google Scholar] [CrossRef]
Li, X.; Zhao, H.; Xu, J.; Zhu, G.; Deng, W. APDPFL: Anti-Poisoning Attack Decentralized Privacy Enhanced Federated Learning Scheme for Flight Operation Data Sharing. IEEE Trans. Wirel. Commun. 2024, 23, 19098–19109. [Google Scholar] [CrossRef]
Deng, W.; Wang, J.; Guo, A.; Zhao, H. Quantum differential evolutionary algorithm with quantum-adaptive mutation strategy and population state evaluation framework for high-dimensional problems. Inf. Sci. 2024, 676, 120787. [Google Scholar] [CrossRef]
Huang, C.; Song, Y.; Ma, H.; Zhou, X.; Deng, W. A multiple level competitive swarm optimizer based on dual evaluation criteria and global optimization for large-scale optimization problem. Inf. Sci. 2025, 708, 122068. [Google Scholar] [CrossRef]
Chen, Y.; Ding, Y.; Hu, Z.-Z.; Ren, Z. Geometrized Task Scheduling and Adaptive Resource Allocation for Large-Scale Edge Computing in Smart Cities. IEEE Internet Things J. 2025. early access. [Google Scholar] [CrossRef]
Huang, C.; Wu, D.; Zhou, X.; Song, Y.; Chen, H.; Deng, W. Competitive swarm optimizer with dynamic multi-competitions and convergence accelerator for large-scale optimization problems. Appl. Soft Comput. 2024, 167, 112252. [Google Scholar] [CrossRef]
Ma, C.; Zhang, T.; Jiang, Z.; Ren, Z. Dynamic analysis of lowering operations during floating offshore wind turbine assembly mating. Renew. Energy 2025, 243, 122528. [Google Scholar] [CrossRef]
Deng, W.; Li, X.; Xu, J.; Li, W.; Zhu, G.; Zhao, H. BFKD: Blockchain-Based Federated Knowledge Distillation for Aviation Internet of Things. IEEE Trans. Reliab. 2024. early access. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Q.; Ni, B.; Zhang, W.; Xu, M.; Yang, X. Learning context graph for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2153–2162. [Google Scholar] [CrossRef]
Yang, J.; Zheng, W.-S.; Yang, Q.; Chen, Y.-C.; Tian, Q. Spatial-temporal graph convolutional network for video-based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3286–3296. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Q.; Ni, B.; Zhang, W.; Xu, M.; Yang, X. Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2896–2905. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, F.; Zhang, Q.; Wang, S.; Wang, Y.; Yu, Y. Cross-view correspondence reasoning based on bipartite graph convolutional network for mammogram mass detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3811–3821. [Google Scholar] [CrossRef]
Lin, P.; Sun, P.; Cheng, G.; Xie, S.; Li, X.; Shi, J. Graph-guided architecture search for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4202–4211. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Chen, J.; Lei, B.; Song, Q.; Ying, H.; Chen, D.Z.; Wu, J. A hierarchical graph network for 3D object detection on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 389–398. [Google Scholar] [CrossRef]
Lin, Z.-H.; Huang, S.-Y.; Wang, Y.-C.F. Convolution in the cloud: Learning deformable kernels in 3D graph convolution networks for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1797–1806. [Google Scholar] [CrossRef]
Xu, Q.; Sun, X.; Wu, C.-Y.; Wang, P.; Neumann, U. Grid-GCN for fast and scalable point cloud learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5660–5669. [Google Scholar] [CrossRef]
Shi, W.; Rajkumar, R. Point-GNN: Graph neural network for 3D object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1708–1716. [Google Scholar] [CrossRef]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar] [CrossRef]
Wang, X.; Ye, Y.; Gupta, A. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6857–6866. [Google Scholar] [CrossRef]
Kampffmeyer, M.; Chen, Y.; Liang, X.; Wang, H.; Zhang, Y.; Xing, E.P. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11479–11488. [Google Scholar] [CrossRef]
Garcia, V.; Bruna, J. Few-shot learning with graph neural networks. arXiv 2017, arXiv:1711.04043. [Google Scholar]
Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.J.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
Kim, J.; Kim, T.; Kim, S.; Yoo, C.D. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11–20. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Li, Y.; Wang, S. Linkage based face clustering via graph convolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1117–1125. [Google Scholar] [CrossRef]
Yang, L.; Zhan, X.; Chen, D.; Yan, J.; Loy, C.C.; Lin, D. Learning to cluster faces on an affinity graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2293–2301. [Google Scholar] [CrossRef]
Zhang, N.; Deng, S.; Li, J.; Chen, X.; Zhang, W.; Chen, H. Summarizing Chinese medical answer with graph convolution networks and question-focused dual attention. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 15–24. [Google Scholar] [CrossRef]
Wei, X.; Yu, R.; Sun, J. View-GCN: View-based graph convolutional network for 3D shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1847–1856. [Google Scholar] [CrossRef]
Wald, J.; Dhamo, H.; Navab, N.; Tombari, F. Learning 3D semantic scene graphs from 3D indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3960–3969. [Google Scholar] [CrossRef]
Ulutan, O.; Iftekhar, A.S.M.; Manjunath, B.S. VSGNet: Spatial attention network for detecting human-object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13614–13623. [Google Scholar] [CrossRef]
Sun, J.; Jiang, Q.; Lu, C. Recursive social behavior graph for trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 657–666. [Google Scholar] [CrossRef]
Zhang, W.; Wang, X.E.; Tang, S.; Shi, H.; Shi, H.; Xiao, J.; Zhuang, Y.; Wang, W.Y. Relational graph learning for grounded video description generation. In Proceedings of the 28th ACM International Conference on Multimedia (MM ‘20), Seattle, WA, USA, 12–16 October 2020; pp. 3807–3828. [Google Scholar] [CrossRef]
Chen, J.; Ma, T.; Xiao, C. FastGCN: Fast learning with graph convolutional networks via importance sampling. arXiv 2018, arXiv:1801.10247. [Google Scholar] [CrossRef]
Zhou, K.; Huang, X.; Song, Q.; Chen, R.; Hu, X. Auto-GNN: Neural architecture search of graph neural networks. Front. Big Data 2022, 5, 1029307. [Google Scholar] [CrossRef]
Dwivedi, V.P.; Rampášek, L.; Galkin, M.; Parviz, A.; Wolf, G.; Luu, A.T.; Beaini, D. Long range graph benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 22326–22340. [Google Scholar]
Srinivasa, R.S.; Xiao, C.; Glass, L.; Romberg, J.; Sun, J. Fast graph attention networks using effective resistance based graph sparsification. arXiv 2020, arXiv:2006.08796. [Google Scholar] [CrossRef]
Tang, J.; Yang, Y.; Wei, W.; Shi, L.; Xia, L.; Yin, D.; Huang, C. HiGPT: Heterogeneous graph language model. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 2842–2853. [Google Scholar] [CrossRef]
Ying, Z.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. GNNExplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 9240–9251. [Google Scholar]
Zhang, L.; Zhang, Z.; Song, R.; Cheng, J.; Liu, Y.; Chen, X. Knowledge-infused graph neural networks for medical diagnosis. Sci. Rep. 2021, 11, 53042. [Google Scholar]
Zhang, H.; Wu, B.; Yuan, X.; Pan, S.; Tong, H.; Pei, J. Trustworthy graph neural networks: Aspects, methods, and trends. Proc. IEEE 2024, 112, 97–139. [Google Scholar] [CrossRef]
Wang, X.; Wu, Y.; Zhang, A.; Wang, H. GOODAT: Graph out-of-distribution anomaly detection for autonomous driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11234–11247. [Google Scholar]
Zhu, Z.; Wang, P.; Hu, Q.; Li, G.; Liang, X.; Cheng, J. FastGL: A GPU-efficient framework for accelerating sampling-based GNN training at large scale. arXiv 2023, arXiv:2409.14939. [Google Scholar]
Magister, L.C.; Kazhdan, D.; Singh, V.; Liò, P. GCExplainer: Human-in-the-loop concept-based explanations for graph neural networks. arXiv 2021, arXiv:2107.11889. [Google Scholar] [CrossRef]
Chai, Z.; Zhang, T.; Wu, L.; Han, K.; Hu, X.; Huang, X.; Yang, Y. GraphLLM: Boosting graph reasoning ability of large language model. arXiv 2023, arXiv:2310.05845. [Google Scholar] [CrossRef]

Figure 1. The development of graph neural networks. References: [8,15,16,17,24,25,26,27,28,29,30,31,32,33,34].

Figure 2. Structure of SGRN.

Figure 3. GNN-based video action recognition model.

Figure 4. GCN–LSTM framework.

Figure 5. GNN-based visual question answering.

Figure 6. Dual-channel GNN for image–text retrieval.

Table 1. Symbols and their meanings.

Symbol	Meaning
v	Current node
u	Neighboring node
N(v)	Neighbor set of node v
$H_{v}^{(l)}$	Feature representation of node v at layer l
$H_{u}^{(l)}$	Feature representation of neighboring node u at layer l
W^(l)	Learnable weight matrix at layer l
σ	Activation function (e.g., ReLU)
$m_{v}^{(l)}$	Message received by node v at layer l
$H_{v}^{(l + 1)}$	Feature representation of node v at layer l + 1

Table 3. Object detection model comparison.

Method	mAP Accuracy	Inference Speed (FPS)	Memory Consumption
Faster R-CNN [37]	36.2%	5 fps	200 MB
YOLO [38]	57.9%	40+ fps	250 MB
Mask R-CNN [39]	37.1%	10 fps	300 MB
SGRN [40]	38.3%	15 fps	320 MB
RelationNet++ [42]	45.2%	12 fps	350 MB
GraphFPN [44]	46.2%	13 fps	370 MB

Table 4. Performance comparison of GNN-based semantic segmentation models.

Model	mIoU (%)	Parameter Count (M)	Key Innovation
Dual GCN [51]	78.5	28.1	Spatial-Channel Dual Graph Modeling
GRN [52]	76.2	45.3	Global Relationship Reasoning Unit
CDGC [56]	80.1	29.5	Category Dynamic Graph Convolution

Table 5. Few-shot segmentation model comparison.

Model	mIoU (%)	Key Innovation
PGAM [57]	56.3	Pyramid Graph Attention
Scale-Aware GNN [58]	61.7	Cross-Scale Node Collaboration

Table 6. Comparison of GNN-based video action recognition models.

Model	Top-1 Acc (%)	Parameter Count (M)	Key Innovation
Region Graphs [72]	76.3	23.5	Spatio-Temporal Region Graph Reasoning
Object-Relation [73]	78.1	32.7	Participant-Centric Object-Level Graph
Temporal Reasoning [74]	77.5	28.4	Multi-Scale Temporal Graph Reasoning

Table 7. Comparison of core temporal action localization models.

Model	mAP@0.5 (%)	mAP@0.75 (%)	Key Innovation
CDC [75]	45.3	26.1	Deconvolutional Temporal Localization
P-GCN [76]	48.3	27.5	Candidate Graph Reasoning

Table 8. Comparison of visual description models’ performance.

Model	BLEU-4	METEOR	CIDEr
Global–Local [78]	36.5	27.7	108.2
GCN-LSTM [79]	38.2	28.3	112.5
SGAE [80]	39.1	28.9	114.8

Table 9. Comparison of visual question answering models’ performance.

Model	Accuracy (%)	Key Innovation
Cycle-Consistency [81]	68.5	Cycle Consistency Constraint for Enhanced Cross-Modal Alignment
ReGAT [84]	70.2	Question-Adaptive Graph Attention Mechanism

Table 10. Comparison of image–text retrieval models’ performance.

Model	Image→Text R@1	Text→Image R@1	Key Innovation
VSE++ [87]	52.9	39.6	Hard Negative Sample Mining for Optimized Embedding Space
Dual-Channel GNN [89]	58.3	42.7	Graph Convolutional Network Coupling Multimodal Features
SGM [90]	62.1	46.9	Scene Graph Hierarchical Matching

Table 11. Comparison of point cloud 3D object detection performance.

Model	mAP@0.5	Speed (FPS)	Key Innovation
VoxelNet [120]	53.1	4.2	Voxelization + 3D CNN
PointNet++ [119]	58.3	12.4	Hierarchical Point Set Feature Learning
Point-GNN [125]	64.7	8.6	End-to-End Graph Neural Network Detection
Grid-GCN [124]	63.9	15.3	Efficient Grid Context Aggregation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, Z.; Wang, C.; Wang, Y.; Gao, X.; Li, B.; Yin, L.; Chen, H. Recent Research Progress of Graph Neural Networks in Computer Vision. Electronics 2025, 14, 1742. https://doi.org/10.3390/electronics14091742

AMA Style

Jia Z, Wang C, Wang Y, Gao X, Li B, Yin L, Chen H. Recent Research Progress of Graph Neural Networks in Computer Vision. Electronics. 2025; 14(9):1742. https://doi.org/10.3390/electronics14091742

Chicago/Turabian Style

Jia, Zhiyong, Chuang Wang, Yang Wang, Xinrui Gao, Bingtao Li, Lifeng Yin, and Huayue Chen. 2025. "Recent Research Progress of Graph Neural Networks in Computer Vision" Electronics 14, no. 9: 1742. https://doi.org/10.3390/electronics14091742

APA Style

Jia, Z., Wang, C., Wang, Y., Gao, X., Li, B., Yin, L., & Chen, H. (2025). Recent Research Progress of Graph Neural Networks in Computer Vision. Electronics, 14(9), 1742. https://doi.org/10.3390/electronics14091742

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recent Research Progress of Graph Neural Networks in Computer Vision

Abstract

1. Introduction

2. Development and Basic Concepts of Graph Neural Networks

2.1. Concept and Development of GNNs

2.2. Evolution and Progress of GNN Models

2.2.1. Graph Convolutional Networks (GCN)

2.2.2. Graph Sampling and Aggregation (GraphSAGE)

2.2.3. Graph Attention Networks (GAT)

2.2.4. Relational Graph Convolutional Networks (R-GCN)

2.2.5. Model Comparison

2.3. Spectral and Spatial Graph Neural Networks

2.3.1. Spectral Graph Neural Networks

2.3.2. Spatial Graph Neural Networks

3. Applications of Graph Neural Networks in Computer Vision

3.1. Image Processing

3.1.1. Object Detection

3.1.2. Image Segmentation

3.2. Video Analysis

3.2.1. Video Action Recognition

3.2.2. Temporal Action Localization

3.3. Other Related Work: Cross-Media

3.3.1. Visual Description

3.3.2. Visual Question Answering (VQA)

3.3.3. Cross-Media Retrieval

3.3.4. Applications in Different Fields

3.4. Frontier Issues of Graph Neural Networks in Computer Vision

3.4.1. Advanced Graph Neural Networks for Computer Vision

3.4.2. Broader Applications of Graph Neural Networks in Computer Vision

4. Challenges and Future Directions

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI