Techniques and Challenges of Image Segmentation: A Review

: Image segmentation, which has become a research hotspot in the ﬁeld of image processing and computer vision, refers to the process of dividing an image into meaningful and non-overlapping regions, and it is an essential step in natural scene understanding. Despite decades of effort and many achievements, there are still challenges in feature extraction and model design. In this paper, we review the advancement in image segmentation methods systematically. According to the segmentation principles and image data characteristics, three important stages of image segmentation are mainly reviewed, which are classic segmentation, collaborative segmentation, and semantic segmentation based on deep learning. We elaborate on the main algorithms and key techniques in each stage, compare, and summarize the advantages and defects of different segmentation models, and discuss their applicability. Finally, we analyze the main challenges and development trends of image segmentation techniques.


Introduction
Image segmentation is one of the most popular research fields in computer vision, and forms the basis of pattern recognition and image understanding.The development of image segmentation techniques is closely related to many disciplines and fields, e.g., autonomous vehicles [1], intelligent medical technology [2,3], image search engines [4], industrial inspection, and augmented reality.
Image segmentation divides images into regions with different features and extracts the regions of interest (ROIs).These regions, according to human visual perception, are meaningful and non-overlapping.There are two difficulties in image segmentation: (1) how to define "meaningful regions", as the uncertainty of visual perception and the diversity of human comprehension lead to a lack of a clear definition of the objects, it makes image segmentation an ill-posed problem; and (2) how to effectively represent the objects in an image.Digital images are made up of pixels, that can be grouped together to make up larger sets based on their color, texture, and other information.These are referred to as "pixel sets" or "superpixels".These low-level features reflect the local attributes of the image, but it is difficult to obtain global information (e.g., shape and position) through these local attributes.
Since the 1970s, image segmentation has received continuous attention from computer vision researchers.The classic segmentation methods mainly focus on highlighting and obtaining the information contained in a single image, that often requires professional knowledge and human intervention.However, it is difficult to obtain high-level semantic information from images.Co-segmentation methods involve identifying common objects from a set of images, that requires the acquisition of certain prior knowledge.Since the image annotation of these methods is dispensable, they are classed as semi-supervised or weakly supervised methods.With the enrichment of large-scale fine-grained annotation image datasets, image segmentation methods based on deep neural networks have gradually become a popular topic.
Although many achievements have been made in image segmentation research, there are still many challenges, e.g., feature representation, model design, and optimization.In particular, semantic segmentation is still full of challenges due to limited or sparse annotations, class imbalance, overfitting, long training time, and gradient vanishing.The authors of [5][6][7] introduced semantic segmentation methods and commonly used datasets, and [8] analyzed the evaluation metrics and methods of semantic segmentation, but reviews have not yet sorted and summarized image segmentation algorithms from the perspective of how the technology in the field of image segmentation has evolved and developed to the present day.Therefore, it is necessary to systematically summarize the existing segmentation methods, especially the state-of-the-art methods.We analyze and reclassify the existing image segmentation methods from the perspective of algorithm development, elaborate on the working mechanisms of these methods and enumerate some influential image segmentation algorithms, and introduce the essential techniques of semantic segmentation based on deep neural networks systematically, as shown in Figure 1.
and obtaining the information contained in a single image, that often requires professional knowledge and human intervention.However, it is difficult to obtain high-level semantic information from images.Co-segmentation methods involve identifying common objects from a set of images, that requires the acquisition of certain prior knowledge.Since the image annotation of these methods is dispensable, they are classed as semi-supervised or weakly supervised methods.With the enrichment of large-scale fine-grained annotation image datasets, image segmentation methods based on deep neural networks have gradually become a popular topic.
Although many achievements have been made in image segmentation research, there are still many challenges, e.g., feature representation, model design, and optimization.In particular, semantic segmentation is still full of challenges due to limited or sparse annotations, class imbalance, overfitting, long training time, and gradient vanishing.The authors of [5][6][7] introduced semantic segmentation methods and commonly used datasets, and [8] analyzed the evaluation metrics and methods of semantic segmentation, but reviews have not yet sorted and summarized image segmentation algorithms from the perspective of how the technology in the field of image segmentation has evolved and developed to the present day.Therefore, it is necessary to systematically summarize the existing segmentation methods, especially the state-of-the-art methods.We analyze and reclassify the existing image segmentation methods from the perspective of algorithm development, elaborate on the working mechanisms of these methods and enumerate some influential image segmentation algorithms, and introduce the essential techniques of semantic segmentation based on deep neural networks systematically, as shown in Figure 1.

Classic Segmentation Methods
The classic segmentation algorithms were proposed for grayscale images, which mainly consider gray-level similarity in the same region and gray-level discontinuity in different regions.In general, region division is based on gray-level similarity, and edge detection is based on gray-level discontinuity.Color image segmentation involves using

Classic Segmentation Methods
The classic segmentation algorithms were proposed for grayscale images, which mainly consider gray-level similarity in the same region and gray-level discontinuity in different regions.In general, region division is based on gray-level similarity, and edge detection is based on gray-level discontinuity.Color image segmentation involves using the similarity between pixels to segment the image into different regions or superpixels, and then merging these superpixels.

Edge Detection
The positions where the gray level changes sharply in an image are generally the boundaries of different regions.The task of edge detection is to identify the points on these boundaries.Edge detection is one of the earliest segmentation methods and is also called the parallel boundary technique.The derivative or differential of the gray level is used to identify the obvious changes at the boundary.In practice, the derivative of the digital image is obtained by using the difference approximation for the differential.Examples of edge detection results are represented in Figure 2.
Electronics 2023, 12, x FOR PEER REVIEW 3 the similarity between pixels to segment the image into different regions or superp and then merging these superpixels.

Edge Detection
The positions where the gray level changes sharply in an image are general boundaries of different regions.The task of edge detection is to identify the points on boundaries.Edge detection is one of the earliest segmentation methods and is also c the parallel boundary technique.The derivative or differential of the gray level is us identify the obvious changes at the boundary.In practice, the derivative of the digit age is obtained by using the difference approximation for the differential.Examp edge detection results are represented in Figure 2.These operators are sensitive to noise and are only suitable for images with low and complexity.The Canny operator performs best among the operators shown in F 2. It has strong denoising ability, and also processes the segmentation of lines well continuity, fineness, and straightness.However, the Canny operator is more comple takes longer to execute.In the actual industrial production, a thresholding gradient i ally used in the case of high real-time requirement.On the contrary, the more adva Canny operator is selected in the case of high quality requirement.
Although differential operators can locate the boundaries of different region ciently, the closure and continuity of the boundaries cannot be guaranteed due to nu ous discontinuous points and lines in the high-detail regions.Therefore, it is necess smooth the image before using a differential operator to detect edges.
Another edge detection method is the serial boundary technique, that concate points of edges to form a closed boundary.Serial boundary techniques mainly in graph-searching algorithms and dynamic programming algorithms.In graph-sear algorithms, the points on the edges are represented by a graph structure, and the with the minimum cost is searched in the graph to determine the closed boundaries, w is always computationally intensive.The dynamic programming algorithm utilizes ristic rules to reduce the search computation.
The active contours method approximates the actual contours of the objec matching the closed curve (i.e., the initial contours based on gradient) with the loca tures of the image, and finds the closed curve with the minimum energy by minim the energy function to achieve image segmentation.The method is sensitive to the loc of the initial contour, so the initialization must be close to the target contour.More These operators are sensitive to noise and are only suitable for images with low noise and complexity.The Canny operator performs best among the operators shown in Figure 2. It has strong denoising ability, and also processes the segmentation of lines well with continuity, fineness, and straightness.However, the Canny operator is more complex and takes longer to execute.In the actual industrial production, a thresholding gradient is usually used in the case of high real-time requirement.On the contrary, the more advanced Canny operator is selected in the case of high quality requirement.
Although differential operators can locate the boundaries of different regions efficiently, the closure and continuity of the boundaries cannot be guaranteed due to numerous discontinuous points and lines in the high-detail regions.Therefore, it is necessary to smooth the image before using a differential operator to detect edges.
Another edge detection method is the serial boundary technique, that concatenates points of edges to form a closed boundary.Serial boundary techniques mainly include graph-searching algorithms and dynamic programming algorithms.In graph-searching algorithms, the points on the edges are represented by a graph structure, and the path with the minimum cost is searched in the graph to determine the closed boundaries, which is always computationally intensive.The dynamic programming algorithm utilizes heuristic rules to reduce the search computation.
The active contours method approximates the actual contours of the objects by matching the closed curve (i.e., the initial contours based on gradient) with the local features of the image, and finds the closed curve with the minimum energy by minimizing the energy function to achieve image segmentation.The method is sensitive to the location of the initial contour, so the initialization must be close to the target contour.Moreover, its non-convexity easily leads to the local minimum, so it is difficult to converge to the concave boundary.Lankton and Tannenbaum [9] proposed a framework that considers the local segmentation energy to evolve contours, that could produce the initial localization according to the locally based global active contour energy and effectively segment objects with heterogeneous feature profiles.
Graph cuts marks the target nodes (i.e., source nodes) and background nodes (i.e., sink nodes), and uses the vector connection between different nodes to represent the fit degree of the nodes and the corresponding pixels (i.e., the penalty function).Graph cuts is an NP-hard problem, so efficient approximation algorithms must be sought to minimize the energy function, that can be adopted by using a swap algorithm based on the semimetric properties of connections and an expansion algorithm based on the metric properties of nodes.Freedman [10] proposed an interactive segmentation graph cuts algorithm combined with the prior knowledge of shapes, that solved the problems to a certain extent of inaccurate segmentation in the case of diffuse edges or multiple close similar objects.Graph cuts algorithms are widely used in the field of medical image analysis.

Region Division
The region division strategy includes serial region division and parallel region division.Thresholding is a typical parallel region division algorithm.The threshold is generally defined by the trough value in a gray histogram with some processing to make the troughs in the histogram deeper or to convert the troughs into peaks.The optimal grayscale threshold can be determined by the zeroth-order or first-order cumulant moment of the gray histogram to maximize the discriminability of the different categories.
The serial region technique involves dividing the region segmentation task into multiple steps to be performed sequentially, and the representative steps are region growing and region merging.
Region growing involves taking multiple seeds (single pixels or regions) as initiation points and combining the pixels with the same or similar features in the seed neighborhoods in the regions where the seed is located, according to a predefined growth rule until no more pixels can be merged.The principle of region merging is similar to the region growing, except that region merging measures the similarity by judging whether the difference between the average gray value of the pixels in the region obtained in the previous step and the gray value of its adjacent pixels is less than the given threshold K. Region merging can be used to solve the problem of hard noise loss and object occlusion, and has a good effect on controlling the segmentation scale and processing unconventional data; however, its computational cost is high, and the stopping rule is difficult to affirm.
Watershed is based on the concept of topography.When water rises from a low place, dams need to be built to prevent the water from reaching the mountain peaks.The dams built on the mountain peaks divide the entire image into several regions.The watershed algorithm can obtain the closed contour and has high processing efficiency.However, when the image is more complex, it is prone to false segmentation, that can be solved by establishing a Gaussian mixture model (GMM).The improved watershed has high generalization performance, is often used in the segmentation of MRI images and digital elevation maps, and is especially effective for segmenting medical images containing overlapping cells (e.g., blood cell segmentation).
The superpixel is a series of small irregular areas composed of pixels with similar positions and features (e.g., brightness, color, and texture).Using superpixels instead of pixels to represent features can reduce the complexity of image processing, so it is often used in the preprocessing of image segmentation.Image segmentation methods based on superpixel generation mainly include clustering and graph theory.

Graph Theory
The image segmentation method based on graph theory maps an image to a graph, that represents pixels or regions as vertices of the graph, and represents the similarity between vertices as weights of edges.Image segmentation, based on graph theory, is regarded as the division of vertices in the graph, analyzing the weighted graph with the principle and method based on graph theory, and obtaining optimal segmentation with the global optimization of the graph (e.g., the min-cut).
Graph-based region merging uses different metrics to obtain optimal global grouping instead of using fixed merging rules in clustering.Felzenszwalb et al. [11] used the minimum spanning tree (MST) to merge pixels after the image was represented as a graph.
Image segmentation based on MRF (Markov random field) introduces probabilistic graphical models (PGMs) into the region division to represent the randomness of the lower-level features in the images.It maps the image to an undigraph, where each vertex in the graph represents the feature at the corresponding location in the image, and each edge represents the relationship between two vertices.According to the Markov property of the graph, the feature of each point is only related to its adjacent features.
Leordeanu et al. [12] proposed a method based on spectral graph partitioning to find the correspondence between two sets of features.Adjacency matrix M is built for the weighted graph corresponding to the image, and the mapping constraints required for the overall mapping are imposed on the principal eigenvectors of M, so that the correct assignments are recovered according to the strong degree of the main cluster of M.

Clustering Method
K-means clustering is a special thresholding segmentation algorithm that is proposed based on the Lloyd algorithm.The algorithm operates as follows: (i) initialize K points as clustering centers; (ii) calculate the distance between each point i in the image and K cluster centers, and select the minimum distance as the classification k i ; (iii) average the points of each category (the centroid) and move the cluster center to the centroid; and (iv) repeat steps (ii) and (iii) until algorithm convergence.Simply put, K-means is an iteration process for computing the cluster centers.The K-means has noise robustness and quick convergence, but it is not conducive to processing nonadjacent regions, and it can only converge to the local optimum solution instead of the global optimum solution.
Mean-shift [13] is a clustering algorithm based on density estimation, that models the image feature space to the probability density function.Chuang [14] proposed a fuzzy C-means algorithm that integrated spatial information into the membership function for clustering to generate more uniform region segmentation.
Spectral clustering is a common clustering method based on graph theory, that divides the weighted graph and creates subgraphs with low coupling and high cohesion.Achanta et al. [15] proposed a simple linear iterative clustering (SLIC) algorithm that used K-means to generate superpixels; its segmentation results are shown in Figure 3. SLIC can be applied to 3D supervoxel generation.Li et al. [16] proposed a superpixel segmentation algorithm named linear spectral clustering (LSC), that used a kernel function to map the coordinates of the pixel values into a high-dimensional space, and weighted each point in the feature space appropriately to obtain the same optimal solution for both the objective function of K-means and the normalized cut.

Random Walks
Random walks is a segmentation algorithm based on graph theory, that is commonly used in image segmentation, image denoising [17,18], and image matching [19].By assigning labels to adjacent pixels in accordance with predefined rules, pixels with the same label can be represented together to distinguish different objects.
Grady et al. [20] transformed the segmentation problem into a discrete Dirichlet problem.They converted the image into a connected undigraph with weight, and marked the foreground and background of the image with one or a group of points, respectively, as initial conditions.For the unmarked points, they calculated the probability of reaching

Random Walks
Random walks is a segmentation algorithm based on graph theory, that is commonly used in image segmentation, image denoising [17,18], and image matching [19].By assigning labels to adjacent pixels in accordance with predefined rules, pixels with the same label can be represented together to distinguish different objects.
Grady et al. [20] transformed the segmentation problem into a discrete Dirichlet problem.They converted the image into a connected undigraph with weight, and marked the foreground and background of the image with one or a group of points, respectively, as initial conditions.For the unmarked points, they calculated the probability of reaching the foreground and background for the first time in random walks, and then took the highest probability as its category.Yang et al. [21] proposed a constrained random walks algorithm, that took user input as subsidiary conditions, e.g., users could assign the foreground and background in the image, or draw the regions where the boundaries must pass (hard constraint) or the regions where the boundaries can pass or not (soft constraint).The framework contained a constrained random walks algorithm and a local edit algorithm, that resulted in more accurate region contours and interoperability.
Lai et al. [22] extended the random walks image segmentation idea to 3D mesh images.They represented each side of the mesh as a vertex in the graph, defined the weight of edges by using the dihedral angle between adjacent faces, and sought a harmonic function adapted to boundary conditions.On this basis, Zhang et al. [23] proposed a fast geodesic curvature flow (FGCF) algorithm, that considered mesh vertices as the graph vertices to reduce the number of vertices in the graph, and changed the cutting contour to the local minimum of the weighted curve to smooth the zigzag contour.Therefore, the FGCF with less user input permitted had increased efficiency and higher robustness in the segmentation of the mesh benchmark dataset.

Co-Segmentation Methods
The classic segmentation methods usually focus on the feature extraction of a single image, which makes it difficult to obtain the high-level semantic information of the image.In 2006, Rother et al. [24] proposed the concept of collaborative segmentation for the first time.Collaborative segmentation, or co-segmentation for short, involves extracting the common foreground regions from multiple images with no human intervention, to obtain prior knowledge.Figure 4 shows a set of examples of co-segmentation results.

Random Walks
Random walks is a segmentation algorithm based on graph theory, that is commonly used in image segmentation, image denoising [17,18], and image matching [19].By assigning labels to adjacent pixels in accordance with predefined rules, pixels with the same label can be represented together to distinguish different objects.
Grady et al. [20] transformed the segmentation problem into a discrete Dirichlet problem.They converted the image into a connected undigraph with weight, and marked the foreground and background of the image with one or a group of points, respectively, as initial conditions.For the unmarked points, they calculated the probability of reaching the foreground and background for the first time in random walks, and then took the highest probability as its category.Yang et al. [21] proposed a constrained random walks algorithm, that took user input as subsidiary conditions, e.g., users could assign the foreground and background in the image, or draw the regions where the boundaries must pass (hard constraint) or the regions where the boundaries can pass or not (soft constraint).The framework contained a constrained random walks algorithm and a local edit algorithm, that resulted in more accurate region contours and interoperability.
Lai et al. [22] extended the random walks image segmentation idea to 3D mesh images.They represented each side of the mesh as a vertex in the graph, defined the weight of edges by using the dihedral angle between adjacent faces, and sought a harmonic function adapted to boundary conditions.On this basis, Zhang et al. [23] proposed a fast geodesic curvature flow (FGCF) algorithm, that considered mesh vertices as the graph vertices to reduce the number of vertices in the graph, and changed the cutting contour to the local minimum of the weighted curve to smooth the zigzag contour.Therefore, the FGCF with less user input permitted had increased efficiency and higher robustness in the segmentation of the mesh benchmark dataset.

Co-Segmentation Methods
The classic segmentation methods usually focus on the feature extraction of a single image, which makes it difficult to obtain the high-level semantic information of the image.In 2006, Rother et al. [24] proposed the concept of collaborative segmentation for the first time.Collaborative segmentation, or co-segmentation for short, involves extracting the common foreground regions from multiple images with no human intervention, to obtain prior knowledge.Figure 4 shows a set of examples of co-segmentation results.To achieve co-segmentation, it is necessary to extract the features of the foreground of single or multiple images (the seed image(s)) as prior knowledge using a classic segmentation method, and then utilize the prior knowledge to process a set of images containing the same or similar objects.The extended model can be expressed as follows: where E s represents the energy function of seed image segmentation, that describes the difference between the foreground and background of the image and the smoothness of the image, and E g represents the energy function of co-segmentation, that describes the similarity between foregrounds in a set of images.To achieve a good co-segmentation effect, segmentation energy E should be minimized.This can be achieved using two methods: improving the classic segmentation method to minimize E s , or optimizing the unsupervised learning method to learn good representations in image sets to minimize E g .
The energy function in the classic segmentation model is E s .e.g., when using MRF segmentation method as E s , then where E MRF u and E MRF p are the unary potential and the pairwise potential, respectively.The former measures the properties of the pixel itself, and the latter measures itself in relation to other pixels.In MRF, the unary potential represents the probability of a pixel belonging to class x i when a feature of the pixel is y i , which is ∑ x i E u (x i ); the pairwise potential represents the probability that two adjacent pixels belong to the same category, which is ∑ x i ,x j ∈Ψ E p x i , x j .The co-segmentation term E g is used to penalize the inconsistency of multiple foreground color histograms.In the MRF-based co-segmentation models, multifarious co-segmentation terms and their minimization methods were proposed.

MRF-Based Co-Segmentation
Rother et al. [24] extended the MRF segmentation and utilized prior knowledge to solve the ill-posed problems in multiple image segmentation.First, they segmented the foreground of the seed image, and assumed that the foreground objects of a set of images are similar; then, they built the energy function according to the consistency of the MRF probability distribution and the global constraint of the foreground feature similarity; finally, they estimated whether each pixel belongs to the foreground or background by minimizing the energy function to achieve the segmentation of the foreground and background.
The subsequent research on MRF co-segmentation focused on the optimization of global constraints.Vicente et al. [25] proposed an extended Boykov-Jolly model using multiscale decomposition, based on the L1 norm model [24], the L2 norm model [26], and the reward model [27].Compared with the above three models, the extended Boykov-Jolly model made great strides in reducing the number of parameters and improving robustness.Rubio et al. [28] evaluated the foreground similarity through high-order graph matching and introduced high-order graph matching into the MRF model to form global terms.Chang et al. [29] proposed a universal significance measure for images as prior knowledge, that could add foreground positional information in the MRF model and solve the problem of significant differences in the appearance, shape, and scale of multiple images.Yu et al. [30] adopted a method combined with a co-saliency model to achieve co-segmentation, and they represented the dissimilarity between foreground objects in each image and common objects in the dataset with a Gaussian mixture model as a new global constraint, then added the global constraint to co-segmentation energy E, and used graph cuts to minimize the energy function iteratively.
The co-segmentation based on MRF has good universality, and it is commonly used in video object detection and segmentation [30,31] and interactive image editing [32].

Co-Segmentation Based on Random Walks
Collins et al. [33] extended the random walks model to solve the co-segmentation problem, further utilized the quasiconvexity to optimize the segmentation algorithm, and provided a professional CUDA library to calculate the linear operation of the image sparse features.Fabijanska et al. [34] proposed an optimized random walks algorithm for 3D voxel image segmentation, using a supervoxel instead of a single voxel, which greatly saved computing time and memory resources.Dong et al. [35] proposed a subMarkov random walks (subRW) algorithm with prior label knowledge, which combined subRW with other random walks algorithms for seed image segmentation, and it achieved a good segmentation effect on images containing slender objects.
The co-segmentation methods based on random walks have good flexibility and robustness.They have achieved good results in some areas of medical image segmentation, especially in 3D medical image segmentation [36,37].

Co-Segmentation Based on Active Contours
Meng et al. [38] extended the active contour method to co-segmentation, constructed an energy function based on foreground consistency between images and background inconsistency within each image, and solved the energy function minimization by level set.Zhang et al. [39] proposed a deformable co-segmentation algorithm which transformed the prior heuristic information of brain anatomy contained in multiple images into the constraints controlling the brain MRI segmentation, and acquired the minimum energy function by level set, solving the problem of brain MRI image segmentation.Zhang et al. [40] introduced the saliency of the region of interest in the image into the active contour algorithm to improve the effect of the co-segmentation of multiple images, and proposed a level set optimization method based on superpixels, hierarchical computing, and convergence judgment to solve the minimized energy function.
The co-segmentation methods based on active contours have a good effect on the boundary extraction of complex shapes, but their unidirectional movement characteristic severely limits their flexibility, which is not conducive to the recognition and processing of objects with weak edges.

Clustering-Based Co-Segmentation
Clustering-based co-segmentation is an extension of the clustering segmentation of a single image.Joulin et al. [41] proposed a co-segmentation method based on spectral clustering and discriminative clustering.They used spectral clustering to segment a single image based on local spatial information, and then used discriminative clustering to propagate the segmentation results in a set of images to achieve co-segmentation.Kim et al. [42] divided the image into superpixels, used a weighted graph to describe the relevance of superpixels, converted the weighted graph into an affinity matrix to describe the relation of the intra-image, and then adopted spectral clustering to achieve co-segmentation.This final representation can be seen in Figure 5.If the number of initial cluster centers is not limited, the clustering method can be applied to the multi-objective co-segmentation problem.The algorithm follows the processes below.Firstly, the image is segmented into local regions of multiple superpixel blocks through image preprocessing.Then, these local regions are clustered by a clustering algorithm to form the corresponding prior information.Finally, the prior information is propagated in a set of images to achieve multi-object co-segmentation.Joulin et al.

Co-Segmentation Based on Graph Theory
Co-segmentation based on graph theory partitions an image into a digraph.In contrast to the digraph mentioned earlier, Meng et al. [44] divided each image into If the number of initial cluster centers is not limited, the clustering method can be applied to the multi-objective co-segmentation problem.The algorithm follows the processes below.Firstly, the image is segmented into local regions of multiple superpixel blocks through image preprocessing.Then, these local regions are clustered by a clustering algorithm to form the corresponding prior information.Finally, the prior information is propagated in a set of images to achieve multi-object co-segmentation.Joulin et al.

Co-Segmentation Based on Graph Theory
Co-segmentation based on graph theory partitions an image into a digraph.In contrast to the digraph mentioned earlier, Meng et al. [44] divided each image into several local regions based on the object detection, and then used these local regions as nodes to construct a digraph instead of using superpixels or pixels as nodes.Nodes are connected by directed edges, and the weight of the edges represents the local region similarity and saliency map between the two objects.Thereupon, the image co-segmentation problem was converted into the problem of finding the shortest path on the digraph.Finally, they obtained the shortest path through the dynamic programming (DP) algorithm.The flowchart is shown in Figure 6.

Co-Segmentation Based on Thermal Diffusion
Thermal diffusion image segmentation maximizes the temperature of the system by changing the location of the heat source, and its goal is to find the optimal location of the heat source to achieve the best segmentation effect.Anisotropic diffusion is a nonlinear filter that can not only reduce the Gaussian noise but also preserve image edges.It is often used in image processing to reduce noise while enhancing image details.Kim et al. [46] proposed a method called CoSand, that adopted temperature maximization modeling on anisotropic diffusion, where k heat sources maximize the temperature corresponding to the segmentation of k-categories; they achieved large-scale multicategory co-segmentation by maximizing the segmentation confidence of each pixel in the image.Kim et al.
[47] realized multi-foreground co-segmentation by iteratively implementing the two tasks of scene modeling and region labeling according to the similarity of the foreground objects in multiple images.In the process of foreground modeling, a spatial pyramid matching algorithm was used to extract local features, the linear support vector machine (SVM) was In the same year, Meng et al. [45] proposed a new co-saliency model to extract cosaliency maps from pairwise-constrained images.The co-saliency map consists of two terms; that is, the saliency map based on a single image and the saliency map based on multiple images, so it can also be called a dual-constrained saliency map.Compared to [44], the co-saliency map obtained by pairwise-constrained graph matching is more accurate.They extracted multiple saliency maps by matching similar regions between images, transformed it into a pairwise-constrained graph matching problem, and solved the pairwise-constrained graph matching problem using the DP algorithm.

Co-Segmentation Based on Thermal Diffusion
Thermal diffusion image segmentation maximizes the temperature of the system by changing the location of the heat source, and its goal is to find the optimal location of the heat source to achieve the best segmentation effect.Anisotropic diffusion is a nonlinear filter that can not only reduce the Gaussian noise but also preserve image edges.It is often used in image processing to reduce noise while enhancing image details.Kim et al. [46] proposed a method called CoSand, that adopted temperature maximization modeling on anisotropic diffusion, where k heat sources maximize the temperature corresponding to the segmentation of k-categories; they achieved large-scale multicategory co-segmentation by maximizing the segmentation confidence of each pixel in the image.Kim et al. [47] realized multi-foreground co-segmentation by iteratively implementing the two tasks of scene modeling and region labeling according to the similarity of the foreground objects in multiple images.In the process of foreground modeling, a spatial pyramid matching algorithm was used to extract local features, the linear support vector machine (SVM) was used for feature matching, and the Gaussian mixture model was used for object classification and detection.This method achieved good evaluation results on the Flickr MFC and ImageNet, and was still accurately segmented when foreground objects did not appear in every image.

Object-Based Co-Segmentation
Alexe et al. [48] proposed an object-based measurement method to quantify the possibility that an image window contains objects of any category.The probability of whether it is an object in each sampling window was calculated in advance, and the highest scoring window was used as the feature calibration for each category of objects according to the Bayesian theory.The method could distinguish between objects with clear spatial boundaries, e.g., telephones, as well as amorphous background elements, e.g., grass, that greatly reduced the number of specified category object detection windows.Vicente et al. [49] used foreground objects, measured the similarity between objects, extracted the features with the highest score from multiple candidate object classes, and achieved good experimental results on the iCoseg dataset.
To solve the problem of multi-object segmentation, binary segmentation methods based on target similarity ranking were proposed, that built a model using the maximum flow of parameters and trained a scoring function to obtain the optimal prediction result.The scoring function is determined by the properties, e.g., the convexity of all objects in the foreground, the continuity of the curve, the contrast between the foreground and the background, and the positions of the objects in the image.Meng et al. [50] proposed a multi-group image co-segmentation framework, that could obtain inter-image information in each set of images, generating more accurate prior knowledge; they used MRF and the dense mapping model, used EM to solve the energy E minimization problem of cosegmentation, and achieved the co-segmentation of multiple foreground recognition.The main methods in co-segmentation are shown in Table 1.

Methods
Ref.

Foreground Feature Co-Information Optimization
MRF-Based Co-Segmentation [24] color histogram L 1 norm graph cuts [26] color histogram L 2 norm quadratic pseudo-Boolean [27] color and texture histograms reward model maximum flow [25] color histogram Boykov-Jolly model dual decomposition [46] color and SIFT features region matching graph cuts

Semantic Segmentation Based on Deep Learning
With the continuous development of image acquisition equipment, there has been a great increase in the complexity of image details and the difference in objects (e.g., scale, posture).Low-level features (e.g., color, brightness, and texture) are difficult to obtain good segmentation results from, and feature extraction methods based on manual or heuristic rules cannot meet the complex needs of current image segmentation, that puts forward the higher generalization ability of image segmentation models.
Semantic texton forests [51] and random forest [52] methods were generally used to construct semantic segmentation classifiers before deep learning was applied to the field of image segmentation.For the past few years, deep learning algorithms have been increasingly applied to segmentation tasks, and the segmentation effect and performance have been significantly improved.The original approach divides the image into small patches to train a neural network and then classifies the pixels.This patch classification algorithm [53] has been adopted because the fully connected layers of the neural network require fixed-size images.
In 2015, Long et al. [54] proposed fully convolutional networks (FCNs) with convolution instead of full connection, that made it possible to input any image size, and the FCN architecture is shown in Figure 7. FCNs prove that neural networks can perform end-to-end semantic segmentation training, laying a foundation for deep neural networks in semantic segmentation.
ward the higher generalization ability of image segmentation models.
Semantic texton forests [51] and random forest [52] methods were generally used to construct semantic segmentation classifiers before deep learning was applied to the field of image segmentation.For the past few years, deep learning algorithms have been increasingly applied to segmentation tasks, and the segmentation effect and performance have been significantly improved.The original approach divides the image into small patches to train a neural network and then classifies the pixels.This patch classification algorithm [53] has been adopted because the fully connected layers of the neural network require fixed-size images.
In 2015, Long et al. [54] proposed fully convolutional networks (FCNs) with convolution instead of full connection, that made it possible to input any image size, and the FCN architecture is shown in Figure 7. FCNs prove that neural networks can perform endto-end semantic segmentation training, laying a foundation for deep neural networks in semantic segmentation.2.

Encoder-Decoder Architecture
Encoder-decoder architecture is based on FCNs.Prior to FCNs, convolutional neural networks (CNNs) achieved good effects in image classification, e.g., LeNet-5 [55], AlexNet [56], and VGG [57], whose output layers are the categories of images.However, semantic segmentation needs to map the high-level features back to the original image size after obtaining high-level semantic information.This requires an encoder-decoder architecture.
In the encoder stage, convolution and pooling operations are mainly performed to extract high-dimensional features containing semantic information.The convolution operation involves performing the multiplication and summing of the image-specific region with different convolution kernels pixel-for-pixel, and then transforming the activation function to obtain a feature map.The pooling operation involves sampling within a certain region (the pooling window), and then using a certain sampling statistic as the representative feature of the region.The backbone blocks commonly used in segmentation network encoders are VGG, Inception [58,59], and ResNet [60].
In the decoder stage, an operation is performed to generate a semantic segmentation mask by the high-dimensional feature vector.The process to map back the multi-level features extracted by the encoder to the original image is called up-sampling.

•
The interpolation method uses a specified interpolation strategy to insert new elements between the pixels of the original image, thereby expanding the size of the image and achieving the effect of up-sampling.Interpolation does not require training parameters and is often used in early up-sampling tasks;

•
The FCN adopts deconvolution for up-sampling.Deconvolution, also known as transposed convolution, reverses the parameters of the original convolution kernel upside down and flipped horizontally, and fills the spaces between and around the elements of the original image; • SegNet [61] adopts the up-sampling method of unpooling.Unpooling represents the inverse operation of max-pooling in the CNN.During maximum pooling, not only the maximum value of the pooling window, but also the coordinate position of the maximum values should be recorded; in the case of unpooling, the maximum value of this position is activated, and the values in other positions are all set to 0; • Wang et al. [62] proposed a dense up-sampling convolution (DUC), the core idea of which is to convert the label mapping in the feature map into smaller label mapping with multiple channels.This transformation can be achieved by directly using convolutions between the input feature map and the output label map, without the need to interpolate extra values during the up-sampling process.

Dilated Convolution
Dilated convolution, also known as atrous convolution, is constructed by inserting holes into the convolution kernel to expand the receptive field and reduce the computation during down-sampling.In FCN, the max-pooling layers are replaced by dilated convolution to maintain the receiving field of the corresponding layer and the high resolution of the feature map.
The DeepLab series [65][66][67][68] are classic models in the field of semantic segmentation.Prior to putting forward DeepLab V1, the semantic segmentation results were usually rough due to the transfer invariance lost in the pooling process, and the probabilistic relationship between labels not used for prediction.To ameliorate these problems, DeepLab V1 [65] uses dilated convolution to solve the problem of resolution reduction during upsampling, and uses fully connected conditional random fields (fully connected CRFs) to optimize the post-processing of segmented images to obtain objects at multi-scales and context information.
Yu et al. [69] used dilated convolution to aggregate multiscale context information.

Dilated Convolution
Dilated convolution, also known as atrous convolution, is constructed by inserting holes into the convolution kernel to expand the receptive field and reduce the computation during down-sampling.In FCN, the max-pooling layers are replaced by dilated convolution to maintain the receiving field of the corresponding layer and the high resolution of the feature map.
The DeepLab series [65][66][67][68] are classic models in the field of semantic segmentation.Prior to putting forward DeepLab V1, the semantic segmentation results were usually rough due to the transfer invariance lost in the pooling process, and the probabilistic relationship between labels not used for prediction.To ameliorate these problems, DeepLab V1 [65] uses dilated convolution to solve the problem of resolution reduction during up-sampling, and uses fully connected conditional random fields (fully connected CRFs) to optimize the post-processing of segmented images to obtain objects at multi-scales and context information.
Yu et al. [69] used dilated convolution to aggregate multiscale context information.They adopted a context module with eight convolutional layers, among which seven layers applied different 3 × 3 convolution kernels with different dilation factors (i.e., [1,1,2,4,8,16,1]), that proved that the simplified adaptive network could further improve the accuracy and precision of image segmentation without any resolution being lost.In [70], they proposed a dilated residual network (DRN) based on ResNet, that included five groups of convolutional layers.The down-sampling of the latter two groups (i.e., G4 and G5) was removed to maintain the spatial resolution of the feature map.Instead of this, the subsequent convolutions of G4 and G5 used dilated convolutions with dilatation rates r = 2 and r = 4, respectively.
Wang et al. [62] proposed a hybrid dilated convolution (HDC) to effectively deal with the "gridding" problem caused by dilated convolution.The HDC makes the final size of the receptive field of a series of convolution operations completely cover a square region without any holes or missing edges.To enable this, they used a different dilation rate for each layer, instead of using the same dilation rate for all layers after previous down-sampling.

Multiscale Feature Extraction
Spatial pyramid pooling (SPP) was proposed to solve the problem of the CNNs requiring fixed-size input images.He et al. [71] developed the SPP-net and verified its effectiveness in semantic segmentation and object detection.To make the most of image context information, Zhao et al. [72] developed PSPNet with a pyramid pooling module (PPM), as shown in Figure 9.Using ResNet as the backbone network, the PSPNet utilized PPM to extract and aggregate different subregion features at different scales, that were then up-sampled and concatenated to form the feature map, that carried both local and global context information.It is particularly worth noting that the number of pyramid layers and the size of each layer are variable, that depend on the size of the feature map input to the PPM.

Multiscale Feature Extraction
Spatial pyramid pooling (SPP) was proposed to solve the problem of the CNNs requiring fixed-size input images.He et al. [71] developed the SPP-net and verified its effectiveness in semantic segmentation and object detection.To make the most of image context information, Zhao et al. [72] developed PSPNet with a pyramid pooling module (PPM), as shown in Figure 9.Using ResNet as the backbone network, the PSPNet utilized PPM to extract and aggregate different subregion features at different scales, that were then up-sampled and concatenated to form the feature map, that carried both local and global context information.It is particularly worth noting that the number of pyramid layers and the size of each layer are variable, that depend on the size of the feature map input to the PPM.Ghiasi and Fowlkes [73] described a multi-resolution reconstruction architecture based on a Laplacian pyramid, that used skip connections from higher-resolution feature maps and multiplicative gating to refine segmentation boundaries reconstructed from lower-resolution maps successively.
DeepLab V2 [66] introduced atrous spatial pyramid pooling (ASPP) to expand the receptive field and capture multiscale features.The ASPP module contained four parallel dilated convolutions with different dilation rates, as shown in Figure 10.Referring to the HDC method, DeepLab V3 [67] applied both cascade modules and parallel modules of dilated convolution, grouped the parallel convolution in the ASPP module, and added the 1 × 1 convolution layer and batch normalization in the ASPP module.The DeepLab V3 significantly improved on the previous DeepLab versions without DenseCRF post-processing.Moreover, using Xception as the backbone network and DeepLab V3 as the decoder, DeepLab V3+ [68] adopted dilated depth wise separable convolutions instead of max-pooling and batch normalization to refine the segmentation boundaries.Ghiasi and Fowlkes [73] described a multi-resolution reconstruction architecture based on a Laplacian pyramid, that used skip connections from higher-resolution feature maps and multiplicative gating to refine segmentation boundaries reconstructed from lowerresolution maps successively.
DeepLab V2 [66] introduced atrous spatial pyramid pooling (ASPP) to expand the receptive field and capture multiscale features.The ASPP module contained four parallel dilated convolutions with different dilation rates, as shown in Figure 10.Referring to the HDC method, DeepLab V3 [67] applied both cascade modules and parallel modules of dilated convolution, grouped the parallel convolution in the ASPP module, and added the 1 × 1 convolution layer and batch normalization in the ASPP module.The DeepLab V3 significantly improved on the previous DeepLab versions without DenseCRF post-processing.Moreover, using Xception as the backbone network and DeepLab V3 as the decoder, DeepLab V3+ [68] adopted dilated depth wise separable convolutions instead of max-pooling and batch normalization to refine the segmentation boundaries.
lower-resolution maps successively.
DeepLab V2 [66] introduced atrous spatial pyramid pooling (ASPP) to expand the receptive field and capture multiscale features.The ASPP module contained four parallel dilated convolutions with different dilation rates, as shown in Figure 10.Referring to the HDC method, DeepLab V3 [67] applied both cascade modules and parallel modules of dilated convolution, grouped the parallel convolution in the ASPP module, and added the 1 × 1 convolution layer and batch normalization in the ASPP module.The DeepLab V3 significantly improved on the previous DeepLab versions without DenseCRF post-processing.Moreover, using Xception as the backbone network and DeepLab V3 as the decoder, DeepLab V3+ [68] adopted dilated depth wise separable convolutions instead of max-pooling and batch normalization to refine the segmentation boundaries.The scheme of FPN (feature pyramid network) [74] is similar to the skip connections of the U-Net model, that is beneficial for obtaining high resolution and strong semantic features for object detection with significant size differences in the images.He et al. [75] proposed an adaptive pyramid context network (APCNet) to solve the optimal solution of semantic segmentation.They utilized multiple adaptive context modules (ACMs) to build multiscale contextual feature representations; each ACM used the global image representation to estimate the local affinity weights of each subregion and calculated the optimal context vector according to these local affinity weights.
Ye et al. [76] developed an enhanced feature pyramid network (EFPN), that combined a semantic enhancement module (SEM), edge extraction module (EEM), and context aggregation model (CAM) into a decoder network to improve the robustness of multi-level feature fusion, and added a global fusion model (GFM) into the encoder network to capture more deep semantic information and transmit it to each layer efficiently.Among them, the SEM upgraded the ASPP module by modifying smaller dilation rates to enhance and obtain low-level features and replacing the pooling layer with a short residual connection in post-processing to avoid the loss of shallow semantic information, that simplified the network with a denser connection.
Wu et al. [77] proposed FPANet, a feature pyramid aggregation network for real-time semantic segmentation.FPANet is also an encoder-decoder model, using ResNet and ASPP in the encoder stage and a semantic bidirectional feature pyramid network (SeBiFPN) in the decoder stage.Reducing the number of feature channels with a lightweight feature pyramid fusion module (FPFM), the SeBiFPN was utilized to obtain both the semantic and spatial information of images and fuse features of different levels.

Attention Mechanisms
To represent the dependency between different regions in an image, especially the longdistance regions, and obtain their semantic relevance, some methods commonly used in the field of natural language processing (NLP) have been applied to computer vision, that have made good achievements in semantic segmentation.The attention mechanism was first put forward in the computer vision field in 2014.The Google Mind team [78] adopted the recurrent neural network (RNN) model to apply attention mechanisms to image classification, making attention mechanisms gradually popular in image processing tasks.
RNN can model the short-term dependence between pixels, connect pixels, and process them sequentially, which establishes a global context relationship.Visin et al. [79] proposed a ReSeg network based on ReNet [80], and each ReNet layer consisted of four RNNs that swept in both horizontal and vertical directions across the image to obtain global information.The ReSeg architecture is shown in Figure 11.
vision, that have made good achievements in semantic segmentation.The attention mechanism was first put forward in the computer vision field in 2014.The Google Mind team [78] adopted the recurrent neural network (RNN) model to apply attention mechanisms to image classification, making attention mechanisms gradually popular in image processing tasks.
RNN can model the short-term dependence between pixels, connect pixels, and process them sequentially, which establishes a global context relationship.Visin et al. [79] proposed a ReSeg network based on ReNet [80], and each ReNet layer consisted of four RNNs that swept in both horizontal and vertical directions across the image to obtain global information.The ReSeg architecture is shown in Figure 11.Both RNN and LSTM have their limitations, e.g., weakened long-distance dependence, requiring too many parameters, and not allowing parallel operations.Oktay et al. [83] proposed attention U-Net, as shown in Figure 12, that introduced an attention mechanism in U-Net.Prior to splicing the features at each resolution of the encoder with the corresponding features in the decoder, they used attention gate (AG) modules to supervise the features of the previous layer through the features of the next layer, thus readjusting the output features of the encoder.The AG modules adjusted the activation value adaptively by generating a gated signal and suppressed the feature responses of the unrelated background regions progressively to control the importance of different spatial features.Pal et al. [84] proposed an attention UW-Net, that achieved a good performance on medical chest X-ray images.The attention UW-Net improves a skip connection based on the U-Net segmentation network, i.e., a dense connection is added between the B-5 and B-6 blocks of the original U-Net architecture, that allows the network to learn the details lost in the previous max-pooling and effectively reduces the information loss.In addition, an improved attention gate is designed, that modifies the resampling of the attention vectors by copying the vector space in the channel attention, which could better realize the attention to the salient region and the suppression of the irrelevant background region.
Both RNN and LSTM have their limitations, e.g., weakened long-distance depen ence, requiring too many parameters, and not allowing parallel operations.Oktay et [83] proposed attention U-Net, as shown in Figure 12, that introduced an attention mec anism in U-Net.Prior to splicing the features at each resolution of the encoder with t corresponding features in the decoder, they used attention gate (AG) modules to supe vise the features of the previous layer through the features of the next layer, thus rea justing the output features of the encoder.The AG modules adjusted the activation val adaptively by generating a gated signal and suppressed the feature responses of the u related background regions progressively to control the importance of different spat features.Pal et al. [84] proposed an attention UW-Net, that achieved a good performan on medical chest X-ray images.The attention UW-Net improves a skip connection bas on the U-Net segmentation network, i.e., a dense connection is added between the B-5 an B-6 blocks of the original U-Net architecture, that allows the network to learn the deta lost in the previous max-pooling and effectively reduces the information loss.In additio an improved attention gate is designed, that modifies the resampling of the attention ve tors by copying the vector space in the channel attention, which could better realize t attention to the salient region and the suppression of the irrelevant background region Self-attention mechanisms are mostly used in the encoder network to represent t correlation between different regions (pixels) or different channels of the feature maps computes a weighted sum of pairwise affinities across all positions of a single sample update the feature at each position.Self-attention mechanisms have produced many i fluential achievements in image segmentation, e.g., PSANet [85], DANet [86], APCN [75], CARAFE [87], and CARAFE++ [88].
In 2017, Vaswani et al. [89] proposed the transformer, a deep neural network sole Self-attention mechanisms are mostly used in the encoder network to represent the correlation between different regions (pixels) or different channels of the feature maps.It computes a weighted sum of pairwise affinities across all positions of a single sample to update the feature at each position.Self-attention mechanisms have produced many influential achievements in image segmentation, e.g., PSANet [85], DANet [86], APCNet [75], CARAFE [87], and CARAFE++ [88].
In 2017, Vaswani et al. [89] proposed the transformer, a deep neural network solely based on a self-attention mechanism, dispensing with convolutions and recurrence entirely.Thereafter, transformer and its variants (i.e., X-transformer) were used in the field of computer vision.With the self-attention mechanism of the transformer and CNN pre-training model, the improved network [90,91] achieved some breakthroughs.Dosovitskiy et al. [92] proposed a vision transformer (ViT), that proved that transformer could substitute for CNN in classification and prediction of image patch sequences.As shown in Figure 13, they divided the image into patches of fixed sizes, lined up the image patches, and input the patches sequence vector into a transformer encoder (the right-hand diagram), that consisted of alternating multi-head attention layers and multi-layer perceptron (MLP).Liu et al. [93] developed the swin transformer, that has achieved impressive performance in image semantic segmentation and instance segmentation.The swin transformer advanced the sliding window approach, that built hierarchical feature maps by merging image patches in deeper layers, calculated self-attention in each local window, and utilized cyclic-shifting window partition approaches alternatively in the consecutive swin transformer blocks to introduce cross-window connections between neighboring nonoverlapping windows.The swin transformer network replaced the standard multi-head self-attention (MSA) module in a transformer block with shifted window approach, with the other layers remaining the same, as shown in Figure 14.Liu et al. [93] developed the swin transformer, that has achieved impressive performance in image semantic segmentation and instance segmentation.The swin transformer advanced the sliding window approach, that built hierarchical feature maps by merging image patches in deeper layers, calculated self-attention in each local window, and utilized cyclic-shifting window partition approaches alternatively in the consecutive swin transformer blocks to introduce cross-window connections between neighboring nonoverlapping windows.The swin transformer network replaced the standard multi-head self-attention (MSA) module in a transformer block with shifted window approach, with the other layers remaining the same, as shown in Figure 14.Liu et al. [93] developed the swin transformer, that has achieved impressive performance in image semantic segmentation and instance segmentation.The swin transformer advanced the sliding window approach, that built hierarchical feature maps by merging image patches in deeper layers, calculated self-attention in each local window, and utilized cyclic-shifting window partition approaches alternatively in the consecutive swin transformer blocks to introduce cross-window connections between neighboring nonoverlapping windows.The swin transformer network replaced the standard multi-head self-attention (MSA) module in a transformer block with shifted window approach, with the other layers remaining the same, as shown in Figure 14.The forerunner for end-to-end

Conclusions
According to the chronological evolution of image segmentation technology, we have comprehensively sorted the classic segmentation algorithms and the current popular deep learning algorithms, elaborated on the representative solutions of each stage, and enumerated the classic algorithms with certain influences.In general, the development of image segmentation shows a trend from coarse-grained to fine-grained, from manual feature extraction to adaptive learning, and from single-image-oriented to segmentation based on common features of big data.
With the development of image acquisition technology, the types of images are becoming more varied, that brings more challenges in image segmentation with different dimensions, scales, resolutions, and imaging modes.Researchers expect the use of a general network with improved adaptability and generalization ability [94].Since the FCN was proposed, deep neural network research has shown obvious advantages in scene understanding and object recognition.Future research directions still focus on deep neural networks, aiming to further improve the accuracy, real-time ability, and robustness of the network.With the great breakthrough made by the swin transformer in the field of computer vision in 2021, image segmentation has entered the transformer stage from the CNN stage, and the transformer may bring new advances to computer vision research.Nevertheless, deep learning also has its shortcomings, e.g., deep learning is inexplicable, which limits the robustness, reliability, and performance optimization of its downstream tasks.The current research directions and challenges of image segmentation are as follows: 1.
Semantic segmentation, instance segmentation, and panoramic segmentation are still the research hotspots of image segmentation.Instance segmentation predicts the pixel regions contained in each instance; panoramic segmentation integrates both semantic segmentation and instance segmentation, and assigns a category label and an instance ID to each pixel of the image.Especially in panoramic segmentation, countable, or uncountable instances are difficult to recognize in a single workflow, so it is a challenging task to build an effective network to simultaneously identify both large inter-category differences and small intra-category differences; 2.
With the popularization of image acquisition equipment (e.g., LiDAR cameras), RGBdepth, 3D-point clouds, voxels, and mesh segmentation have gradually become research hotspots, which have a wide requirement in face recognition [95], autonomous vehicles, VR, AR, architectural modeling, etc.Although there has been some progress in the research of 3D image segmentation, e.g., region growth, random walks, and clustering in classic algorithms, and SVM, random forest, and AdaBoost in machine learning algorithms, the representation and processing of 3D data, which are unstructured, redundant, disordered, and unevenly distributed, remain a major challenge; 3.
In some fields, it is difficult to use supervised learning algorithms to train the network due to a lack of datasets or fine-grained annotations.Semi-supervised and unsupervised semantic segmentation can be selected in these cases, where the network can be trained on the benchmark dataset first, and the lower-level parameters of the network can then be fixed, and the fully connected layer or some high-level parameters can be trained on the small-sample dataset.This is transfer learning, that does not require abundant labeled samples.Reinforcement learning is also a possible solution, but it is rarely studied in the field of image segmentation.In addition, few-shot image semantic segmentation is also a hot research direction; 4.
Deep learning networks require a significant amount of computing resources in the training process, that also illustrates the computational complexity of the deep neural network.Real-time (or near real-time) segmentation is required in many fields, e.g., video processing to meet the human vision mechanism of at least 25 fps, and most current networks are far below this frame rate.Some lightweight networks have improved the speed of the segmentation to a certain extent, but there is still a large amount of room for improvement in the balance of model accuracy and real-time performance.

Figure 1 .
Figure 1.The categories of image segmentation methods.

Figure 1 .
Figure 1.The categories of image segmentation methods.

Figure 4 .
Figure 4. Two examples of co-segmentation results.Figure 4. Two examples of co-segmentation results.

Figure 4 .
Figure 4. Two examples of co-segmentation results.Figure 4. Two examples of co-segmentation results.

Figure 5 .
Figure 5.An illustration of hierarchical graph clustering constructed between two images.Figure from [42].
[43]   used a similarity matrix based on feature positions and color vectors to represent the local information in a single image; that is, spectral clustering.According to the local information and feature mapping relation, the expectation maximization (EM) was used to minimize the classification discriminant function to obtain a set of parameters.The algorithm could realize multiple classes and a significantly larger number of image co-segmentations effectively.

Figure 5 .
Figure 5.An illustration of hierarchical graph clustering constructed between two images.Figure from [42].
[43] used a similarity matrix based on feature positions and color vectors to represent the local information in a single image; that is, spectral clustering.According to the local information and feature mapping relation, the expectation maximization (EM) was used to minimize the classification discriminant function to obtain a set of parameters.The algorithm could realize multiple classes and a significantly larger number of image co-segmentations effectively.

Electronics 2023 , 25 Figure 6 .
Figure 6.Framework of the co-segmentation based on the shortest path algorithm.Figure from [44].

Figure 6 .
Figure 6.Framework of the co-segmentation based on the shortest path algorithm.Figure from [44].

Figure 7 .
Figure 7. Fully convolutional networks architecture.Figure 7. Fully convolutional networks architecture.Subsequent networks were advanced based on the FCN model.The following section introduces the main technologies and representative models from the perspective of how semantic segmentation networks work.The main semantic segmentation algorithms based on deep learning are shown in Table2.

Figure 11 .
Figure 11.The ReSeg architecture.Figure from [79].LSTM (long short-term memory) adds a new function to record long-term memory, that can represent long-distance dependence.Byeon et al. [81] used LSTM to achieve pixelfor-pixel segmentation of scene images, which proved that image texture information and spatial model parameters could be learned in a 2D LSTM model.Liang et al. [82] proposed a semantic segmentation model based on the graph LSTM model, that extended LSTM from sequential data or multidimensional data to a general graph structure, further enhancing the global context visual features.

Figure 11 .
Figure 11.The ReSeg architecture.Figure from [79].LSTM (long short-term memory) adds a new function to record long-term memory, that can represent long-distance dependence.Byeon et al. [81] used LSTM to achieve pixel-for-pixel segmentation of scene images, which proved that image texture information and spatial model parameters could be learned in a 2D LSTM model.Liang et al. [82] proposed a semantic segmentation model based on the graph LSTM model, that extended LSTM from sequential data or multidimensional data to a general graph structure, further enhancing the global context visual features.Both RNN and LSTM have their limitations, e.g., weakened long-distance dependence, requiring too many parameters, and not allowing parallel operations.Oktay et al.[83]   proposed attention U-Net, as shown in Figure12, that introduced an attention mechanism in U-Net.Prior to splicing the features at each resolution of the encoder with the corresponding features in the decoder, they used attention gate (AG) modules to supervise the features of the previous layer through the features of the next layer, thus readjusting the output features of the encoder.The AG modules adjusted the activation value adaptively by generating a gated signal and suppressed the feature responses of the unrelated background regions progressively to control the importance of different spatial features.Pal et al.[84] proposed an attention UW-Net, that achieved a good performance on medical chest X-ray images.The attention UW-Net improves a skip connection based on the U-Net segmentation network, i.e., a dense connection is added between the B-5 and B-6 blocks of the original U-Net architecture, that allows the network to learn the details lost in the previous max-pooling and effectively reduces the information loss.In addition, an improved attention gate is designed, that modifies the resampling of the attention vectors by copying the vector space in the channel attention, which could better realize the attention to the salient region and the suppression of the irrelevant background region.

Table 2 .
Comparison and analysis of semantic segmentation methods based on deep learning.

Table 2 .
Comparison and analysis of semantic segmentation methods based on deep learning.

Table 2 .
Comparison and analysis of semantic segmentation methods based on deep learning.