FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding

Zeng, Borui; Shu, Can; Liao, Ziqi; Yu, Jingru; Liu, Zhiyu; Chen, Xiaoyan

doi:10.3390/app15116016

Open AccessArticle

FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding

by

Borui Zeng

^1,†

,

Can Shu

^1,†

,

Ziqi Liao

¹,

Jingru Yu

²,

Zhiyu Liu

¹ and

Xiaoyan Chen

^1,3,*

¹

College of Information Engineering, Sichuan Agricultural University, Ya’an 625000, China

²

College of Art and Media, Sichuan Agricultural University, Ya’an 625000, China

³

Sichuan Key Laboratory of Agricultural Information Engineering, Ya’an 625014, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(11), 6016; https://doi.org/10.3390/app15116016

Submission received: 27 April 2025 / Revised: 21 May 2025 / Accepted: 26 May 2025 / Published: 27 May 2025

Download

Browse Figures

Versions Notes

Abstract

Facial semantic segmentation, as a critical technology in high-level visual understanding, plays an important role in applications such as facial editing, augmented reality, and identity recognition. However, due to the complexity of facial structures, ambiguous boundaries, and inconsistent scales of facial components, traditional methods still suffer from significant limitations in detail preservation and contextual modeling. To address these challenges, this paper proposes a facial parsing network based on the Deeplabv3+ framework, named FP-Deeplab, which aims to improve segmentation performance and generalization capability through structurally enhanced modules. Specifically, two key modules are designed: (1) the Context-Channel Refine Feature Enhancement (CCR-FE) module, which integrates multi-scale contextual strip convolutions and Cross-Axis Attention and introduces a channel attention mechanism to strengthen the modeling of long-range spatial dependencies and enhances the perception and representation of boundary regions; (2) the Self-Modulation Attention Feature Integration with Regularization (SimFA) module, which combines local detail modeling and a parameter-free channel attention modulation mechanism to achieve fine-grained reconstruction and enhancement of semantic features, effectively mitigating boundary blur and information loss during the upsampling stage. The experimental results on two public facial segmentation datasets, CelebAMask-HQ and HELEN, demonstrate that FP-Deeplab improves the baseline model by 3.8% in Mean IoU and 2.3% in the overall F1-score on the HELEN dataset, and it achieves a Mean F1-score of 84.8% on the CelebAMask-HQ dataset. Furthermore, the proposed method shows superior accuracy and robustness in multiple key component categories, especially in long-tailed regions, validating its effectiveness.

Keywords:

face parsing; fine-grained segmentation; boundary refinement; facial image analysis; deep learning

1. Introduction

Face parsing, which is also referred to as fine-grained facial semantic segmentation, is a fundamental task in the field of computer vision. Its goal is to perform pixel-level classification of various semantic components in facial images, such as eyes, nose, mouth, eyebrows, and more. Compared with traditional face detection and recognition tasks, face parsing places greater emphasis on structural hierarchy modeling and part-level detail representation, facilitating a high-precision semantic understanding of facial regions. This technique has been widely applied in various practical scenarios, including virtual reality [1], augmented reality [2], intelligent beautification, expression-driven [3] face synthesis, human–computer interaction, and identity recognition, and holds significant research value and application prospects.

However, compared to general semantic segmentation tasks [4], face parsing presents more stringent challenges in structural complexity, semantic boundary clarity, and spatial dependency modeling. First, facial components are densely arranged with fine-grained structures, and large inter-individual variations in facial features such as shape, scale, and spatial layout make it difficult for models to learn a unified and effective feature representation. Second, real-world conditions such as facial expression variation, pose changes, lighting fluctuations, and occlusions (e.g., hair and glasses) introduce severe interference in local detail extraction, undermining model robustness. Furthermore, facial regions exhibit strong semantic interdependencies—for instance, structural correlations exist between the eyebrows and eyes and between the lips and mouth region—which demand powerful global modeling capabilities to capture long-range relationships. In boundary modeling, the visual similarity of facial parts, like hairline and eyebrows, to the background in color or texture often leads to boundary ambiguity and misclassification. Especially in high-resolution image processing scenarios [5], coarse feature restoration and boundary information loss during upsampling exacerbate the deficiency in key structural representation, thereby degrading overall segmentation quality.

In recent years, despite the significant progress made in semantic segmentation methods based on deep learning, widely adopted strategies such as multi-scale feature extraction, self-attention mechanisms, and global context modeling still face many challenges in facial parsing tasks. Conventional CNN-based networks [6] are limited by their local receptive fields, making it difficult to effectively model long-range dependencies. Although Transformer-based attention mechanisms [7] possess strong contextual modeling capabilities, their high computational cost on high-resolution feature maps often results in resource bottlenecks. Moreover, current upsampling stages generally suffer from coarse feature fusion and loss of boundary details.

To address the above challenges, this paper proposes a novel facial parsing network named FP-Deeplab. Built upon the DeepLabv3+ framework [8], the proposed network integrates contextual awareness, multi-scale information fusion, and fine-grained attention-guided strategies to systematically enhance both global structure modeling and local detail preservation in facial semantic segmentation. FP-Deeplab constructs hierarchical modules to enrich semantic representations and recover structural details, which are particularly suitable for challenging scenarios such as blurry facial boundaries, large-scale variations, and hard-to-segment long-tail categories. Specifically, the main contributions of this paper are as follows:

A Cross-Axis Attention mechanism is introduced, which establishes axial attention along both horizontal and vertical spatial dimensions to enhance long-range pixel-level dependency modeling, improving the semantic consistency and segmentation robustness of facial structures.
A Context-Channel Refine Feature Enhancement (CCR-FE) module is proposed to optimize the original ASPP structure by combining multi-scale strip convolutions and a channel attention mechanism, effectively enhancing the perception and representation of local structures and complex facial regions.
A SimFA module is developed to refine the feature fusion process in the upsampling stage. By leveraging local feature enhancement and a self-modulated attention mechanism, this module enables adaptive semantic restoration, effectively alleviating boundary blur and structural discontinuities, thus producing more stable and clearer face segmentation results.

2. Related Work

2.1. Semantic Segmentation Methods

Semantic segmentation is a fundamental task in computer vision, aiming to assign a precise class label to each pixel in an image. Traditional approaches primarily relied on handcrafted features and graphical models such as Conditional Random Fields (CRFs) [9] and Markov Random Fields (MRFs) [10]. While effective in regular structures or simplified scenes, these methods perform poorly in complex backgrounds, non-rigid structures, and fine-grained categories. With the rise of deep learning, Fully Convolutional Networks (FCNs) [11] marked the beginning of end-to-end pixel-wise classification. Subsequently, U-Net [12] employed a symmetric encoder–decoder architecture to efficiently fuse low-level and high-level features, achieving remarkable success in medical image segmentation. The DeepLab series [13] further enhanced multi-scale context modeling through atrous convolution, Atrous Spatial Pyramid Pooling (ASPP), and encoder–decoder designs. PSPNet [14] introduced pyramid pooling modules to aggregate global contextual information, thereby strengthening semantic representation.

In addition, to capture larger receptive fields and model long-range dependencies, models such as OCRNet [15] and SegFormer [16] introduced contextual reconstruction mechanisms, attention modules, and Transformer architectures, significantly improving segmentation accuracy. However, these general-purpose models are often designed for large-scale object segmentation in urban scenes or natural images, and they may underperform in domains like facial parsing, which require highly detailed and compact semantic structure modeling.

2.2. Face Parsing Techniques

Face parsing is a fine-grained subtask of semantic segmentation that aims to decompose facial images into multiple semantic components, including eyes, eyebrows, nose, mouth, hairline, ears, and so forth. Compared with general semantic segmentation, face parsing involves more complex structural relationships and shape deformations, demanding higher spatial parsing accuracy and semantic modeling capabilities. Early approaches were mainly based on shape priors and handcrafted features, such as Active Shape Models (ASMs) and Active Appearance Models (AAMs). However, these techniques performed poorly under variations in expression, lighting, and pose. With the advancement of deep learning, CNN-based face parsing methods have become mainstream. Representative works include extended applications of FCNs, Interlinked CNN (ICNN) [17], and EHANet [18], which achieved more stable and efficient feature extraction and semantic prediction under end-to-end frameworks.

Recently, Transformer-based architectures have also been introduced into facial parsing tasks. Architectures such as ViT [19] and Swin Transformer [20] leverage global attention mechanisms to improve semantic consistency. Representative models include Parsing-R-CNN [21] and HRFormer [22]. However, Transformer models tend to be computationally expensive, rely heavily on large-scale pretraining, and still face difficulties in balancing fine boundary detail preservation with high-level semantic completeness.

2.3. Feature Fusion and Attention Mechanisms

To enhance a model’s ability to represent multi-scale information and respond to key regions, researchers have extensively adopted feature fusion strategies and attention mechanisms. Feature pyramid structures, such as FPN, strengthen feature representations across different semantic layers. Spatial Pyramid Pooling (SPP) and Atrous Spatial Pyramid Pooling (ASPP) further improve the integration and perception of contextual information. On this basis, various attention mechanisms have been employed to enhance the model’s ability to focus on discriminative features. Self-attention, channel attention mechanisms, such as SE-Net [23], and spatial attention modules have shown efficacy in improving feature representation across different dimensions. Axial attention [24], which decomposes attention computation along horizontal and vertical directions, significantly reduces computational complexity while preserving strong spatial modeling capacity, and it has achieved promising results in segmentation tasks.

Although the aforementioned methods effectively improve the model’s understanding of semantics and structure, face parsing remains uniquely challenging due to the complex structural dependencies and semantic interactions among pixels. Therefore, how to simultaneously enhance global context modeling and preserve boundary details and local texture remains an urgent problem that needs to be addressed.

3. Methods

To address the challenges of long-range dependency modeling and detail preservation in face parsing, we propose FP-Deeplab, As shown in Figure 1, which integrates three core components. First, the Cross-Axis Attention mechanism [25] constructs dual attention paths along horizontal and vertical directions to capture global dependencies through cross-directional feature interaction, enabling more accurate modeling of complex facial structures. Second, the Context-Channel Refine Feature Enhancement (CCR-FE) module enhances the ASPP structure by combining multi-scale strip convolutions and channel attention to strengthen feature representation and highlight critical regions. Finally, the Self-Modulation Attention Feature Integration with Regularization (SimFA) module fuses high- and low-level features using a parameter-free attention mechanism, improving semantic recovery and preserving boundary details during upsampling. Together, these modules enable FP-Deeplab to achieve fine-grained facial parsing with strong structural awareness and robust detail fidelity.

Through the above innovations, FP-Deeplab constructs a cross-dimensional context modeling mechanism and a fine-grained semantic enhancement strategy, achieving a coordinated optimization of global structural perception and local detail representation. Even under complex conditions such as occlusion and class imbalance (long-tail categories), the network maintains high segmentation accuracy, demonstrating strong practicality and generalization ability.

3.1. Cross-Axis Attention

Establishing long-range interactions between pixels is crucial for capturing spatial structures or shapes in segmentation tasks. However, in the specific context of face parsing, the significant distance variations among different facial components make such interactions particularly challenging. Axial attention has been proposed as an alternative to the standard self-attention mechanism by decomposing attention into two separate branches, each operating along the horizontal or vertical dimension. Nevertheless, conventional axial attention suffers from limited information integration when fusing features from these two directions, which restricts its capacity for global dependency modeling.

Cross-Axis Attention [25] constructs dual cross-attention mechanisms along two spatial dimensions to better exploit directional information extracted from axial attention. As illustrated in Figure 2, it consists of two parallel branches that compute horizontal and vertical axial attention separately.

F_{x}

and

F_{y}

represent multi-scale contextual features captured using strip convolutions with different shapes. Taking the upper branch in Figure 2 as an example,

F_{x}

is passed to compute attention along the y-axis. To better leverage the multi-scale contextual features from both spatial directions, Cross-Axis Attention computes cross-attention between

F_{x}

and

F_{y}

. Specifically,

F_{x}

is used as both the key and value matrices, while

F_{y}

serves as the query matrix. The computation is as follows:

F_{A} = {CA}_{y} (F_{y}, F_{x}, F_{x}),

(1)

Here,

{CA}_{y}

denotes the cross-attention operation along the y-axis. Similarly, in the lower branch, the contextual encoding is performed along the x-axis using an analogous approach. The computation is as follows:

F_{B} = {CA}_{x} (F_{x}, F_{y}, F_{y}),

(2)

Here,

{CA}_{x}

denotes the cross-attention operation along the x-axis.

In this study, Cross-Axis Attention is employed to capture global dependencies, thereby enhancing the network’s ability to extract features from irregular facial components. This approach effectively addresses the accuracy issues caused by boundary ambiguity in segmentation tasks, making the network more suitable for complex face parsing scenarios and improving overall segmentation performance.

It is worth noting that facial images typically exhibit a natural bilateral symmetry. The design of Cross-Axis Attention, which computes attention along both horizontal and vertical directions, inherently supports the modeling of semantic relationships across symmetric facial regions, such as the eyes or the corners of the mouth. This bidirectional structure enhances feature interaction across spatially mirrored parts, thereby improving the consistency of structural understanding and reinforcing boundary delineation in symmetrical areas.

3.2. CCR-FE Module

The traditional Atrous Spatial Pyramid Pooling (ASPP) module expands the receptive field using different dilation rates to extract multi-scale local features. However, ASPP mainly relies on simple channel concatenation to integrate information across scales without fully exploiting long-range dependencies and semantic interactions between channels. This limitation restricts its representational capacity, particularly in scenarios involving blurred boundaries or fine-grained structural segmentation, potentially leading to information loss.

To address these issues, we propose the Context-Channel Refine Feature Enhancement (CCR-FE) module. Built upon the multi-scale features generated by ASPP, CCR-FE employs strip convolutions of varying scales to capture contextual information related to the segmentation targets. These multi-scale contextual features are subsequently fed into the Cross-Axis Attention module to extract spatial global dependencies. Additionally, a lightweight channel attention mechanism is incorporated to optimize the features and enhance the model’s sensitivity to critical regions.

Given the input feature map F (i.e., the output of ASPP), CCR-FE first applies multi-scale strip convolutions to extract contextual information for the segmentation targets. Specifically, depthwise separable strip convolutions of different kernel sizes are performed along the horizontal and vertical directions to enhance the model’s perception of both local and global structures. The horizontal branch (MSC_x) and vertical branch (MSC_y) use kernel sizes of

{1 \times 7, 1 \times 11, 1 \times 21}

and

{7 \times 1, 11 \times 1, 21 \times 1}

, respectively, to simulate multi-scale receptive fields while keeping the computational cost low. These operations are conducted on 256-dimensional input features produced by ASPP. The output is then processed by a 1 × 1 convolution, and the operations can be formulated as follows:

F_{x} = {Conv}_{1 \times 1} ({MSC}_{x} (Norm (F))),

(3)

F_{y} = {Conv}_{1 \times 1} ({MSC}_{y} (Norm (F))),

(4)

Specifically,

{MSC}_{x}

denotes the multi-scale strip convolution operation along the x-axis, while

{MSC}_{y}

represents the same along the y-axis.

Norm (\cdot)

refers to the layer normalization operation. The resulting features

F_{x}

and

F_{y}

encode rich, multi-scale contextual information. These extracted contextual features are subsequently fed into the Cross-Axis Attention module [25] to model spatial global dependencies. By performing self-attention along both horizontal and vertical directions, Cross-Axis Attention effectively enhances long-range pixel-wise interactions. In particular, Cross-Axis Attention is implemented via dual attention branches: one computes attention maps along rows and the other along columns, followed by a reshape and weighted aggregation. This design allows interaction between distant spatial regions without introducing large computational burdens.

Given the outputs

F_{A}

and

F_{B}

obtained from the aforementioned process, the final contextual representation can be formulated as

F_{Cont} = F_{A} + F_{B} + F,

(5)

In addition, to further optimize feature representations, this study incorporates a channel attention mechanism within the CCR-FE module to compute the importance weights of individual channels. Specifically, each channel of the input feature map is first compressed into a global descriptor using adaptive average pooling. This is followed by a 1 × 1 convolution and ReLU activation function [26] to generate channel-wise attention weights. These weights are then upsampled to match the spatial dimensions of the input and applied via element-wise multiplication, thereby recalibrating channel-wise features and enhancing the expression of more informative channels. The computation can be formulated as follows:

F_{Chan} = F \cdot Interpolate (ReLU ({Conv}_{1 \times 1} (AAP (F)))),

(6)

The attention branch reduces the channel dimension from C to C⁄r (with r = 4) and then restores it back to C to ensure both parameter efficiency and expressive power. The use of interpolation instead of transposed convolutions further maintains a lightweight design and smooth gradients. To enable the model to focus on more informative features, the CCR-FE module employs an element-wise maximum fusion strategy to compute the final enhanced features. Subsequently, the fused attention features are used to reweight the multi-scale features generated by ASPP. This process can be expressed as follows:

F_{fuse} = max (F_{Cont}, F_{Chan}) \cdot F,

(7)

By jointly leveraging both spatial and channel contexts, CCR-FE establishes complementary attention mechanisms that adaptively enhance salient features and suppress irrelevant background responses.

The CCR-FE module is not complex in structure, yet it effectively enhances the model’s feature representation capability by emphasizing more informative dimensions. This improvement significantly boosts the model’s performance in segmenting regions with blurred boundaries and fine-grained targets.

3.3. SimFA Module

In face parsing tasks, the semantic information of facial regions is highly diverse. For instance, areas such as the eyes, lips, and eyebrows typically exhibit sharp boundaries and rich textures, whereas regions like the cheeks and forehead tend to be smoother and lack distinct edges. Moreover, facial morphology varies significantly under different conditions such as pose, illumination, and occlusion, requiring the segmentation model to possess strong feature representation capabilities that can simultaneously capture the global structure and finely model local details.

However, existing feature fusion methods still face numerous challenges in expressing local details, enabling effective feature interaction, and recovering information during upsampling. Conventional fusion strategies for high- and low-level features mostly rely on simple concatenation or weighted summation, which fail to adequately model pixel-wise fine-grained relationships, thus limiting the model’s expressiveness for critical regions and resulting in blurred boundaries and loss of texture information. On the other hand, using only nonlocal attention mechanisms or local convolutional operations struggles to balance global and local representations, leading to information mismatch during multi-scale feature integration. Additionally, the semantic recovery capability during the upsampling process remains limited, often causing feature blurring and unclear boundaries, thereby degrading the final segmentation accuracy.

Inspired by SMFANet [27], this study designs a Self-Modulation Attention Feature Integration (SimFA) module to address the aforementioned challenges, as illustrated in Figure 3. The SimFA module introduces a novel adaptive modulation strategy based on the basis of feature interaction and estimation, enabling the model to more precisely enhance the representation of critical facial regions during multi-scale feature integration. Simultaneously, it effectively suppresses background interference, thereby improving boundary sharpness and detail preservation in the segmentation results. Specifically, SimFA first extracts global structural information and local detailed features through two separate branches: nonlocal feature interaction and local detail estimation. Given an input feature map H, it is first expanded in the channel dimension via a

1 \times 1

convolution and then divided into a global branch E and a local branch L, as follows:

{E, L} = Split ({Conv}_{1 \times 1} (H)),

(8)

The E branch acquires global structural information through an efficient approximation of the self-attention mechanism and enhances nonlocal feature interaction via a variance-based modulation strategy. Specifically, it first employs adaptive max pooling to obtain low-frequency features and then applies a

3 \times 3

depthwise separable convolution [28] to extract structural information, resulting in

E_{s}

. Subsequently, the global variance of the input feature F, denoted as

σ^{2} (E)

, is computed as follows:

σ^{2} (E) = \frac{1}{N} \sum_{i = 1}^{N} {(e_{i} - μ)}^{2},

(9)

The modulation is performed through a 1×1 convolution as follows:

E_{m} = {Conv}_{1 \times 1} (E_{s} + σ^{2} (E)) .

(10)

Subsequently, the GELU activation function [29] and nearest-neighbor upsampling are applied to obtain the final global feature representation

E_{n}

, enabling adaptive nonlocal feature aggregation.

The L branch captures edge and texture details using a

3 \times 3

depthwise separable convolution, followed by nonlinear activation to enhance feature expression. The operations are defined as

\begin{matrix} L_{p} & = {Conv}_{1 \times 1} ({DWConv}_{3 \times 3} (L)), \end{matrix}

(11)

\begin{matrix} L_{d} & = {Conv}_{1 \times 1} (GELU (L_{p})) . \end{matrix}

(12)

The global feature

E_{n}

and the local detail feature

L_{d}

are then fused via element-wise addition to generate the integrated feature map

H_{s}

:

H_{s} = E_{m} + L_{d} .

(13)

To better integrate with low-level features, SimFA performs unified channel alignment after feature concatenation and adopts a structured dual-branch mechanism to semantically distinguish between global modeling and local enhancement, thereby avoiding the information conflict often caused by simple concatenation. Meanwhile, the module plays a fine-grained guidance role during the high-resolution feature reconstruction stage, enabling a more effective combination of shallow features with upsampled semantic information and improving the preservation of boundary details. To ensure semantic consistency during feature integration, the similarity between the upsampled decoder features and the encoder features is implicitly modeled through both spatial alignment and adaptive response modulation. This is achieved by capturing contextual correlation in the global branch and enhancing boundary-localized similarity in the local branch, which jointly determines the semantic similarity distribution across spatial positions. The resulting fused representation

H_{s}

thus encodes both semantic consistency and discriminative detail, forming a basis for effective low–high feature merging.

Although the above mechanism achieves efficient feature aggregation through nonlocal interaction and local detail estimation, it still lacks dynamic modeling along the channel dimension, which may result in insufficient activation of fine-grained features and affect segmentation quality. To address this issue, SimFA further incorporates a channel-adaptive modulation mechanism inspired by SimAM [30].

Leveraging the ability of SimAM to model channel-wise attention, this mechanism computes intra-channel variance to adaptively modulate feature responses, assigning more appropriate activation strengths to different pixels. This allows the model to emphasize critical facial regions while highlighting important regions, such as boundaries and lips, and simultaneously suppress redundant background information, such as surrounding hair or earrings, thereby reducing interference and improving both segmentation accuracy and boundary clarity.

Additionally, the module performs fine-grained modulation during the upsampling phase, where the fused channel attention mask is used to reweight pixel responses, enabling adaptive feature recalibration. This effectively mitigates information redundancy and semantic conflict during semantic feature fusion, enhancing high-resolution feature recovery and yielding sharper edges and more precise contours of facial regions. The modulation process is calculated as

X_{simam} = \frac{{(H_{S} - μ)}^{2}}{4 (\frac{σ^{2}}{N} + ϵ)} + 0.5,

(14)

where

μ

is the channel-wise mean, and

σ^{2}

is the channel variance. Finally, the modulated feature map is obtained by element-wise multiplication with the Sigmoid-activated attention map [31]:

H_{SimFA} = H_{s} \cdot Sigmoid (x_{simam}),

(15)

Overall, SimFA is not a simple fusion module but a comprehensive feature refinement framework that balances global structure and fine-grained details. It adaptively recalibrates responses, filters semantic noise, and enhances edge fidelity, significantly improving the semantic consistency and visual clarity of the final segmentation outputs.

4. Experiments

4.1. Datasets and Evaluation Metrics

Datasets. In this study, the proposed method was evaluated on the HELEN dataset [32], which is a challenging face parsing dataset that includes 11 semantic facial categories such as skin, left/right eyes, left/right eyebrows, upper/lower lips, inner mouth, nose, and hair. To ensure consistency and comparability with existing studies, the “hair” category was excluded from segmentation in our experiments. The dataset consists of 2330 high-quality images, among which 2000 were used for training, 230 for validation, and 100 for testing.

The model was also applied to the CelebAMask-HQ dataset [33], a large-scale facial semantic segmentation dataset with a resolution of 512 × 512. It contains 30,000 manually annotated images covering 19 semantic categories, including both facial components and accessories, such as skin, nose, eyes, eyebrows, ears, mouth, lips, hair, hat, earrings, necklace, neck, and clothing. According to the official data split, 24,183 images were used for training, 2993 for validation, and 2824 for testing.

Evaluation Metrics. For the CelebAMask-HQ dataset, two widely adopted evaluation metrics were employed: mean Intersection over Union (mIoU) [34] and F1-score [35]. For the HELEN dataset, the overall F1-score was used to assess the model’s performance, where the metric is computed over merged facial components: eyebrows (left + right), eyes (left + right), nose, and mouth (upper lip + lower lip + inner mouth). The formulas for the above metrics are defined as follows:

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(16)

F 1 = 2 \times \frac{precision \cdot recall}{precision + recall}

(17)

Implementation Details. In this study, all experiments were conducted using PyTorch version 1.12.1. The hardware environment consists of an Intel Xeon Platinum 8352V CPU and an NVIDIA GTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). MobileNetV2 [36] was adopted as the backbone for the proposed network architecture. The model was optimized for 200 epochs using the Adam optimizer [37], with an initial learning rate (base lr) set at

4 \times 10^{- 4}

and a momentum of 0.9. A cosine annealing schedule was applied to decay the learning rate during training. The batch size was set to 16. In addition, standard data augmentation techniques were employed to prevent overfitting. The loss function used was the standard pixel-wise cross-entropy loss. To enhance generalization, data augmentation includes random horizontal flipping, random scaling (ranging from 0.75 to 1.25), random rotations within ±10 degrees, random cropping, and color jittering (adjustments to brightness, contrast, and saturation). Weight decay was set to

1 \times 10^{- 4}

to regularize the training process.

4.2. Results on HELEN

4.2.1. Comparison with Mainstream Methods

To verify the practical effectiveness of the proposed model in face parsing, we first compared our method with several representative state-of-the-art approaches on the HELEN dataset. Table 1 presents the overall F1-score performance of each method. It can be observed that, although our model performs slightly lower than that of Yin et al. [38] on the upper lip, it achieves comparable or better results on other facial components. Compared with the method proposed by Wei et al. [39], our model demonstrates significant improvements in the segmentation of facial organs. These results indicate that the proposed method exhibits stronger adaptability in structural recognition and boundary preservation of facial regions, especially in fine-grained areas such as eyes and lips.

4.2.2. Ablation Study

Quantitative Analysis. To further analyze the effectiveness of the key components in our proposed method, we conducted ablation experiments on the HELEN dataset, focusing on the contributions of the CCR-FE and SimFA modules. The baseline model is the standard Deeplabv3+ network. As shown in Table 2, on the one hand, the CCR-FE module enhances the representation of facial boundary details by focusing on critical facial features, resulting in improvements of 1.5% in Mean IoU and 1.2% in the overall F1-score. On the other hand, the SimFA module significantly improves the recovery of semantic information, leading to gains of 3.3% in Mean IoU and 2.0% in the overall F1-score. Furthermore, when both modules are integrated, the model achieves the best performance with a Mean IoU of 82.5% and an overall F1-score of 92.2%. These results clearly demonstrate the positive impact of feature enhancement and detail preservation on segmentation performance.

Qualitative Analysis. To further illustrate the effect of different modules on facial segmentation performance, we selected representative samples from the HELEN test set for visual analysis. As shown in Figure 4, the baseline model without any enhancement modules exhibits noticeable boundary blurring and prediction discontinuities in critical regions such as the lip contours and eye outlines. After introducing the CCR-FE module, the model’s ability to preserve structural integrity in boundary areas is significantly enhanced, enabling more accurate capture of the geometric shapes and boundaries of facial organs. When the SimFA module is incorporated, the overall semantic structure of the segmentation result becomes more developed, with substantially improved performance in texture modeling and local detail representation. Finally, when both modules are jointly integrated into the proposed FP-Deeplab network, the model achieves sharper boundary precision while significantly improving the restoration of local facial morphology and the consistency of spatial semantics; this results in clearer structural segmentation contours and more accurate region delineation.

The consistency between the qualitative and quantitative results demonstrates that FP-Deeplab fully leverages the complementary strengths of the CCR-FE and SimFA modules in spatial structure modeling and channel-wise semantic modulation. This significantly enhances the accuracy and robustness of facial semantic segmentation and offers a practical and effective solution for fine-grained face parsing.

4.3. Results on CelebAMask-HQ

To further validate the effectiveness of the proposed method on face parsing tasks with a larger number of categories, higher resolution, and more complex structures, we conducted comprehensive experiments on the CelebAMask-HQ dataset. This dataset contains richer semantic labels and more diverse facial structures, enabling a more thorough evaluation of the model’s capability in fine-grained region analysis and overall segmentation accuracy.

4.3.1. Performance Comparison with State-of-the-Art Methods

We compared our method with several representative state-of-the-art approaches, including Zhao et al. [14], Lee et al. [33], Wei et al. [39], Luo et al. [18], and FaRL [43]. These methods are characterized by their modeling strategies for complex facial regions and demonstrate strengths in boundary refinement, context modeling, and semantic enhancement. Under the same evaluation metrics and testing protocol, we assessed the segmentation performance of all methods on the CelebAMask-HQ test set. As shown in Table 3, the proposed FP-Deeplab achieves superior performance across multiple key facial components, particularly in structurally complex regions such as hair, lips, and nose. Moreover, for long-tailed categories, such as “necklace“ and “clothes“, which appear less frequently in the dataset, FP-Deeplab also demonstrates excellent performance, highlighting its strong generalization ability under class-imbalanced semantic distributions.

Figure 5 presents several visualized segmentation results of our proposed FP-Deeplab on the CelebAMask-HQ test set. To intuitively evaluate the model’s ability to parse different facial components, we selected multiple challenging samples, including cases with occluded facial parts and complex hair textures. As shown in the figure, FP-Deeplab accurately identifies and segments fine-grained structures, such as eyebrows, eyes, and lips, and also maintains good shape consistency and semantic integrity in smaller regions like hair boundaries, mouth contours, and earrings. Additionally, the model demonstrates smooth and continuous predictions in boundary transition areas and shows stable performance on long-tailed categories such as earrings, necklaces, and clothing, reflecting its strong structural perception and semantic discrimination capabilities.

Furthermore, the precise restoration of complex facial structures further indicates that FP-Deeplab excels in fine-detail modeling, boundary preservation, and feature fusion. Compared to several advanced methods listed in Table 3 that rely on large-scale pretraining (e.g., FaRL), FP-Deeplab achieves similarly robust performance on multiple key regions and long-tailed classes, even without the aid of large-scale pretrained models. These results not only highlight the model’s competitiveness in quantitative evaluation metrics but also emphasize its practicality and robustness in real-world applications.

4.3.2. Visualization of Learned Features

To further investigate how our proposed modules contribute to semantic segmentation, we visualize the class-specific attention maps produced by CCR-FE and SimFA for a representative input image. As shown in Figure 6, we display the heatmaps corresponding to seven key facial components: background, skin, right eyebrow, left eye, nose, mouth, and hair.

The top row illustrates the attention responses of the CCR-FE module. It can be observed that CCR-FE effectively captures long-range spatial dependencies and highlights semantically consistent regions, particularly on symmetric structures such as eyebrows and eyes. For example, both eyes and brows exhibit strong bilateral activation, demonstrating the module’s ability to model cross-axis spatial relationships, which are essential for precise facial parsing.

The bottom row shows the heatmaps generated by the SimFA module during the upsampling phase. These visualizations highlight the strengths of boundary-aware refinement and detail enhancement of SimFA. Specifically, SimFA attends more precisely to fine-grained structures such as the mouth, eyes, and nose contours while also suppressing background distractions. This illustrates the ability of SimFA to adaptively modulate features based on both global context and local texture information.

Overall, the visualizations offer intuitive evidence that both CCR-FE and SimFA contribute distinct yet complementary roles: CCR-FE enhances multi-scale contextual encoding during decoding, while SimFA focuses on fine-structure recovery and semantic consistency during upsampling. These properties jointly enable FP-Deeplab to produce high-resolution segmentation with sharp and accurate facial boundaries.

4.3.3. Comparison of Different Backbone Networks

To evaluate the impact of backbone network architectures on the performance of facial semantic segmentation, we conducted comparative experiments on the CelebAMask-HQ dataset using two different types of feature extraction backbones: the lightweight MobileNetV2 and the more powerful but computationally intensive Xception [28]. In these experiments, we kept all other architectural components unchanged and integrated the proposed CCR-FE and SimFA modules under both configurations to ensure a fair comparison.

As shown in Table 4, although using Xception as the backbone yields slightly better results in terms of Mean IoU and Mean F1-score compared to MobileNetV2, the overall performance difference remains minimal. Specifically, the segmentation results obtained with MobileNetV2 are already very close to those of Xception, indicating that the choice of backbone is not a decisive factor in the final performance. This outcome further confirms that the significant performance improvement is primarily attributed to the synergistic effect of the CCR-FE and SimFA modules, which play essential roles in context modeling and semantic enhancement, respectively, handling the majority of feature refinement and integration tasks.

Therefore, it can be concluded that the core advantage of FP-Deeplab lies in its structural module design rather than its reliance on a more complex backbone. This also validates the proposed method’s strong adaptability and architectural independence across different backbones, demonstrating its promising potential for broad transferability and practical deployment.

5. Discussion

To gain a deeper understanding of the limitations of FP-Deeplab in fine-grained facial semantic segmentation, we performed a detailed error analysis based on the model’s predictions on the CelebAMask-HQ dataset. As shown in the confusion matrix in Figure 7, several representative error patterns emerge.

One prominent issue is the confusion between semantically and visually similar facial parts, especially those with symmetrical spatial distributions. Misclassifications frequently occur between left and right brows, left and right eyes, and upper and lower lips. These categories share highly similar visual features and are often difficult to distinguish, particularly in low-resolution or occluded regions. Additionally, confusion between ears and earrings is observed, likely due to their spatial adjacency and overlapping textures. These inter-class ambiguities indicate that the current feature representations, though effective globally, are occasionally insufficient for capturing subtle local distinctions.

Boundary-level errors are another source of segmentation inaccuracy. In particular, the transitions between hair and background or between neck and clothing tend to be imprecise. These regions often exhibit weak texture contrast and complex edge patterns, which pose challenges for the network in maintaining spatial consistency. Such boundary imprecision suggests the need for explicit edge modeling or auxiliary supervision that enhances the network’s sensitivity to fine structural transitions.

Moreover, FP-Deeplab demonstrates reduced accuracy in detecting small or infrequent facial components such as hats, earrings, and necklaces. Due to their limited spatial extent and class imbalance in the dataset, these regions are more prone to omission or misclassification. This observation reflects the model’s difficulty in learning reliable representations for spatially constrained or under-represented categories and highlights the potential benefit of introducing class-aware weighting strategies or attention mechanisms to amplify such features.

In summary, this error analysis reveals that FP-Deeplab’s main performance bottlenecks stem from symmetric structure confusion, boundary localization imprecision, and small-object segmentation weakness. These findings not only provide a clearer picture of the model’s current challenges but also inform future directions for improvement, such as integrating boundary-aware refinement, semantic decoupling strategies, and enhanced training schemes for long-tailed categories.

6. Conclusions

This paper proposes FP-Deeplab, a novel face parsing network designed to address feature detail loss, boundary blurring, and inadequate context modeling in face semantic segmentation. Built upon the Deeplabv3+ framework, FP-Deeplab integrates two specially designed modules: the Context-Channel Refine Feature Enhancement (CCR-FE) module, which improves multi-scale contextual representation, and the Self-Modulation Attention Feature Integration with Regularization (SimFA) module, which enhances semantic restoration and edge detail preservation during upsampling. Extensive experiments conducted on challenging benchmarks demonstrate that FP-Deeplab achieves competitive or superior performance compared to several state-of-the-art methods, particularly in the segmentation of fine-grained structures and long-tailed categories. Notably, even in the absence of large-scale pretraining (e.g., unlike FaRL), FP-Deeplab maintains robust segmentation accuracy and semantic consistency. Ablation studies confirm the critical role of the CCR-FE and SimFA modules, and backbone comparisons show that the observed improvements are primarily due to the proposed architectural innovations rather than the backbone itself. These findings validate the model’s effectiveness and generalizability across complex facial structures.

In future work, we plan to investigate cross-modal information fusion and adaptive feature guidance mechanisms to further enhance the model’s robustness, scalability, and practical deployment in real-world face analysis scenarios.

Author Contributions

Conceptualization, B.Z. and C.S.; Methodology, B.Z., C.S. and Z.L. (Ziqi Liao); Software, B.Z., C.S. and Z.L. (Ziqi Liao); Validation, B.Z.; Formal Analysis, J.Y. and Z.L. (Zhiyu Liu); Investigation, X.C.; Writing—Original Draft Preparation, B.Z.; Writing—Review & Editing, B.Z. and C.S.; Visualization, B.Z. and J.Y.; Resources, B.Z. and X.C.; Supervision, X.C.; Project Administration, B.Z.; Funding Acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code and datasets used in this study are publicly available at https://github.com/borry30/FP-Deeplab (accessed on 26 April 2025).

Acknowledgments

We would like to express our sincere gratitude to Xiaoyan Chen for her valuable guidance during the course of this study. We also appreciate the open-source community for providing publicly available datasets and tools that supported the development and evaluation of our work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anthes, C.; García-Hernández, R.J.; Wiedemann, M.; Kranzlmüller, D. State of the art of virtual reality technology. In Proceedings of the 2016 IEEE Aerospace Conference, Big Sky, MT, USA, 5–12 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–19. [Google Scholar]
Arena, F.; Collotta, M.; Pau, G.; Termine, F. An overview of augmented reality. Computers 2022, 11, 28. [Google Scholar] [CrossRef]
Qiu, Y.; Hui, Y.; Zhao, P.; Cai, C.-H.; Dai, B.; Dou, J.; Bhattacharya, S.; Yu, J. A novel image expression-driven modeling strategy for coke quality prediction in the smart cokemaking process. Energy 2024, 294, 130866. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Wiley, V.; Lucas, T. Computer vision and image processing: A paper review. Int. J. Artif. Intell. Res. 2018, 2, 29–36. [Google Scholar] [CrossRef]
Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional neural networks: An overview and application in radiology. Insights Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Yuan, H.; Ji, S. StructPool: Structured graph pooling via conditional random fields. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Liu, Z.; Li, X.; Luo, P.; Loy, C.C.; Tang, X. Deep learning Markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1814–1828. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: London, UK, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16; Springer: London, UK, 2020; pp. 173–190. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhou, Y.; Hu, X.; Zhang, B. Interlinked convolutional neural networks for face parsing. In Advances in Neural Networks–ISNN 2015, Proceedings of the 12th International Symposium on Neural Networks, ISNN 2015, Jeju, South Korea, 15–18 October 2015; Springer: London, UK, 2015; pp. 222–231. [Google Scholar]
Luo, L.; Xue, D.; Feng, X. Ehanet: An effective hierarchical aggregation network for face parsing. Appl. Sci. 2020, 10, 3135. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yang, L.; Song, Q.; Wang, Z.; Jiang, M. Parsing R-CNN for instance-level human analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 364–373. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. Hrformer: High-resolution transformer for dense prediction. arXiv 2021, arXiv:2110.09408. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
Shao, H.; Zeng, Q.; Hou, Q.; Yang, J. Mcanet: Medical image segmentation with multi-scale cross-axis attention. arXiv 2023, arXiv:2312.08866. [Google Scholar]
Agarap, A.F. Deep learning using rectified linear units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Computer Vision–ECCV 2024, Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: London, UK, 2024; pp. 359–375. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Qin, X.; Li, N.; Weng, C.; Su, D.; Li, M. Simple attention module based speaker verification with iterative noisy label detection. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6722–6726. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Smith, B.M.; Zhang, L.; Brandt, J.; Lin, Z.; Yang, J. Exemplar-based face parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3484–3491. [Google Scholar]
Lee, C.-H.; Liu, Z.; Wu, L.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5549–5558. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Opitz, J.; Burst, S. Macro f1 and macro f1. arXiv 2019, arXiv:1911.03347. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Yin, Z.; Yiu, V.; Hu, X.; Tang, L. End-to-end face parsing via interlinked convolutional neural networks. Cogn. Neurodyn. 2021, 15, 169–179. [Google Scholar] [CrossRef]
Wei, Z.; Liu, S.; Sun, Y.; Ling, H. Accurate facial image parsing at real-time speed. IEEE Trans. Image Process. 2019, 28, 4659–4670. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Yang, J.; Huang, C.; Yang, M.-H. Multi-objective convolutional learning for face labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3451–3459. [Google Scholar]
Liu, S.; Shi, J.; Liang, J.; Yang, M.-H. Face parsing via recurrent propagation. arXiv 2017, arXiv:1708.01936. [Google Scholar]
Guo, T.; Kim, Y.; Zhang, H.; Qian, D.; Yoo, B.; Xu, J.; Zou, D.; Han, J.-J.; Choi, C. Residual encoder decoder network and adaptive prior for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zheng, Y.; Yang, H.; Zhang, T.; Bao, J.; Chen, D.; Huang, Y.; Yuan, L.; Chen, D.; Zeng, M.; Wen, F. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18697–18709. [Google Scholar]

Figure 1. The overall architecture of the FP-Deeplab network. In the proposed network, the input image is processed by the feature extractor, ASPP, and CCR-FE modules, fused with low-level features, and finally refined by the SimFA module. Notably, the SimFA module replaces the conventional 3 × 3 convolution block in the original Deeplabv3+ decoder, enabling more effective feature integration during the upsampling stage.

Figure 2. Detailed structure of Cross-Axis Attention.

R

denotes the Reshape operation.

Figure 2. Detailed structure of Cross-Axis Attention.

R

denotes the Reshape operation.

Figure 3. The architecture of the proposed SimFA module. The input features are first split into two parallel branches: E (global branch), which captures semantic structures via nonlocal interactions, and L (local branch), which focuses on boundary and texture refinement. The outputs of the two branches are subsequently fused and adaptively modulated through a channel-wise attention mechanism to enhance fine-grained segmentation accuracy.

Figure 4. Qualitative comparison of facial segmentation results under different module configurations. Here, ‘+’ indicates that only the corresponding component is added to the baseline.

Figure 5. Visualization examples of the proposed FP-Deeplab model on the CelebAMask-HQ test set.

Figure 6. Visualizations of class-specific activation maps from CCR-FE (top) and SimFA (bottom) on representative facial regions. The input image is shown on the left. Each column corresponds to a target semantic class.

Figure 7. Normalized confusion matrix of the FP-Deeplab model on the CelebAMask-HQ test set. High values along the diagonal indicate strong prediction accuracy for major facial components such as background, skin, hair, and nose. However, notable inter-class confusion is observed between symmetric or spatially adjacent parts—such as left/right brows, eyes, and lips—ears and earrings, and neck-related categories. This highlights common challenges in fine-grained facial parsing, especially for small or visually similar regions.

Table 1. Comparison with different methods on the HELEN dataset in the overall F1-score.

Methods	Skin	Nose	U-Lip	I-Mouth	L-Lip	Eyes	Brows	Mouth	Overall F1
Zhou et al. [17]	-	92.0	82.4	77.7	80.8	77.8	86.3	88.9	84.5
Liu et al. [40]	91.0	90.9	62.3	80.8	69.4	76.8	71.3	84.1	84.7
Liu et al. [41]	92.1	93.0	74.3	79.2	81.7	86.8	77.0	89.1	88.6
Guo et al. [42]	93.8	94.1	75.8	83.7	83.1	80.4	87.1	92.4	90.5
Yin et al. [38]	-	96.3	82.4	85.6	86.6	89.5	84.8	92.8	91.0
Wei et al. [39]	95.6	95.2	80.0	86.7	86.4	89.0	82.6	93.6	91.6
FP-Deeplab (Ours)	94.9	96.0	81.3	87.5	88.1	89.6	84.2	94.3	92.2

Table 2. Ablation study of FP-Deeplab on the HELEN dataset.

Baseline	CCR-FE	SimFA	Mean IoU	Overall F1
✔			78.7	89.9
✔	✔		80.2 _(+1.5)	91.1 _(+1.2)
✔		✔	82.0 _(+3.3)	91.9 _(+2.0)
✔	✔	✔	82.5 _(+3.8)	92.2 _(+2.3)

Table 3. Comparison with different methods on the CelebAMask-HQ dataset in terms of Mean F1. Our method demonstrates strong effectiveness in handling long-tailed categories such as Necklace and Clothes, as well as head-related classes, including Face and Hair.

Methods	Face	Nose	Classes	L-Eye	R-Eye	L-Brow	R-Brow	L-Ear	R-Ear	Mean
Methods	I-Mouth	U-Lip	L-Lip	Hair	Hat	Earring	Necklace	Neck	Cloth	Mean
Zhao et al. [14]	94.8	90.3	75.8	79.9	80.1	77.3	78.0	75.6	73.1	76.2
Zhao et al. [14]	89.8	87.1	88.8	90.4	58.2	65.7	19.4	82.7	64.2	76.2
Lee et al. [33]	95.5	85.6	92.9	84.3	85.2	81.4	81.2	84.9	83.1	80.3
Lee et al. [33]	63.4	88.9	90.1	86.6	91.3	63.2	26.1	92.8	68.3	80.3
Wei et al. [39]	96.4	91.9	89.5	87.1	85.0	80.8	82.5	84.1	83.3	82.1
Wei et al. [39]	90.6	87.9	91.0	91.1	83.9	65.4	17.8	88.1	80.6	82.1
Luo et al. [18]	96.0	93.7	90.6	86.2	86.5	83.2	83.1	86.5	84.1	84.0
Luo et al. [18]	93.8	88.6	90.3	93.9	85.9	67.8	30.1	88.8	83.5	84.0
FaRL_scratch [43]	96.2	93.8	92.3	89.0	89.0	85.3	85.4	86.9	87.3	84.7
FaRL_scratch [43]	91.7	88.1	90.0	94.9	82.7	63.1	33.5	90.8	85.9	84.7
FP-Deeplab (Ours)	96.4	94.0	94.2	82.2	82.5	79.4	79.0	82.2	80.6	84.8
FP-Deeplab (Ours)	93.2	90.2	91.6	95.7	85.8	59.6	59.7	92.0	88.3	84.8

Table 4. The impact of using Xception and MobileNetV2 as backbones on the segmentation performance of FP-Deeplab.

Backbone	Mean F1	Mean IoU	Params (M)/ $10^{6}$	GFLOPs
MobileNetV2	84.8	75.64	7.2	51.8
Xception	85.0	75.89	56.7	165.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, B.; Shu, C.; Liao, Z.; Yu, J.; Liu, Z.; Chen, X. FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding. Appl. Sci. 2025, 15, 6016. https://doi.org/10.3390/app15116016

AMA Style

Zeng B, Shu C, Liao Z, Yu J, Liu Z, Chen X. FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding. Applied Sciences. 2025; 15(11):6016. https://doi.org/10.3390/app15116016

Chicago/Turabian Style

Zeng, Borui, Can Shu, Ziqi Liao, Jingru Yu, Zhiyu Liu, and Xiaoyan Chen. 2025. "FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding" Applied Sciences 15, no. 11: 6016. https://doi.org/10.3390/app15116016

APA Style

Zeng, B., Shu, C., Liao, Z., Yu, J., Liu, Z., & Chen, X. (2025). FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding. Applied Sciences, 15(11), 6016. https://doi.org/10.3390/app15116016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation Methods

2.2. Face Parsing Techniques

2.3. Feature Fusion and Attention Mechanisms

3. Methods

3.1. Cross-Axis Attention

3.2. CCR-FE Module

3.3. SimFA Module

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Results on HELEN

4.2.1. Comparison with Mainstream Methods

4.2.2. Ablation Study

4.3. Results on CelebAMask-HQ

4.3.1. Performance Comparison with State-of-the-Art Methods

4.3.2. Visualization of Learned Features

4.3.3. Comparison of Different Backbone Networks

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI