LCAM: Low-Complexity Attention Module for Lightweight Face Recognition Networks

: Inspired by the human visual system to concentrate on the important region of a scene, attention modules recalibrate the weights of either the channel features alone or along with spatial features to prioritize informative regions while suppressing unimportant information. However, the ﬂoating-point operations (FLOPs) and parameter counts are considerably high when one is incorporating these modules, especially for those with both channel and spatial attentions in a baseline model. Despite the success of attention modules in general ImageNet classiﬁcation tasks, emphasis should be given to incorporating these modules in face recognition tasks. Hence, a novel attention mechanism with three parallel branches known as the Low-Complexity Attention Module (LCAM) is proposed. Note that there is only one convolution operation for each branch. Therefore, the LCAM is lightweight, yet it is still able to achieve a better performance. Experiments from face veriﬁcation tasks indicate that LCAM achieves similar or even better results compared with those of previous modules that incorporate both channel and spatial attentions. Moreover, compared to the baseline model with no attention modules, LCAM achieves performance values of 0.84% on ConvFaceNeXt, 1.15% on MobileFaceNet, and 0.86% on ProxylessFaceNAS with respect to the average accuracy of seven image-based face recognition datasets.


Introduction
Attention modules have proven to be useful in enhancing the performance of convolutional neural networks [1]. Instead of adding more layers to improve the network's performance, which also consumes more computing resources, the attention module can be plugged into the network. As a result, important regions of an image can be emphasized aside from boosting the feature representation capability. Most of the attention modules can be categorized into channel attention, spatial attention, or a combination of both attentions. Channel attention enables the network to emphasize the inter-channel relationship of features, while spatial attention underlines the importance of the inter-spatial relationship of features [2]. Previous works [3][4][5][6] demonstrate the effectiveness of the attention modules for general ImageNet [7] classification tasks, but there is a lack of studies about different attention modules in face recognition tasks, especially for lightweight face recognition models [8][9][10][11]. Moreover, most of the modules that integrate channel and spatial attentions require high computation, which are not suitable for real-world deployment in these lightweight models.

•
To propose an attention module with low complexity. The proposed attention module is the Low-Complexity Attention Module, which is also known as LCAM. Notably, LCAM has significantly fewer FLOPs and parameter counts and is still able to exhibit comparable or better performances in comparison with those of other modules that combine both channel and spatial attentions.

•
To preserve and enhance the information interaction in the spatial (vertical and horizontal) branches so as to avoid information loss in LCAM.
The proposed LCAM is incorporated into three existing lightweight face recognition models, namely ConvFaceNeXt [11], MobileFaceNet [8], and ProxylessFaceNAS [10]. These lightweight mobile technologies play an important role in various mobile applications [15][16][17][18] with constrained computational resources. The remainder of the paper is organized as follows: In Section 2, the general face recognition pipeline and several lightweight face recognition models, as well as attention modules, are reviewed. Section 3 introduces the proposed attention module, namely LCAM. In Section 4, the experimental results of LCAM and other previous attention modules are presented and analyzed. Finally, Section 5 summarizes and concludes this work.

Related Work
First, the general face recognition pipeline is described. Next, some lightweight face recognition models are briefly discussed. Finally, previous attention modules are reviewed.

General Face Recognition Pipeline
The general face recognition pipeline consists of face detection, face alignment, and facial representation for verification or identification. The first step for face recognition is face detection, with the aim to locate all the faces within a given image or frame of a video. Generally, this process involves locating human faces, whereby each person's face is enclosed with a bounding box. Aside from the frontal face, a robust detector must be Mathematics 2023, 11,1694 3 of 27 able to detect faces with different illuminations, poses, and scales [19]. Most of the current face detection approaches are based on deep learning. Some examples include Multi-Task Cascaded Convolutional Neural Networks, which is also known as MTCNN [20] and the single-stage headless face detector [21]. After that, face alignment includes the process of modifying the position of the face to a normalized canonical coordinate with the aim of eliminating variations in scale, rotation, and translation [22]. Normally alignment transformation is carried out with respect to discriminant facial landmarks, such as the center of the eyes, tip of the nose, and corner of the mouth.
Finally, facial representation is implemented by obtaining the face descriptors from the extracted features. Specifically, this process involves the mapping of aligned face images to a new feature space. Given a pair of face images, face verification is a one-to-one matching procedure to determine whether both faces belong to the same person [23]. On the other hand, face identification is a one-to-many matching procedure in order to match the given unknown face against known faces in the gallery [18].

Lightweight Face Recognition Models
With the advent of face recognition systems in mobile and embedded devices, lightweight face recognition models have become one of the active and popular research fields in computing. These models are constructed based on efficient blocks. Moreover, these lightweight face recognition models should have low complexity, no more than 1 G of FLOPs, and less than a 19.8 MB model size [24]. Some examples of lightweight face recognition models are MobileFaceNet [8], ShuffleFaceNet [9], MobileFaceNetV1 [10], ProxylessFaceNAS [10], and ConvFaceNeXt [11]. First, MobileFaceNet [8] was built upon an inverted residual block [25], in addition to introducing global depthwise convolution that efficiently reduced the final spatial dimension. After that, ShuffleFaceNet [9] utilized an inverted residual block nestled between a channel split unit on the top and a channel shuffle unit [26] at the bottom. Later, Mobile-FaceNetV1 [10] deployed separable convolution [27] to decrease the computational complexity. Concurrently, ProxylessFaceNAS [10] added the inverted residual block to the search space of ProxylessNAS [28] for a more efficient architecture. Recently, ConvFaceNeXt [11] employed an enhanced form of ConvNeXt block [29] to further reduce the FLOPs, parameters, and model size. Note that all of the aforementioned lightweight face recognition models have lowcomplexity, and no attention modules are integrated into these baseline models. In addition, these models are based on the Convolutional Neural Network (CNN) technique, where the extracted features are learned automatically through the given dataset. Other approaches, such as face recognition algorithms based on fast computation of orthogonal moments [30], are not considered in this research because of the lower recognition performance of the handcrafted model compared to that of CNN in general [31]. Moreover, the design of a handcrafted model is difficult because expert knowledge in the corresponding domain is required to manually extract the feature [32].

Attention Modules
Attention modules can extract informative details from an image region, thus enriching the representation power of the overall model. Basically, there are two types of attentions. Channel attention focuses on 'what' is important, given different feature maps from an input image. Conversely, spatial attention addresses 'where' the position of the important region is [2]. Among these, some modules encode only informative channel details, such as SE [3] and ECA [4]. On the other hand, other modules complement the channel with spatial information, namely CBAM [2], ECBAM [12], CA [5], SCA [13], TA [6], and DAA [14]. Given F input ∈ R H×W×C and F output ∈ R H×W×C as the input and output tensors, respectively, each attention module is briefly presented and discussed as follows: Note that H, W, and C represent those tensors' height, width, and channel. Regarding these notations, the outlines for all the attention modules are depicted in Figure 1. Other acronyms in Figure 1 include 'GAP' and 'GMP', which refer to the Global Average Pooling and Global Max Pooling, respectively, while 'BN' stands for Batch Normalization. In addition, 'Channel Pool' is the abbreviation for channel pooling, 'Concat.' is the abbreviation for concatenation, and 'Avg. Pool' and 'Max. Pool' refer to the average pooling operation and maximum pooling operation, respectively. These pooling operations are performed in single or double dimensions, as indicated by H, W, or C, respectively. Finally, notation ⊗ represents element-wise multiplication, whereas ⊕ represents element-wise summation.
The SE [3] module was introduced to capture channel-wise relationships and choose the best representation by means of recalibrating the channel weight. Specifically, the squeeze operation aggregates information across spatial dimensions through GAP, while the excitation operation utilizes reduction ratio r to scale the channel dimension. With the aim of reducing complexity through 1D convolution, ECA [4] was developed to capture cross-channel interactions. Unlike SE, the dimensionality reduction operation was excluded from ECA to generate appropriate and feasible channel attention maps.
Instead of only attaining a channel-wise relationship, such as those in SE and ECA, CBAM [2] further improved the representational power by combining both channel and spatial attention modules in a sequence. The channel attention of CBAM was almost similar to that of SE, except that there were two pooling operations at the initial stage, namely GAP and GMP. On the other hand, spatial attention involved applying average pooling and max pooling on the channel dimension, which is collectively known as channel pooling. Furthermore, ECBAM [12] was conceived to extract channel and spatial information in a more robust way. Different from CBAM, which adopted SE as the channel attention, ECBAM utilized ECA with both GAP and GMP operations. In addition, ECBAM followed the same spatial attention setting as that of CBAM.
Considering the importance of positional information, CA [5] was introduced. Since convolution only extracts local information, as observed in the computation of spatial attention in CBAM and ECBAM modules [5], CA was suggested to overcome this problem by capturing long-range interactions. Similar to CA, SCA [13] was developed to fuse the channel attention maps with spatial information in two branches. With this arrangement, SCA aimed to capture cross-dimensional information from each branch simultaneously. As it is different from other attention modules, the pooling operation was omitted from SCA to preserve more information.
Instead of capturing attention with single branch (SE, ECA, CBAM, and ECBAM) or two branches arranged in parallel (CA and SCA), a module with three parallel branches known as TA [6] was introduced. TA was developed to capture the interdependencies between the channel and spatial dimensions concurrently. Akin to SCA, cross-dimensional interactions were captured by both channel branches, which were encoded with either height or width information. The remaining third branch adopted the same operating procedure as that of spatial attention in CBAM. Another three-branch structure arranged in parallel, known as DAA [14], was developed. Each of the three branches encoded information in the height, width, and channel dimensions separately. Unlike TA, which incorporated inter-dependency relationships, DAA built intra-dependencies for each dimension of the input tensor.

Proposed Approach
A triplet-branch Low-Complexity Attention Module, known as LCAM, is proposed. Figure 2 shows the graphical outline for LCAM. The input and output tensors for LCAM are denoted as F input and F output , respectively, where both tensors have the same spatial (H and W) and channel (C) dimensions. Notably, LCAM has three parallel branches, where each branch encodes information in the height, width, and channel dimensions, respectively.  Compared to those of TA and DAA, each branch of LCAM consists of only one convolution operation, as well as a smaller kernel size. Consequently, the number of FLOPs and parameters is reduced, yielding lower complexity and fewer memory requirements. Figure 3 shows a detailed block structure of LCAM, where the entire operations for LCAM can be summarized as: where H F , W F , and C F are the weighted attention map for the vertical, horizontal, and channel branches, respectively. In the following subsections, the details for each of the three branches are presented and discussed with respect to the graphical outline and block structure of LCAM. Compared to those of TA and DAA, each branch of LCAM consists of only one convolution operation, as well as a smaller kernel size. Consequently, the number of FLOPs and parameters is reduced, yielding lower complexity and fewer memory requirements. Figure 3 shows a detailed block structure of LCAM, where the entire operations for LCAM can be summarized as: where F H , F W , and F C are the weighted attention map for the vertical, horizontal, and channel branches, respectively. In the following subsections, the details for each of the three branches are presented and discussed with respect to the graphical outline and block structure of LCAM.

Channel Attention Branch
The first unit of LCAM is the Channel Attention Branch (CAB), which exploits the inter-channel interaction of different feature maps. Motivated by ECA, additional batch normalization is appended in between the 1D convolution and sigmoid activation function to promote stability and facilitate the training process. Unlike DAA, the reduction ratio is not applied in the channel branch of LCAM. In this way, more effective channel attention can be learned while preserving channel information.

Channel Attention Branch
The first unit of LCAM is the Channel Attention Branch (CAB), which exploits the inter-channel interaction of different feature maps. Motivated by ECA, additional batch normalization is appended in between the 1D convolution and sigmoid activation function to promote stability and facilitate the training process. Unlike DAA, the reduction ratio is not applied in the channel branch of LCAM. In this way, more effective channel attention can be learned while preserving channel information.
First, GAP is performed on the input tensor input F . Hence, a pooling feature, GAP F , with 1 × 1 × C dimension is generated. Next, a 1D convolution with adaptive kernel size is deployed to capture cross-channel interaction. Following ECA [4], the kernel size for 1D convolution is adaptively determined by: where g and γ are hyperparameters assigned, respectively, with values of 1 and 2. Moreover, odd k represents the nearest odd number of kernel size. Note that the adaptive kernel size for LCAM is set to be at least 3. In essence, the predicted attention for each channel is based on a local k neighborhood. Finally, the channel attention map C F is obtained by First, GAP is performed on the input tensor F input . Hence, a pooling feature, F GAP , with 1 × 1 × C dimension is generated. Next, a 1D convolution with adaptive kernel size is deployed to capture cross-channel interaction. Following ECA [4], the kernel size for 1D convolution is adaptively determined by: where g and γ are hyperparameters assigned, respectively, with values of 1 and 2. Moreover, | k | odd represents the nearest odd number of kernel size. Note that the adaptive kernel size for LCAM is set to be at least 3. In essence, the predicted attention for each channel is based on a local k neighborhood. Finally, the channel attention map F C is obtained by applying sigmoid activation function to scale the weight for each channel. The whole process of CAB can be mathematically formulated as follows: and where f k 1d denotes 1D convolution with the kernel of size k, which is determined adaptively by Equation (2), b is the batch normalization operation, and σ indicates the sigmoid activation function.

Vertical Attention Branch
With the intention to encode height information in the spatial dimension, the Vertical Attention Branch (VAB) is the second unit of LCAM. As opposed to other attention modules with 2D symmetric convolution, LCAM utilized only a 1D asymmetric convolution in each spatial branch. This approach helps in reducing the computational complexity and memory footprint. Moreover, LCAM capitalizes fully on the spatial information in every 1D asymmetric convolution operation. Hence, LCAM is not susceptible to information loss because more information can be preserved. Specifically, average pooling is performed on the channel dimension while retraining both the height and width information. Moreover, permutation operations also improve the robustness of face pose variations, which leads to a better face recognition performance.
Initially, the channel dimension of the input tensor is compressed through the channel average pooling operation. The generated pooling feature F CAP has a shape of H × W × 1. After that, the pool feature is permuted with respect to the height dimension. Intuitively, this operation involved rotation along the height dimension, which caused the swapping of positions between the width and channel dimensions. As a result, the dimension of the permuted feature F PH is rearranged to H × 1 × W. Subsequently, a 1D asymmetric convolution with a kernel of size 3 × 1 followed by a sigmoid activation function is deployed. The convoluted feature F CPH has a dimension of H × 1 × 1. Finally, another permutation operation is implemented to rotate the feature back to the original position. The generated attention map F H can then be used to calibrate the height dimension. In essence, the aforementioned steps can be mathematically summarized as: where p h represents permutation operation along the height dimension, while f 3×1 corresponds to 3 × 1 asymmetric convolution.

Horizontal Attention Branch
Another counterpart unit for spatial dimension in LCAM is the Horizontal Attention Branch (HAB). The purpose of the HAB is to capture the width information. The workflow of the HAB is almost similar to that of the VAB, with some minor adjustments. Instead of permutation along the height dimension, the rotation operation of the HAB is carried out pertaining to the width dimension. In addition, 1 × 3 asymmetric convolution is utilized for the permuted feature.
Akin to VAB, the workflow of the HAB starts with the channel average pooling operation. Next, the pooling feature F CAP is rotated along the width dimension to yield a permuted feature, F PW , with a shape of 1 × W × H. Then, a 1D asymmetric convolution of 1 × 3 kernel size is utilized before appending a batch normalization and sigmoid activation function. Hence, a convoluted feature, F CPW , with a dimension of 1 × W × 1 is generated. Finally, the feature is rearranged to the original position through permutation operation.
The entire flow of the HAB is first represented by Equation (5), while the subsequent steps are given as: where p w refers to the permutation operation along the width dimension, while f 1×3 is the 1 × 3 asymmetric convolution.

Experiments and Analysis
In this section, the training and evaluation datasets for LCAM and previous attention modules are initially introduced. Next, experimental settings for all these modules are presented. Then, ablation studies are conducted to access the most suitable design structure for the LCAM module. After that, quantitative analysis is implemented to evaluate the verification accuracy. Finally, qualitative analysis is performed on the output images by visual inspection.

Dataset
All the models in this work employ UMD Faces [36] as the training dataset. There are around 367,888 face images of 8277 individuals in this medium-sized dataset. In comparison with other similar-sized datasets, such as CASIA WebFace [37] and VGGFace [38], the images of UMD Faces contain more pose variations, which facilitate the learning capability of a model. Face images with 112 × 112 × 3 dimensions are used for training. These faces are detected and aligned with a multi-task cascaded convolutional network (MTCNN) [20], which is available from Face.evoLVe library [39].

Experimental Settings
The effectiveness and adaptability of LCAM and the previous eight attention modules discussed in Section 2.3 are evaluated by plugging them into three different lightweight face recognition models. Notably, this is carried out by evaluating the verification accuracies with respect to seven image-based and two template-based face recognition datasets, as mentioned in Section 4.1. These three lightweight face recognition models are Con-vFaceNeXt [11], MobileFaceNet [8], and ProxylessFaceNAS [10]. Note that the attention module is placed in the rear position after the core building block of a model. Following previous works [5,14], the reduction ratios for SE, CBAM, CA, and DAA are fixed, respectively, at 24, 24, 32, and 8 to ensure similar model complexities.
All experiments are conducted with the TensorFlow framework on an Nvidia Tesla P100 GPU. Stochastic Gradient Descent optimizer is used to train these models from scratch. Moreover, a momentum of 0.9 and a weight decay of 0.0005 are employed. Additionally, a cosine learning schedule with an initial value of 0.1 and a decreased factor of 0.5 are adopted. Note that ConvFaceNeXt and MobileFaceNet are trained with a batch size of 256 for 49 epochs. Conversely, the batch size for ProxylessFaceNAS is 64 due to the GPU memory limitation. Furthermore, the loss function used for all models is ArcFace [48], which boosts the discriminative power of learned face features through the additive angular margin m. Given a feature x i belonging to identity class y i , ArcFace is formulated as: ln e h(cos(θ y i +m)) e h(cos(θ y i +m)) + ∑ C j=1, j =y i e h cos(θ j ) (12) where N is the batch size, C is the total number of classes in the training dataset, h is the scaling hyperparameter, and θ y i is the angle between the feature x i and i-th class center.

Ablation Studies
In this section, the optimum design structure and configuration for LCAM are studied. These evaluations are measured in terms of performance, FLOPs, and parameters. Among the four family members of ConvFaceNeXt, the base model selected to conduct all the experiments is ConvFaceNeXt_PE. This model will be referred to as ConFaceNeXt in the following sections for simplicity. Three sets of experiments are carried out based on the hill climbing technique [49]. First, the effect of different kernel sizes in the 1D asymmetric convolution of LCAM is accessed. Second, the influence of each of the three branches of LCAM, along with different combinations, is examined. Finally, the impact of different location configurations of LCAM in ConvFaceNeXt is investigated. The overview for these three sets of experiments is shown in Figure 4.
scratch. Moreover, a momentum of 0.9 and a weight decay of 0.0005 are employed. Additionally, a cosine learning schedule with an initial value of 0.1 and a decreased factor of 0.5 are adopted. Note that ConvFaceNeXt and MobileFaceNet are trained with a batch size of 256 for 49 epochs. Conversely, the batch size for ProxylessFaceNAS is 64 due to the GPU memory limitation. Furthermore, the loss function used for all models is ArcFace [48], which boosts the discriminative power of learned face features through the additive angular margin m. Given a feature xi belonging to identity class yi, ArcFace is formulated as: (12) where N is the batch size, C is the total number of classes in the training dataset, h is the scaling hyperparameter, and i y  is the angle between the feature i x and i-th class center.

Ablation Studies
In this section, the optimum design structure and configuration for LCAM are studied. These evaluations are measured in terms of performance, FLOPs, and parameters. Among the four family members of ConvFaceNeXt, the base model selected to conduct all the experiments is ConvFaceNeXt_PE. This model will be referred to as ConFaceNeXt in the following sections for simplicity. Three sets of experiments are carried out based on the hill climbing technique [49]. First, the effect of different kernel sizes in the 1D asymmetric convolution of LCAM is accessed. Second, the influence of each of the three branches of LCAM, along with different combinations, is examined. Finally, the impact of different location configurations of LCAM in ConvFaceNeXt is investigated. The overview for these three sets of experiments is shown in Figure 4.

Effect of Different Kernel Size
Generally, most of the attention modules operate on 2D symmetric convolution. Hence, 1D asymmetric convolution is deployed in each of the vertical and horizontal branches of LCAM to reduce the computational complexity. As they are different from DAA, each of these branches exploits spatial information to the fullest extent. This is made possible by the permutation operation, which enables 1D asymmetric convolution to operate on both height and information. Four experiments with different kernel sizes are conducted. This analysis involves kernel sizes of three, five, seven, and nine. The verification accuracies for various kernel sizes are reported in Tables 1 and 2 and Figure 5. Note that the first model with LCAM, which operates on a kernel of size three for 1D asymmetric convolution, is known as ConvFaceNeXt_LCAM. The subsequent models, which incorporate LCAM with kernel sizes of five, seven, and nine, are denoted as ConvFaceNeXt_L5K, ConvFaceNeXt_L7K, and ConvFaceNeXt_L9K, respectively. Generally, most of the attention modules operate on 2D symmetric convolution. Hence, 1D asymmetric convolution is deployed in each of the vertical and horizontal branches of LCAM to reduce the computational complexity. As they are different from DAA, each of these branches exploits spatial information to the fullest extent. This is made possible by the permutation operation, which enables 1D asymmetric convolution to operate on both height and information. Four experiments with different kernel sizes are conducted. This analysis involves kernel sizes of three, five, seven, and nine. The verification accuracies for various kernel sizes are reported in Tables 1 and 2 and Figure 5. Note that the first model with LCAM, which operates on a kernel of size three for 1D asymmetric convolution, is known as ConvFaceNeXt_LCAM. The subsequent models, which incorporate LCAM with kernel sizes of five, seven, and nine, are denoted as ConvFaceN-eXt_L5K, ConvFaceNeXt_L7K, and ConvFaceNeXt_L9K, respectively.   From the results of the image-based dataset in Table 1, it can be seen that Con-vFaceNeXt_LCAM and ConvFaceNeXt_L5K perform similarly well. However, when the kernel size is increased further to seven and nine, there is a drop in the verification accuracy. Note that DAA adopts 1D asymmetric convolution with a kernel of size seven for performance gain in the ImageNet classification task. Nevertheless, from the conducted experiments, it was observed that larger kernel sizes of seven and nine led to a performance deterioration in the face recognition task. This suggests that the performance tends to be saturated beyond the kernel of size five for evaluation using face datasets. Another possible reason might be that DAA is plugged into MobileNetV2 with a large (224 × 224 × 3) input image, while it was only 112 × 112 × 3 for ConvFaceNeXt.
Other lightweight face recognition models, such as MobileFaceNet and Proxyless-FaceNAS, are fed with the same small input image size of 112 × 112 × 3 to reduce the computational complexity. In this situation, a smaller kernel size for 1D asymmetric convolution is more prone to extract local face information compared to that of a large kernel size. Although there is a 0.05% minor performance gain when one is switching from the kernel sizes of three to five, this increment is negligible compared to the additional 45 K parameters introduced by the model with a larger kernel size. In addition, as indicated in Table 2 and Figure 5 for template-based datasets, ConvFaceNeXt_LCAM with a kernel size of three performs better than the other models with larger kernel sizes do. Based on these results and the intention to propose a low-complexity attention module, a kernel of size three is adopted for LCAM to ensure more detailed and minute face features can be extracted.

Effect on Different Combination of Branches
In this section, experiments are conducted to determine the optimum combination of branches for LCAM. Based on three branches of LCAM, namely CAB, VAB, and HAB, the effectiveness of seven different combinations is examined. These seven variations include three single-branch modules, three double-branch modules, and a triplet-branch module. Specifically, the first to the third models incorporate single branch modules, including Con-FaceNeXt_CAB (channel), ConFaceNeXt_VAB (height), and ConFaceNeXt_HAB (width). Next, the fourth to the sixth models are integrated with double branch modules, namely ConFaceNeXt_CAB+VAB, ConFaceNeXt_CAB+HAB, and ConFaceNeXt_VAB+HAB. Finally, the last model comprises all three branches, denoted as ConvFaceNeXt_LCAM. Note that the baseline model is represented as ConvFaceNeXt. The results of applying different combinations of branches are shown in Tables 3 and 4 and Figure 6.  It is observed that the model ConvFaceNeXt_LCAM with all three branches has the best overall performance for the image-based dataset, as shown in Table 3. In addition, the parameters for ConvFaceNeXt_LCAM are almost similar to those of the other variations, with a 0.5% slight increase in FLOPs compared to that of the baseline model. This implies that the combination of CAB, VAB, and HAB in LCAM is capable of improving the face verification performance. This is because the scaling weight of each dimension is highlighted properly. Hence, channel, height, and width (spatial) information can be captured effectively by LCAM. In addition, the remaining six models derived from LCAM improve the verification performance of the image-based dataset compared to that of the baseline model. For the single branch module, ConvFaceNeXt_CAB and ConvFaceNeXt_VAB show comparable performances, while ConvFaceNeXt_HAB has mediocre performance.   It is observed that the model ConvFaceNeXt_LCAM with all three branches has the best overall performance for the image-based dataset, as shown in Table 3. In addition, the parameters for ConvFaceNeXt_LCAM are almost similar to those of the other variations, with a 0.5% slight increase in FLOPs compared to that of the baseline model. This implies that the combination of CAB, VAB, and HAB in LCAM is capable of improving the face verification performance. This is because the scaling weight of each dimension is highlighted properly. Hence, channel, height, and width (spatial) information can be captured effectively by LCAM. In addition, the remaining six models derived from LCAM improve the verification performance of the image-based dataset compared to that of the baseline model. For the single branch module, ConvFaceNeXt_CAB and ConvFaceNeXt_VAB show comparable performances, while ConvFaceNeXt_HAB has mediocre performance.
Meanwhile, among the three double-branch modules, ConvFaceNeXt_CAB+HAB performs better than the other two do. From the perspective of channel and width dimensions, there is a gradual performance improvement from the utilization of single to all three branches in LCAM. Concretely, when the models consider only a single branch with CAB or HAB, the verification accuracies are, respectively, 92.84% and 92.65%. When two branches are incorporated, as in the combination of CAB and HAB, the performance value increases to 93.00%. Eventually, the best result of 93.13% is obtained by integrating all three branches. Additionally, ConvFaceNeXt_LCAM accomplishes the highest accuracy for the template-based dataset, as shown in Table 4. These observations prove that it is crucial to combine all three branches to effectively recalibrate each dimension. Meanwhile, among the three double-branch modules, ConvFaceNeXt_CAB+HAB performs better than the other two do. From the perspective of channel and width dimensions, there is a gradual performance improvement from the utilization of single to all three branches in LCAM. Concretely, when the models consider only a single branch with CAB or HAB, the verification accuracies are, respectively, 92.84% and 92.65%. When two branches are incorporated, as in the combination of CAB and HAB, the performance value increases to 93.00%. Eventually, the best result of 93.13% is obtained by integrating all three branches. Additionally, ConvFaceNeXt_LCAM accomplishes the highest accuracy for the template-based dataset, as shown in Table 4. These observations prove that it is crucial to combine all three branches to effectively recalibrate each dimension.

Effect on Different Integration Strategy
In order to gain insight into the optimum location for LCAM in the core building block of ConvFaceNeXt, several integration strategies are investigated. As a brief introduction, ECN is the main building block for ConvFaceNeXt with three convolution operations, as shown in Figure 7a. The ECN block starts with a depthwise convolution, followed by two pointwise convolutions. Note that the first pointwise convolution is used for channel expansion, while the second one is for channel reduction. Three variants of integration strategies are studied. The first ConvFaceNeXt_D1 model consists of integrating LCAM after depthwise convolution along with the batch normalization operation. The second ConvFaceNeXt_P1 model incorporates LCAM next to the first pointwise convolution along with the PReLU activation function. Finally, the last model, ConvFaceNeXt_LCAM, deploys the attention module at the end of an ECN block. For better clarity, these three variants are depicted in Figure 7b-d, where the structure for LCAM is shown in Figure 2.
In order to gain insight into the optimum location for LCAM in the core building block of ConvFaceNeXt, several integration strategies are investigated. As a brief introduction, ECN is the main building block for ConvFaceNeXt with three convolution operations, as shown in Figure 7a. The ECN block starts with a depthwise convolution, followed by two pointwise convolutions. Note that the first pointwise convolution is used for channel expansion, while the second one is for channel reduction. Three variants of integration strategies are studied. The first ConvFaceNeXt_D1 model consists of integrating LCAM after depthwise convolution along with the batch normalization operation. The second ConvFaceNeXt_P1 model incorporates LCAM next to the first pointwise convolution along with the PReLU activation function. Finally, the last model, ConvFaceN-eXt_LCAM, deploys the attention module at the end of an ECN block. For better clarity, these three variants are depicted in Figure 7b-d, where the structure for LCAM is shown in Figure 2. Through the reported results in Tables 5 and 6, ConvFaceNeXt_LCAM has the best verification accuracy for almost all the datasets. One reason for this performance gain might be due to the placement of LCAM at the rear bottom position of an ECN block. Specifically, this placement ensures a richer feature representation because information can be fully extracted by all three convolution layers in the ECN block. In contrast, the convolution operation is performed only once or twice prior to the attention module in ConvFaceNeXt_D1 and ConvFaceNeXt_P1, respectively, leading to suboptimal information extraction. As an assumption, the verification accuracy of ConvFaceNeXt_P1 is Through the reported results in Tables 5 and 6, ConvFaceNeXt_LCAM has the best verification accuracy for almost all the datasets. One reason for this performance gain might be due to the placement of LCAM at the rear bottom position of an ECN block. Specifically, this placement ensures a richer feature representation because information can be fully extracted by all three convolution layers in the ECN block. In contrast, the convolution operation is performed only once or twice prior to the attention module in ConvFaceNeXt_D1 and ConvFaceNeXt_P1, respectively, leading to suboptimal information extraction. As an assumption, the verification accuracy of ConvFaceNeXt_P1 is supposed to be better than that of ConvFaceNeXt_D1 based on the intuition that more information can be encoded by using more convolution layers. However, this is not the case, as the performance of ConvFaceNeXt_P1 with two convolution operations prior to plugging the attention module is lower than that of ConvFaceNeXt_D1 with one convolution operation. Upon closer observation, LCAM is integrated after the PReLU activation function for ConvFaceNeXt_P1. Contrarily, the integration of the attention module for ConvFaceNeXt_D1 and ConvFaceNeXt_LCAM occurs after the batch normalization operation, which yields a better performance. This observation suggests that the nonlinear activation function causes information loss to a certain extent [25]. Moreover, ConvFaceNeXt_P1 with the attention module placed after the first pointwise convolution has more complexity than those of the other two models. This is because the channel expansion induced by the first pointwise convolution inevitably increases the number of parameters in the corresponding attention module. Apart from that, ConvFaceNeXt_LCAM performs equally well for the templatebased datasets, as shown in Table 6 and Figure 8. With respect to the aforementioned reasons, the attention module is thus integrated at the end of the core building block.
supposed to be better than that of ConvFaceNeXt_D1 based on the intuition that more information can be encoded by using more convolution layers. However, this is not the case, as the performance of ConvFaceNeXt_P1 with two convolution operations prior to plugging the attention module is lower than that of ConvFaceNeXt_D1 with one convolution operation. Upon closer observation, LCAM is integrated after the PReLU activation function for ConvFaceNeXt_P1. Contrarily, the integration of the attention module for ConvFaceN-eXt_D1 and ConvFaceNeXt_LCAM occurs after the batch normalization operation, which yields a better performance. This observation suggests that the nonlinear activation function causes information loss to a certain extent [25]. Moreover, ConvFaceNeXt_P1 with the attention module placed after the first pointwise convolution has more complexity than those of the other two models. This is because the channel expansion induced by the first pointwise convolution inevitably increases the number of parameters in the corresponding attention module. Apart from that, ConvFaceNeXt_LCAM performs equally well for the template-based datasets, as shown in Table 6 and Figure 8. With respect to the aforementioned reasons, the attention module is thus integrated at the end of the core building block.

Quantitative Analysis
In this section, the experimental results for all attention modules are reported. Based on the ablation study, LCAM adopts the optimum settings shown in Section 4.3. This section further reviews the adaptability of LCAM to suit three different lightweight face recognition models, namely ConvFaceNeXt [11], MobileFaceNet [8], and ProxylessFace-NAS [10]. Specifically, the performances of each model plugs with LCAM or other attention modules are examined. Basically, a performance bias tends to occur when one is comparing models utilizing different training and testing datasets. For the sake of fairness, all the models incorporated with different attention modules are trained from scratch using the UMD face dataset with the experiment settings presented in Section 4.2. These models are then evaluated on seven image-based and two template-based face datasets, as described in Section 4.1.
With the aim of reducing the computational complexity, the ECN block is deployed as the core structure for ConvFaceNeXt. Moreover, blocks with the same output dimension are aggregated for a comprehensive feature correlation. The results of the ConvFaceNeXt plug with different attention modules are reported in Tables 7 and 8. For image-based datasets, the verification accuracies are based on seven face datasets, where the average accuracy for each attention module is computed and shown in the last column of Table 7. Based on a brief glimpse at the average accuracy, it appears that all of the attention modules play an important role in increasing the performance results as compared to that of the baseline model with no attention mechanism. This conforms to the fact that attention modules increase the representational power of a neural network in general, and specifically, for lightweight face recognition models as well. Typically, modules combining both channel and spatial attentions perform much better than those with single-channel attention do. For example, models integrated with channel attention, such as SE and ECA, have lower average accuracy in contrast to those of SCA, TA, DAA, and LCAM, which consider both attentions. This proves that both attentions complement each other for richer channel and spatial representations.  With respect to the arrangement mode for channel and spatial attentions, it is observed that the parallel arrangements of SCA, TA, DAA, and LCAM yield better results as opposed to those of CBAM and ECBAM with sequential channel-spatial configuration. This validates the superiority of factorizing the attention module into several parallel branches for effective attention map generation. In terms of complexity, models with channel attention have lower FLOPs and similar parameters in comparison with those of other models with both attentions. However, this low complexity comes with a cost of a suboptimal performance. It is interesting to note that besides having the highest average accuracy with image-based datasets, the performance improvement of LCAM is achieved with the lowest FLOPs among the models that integrate both channel and spatial attentions. For template-based datasets, the performance for LCAM is among the best, as indicated by the verification accuracies in Table 8. As a whole, ConvFaceNeXt incorporates the LCAM attention module and has the best overall result, taking into account the accuracies, as well as parameters and FLOPs counts.
The core structure of MobileFaceNet is an inverted residual block. The performances of each MobileFaceNet integration with different attention modules are shown in Tables 9 and 10. Likewise, all attention modules plugged into MobileFaceNet have the tendency to achieve better performance improvements compared to that of the baseline model. Several modules with both attentions, particularly ECBAM, TA, DAA, and LCAM, outperform the SE and ECA module with a single attention mechanism. Another notable observation is that although ECA, ECBAM, and LCAM utilize almost similar channel attentions, the spatial attention of ECBAM, as well as the Vertical and Horizontal Attention Branches of LCAM, contribute towards a better performance. The same condition can be perceived from SE and DAA modules, which adopt nearly identical channel attentions. In terms of arrangement modes, the parallel combination of TA, DAA, and LCAM surpasses the performance of sequential combination in CBAM and ECBAM. Among these three parallel branches' attention modules, LCAM attains the highest average accuracy. This validates the benefits of using only one convolution operation for each branch and fully exploiting the height and width information for every spatial branch in LCAM. Intriguingly, LCAM is accurate despite face pose variations, as observed by the highest accuracy value for CPLFW and VGG2-FP, in addition to the next best result for CFP-FP. From the comparison of modules with both attentions, LCAM has the least complexity in terms of FLOPs without sacrificing the verification performance, which is notably the best.
In addition, with reference to the results in Table 10, the performance for LCAM is among the best for the template-based datasets. This proves that LCAM is well adapted to the MobileFaceNet architecture. With the purpose of learning discriminative face features, an efficient model known as ProxylessFaceNAS is suggested. The core structure of ProxylessFaceNAS is an inverted, residual block. Note that for ProxylessFaceNAS, the expansion ratio and kernel size for the block are larger than those of ConvFaceNeXt and MobileFaceNet. Consequently, Prox-ylessFaceNAS is the largest or heaviest among the three baseline models. The results for ProxylessFaceNAS, attached with various attention modules, are shown in Tables 11 and 12. For the image-based datasets, all the models with attention modules have better performances than the baseline model does, as shown in Table 11. Among these modules, TA, ECBAM, and LCAM outperform the others in terms of average verification accuracy. In addition, these three modules, with both attentions, outdo SE and ECA with only channel attention. In comparison with ECA, the spatial attention of ECBAM and LCAM contribute by complementing the channel attention for performance improvements. Moreover, in the context of ProxylessFaceNAS, ECBAM with sequential arrangement has a similar performance to those of LCAM and TA with the parallel arrangement.    Another interesting observation is that for those modules without a reduction ratio, the performance is superior as opposed to that of their corresponding counterpart with a reduction ratio. Specifically, the average verification accuracies for ECA, ECBAM, and LCAM are higher compared to those of SE, CBAM, and DAA, respectively. The reason is that some information might be lost due to dimensionality reduction when they are applying a reduction ratio in channel attention. Besides the higher gain, the computation costs for models integrated with ECA, ECBAM, and LCAM are lower, which suits the requirement of lightweight face recognition models. Although the average accuracy of TA is comparatively better than that of LCAM, this comes at the cost of more computational complexity. Concretely, for modules with both attentions, TA has the highest number of FLOPs, while LCAM has the least of them. For template-based datasets, LCAM achieves a superior performance in comparison with those of other attention modules, as shown in Table 12. These results demonstrate the flexibility of LCAM to perform well, not only in smaller ConvFaceNeXt and MobileFaceNet models, but in larger lightweight face recognition models such as ProxylessFaceNAS as well.
In essence, LCAM is robust and can be plugged into any lightweight face recognition model for a more accurate and superior feature representation compared to those of the other attention modules. This shows that LCAM has great adaptability without worsening the performance. This is evidenced by the highest average verification accuracy for imagebased datasets when one is integrating LCAM in ConvFaceNeXt and MobileFaceNet. For ProxylessFaceNAS, LCAM performs equally well with comparable results to those of TA, albeit with much lower complexity. With regard to template-based datasets, LCAM again shows a good and competitive performance measured by verification accuracies.

Qualitative Analysis
In this section, the Grad-CAM [50] technique is used to visualize the effectiveness of the proposed LCAM attention module in recognizing and localizing important facial features, thus enhancing the model's representational power. Note that the color scales for Grad-CAM rank from red at the top to green, and finally, to blue to indicate the most and least significant regions, respectively. Random examples of positive face image pairs are taken, which correspond to the same individual. The visualization comprises the proposed LCAM and eight previous attention modules, which are all plugged into the ConvFaceNeXt model trained with the same dataset and settings as those described in Sections 4.1 and 4.2.
Two pairs of face images from LFW are shown in Figure 9. The first pair shows the faces of a lady from the front. For the top face image, LCAM can highlight the eyes, mouth, and ears properly. Note that it is vital to distinguish discriminative face parts such as the eyes, nose, mouth, and ears for face recognition tasks [51]. Although CA exhibits a similar ability to emphasize those face parts, unnecessary regions encompassing the whole upper hair area are excited as well. Some other attention modules, such as SE and ECBAM, are observed to perform excitation only on the eye parts. Similarly, ECA and TA focus only on the ear parts. Even though the whole face region is covered by DAA, the level of excitation is lower on discriminative face parts. For the face image of the lady at the bottom, LCAM shows the most excitation, particularly on the eyes, nose, and ears, as shown in the orange and yellow regions. ECA and TA highlight the whole face region as well, albeit with less excitation. Other attention modules show a less strong response towards distinctive face parts. least significant regions, respectively. Random examples of positive face image pairs are taken, which correspond to the same individual. The visualization comprises the proposed LCAM and eight previous attention modules, which are all plugged into the Con-vFaceNeXt model trained with the same dataset and settings as those described in Section 4.1 and Section 4.2.
Two pairs of face images from LFW are shown in Figure 9. The first pair shows the faces of a lady from the front. For the top face image, LCAM can highlight the eyes, mouth, and ears properly. Note that it is vital to distinguish discriminative face parts such as the eyes, nose, mouth, and ears for face recognition tasks [51]. Although CA exhibits a similar ability to emphasize those face parts, unnecessary regions encompassing the whole upper hair area are excited as well. Some other attention modules, such as SE and ECBAM, are observed to perform excitation only on the eye parts. Similarly, ECA and TA focus only on the ear parts. Even though the whole face region is covered by DAA, the level of excitation is lower on discriminative face parts. For the face image of the lady at the bottom, LCAM shows the most excitation, particularly on the eyes, nose, and ears, as shown in the orange and yellow regions. ECA and TA highlight the whole face region as well, albeit with less excitation. Other attention modules show a less strong response towards distinctive face parts. The second image from the LFW dataset shows a frontal face of a guy. It is observed that LCAM has the capacity to locate discriminative face parts of the guy in both the top and bottom images. Although CA highlights the nose and mouth parts properly in the bottom image, the background regions are equally emphasized, which is not desirable. In addition, other attention modules, such as TA and DAA, only focus on the forehead and lower right part of the chin, with no emphasis given to discriminative face parts, as observed in the bottom face images.
Two additional cross-pose face images from CPLFW are shown in Figure 10. The first image shows the face of a guy from the front on top and a side view of his face on the bottom. On the front view image of the face, LCAM shows the most excitation on the eyes, nose, and mouth compared to that of the other attention modules. Similarly, TA highlights those parts despite with a less strong response towards the right eye and mouth. In contrast, other attention modules only focus on certain face regions, for example, CBAM and SCA, which focus on the less relevant forehead area. For the side-view image of the guy, The second image from the LFW dataset shows a frontal face of a guy. It is observed that LCAM has the capacity to locate discriminative face parts of the guy in both the top and bottom images. Although CA highlights the nose and mouth parts properly in the bottom image, the background regions are equally emphasized, which is not desirable. In addition, other attention modules, such as TA and DAA, only focus on the forehead and lower right part of the chin, with no emphasis given to discriminative face parts, as observed in the bottom face images.
Two additional cross-pose face images from CPLFW are shown in Figure 10. The first image shows the face of a guy from the front on top and a side view of his face on the bottom. On the front view image of the face, LCAM shows the most excitation on the eyes, nose, and mouth compared to that of the other attention modules. Similarly, TA highlights those parts despite with a less strong response towards the right eye and mouth. In contrast, other attention modules only focus on certain face regions, for example, CBAM and SCA, which focus on the less relevant forehead area. For the side-view image of the guy, LCAM shows more attentiveness to the eye and nose. DAA has comparable performance as well, while some other attention modules highlight face regions to a certain extent. The second pair consists of two side-view images of a lady. It is observed that LCAM can focus correctly on the face region in the top image. In addition, LCAM is the only method with a strong reaction to the nose tip on the bottom image. Although ECBAM is able to highlight most of the side face region on the right, discriminative face parts are neglected.
LCAM shows more attentiveness to the eye and nose. DAA has comparable performance as well, while some other attention modules highlight face regions to a certain extent. The second pair consists of two side-view images of a lady. It is observed that LCAM can focus correctly on the face region in the top image. In addition, LCAM is the only method with a strong reaction to the nose tip on the bottom image. Although ECBAM is able to highlight most of the side face region on the right, discriminative face parts are neglected. Lastly, two pairs of images of faces at different ages are shown in Figure 11. The image on the top corresponds to a face that is younger than the bottom one is. For the first pair, consisting of guy images, LCAM exhibits more excitation on the eyes and nose in comparison with those of the other attention modules, as illustrated by the images on the top. Likewise, CA yields an equivalent performance. In contrast, the excitation is especially poor for DAA on the face region, focusing instead on irrelevant background and shirt areas. For the bottom face image with a hat, LCAM again shows significant excitation on the eyes, nose, and mouth parts. Generally, other attention modules are able to highlight the face region, albeit with some of them having less excitation. The second pair represents the face of a girl. For the top image, discriminative face parts such as the eyes and mouth are well emphasized by LCAM and CA. Regarding the image on the bottom, LCAM shows more excitation on the face region, particularly on the eyes. Contrarily, other attention modules have less excitation or only highlight certain facial parts. For instance, TA only shows a response towards the eyes, while ignoring other discriminative parts. Lastly, two pairs of images of faces at different ages are shown in Figure 11. The image on the top corresponds to a face that is younger than the bottom one is. For the first pair, consisting of guy images, LCAM exhibits more excitation on the eyes and nose in comparison with those of the other attention modules, as illustrated by the images on the top. Likewise, CA yields an equivalent performance. In contrast, the excitation is especially poor for DAA on the face region, focusing instead on irrelevant background and shirt areas. For the bottom face image with a hat, LCAM again shows significant excitation on the eyes, nose, and mouth parts. Generally, other attention modules are able to highlight the face region, albeit with some of them having less excitation. The second pair represents the face of a girl. For the top image, discriminative face parts such as the eyes and mouth are well emphasized by LCAM and CA. Regarding the image on the bottom, LCAM shows more excitation on the face region, particularly on the eyes. Contrarily, other attention modules have less excitation or only highlight certain facial parts. For instance, TA only shows a response towards the eyes, while ignoring other discriminative parts.
Qualitative observations attest to LCAM's superiority in not only highlighting face region, but also in emphasizing important face parts such as the eyes, nose, mouth, and ears. Concretely, these targeted parts play an important role in obtaining pose-and age-invariant features to boost the accuracy, as well as the overall performance of a model [52]. Qualitative observations attest to LCAM's superiority in not only highlighting face region, but also in emphasizing important face parts such as the eyes, nose, mouth, and ears. Concretely, these targeted parts play an important role in obtaining pose-and ageinvariant features to boost the accuracy, as well as the overall performance of a model [52].

Conclusions
An attention module known as the Low-Complexity Attention Module (LCAM) is proposed for mobile-based networks in general, and specifically, for lightweight face recognition models. Basically, the LCAM consists of three parallel branches to encode scaling information in the channel, height, and width dimensions. In order to ensure low complexity, each of the LCAM branches utilizes only one convolution operation. Concretely, the Channel Attention Branch deploys a 1D convolution, while the other two Vertical and Horizontal Attention Branches employ, respectively, 1D asymmetric convolution. As a result, LCAM has fewer FLOPs and fewer parameters compared to those of other modules, which consider both channel and spatial attentions. Aside from that, although the Vertical Attention Branch and the Horizontal Attention Branch are separate entities, each of these branches makes full use of the height and width information through 1D asymmetric convolution. In this way, LCAM promotes information interaction within each spatial branch and minimizes the information loss. Several integral attributes of LCAM are examined in the ablation studies. First, 1D asymmetric convolution with a kernel size of three is adopted with the aim of extracting more detailed information. Second, a comprehensive attention module is obtained by combining all three branches of LCAM to improve the overall performance. Finally, LCAM is integrated at the rear of a core building block to ensure richer feature representations and low computational costs. In the quantitative analysis, LCAM achieved the highest verification accuracy among all the other attention modules, irrespective of the network model. Moreover, qualitative observation implies that LCAM is capable of highlighting face regions, while simultaneously emphasizing important face parts. Although LCAM shows a better performance, there are some limitations of the proposed module, which provide room for improvements. Specifically, pooling operations in the channel and spatial branches of LCAM might cause information loss to a certain extent. For instance, the global average pooling in the Channel Attention Branch could possibly cause a loss of spatial information, and vice versa, for the spatial attention branches. In future, mechanisms for further reducing information loss

Conclusions
An attention module known as the Low-Complexity Attention Module (LCAM) is proposed for mobile-based networks in general, and specifically, for lightweight face recognition models. Basically, the LCAM consists of three parallel branches to encode scaling information in the channel, height, and width dimensions. In order to ensure low complexity, each of the LCAM branches utilizes only one convolution operation. Concretely, the Channel Attention Branch deploys a 1D convolution, while the other two Vertical and Horizontal Attention Branches employ, respectively, 1D asymmetric convolution. As a result, LCAM has fewer FLOPs and fewer parameters compared to those of other modules, which consider both channel and spatial attentions. Aside from that, although the Vertical Attention Branch and the Horizontal Attention Branch are separate entities, each of these branches makes full use of the height and width information through 1D asymmetric convolution. In this way, LCAM promotes information interaction within each spatial branch and minimizes the information loss. Several integral attributes of LCAM are examined in the ablation studies. First, 1D asymmetric convolution with a kernel size of three is adopted with the aim of extracting more detailed information. Second, a comprehensive attention module is obtained by combining all three branches of LCAM to improve the overall performance. Finally, LCAM is integrated at the rear of a core building block to ensure richer feature representations and low computational costs. In the quantitative analysis, LCAM achieved the highest verification accuracy among all the other attention modules, irrespective of the network model. Moreover, qualitative observation implies that LCAM is capable of highlighting face regions, while simultaneously emphasizing important face parts. Although LCAM shows a better performance, there are some limitations of the proposed module, which provide room for improvements. Specifically, pooling operations in the channel and spatial branches of LCAM might cause information loss to a certain extent. For instance, the global average pooling in the Channel Attention Branch could possibly cause a loss of spatial information, and vice versa, for the spatial attention branches. In future, mechanisms for further reducing information loss will be explored in LCAM, so as to achieve higher accuracy, while retaining important information, which is crucial for face recognition tasks. In addition, the performance of LCAM in other vision applications, such as object detection and semantic segmentation, will be analyzed in the future to investigate the generality of this attention module.