Given breast tumor ultrasound images, ResNet extracts multi-scale feature maps, capturing both local and global information. These feature maps are then fed into the CLEP module for lesion region perception. The lesion enhancement perception feature map , obtained from the CLEP module, is further utilized in the breast tumor classification task to improve the performance of lesion type classification. Notably, feature fusion explicitly associates the classification and lesion perception branches through a module that takes ResNet’s feature map , and CLEP’s feature map as inputs. This design alleviates potential conflicts and ensures enhanced performance for the breast tumor classification task. The following sections provide a detailed discussion of each module.
2.2.2. Contextual Lesion Enhancement Perception (CLEP) Module
As shown in
Figure 3, the CLEP module incorporates a Coordinate Attention (CA) module, a Convolutional Block Attention Module (CBAM), and a feature fusion operation. This design is motivated by the versatility of CBAM when integrated into various CNNs, which can seamlessly enhance both classification and localization performance [
24]. However, CBAM only captures local relationships and cannot model the long-range dependencies crucial for visual tasks. Therefore, we introduce a novel Coordinate Attention (CA) module on top of CBAM, embedding positional information into channel attention. This enables the network to capture directionally aware and position-sensitive information, facilitating the model’s ability to recognize targets of interest [
25].
Specifically, the multi-scale feature maps
extracted from the FE are initially fed into the CA module, as illustrated in
Figure 4. This module encodes both channel relationships and long-range dependencies through two steps: coordinate information embedding and coordinate attention generation.
Global average pooling is commonly used in channel attention to encode spatial information in a global manner. However, this approach compresses global spatial information into a single channel descriptor, making it difficult to retain positional information, which is crucial for capturing spatial structure in visual tasks. To facilitate the attention block in accurately capturing long-range interactions in space, we decompose the global pooling and convert it into a pair of 1D feature encoding operations. Specifically, for a given feature
F, we use two pooling kernels with spatial ranges of
or
, encoding each channel along the horizontal and vertical coordinates, respectively. Therefore, the output for the
cth channel at height
h can be expressed as
Similarly, the output for the
cth channel at width
w can be written as
These two transformations aggregate features along two spatial dimensions, resulting in a pair of directionally aware feature maps. These transformations also allow our attention block to capture long-range dependencies in one spatial direction while preserving precise positional information in the other. This enables the network to more accurately localize objects of interest.
To leverage these expressive capabilities, the module incorporates a second transformation, referred to as the coordinate attention generation. Specifically, given the aggregated feature maps derived from Equations (2) and (3), they are concatenated, and then, fed into a shared 1 × 1 convolution transformation function
. This process generates a feature map
f:
where
represents the concatenation operation along the spatial dimension, and
denotes a nonlinear activation function. The resulting feature map
serves as an intermediate representation, encoding spatial information in both the horizontal and vertical directions. Here,
r is a reduction ratio that determines the size of the block. Next, we split
f along the spatial dimension into two separate tensors:
and
. Two 1 × 1 convolution transformations,
and
, are used to convert
and
into tensors with the same number of channels as the input feature map
F. This results in
where
represents the sigmoid function. To minimize the computational complexity, we often use an appropriate reduction ratio
r (e.g., 32) to reduce the number of channels in
f. The outputs
and
are then expanded and used as attention weights. Ultimately, the output
Y of the Coordinate Attention block can be expressed as
Then, the feature maps
processed by the Coordinate Attention (CA) module, yielding
, are fed into the Convolutional Block Attention Module (CBAM). The CBAM module consists of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), which process the input feature maps sequentially, as shown in
Figure 5. CBAM effectively compresses the input feature maps and selectively highlights the inter-channel and inter-spatial discriminative relationships in the input features.
Specifically, in the CBAM’s Channel Attention Module (CAM), the input feature map
F undergoes parallel operations of a maximum pooling layer and an average pooling layer. This yields
and
, each producing a vector of dimensions
. These two vectors are then simultaneously processed by a multi-layer perceptron (MLP), consisting of an input layer, a hidden layer, and an output layer, with the respective weights denoted as
and
. After passing through the MLP, the two vectors are merged by element-wise addition. Finally, a sigmoid activation function is applied, resulting in the final output of the Channel Attention Module. It can be represented as
After the CAM, the output
is a tensor of dimensions
. Then, element-wise multiplication is performed between
and the corresponding channel of the original feature map, where the feature map has dimensions
. The result is a new feature map, denoted as
. In the Spatial Attention Module (SAM) of CBAM, the feature map
undergoes two separate pooling operations on its channels: a max-pooling and an average-pooling operation. This results in a
C-dimensional vector for each pooling operation, generating
and
, respectively. These two vectors are then concatenated and passed through a
convolution layer, followed by a sigmoid activation function, resulting in a
feature map. This can be represented as
In a manner similar to the Channel Attention module, the final output needs to be element-wise multiplied with the original input feature map to merge the two, producing the final output feature map for the CBAM module.
The feature maps processed by CA and CBAM are input into a convolutional layer to compress them into the same number of channels as the feature map
, thereby extracting the regions of interest (ROIs) related to the lesion. This refines the learned knowledge. Each compressed intermediate feature map is resized to match the dimensions of
, then concatenated into a merged feature map
. Finally,
undergoes a series of convolutional layers, batch normalization (BN), and a sigmoid activation layer to produce the lesion perception feature map
, where each pixel represents the probability of being part of the lesion or background. A binarization function determines the predicted lesion position mask
Y. Since the confidence threshold can impact the model’s final results, a higher threshold makes the model more conservative (i.e., reducing false positives), while a lower threshold makes the model more aggressive (i.e., capturing all possible lesions). Therefore, in this study, we adopt the commonly used default threshold of 0.5 as the confidence threshold.
where
denotes the sigmoid function,
represents the probability that each pixel in
belongs to the lesion,
D is the mapping function of CLEP, parameterized by the trainable parameters
, and the binarize function is a threshold function.
2.2.3. Multi-Feature Fusion Module
Subsequently, we use the Patch Embedding module to convert the feature maps
and
into a sequence of tokenized representations
and
. The Patch Embedding process can be represented as
Next, the tokenized representations
and
are fed into the Multi-Feature Fusion (MFF) module to generate the fused feature
. The fusion process can be represented as
where MFF consists of a Discrepancy Feature Fusion module and a pair of alternating Common Feature Enhancement Fusion modules, which are designed to extract global dependency features, namely, discrepancy features and common features. A schematic diagram of this fusion module is shown in
Figure 6.
The goal of feature fusion is to obtain a composite feature map that captures prominent targets while preserving fine texture details. Therefore, leveraging the differences and shared features present in different feature maps is crucial for achieving optimal fusion performance. Inspired by the effectiveness of cross-attention mechanisms in extracting common features between images, we introduce the DFFM and the CFEFM.
To effectively capture the differential features between
and
generated in the previous stage, we employ a DFFM in the form of cross-attention, as shown in
Figure 7. It takes
and
as input and outputs features that highlight the differences.
Specifically, to explore the long-distance relationships of the feature
, we partition
and
into
s local feature segments, as follows:
where
and
, and
. Subsequently, we employ a linear layer to transform the token segments into query
Q, key
K, and value
V. The linear projection can be expressed as
where
. The Linear(*) function denotes a linear projection operator shared among different segments.
To explore the shared information between the
and
features while considering long-range relationships, we first apply Softmax to
to normalize each element into a probability distribution, and multiply
(where
i and
j range from 1 to
s). Then, perform element-wise multiplication with
V to infer the shared feature information between
Q and
V. This process can be expressed as
where
is a scaling factor that mitigates the issue of gradient saturation when the dot product increases in the Softmax function. After this, we can easily extract the difference information between
Q and
V by removing the shared feature information. This process can be represented as
To obtain complementary feature information from the
and
features, we inject the differential features into
Q, which can be represented as
Then, we generate
by applying a multi-layer perceptron (MLP) with layer normalization (LN) to
and adding
again:
where
is the output of the DFFM.
To integrate the shared feature information of
into the fused feature and enhance it, the proposed fusion module, after DFFM, adopts a CFEFM to alternately extract common feature information from
. The structure of CFEFM is shown in
Figure 8.
To infuse shared feature information from
into the fused information, we first use the segments of
as
and the segments of
as
and
. The shared feature information between
and
can be expressed as
Next, we add the shared feature information
to
, yielding
Then, we pass
through a multi-layer perceptron (MLP) with layer normalization (LN) to produce
represents the output of the first CFEFM. Subsequently, we infuse the shared feature information between
and
into the fused feature to enrich it. This process follows the same formulation as Formulas (25) and (26). This design enables the learned features to emphasize the lesion region while mitigating the negative effects caused by noise or artifacts in non-lesion areas of breast tumor ultrasound images. The classification head comprises an average pooling layer and a fully connected layer. It leverages the enriched feature map to predict the probability for each category of the input breast tumor ultrasound image. The classification process can be described as
where
C is the mapping function of the classifier parameterized by
.
represents the final fused feature, and
indicates the probability that the input breast tumor ultrasound image belongs to each category.