3.2. Overview of the Framework
The proposed
model is illustrated in
Figure 2. The
model has a dual-branch structure, which consists of a PvT branch, a CNN branch, a feature fusion module (FFM), two dual-branch classifiers, and a fusion classifier based on second-order pooling (SOP). The structure of the PvT is described in detail in
Section 3.2, while the CNN branch is based on the ResNet50 architecture. The PvT and ResNet50 model were both pretrained on the ImageNet dataset. In a colonoscopy image, the local features extracted by the CNN represent the relationship between the intestinal wall at the near-shot point and the coaxial extension line, while the global representations modeled by the PvT capture similar information in the cross-section space of the intestinal cavity. The FFM enhances the visual representation ability by combining local features with global representations in an interactive manner. The PvT and CNN branches both contain four stages, and the FFM performs feature fusion on the output from Stages 1 to 3. Then, the fused features are transmitted back to the two main branches. The
model has three outputs: the auxiliary outputs of the dual branches adopt average pooling, while the outputs of the fourth stage are concatenated and spliced at the channel level and serve as the main output of the model after SOP and the fully connected (FC) layers. Therefore, the total loss of the model includes the sum of three losses,
=
+
+
, and the cross-entropy loss function with label smoothing is applied for all losses.
,
, and
are the weight coefficients of the three losses, respectively, and their proportions are discussed in
Section 4.1. Suppose that the true label corresponding to the
n-th sample is
, and
is the final output of the network, that is the prediction result of sample
n. The calculation is expressed as follows:
where
N is the number of samples,
K is the number of classification categories, and
is the coefficient of label smoothing.
can stand for
pvt,
cnn, and
combine.
3.3. Pyramid Vision Transformer
The PvT was originally proposed for dense prediction tasks, such as semantic segmentation and target detection, and is a pure Transformer backbone [
23]. The progressive shrinking pyramid and spatial-reduction attention layer in the PvT can learn high-resolution representations, reducing the computational costs. Although the architecture of the PvT is favorable for dense prediction tasks, it does not show a strong advantage for image classification tasks. The improved version of the PvT [
24] includes overlapping patch embedding (OPE) and a convolutional feed-forward network, thereby reducing the computational costs and exhibiting excellent performance in classification tasks. Each stage of the PvT consists of one OPE, one block, and one normalization layer, with the block containing N basic component blocks.
Figure 3 shows the components of the
i-th stage of the PvT in our study. Let
,
, and
be the height, width, and embedding dimensions of the features in the
i-th stage, respectively. The flattened token output in the (
i-1)-th stage is reshaped, and then, OPE is carried out. In contrast to linear embeddings in the ViT, OPE is realized mainly by convolution operations with a kernel size larger than the stride size. When
i = 1, the convolution kernel size in OPE is 7 and the stride is 3, while when
i = 2,3,4, the convolution kernel size in OPE is 3 and the stride is 2. In the basic component block, spatial reduction (SR) is performed first; then, multihead attention (MHA) is implemented. The MHA mechanism receives a query
Q, key
K, and value
V as the input, and the SR operation greatly reduces the scale of
K and
V, effectively reducing the computational overhead and strongly encouraging the model to learn higher-resolution representations. The MHA operation can be formulated as follows:
where
is the number of heads in the attention layer at stage
i and Concat(·) is the concatenation operation.
and
are linear projection parameters. The SR(·) operation can be formulated as follows:
where
denotes the input sequence and
(·) and
(·) are reshape operations.
denotes the feature after
(·), and
represents the spatial reduction ratio, which is also the size of the kernel and stride in the convolution operation Conv(·).
At the end of the spatial reduction operation,
(·) changes
to
and LN(·) refers to layer normalization, while
is a linear projection. Note that the Attention(·) calculation is consistent with the original paper [
42]:
Depthwise convolution is introduced into the convolution feed-forward network to capture the local continuity of the input tensor. The dimensional decay factor between the two fully connected (FC) layers at the i-th stage is . In our study, the settings of d, h, , and in the four stages were [64, 128, 320, 512], [1, 2, 5, 8], [8, 4, 2, 1], and [8, 8, 4, 4], while the numbers of basic component blocks in the four stages were 3, 8, 27, and 3.
3.4. Feature Fusion Module
As shown in
Figure 2, we developed a complementary design between the two branches, namely the feature fusion module (FFM). The structure of the FFM is illustrated in
Figure 4. The FFM receives feature maps from the PvT and CNN branches, and after the fusion at the channel scale, the output is sent back to the two main branches to enhance the complementary representation. Concretely,
and
are intermediate feature mappings from the PvT and CNN in stage
i, which are aggregated by
:
where Concat(·) is the concatenation operation and Split(·) is the tensor split operation.
consists of 1 × 1 convolutions and ReLU activation functions and is designed for channel-level fusion. The
and
obtained by splitting along the channel dimension are followed by
and
. Then, the fused features are transmitted back to each branch and added to the original input features
and
. It is worth noting that the FFM aggregates local features and global representations only in Stages 1 to 3; in Stage 4, the outputs of the two branches are concatenated, followed by SOP and the final classification.
3.5. Second-Order Pooling
In typical CNN structures, global average pooling implements first-order data statistics on the extracted features to determine the final classification. However, compared with the complex learning process of the CNN, the first-order statistics are relatively crude. Inspired by [
37,
38], in our study, second-order pooling was applied for the final abstraction of the features obtained by each branch. Specifically,
and
are the output features of the PvT and CNN branches in Stage 4, and
is obtained after the concatenation operation along the channel dimension. We reshaped
to a feature matrix
, where
and
. The covariance matrix is calculated as follows:
Here,
I and
1 are identity and all-ones matrices of size
S ×
S, respectively. Covariance normalization is beneficial for discriminative representations [
39], and this normalization often relies on eigenvalue decomposition (EIG) or singular-value decomposition of the matrices. However, since graphics processing units (GPUs) are not ideal for EIG implementations, Newton–Schulz iterations were adopted to accelerate the covariance normalization process. To ensure the convergence of the Newton–Schulz iteration,
is first normalized as follows:
Here,
denotes the trace of
. Given
=
A and
=
I, for
l = 1,…,
L, the Newton–Schulz iteration is given as follows:
Note that the pre-normalization process has an adverse effect on the network since it nontrivially changes the magnitude of the data. After the Newton–Schulz iteration, post-compensation is applied to produce the final normalized covariance matrix:
where
is a symmetric matrix, and we extracted the upper triangular elements of this matrix to use as the input to the final fully connected layer.
3.6. Implementation Details
During training, randomly clipped 224 × 224 areas were fed into the network. During testing, the colonoscopy images were resized to 256 × 256, and center clipped 224 × 224 areas were used for prediction (for the InceptionV4 model, the image size was resized to 320 × 320 and the input size was 300 × 300). For data enhancement, random horizontal and vertical flips were adopted with a probability of 0.3. In addition, contrast-limited adaptive histogram equalization (Clahe) was applied to enhance the representation of the color features. The work of Mokter et al. [
43] suggests that the vascular pattern is important characteristic information. As shown in
Figure 5, the Clahe algorithm improved the image contrast and could highlight the vascular texture or white ulcers, while random flipping was conducive to increasing image diversity. The
model was trained using the stochastic gradient descent (SGD) optimizer with a weight decay of
and a momentum of 0.9. The learning rate was initialized to 0.001 and decreased by 0.1 every 10 epochs in the 50 training epochs, with 128 images included in each small batch. The cross-entropy loss with label smoothing was adapted, and the coefficient
was set to 0.1. Our work was implemented using PyTorch with two NVIDIA Quadro GV100 32G graphics cards.
For the classification task, we evaluated the performance of all methods according to the overall accuracy and the accuracy (ACC), sensitivity (SEN), specificity (SPE), positive predictive value (PPV), negative predictive value (NPV), and F1-score (F1) of each individual class. These metrics can be calculated as follows:
Here, N is the number of samples, while denotes the number of all samples with correct predictions. TP, TN, FP and FN represent the number of true positive, true negative, false positive, and false negative samples in each prediction category, respectively.