PHF3 Technique: A Pyramid Hybrid Feature Fusion Framework for Severity Classification of Ulcerative Colitis Using Endoscopic Images

Evaluating the severity of ulcerative colitis (UC) through the Mayo endoscopic subscore (MES) is crucial for understanding patient conditions and providing effective treatment. However, UC lesions present different characteristics in endoscopic images, exacerbating interclass similarities and intraclass differences in MES classification. In addition, inexperience and review fatigue in endoscopists introduces nontrivial challenges to the reliability and repeatability of MES evaluations. In this paper, we propose a pyramid hybrid feature fusion framework (PHF3) as an auxiliary diagnostic tool for clinical UC severity classification. Specifically, the PHF3 model has a dual-branch hybrid architecture with ResNet50 and a pyramid vision Transformer (PvT), where the local features extracted by ResNet50 represent the relationship between the intestinal wall at the near-shot point and its depth, and the global representations modeled by the PvT capture similar information in the cross-section of the intestinal cavity. Furthermore, a feature fusion module (FFM) is designed to combine local features with global representations, while second-order pooling (SOP) is applied to enhance discriminative information in the classification process. The experimental results show that, compared with existing methods, the proposed PHF3 model has competitive performance. The area under the receiver operating characteristic curve (AUC) of MES 0, MES 1, MES 2, and MES 3 reached 0.996, 0.972, 0.967, and 0.990, respectively, and the overall accuracy reached 88.91%. Thus, our proposed method is valuable for developing an auxiliary assessment system for UC severity.


Introduction
Ulcerative colitis (UC) is a chronic inflammatory bowel disease characterized by mucosal inflammation, which begins in the rectum and extends proximally into the colon in a continuous manner. Bloody diarrhea is the most common early symptom of UC, and other clinical symptoms include abdominal pain, faecal urgency, tenesmus, and vomiting [1,2]. In recent years, although the incidence of UC has stabilized in developed regions, the disease burden remains high [3]. Moreover, in developing regions, with the acceleration of urbanization, the incidence of UC continues to increase, reaching 5.41 cases per 100,000 persons in India [4]. Endoscopy plays a fundamental role in the diagnosis, treatment, and management of UC, especially in monitoring disease activity and responses to treatment [5]. Endoscopic mucosal remission is an important therapeutic goal for UC, as well as the basis for evaluating future colorectal cancer risk and improving the prognostic quality of life [6]. Therefore, accurately assessing UC activity and the overall severity of the disease is critical for selecting the best management strategy for patients [7].
At present, the most commonly used evaluation index for assessing the severity of UC in clinical practice is the Mayo score [8], and the Mayo endoscopic subscore (MES) is the most important component of the overall Mayo score [6,9]. The MES evaluates the degree of damage to the intestinal mucosa. As shown in Figure 1, the MES classifies mucosal injury into four levels: normal or inactive, mild disease, moderate disease, and severe disease. However, the use of the MES for endoscopic evaluation is difficult and requires that endoscopists be trained. The reliance on subjective interpretations by endoscopists also hinders the reliability and repeatability of MES classification [10,11]. In addition, inexperience and review fatigue may lead endoscopists to misjudge the severity of UC, which may result in delayed treatment and missing the best time to change treatment decisions. However, artificial intelligence technology has been used to assist endoscopists in the rapid and accurate determination of UC severity classification. In recent years, convolutional neural networks (CNNs) have made substantial progress in the field of computer vision and are widely used in medical image classification [12,13], segmentation, registration, reconstruction, and object detection [14][15][16]. Due to the powerful feature extraction ability of CNNs, CNN-based deep learning models have been applied to colonoscopies to identify a variety of diseases in the small intestine [17] and detect polyps [18], significantly reducing the workload of endoscopists. The excellent remote dependency modeling capabilities of the Vision Transformer (ViT) have popularized this approach in the field of computer vision [19]. Moreover, the number of technical reports on the ViT in medical image analysis has increased exponentially [20]. While the ViT can compensate for the CNN's inability to capture global representations, the input mode of patch embedding ignores local details and lacks local inductive bias and an overall hierarchical structure. As a result, many excellent studies that combine the ViT and CNN to utilize their complementary advantages have emerged [21,22].
Compared with other medical imaging methods, colonoscopy images are closer to natural images and have three color channels. The progressive shooting characteristics of colonoscopy cause colonoscopy images to appear diversified, and many shooting points contain not only the features of the same intestinal lumen cross-section, but also the depth features of the intestinal cavity. In one colonoscopy image, although the upper left corner may be far from the lower right corner, the characteristics of these regions may be similar in the bowel lumen sectional space. Therefore, remote relationship modeling may be critical for extracting information from colonoscopy images. Inspired by the pyramid vision Transformer (PvT) proposed by Wang et al. [23,24] and rich CNN-Transformer feature aggregation networks [25], we propose a pyramid hybrid feature fusion framework (PHF 3 ) for UC severity classification. Compared with the ViT, the PvT can learn high-resolution representations while taking into account the problem of computational consumption and has a more nuanced local feature extraction process, but its relationship modeling is still global and lacks the inductive bias unique to convolution. The pyramidal hierarchical structure of the PvT creates a unique condition for its fusion with the CNN, which avoids the problem of feature dimension mismatch when the CNN is combined with Transformer. The design of the PHF 3 dual-branch stream hybrid architecture ensures that the local features and the global representation are relatively independent while complementing each other, which provides a simple and effective way of feature fusion. The main contributions of this article can be summarized as follows: (1) The dual-branch pyramid hybrid architecture combines two feature extractors, namely a CNN and a PvT, to extract the deep features and cross-sectional spatial features of the intestinal cavity in colonoscopy images. (2) A feature fusion module was designed to integrate the local features extracted by the CNN and the global dependencies modeled by the PvT, thereby improving the classification accuracy by enhancing the feature representation ability. (3) At the output of the model, the second-order aggregation of the features was applied to enhance discriminative representations, which is effective for classifying the UC severity. In addition, an iterative method for covariance normalization was utilized to accelerate network training.
The remainder of this article is organized as follows. In Section 2, related work on UC severity classification and the hybrid architecture of the CNN and ViT is summarized. The proposed PHF 3 model for UC severity classification is described in detail in Section 3. The performance of the PHF 3 model is evaluated in Section 4 and compared with that of conventional deep models. A discussion and some conclusions are presented in Sections 5 and 6, respectively.

Deep Learning for UC Severity Classification
In recent years, deep-learning-based algorithms have replaced traditional machine learning methods due to superior recognition ability and end-to-end training strategies and have shown promise in gastroenteroscopy image diagnosis applications [26,27]. However, few studies have assessed the severity of UC. Ozawa et al. [28] constructed a computeraided diagnosis system based on GoogleNet to identify MES 0 and MES 0-1, which was the first study that explored the performance of CNNs in evaluating different disease activity levels in UC. Subsequently, Stidham [29] showed that the deep learning model performed similarly to experienced human reviewers in grading the endoscopic severity of UC. In the recognition of MES 0-1 and MES 2-3, an Inception V3-based image classification architecture achieved an area under the receiver operation curve (AUC) of 0.970 (95% CI, 0.967-0.972). Bhambhvani et al. [6] developed a deep learning model based on ResNeXt101, with the aim of automatically classifying the MES of individuals with UC. However, this study included only three categories (MES 1-3), and the number of samples was small. Thus, recent deeplearning-based UC severity classification strategies are based on CNNs, and there have been few reports on four-level MES assessments. In a new study, Luo et al. [30] designed an efficient attention mechanism network (EAM-Net), and fed the features extracted from convolutional neural networks into EAM-Net and recurrent neural networks, respectively, achieving advanced results in the UC severity classification task, with an overall accuracy of 0.906 and 0.916 on two datasets, respectively. However, its DenseNet-based backbone may be difficult to complete the global modeling of the overall relationship [31].

Dual-Branch Stream Hybrid Architecture of CNN and ViT
Since the advantages and disadvantages of the CNN and ViT have been revealed, the combination of a CNN and ViT to develop a model with better performance has become a popular research topic. In general, these diverse works can be divided into three categories: conv-like Transformers, Transformer-like ConvNets, and conv-Transformer hybrids [32]. Among them, the conv-Transformer hybrid takes advantage of the CNN and ViT in a more direct and simpler way, and the dual-branch stream structure is one example of this type of model. In this kind of structure, an effective feature fusion module is critical. Peng et al. [31] proposed Conformer, which utilizes a convolution operation and a self-attention mechanism to enhance representation learning. The feature coupling unit of Conformer integrates local features and global representations at different resolutions in an interactive manner. Chen et al. [33] presented Mobile-Former, which adopts a parallel lightweight bidirectional bridge design between MobileNet and Transformer. Yoo et al. [25] designed a more concise feature aggregation method in which the flat features of Transformer linear embeddings are rearranged and concatenated and combined with CNN features. Due to the mismatch between the intermediate feature dimensions of the CNN and ViT, the design of the feature fusion module in these studies was relatively complex. Liu et al. [34] developed a hybrid architecture named CVM-Cervix, which does not include any interactions or fusion between the CNN and ViT branches, with a multilayer perceptron applied only at the output to combine the features of the two branches. The excellent performance of CVM-Cervix in cervical cancer classification tasks suggests that effective fusion at the output may be indispensable.

Higher-Order Statistics in Deep Learning
Since Lin et al. [35] proposed bilinear CNNs, many studies [36,37] have found that high-order pooling representations and deep CNN integration introduce promising improvements in challenging fine-grained visual classification tasks. Li et al. [37,38] conducted global covariance pooling for convolution features, achieving better improvements than those achieved by first-order pooling, and proposed a covariance iterative normalization method. Dai et al. [39], inspired by the work of Li et al., considered learning the feature interdependencies through the second-order statistics of the features and designed a second-order channel attention module for single-image super-resolution. Fang et al. [40] introduced a novel bilinear attention block for person retrieval, adopting the bilinear pooling method to model local feature interactions in each channel while preserving spatial structure information. Chen et al. [41] developed a new approach, fitting higher-order statistics with linear polynomials, and constructed a higher-order attention module for person re-identification, which can be simply realized by 1 × 1 convolution and an element-level addition/product. These encouraging studies demonstrate that higher-order statistics play a significant role in deep learning in enhancing the representations of discriminative features.

Dataset Details
This study was approved by the Ethics Committee of Daping Hospital affiliated with Army Medical University and was performed according to the Declaration of Helsinki. A total of 15,120 colonoscopy images with high quality of 768 cases were collected from the Daping Hospital affiliated with Army Medical University and Sir Run Run Shaw Hospital of Zhejiang University from January 2018 to December 2021. Each colonoscopy image was independently annotated by two endoscopic experts; when their labels were inconsistent, a third expert assisted in the discussion and they made the final decision together. Finally, the whole dataset included 4124 MES 0 images, 6669 MES 1 images, 1773 MES 2 images, and 2554 MES 3 images. Table 1 illustrates the specific data distribution. The whole dataset was randomly divided into training and test datasets at a ratio of 8:2, with the training dataset containing 12090 images and the test dataset containing 3030 images. Details on the datasets are presented in Table 2.   Training dataset  3298  5332  1419  2041  12,090  Test dataset  826  1337  354  513  3030  Total  4124  6669  1773  2554 15,120

Overview of the Framework
The proposed PHF 3 model is illustrated in Figure 2. The PHF 3 model has a dualbranch structure, which consists of a PvT branch, a CNN branch, a feature fusion module (FFM), two dual-branch classifiers, and a fusion classifier based on second-order pooling (SOP). The structure of the PvT is described in detail in Section 3.2, while the CNN branch is based on the ResNet50 architecture. The PvT and ResNet50 model were both pretrained on the ImageNet dataset. In a colonoscopy image, the local features extracted by the CNN represent the relationship between the intestinal wall at the near-shot point and the coaxial extension line, while the global representations modeled by the PvT capture similar information in the cross-section space of the intestinal cavity. The FFM enhances the visual representation ability by combining local features with global representations in an interactive manner. The PvT and CNN branches both contain four stages, and the FFM performs feature fusion on the output from Stages 1 to 3. Then, the fused features are transmitted back to the two main branches. The PHF 3 model has three outputs: the auxiliary outputs of the dual branches adopt average pooling, while the outputs of the fourth stage are concatenated and spliced at the channel level and serve as the main output of the model after SOP and the fully connected (FC) layers. Therefore, the total loss of the model includes the sum of three losses, Loss all = αLoss pvt + βLoss cnn + γLoss combine , and the cross-entropy loss function with label smoothing is applied for all losses. α, β, and γ are the weight coefficients of the three losses, respectively, and their proportions are discussed in Section 4.1. Suppose that the true label corresponding to the n-th sample is y n ∈ {1, 2, . . . , K}, and v = (v 1 , v 2 , . . . v K ) is the final output of the network, that is the prediction result of sample n. The calculation is expressed as follows: where N is the number of samples, K is the number of classification categories, and ε is the coefficient of label smoothing. Ω can stand for pvt, cnn, and combine.  Figure 2. Illustration of the pyramid hybrid feature fusion framework (PHF 3 ). The stem consists of a convolution, a batch normalization, a ReLU activation function, and maximum pooling; FFM: feature fusion module, SOP: second-order pooling, LayerNorm: layer normalization, FC: fully connected layer.

Pyramid Vision Transformer
The PvT was originally proposed for dense prediction tasks, such as semantic segmentation and target detection, and is a pure Transformer backbone [23]. The progressive shrinking pyramid and spatial-reduction attention layer in the PvT can learn high-resolution representations, reducing the computational costs. Although the architecture of the PvT is favorable for dense prediction tasks, it does not show a strong advantage for image classification tasks. The improved version of the PvT [24] includes overlapping patch embedding (OPE) and a convolutional feed-forward network, thereby reducing the computational costs and exhibiting excellent performance in classification tasks. Each stage of the PvT consists of one OPE, one block, and one normalization layer, with the block containing N basic component blocks. Figure 3 shows the components of the i-th stage of the PvT in our study. Let H i , W i , and d i be the height, width, and embedding dimensions of the features in the i-th stage, respectively. The flattened token output in the (i-1)-th stage is reshaped, and then, OPE is carried out. In contrast to linear embeddings in the ViT, OPE is realized mainly by convolution operations with a kernel size larger than the stride size. When i = 1, the convolution kernel size in OPE is 7 and the stride is 3, while when i = 2,3,4, the convolution kernel size in OPE is 3 and the stride is 2. In the basic component block, spatial reduction (SR) is performed first; then, multihead attention (MHA) is implemented. The MHA mechanism receives a query Q, key K, and value V as the input, and the SR operation greatly reduces the scale of K and V, effectively reducing the computational overhead and strongly encouraging the model to learn higher-resolution representations. The MHA operation can be formulated as follows: where h i is the number of heads in the attention layer at stage i and Concat(·) is the concatenation operation.
are linear projection parameters. The SR(·) operation can be formulated as follows: where x ∈ R H i W i ×d i denotes the input sequence and RP 1 (·) and RP 2 (·) are reshape operations. x ∈ R d i ×H i ×W i denotes the feature after RP 1 (·), and R i s represents the spatial reduction ratio, which is also the size of the kernel and stride in the convolution operation Conv(·).  At the end of the spatial reduction operation, RP 2 2 )×d i and LN(·) refers to layer normalization, while W S ∈ R d i ×d i is a linear projection. Note that the Attention(·) calculation is consistent with the original paper [42]: Depthwise convolution is introduced into the convolution feed-forward network to capture the local continuity of the input tensor. The dimensional decay factor between the two fully connected (FC) layers at the i-th stage is R i m . In our study, the settings of d, h, R s , and R m in the four stages were [64, 128, 320, 512], [1,2,5,8], [8,4,2,1], and [8,8,4,4], while the numbers of basic component blocks in the four stages were 3, 8, 27, and 3.

Feature Fusion Module
As shown in Figure 2, we developed a complementary design between the two branches, namely the feature fusion module (FFM). The structure of the FFM is illustrated in Figure 4. The FFM receives feature maps from the PvT and CNN branches, and after the fusion at the channel scale, the output is sent back to the two main branches to enhance the complementary representation. Concretely, are intermediate feature mappings from the PvT and CNN in stage i, which are aggregated by G f use : where Concat(·) is the concatenation operation and Split(·) is the tensor split operation. G f use consists of 1 × 1 convolutions and ReLU activation functions and is designed for channellevel fusion. The M i pvt ∈ R C i pvt ×H i ×W i and M i cnn ∈ R C i cnn ×H i ×W i obtained by splitting along the channel dimension are followed by G pvt and G cnn . Then, the fused features are transmitted back to each branch and added to the original input features F i pvt and F i cnn . It is worth noting that the FFM aggregates local features and global representations only in Stages 1 to 3; in Stage 4, the outputs of the two branches are concatenated, followed by SOP and the final classification.

Second-Order Pooling
In typical CNN structures, global average pooling implements first-order data statistics on the extracted features to determine the final classification. However, compared with the complex learning process of the CNN, the first-order statistics are relatively crude. Inspired by [37,38], in our study, second-order pooling was applied for the final abstraction of the features obtained by each branch. Specifically, F 4 pvt ∈ R C 4 pvt ×H 4 ×W 4 and F 4 cnn ∈ R C 4 cnn ×H 4 ×W 4 are the output features of the PvT and CNN branches in Stage 4, and F f inal ∈ R (C 4 pvt +C 4 cnn )×H 4 ×W 4 is obtained after the concatenation operation along the channel dimension. We reshaped F f inal to a feature matrix X ∈ R C×S , where C = C 4 pvt + C 4 cnn and S = H 4 W 4 . The covariance matrix is calculated as follows: Here, I and 1 are identity and all-ones matrices of size S × S, respectively. Covariance normalization is beneficial for discriminative representations [39], and this normalization often relies on eigenvalue decomposition (EIG) or singular-value decomposition of the matrices. However, since graphics processing units (GPUs) are not ideal for EIG implementations, Newton-Schulz iterations were adopted to accelerate the covariance normalization process. To ensure the convergence of the Newton-Schulz iteration, Σ is first normalized as follows: Here, tr(Σ) = ∑ C i λ i denotes the trace of Σ. Given Y 0 = A and Z 0 = I, for l = 1, . . . , L, the Newton-Schulz iteration is given as follows: Note that the pre-normalization process has an adverse effect on the network since it nontrivially changes the magnitude of the data. After the Newton-Schulz iteration, post-compensation is applied to produce the final normalized covariance matrix: whereŶ is a symmetric matrix, and we extracted the upper triangular elements of this matrix to use as the input to the final fully connected layer.

Implementation Details
During training, randomly clipped 224 × 224 areas were fed into the network. During testing, the colonoscopy images were resized to 256 × 256, and center clipped 224 × 224 areas were used for prediction (for the InceptionV4 model, the image size was resized to 320 × 320 and the input size was 300 × 300). For data enhancement, random horizontal and vertical flips were adopted with a probability of 0.3. In addition, contrast-limited adaptive histogram equalization (Clahe) was applied to enhance the representation of the color features. The work of Mokter et al. [43] suggests that the vascular pattern is important characteristic information. As shown in Figure 5, the Clahe algorithm improved the image contrast and could highlight the vascular texture or white ulcers, while random flipping was conducive to increasing image diversity. The PHF 3 model was trained using the stochastic gradient descent (SGD) optimizer with a weight decay of 1 × 10 −5 and a momentum of 0.9. The learning rate was initialized to 0.001 and decreased by 0.1 every 10 epochs in the 50 training epochs, with 128 images included in each small batch. The cross-entropy loss with label smoothing was adapted, and the coefficient ε was set to 0.1. Our work was implemented using PyTorch with two NVIDIA Quadro GV100 32G graphics cards.

Preliminary Study
In the final feature fusion process, the Newton-Schulz iterative procedure was applied to achieve fast covariance normalization, where the number of iterations was a tunable hyperparameter. Therefore, we first explored the impact of the number of Newton-Schulz iterations L on model performance. The results are shown in Figure 6, where L = 0 indicates that SOP was not adopted. When L = 1-8, the accuracy fluctuated to some extent, but the overall trend was rising. After L = 8, the overall accuracy decreased, indicating that increasing the number of iterations is not conducive to improving accuracy, which is consistent with the discussion in [37]. Therefore, in our final model, L was set to 8.
In addition, we also explored the proportion of weight coefficients α, β, and γ for the three losses: Loss pvt , Loss cnn , and Loss combine . Intuitively, we believed that Loss combine was more important, and the experimental results are shown in Table 3. To a certain extent, increasing the weight of Loss combine can effectively improve the performance of the model. Therefore

The Ablation Experiments and Comparison Experiments with Classical CNNs
The results of the ablation experiments are shown in Table 4. Compared with the branch structure alone, the PHF 3 model was able to combine the local features extracted by ResNet50 with the global representations modeled by the PvT to achieve better classification performance. The overall accuracy of ResNet50, the PvT, and the PHF 3 model were 86.01%, 87.29%, and 88.91%, respectively. A performance comparison of some representative CNN models is shown in Table 5. Compared with ResNet50, ResNet101 had no significant improvement in terms of the recognition performance, suggesting that increasing the network depth did not enhance the representations of discriminant features. Considering the balance between performance and efficiency, our proposed PHF 3 model adopted ResNet50 and the PvT as the two main branches and outperformed both individual models.
The ROC curves of the five models on the test set are shown in Figure 8, and the AUC values of the PHF 3 model for MES 0, MES 1, MES 2, and MES 3 reached 0.996, 0.972, and 0.990, respectively.

Comparative Experiments with Advanced Models
Even the basic ViT (ViT-B) exhibited better prediction accuracy than most CNNs. Moreover, compared to the ViT-B, the the PvT, which learns higher-resolution representations, improved the overall accuracy by approximately 1.0 percentage points and exhibited performance comparable to that of the basic Swin Transformer (Swin-B) [44]. In order to highlight the advantages of the PHF 3 model, we also compared it with advanced models such as the VAN [45] and MViT [46], and Conformer [31], as a representative of the two-branch CNN-Transformer structure, was also included in our comparison scope. The results are shown in Table 6, from which we can find that the VAN did not achieve exciting performance. In comparison, the Swin-B , MViT, and PvT performed better, which implies that the effectiveness of the pyramid structure and the overlap and transformation of patches may be beneficial for learning diffuse lesions. The confusion matrices of the six models are shown in Figure 9, visually illustrating the differences in their predictions. Compared with the Swin-B, the PHF 3 model had slightly higher false negatives for MES 1 and false positives for MES 3, but the Swin-B was more likely to predict MES 0 as MES 1 and MES 3 as MES 2. As a trade-off between recall and precision, the F1-score considers the false positive and false negative rates, and the PHF 3 model had better F1-scores than the comparison models in all categories.

Visualization of Feature Maps and Heat Maps
The feature maps in the model inference process helped in understanding the feature capture characteristics of the CNN and PvT branches and the effectiveness of the FFM. As shown in Figure 10, we plotted partial feature maps before and after fusion for each stage. Overall, the extracted features of both branches became increasingly abstract as the model deepened, with the CNN branch focusing on local features and highlighting local details, while the PvT branch focused on global representations and the overall performance was chaotic. The FFM clearly introduced global information into the CNN branch, while the local details introduced into the PvT branch were more difficult to notice due to the chaotic feature representations. Furthermore, as shown in Figure 11, to enhance the interpretability of the model, Grad-CAM [47] was used to draw the heat maps. Some local ulcers and bleeding were highlighted, which are important features when the model makes decisions.

A Novel Deep Learning Framework for UC Severity Assessment
In clinical practice, evaluating the severity of UC through MES is of great significance for understanding patient conditions and providing effective treatment. Ensuring the reliability and reproducibility of UC severity classification remains a nontrivial challenge, and previous works have been limited to convolutional neural networks, thus ignoring the global dependencies of features in the intestinal lumen. Our work provides a novel solution to this challenge. In colonoscopy images, we considered not only the local relationship between the depth features of the intestinal cavity, but also the global dependencies of the features in the same intestinal cavity cross-section. The proposed hybrid architecture combines local features and global representations in a simple and effective manner, achieving better performance than the baseline models. In the ablation study, the overall accuracy of the PHF 3 model was 2.90 and 1.62 percentage points higher than that of ResNet50 and PvT, respectively. The feature fusion process of the two-branch hybrid framework can be observed obviously by the visualization of the feature maps. Compared with the classical CNNs, the accuracy of the proposed method was improved by 2.38-4.26 percentage points, and the AUC of the proposed method was the highest in all categories. Even compared to advanced models based on the Transformer architecture and CNN-Transformer combined framework, the PHF 3 model still had the upper hand. Our approach highlighted the importance of the fusion of local features and global representations in the feature capture of diffuse lesions and the effectiveness of second-order information in enhancing fused discriminative features. This exciting result prompted us to believe that our study will further advance the application of deep learning as an auxiliary diagnostic tool in intestinal digestive diseases.

Multi-Branch Hybrid Architectures May Be Irreplaceable
The ViT and CNN are two mainstream models of deep learning at present. However, they all have their own shortcomings, such as the difficulty of global relationship modeling for the CNN and the lack of local inductive bias for the ViT. Therefore, a natural idea is to combine them to complement each other's advantages, such as introducing convolution operators for the ViT or adding global attention mechanism for the CNN. Actually, the convolution operation is introduced into the PvT structure, which is a conv-like Transformer. Our experiments suggest that, even if convolutions are introduced to capture local relations in Transformers, the effects may be limited, and there is still room for improvement. Similarly, it is worth exploring whether Transformer-like ConvNets can be enhanced further by introducing the ViT. In studies combining the CNN and ViT, a hybrid structure with multiple branches may be a fusion method that is difficult to replace. Another issue worth noting is how to weight losses incurred by multiple branches. Although we tested weighting with different proportions, this artificial setup may not be optimal. If possible, designing adaptive weighting methods may yield surprising results. In addition, due to the complex attention mechanism and model design, the training and reasoning of the ViT are not as fast as those of the CNN, which increases the computational costs of the CNN-ViT fusion structure. The SR strategy in the PvT effectively reduced the amount of computations. In a recent report, inspired by the SR in the PvT, Li et al. [48] proposed the Next-ViT, a new paradigm that fuses convolutional and Transformer modules during every stage, aiming to improve model efficiency and achieve industrial-scale deployment of the CNN-Transformer hybrid architecture. Therefore, how to achieve efficient computation in a multi-branch hybrid framework is also a problem that needs more research and experiments.

Higher-Order Statistics Require More Exploration
It is well known that first-order statistics may limit modeling capabilities; however, second-order statistics are difficult to apply in GPUs because they introduce additional computations. In our study, covariance normalization realized by Newton-Schultz iterations achieved end-to-end training with acceptable computational costs, providing a new approach for higher-order statistics applications. In a preliminary study, we explored the influence of the number of Newton-Schultz iterations. Although the more iterations, the better the fitting, the experimental results showed that the existence of a certain fitting bias is beneficial to the generalization performance of the model. Our research may show that second-order information has higher discriminative representation power for the fusion features of two branches. A very smooth thought is whether statistics larger than second-order can perform better or whether it is an effective combination of first-order and second-order information that needs more research to explore. In addition, when the computational costs are no longer a hindrance, attention mechanisms based on higher-order statistics are an exciting concept. At present, some studies have used second-order statistical information to construct attention modules [39,40]. Furthermore, the kinds of feature representations that can be enhanced by higher-order information should be investigated.

Limitations and Future Work
Despite the excellent and encouraging classification results, our research has several limitations. First, the number of samples in each category in our dataset was not balanced, which is a characteristic of most clinical diseases: there are always more remission samples than severe case samples. UC severity classification is a more refined identification, and simple upsampling or downsampling of the data cannot provide effective improvements. Although class weighting based on loss can improve the classification accuracy of small sample categories, this process inevitably reduces the overall accuracy. Second, due to differences between the equipment and endoscopist manipulations, it is challenging to construct a model with strong generalizability. In our study, the data were obtained from only two large centers, and multicenter verification was lacking. In the future, we will consider collecting multisource data and applying our model to colonoscopy video processing. Moreover, we will explore the non-substitutability of the dual-branch hybrid architecture and applications of higher-order attention mechanisms.

Conclusions
In this paper, we proposed PHF 3 , a novel dual-branch hybrid feature fusion framework with the PvT and ResNet50 for UC severity classification in endoscopic images. Compared with the ViT, the PvT can learn higher-resolution representations, which is beneficial for learning the diffuse lesion features of UC. Moreover, the PvT has high operation efficiency. The designed FFM structure solves the feature fusion problem of the CNN and PvT as the feature resolution pyramid changes, thereby effectively combining local features and global representations, while the SOP module enhances discriminant information by using second-order statistics. The comparative ablation studies were encouraging, showing that the PHF 3 model exhibited better performance than the comparison methods and can, thus, be used as an auxiliary tool for UC severity assessment. We hope that our work provides new ideas for combining convolution and Transformer and its application in assisted recognition in colonoscopy images.

Informed Consent Statement: No applicable.
Data Availability Statement: The data cannot be made available due to privacy restrictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The The receiver operating characteristic curve AUC The area under the receiver operating characteristic curve Clahe Contrast-limited adaptive histogram equalization GPUs Graphics processing units