HMT-Net: Transformer and MLP Hybrid Encoder for Skin Disease Segmentation

At present, convolutional neural networks (CNNs) have been widely applied to the task of skin disease image segmentation due to the fact of their powerful information discrimination abilities and have achieved good results. However, it is difficult for CNNs to capture the connection between long-range contexts when extracting deep semantic features of lesion images, and the resulting semantic gap leads to the problem of segmentation blur in skin lesion image segmentation. In order to solve the above problems, we designed a hybrid encoder network based on transformer and fully connected neural network (MLP) architecture, and we call this approach HMT-Net. In the HMT-Net network, we use the attention mechanism of the CTrans module to learn the global relevance of the feature map to improve the network’s ability to understand the overall foreground information of the lesion. On the other hand, we use the TokMLP module to effectively enhance the network’s ability to learn the boundary features of lesion images. In the TokMLP module, the tokenized MLP axial displacement operation strengthens the connection between pixels to facilitate the extraction of local feature information by our network. In order to verify the superiority of our network in segmentation tasks, we conducted extensive experiments on the proposed HMT-Net network and several newly proposed Transformer and MLP networks on three public datasets (ISIC2018, ISBI2017, and ISBI2016) and obtained the following results. Our method achieves 82.39%, 75.53%, and 83.98% on the Dice index and 89.35%, 84.93%, and 91.33% on the IOU. Compared with the latest skin disease segmentation network, FAC-Net, our method improves the Dice index by 1.99%, 1.68%, and 1.6%, respectively. In addition, the IOU indicators have increased by 0.45%, 2.36%, and 1.13%, respectively. The experimental results show that our designed HMT-Net achieves state-of-the-art performance superior to other segmentation methods.


Introduction
Dermoscopy is the primary method for increasing skin cancer diagnosis and decreasing skin cancer mortality [1]. Dermoscopy is an imaging technique that eliminates skin surface reflection and enhances deep skin visualization. Due to the different diagnostic equipment for skin diseases and the differences in the quality and size of pictures, different doctors tend to make subjective decisions based on their own experiences in the diagnosis process [2], which often consumes many human and material resources. Because of these problems, dermoscopic image segmentation technology has emerged in recent years.
The currently used standard skin lesion inspection method consists of five steps: imaging, preprocessing, segmentation, feature extraction, and classification [3]. In this process, doctors can only focus on the lesions obtained by segmenting dermoscopic image lesions for a timely diagnosis. Early segmentation methods for skin lesions usually use algorithms based on optimal thresholding [4], region growing [5], and edge detection [6].
Because these traditional methods necessitate human intervention, an increasing number of experts and researchers began to investigate more effective segmentation methods.
With the success of CNNs in the visual field, people have begun to propose and improve various CNN-based segmentation methods, such as UNet [7], R2U-Net [8], and FAC-Net [9]. However, mainstream CNNs also have their own limitations. The reason is that CNNs do not consider the relationship between remote context information and extracted lesion image features, resulting in a certain semantic gap that leads to the problem of segmentation blur in lesion image segmentation.
In response to the inherent defects of CNNs, many scholars have tried to apply the transformer method in natural language to the field of vision. The core idea of transformer is to use the self-attention mechanism to explore the correlation between global elements. Among them, visual transformer (UTNet [10], MCTrans [11], and UNETR [12]) proved that transformer can achieve excellent results in image classification, image segmentation, image detection, and image tracking.
In addition, many experts try to apply MLP to image segmentation directions. MLP not only has the characteristics of long-term dependence but also has excellent classification ability. Among them, MLP (MLP-Mixer [13], UNeXt [14], and S2-MLP [15]) is also often used in image classification, image segmentation, target detection, and other fields. These methods have also achieved very good results.
Skin lesion tissue is often shown in the form of local blocks, and the shape and size of the lesion area are different. In addition, when we only use convolution operation to extract focal region features and discard useless background information, some important foreground information is often lost or too much background information is retained. To solve the above problems, we designed a hybrid encoder network, HMT-Net. The transformer method we use overcomes CNN's difficulty capturing remote context connections. Meanwhile, we take advantage of MLP's focus on local feature learning to make up for the shortcomings of transformer in understanding local feature information. We effectively combined these two methods to achieve better segmentation of skin lesions. The main contributions of HMT-Net are as follows: We summarize the contributions of this paper as follows: • We used the CTrans module in a skin disease dataset segmentation task. The transformer in the CTrans module can calculate the correlation between global elements, which enhances the network's understanding of global information and thereby improves the network's overall recognition ability for foreground information; • We used the modules of TokMLP in the network. The TokMLP module performs shift operations on different axial elements in the feature map to improve the correlation between a certain point element and its surrounding elements, which improves the network's learning ability for lesion edge information; • By effectively connecting the CTrans module with the TOKMLP block, we designed a new approach based on DLL architecture.
We present existing work in Section 2. We then describe the method and its corresponding analysis in Section 3. The experimental setup, details, and evaluation metrics are presented in Section 4. In Section 5, we present the results of the experiment. Next, we describe the discussion of the network, draw some conclusions, and look to the future in Section 6.

Related Work
Skin lesion segmentation [16] accurately segments the lesion area manually or by other means to prepare for the doctor's diagnosis and treatment. Traditional skin lesion segmentation algorithms include threshold-based [4], region-growing [5], edge detection [6], and active contour-based [17] segmentation methods. Although these methods are still widely used in image segmentation tasks, they often set a specific threshold or judge elements based on contours. Similar segmentation algorithms struggle to meet the large number of data points in some lesion areas, with significant changes in image segmentation results. Ben Cohen et al. [18] first explored the use of full convolutional neural networks (FCNs) to achieve the task of liver and tumor segmentation in CT images. A FCN rewrites the Sensors 2023, 23, 3067 3 of 17 classification network as a network for image segmentation by rewriting the fully connected layer as a convolutional layer and using deconvolution to complete the upsampling process. Due to the simple network structure, its segmentation results are not fine enough, and there is still a lot of room for improvement. Yuan et al. [19] completed an end-to-end skin melanoma segmentation method based on 19-layer FCN. The segmentation method based on 19-layer FCN has a deep network layer, and the complexity of the network will increase accordingly.
In recent years, with the continuous development of deep learning in the field of computer vision, the segmentation algorithms of skin lesions have achieved remarkable results. UNet [7] is a landmark work that first uses skip connections to apply the feature information in the encoder to the decoder. Although such a connection method can effectively alleviate the inevitable information loss in the downsampling process, However, different medical image segmentation tasks require different segmentation networks to enhance feature learning for specialized data, so after UNet, many more important medical image segmentation models have emerged, such as UNet++ [20], UNet3+ [21], R2U-Net [8], SA-UNet [22], CE-Net [23], FAC-Net [9], SMU-Net [24], and BA-Net [25]. Although these network architectures can achieve good performance in different tasks, their generalization ability is limited. This is because these networks are all combined using different forms of convolution operations, and convolution operations have obvious limitations; that is, it is difficult to learn the correlation between global elements. As a result of the inherent flaws of the convolution operation, many scholars have attempted to propose alternative methods to solve this problem.
Recently, many transformer-based networks have been proposed and used for medical image segmentation. The transformer's overall structure is typically an encoder and decoder structure. Transformer can effectively establish long-distance dependencies through the self-attention mechanism, and the model's learning ability can be strengthened through multiple heads. The transformer obtains the initial embeddings obtained by the image through the convolution block or directly operates on the pixels and embeds them in the transformer, and this operation does not always need to maintain the original structure of the feature map so that it has a good modality fusion ability. Transformer can perform a variety of tasks by using the attention mechanism. Therefore, it has a wide range of applications in the field of medical image segmentation, such as TransUNet [26], UCTransNet [27], and Swin-Unet [28]. Although these methods can better learn global foreground information in medical image segmentation tasks, their weakness is that it is easy to ignore the learning of local feature information. How we effectively model the overall understanding of foreground information, and the acquisition of local feature information is a problem we need to consider.
In addition, networks based on the MLP architecture have also been proposed and used in the field of image segmentation. MLP has structures such as an input layer, a hidden layer, and an output layer. In image segmentation work, methods based on MLP architecture are often widely used. For example, UNeXt [14], AS-MLP [29], and CycleMLP [30] all use different axial shift operations to interact with spatial information flows in different directions to strengthen the connection between pixels. It can effectively make up for transformer's weakness in ignoring the understanding of local feature information.
Due to the presence of occlusions, such as hair and air bubbles, in the skin disease images, the shape is different, the boundary is blurred, and the internal and external layer characteristics of the lesion are different. Moreover, when CNNs extracts lesion feature information, it loses some important foreground information when discarding useless background information, which leads to the problem of blurred lesion image segmentation.
Aiming at the problems existing in the skin disease data and the CNNs method, we designed a hybrid encoder network, HMT-Net, using the CTrans module to enhance the network's learning of global elements and improve the network's ability to understand the global feature information of lesions. Using the TokMLP module can promote the correlation between adjacent elements and make up for the weakness of the transformer ignoring local information through axial shift operations in different directions, thereby improving the network's ability to learn feature information about the lesion edge.

Method
This chapter begins with a brief introduction to the overall framework of HMT-Net. Then, the composition, structure, and implementation details of the TokMLP block are introduced. Finally, we introduce the CTrans module in detail.

The Overall Structure of the HMT Network
The HMT-Net skin lesion segmentation network is depicted in Figure 1. In other words, we tried various connection methods for the TokMLP block and the CTrans module continuously and finally connected them effectively. We successfully applied it to skip connections. First, the CTrans module we use can effectively calculate the interrelationship of global feature information, which enhances the network's ability to learn global information. It can effectively improve the overall ability to understand the foreground information of skin damage images. Secondly, we have effectively connected the TokMLP block with the CTrans module (as shown in Figure 1), which can enhance the network's understanding of local feature information by strengthening the connection between adjacent element points. It can effectively enhance the network's ability to identify lesion boundaries. The characteristic of this network is that the idea of combining the CTrans module, and the TokMLP block is adopted in the encoder stage. We further combine these two ideas to achieve effective and accurate segmentation of skin lesion feature maps.
the global feature information of lesions. Using the TokMLP module can promote the correlation between adjacent elements and make up for the weakness of the transformer ignoring local information through axial shift operations in different directions, thereby improving the network's ability to learn feature information about the lesion edge.

Method
This chapter begins with a brief introduction to the overall framework of HMT-Net. Then, the composition, structure, and implementation details of the TokMLP block are introduced. Finally, we introduce the CTrans module in detail.

The Overall Structure of the HMT Network
The HMT-Net skin lesion segmentation network is depicted in Figure 1. In other words, we tried various connection methods for the TokMLP block and the CTrans module continuously and finally connected them effectively. We successfully applied it to skip connections. First, the CTrans module we use can effectively calculate the interrelationship of global feature information, which enhances the network's ability to learn global information. It can effectively improve the overall ability to understand the foreground information of skin damage images. Secondly, we have effectively connected the TokMLP block with the CTrans module (as shown in Figure 1), which can enhance the network's understanding of local feature information by strengthening the connection between adjacent element points. It can effectively enhance the network's ability to identify lesion boundaries. The characteristic of this network is that the idea of combining the CTrans module, and the TokMLP block is adopted in the encoder stage. We further combine these two ideas to achieve effective and accurate segmentation of skin lesion feature maps.

TokMLP Block
We first shift the convolutional features in an orderly manner along the channel axis. Since this operation can enhance MLP's attention to certain local feature information, the

TokMLP Block
We first shift the convolutional features in an orderly manner along the channel axis. Since this operation can enhance MLP's attention to certain local feature information, the MLP block produces locality. We first move the feature map horizontally in the width and height axes, respectively (the shift operation principle is shown in Figure 2), and its principle is very similar to axial attention [31]. First, the feature map is divided into m MLP block produces locality. We first move the feature map horizontally in the width and height axes, respectively (the shift operation principle is shown in Figure 2), and its principle is very similar to axial attention [31]. First, the feature map is divided into m partitions, and then they are sequentially moved by k positions according to the specified different axes. In the TokMLP block ( Figure 3), we first map the shifted feature information into tokens. Two MLP modules make up the TokMLP block we use (as shown in Figure 3). These tokens are first passed by us to the shifted MLP (span width). Then, this feature information needs to go through the depth direction convolution layer (DWConv) operation. It should be noted that the operation of the GELU [31] activation layer is included in the DWConv operation. GELU is a common activation function, the full name of which is "Gaussian error linear unit". This activation function has random regular operation. Even state-of-the-art architectures, such as ViT [32] and BERT [33], achieve good results due to the use of GELU. Our approach of using GELU instead of RELU is a smoother alternative with better performance. Next, we need to pass the feature information into another moving MLP (across height). Finally, we feed the layer normalized (LN) features into the next module. LN is a normalization operation on a single neuron in an intermediate layer, and we use LN because in the TokMLP block, layer normalization along tokens makes more sense than batch normalization (BN). The calculation process in the TokMLP block is as follows: In the TokMLP block ( Figure 3), we first map the shifted feature information into tokens. Two MLP modules make up the TokMLP block we use (as shown in Figure 3). These tokens are first passed by us to the shifted MLP (span width). Then, this feature information needs to go through the depth direction convolution layer (DWConv) operation. It should be noted that the operation of the GELU [31] activation layer is included in the DWConv operation. GELU is a common activation function, the full name of which is "Gaussian error linear unit". This activation function has random regular operation. Even state-of-the-art architectures, such as ViT [32] and BERT [33], achieve good results due to the use of GELU. Our approach of using GELU instead of RELU is a smoother alternative with better performance. Next, we need to pass the feature information into another moving MLP (across height). Finally, we feed the layer normalized (LN) features into the next module. LN is a normalization operation on a single neuron in an intermediate layer, and we use LN because in the TokMLP block, layer normalization along tokens makes more sense than batch normalization (BN). The calculation process in the TokMLP block is as follows: where G is for height, K is for width, DWC is for deep convolution, Shi f t K stands for Shift operation in width direction, Shi f t G stands for shift operation in the height direction, Tok stands for feature mapping operation, F stands for tokens gained by feature mapping operation, DWC stands for deep convolution, and LN for layer normalization. As shown in the above formula, firstly, the feature information obtained by convolution is obtained by shift operation and feature mapping operation along the width direction to obtain T K , and then the output Y is obtained by MLP and DWConv operations on T K . We carry out a displacement operation and a feature mapping operation along the height direction for the obtained output Y to get T G . We carry out an MLP and GELU operations on T G . After adding the output obtained with T K , we carry out an LN operation. Our network enhances the recognition of local features of skin lesions by using TokMLP blocks to strengthen the association of elements with surrounding elements.
where is for height, is for width, is for deep convolution, ℎ stands for Shift operation in width direction, ℎ stands for shift operation in the height direction, stands for feature mapping operation, stands for tokens gained by feature mapping operation, stands for deep convolution, and for layer normalization. As shown in the above formula, firstly, the feature information obtained by convolution is obtained by shift operation and feature mapping operation along the width direction to obtain , and then the output Y is obtained by MLP and DWConv operations on . We carry out a displacement operation and a feature mapping operation along the height direction for the obtained output Y to get . We carry out an MLP and GELU operations on . After adding the output obtained with , we carry out an LN operation. Our network enhances the recognition of local features of skin lesions by using Tok-MLP blocks to strengthen the association of elements with surrounding elements.

CTrans Module
The CTrans module we used is composed of two modules, CCT and CCA. CCT is a channel cross-fusion transformer, which is composed of three parts: multiscale feature embedding, multihead channel cross-attention, and multilayer perceptrons (MLP) (see Figure 4). It has the effect of fusing features from multiscale encoders. CCA is channel cross-attention, which has the effect of fusing features from multiscale encoders.

CTrans Module
The CTrans module we used is composed of two modules, CCT and CCA. CCT is a channel cross-fusion transformer, which is composed of three parts: multiscale feature embedding, multihead channel cross-attention, and multilayer perceptrons (MLP) (see Figure 4). It has the effect of fusing features from multiscale encoders. CCA is channel cross-attention, which has the effect of fusing features from multiscale encoders.  Firstly, the feature graph ( ∈ × 2 j × ( = 1,2,3,4 = 0,1,2,3)) of its four skip connection layers (including those operated by the TokMLP block) is reconstructed by a multiscale feature embedding operation.
We first embed the feature graph 1 ∈ 64×224×224 (as shown in Figure 4) and 1 and then perform the patch_embeddings operation (that is, after a convolution operation with kernel size of (16, 16) and step size of 16). The output 1 ∈ 64×14×14 is obtained, and then the output 1 ∈ 64×196 is flattened. We perform a dimension swap operation on its length and width to obtain the output 1 ∈ 196×64 . Finally, we perform position coding operations on it to obtain the output 1 ∈ 196×64 . Similarly, for 2 ∈ 128×112×112 , 3 ∈ 256×56×56 , and 4 ∈ 512×28×28 and after the operation to obtain new features ( 2 ∈ 196×128 , 3 ∈ 196×256 , and 4 ∈ 196×512 ) in the same way. The four resulting results are entered into the encoder in sequence and then spliced according to their channel dimensions. Then, the output ∈ 196×960 obtained after concatenation. We first input and into the CCT module for transformer operation, and its working principle is shown in Figure 5.
We first embed the feature graph G 1 ∈ R 64×224×224 (as shown in Figure 4) and G 1 and then perform the patch_embeddings operation (that is, after a convolution operation with kernel size of (16, 16) and step size of 16). The output P 1 ∈ R 64×14×14 is obtained, and then the output F 1 ∈ R 64×196 is flattened. We perform a dimension swap operation on its length and width to obtain the output J 1 ∈ R 196×64 . Finally, we perform position coding operations on it to obtain the output X 1 ∈ R 196×64 . Similarly, for G 2 ∈ R 128×112×112 , G 3 ∈ R 256×56×56 , and G 4 ∈ R 512×28×28 and after the operation to obtain new features (X 2 ∈ R 196×128 , X 3 ∈ R 196×256 , and X 4 ∈ R 196×512 ) in the same way. The four resulting results are entered into the encoder in sequence and then spliced according to their channel dimensions. Then, the output X c ∈ R 196×960 obtained after concatenation.
We first input X i and X c into the CCT module for transformer operation, and its working principle is shown in Figure 5. The matrix is generated using the cross-concern (CA) mechanism in the multihead cross-self-attention. We use the matrix to weight the value V:  In Formula (5), M d ∈ R L i ×b , M K ∈ R L C ×b , and M V ∈ R L C ×b represent the weight of each input, respectively, Q i ∈ R L i ×b , K ∈ R L c ×b , and V ∈ R L c ×b . b represents the sequence length (i.e., number of patches) and L i (i = 1, 2, 3, 4) represents the channel size of each jump connection layer. The channel sizes of the four jump connection layers in our experiment were 64, 128, 256, and 512, respectively.
The matrix Z i is generated using the cross-concern (CA) mechanism in the multihead cross-self-attention. We use the matrix Z i to weight the value V: where ψ represents the instance normalization operation, sigma is the softmax function.
The O i is computed for each Q i (i = 1, 2, 3, 4), and four O i are generated for each of the four inputs. In the case of "S many heads of attention", its output can be calculated as follows: In Formula (7), NC i represents the output of multiheaded cross-self-attention. In Formula (8), MLP represents the multilayer perceptron in the CCT module, LN represents the layer normalization, and K i (i = 1, 2, 3, 4) represents the four outputs operated by the CCT module.
The working principle of the CCA module is shown in Figure 6. The input of the CCA module is, respectively, every output K i ∈ R H×W×C (i = 1, 2, 3, 4) obtained through the operation of the CCT module and every feature figure N i ∈ R H×W×C (i = 1, 2, 3, 4) in the decoder. The vector L x ∈ R 1×1×C (L = 1, 2) can be obtained by compressing the GAP layer's space. The full connection layer should not only expand the convolutional layer into vectors but also classify each feature map. The idea of GAP is to combine the above two processes into one and carry them out together. The purpose of GAP we use is to solve the overfitting risk caused by fewer parameters of the full connection but also to achieve the same conversion function of the full connection. Then, linear and sigmoid operations were carried out, respectively, and finally (as shown in Figure 3), pixel point addition and pixel point multiplication were successfully carried out. We obtain four outputs after CCA module operation ( ∈ × × ( = 1, 2, 3, 4)), and we need to concatenate these four outputs, respectively, with the output ( ∈ × × ( = 1, 2, 3, 4)) of the decoder obtained by the up-sampling operation in accordance with the channel dimension. After a convolution (kernel 1) and sigmoid operation on the final output, we obtain the final segmentation result graph. We obtain four outputs after CCA module operation E i (E i ∈ R H×W×C (i = 1, 2, 3, 4)), and we need to concatenate these four outputs, respectively, with the output N i (N i ∈ R H×W×C (i = 1, 2, 3, 4)) of the decoder obtained by the up-sampling operation in accordance with the channel dimension. After a convolution (kernel 1) and sigmoid operation on the final output, we obtain the final segmentation result graph.

Experiments
In this section, we first introduce the datasets used in our experiments. Then, the evaluation metrics for the experiments are introduced. Next, we describe the parameter settings of the training process in detail. Finally, the loss function is introduced.

Datasets
We used three accepted datasets for skin lesions (ISBI2016 [34], ISBI2017 [35], and ISBI2018 [36]) to validate the proposed network. The training set, verification set, and test set of the dataset are composed as shown in Table 1. The main types of skin injuries in the dataset are shown in Table 1 (as shown in Figure 7).  (1) There are small differences between normal skin and lesion images in the which adds some challenges to the work of lesion image segmentation.   (1) There are small differences between normal skin and lesion images in the sample, which adds some challenges to the work of lesion image segmentation. (2) There are obvious multilevel features inside and outside the lesion in the sample, which may cause the network to mislocate the lesion boundary. (3) The size and shape of the lesions are basically irregular, and the boundaries are also blurred, making them difficult to identify. (4) The image in the sample has interference factors such as hair, air bubbles, and other occluders, which will affect the segmentation accuracy of the network.
The training dataset of the ISIC2018 dataset provides 2594 labeled images of lesions. In addition, the validation set of this dataset consists of 100 unlabeled lesion images, and the test data consists of 1000 unlabeled lesion images. We need to redivide the training data of the original dataset into a training dataset, a verification dataset, and a test dataset, and the distribution ratio is 7:1:2.
The ISBI2017 dataset for training includes 2000 dermoscopic images of different resolutions with segmentation labels. Its test data consists of 600 dermoscopic images with corresponding segmentation markers. Regarding the distribution of this dataset, we used the training data of the original dataset as the experimental training dataset. At the same time, we divided the test data of the original dataset into a validation dataset and a test dataset, and the allocation ratio was 1:4.
The ISBI2016 dataset contains 900 training images in JPEG format and corresponding segmentation label images in PNG format, including 379 test images. For this dataset, we need to redistribute the original data. The scheme was as follows: we used the 900 training images provided by ISBI 2016 as the training set, and we used the test images of the original data as both validation and test datasets.

Evaluation Metrics
In this experiment, we used the dice coefficient (Dice) and intersection ratio (IoU) to evaluate the segmentation performance of the network. The evaluation index we used was calculated from four values: TP, FP, TN, and FN. When the predicted results are consistent with the actual results and both are positive, TP indicates that the prediction of the network is correct. When the predicted result is positive but its actual value is negative, it means that the model has made a wrong judgment, and FP means that the result predicted by the model is wrong. When the predicted result is consistent with the actual value and the predicted result is a negative value, TN indicates that the predicted value of the model is correct at this time. In the case that the predicted result is negative and its true value is positive, it indicates that the judgment result of the model is incorrect, and FN indicates that the predicted value of the model is wrong at this time. The calculation principle of the Dice and IoU indicators we use is shown in the following formula:

Experimental Setup
In our experiments, we did not use any of the network's pretrained weights. The resolutions of images in the skin lesion dataset are different. In the experiment, we uniformly readjusted the resolution of lesion images to 224 × 224 for testing. The experiment ran on the Python 3.6 and Torch 1.8.1 platforms. We ran the experiments on two Tesla T4 GPUs (15 GB of memory), and then used the ADAM optimizer to optimize the results. Next, we set the learning rate, batch size, and epoch to 0.001, 24, and 200, respectively. The format of the dataset we used is a picture, and we used online data augmentation methods such as horizontal (vertical) flipping and random rotation for the data.

Loss Function
In addition to the discussion of the experimental setup, we also describe the loss functions used in the experiments. In our medical image segmentation model, we used binary cross-entropy loss (BCE) and dice loss (Dice) to train the network. It is stated as follows: L Dice = 1 − 2TP (FN + TP) + (FP + TP) (12) where y i represents the ground truth pixel.

Results
In this section, we perform ablation experiments on connections at four different locations in the CTrans module. Finally, we present the experimental results of the HMT network.

Ablation Experiment
In order to better demonstrate the effect of the experimental method mentioned in this paper. We conducted related ablation experiments on the ISIC2018 dataset. In our ablation experiment, as shown in the connection position of the CTrans module in Figure 1, the sequence numbers of its connection positions are 1, 2, 3, and 4 from bottom to top. We tested the segmentation effect of concatenating TokMLP blocks at different locations on the dataset. We compared the effects of different connection methods between the two modules on the effect of lesion segmentation. As shown in Figure 8, we can observe that there are obvious differences in the segmentation results due to the different connection methods of the two modules. From the figure, it can be clearly seen that when the TokMLP block is connected to the first position of the CTrans module, it can obtain clearer predicted segmentation maps, especially when lesions have different locations and shapes.  2) The original image and grayscale image that were fed into the network. (3)(4)(5) The CTrans module is connected with the TokMLP block at the second, third, and fourth positions. Their segmentation effect is slightly worse, and their segmentation prediction graph is slightly greater than the labelcol graph. Compared with Figure 6, the segmentation performance is not ideal, and Figure 6 is closer to the label-column diagram. By comparing (3,7), Figure 6 is closer to the label diagram. By comparison (3-7), we can clearly see that the segmentation prediction graph is slowly approaching the label graph. By comparison, we can find that the position-one connection method can not only overcome the shortcoming of the CNN's difficulty in capturing context connection but also achieve accurate segmentation of focal local features.
In addition, we also made relevant statistics and comparisons on the DC and IOU values of each dataset. As shown in Tables 2-4, it is clear from the data in the table that the segmentation index is increasing from position 4 to position 1. It can be seen that our  . (1,2) The original image and grayscale image that were fed into the network. (3)(4)(5) The CTrans module is connected with the TokMLP block at the second, third, and fourth positions. Their segmentation effect is slightly worse, and their segmentation prediction graph is slightly greater than the labelcol graph. Compared with Figure 6, the segmentation performance is not ideal, and Figure 6 is closer to the label-column diagram. By comparing (3,7), Figure 6 is closer to the label diagram. By comparison (3-7), we can clearly see that the segmentation prediction graph is slowly approaching the label graph. By comparison, we can find that the position-one connection method can not only overcome the shortcoming of the CNN's difficulty in capturing context connection but also achieve accurate segmentation of focal local features.
In addition, we also made relevant statistics and comparisons on the DC and IOU values of each dataset. As shown in Tables 2-4, it is clear from the data in the table that the segmentation index is increasing from position 4 to position 1. It can be seen that our network can not only realize the learning of the association between remote contexts but also realize the understanding of the local features of the lesions to realize the accurate segmentation of the skin lesions. In addition, we also conduct relevant statistics on the three datasets (ISBI2016, ISBI2017, and ISIC2018) through the evaluation indicators (IOU and DC) (Tables 2-4). It can be confirmed by statistical data that the TokMLP block is connected to the first position. The segmentation results are more feasible than the other three schemes.

Comparative Experiment
We conducted comparative experiments on proposed segmentation networks and mainstream segmentation networks, including UNet [7], R2U-Net [8], FAC-Net [9], UC TransNet [26], TransUNet [27], Swin-Unet [28], UNeXt [14], AS-MLP [29], and CycleMLP [30]. In order to ensure the fairness of the experimental comparison, we conducted experiments in the same computing environment with the most suitable parameters for each network. We apply each network to three datasets (ISBI2016 [33], ISBI2017 [34], and ISIC2018 [35]). The prediction result images of skin lesions obtained through network training segmentation are shown in the following three figures (Figures 9-11). We can see that U-Net [7] was frequently unable to accurately segment the difficult complex lesion boundaries. The segmentation effect of R2U-Net [8] and U-Net [7] in the popular CNNs was worse than that of FAC-Net [9]. However, it can be seen from the figure that CNNs represented by FAC-net often lost important foreground information when discarding useless background information. The transformer based approach can effectively remedy this problem with an understanding of the connections between remote contexts. At present, the mainstream UCTransNet [26], TransUNet [27], based on the transformer architecture, and UNeXt [14], based on the MLP architecture, both achieve relatively high segmentation accuracy. Its performance may be better than or equal to that of the latest CNNs architecture network based on FAC-Net [9]. In addition, it can be seen from the figure that UNeXt based on MLP method can achieve accurate location of focal local features through axial displacement operation. However, in the first line of Figure 5, we can clearly see that U-Net [7], Swin-Unet [28], and UNeXt [14] all have unclear segmentation errors. This is due to the interference factors of the marks used to mark the lesion outside the lesion site in the original image, which makes these networks unable to accurately identify the characteristic information of the lesion, and our method can still maintain accurate segmentation performance. The segmentation effect of AS-MLP [29], CycleMLP [30], and other networks was slightly worse than that of FAC-Net [9]. The segmentation diagram clearly shows that while the segmentation effect of FAC-Net [9] was better, the problem of not detailed segmentation of focal boundaries persists in FAC-Net [9] when compared to the results of HMT-Net segmentation proposed by us. In the comparison of the result figures, it can be clearly seen that the segmentation effect of the hybrid network HMT-Net proposed by us was significantly better than that of the currently commonly used CNNs segmentation network. Not only that, but it was also better than the more popular transformer and MLP split networks.               In addition to intuitive comparisons, we also performed statistical comparisons of the experimentally obtained data (IOU and DC) on three datasets (ISBI2016, ISBI2017, and ISIC2018). As can be seen from the following three tables (Tables 5-7), the method we used was superior to the current relatively new transformer-based and MLP-based networks in these two indicators. Moreover, compared with the state-of-the-art convolutional neural network, FAC-Net, the method used in this paper shows significant improvements on all three datasets. By comparing the IOU and DC index values on three datasets (ISBI2018, ISBI2017, and ISBI2016), the module we used was more robust in improving the results of skin lesion segmentation.   In addition to intuitive comparisons, we also performed statistical comparisons of the experimentally obtained data (IOU and DC) on three datasets (ISBI2016, ISBI2017, and ISIC2018). As can be seen from the following three tables (Tables 5-7), the method we used was superior to the current relatively new transformer-based and MLP-based networks in these two indicators. Moreover, compared with the state-of-the-art convolutional neural network, FAC-Net, the method used in this paper shows significant improvements on all three datasets. By comparing the IOU and DC index values on three datasets (ISBI2018, ISBI2017, and ISBI2016), the module we used was more robust in improving the results of skin lesion segmentation.

Conclusions
Accurate and efficient segmentation of medical images is a key step in clinical skin lesion analysis, and diagnosis. In this work, we designed a hybrid encoding network model, HMT-Net, from the perspective of channel encoders, with the aim of providing an accurate and reliable automatic segmentation algorithm for skin lesion images. By using the CTrans module based on the transformer architecture, we effectively improved the network's computing power for the global information correlation of lesions. Thus, the ability of the network to identify the global foreground information of lesions was improved. At the same time, we use a marked MLP block with axial displacement, through different axial displacement operations, to enhance the correlation between pixels and adjacent elements, thereby improving the network's attention to local feature information. It can effectively solve the problem of weak boundary information recognition existing in lesion image segmentation. Through in-depth analysis and experimental demonstration, we demonstrate the advantages of our HMT-Net model. It can indeed learn the global information of the image more effectively, and we can improve the network's understanding of local information by enhancing the connection between adjacent pixels. We further combine the two ideas here to achieve precise and effective segmentation of skin lesions. Our future work will be carried out from the two aspects of lightweight network and improving the understanding of network local information.